├── model └── README.md ├── cache └── README.md ├── images ├── fig1.png └── fig2.png ├── README.md ├── metrics.py ├── nnLayer.py ├── SecondStructurePredictor.py ├── DL_ClassifierModel.py └── utils.py /model/README.md: -------------------------------------------------------------------------------- 1 | 此文件夹主要用于保存训练的模型 -------------------------------------------------------------------------------- /cache/README.md: -------------------------------------------------------------------------------- 1 | 此文件夹用于保存训练过程中产生的词向量等 -------------------------------------------------------------------------------- /images/fig1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wudejian789/2020TIANCHI-ProteinSecondaryStructurePrediction-TOP1/HEAD/images/fig1.png -------------------------------------------------------------------------------- /images/fig2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wudejian789/2020TIANCHI-ProteinSecondaryStructurePrediction-TOP1/HEAD/images/fig2.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 蛋白质结构预测大赛top1方案分享 2 | 3 | >PS:本人目前就读于中南大学研一,比较喜欢参加各类数据挖掘竞赛,有想组队的小伙伴可以带我一个,欢迎加QQ:793729558进行交流。以下都是个人理解,若有错误,欢迎指正。 4 | 5 | ## 1. 赛题介绍 6 | 7 | 赛题链接:https://tianchi.aliyun.com/competition/entrance/231781/introduction 8 | 9 | 本题为根据蛋白质的一级结构预测其二级结构,经过比赛期间组内师兄的讲解,我对蛋白质一级结构二级结构的理解如下,如有错误,欢迎指正。 10 | 11 | 蛋白质可以看成是一条氨基酸序列,在空间中是一种相互交错螺旋的结构,像一条互相缠绕的绳子: 12 | 13 | ![img](https://github.com/wudejian789/2020TIANCHI-ProteinSecondaryStructurePrediction-TOP1/blob/master/images/fig1.png) 14 | 15 | 这种三维结构叫做蛋白质的三级结构,而如果不考虑结构的三维性,或者说把这整条序列拉直,用一个一维的序列表示,这便是得到了蛋白质的一级结构: 16 | 17 | >GPTGTGESKCPLMVKVLDAV······ 18 | 19 | 这些字母**G、A、V**等便是代表一个个的氨基酸,其中主要包含有20种常见的氨基酸。 20 | 21 | 用这样的序列表示蛋白质比起原始的三维结构确实方便不少,但却丢失了三维的结构信息,蛋白质的结构决定其功能,这里的结构不止是序列本身,更多的还依赖其三维结构。因此,便出现了蛋白质的二级结构,它是一条与一级结构长度相等的一维序列,用以表征一级结构种的各位置的氨基酸在三维空间种的形态,以保留一部分的三维结构信息,例如以上蛋白质一节结构对应的二级结构为: 22 | 23 | > EEEEEEETT······ 24 | 25 | ==注意:以上二级结构最前面有11个空白符,这个空白符也是在蛋白质在三维空间种的一种松散结构表示。== 26 | 27 | 这里的' '、'**E**'、'**T**'等都是对应位置的氨基酸在空间种的形态(与一级结构 **GPTGTGESKCPLMVKVLDAV······** 是一一对应的),例如'**T**'代表的就是该位置的氨基酸在空间中是一种**氢键转折**的形态。 28 | 29 | 本赛题就是需要通过蛋白质的一级结构,预测其二级结构,在深度学习种是一种典型的N-N的seq2seq问题。 30 | 31 | ## 2. 赛题理解 32 | 33 | 不难想到,蛋白质三维结构的形成,其实主要是受某些力的作用,不同氨基酸的分子量、体积、质量等性质都有差异,这些小分子间会受到分子间作用力的影响,换句话说,分子间作用力等多种因素共同作用,让蛋白质形成了这样的一种相对稳定的空间结构,以达到一种稳态;而倘若你强行把它拉直,它也会由于受力不均,又开始相互缠绕,以达到稳态。 34 | 35 | 因此,对于某条蛋白质的二级结构中第**i**个位置的空间形态,其不止是取决于对应一级结构中位置**i**的氨基酸,还取决于位置**i**周围氨基酸甚至整条序列的情况。 36 | 37 | 定义一级结构中位置**i**及其上下文的整条片段为**X**,对应的二级结构中位置**i**的形态为**Y**,我统计了整个训练数据中 **P(Y|X)** 的情况,并计算了在不同窗口大小时,**P(Y|X)>0.95** 在所有 **P(Y|X)** 中的占比情况如下表: 38 | 39 | 窗口大小|1|3|5|7|9|13|… 40 | :-:|:-:|:-:|:-:|:-:|:-:|:-:|:-: 41 | **P(P(Y\|X)>0.95)**|0.00588|0.02188|0.55728|0.83511|0.84413|0.85431|… 42 | 43 | 以上结果也验证了之前的理解,且不难看出,当窗口大小达到**7**以上时,可以达到较好的预测。 44 | 45 | ## 3. 思路分享 46 | 这类题首先需要解决的是输入序列的编码问题,很自然的可以想到onehot和word2vec两种编码方法,本次赛题我们都进行了尝试。 47 | 48 | ### 3.1 Onehot与基本理化性质编码+滑窗法+浅层NN 49 | 氨基酸的基本理化性质包括分子量、等电点、解离常数、范德华半径、水中溶解度、侧脸疏水性,以及形成α螺旋可能性、形成β螺旋可能性、转向概率等(来自Chou-Fasman算法),这些数据百度都很容易找到。 50 | 51 | 然后是窗口大小的选择。经过测试,隐层节点数为1024,当窗口大小达到79以上时,线下MaF达到饱和,为**0.749**。再调节隐层大小为2048,最后的线下MaF为**0.767**。 52 | 53 | ==(注:此处为氨基酸级别的MaF得分,非官方评分方法的序列级MaF再平均的结果;且此处未进行交叉验证,仅仅是单模的结果,后面的线上结果也是,所以可能会有一定偏差)。== 54 | 55 | 该模型提交后线上结果为**0.7312**。(滑窗模型其实等价于基于整条序列的CNN模型) 56 | 57 | ### 3.2 Word2vec+深层NN 58 | NN的结构设计主要参考论文《Protein Secondary Structure Prediction Using Cascaded Convolutional and Recurrent Neural Networks》[1],这是一篇使用深度学习进行蛋白质二级结构预测的经典论文,文中使用了CNN+BiGRU的结构进行蛋白质二级结构预测,模型结构如下: 59 | 60 | ![img](https://github.com/wudejian789/2020TIANCHI-ProteinSecondaryStructurePrediction-TOP1/blob/master/images/fig2.png) 61 | 62 | 该模型先通过CNN捕获局部信息,再通过RNN融入全局信息,是NLP长文本任务的常见baseline模型。这里基本照搬了模型结构,但将编码部分改为了word2vec预训练的结构,词向量大小为128,其它结构和参数与原文一致,文章可从github项目目录进行下载。 63 | 64 | 此处还需注意一定的是,如果embedding层是单独对每个氨基酸进行编码的话,那么词表大小为23(数据集中共23种字母)。而在NLP种经常用到的一种叫做n-gram的技术,即将多个词绑定在以此形成整体,这个技术在蛋白质序列种也用得比较多,成为k-mers。倘若使用k-mers构建词表的话,假定k=3,那么词表的大小就是23*23*23=12167,这样相当于在编码时将上下文也考虑了进去,增加了词的多样性,在一定程度上可以提高模型的学习能力,但也会增大过拟合的风险。 65 | 66 | 这里我也分别尝试了k=1和k=3的两个模型情况,线下分别为**0.719**和**0.706**,线上分别为**0.7576**和**0.7518**。 67 | 68 | ==(注:此处计算线下分数的方法与官方提供的是一致的,但不知道为何线下比线上低了这么多,暂时还未找到原因;且此处未进行交叉验证,仅仅是单模的结果,后面的线上结果也是,所以可能会有一定偏差)== 69 | 70 | 这两个模型的输入和数据划分都有较大差异,显然会有一定的融合收益,将二者的结果进行加权平均后,线上结果为**0.7702**。 71 | 72 | ### 3.3 最终模型 73 | 将以上几个模型的特征输入都有着较大不同,进行简单加权融合后,线上结果到达**0.7770**。 74 | 75 | 在得到以上结果后,进一步分析问题: 76 | 77 | **1.模型真正学会的到底是什么:** 我们结合赛题理解部分的统计结果,不难想到,与其说模型是学会了推理,不如说模型主要是记忆了大量由**X->Y**的固定映射或搭配,然后根据不同搭配的置信度进行决策,学会如何权衡不同搭配以得到更加正确的结果,这也是选用单层小窗口CNN时严重欠拟合的原因。 78 | 79 | **2.氨基酸的编码表示:** 在NLP任务中的字词编码由于词表过大导致维度过多且稀疏,这才出现了word2vec算法以得到词语的低维稠密表示,且不同字词间有很大的联系,而这一联系可以通过词向量间的cos距离等来刻画。而蛋白质总共包括的氨基酸种类较少,在本数据中只有23种,只需要一个23维的onehot向量就可以表示,且不同氨基酸间的关联度很小,更多的是差异性,而onehot向量是可以充分表达这一差异性的(不同onehot向量在高维空间中相互垂直)。这也是简单的onehot编码+大窗口CNN能如此有效的原因,也验证了前面的观点,即模型主要是记忆了大量由**X->Y**的隐射,预测时根据所记忆的大量先验知识,对输出进行决策。 80 | 81 | 综上,我们设计了最终的模型:在**3.2**部分的模型中,将embedding部分改成了**25维onehot编码+14维理化特征+25维word2vec特征**,其中onehot和理化特征部分在训练过程中是frozen的,而word2vec会随着训练进行finetune;其次是加大了CNN部分的窗口,设置成了[1,9,81]。 82 | 83 | >PS:这部分没有尝试其它的数值,这里是玄学设计,取了一个单粒度的窗口1(相当于最普通的神经网络,仅仅是对特征进行了非线性变换);大窗口81(为了达到之前的最优窗口79);以及大窗口开根号的数值——9,以折个中) 84 | 85 | 最终按次方案训练了一个3折的模型,线下MaF平均为**0.756**。 86 | 87 | ==(注:此处计算线下分数的方法是padding后序列级别的MaF再平均,理论上应该会高于去掉padding后的结果,这里同样线上的结果也好于线下== 88 | 89 | 将3折的模型加权平均后线上分数为**0.7832**。(榜上最优结果0.7855是融合了之前的几个模型,但没多大参考价值,最终模型可以说是融合了之前的所有模型,融合没多大价值,收益基本来自数据的分布差异) 90 | 91 | ## 4. 代码开源 92 | 代码在github进行了开源,基于pytorch,其中主要包含: 93 | >nnLayer:基本神经网络结构的封装。 94 | >DL_ClassifierModel:整个模型的封装,包含训练、模型的加载保存等部分。 95 | >utils:数据接口部分的封装。 96 | >metrics:评价指标函数的封装。 97 | >SecondStructurePredictor:模型的预测接口类。 98 | 99 | 使用方法如下: 100 | ```python 101 | # 导入相关类 102 | from utils import * 103 | from DL_ClassifierModel import * 104 | from SecondStructurePredictor import * 105 | # 初始化数据类 106 | dataClass = DataClass('data_seq_train.txt', 'data_sec_train.txt', k=1, validSize=0.3, minCount=0) 107 | # 词向量预训练 108 | dataClass.vectorize(method='char2vec', feaSize=25, sg=1) 109 | # onehot+理化特征获取 110 | dataClass.vectorize(method='feaEmbedding') 111 | # 初始化模型对象 112 | model = FinalModel(classNum=dataClass.classNum, embedding=dataClass.vector['embedding'], feaEmbedding=dataClass.vector['feaEmbedding'], 113 | useFocalLoss=True, device=torch.device('cuda')) 114 | # 开始训练 115 | model.cv_train( dataClass, trainSize=64, batchSize=64, epoch=1000, stopRounds=100, earlyStop=30, saveRounds=1, 116 | savePath='model/FinalModel', lr=3e-4, augmentation=0.1, kFold=3) 117 | # 预测, 得到的输出是一个N × L × C的矩阵,N为样例数,L为序列最大长度,C为类别数,即得到的是各序列各位置得到各类别上的概率。 118 | model = Predictor_final('model/FinalModelxxx.pkl', device='xxx', map_location='xxx') 119 | model.predict('seqData.txt', batchSize=128) 120 | ``` 121 | 122 | ## 参考文献 123 | [1]Li Z, Yu Y. Protein secondary structure prediction using cascaded convolutional and recurrent neural networks[J]. arXiv preprint arXiv:1604.07176, 2016. -------------------------------------------------------------------------------- /metrics.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from sklearn import metrics as skmetrics 3 | import warnings 4 | warnings.filterwarnings("ignore") 5 | 6 | def lgb_MaF(preds, dtrain): 7 | Y = np.array(dtrain.get_label(), dtype=np.int32) 8 | preds = preds.reshape(-1,len(Y)) 9 | Y_pre = np.argmax( preds, axis=0 ) 10 | return 'macro_f1', float(F1(preds.shape[0], Y_pre, Y, 'macro')), True 11 | 12 | def lgb_precision(preds, dtrain): 13 | Y = dtrain.get_label() 14 | preds = preds.reshape(-1,len(Y)) 15 | Y_pre = np.argmax( preds, axis=0 ) 16 | return 'precision', float(Counter(Y==Y_pre)[True]/len(Y)), True 17 | 18 | class Metrictor: 19 | def __init__(self, classNum): 20 | self.classNum = classNum 21 | self._reporter_ = {"Score":self.Score, 22 | "MaF":self.MaF, "MiF":self.MiF, 23 | "ACC":self.ACC, 24 | "MaAUC":self.MaAUC, "MiAUC":self.MiAUC, 25 | "MaMCC":self.MaMCC, "MiMCC":self.MiMCC} 26 | def __call__(self, report, end='\n'): 27 | res = {} 28 | for mtc in report: 29 | v = self._reporter_[mtc]() 30 | print(f" {mtc}={v:6.3f}", end=';') 31 | res[mtc] = v 32 | print(end=end) 33 | return res 34 | def set_data(self, Y_prob_pre, Y): 35 | self.raw_Y_pre,self.raw_Y = Y_prob_pre.argmax(axis=-1),Y 36 | Y_prob_pre,Y = Y_prob_pre.reshape(-1,self.classNum),Y.reshape(-1) 37 | self.Y_prob_pre,self.Y = Y_prob_pre,Y 38 | self.Y_pre = self.Y_prob_pre.argmax(axis=1) 39 | self.N = len(self.Y) 40 | @staticmethod 41 | def table_show(resList, report, rowName='CV'): 42 | lineLen = len(report)*8 + 6 43 | print("="*(lineLen//2-6) + "FINAL RESULT" + "="*(lineLen//2-6)) 44 | print(f"{'-':^6}" + "".join([f"{i:>8}" for i in report])) 45 | for i,res in enumerate(resList): 46 | print(f"{rowName+'_'+str(i+1):^6}" + "".join([f"{res[j]:>8.3f}" for j in report])) 47 | print(f"{'MEAN':^6}" + "".join([f"{np.mean([res[i] for res in resList]):>8.3f}" for i in report])) 48 | print("======" + "========"*len(report)) 49 | def each_class_indictor_show(self, id2lab): 50 | id2lab = np.array(id2lab) 51 | Yarr = np.zeros((self.N, self.classNum), dtype='int32') 52 | Yarr[list(range(self.N)),self.Y] = 1 53 | TPi,FPi,TNi,FNi = _TPiFPiTNiFNi(self.classNum, self.Y_pre, self.Y, ignore=False) 54 | MCCi = (TPi*TNi - FPi*FNi) / (np.sqrt( (TPi+FPi)*(TPi+FNi)*(TNi+FPi)*(TNi+FNi) ) + 1e-10) 55 | Pi = TPi/(TPi+FPi+1e-10) 56 | Ri = TPi/(TPi+FNi+1e-10) 57 | Fi = 2*Pi*Ri/(Pi+Ri+1e-10) 58 | sortedIndex = np.argsort(id2lab) 59 | classRate = Yarr.sum(axis=0)[sortedIndex] / self.N 60 | id2lab,MCCi,Pi,Ri,Fi = id2lab[sortedIndex],MCCi[sortedIndex],Pi[sortedIndex],Ri[sortedIndex],Fi[sortedIndex] 61 | print("-"*28 + "MACRO INDICTOR" + "-"*28) 62 | print(f"{'':30}{'rate':<8}{'MCCi':<8}{'Pi':<8}{'Ri':<8}{'Fi':<8}") 63 | for i,c in enumerate(id2lab): 64 | print(f"{c:30}{classRate[i]:<8.2f}{MCCi[i]:<8.3f}{Pi[i]:<8.3f}{Ri[i]:<8.3f}{Fi[i]:<8.3f}") 65 | print("-"*70) 66 | def MaF(self): 67 | return F1(self.classNum, self.Y_pre, self.Y, average='macro') 68 | def MiF(self): 69 | return F1(self.classNum, self.Y_pre, self.Y, average='micro') 70 | def ACC(self): 71 | return ACC(self.classNum, self.Y_pre, self.Y) 72 | def MaMCC(self): 73 | return MCC(self.classNum, self.Y_pre, self.Y, average='macro') 74 | def MiMCC(self): 75 | return MCC(self.classNum, self.Y_pre, self.Y, average='micro') 76 | def MaAUC(self): 77 | return AUC(self.classNum, self.Y_prob_pre, self.Y, average='macro') 78 | def MiAUC(self): 79 | return AUC(self.classNum, self.Y_prob_pre, self.Y, average='micro') 80 | def Score(self): 81 | res = 0 82 | for y_pre,y in zip(self.raw_Y_pre,self.raw_Y): 83 | res += skmetrics.f1_score(y, y_pre, average='macro') 84 | return res/len(self.raw_Y) 85 | 86 | def _TPiFPiTNiFNi(classNum, Y_pre, Y, ignore=True): 87 | Yarr, Yarr_pre = np.zeros((len(Y), classNum), dtype='int32'), np.zeros((len(Y), classNum), dtype='int32') 88 | Yarr[list(range(len(Y))),Y] = 1 89 | Yarr_pre[list(range(len(Y))),Y_pre] = 1 90 | if ignore: 91 | isValid = (Yarr.sum(axis=0) + Yarr_pre.sum(axis=0))>0 92 | Yarr,Yarr_pre = Yarr[:,isValid],Yarr_pre[:,isValid] 93 | TPi = np.array([Yarr_pre[:,i][Yarr[:,i]==1].sum() for i in range(Yarr.shape[1])], dtype='float32') 94 | FPi = Yarr_pre.sum(axis=0) - TPi 95 | TNi = (1^Yarr).sum(axis=0) - FPi 96 | FNi = Yarr.sum(axis=0) - TPi 97 | return TPi,FPi,TNi,FNi 98 | 99 | def ACC(classNum, Y_pre, Y): 100 | TPi,FPi,TNi,FNi = _TPiFPiTNiFNi(classNum, Y_pre, Y) 101 | return TPi.sum() / len(Y) 102 | 103 | def AUC(classNum, Y_prob_pre, Y, average='micro'): 104 | assert average in ['micro', 'macro'] 105 | Yarr = np.zeros((len(Y), classNum), dtype='int32') 106 | Yarr[list(range(len(Y))),Y] = 1 107 | return skmetrics.roc_auc_score(Yarr, Y_prob_pre, average=average) 108 | 109 | def MCC(classNum, Y_pre, Y, average='micro'): 110 | assert average in ['micro', 'macro'] 111 | TPi,FPi,TNi,FNi = _TPiFPiTNiFNi(classNum, Y_pre, Y) 112 | if average=='micro': 113 | TP,FP,TN,FN = TPi.sum(),FPi.sum(),TNi.sum(),FNi.sum() 114 | MiMCC = (TP*TN - FP*FN) / (np.sqrt( (TP+FP)*(TP+FN)*(TN+FP)*(TN+FN) ) + 1e-10) 115 | return MiMCC 116 | else: 117 | MCCi = (TPi*TNi - FPi*FNi) / (np.sqrt((TPi+FPi)*(TPi+FNi)*(TNi+FPi)*(TNi+FNi)) + 1e-10) 118 | return MCCi.mean() 119 | 120 | def Precision(classNum, Y_pre, Y, average='micro'): 121 | assert average in ['micro', 'macro'] 122 | TPi,FPi,TNi,FNi = _TPiFPiTNiFNi(classNum, Y_pre, Y) 123 | if average=='micro': 124 | MiP = TPi.sum() / (TPi.sum() + FPi.sum() + 1e-10) 125 | return MiP 126 | else: 127 | Pi = TPi/(TPi+FPi+1e-10) 128 | return Pi.mean() 129 | 130 | def Recall(classNum, Y_pre, Y, average='micro'): 131 | assert average in ['micro', 'macro'] 132 | TPi,FPi,TNi,FNi = _TPiFPiTNiFNi(classNum, Y_pre, Y) 133 | if average=='micro': 134 | MiR = TPi.sum() / (TPi.sum() + FNi.sum() + 1e-10) 135 | return MiR 136 | else: 137 | Ri = TPi/(TPi + FNi + 1e-10) 138 | return Ri.mean() 139 | 140 | def F1(classNum, Y_pre, Y, average='micro'): 141 | assert average in ['micro', 'macro'] 142 | if average=='micro': 143 | MiP,MiR = Precision(classNum, Y_pre, Y, average='micro'),Recall(classNum, Y_pre, Y, average='micro') 144 | MiF = 2*MiP*MiR/(MiP+MiR+1e-10) 145 | return MiF 146 | else: 147 | TPi,FPi,TNi,FNi = _TPiFPiTNiFNi(classNum, Y_pre, Y) 148 | Pi,Ri = TPi/(TPi + FPi + 1e-10),TPi/(TPi + FNi + 1e-10) 149 | Fi = 2*Pi*Ri/(Pi+Ri+1e-10) 150 | return Fi.mean() 151 | 152 | -------------------------------------------------------------------------------- /nnLayer.py: -------------------------------------------------------------------------------- 1 | from torch import nn as nn 2 | from torch.nn import functional as F 3 | import torch,time,os 4 | import numpy as np 5 | 6 | class TextEmbedding(nn.Module): 7 | def __init__(self, embedding, dropout=0.3, freeze=False, name='textEmbedding'): 8 | super(TextEmbedding, self).__init__() 9 | self.name = name 10 | self.embedding = nn.Embedding.from_pretrained(embedding, freeze=freeze) 11 | self.dropout = nn.Dropout(p=dropout) 12 | def forward(self, x): 13 | # x: batchSize × seqLen 14 | return self.dropout(self.embedding(x)) 15 | 16 | class TextCNN(nn.Module): 17 | def __init__(self, feaSize, contextSizeList, filterNum, name='textCNN'): 18 | super(TextCNN, self).__init__() 19 | self.name = name 20 | moduleList = [] 21 | for i in range(len(contextSizeList)): 22 | moduleList.append( 23 | nn.Sequential( 24 | nn.Conv1d(in_channels=feaSize, out_channels=filterNum, kernel_size=contextSizeList[i], padding=contextSizeList[i]//2), 25 | nn.BatchNorm1d(filterNum), 26 | nn.ReLU(), 27 | ) 28 | ) 29 | self.conv1dList = nn.ModuleList(moduleList) 30 | def forward(self, x): 31 | # x: batchSize × seqLen × feaSize 32 | x = x.transpose(1,2) # => batchSize × feaSize × seqLen 33 | x = [conv(x) for conv in self.conv1dList] # => scaleNum * (batchSize × filterNum × seqLen) 34 | return torch.cat(x, dim=1).transpose(1,2) # => batchSize × seqLen × scaleNum*filterNum 35 | 36 | class TextDeepCNN(nn.Module): 37 | def __init__(self, feaSize, filterNum, name='textDeepCNN'): 38 | super(TextDeepCNN, self).__init__() 39 | self.name = name 40 | self.conv1 = nn.Sequential( 41 | nn.Conv1d(in_channels=feaSize, out_channels=feaSize*2, kernel_size=1, padding=0), 42 | nn.BatchNorm1d(feaSize*2), 43 | nn.ReLU(), 44 | nn.Conv1d(in_channels=feaSize*2, out_channels=filterNum, kernel_size=1, padding=0), 45 | nn.BatchNorm1d(filterNum), 46 | nn.ReLU(), 47 | ) 48 | self.conv2 = nn.Sequential( 49 | nn.Conv1d(in_channels=feaSize, out_channels=feaSize*2, kernel_size=3, padding=1), 50 | nn.BatchNorm1d(feaSize*2), 51 | nn.ReLU(), 52 | nn.Conv1d(in_channels=feaSize*2, out_channels=feaSize*2, kernel_size=3, padding=1), 53 | nn.BatchNorm1d(feaSize*2), 54 | nn.ReLU(), 55 | nn.Conv1d(in_channels=feaSize*2, out_channels=filterNum, kernel_size=3, padding=1), 56 | nn.BatchNorm1d(filterNum), 57 | nn.ReLU(), 58 | ) 59 | self.conv3 = nn.Sequential( 60 | nn.Conv1d(in_channels=feaSize, out_channels=feaSize*2, kernel_size=7, padding=3), 61 | nn.BatchNorm1d(feaSize*2), 62 | nn.ReLU(), 63 | nn.Conv1d(in_channels=feaSize*2, out_channels=feaSize*4, kernel_size=3, padding=1), 64 | nn.BatchNorm1d(feaSize*4), 65 | nn.ReLU(), 66 | nn.Conv1d(in_channels=feaSize*4, out_channels=feaSize*4, kernel_size=3, padding=1), 67 | nn.BatchNorm1d(feaSize*4), 68 | nn.ReLU(), 69 | nn.Conv1d(in_channels=feaSize*4, out_channels=filterNum, kernel_size=3, padding=1), 70 | nn.BatchNorm1d(filterNum), 71 | nn.ReLU() 72 | ) 73 | def forward(self, x): 74 | # x: batchSize × seqLen × feaSize 75 | x = x.transpose(1,2) # => batchSize × feaSize × seqLen 76 | x = [self.conv1(x),self.conv2(x),self.conv3(x)] # => scaleNum * (batchSize × filterNum × seqLen) 77 | return torch.cat(x, dim=1).transpose(1,2) # => batchSize × seqLen × scaleNum*filterNum 78 | 79 | class TextBiGRU(nn.Module): 80 | def __init__(self, feaSize, hiddenSize, num_layers=1, dropout=0.0, name='textBiGRU'): 81 | super(TextBiGRU, self).__init__() 82 | self.name = name 83 | self.biGRU = nn.GRU(feaSize, hiddenSize, bidirectional=True, batch_first=True, num_layers=num_layers, dropout=dropout) 84 | def forward(self, x, xlen=None): 85 | # x: batchSizeh × seqLen × feaSize 86 | if xlen is not None: 87 | xlen, indices = torch.sort(xlen, descending=True) 88 | _, desortedIndices = torch.sort(indices, descending=False) 89 | 90 | x = nn.utils.rnn.pack_padded_sequence(x[indices], xlen, batch_first=True) 91 | output, hn = self.biGRU(x) # output: batchSize × seqLen × hiddenSize*2; hn: numLayers*2 × batchSize × hiddenSize 92 | if xlen is not None: 93 | output, _ = nn.utils.rnn.pad_packed_sequence(output, batch_first=True) 94 | return output[desortedIndices] 95 | return output # output: batchSize × seqLen × hiddenSize*2 96 | 97 | class TextTransformer(nn.Module): 98 | def __init__(self, featureSize, dk, multiNum, seqMaxLen, dropout=0.1, name='textTransformer'): 99 | super(TextTransformer, self).__init__() 100 | self.name = name 101 | self.dk = dk 102 | self.multiNum = multiNum 103 | self.WQ = nn.ModuleList([nn.Linear(featureSize, self.dk) for i in range(multiNum)]) 104 | self.WK = nn.ModuleList([nn.Linear(featureSize, self.dk) for i in range(multiNum)]) 105 | self.WV = nn.ModuleList([nn.Linear(featureSize, self.dk) for i in range(multiNum)]) 106 | self.WO = nn.Linear(self.dk*multiNum, featureSize) 107 | self.layerNorm1 = nn.LayerNorm([seqMaxLen, featureSize]) 108 | self.layerNorm2 = nn.LayerNorm([seqMaxLen, featureSize]) 109 | self.Wffn = nn.Sequential( 110 | nn.Linear(featureSize, featureSize*4), 111 | nn.ReLU(), 112 | nn.Linear(featureSize*4, featureSize) 113 | ) 114 | self.dropout = nn.Dropout(p=dropout) 115 | 116 | def forward(self, x): 117 | # x: batchSize × seqLen × feaSize 118 | queries = [self.WQ[i](x) for i in range(self.multiNum)] # => multiNum*(batchSize × seqLen × dk) 119 | keys = [self.WK[i](x) for i in range(self.multiNum)] # => multiNum*(batchSize × seqLen × dk) 120 | values = [self.WQ[i](x) for i in range(self.multiNum)] # => multiNum*(batchSize × seqLen × dk) 121 | score = [torch.bmm(queries[i], keys[i].transpose(1,2))/np.sqrt(self.dk) for i in range(self.multiNum)] # => multiNum*(batchSize × seqLen × seqLen) 122 | z = [self.dropout(torch.bmm(F.softmax(score[i], dim=2), values[i])) for i in range(self.multiNum)] # => multiNum*(batchSize × seqLen × dk) 123 | z = self.WO(torch.cat(z, dim=2)) # => batch × seqLen × feaSize 124 | z = self.layerNorm1(x + z) # => batchSize × seqLen × feaSize 125 | ffnx = self.Wffn(z) # => batchSize × seqLen × feaSize 126 | return self.layerNorm2(z + ffnx) # => batchSize × seqLen × feaSize 127 | 128 | class FocalCrossEntropyLoss(nn.Module): 129 | def __init__(self, gama=2, weight=-1, logit=True): 130 | super(FocalCrossEntropyLoss, self).__init__() 131 | self.weight = torch.nn.Parameter(torch.tensor(weight, dtype=torch.float32), requires_grad=False) 132 | self.gama = gama 133 | self.logit = logit 134 | def forward(self, Y_pre, Y): 135 | if self.logit: 136 | Y_pre = F.softmax(Y_pre, dim=1) 137 | P = Y_pre[list(range(len(Y))), Y] 138 | if self.weight != -1: 139 | w = self.weight[Y] 140 | else: 141 | w = torch.tensor([1.0 for i in range(len(Y))], device=self.weight.device) 142 | w = (w/w.sum()).reshape(-1) 143 | return (-w*((1-P)**self.gama * torch.log(P))).sum() 144 | 145 | class LinearRelu(nn.Module): 146 | def __init__(self, inSize, outSize, name='linearRelu'): 147 | super(LinearRelu, self).__init__() 148 | self.name = name 149 | self.layer = nn.Sequential( 150 | nn.ReLU(), 151 | nn.Linear(inSize, outSize), 152 | nn.ReLU() 153 | ) 154 | def forward(self, x): 155 | return self.layer(x) 156 | 157 | class MLP(nn.Module): 158 | def __init__(self, inSize, outSize, hiddenList=[], dropout=0.1, name='MLP', actFunc=nn.ReLU): 159 | super(MLP, self).__init__() 160 | self.name = name 161 | layers = nn.Sequential() 162 | for i,os in enumerate(hiddenList): 163 | layers.add_module(str(i*2), nn.Linear(inSize, os)) 164 | layers.add_module(str(i*2+1), actFunc()) 165 | inSize = os 166 | self.hiddenLayers = layers 167 | self.dropout = nn.Dropout(p=dropout) 168 | self.out = nn.Linear(inSize, outSize) 169 | def forward(self, x): 170 | x = self.hiddenLayers(x) 171 | return self.out(self.dropout(x)) 172 | 173 | class ContextNN(nn.Module): 174 | def __init__(self, seqMaxLen, name='contextNN'): 175 | super(ContextNN, self).__init__() 176 | self.name = name 177 | self.linear = nn.Linear(seqMaxLen, seqMaxLen) 178 | self.act = nn.ReLU() 179 | def forward(self, x): 180 | # x: batchSize × seqLen × feaSize 181 | x = x.transpose(1,2) # => batchSize × feaSize × seqLen 182 | x = self.linear(x).transpose(1,2) # => batchSize × seqLen × feaSize 183 | return self.act(x) -------------------------------------------------------------------------------- /SecondStructurePredictor.py: -------------------------------------------------------------------------------- 1 | from nnLayer import * 2 | from torch.nn import functional as F 3 | from tqdm import tqdm 4 | import pickle 5 | 6 | class Predictor_DCRNN: 7 | def __init__(self, weightPath, classNum=9, 8 | feaSize=128, filterNum=64, contextSizeList=[3,7,11], 9 | hiddenSize=512, num_layers=3, 10 | hiddenList=[2048], 11 | map_location="cpu", device=torch.device("cpu")): 12 | stateDict = torch.load(weightPath, map_location=map_location) 13 | self.seqItem2id,self.id2seqItem = stateDict['seqItem2id'],stateDict['id2seqItem'] 14 | self.secItem2id,self.id2secItem = stateDict['secItem2id'],stateDict['id2secItem'] 15 | self.k = int(weightPath[:-4].split('_')[-2][1:]) 16 | self.trainIdList,self.validIdList = stateDict['trainIdList'],stateDict['validIdList'] 17 | self.seqItem2id,self.id2seqItem = stateDict['seqItem2id'],stateDict['id2seqItem'] 18 | self.secItem2id,self.id2secItem = stateDict['secItem2id'],stateDict['id2secItem'] 19 | self.textEmbedding = TextEmbedding( torch.zeros((len(self.id2seqItem),feaSize), dtype=torch.float32) ).to(device) 20 | self.textCNN = TextCNN( feaSize, contextSizeList, filterNum ).to(device) 21 | self.textBiGRU = TextBiGRU(len(contextSizeList)*filterNum, hiddenSize, num_layers=num_layers).to(device) 22 | self.fcLinear = MLP(len(contextSizeList)*filterNum+hiddenSize*2, classNum, hiddenList).to(device) 23 | self.moduleList = nn.ModuleList([self.textEmbedding,self.textCNN,self.textBiGRU,self.fcLinear]) 24 | for module in self.moduleList: 25 | module.load_state_dict(stateDict[module.name]) 26 | module.eval() 27 | self.device = device 28 | print("%d epochs and %.3lf val Score 's model load finished."%(stateDict['epochs'], stateDict['bestMtc'])) 29 | 30 | def predict(self, seqData, batchSize=32): 31 | if type(seqData)==str: 32 | with open(seqData, 'r') as f: 33 | seqData = f.readlines() 34 | k = self.k 35 | seqData = [' '*(k//2)+i[:-1]+' '*(k//2) for i in seqData] 36 | seqData = [[seq[i-k//2:i+k//2+1] for i in range(k//2,len(seq)-k//2)] for seq in seqData] 37 | tokenizedSeq = np.array([[self.seqItem2id[i] if i in self.seqItem2id else self.seqItem2id[''] for i in seq] for seq in seqData]) 38 | seqMaxLen = np.array([len(seq)+1 for seq in seqData]).max() 39 | 40 | secPre = [] 41 | idList = list(range(len(tokenizedSeq))) 42 | print('Predicting...') 43 | for i in tqdm(range((len(idList)+batchSize-1)//batchSize)): 44 | samples = idList[i*batchSize:(i+1)*batchSize] 45 | batchSeq = torch.tensor([i+[0]*(seqMaxLen-len(i)) for i in tokenizedSeq[samples]], dtype=torch.long).to(self.device) 46 | batchSec = F.softmax(self._calculate_y_logit(batchSeq), dim=2).cpu().data.numpy() 47 | secPre.append(batchSec) 48 | secPre = np.vstack(secPre).astype('float32') 49 | print('Finished!') 50 | return secPre,[len(seq) for seq in seqData] 51 | 52 | def _calculate_y_logit(self, X): 53 | X = self.textEmbedding(X) # => batchSize × seqLen × feaSize 54 | X_conved = self.textCNN(X) # => batchSize × seqLen × scaleNum*filterNum 55 | X_BiGRUed = self.textBiGRU(X_conved, None) # => batchSize × seqLen × hiddenSize*2 56 | X = torch.cat([X_conved,X_BiGRUed], dim=2) # => batchSize × seqLen × (scaleNum*filterNum+hiddenSize*2) 57 | return self.fcLinear(X) # => batchSize × seqLen × classNum 58 | 59 | class Predictor_OneHotBP: 60 | def __init__(self, weightPath, classNum=8, 61 | feaSize=39, seqLen=79, hiddenList=[2048], 62 | map_location="cpu", device=torch.device("cpu")): 63 | stateDict = torch.load(weightPath, map_location=map_location) 64 | self.seqItem2id,self.id2seqItem = stateDict['seqItem2id'],stateDict['id2seqItem'] 65 | self.secItem2id,self.id2secItem = stateDict['secItem2id'],stateDict['id2secItem'] 66 | self.window = seqLen 67 | self.trainIdList,self.validIdList = stateDict['trainIdList'],stateDict['validIdList'] 68 | self.seqItem2id,self.id2seqItem = stateDict['seqItem2id'],stateDict['id2seqItem'] 69 | self.secItem2id,self.id2secItem = stateDict['secItem2id'],stateDict['id2secItem'] 70 | self.textEmbedding = TextEmbedding( torch.zeros((len(self.id2seqItem),feaSize), dtype=torch.float32) ).to(device) 71 | self.fcLinear = MLP(feaSize*seqLen, classNum, hiddenList).to(device) 72 | self.moduleList = nn.ModuleList([self.textEmbedding,self.fcLinear]) 73 | for module in self.moduleList: 74 | module.load_state_dict(stateDict[module.name]) 75 | module.eval() 76 | self.device = device 77 | print("%d epochs and %.3lf val Score 's model load finished."%(stateDict['epochs'], stateDict['bestMtc'])) 78 | 79 | def predict(self, seqData, batchSize=10240): 80 | window = self.window 81 | if type(seqData)==str: 82 | with open(seqData, 'r') as f: 83 | rawData = [i[:-1] for i in f.readlines()] 84 | 85 | seqData = [] 86 | for seq in rawData: 87 | seq = ' '*(window//2) + seq + ' '*(window//2) 88 | seqData += [seq[i-window//2:i+window//2+1] for i in range(window//2,len(seq)-window//2)] 89 | tokenizedSeq = np.array([[self.seqItem2id[i] if i in self.seqItem2id else self.seqItem2id[''] for i in seq] for seq in seqData]) 90 | 91 | secPre = [] 92 | idList = list(range(len(tokenizedSeq))) 93 | print('Predicting...') 94 | for i in tqdm(range((len(idList)+batchSize-1)//batchSize)): 95 | samples = idList[i*batchSize:(i+1)*batchSize] 96 | batchSeq = torch.tensor(tokenizedSeq[samples], dtype=torch.long).to(self.device) 97 | batchSec = F.softmax(self._calculate_y_logit(batchSeq), dim=1).cpu().data.numpy() 98 | secPre.append(batchSec) 99 | secPre = np.vstack(secPre).astype('float32') 100 | print('Finished!') 101 | return secPre,[len(seq) for seq in rawData] 102 | 103 | def _calculate_y_logit(self, X): 104 | X = self.textEmbedding(X) # => batchSize × seqLen × feaSize 105 | X = torch.flatten(X, start_dim=1) # => batchSize × seqLen*feaSize 106 | return self.fcLinear(X) # => batchSize × classNum 107 | 108 | class Predictor_final: 109 | def __init__(self, weightPath, classNum=9, 110 | feaSize=64, filterNum=128, contextSizeList=[1,9,81], 111 | hiddenSize=512, num_layers=3, 112 | hiddenList=[2048], 113 | map_location="cpu", device=torch.device("cpu")): 114 | stateDict = torch.load(weightPath, map_location=map_location) 115 | self.seqItem2id,self.id2seqItem = stateDict['seqItem2id'],stateDict['id2seqItem'] 116 | self.secItem2id,self.id2secItem = stateDict['secItem2id'],stateDict['id2secItem'] 117 | self.trainIdList,self.validIdList = stateDict['trainIdList'],stateDict['validIdList'] 118 | self.seqItem2id,self.id2seqItem = stateDict['seqItem2id'],stateDict['id2seqItem'] 119 | self.secItem2id,self.id2secItem = stateDict['secItem2id'],stateDict['id2secItem'] 120 | self.textEmbedding = TextEmbedding( torch.zeros((len(self.id2seqItem),feaSize-39), dtype=torch.float32) ).to(device) 121 | self.feaEmbedding = TextEmbedding( torch.zeros((len(self.id2seqItem),39), dtype=torch.float32), freeze=True, name='feaEmbedding' ).to(device) 122 | self.textCNN = TextCNN( feaSize, contextSizeList, filterNum ).to(device) 123 | self.textBiGRU = TextBiGRU(len(contextSizeList)*filterNum, hiddenSize, num_layers=num_layers).to(device) 124 | self.fcLinear = MLP(len(contextSizeList)*filterNum+hiddenSize*2, classNum, hiddenList).to(device) 125 | self.moduleList = nn.ModuleList([self.textEmbedding,self.feaEmbedding,self.textCNN,self.textBiGRU,self.fcLinear]) 126 | for module in self.moduleList: 127 | module.load_state_dict(stateDict[module.name]) 128 | module.eval() 129 | self.device = device 130 | print("%d epochs and %.3lf val Score 's model load finished."%(stateDict['epochs'], stateDict['bestMtc'])) 131 | 132 | def predict(self, seqData, batchSize=32): 133 | if type(seqData)==str: 134 | with open(seqData, 'r') as f: 135 | seqData = f.readlines() 136 | seqData = [i[:-1] for i in seqData] 137 | tokenizedSeq = np.array([[self.seqItem2id[i] if i in self.seqItem2id else self.seqItem2id[''] for i in seq] for seq in seqData]) 138 | seqMaxLen = np.array([len(seq)+1 for seq in seqData]).max() 139 | 140 | secPre = [] 141 | idList = list(range(len(tokenizedSeq))) 142 | print('Predicting...') 143 | for i in tqdm(range((len(idList)+batchSize-1)//batchSize)): 144 | samples = idList[i*batchSize:(i+1)*batchSize] 145 | batchSeq = torch.tensor([i+[0]*(seqMaxLen-len(i)) for i in tokenizedSeq[samples]], dtype=torch.long).to(self.device) 146 | batchSec = F.softmax(self._calculate_y_logit(batchSeq), dim=2).cpu().data.numpy() 147 | secPre.append(batchSec) 148 | secPre = np.vstack(secPre).astype('float32') 149 | print('Finished!') 150 | return secPre,[len(seq) for seq in seqData] 151 | 152 | def _calculate_y_logit(self, X): 153 | X = torch.cat([self.textEmbedding(X),self.feaEmbedding(X)], dim=2) # => batchSize × seqLen × feaSize 154 | X_conved = self.textCNN(X) # => batchSize × seqLen × scaleNum*filterNum 155 | X_BiGRUed = self.textBiGRU(X_conved, None) # => batchSize × seqLen × hiddenSize*2 156 | X = torch.cat([X_conved,X_BiGRUed], dim=2) # => batchSize × seqLen × (scaleNum*filterNum+hiddenSize*2) 157 | return self.fcLinear(X) # => batchSize × seqLen × classNum -------------------------------------------------------------------------------- /DL_ClassifierModel.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch,time,os,pickle 3 | from torch import nn as nn 4 | from nnLayer import * 5 | from metrics import * 6 | from collections import Counter,Iterable 7 | from sklearn.model_selection import KFold 8 | 9 | class BaseClassifier: 10 | def __init__(self): 11 | pass 12 | def calculate_y_logit(self, X, XLen): 13 | pass 14 | def cv_train(self, dataClass, trainSize=256, batchSize=256, epoch=100, stopRounds=10, earlyStop=10, saveRounds=1, augmentation=0.05, 15 | optimType='Adam', lr=0.001, weightDecay=0, kFold=5, isHigherBetter=True, metrics="Score", report=["ACC", "MaF", "Score"], 16 | savePath='model'): 17 | kf = KFold(n_splits=kFold) 18 | validRes = [] 19 | for i,(trainIndices,validIndices) in enumerate(kf.split(range(dataClass.totalSampleNum))): 20 | print(f'CV_{i+1}:') 21 | self.reset_parameters() 22 | dataClass.trainIdList,dataClass.validIdList = trainIndices,validIndices 23 | dataClass.trainSampleNum,self.validSampleNum = len(trainIndices),len(validIndices) 24 | dataClass.describe() 25 | res = self.train(dataClass,trainSize,batchSize,epoch,stopRounds,earlyStop,saveRounds,augmentation,optimType,lr,weightDecay, 26 | isHigherBetter,metrics,report,f"{savePath}_cv{i+1}") 27 | validRes.append(res) 28 | Metrictor.table_show(validRes, report) 29 | def train(self, dataClass, trainSize=256, batchSize=256, epoch=100, stopRounds=10, earlyStop=10, saveRounds=1, augmentation=0.05, 30 | optimType='Adam', lr=0.001, weightDecay=0, isHigherBetter=True, metrics="Score", report=["ACC", "MaF", "Score"], 31 | savePath='model'): 32 | assert batchSize%trainSize==0 33 | metrictor = Metrictor(dataClass.classNum) 34 | self.stepCounter = 0 35 | self.stepUpdate = batchSize//trainSize 36 | optimizer = torch.optim.Adam(self.moduleList.parameters(), lr=lr, weight_decay=weightDecay) 37 | trainStream = dataClass.random_batch_data_stream(batchSize=trainSize, type='train', device=self.device, augmentation=augmentation) 38 | itersPerEpoch = (dataClass.trainSampleNum+trainSize-1)//trainSize 39 | mtc,bestMtc,stopSteps = 0.0,0.0,0 40 | if dataClass.validSampleNum>0: validStream = dataClass.random_batch_data_stream(batchSize=trainSize, type='valid', device=self.device, augmentation=augmentation) 41 | st = time.time() 42 | for e in range(epoch): 43 | for i in range(itersPerEpoch): 44 | self.to_train_mode() 45 | X,Y = next(trainStream) 46 | loss = self._train_step(X,Y, optimizer) 47 | if stopRounds>0 and (e*itersPerEpoch+i+1)%stopRounds==0: 48 | self.to_eval_mode() 49 | print(f"After iters {e*itersPerEpoch+i+1}: [train] loss= {loss:.3f};", end='') 50 | if dataClass.validSampleNum>0: 51 | X,Y = next(validStream) 52 | loss = self.calculate_loss(X,Y) 53 | print(f' [valid] loss= {loss:.3f};', end='') 54 | restNum = ((itersPerEpoch-i-1)+(epoch-e-1)*itersPerEpoch)*trainSize 55 | speed = (e*itersPerEpoch+i+1)*trainSize/(time.time()-st) 56 | print(" speed: %.3lf items/s; remaining time: %.3lfs;"%(speed, restNum/speed)) 57 | if dataClass.validSampleNum>0 and (e+1)%saveRounds==0: 58 | self.to_eval_mode() 59 | print(f'========== Epoch:{e+1:5d} ==========') 60 | #Y_pre,Y = self.calculate_y_prob_by_iterator(dataClass.one_epoch_batch_data_stream(trainSize, type='train', device=self.device)) 61 | #metrictor.set_data(Y_pre, Y) 62 | #print(f'[Total Train]',end='') 63 | #metrictor(report) 64 | print(f'[Total Valid]',end='') 65 | Y_pre,Y = self.calculate_y_prob_by_iterator(dataClass.one_epoch_batch_data_stream(trainSize, type='valid', device=self.device)) 66 | metrictor.set_data(Y_pre, Y) 67 | res = metrictor(report) 68 | mtc = res[metrics] 69 | print('=================================') 70 | if (mtc>bestMtc and isHigherBetter) or (mtc=earlyStop: 78 | print(f'The val {metrics} has not improved for more than {earlyStop} steps in epoch {e+1}, stop training.') 79 | break 80 | self.load("%s.pkl"%savePath) 81 | os.rename("%s.pkl"%savePath, "%s_%s.pkl"%(savePath, ("%.3lf"%bestMtc)[2:])) 82 | print(f'============ Result ============') 83 | Y_pre,Y = self.calculate_y_prob_by_iterator(dataClass.one_epoch_batch_data_stream(trainSize, type='train', device=self.device)) 84 | metrictor.set_data(Y_pre, Y) 85 | print(f'[Total Train]',end='') 86 | metrictor(report) 87 | Y_pre,Y = self.calculate_y_prob_by_iterator(dataClass.one_epoch_batch_data_stream(trainSize, type='valid', device=self.device)) 88 | metrictor.set_data(Y_pre, Y) 89 | print(f'[Total Valid]',end='') 90 | res = metrictor(report) 91 | metrictor.each_class_indictor_show(dataClass.id2secItem) 92 | print(f'================================') 93 | return res 94 | def reset_parameters(self): 95 | for module in self.moduleList: 96 | for subModule in module.modules(): 97 | if hasattr(subModule, "reset_parameters"): 98 | subModule.reset_parameters() 99 | def save(self, path, epochs, bestMtc=None, dataClass=None): 100 | stateDict = {'epochs':epochs, 'bestMtc':bestMtc} 101 | for module in self.moduleList: 102 | stateDict[module.name] = module.state_dict() 103 | if dataClass is not None: 104 | stateDict['trainIdList'],stateDict['validIdList'] = dataClass.trainIdList,dataClass.validIdList 105 | stateDict['seqItem2id'],stateDict['id2seqItem'] = dataClass.seqItem2id,dataClass.id2seqItem 106 | stateDict['secItem2id'],stateDict['id2secItem'] = dataClass.secItem2id,dataClass.id2secItem 107 | torch.save(stateDict, path) 108 | print('Model saved in "%s".'%path) 109 | def load(self, path, map_location=None, dataClass=None): 110 | parameters = torch.load(path, map_location=map_location) 111 | for module in self.moduleList: 112 | module.load_state_dict(parameters[module.name]) 113 | if dataClass is not None: 114 | dataClass.trainIdList,dataClass.validIdList = parameters['trainIdList'],parameters['validIdList'] 115 | dataClass.seqItem2id,dataClass.id2seqItem = parameters['seqItem2id'],parameters['id2seqItem'] 116 | dataClass.secItem2id,dataClass.id2secItem = parameters['secItem2id'],parameters['id2secItem'] 117 | print("%d epochs and %.3lf val Score 's model load finished."%(parameters['epochs'], parameters['bestMtc'])) 118 | def calculate_y_prob(self, X): 119 | Y_pre = self.calculate_y_logit(X) 120 | return torch.softmax(Y_pre, dim=1) 121 | def calculate_y(self, X): 122 | Y_pre = self.calculate_y_prob(X) 123 | return torch.argmax(Y_pre, dim=1) 124 | def calculate_loss(self, X, Y): 125 | Y_logit = self.calculate_y_logit(X) 126 | return self.criterion(Y_logit, Y) 127 | def calculate_indicator_by_iterator(self, dataStream, classNum, report): 128 | metrictor = Metrictor(classNum) 129 | Y_prob_pre,Y = self.calculate_y_prob_by_iterator(dataStream) 130 | metrictor.set_data(Y_prob_pre, Y) 131 | return metrictor(report) 132 | def calculate_y_prob_by_iterator(self, dataStream): 133 | YArr,Y_preArr = [],[] 134 | while True: 135 | try: 136 | X,Y = next(dataStream) 137 | except: 138 | break 139 | Y_pre,Y = self.calculate_y_prob(X).cpu().data.numpy(),Y.cpu().data.numpy() 140 | YArr.append(Y) 141 | Y_preArr.append(Y_pre) 142 | YArr,Y_preArr = np.hstack(YArr).astype('int32'),np.vstack(Y_preArr).astype('float32') 143 | return Y_preArr, YArr 144 | def calculate_y_by_iterator(self, dataStream): 145 | Y_preArr, YArr = self.calculate_y_prob_by_iterator(dataStream) 146 | return Y_preArr.argmax(axis=1), YArr 147 | def to_train_mode(self): 148 | for module in self.moduleList: 149 | module.train() 150 | def to_eval_mode(self): 151 | for module in self.moduleList: 152 | module.eval() 153 | def _train_step(self, X, Y, optimizer): 154 | self.stepCounter += 1 155 | if self.stepCounter batchSizeh × seqLen × feaSize 181 | X = self.textCNN(X) # => batchSize × seqLen × scaleNum*filterNum 182 | return self.fcLinear(X) # => batchSize × seqLen × classNum 183 | def calculate_y_prob(self, X): 184 | Y_pre = self.calculate_y_logit(X) 185 | return torch.softmax(Y_pre, dim=2) 186 | def calculate_y(self, X): 187 | Y_pre = self.calculate_y_prob(X) 188 | return torch.argmax(Y_pre, dim=2) 189 | def calculate_y_by_iterator(self, dataStream): 190 | Y_preArr, YArr = self.calculate_y_prob_by_iterator(dataStream) 191 | return Y_preArr.argmax(axis=2), YArr 192 | def calculate_loss(self, X, Y): 193 | Y_logit = self.calculate_y_logit(X) 194 | Y = Y.reshape(-1) 195 | Y_logit = Y_logit.reshape(len(Y),-1) 196 | return self.criterion(Y_logit, Y) 197 | def calculate_y_prob_by_iterator(self, dataStream): 198 | YArr,Y_preArr = [],[] 199 | while True: 200 | try: 201 | X,Y = next(dataStream) 202 | except: 203 | break 204 | Y_pre,Y = self.calculate_y_prob(X).cpu().data.numpy(),Y.cpu().data.numpy() 205 | YArr.append(Y) 206 | Y_preArr.append(Y_pre) 207 | YArr,Y_preArr = np.vstack(YArr).astype('int32'),np.vstack(Y_preArr).astype('float32') 208 | return Y_preArr, YArr 209 | 210 | 211 | class TextClassifier_Transformer(BaseClassifier): 212 | def __init__(self, classNum, embedding, seqMaxLen, feaSize=128, dk=64, multiNum=8, hiddenList=[], 213 | embDropout=0.3, fcDropout=0.5, 214 | useFocalLoss=False, weight=-1, device=torch.device("cuda:0")): 215 | self.textEmbedding = TextEmbedding( torch.tensor(embedding, dtype=torch.float),dropout=embDropout ).to(device) 216 | self.textTransformer1 = TextTransformer(feaSize, dk, multiNum, seqMaxLen).to(device) 217 | self.fcLinear1 = LinearRelu(feaSize, feaSize*2, name='fcLinear1').to(device) 218 | self.textTransformer2 = TextTransformer(feaSize*2, dk, multiNum, seqMaxLen).to(device) 219 | self.fcLinear2 = LinearRelu(feaSize*2, feaSize*4, name='fcLinear2').to(device) 220 | self.textTransformer3 = TextTransformer(feaSize*4, dk, multiNum, seqMaxLen).to(device) 221 | self.fcLinear = MLP(feaSize*4, classNum, hiddenList, fcDropout).to(device) 222 | self.fcLinear1.name,self.fcLinear2.name = 'fcLinear1','fcLinear2' 223 | self.moduleList = nn.ModuleList([self.textEmbedding, 224 | self.textTransformer1, self.textTransformer2, self.textTransformer3, 225 | self.fcLinear1,self.fcLinear2,self.fcLinear]) 226 | self.classNum = classNum 227 | self.device = device 228 | self.feaSize = feaSize 229 | self.criterion = nn.CrossEntropyLoss() if not useFocalLoss else FocalCrossEntropyLoss(weight=weight).to(device) 230 | 231 | self.PE = torch.tensor([[[np.sin(pos/10000**(2*i/feaSize)) if i%2==0 else np.cos(pos/10000**(2*i/feaSize)) for i in range(1,feaSize+1)] for pos in range(1,seqMaxLen+1)]], dtype=torch.float32, device=device) 232 | def calculate_y_logit(self, X): 233 | X = self.textEmbedding(X['seqArr'])+self.PE # => batchSizeh × seqLen × feaSize 234 | X = self.textTransformer1(X) # => batchSize × seqLen × feaSize 235 | X = self.fcLinear1(X) # => batchSize × seqLen × feaSize*2 236 | X = self.textTransformer2(X) # => batchSize × seqLen × feaSize*2 237 | X = self.fcLinear2(X) # => batchSize × seqLen × feaSize*4 238 | X = self.textTransformer3(X) # => batchSize × seqLen × feaSize*4 239 | return self.fcLinear(X) # => batchSize × seqLen × classNum 240 | def calculate_y_prob(self, X): 241 | Y_pre = self.calculate_y_logit(X) 242 | return torch.softmax(Y_pre, dim=2) 243 | def calculate_y(self, X): 244 | Y_pre = self.calculate_y_prob(X) 245 | return torch.argmax(Y_pre, dim=2) 246 | def calculate_y_by_iterator(self, dataStream): 247 | Y_preArr, YArr = self.calculate_y_prob_by_iterator(dataStream) 248 | return Y_preArr.argmax(axis=2), YArr 249 | def calculate_loss(self, X, Y): 250 | Y_logit = self.calculate_y_logit(X) 251 | Y = Y.reshape(-1) 252 | Y_logit = Y_logit.reshape(len(Y),-1) 253 | return self.criterion(Y_logit, Y) 254 | def calculate_y_prob_by_iterator(self, dataStream): 255 | YArr,Y_preArr = [],[] 256 | while True: 257 | try: 258 | X,Y = next(dataStream) 259 | except: 260 | break 261 | Y_pre,Y = self.calculate_y_prob(X).cpu().data.numpy(),Y.cpu().data.numpy() 262 | YArr.append(Y) 263 | Y_preArr.append(Y_pre) 264 | YArr,Y_preArr = np.vstack(YArr).astype('int32'),np.vstack(Y_preArr).astype('float32') 265 | return Y_preArr, YArr 266 | 267 | class TextClassifier_BiGRU(BaseClassifier): 268 | def __init__(self, classNum, embedding, feaSize=128, hiddenSize=256, hiddenList=[], 269 | embDropout=0.3, fcDropout=0.5, num_layers=1, 270 | useFocalLoss=False, weight=-1, device=torch.device("cuda:0")): 271 | self.textEmbedding = TextEmbedding( torch.tensor(embedding, dtype=torch.float),dropout=embDropout ).to(device) 272 | self.textBiGRU = TextBiGRU(feaSize, hiddenSize, num_layers=num_layers, dropout=0.1).to(device) 273 | self.fcLinear = MLP(hiddenSize*2, classNum, hiddenList, fcDropout).to(device) 274 | self.moduleList = nn.ModuleList([self.textEmbedding, self.textBiGRU, self.fcLinear]) 275 | self.classNum = classNum 276 | self.device = device 277 | self.feaSize = feaSize 278 | self.criterion = nn.CrossEntropyLoss() if not useFocalLoss else FocalCrossEntropyLoss(weight=weight).to(device) 279 | def calculate_y_logit(self, X): 280 | X,XLen = X['seqArr'],X['seqLenArr'] 281 | X = self.textEmbedding(X) 282 | X = self.textBiGRU(X, None) # => batchSize × seqLen × hiddenSize*2 283 | return self.fcLinear(X) # => batchSize × seqLen × classNum 284 | def calculate_y_prob(self, X): 285 | Y_pre = self.calculate_y_logit(X) 286 | return torch.softmax(Y_pre, dim=2) 287 | def calculate_y(self, X): 288 | Y_pre = self.calculate_y_prob(X) 289 | return torch.argmax(Y_pre, dim=2) 290 | def calculate_y_by_iterator(self, dataStream): 291 | Y_preArr, YArr = self.calculate_y_prob_by_iterator(dataStream) 292 | return Y_preArr.argmax(axis=2), YArr 293 | def calculate_loss(self, X, Y): 294 | Y_logit = self.calculate_y_logit(X) 295 | Y = Y.reshape(-1) 296 | Y_logit = Y_logit.reshape(len(Y),-1) 297 | return self.criterion(Y_logit, Y) 298 | def calculate_y_prob_by_iterator(self, dataStream): 299 | YArr,Y_preArr = [],[] 300 | while True: 301 | try: 302 | X,Y = next(dataStream) 303 | except: 304 | break 305 | Y_pre,Y = self.calculate_y_prob(X).cpu().data.numpy(),Y.cpu().data.numpy() 306 | YArr.append(Y) 307 | Y_preArr.append(Y_pre) 308 | YArr,Y_preArr = np.vstack(YArr).astype('int32'),np.vstack(Y_preArr).astype('float32') 309 | return Y_preArr, YArr 310 | 311 | class DCRNN(BaseClassifier): 312 | def __init__(self, classNum, embedding, feaSize=64, 313 | filterNum=64, contextSizeList=[3,7,11], 314 | hiddenSize=512, num_layers=3, 315 | hiddenList=[2048], 316 | embDropout=0.2, BiGRUDropout=0.2, fcDropout=0.4, 317 | useFocalLoss=False, weight=-1, device=torch.device("cuda:0")): 318 | self.textEmbedding = TextEmbedding( torch.tensor(embedding, dtype=torch.float),dropout=embDropout ).to(device) 319 | self.textCNN = TextCNN( feaSize, contextSizeList, filterNum ).to(device) 320 | self.textBiGRU = TextBiGRU(len(contextSizeList)*filterNum, hiddenSize, num_layers=num_layers, dropout=BiGRUDropout).to(device) 321 | self.fcLinear = MLP(len(contextSizeList)*filterNum+hiddenSize*2, classNum, hiddenList, fcDropout).to(device) 322 | self.moduleList = nn.ModuleList([self.textEmbedding,self.textCNN,self.textBiGRU,self.fcLinear]) 323 | self.classNum = classNum 324 | self.device = device 325 | self.feaSize = feaSize 326 | self.criterion = nn.CrossEntropyLoss() if not useFocalLoss else FocalCrossEntropyLoss(weight=weight).to(device) 327 | def calculate_y_logit(self, X): 328 | X = X['seqArr'] 329 | X = self.textEmbedding(X) # => batchSize × seqLen × feaSize 330 | X_conved = self.textCNN(X) # => batchSize × seqLen × scaleNum*filterNum 331 | X_BiGRUed = self.textBiGRU(X_conved, None) # => batchSize × seqLen × hiddenSize*2 332 | X = torch.cat([X_conved,X_BiGRUed], dim=2) # => batchSize × seqLen × (scaleNum*filterNum+hiddenSize*2) 333 | return self.fcLinear(X) # => batchSize × seqLen × classNum 334 | def calculate_y_prob(self, X): 335 | Y_pre = self.calculate_y_logit(X) 336 | return torch.softmax(Y_pre, dim=2) 337 | def calculate_y(self, X): 338 | Y_pre = self.calculate_y_prob(X) 339 | return torch.argmax(Y_pre, dim=2) 340 | def calculate_y_by_iterator(self, dataStream): 341 | Y_preArr, YArr = self.calculate_y_prob_by_iterator(dataStream) 342 | return Y_preArr.argmax(axis=2), YArr 343 | def calculate_loss(self, X, Y): 344 | Y_logit = self.calculate_y_logit(X) 345 | Y = Y.reshape(-1) 346 | Y_logit = Y_logit.reshape(len(Y),-1) 347 | return self.criterion(Y_logit, Y) 348 | def calculate_y_prob_by_iterator(self, dataStream): 349 | YArr,Y_preArr = [],[] 350 | while True: 351 | try: 352 | X,Y = next(dataStream) 353 | except: 354 | break 355 | Y_pre,Y = self.calculate_y_prob(X).cpu().data.numpy(),Y.cpu().data.numpy() 356 | YArr.append(Y) 357 | Y_preArr.append(Y_pre) 358 | YArr,Y_preArr = np.vstack(YArr).astype('int32'),np.vstack(Y_preArr).astype('float32') 359 | return Y_preArr, YArr 360 | 361 | class NormalNN(BaseClassifier): 362 | def __init__(self, classNum, embedding, seqMaxLen, feaSize, 363 | hiddenList=[2048], 364 | embDropout=0.2, fcDropout=0.4, 365 | useFocalLoss=False, weight=-1, device=torch.device("cuda:0")): 366 | self.textEmbedding = TextEmbedding( torch.tensor(embedding, dtype=torch.float),freeze=True,dropout=embDropout ).to(device) 367 | self.textContextNN = ContextNN( seqMaxLen ).to(device) 368 | self.fcLinear = MLP(feaSize*2, classNum, hiddenList, fcDropout).to(device) 369 | self.moduleList = nn.ModuleList([self.textEmbedding,self.textContextNN,self.fcLinear]) 370 | self.classNum = classNum 371 | self.device = device 372 | self.feaSize = feaSize 373 | self.criterion = nn.CrossEntropyLoss() if not useFocalLoss else FocalCrossEntropyLoss(weight=weight).to(device) 374 | def calculate_y_logit(self, X): 375 | X = X['seqArr'] 376 | X = self.textEmbedding(X) # => batchSize × seqLen × feaSize 377 | context = self.textContextNN(X) # => batchSize × seqLen × feaSize 378 | X = torch.cat([X, context], dim=2) # => batchSize × seqLen × 2*feaSize 379 | return self.fcLinear(X) # => batchSize × seqLen × classNum 380 | def calculate_y_prob(self, X): 381 | Y_pre = self.calculate_y_logit(X) 382 | return torch.softmax(Y_pre, dim=2) 383 | def calculate_y(self, X): 384 | Y_pre = self.calculate_y_prob(X) 385 | return torch.argmax(Y_pre, dim=2) 386 | def calculate_y_by_iterator(self, dataStream): 387 | Y_preArr, YArr = self.calculate_y_prob_by_iterator(dataStream) 388 | return Y_preArr.argmax(axis=2), YArr 389 | def calculate_loss(self, X, Y): 390 | Y_logit = self.calculate_y_logit(X) 391 | Y = Y.reshape(-1) 392 | Y_logit = Y_logit.reshape(len(Y),-1) 393 | return self.criterion(Y_logit, Y) 394 | def calculate_y_prob_by_iterator(self, dataStream): 395 | YArr,Y_preArr = [],[] 396 | while True: 397 | try: 398 | X,Y = next(dataStream) 399 | except: 400 | break 401 | Y_pre,Y = self.calculate_y_prob(X).cpu().data.numpy(),Y.cpu().data.numpy() 402 | YArr.append(Y) 403 | Y_preArr.append(Y_pre) 404 | YArr,Y_preArr = np.vstack(YArr).astype('int32'),np.vstack(Y_preArr).astype('float32') 405 | return Y_preArr, YArr 406 | 407 | class OnehotNN(BaseClassifier): 408 | def __init__(self, classNum, embedding, seqLen, feaSize, 409 | hiddenList=[2048], 410 | embDropout=0.2, fcDropout=0.4, 411 | useFocalLoss=False, weight=-1, device=torch.device('cuda')): 412 | self.textEmbedding = TextEmbedding( torch.tensor(embedding, dtype=torch.float),freeze=True,dropout=embDropout ).to(device) 413 | self.fcLinear = MLP(feaSize*seqLen, classNum, hiddenList, fcDropout).to(device) 414 | self.moduleList = nn.ModuleList([self.textEmbedding, self.fcLinear]) 415 | self.classNum = classNum 416 | self.device = device 417 | self.feaSize = feaSize 418 | self.criterion = nn.CrossEntropyLoss() if not useFocalLoss else FocalCrossEntropyLoss(weight=weight).to(device) 419 | def calculate_y_logit(self, X): 420 | X = X['seqArr'] 421 | X = self.textEmbedding(X) # => batchSize × seqLen × feaSize 422 | X = torch.flatten(X, start_dim=1) # => batchSize × seqLen*feaSize 423 | return self.fcLinear(X) # => batchSize × classNum 424 | 425 | 426 | class FinalModel(BaseClassifier): 427 | def __init__(self, classNum, embedding, feaEmbedding, feaSize=64, 428 | filterNum=128, contextSizeList=[1,9,81], 429 | hiddenSize=512, num_layers=3, 430 | hiddenList=[2048], 431 | embDropout=0.2, BiGRUDropout=0.2, fcDropout=0.4, 432 | useFocalLoss=False, weight=-1, device=torch.device("cuda:0")): 433 | self.textEmbedding = TextEmbedding( torch.tensor(embedding, dtype=torch.float),dropout=embDropout ).to(device) 434 | self.feaEmbedding = TextEmbedding( torch.tensor(feaEmbedding, dtype=torch.float),dropout=embDropout//2,name='feaEmbedding',freeze=True ).to(device) 435 | self.textCNN = TextCNN( feaSize, contextSizeList, filterNum ).to(device) 436 | self.textBiGRU = TextBiGRU(len(contextSizeList)*filterNum, hiddenSize, num_layers=num_layers, dropout=BiGRUDropout).to(device) 437 | self.fcLinear = MLP(len(contextSizeList)*filterNum+hiddenSize*2, classNum, hiddenList, fcDropout).to(device) 438 | self.moduleList = nn.ModuleList([self.textEmbedding,self.feaEmbedding,self.textCNN,self.textBiGRU,self.fcLinear]) 439 | self.classNum = classNum 440 | self.device = device 441 | self.feaSize = feaSize 442 | self.criterion = nn.CrossEntropyLoss() if not useFocalLoss else FocalCrossEntropyLoss(weight=weight).to(device) 443 | def calculate_y_logit(self, X): 444 | X = X['seqArr'] 445 | X = torch.cat([self.textEmbedding(X),self.feaEmbedding(X)], dim=2) # => batchSize × seqLen × feaSize 446 | X_conved = self.textCNN(X) # => batchSize × seqLen × scaleNum*filterNum 447 | X_BiGRUed = self.textBiGRU(X_conved, None) # => batchSize × seqLen × hiddenSize*2 448 | X = torch.cat([X_conved,X_BiGRUed], dim=2) # => batchSize × seqLen × (scaleNum*filterNum+hiddenSize*2) 449 | return self.fcLinear(X) # => batchSize × seqLen × classNum 450 | def calculate_y_prob(self, X): 451 | Y_pre = self.calculate_y_logit(X) 452 | return torch.softmax(Y_pre, dim=2) 453 | def calculate_y(self, X): 454 | Y_pre = self.calculate_y_prob(X) 455 | return torch.argmax(Y_pre, dim=2) 456 | def calculate_y_by_iterator(self, dataStream): 457 | Y_preArr, YArr = self.calculate_y_prob_by_iterator(dataStream) 458 | return Y_preArr.argmax(axis=2), YArr 459 | def calculate_loss(self, X, Y): 460 | Y_logit = self.calculate_y_logit(X) 461 | Y = Y.reshape(-1) 462 | Y_logit = Y_logit.reshape(len(Y),-1) 463 | return self.criterion(Y_logit, Y) 464 | def calculate_y_prob_by_iterator(self, dataStream): 465 | YArr,Y_preArr = [],[] 466 | while True: 467 | try: 468 | X,Y = next(dataStream) 469 | except: 470 | break 471 | Y_pre,Y = self.calculate_y_prob(X).cpu().data.numpy(),Y.cpu().data.numpy() 472 | YArr.append(Y) 473 | Y_preArr.append(Y_pre) 474 | YArr,Y_preArr = np.vstack(YArr).astype('int32'),np.vstack(Y_preArr).astype('float32') 475 | return Y_preArr, YArr -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | from sklearn.model_selection import train_test_split 2 | from sklearn.preprocessing import OneHotEncoder 3 | from gensim.models import Word2Vec 4 | import numpy as np 5 | from tqdm import tqdm 6 | import os,logging,pickle,random,torch 7 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 8 | 9 | A_pssm = [0.872226004,0.35351513,0.291072682,0.312680229,0.340611722,0.221741296,0.186079503,0.158243969,0.323860653,0.284587638,0.245009849,0.335872147,0.414415237,0.3854859,0.462152589,0.415184604,0.415678055,0.162960674,0.301396213,0.20685564] 10 | C_pssm = [0.131367262,0.993210741,0.051014628,0.070738663,0.141343773,0.100584273,0.083674474,0.252771442,0.099405786,0.13155555,0.076118584,0.084907803,0.09452313,0.097157337,0.099119889,0.088962404,0.12053233,0.263446631,0.118959045,0.039528232] 11 | E_pssm = [0.244880490,0.296275879,0.647010458,0.882901417,0.374764691,0.099624506,0.069595497,0.203147438,0.418484161,0.264321695,0.066371629,0.662291457,0.444100017,0.393008897,0.439249980,0.373571161,0.322208045,0.082964380,0.166934354,0.131134815] 12 | D_pssm = [0.296141334,0.249709671,0.850820407,0.512991913,0.265338090,0.166939991,0.110098353,0.170262548,0.507233108,0.157820406,0.104848505,0.431656491,0.494132735,0.374609189,0.448848640,0.442543182,0.263501172,0.173610625,0.197364754,0.224008087] 13 | G_pssm = [0.134500843,0.138081132,0.083494184,0.077593987,0.092954086,0.914093711,0.289071932,0.239061378,0.087289090,0.205822566,0.465739703,0.147806816,0.154850238,0.122176565,0.150144272,0.147839920,0.183952228,0.602706325,0.333488820,0.761599454] 14 | F_pssm = [0.342548214,0.270339354,0.195856200,0.171412621,0.922893893,0.103036555,0.088571455,0.214378914,0.223065930,0.095590198,0.063541687,0.275854077,0.281346349,0.229726456,0.385095037,0.233176818,0.209489150,0.096825142,0.123902125,0.076653113] 15 | I_pssm = [0.176415172,0.188443798,0.308357998,0.275629523,0.235234856,0.260304213,0.082515323,0.932749953,0.306385104,0.140879156,0.107014411,0.360545061,0.366051301,0.258562316,0.334060816,0.326275232,0.232563130,0.174383626,0.204813114,0.460118278] 16 | H_pssm = [0.143569076,0.134224827,0.104347336,0.128179912,0.109001577,0.199901391,0.908161712,0.144819942,0.130959143,0.426831691,0.605280251,0.196738287,0.138929626,0.151766316,0.136858734,0.137098916,0.232657396,0.400298615,0.673472124,0.211236731] 17 | K_pssm = [0.298524389,0.266401812,0.568084594,0.301019405,0.265933301,0.195842778,0.121579844,0.212033239,0.861729735,0.150797931,0.154139928,0.407340393,0.416233547,0.350466473,0.424457415,0.672058897,0.319754166,0.293990392,0.199877922,0.308680273] 18 | M_pssm = [0.173967682,0.197009826,0.081423949,0.105579666,0.094623978,0.336778656,0.659291616,0.172585245,0.174438823,0.524163387,0.886962625,0.179483812,0.216780932,0.145710941,0.134969375,0.170362862,0.206733556,0.533262289,0.384640005,0.244044113] 19 | L_pssm = [0.225244504,0.376776187,0.138589819,0.183133318,0.135891614,0.295921134,0.491456844,0.175555301,0.226657379,0.866720079,0.660651867,0.275622096,0.322198802,0.142342904,0.229513750,0.280515594,0.271416295,0.449116352,0.328244811,0.313903458] 20 | N_pssm = [0.269160755,0.292945305,0.336069790,0.480610819,0.406244664,0.119259286,0.077119964,0.359851389,0.465564987,0.193742239,0.102707801,0.890991538,0.404603890,0.356844609,0.474599142,0.378478471,0.359555731,0.113931650,0.148822188,0.178191469] 21 | Q_pssm = [0.207780410,0.198091695,0.225152747,0.290446646,0.204574963,0.199898084,0.131305856,0.143066821,0.236687261,0.095375846,0.119815893,0.248063241,0.214212144,0.911767496,0.213092991,0.278199811,0.203782752,0.053982761,0.129363584,0.077917397] 22 | P_pssm = [0.284015973,0.314122654,0.654988851,0.293965755,0.266342018,0.169070694,0.102684234,0.317582459,0.517885273,0.213771826,0.161182310,0.388242052,0.854016544,0.322383243,0.527796489,0.544931276,0.312645155,0.198176354,0.259373216,0.319545440] 23 | S_pssm = [0.219297725,0.229428948,0.421884074,0.246614784,0.192544267,0.166972270,0.143466396,0.188121950,0.655457815,0.140421669,0.150345757,0.387298283,0.343485616,0.290221367,0.344720375,0.847964449,0.282228510,0.254715316,0.202529035,0.318086834] 24 | R_pssm = [0.512894266,0.184875405,0.348063977,0.355832670,0.357881997,0.152903417,0.118138694,0.158965184,0.357595661,0.171367320,0.154680339,0.465237872,0.331957284,0.346368584,0.790428314,0.313318200,0.560004204,0.163479257,0.263011596,0.192724756] 25 | T_pssm = [0.377764625,0.178184577,0.297722653,0.383766569,0.254015265,0.11150642,0.219367283,0.298850656,0.371877308,0.252202659,0.215385273,0.354912342,0.351367517,0.380481386,0.542486991,0.293440184,0.850656294,0.247789053,0.38727193,0.244643665] 26 | W_pssm = [0.241772670,0.234871212,0.150171553,0.132149434,0.111593941,0.244663376,0.816923667,0.158818698,0.163135105,0.336656664,0.523681246,0.225822054,0.203272993,0.182922468,0.198275408,0.241880843,0.324984112,0.158786345,0.797769378,0.217122098] 27 | V_pssm = [0.170517317,0.249649453,0.100918838,0.061540932,0.141034882,0.376889735,0.157983778,0.181397842,0.096812917,0.149128782,0.135011831,0.127537528,0.116118235,0.112229974,0.112107972,0.141627238,0.170548818,0.984635386,0.174473134,0.542319867] 28 | Y_pssm = [0.500000000,0.119202922,0.268941421,0.268941421,0.268941421,0.268941421,0.268941421,0.268941421,0.268941421,0.268941421,0.268941421,0.268941421,0.268941421,0.119202922,0.500000000,0.268941421,0.500000000,0.119202922,0.268941421,0.268941421] 29 | X_pssm = [0.203265404,0.327932961,0.196348043,0.162669050,0.179471469,0.730270333,0.182171579,0.397550899,0.153009421,0.190944510,0.321001024,0.269707976,0.198415836,0.253520609,0.200279215,0.259147101,0.267143332,0.775315115,0.313386476,0.987981675] 30 | Z_pssm = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] 31 | 32 | featureDict = { 33 | 'G':[57, 75, 156, 0.102, 0.085, 0.190, 0.152, 75.06714, 6.06, 2.34, 9.60, 48, 249.9, 0], # Glycine 甘氨酸 34 | 'P':[57, 55, 152, 0.102, 0.301, 0.034, 0.068, 115.13194, 6.30, 1.99, 10.96, 90, 1620.0, 10.87], # Proline 脯氨酸 35 | 'T':[83, 119, 96, 0.086, 0.108, 0.065, 0.079, 119.12034, 5.60, 2.09, 9.10, 93, 13.2, 1.67], # Threonine 苏氨酸 36 | 'E':[151, 37, 74, 0.056, 0.060, 0.077, 0.064, 147.13074, 3.15, 2.10, 9.47, 109, 8.5, 2.09], # Glutamic Acid 谷氨酸 37 | 'S':[77, 75, 143, 0.120, 0.139, 0.125, 0.106, 105.09344, 5.68, 2.21, 9.15, 73, 422.0, 1.25], # Serine 丝氨酸 38 | 'K':[114, 74, 101, 0.055, 0.115, 0.072, 0.095, 146.18934, 9.60, 2.16, 9.06, 135, 739.0, 5.223888888888889], # Lysine 赖氨酸 39 | 'C':[70, 119, 119, 0.149, 0.050, 0.117, 0.128, 121.15404, 5.05, 1.92, 10.70, 86, 280, 4.18], # Cysteine 半胱氨酸 40 | 'L':[121, 130, 59, 0.061, 0.025, 0.036, 0.070, 131.17464, 6.01, 2.32, 9.58, 124, 21.7, 9.61], # Leucine 亮氨酸 41 | 'M':[145, 105, 60, 0.068, 0.082, 0.014, 0.055, 149.20784, 5.74, 2.28, 9.21, 124, 56.2, 5.43], # Methionine 蛋氨酸 42 | 'V':[106, 170, 50, 0.062, 0.048, 0.028, 0.053, 117.14784, 6.00, 2.29, 9.74, 105, 58.1, 6.27], # Valine 缬氨酸 43 | 'D':[67, 89, 156, 0.161, 0.083, 0.191, 0.091, 133.10384, 2.85, 1.99, 9.90, 96, 5.0, 2.09], # Asparagine 天冬氨酸 44 | 'A':[142, 83, 66, 0.06, 0.076, 0.035, 0.058, 89.09404, 6.01, 2.35, 9.87, 67, 167.2, 2.09], # Alanine 丙氨酸 45 | 'R':[98, 93, 95, 0.070, 0.106, 0.099, 0.085, 174.20274, 10.76, 2.17, 9.04, 148, 855.6, 5.223888888888889], # Arginine 精氨酸 46 | 'I':[108, 160, 47, 0.043, 0.034, 0.013, 0.056, 131.17464, 6.05, 2.32, 9.76, 124, 34.5, 12.54], # Isoleucine 异亮氨酸 47 | 'N':[101, 54, 146, 0.147, 0.110, 0.179, 0.081, 132.11904, 5.41, 2.02, 8.80, 91, 28.5, 0], # Aspartic Acid 天冬酰胺 48 | 'H':[100, 87, 95, 0.140, 0.047, 0.093, 0.054, 155.15634, 7.60, 1.80, 9.33, 118, 41.9, 2.09], # Histidine 组氨酸 49 | 'F':[113, 138, 60, 0.059, 0.041, 0.065, 0.065, 165.19184, 5.49, 2.20, 9.60, 135, 27.6, 10.45], # Phenylalanine 苯丙氨酸 50 | 'W':[108, 137, 96, 0.077, 0.013, 0.064, 0.167, 204.22844, 5.89, 2.46, 9.41, 163, 13.6, 14.21], # Tryptophan 色氨酸 51 | 'Y':[69, 147, 114, 0.082, 0.065, 0.114, 0.125, 181.19124, 5.64, 2.20, 9.21, 141, 0.4, 9.61], # Tyrosine 酪氨酸 52 | 'Q':[111, 110, 98, 0.074, 0.098, 0.037, 0.098, 146.14594, 5.65, 2.17, 9.13, 114, 4.7, -0.42], # Glutamine 谷氨酰胺 53 | 'X':[99.9, 102.85, 99.15, 0.0887, 0.08429999999999999, 0.0824, 0.0875, 136.90127, 6.027, 2.1690000000000005, 0.0875, 109.2, 232.37999999999997, 5.223888888888889], 54 | 'U':[99.9, 102.85, 99.15, 0.08870000000000001, 0.08430000000000001, 0.0824, 0.08750000000000001, 169.06, 6.026999999999999, 2.1690000000000005, 9.081309523809526, 109.19999999999999, 232.37999999999997, 5.223888888888889], 55 | 'Z':[99.9, 102.85, 99.15, 0.08870000000000001, 0.08430000000000001, 0.0824, 0.08750000000000001, 136.90126999999998, 6.026999999999999, 2.1690000000000005, 9.081309523809526, 109.19999999999999, 232.37999999999997, 5.223888888888889], 56 | } 57 | 58 | class DataClass: 59 | def __init__(self, seqPath, secPath, validSize=0.3, k=3, minCount=10): 60 | # Open files and load data 61 | with open(seqPath,'r') as f: 62 | seqData = [' '*(k//2)+i[:-1]+' '*(k//2) for i in f.readlines()] 63 | with open(secPath,'r') as f: 64 | secData = [i[:-1] for i in f.readlines()] 65 | self.tmp,self.k = seqData,k 66 | seqData = [[seq[i-k//2:i+k//2+1] for i in range(k//2,len(seq)-k//2)] for seq in seqData] 67 | # Dropping uncommon items 68 | itemCounter = {} 69 | for seq in seqData: 70 | for i in seq: 71 | itemCounter[i] = itemCounter.get(i,0)+1 72 | seqData = [[i if itemCounter[i]>=minCount else "" for i in seq] for seq in seqData] 73 | self.rawSeq,self.rawSec = seqData,secData 74 | self.minCount = minCount 75 | # Get mapping variables 76 | self.seqItem2id,self.id2seqItem = {"":0, "":1},["", ""] 77 | self.secItem2id,self.id2secItem = {"":0},[""] 78 | cnt = 2 79 | for seq in seqData: 80 | for i in seq: 81 | if i not in self.seqItem2id: 82 | self.seqItem2id[i] = cnt 83 | self.id2seqItem.append(i) 84 | cnt += 1 85 | self.seqItemNum = cnt 86 | cnt = 1 87 | for sec in secData: 88 | for i in sec: 89 | if i not in self.secItem2id: 90 | self.secItem2id[i] = cnt 91 | self.id2secItem.append(i) 92 | cnt += 1 93 | self.classNum = cnt 94 | # Tokenized the seq 95 | self.tokenizedSeq,self.tokenizedSec = np.array([[self.seqItem2id[i] for i in seq] for seq in seqData]),np.array([[self.secItem2id[i] for i in sec] for sec in secData]) 96 | self.seqLen,self.secLen = np.array([len(seq)+1 for seq in seqData]),np.array([len(sec)+1 for sec in secData]) 97 | self.trainIdList,self.validIdList = train_test_split(range(len(seqData)), test_size=validSize) if validSize>0.0 else (list(range(seqData)),[]) 98 | self.trainSampleNum,self.validSampleNum = len(self.trainIdList),len(self.validIdList) 99 | self.totalSampleNum = self.trainSampleNum+self.validSampleNum 100 | self.vector = {} 101 | print('classNum:',self.classNum) 102 | print(f'seqItemNum:{self.seqItemNum}') 103 | print('train sample size:',len(self.trainIdList)) 104 | print('valid sample size:',len(self.validIdList)) 105 | def describe(self): 106 | pass 107 | ''' 108 | trainSec,validSec = np.hstack(self.tokenizedSec[self.trainIdList]),np.hstack(self.tokenizedSec[self.validIdList]) 109 | trainPad,validPad = self.trainSampleNum*self.seqLen.max()-len(trainSec),self.validSampleNum*self.seqLen.max()-len(validSec) 110 | trainSec,validSec = np.hstack([trainSec,[0]*trainPad]),np.hstack([validSec,[0]*validPad]) 111 | print('===========DataClass Describe===========') 112 | print(f'{"CLASS":<16}{"TRAIN":<8}{"VALID":<8}') 113 | for i,c in enumerate(self.id2secItem): 114 | trainIsC = sum(trainSec==i)/self.trainSampleNum if self.trainSampleNum>0 else -1.0 115 | validIsC = sum(validSec==i)/self.validSampleNum if self.validSampleNum>0 else -1.0 116 | print(f'{c:<16}{trainIsC:<8.3f}{validIsC:<8.3f}') 117 | print('========================================') 118 | ''' 119 | def vectorize(self, method="char2vec", feaSize=128, window=13, sg=1, 120 | workers=8, loadCache=True): 121 | if method=='feaEmbedding': loadCache = False 122 | vecPath = f'cache/{method}_k{self.k}_d{feaSize}.pkl' 123 | if os.path.exists(vecPath) and loadCache: 124 | with open(vecPath, 'rb') as f: 125 | self.vector['embedding'] = pickle.load(f) 126 | print(f'Loaded cache from cache/{vecPath}.') 127 | return 128 | if method == 'char2vec': 129 | doc = [list(i)+[''] for i in self.rawSeq] 130 | model = Word2Vec(doc, min_count=self.minCount, window=window, size=feaSize, workers=workers, sg=sg, iter=10) 131 | char2vec = np.random.random((self.seqItemNum, feaSize)) 132 | for i in range(self.seqItemNum): 133 | if self.id2seqItem[i] in model.wv: 134 | char2vec[i] = model.wv[self.id2seqItem[i]] 135 | else: 136 | print(self.id2seqItem[i],'not in training docs...') 137 | self.vector['embedding'] = char2vec 138 | with open(vecPath, 'wb') as f: 139 | pickle.dump(self.vector['embedding'], f, protocol=4) 140 | elif method == 'feaEmbedding': 141 | oh = np.eye(self.seqItemNum) 142 | feaAppend = [] 143 | for i in range(self.seqItemNum): 144 | item = self.id2seqItem[i] 145 | if item in featureDict: 146 | feaAppend.append( featureDict[item] ) 147 | else: 148 | feaAppend.append( np.random.random(14) ) 149 | emb = np.hstack([oh, np.array(feaAppend)]).astype('float32') 150 | mean,std = emb.mean(axis=0),emb.std(axis=0) 151 | self.vector['feaEmbedding'] = (emb-mean)/(std+1e-10) 152 | 153 | def vector_merge(self, vecList, mergeVecName='mergeVec'): 154 | self.vector[mergeVec] = np.hstack([self.vector[i] for i in vecList]) 155 | print(f'Get a new vector "{mergeVec}" with shape {self.vector[mergeVec].shape}...') 156 | 157 | def random_batch_data_stream(self, batchSize=128, type='train', device=torch.device('cpu'), augmentation=0.05): 158 | idList = [i for i in self.trainIdList] if type=='train' else [i for i in self.validIdList] 159 | X,XLen,Y = self.tokenizedSeq,self.seqLen,self.tokenizedSec 160 | seqMaxLen = XLen.max() 161 | while True: 162 | random.shuffle(idList) 163 | for i in range((len(idList)+batchSize-1)//batchSize): 164 | samples = idList[i*batchSize:(i+1)*batchSize] 165 | yield { 166 | "seqArr":torch.tensor([[i if random.random()>augmentation else self.seqItem2id[''] for i in seq]+[0]*(seqMaxLen-len(seq)) for seq in X[samples]], dtype=torch.long).to(device), \ 167 | "seqLenArr":torch.tensor(XLen[samples], dtype=torch.int).to(device) 168 | }, torch.tensor([i+[0]*(seqMaxLen-len(i)) for i in Y[samples]], dtype=torch.long).to(device) 169 | 170 | def one_epoch_batch_data_stream(self, batchSize=128, type='valid', device=torch.device('cpu')): 171 | idList = [i for i in self.trainIdList] if type=='train' else [i for i in self.validIdList] 172 | X,XLen,Y = self.tokenizedSeq,self.seqLen,self.tokenizedSec 173 | seqMaxLen = XLen.max() 174 | for i in range((len(idList)+batchSize-1)//batchSize): 175 | samples = idList[i*batchSize:(i+1)*batchSize] 176 | yield { 177 | "seqArr":torch.tensor([i+[0]*(seqMaxLen-len(i)) for i in X[samples]], dtype=torch.long).to(device), \ 178 | "seqLenArr":torch.tensor(XLen[samples], dtype=torch.int).to(device) 179 | }, torch.tensor([i+[0]*(seqMaxLen-len(i)) for i in Y[samples]], dtype=torch.long).to(device) 180 | 181 | ''' 182 | P(α):形成α螺旋的可能性 183 | P(β):形成β螺旋的可能性 184 | P(turn):转向概率 185 | f(i): 186 | f(i+1): 187 | f(i+2): 188 | f(i+3): 189 | 分子量(Da): 190 | pl:等电点 191 | pK1(α-COOH):解离常数 192 | pK2(α-NH3):解离常数 193 | 范德华半径: 194 | 水中溶解度(25℃,g/L): 195 | 氨基侧链疏水性(乙醇->水,kj/mol) 196 | ''' 197 | 198 | class DataClass_BP: 199 | def __init__(self, seqPath, secPath, validSize=0.3): 200 | # Open files and load data 201 | with open(seqPath,'r') as f: 202 | seqData = [i[:-1].replace('U','X').replace('Z','X') for i in f.readlines()] 203 | with open(secPath,'r') as f: 204 | secData = [i[:-1] for i in f.readlines()] 205 | self.rawSeq,self.rawSec = seqData,secData 206 | # Get mapping variables 207 | self.seqItem2id,self.id2seqItem = {"":0, "":1},["", ""] 208 | self.secItem2id,self.id2secItem = {"":0},[""] 209 | cnt = 2 210 | for seq in seqData: 211 | for i in seq: 212 | if i not in self.seqItem2id: 213 | self.seqItem2id[i] = cnt 214 | self.id2seqItem.append(i) 215 | cnt += 1 216 | self.seqItemNum = cnt 217 | cnt = 1 218 | for sec in secData: 219 | for i in sec: 220 | if i not in self.secItem2id: 221 | self.secItem2id[i] = cnt 222 | self.id2secItem.append(i) 223 | cnt += 1 224 | self.classNum = cnt 225 | # Tokenized the seq 226 | self.seqLen,self.secLen = np.array([len(seq)+1 for seq in seqData]),np.array([len(sec)+1 for sec in secData]) 227 | self.seqMaxLen = np.max(self.seqLen) 228 | self.tokenizedSeq,self.tokenizedSec = np.array([[self.seqItem2id[i] for i in seq]+[self.seqItem2id['']]*(self.seqMaxLen-len(seq)) for seq in seqData]),np.array([[self.secItem2id[i] for i in sec]+[self.secItem2id['']]*(self.seqMaxLen-len(sec)) for sec in secData]) 229 | self.trainIdList,self.validIdList = train_test_split(range(len(seqData)), test_size=validSize) if validSize>0.0 else (list(range(seqData)),[]) 230 | self.trainSampleNum,self.validSampleNum = len(self.trainIdList),len(self.validIdList) 231 | self.totalSampleNum = self.trainSampleNum+self.validSampleNum 232 | self.vector = {} 233 | print('classNum:',self.classNum) 234 | print(f'seqItemNum:{self.seqItemNum}') 235 | print('train sample size:',len(self.trainIdList)) 236 | print('valid sample size:',len(self.validIdList)) 237 | def describe(self): 238 | trainSec,validSec = np.hstack(self.tokenizedSec[self.trainIdList]),np.hstack(self.tokenizedSec[self.validIdList]) 239 | trainPad,validPad = self.trainSampleNum*self.seqLen.max()-len(trainSec),self.validSampleNum*self.seqLen.max()-len(validSec) 240 | trainSec,validSec = np.hstack([trainSec,[0]*trainPad]),np.hstack([validSec,[0]*validPad]) 241 | print('===========DataClass Describe===========') 242 | print(f'{"CLASS":<16}{"TRAIN":<8}{"VALID":<8}') 243 | for i,c in enumerate(self.id2secItem): 244 | trainIsC = sum(trainSec==i)/self.trainSampleNum if self.trainSampleNum>0 else -1.0 245 | validIsC = sum(validSec==i)/self.validSampleNum if self.validSampleNum>0 else -1.0 246 | print(f'{c:<16}{trainIsC:<8.3f}{validIsC:<8.3f}') 247 | print('========================================') 248 | def vectorize(self, method="feaEmbedding", 249 | feaSize=128, window=13, sg=1, 250 | workers=8, loadCache=True): 251 | if method == 'feaEmbedding': 252 | vecPath = f'cache/{method}.pkl' 253 | else: 254 | vecPath = f'cache/{method}_k{self.k}_d{feaSize}.pkl' 255 | if os.path.exists(vecPath) and loadCache: 256 | with open(vecPath, 'rb') as f: 257 | self.vector['embedding'] = pickle.load(f) 258 | print(f'Loaded cache from cache/{vecPath}.') 259 | return 260 | if method == 'feaEmbedding': 261 | oh = np.eye(self.seqItemNum) 262 | feaAppend = [] 263 | for i in range(self.seqItemNum): 264 | item = self.id2seqItem[i] 265 | if item in featureDict: 266 | feaAppend.append( featureDict[item] ) 267 | else: 268 | feaAppend.append( np.random.random(14) ) 269 | emb = np.hstack([oh, np.array(feaAppend)]).astype('float32') 270 | mean,std = emb.mean(axis=0),emb.std(axis=0) 271 | self.vector['embedding'] = (emb-mean)/(std+1e-10) 272 | elif method == 'char2vec': 273 | doc = [list(i)+[''] for i in self.rawSeq] 274 | model = Word2Vec(doc, min_count=self.minCount, window=window, size=feaSize, workers=workers, sg=sg, iter=10) 275 | char2vec = np.random.random((self.seqItemNum, feaSize)) 276 | for i in range(self.seqItemNum): 277 | if self.id2seqItem[i] in model.wv: 278 | char2vec[i] = model.wv[self.id2seqItem[i]] 279 | else: 280 | print(self.id2seqItem[i],'not in training docs...') 281 | self.vector['embedding'] = char2vec 282 | 283 | with open(vecPath, 'wb') as f: 284 | pickle.dump(self.vector['embedding'], f, protocol=4) 285 | 286 | def vector_merge(self, vecList, mergeVecName='mergeVec'): 287 | self.vector[mergeVec] = np.hstack([self.vector[i] for i in vecList]) 288 | print(f'Get a new vector "{mergeVec}" with shape {self.vector[mergeVec].shape}...') 289 | 290 | def random_batch_data_stream(self, batchSize=128, type='train', device=torch.device('cpu'), augmentation=0.05): 291 | idList = [i for i in self.trainIdList] if type=='train' else [i for i in self.validIdList] 292 | X,XLen,Y = self.tokenizedSeq,self.seqLen,self.tokenizedSec 293 | while True: 294 | random.shuffle(idList) 295 | for i in range((len(idList)+batchSize-1)//batchSize): 296 | samples = idList[i*batchSize:(i+1)*batchSize] 297 | yield { 298 | "seqArr":torch.tensor([[i if random.random()>augmentation else self.seqItem2id[''] for i in seq] for seq in X[samples]], dtype=torch.long).to(device), \ 299 | "seqLenArr":torch.tensor(XLen[samples], dtype=torch.int).to(device) 300 | }, torch.tensor([i for i in Y[samples]], dtype=torch.long).to(device) 301 | 302 | def one_epoch_batch_data_stream(self, batchSize=128, type='valid', device=torch.device('cpu')): 303 | idList = [i for i in self.trainIdList] if type=='train' else [i for i in self.validIdList] 304 | X,XLen,Y = self.tokenizedSeq,self.seqLen,self.tokenizedSec 305 | for i in range((len(idList)+batchSize-1)//batchSize): 306 | samples = idList[i*batchSize:(i+1)*batchSize] 307 | yield { 308 | "seqArr":torch.tensor([i for i in X[samples]], dtype=torch.long).to(device), \ 309 | "seqLenArr":torch.tensor(XLen[samples], dtype=torch.int).to(device) 310 | }, torch.tensor([i for i in Y[samples]], dtype=torch.long).to(device) 311 | 312 | class DataClass_BP2: 313 | def __init__(self, seqPath, secPath, window=17, validSize=0.3): 314 | # Open files and load data 315 | with open(seqPath,'r') as f: 316 | seqData = [] 317 | for seq in f.readlines(): 318 | seq = ' '*(window//2) + seq[:-1] + ' '*(window//2) 319 | seqData += [seq[i-window//2:i+window//2+1] for i in range(window//2,len(seq)-window//2)] 320 | with open(secPath,'r') as f: 321 | secData = [] 322 | for sec in f.readlines(): 323 | secData += list(sec[:-1]) 324 | self.seqLen = window 325 | self.rawSeq,self.rawSec = seqData,secData 326 | # Get mapping variables 327 | self.seqItem2id,self.id2seqItem = {"":0},[""] 328 | self.secItem2id,self.id2secItem = {},[] 329 | cnt = 1 330 | for seq in seqData: 331 | for i in seq: 332 | if i not in self.seqItem2id: 333 | self.seqItem2id[i] = cnt 334 | self.id2seqItem.append(i) 335 | cnt += 1 336 | self.seqItemNum = cnt 337 | cnt = 0 338 | for sec in secData: 339 | for i in sec: 340 | if i not in self.secItem2id: 341 | self.secItem2id[i] = cnt 342 | self.id2secItem.append(i) 343 | cnt += 1 344 | self.classNum = cnt 345 | # Tokenized the seq 346 | self.tokenizedSeq,self.tokenizedSec = np.array([[self.seqItem2id[i] for i in seq] for seq in seqData]),np.array([self.secItem2id[sec] for sec in secData]) 347 | self.trainIdList,self.validIdList = train_test_split(range(len(seqData)), test_size=validSize, stratify=self.tokenizedSec) if validSize>0.0 else (list(range(seqData)),[]) 348 | self.trainSampleNum,self.validSampleNum = len(self.trainIdList),len(self.validIdList) 349 | self.totalSampleNum = self.trainSampleNum+self.validSampleNum 350 | self.vector = {} 351 | print('classNum:',self.classNum) 352 | print(f'seqItemNum:{self.seqItemNum}') 353 | print('train sample size:',len(self.trainIdList)) 354 | print('valid sample size:',len(self.validIdList)) 355 | def describe(self): 356 | trainSec,validSec = np.hstack(self.tokenizedSec[self.trainIdList]),np.hstack(self.tokenizedSec[self.validIdList]) 357 | trainPad,validPad = self.trainSampleNum*self.seqLen.max()-len(trainSec),self.validSampleNum*self.seqLen.max()-len(validSec) 358 | trainSec,validSec = np.hstack([trainSec,[0]*trainPad]),np.hstack([validSec,[0]*validPad]) 359 | print('===========DataClass Describe===========') 360 | print(f'{"CLASS":<16}{"TRAIN":<8}{"VALID":<8}') 361 | for i,c in enumerate(self.id2secItem): 362 | trainIsC = sum(trainSec==i)/self.trainSampleNum if self.trainSampleNum>0 else -1.0 363 | validIsC = sum(validSec==i)/self.validSampleNum if self.validSampleNum>0 else -1.0 364 | print(f'{c:<16}{trainIsC:<8.3f}{validIsC:<8.3f}') 365 | print('========================================') 366 | def vectorize(self, method="feaEmbedding", loadCache=True): 367 | vecPath = f'cache/{method}.pkl' 368 | if os.path.exists(vecPath) and loadCache: 369 | with open(vecPath, 'rb') as f: 370 | self.vector['embedding'] = pickle.load(f) 371 | print(f'Loaded cache from cache/{vecPath}.') 372 | return 373 | if method == 'feaEmbedding': 374 | oh = np.eye(self.seqItemNum) 375 | feaAppend = [] 376 | for i in range(self.seqItemNum): 377 | item = self.id2seqItem[i] 378 | if item in featureDict: 379 | feaAppend.append( featureDict[item] ) 380 | else: 381 | feaAppend.append( np.random.random(14) ) 382 | emb = np.hstack([oh, np.array(feaAppend)]).astype('float32') 383 | mean,std = emb.mean(axis=0),emb.std(axis=0) 384 | self.vector['embedding'] = (emb-mean)/(std+1e-10) 385 | self.vector['embedding'][self.seqItem2id[' ']] *= 0 386 | with open(vecPath, 'wb') as f: 387 | pickle.dump(self.vector['embedding'], f, protocol=4) 388 | 389 | def vector_merge(self, vecList, mergeVecName='mergeVec'): 390 | self.vector[mergeVec] = np.hstack([self.vector[i] for i in vecList]) 391 | print(f'Get a new vector "{mergeVec}" with shape {self.vector[mergeVec].shape}...') 392 | 393 | def random_batch_data_stream(self, batchSize=128, type='train', device=torch.device('cpu'), augmentation=0.05): 394 | idList = [i for i in self.trainIdList] if type=='train' else [i for i in self.validIdList] 395 | X,Y = self.tokenizedSeq,self.tokenizedSec 396 | while True: 397 | random.shuffle(idList) 398 | for i in range((len(idList)+batchSize-1)//batchSize): 399 | samples = idList[i*batchSize:(i+1)*batchSize] 400 | yield { 401 | "seqArr":torch.tensor([[i if random.random()>augmentation else self.seqItem2id[''] for i in seq] for seq in X[samples]], dtype=torch.long).to(device), \ 402 | }, torch.tensor([i for i in Y[samples]], dtype=torch.long).to(device) 403 | 404 | def one_epoch_batch_data_stream(self, batchSize=128, type='valid', device=torch.device('cpu')): 405 | idList = [i for i in self.trainIdList] if type=='train' else [i for i in self.validIdList] 406 | X,Y = self.tokenizedSeq,self.tokenizedSec 407 | for i in range((len(idList)+batchSize-1)//batchSize): 408 | samples = idList[i*batchSize:(i+1)*batchSize] 409 | yield { 410 | "seqArr":torch.tensor([i for i in X[samples]], dtype=torch.long).to(device), \ 411 | }, torch.tensor([i for i in Y[samples]], dtype=torch.long).to(device) --------------------------------------------------------------------------------