├── Independent_test_set └── test_set.txt ├── LICENSE.txt ├── README.md ├── cache └── README.txt ├── data.txt ├── model ├── DL_ClassifierModel.py ├── lncRNA_lib.py ├── metrics.py ├── nnLayer.py └── utils.py ├── out └── README.txt ├── requirements.txt └── train.py /LICENSE.txt: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Min Zeng & Yifan Wu 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # DeepLncLoc 2 | A deep learning-based lncRNA subcellular localization predictor 3 | 4 | 5 | 6 | # Usage 7 | ## How to train the model 8 | You can train the model with a very simple way by the command blow: 9 | ***python train.py --k 3 --d 64 --s 64 --f 128 --metrics MaF --device "cuda:0"*** 10 | >***k*** is the value of the k-mers features. 11 | >***d*** is the dimension of vector of k-mer features which are trained by gensim library. 12 | >***s*** is the number of subsequences. 13 | >***f*** is the filter number in the CNN layer. 14 | >***metrics*** is the evaluation metrics in the training process. "MaF" for macro f1, "ACC" for accuracy, "MaAUC" for macro auc, "MiAUC" for micro auc. 15 | >***device*** is the device you used to build and train the model. It can be "cpu" for cpu or "cuda" for gpu, and "cuda:0" for gpu 0. 16 | 17 | Also you can use the package provided by us to train your model. 18 | First, you need to import the package. 19 | ```python 20 | from model.utils import * 21 | from model.DL_ClassifierModel import * 22 | ``` 23 | Then you need to create the data object and get the embedding features. 24 | ```python 25 | dataClass = DataClass('data.txt', validSize=0.2, testSize=0.0, kmers=3) 26 | dataClass.vectorize("char2vec", feaSize=64) 27 | ``` 28 | Finally, you can create the model object and start training. 29 | ```python 30 | s,f,k,d = 64,128,3,64 31 | model = TextClassifier_SPPCNN(classNum=5, embedding=dataClass.vector['embedding'], SPPSize=s, feaSize=d, filterNum=f, contextSizeList=[1,3,5], embDropout=0.3, fcDropout=0.5, useFocalLoss=True, device="cuda") 32 | model.cv_train(dataClass, trainSize=1, batchSize=16, stopRounds=200, earlyStop=10, epoch=100, kFold=5, savePath=f"out/DeepLncLoc_s{s}_f{f}_k{k}_d{d}", report=['ACC','MaF','MiAUC','MaAUC']) 33 | ``` 34 | 35 | ==Note that the model need to be named as "..._sx_fx_kx_dx" ('x' represents the parameters' value) , therefore we can get the model parameters from the name to better initialize the model architecture while in prediction.== 36 | 37 | ## How to do prediction 38 | First, import the package. 39 | ```python 40 | from model.lncRNA_lib import * 41 | ``` 42 | Then instantiate an object. 43 | ```python 44 | model = lncRNALocalizer(weightPath="out/xxx_sx_fx_kx_dx.pkl", classNum=5, contextSizeList=[1,3,5], map_location={"cuda:0":"cpu"}, device="cpu") 45 | ``` 46 | Finally, do the prediction. 47 | ```python 48 | res = model.predict("ATCG...") 49 | print(res) 50 | ``` 51 | ## Independent test set 52 | The test_set.text in Independent_test_set folder is used in comparison with lncLocator and iLoc-lncRNA. The final prediction results of three predictors (DeepLncLoc, lncLocator, iLoc-lncRNA) can be seen the supplementary materials of the paper. 53 | 54 | ## Other details 55 | The other details can see the paper and the codes. 56 | 57 | # Citation 58 | Min Zeng, Yifan Wu, Chengqian Lu, Fuhao Zhang, Fang-Xiang Wu, Min Li. DeepLncLoc: a deep learning framework for long non-coding RNA subcellular localization prediction based on subsequence embedding. Briefings in Bioinformatics 23 (1), 2022, bbab360. 59 | 60 | # License 61 | This project is licensed under the MIT License - see the LICENSE.txt file for details 62 | -------------------------------------------------------------------------------- /cache/README.txt: -------------------------------------------------------------------------------- 1 | The cache files are saved in this folder. -------------------------------------------------------------------------------- /model/DL_ClassifierModel.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch,time,os,pickle 3 | from torch import nn as nn 4 | from .nnLayer import * 5 | from .metrics import * 6 | from collections import Counter,Iterable 7 | from sklearn.model_selection import StratifiedKFold 8 | 9 | class BaseClassifier: 10 | def __init__(self): 11 | pass 12 | def calculate_y_logit(self, X, XLen): 13 | pass 14 | def cv_train(self, dataClass, trainSize=256, batchSize=256, epoch=100, stopRounds=10, earlyStop=10, saveRounds=1, 15 | optimType='Adam', lr=0.001, weightDecay=0, kFold=5, isHigherBetter=True, metrics="MaF", report=["ACC", "MaF"], 16 | savePath='model'): 17 | skf = StratifiedKFold(n_splits=kFold) 18 | validRes,testRes = [],[] 19 | tvIdList = dataClass.trainIdList+dataClass.validIdList 20 | for i,(trainIndices,validIndices) in enumerate(skf.split(tvIdList, dataClass.Lab[tvIdList])): 21 | print(f'CV_{i+1}:') 22 | self.reset_parameters() 23 | dataClass.trainIdList,dataClass.validIdList = [tvIdList[i] for i in trainIndices],[tvIdList[i] for i in validIndices] 24 | res = self.train(dataClass,trainSize,batchSize,epoch,stopRounds,earlyStop,saveRounds,optimType,lr,weightDecay, 25 | isHigherBetter,metrics,report,f"{savePath}_cv{i+1}") 26 | validRes.append(res) 27 | if dataClass.testSampleNum>0: 28 | testRes.append(self.calculate_indicator_by_iterator(dataClass.one_epoch_batch_data_stream(type='test',batchSize=trainSize), dataClass.classNum, report)) 29 | Metrictor.table_show(validRes, report) 30 | if dataClass.testSampleNum>0: 31 | Metrictor.table_show(testRes, report) 32 | def train(self, dataClass, trainSize=256, batchSize=256, epoch=100, stopRounds=10, earlyStop=10, saveRounds=1, 33 | optimType='Adam', lr=0.001, weightDecay=0, isHigherBetter=True, metrics="MaF", report=["ACC", "MiF"], 34 | savePath='model'): 35 | dataClass.describe() 36 | assert batchSize%trainSize==0 37 | metrictor = Metrictor(dataClass.classNum) 38 | self.stepCounter = 0 39 | self.stepUpdate = batchSize//trainSize 40 | optimizer = torch.optim.Adam(self.moduleList.parameters(), lr=lr, weight_decay=weightDecay) 41 | trainStream = dataClass.random_batch_data_stream(batchSize=trainSize, type='train', device=self.device) 42 | itersPerEpoch = (dataClass.trainSampleNum+trainSize-1)//trainSize 43 | mtc,bestMtc,stopSteps = 0.0,0.0,0 44 | if dataClass.validSampleNum>0: validStream = dataClass.random_batch_data_stream(batchSize=trainSize, type='valid', device=self.device) 45 | st = time.time() 46 | for e in range(epoch): 47 | for i in range(itersPerEpoch): 48 | self.to_train_mode() 49 | X,Y = next(trainStream) 50 | loss = self._train_step(X,Y, optimizer) 51 | if stopRounds>0 and (e*itersPerEpoch+i+1)%stopRounds==0: 52 | self.to_eval_mode() 53 | print(f"After iters {e*itersPerEpoch+i+1}: [train] loss= {loss:.3f};", end='') 54 | if dataClass.validSampleNum>0: 55 | X,Y = next(validStream) 56 | loss = self.calculate_loss(X,Y) 57 | print(f' [valid] loss= {loss:.3f};', end='') 58 | restNum = ((itersPerEpoch-i-1)+(epoch-e-1)*itersPerEpoch)*trainSize 59 | speed = (e*itersPerEpoch+i+1)*trainSize/(time.time()-st) 60 | print(" speed: %.3lf items/s; remaining time: %.3lfs;"%(speed, restNum/speed)) 61 | if dataClass.validSampleNum>0 and (e+1)%saveRounds==0: 62 | self.to_eval_mode() 63 | print(f'========== Epoch:{e+1:5d} ==========') 64 | Y_pre,Y = self.calculate_y_prob_by_iterator(dataClass.one_epoch_batch_data_stream(trainSize, type='train', device=self.device)) 65 | metrictor.set_data(Y_pre, Y) 66 | print(f'[Total Train]',end='') 67 | metrictor(report) 68 | print(f'[Total Valid]',end='') 69 | Y_pre,Y = self.calculate_y_prob_by_iterator(dataClass.one_epoch_batch_data_stream(trainSize, type='valid', device=self.device)) 70 | metrictor.set_data(Y_pre, Y) 71 | res = metrictor(report) 72 | mtc = res[metrics] 73 | print('=================================') 74 | if (mtc>bestMtc and isHigherBetter) or (mtc=earlyStop: 82 | print(f'The val {metrics} has not improved for more than {earlyStop} steps in epoch {e+1}, stop training.') 83 | break 84 | self.load("%s.pkl"%savePath) 85 | os.rename("%s.pkl"%savePath, "%s_%s.pkl"%(savePath, ("%.3lf"%bestMtc)[2:])) 86 | print(f'============ Result ============') 87 | Y_pre,Y = self.calculate_y_prob_by_iterator(dataClass.one_epoch_batch_data_stream(trainSize, type='train', device=self.device)) 88 | metrictor.set_data(Y_pre, Y) 89 | print(f'[Total Train]',end='') 90 | metrictor(report) 91 | Y_pre,Y = self.calculate_y_prob_by_iterator(dataClass.one_epoch_batch_data_stream(trainSize, type='valid', device=self.device)) 92 | metrictor.set_data(Y_pre, Y) 93 | print(f'[Total Valid]',end='') 94 | res = metrictor(report) 95 | metrictor.each_class_indictor_show(dataClass.id2lab) 96 | print(f'================================') 97 | return res 98 | def reset_parameters(self): 99 | for module in self.moduleList: 100 | for subModule in module.modules(): 101 | if hasattr(subModule, "reset_parameters"): 102 | subModule.reset_parameters() 103 | def save(self, path, epochs, bestMtc=None, dataClass=None): 104 | stateDict = {'epochs':epochs, 'bestMtc':bestMtc} 105 | for module in self.moduleList: 106 | stateDict[module.name] = module.state_dict() 107 | if dataClass is not None: 108 | stateDict['trainIdList'],stateDict['validIdList'],stateDict['testIdList'] = dataClass.trainIdList,dataClass.validIdList,dataClass.testIdList 109 | stateDict['lab2id'],stateDict['id2lab'] = dataClass.lab2id,dataClass.id2lab 110 | stateDict['kmers2id'],stateDict['id2kmers'] = dataClass.kmers2id,dataClass.id2kmers 111 | torch.save(stateDict, path) 112 | print('Model saved in "%s".'%path) 113 | def load(self, path, map_location=None, dataClass=None): 114 | parameters = torch.load(path, map_location=map_location) 115 | for module in self.moduleList: 116 | module.load_state_dict(parameters[module.name]) 117 | if dataClass is not None: 118 | if "trainIdList" in parameters: 119 | dataClass.trainIdList = parameters['trainIdList'] 120 | if "validIdList" in parameters: 121 | dataClass.validIdList = parameters['validIdList'] 122 | if "testIdList" in parameters: 123 | dataClass.testIdList = parameters['testIdList'] 124 | print("%d epochs and %.3lf val Score 's model load finished."%(parameters['epochs'], parameters['bestMtc'])) 125 | def calculate_y_prob(self, X): 126 | Y_pre = self.calculate_y_logit(X) 127 | return torch.softmax(Y_pre, dim=1) 128 | def calculate_y(self, X): 129 | Y_pre = self.calculate_y_prob(X) 130 | return torch.argmax(Y_pre, dim=1) 131 | def calculate_loss(self, X, Y): 132 | Y_logit = self.calculate_y_logit(X) 133 | return self.criterion(Y_logit, Y) 134 | def calculate_indicator_by_iterator(self, dataStream, classNum, report): 135 | metrictor = Metrictor(classNum) 136 | Y_prob_pre,Y = self.calculate_y_prob_by_iterator(dataStream) 137 | metrictor.set_data(Y_prob_pre, Y) 138 | return metrictor(report) 139 | def calculate_y_prob_by_iterator(self, dataStream): 140 | YArr,Y_preArr = [],[] 141 | while True: 142 | try: 143 | X,Y = next(dataStream) 144 | except: 145 | break 146 | Y_pre,Y = self.calculate_y_prob(X).cpu().data.numpy(),Y.cpu().data.numpy() 147 | YArr.append(Y) 148 | Y_preArr.append(Y_pre) 149 | YArr,Y_preArr = np.hstack(YArr).astype('int32'),np.vstack(Y_preArr).astype('float32') 150 | return Y_preArr, YArr 151 | def calculate_y_by_iterator(self, dataStream): 152 | Y_preArr, YArr = self.calculate_y_prob_by_iterator(dataStream) 153 | return Y_preArr.argmax(axis=1), YArr 154 | def to_train_mode(self): 155 | for module in self.moduleList: 156 | module.train() 157 | def to_eval_mode(self): 158 | for module in self.moduleList: 159 | module.eval() 160 | def _train_step(self, X, Y, optimizer): 161 | self.stepCounter += 1 162 | if self.stepCounter batchSize × seqLen × feaSize 191 | X = X.transpose(1,2) 192 | X = self.textSPP(X) # => batchSize × feaSize × sppSize 193 | X = self.textCNN(X) # => batchSize × scaleNum*filterNum 194 | return self.fcLinear(X) # => batchSize × classNum 195 | -------------------------------------------------------------------------------- /model/lncRNA_lib.py: -------------------------------------------------------------------------------- 1 | from .nnLayer import * 2 | from torch.nn import functional as F 3 | 4 | class lncRNALocalizer: 5 | def __init__(self, weightPath, classNum=5, contextSizeList=[1,3,5], hiddenList=[], map_location=None, device=torch.device("cpu")): 6 | tmp = weightPath[:-4].split('_') 7 | params = {p[0]:int(p[1:]) for p in tmp if p[0] in ['s','f','k','d']} 8 | self.k = params['k'] 9 | 10 | stateDict = torch.load(weightPath, map_location=map_location) 11 | self.lab2id,self.id2lab = stateDict['lab2id'],stateDict['id2lab'] 12 | self.kmers2id,self.id2kmers = stateDict['kmers2id'],stateDict['id2kmers'] 13 | self.textEmbedding = TextEmbedding( torch.zeros((len(self.id2kmers),params['d']), dtype=torch.float) ).to(device) 14 | self.textSPP = TextSPP(params['s']).to(device) 15 | self.textCNN = TextCNN(params['d'], contextSizeList, params['f']).to(device) 16 | self.fcLinear = MLP(len(contextSizeList)*params['f'], classNum, hiddenList).to(device) 17 | self.moduleList = nn.ModuleList([self.textEmbedding, self.textSPP, self.textCNN, self.fcLinear]) 18 | for module in self.moduleList: 19 | module.load_state_dict(stateDict[module.name]) 20 | module.eval() 21 | 22 | self.device = device 23 | 24 | def predict(self, x): 25 | # x: seqLen 26 | x = self.__transform__(x) # => 1 × seqLen 27 | x = self.textEmbedding(x).transpose(1,2) # => 1 × feaSize × seqLen 28 | x = self.textSPP(x) # => 1 × feaSize × sppSize 29 | x = self.textCNN(x) # => 1 × scaleNum*filterNum 30 | x = self.fcLinear(x)[0] # => classNum 31 | return {k:v for k,v in zip(self.id2lab,F.softmax(x, dim=0).data.numpy())} 32 | def __transform__(self, RNA): 33 | RNA = ''.join([i if i in 'ATCG' else 'O' for i in RNA.replace('U', 'T')]) 34 | kmers = [RNA[i:i+self.k] for i in range(len(RNA)-self.k+1)] + [''] 35 | return torch.tensor( [self.kmers2id[i] for i in kmers if i in self.kmers2id],dtype=torch.long,device=self.device ).view(1,-1) 36 | 37 | from collections import Counter 38 | def vote_predict(localizers, RNA): 39 | num = len(localizers) 40 | res = Counter({}) 41 | for localizer in localizers: 42 | res += Counter(localizer.predict(RNA)) 43 | return {k:res[k]/num for k in res.keys()} 44 | -------------------------------------------------------------------------------- /model/metrics.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from sklearn import metrics as skmetrics 3 | import warnings 4 | warnings.filterwarnings("ignore") 5 | 6 | def lgb_MaF(preds, dtrain): 7 | Y = np.array(dtrain.get_label(), dtype=np.int32) 8 | preds = preds.reshape(-1,len(Y)) 9 | Y_pre = np.argmax( preds, axis=0 ) 10 | return 'macro_f1', float(MaF(preds.shape[1], Y_pre, Y)), True 11 | 12 | def lgb_precision(preds, dtrain): 13 | Y = dtrain.get_label() 14 | preds = preds.reshape(-1,len(Y)) 15 | Y_pre = np.argmax( preds, axis=0 ) 16 | return 'precision', float(Counter(Y==Y_pre)[True]/len(Y)), True 17 | 18 | class Metrictor: 19 | def __init__(self, classNum): 20 | self.classNum = classNum 21 | self._reporter_ = {"MaF":self.MaF, "MiF":self.MiF, 22 | "ACC":self.ACC, 23 | "MaAUC":self.MaAUC, "MiAUC":self.MiAUC, 24 | "MaMCC":self.MaMCC, "MiMCC":self.MiMCC} 25 | def __call__(self, report, end='\n'): 26 | res = {} 27 | for mtc in report: 28 | v = self._reporter_[mtc]() 29 | print(f" {mtc}={v:6.3f}", end=';') 30 | res[mtc] = v 31 | print(end=end) 32 | return res 33 | def set_data(self, Y_prob_pre, Y): 34 | self.Y_prob_pre,self.Y = Y_prob_pre,Y 35 | self.Y_pre = Y_prob_pre.argmax(axis=1) 36 | self.N = len(Y) 37 | @staticmethod 38 | def table_show(resList, report, rowName='CV'): 39 | lineLen = len(report)*8 + 6 40 | print("="*(lineLen//2-6) + "FINAL RESULT" + "="*(lineLen//2-6)) 41 | print(f"{'-':^6}" + "".join([f"{i:>8}" for i in report])) 42 | for i,res in enumerate(resList): 43 | print(f"{rowName+'_'+str(i+1):^6}" + "".join([f"{res[j]:>8.3f}" for j in report])) 44 | print(f"{'MEAN':^6}" + "".join([f"{np.mean([res[i] for res in resList]):>8.3f}" for i in report])) 45 | print("======" + "========"*len(report)) 46 | def each_class_indictor_show(self, id2lab): 47 | id2lab = np.array(id2lab) 48 | Yarr = np.zeros((self.N, self.classNum), dtype='int32') 49 | Yarr[list(range(self.N)),self.Y] = 1 50 | TPi,FPi,TNi,FNi = _TPiFPiTNiFNi(self.classNum, self.Y_pre, self.Y) 51 | MCCi = fill_inf((TPi*TNi - FPi*FNi) / np.sqrt( (TPi+FPi)*(TPi+FNi)*(TNi+FPi)*(TNi+FNi) ), np.nan) 52 | Pi = fill_inf(TPi/(TPi+FPi)) 53 | Ri = fill_inf(TPi/(TPi+FNi)) 54 | Fi = fill_inf(2*Pi*Ri/(Pi+Ri)) 55 | sortedIndex = np.argsort(id2lab) 56 | classRate = Yarr.sum(axis=0)[sortedIndex] / self.N 57 | id2lab,MCCi,Pi,Ri,Fi = id2lab[sortedIndex],MCCi[sortedIndex],Pi[sortedIndex],Ri[sortedIndex],Fi[sortedIndex] 58 | print("-"*28 + "MACRO INDICTOR" + "-"*28) 59 | print(f"{'':30}{'rate':<8}{'MCCi':<8}{'Pi':<8}{'Ri':<8}{'Fi':<8}") 60 | for i,c in enumerate(id2lab): 61 | print(f"{c:30}{classRate[i]:<8.2f}{MCCi[i]:<8.3f}{Pi[i]:<8.3f}{Ri[i]:<8.3f}{Fi[i]:<8.3f}") 62 | print("-"*70) 63 | def MaF(self): 64 | return F1(self.classNum, self.Y_pre, self.Y, average='macro') 65 | def MiF(self, showInfo=False): 66 | return F1(self.classNum, self.Y_pre, self.Y, average='micro') 67 | def ACC(self): 68 | return ACC(self.classNum, self.Y_pre, self.Y) 69 | def MaMCC(self): 70 | return MCC(self.classNum, self.Y_pre, self.Y, average='macro') 71 | def MiMCC(self): 72 | return MCC(self.classNum, self.Y_pre, self.Y, average='micro') 73 | def MaAUC(self): 74 | return AUC(self.classNum, self.Y_prob_pre, self.Y, average='macro') 75 | def MiAUC(self): 76 | return AUC(self.classNum, self.Y_prob_pre, self.Y, average='micro') 77 | 78 | def _TPiFPiTNiFNi(classNum, Y_pre, Y): 79 | Yarr, Yarr_pre = np.zeros((len(Y), classNum), dtype='int32'), np.zeros((len(Y), classNum), dtype='int32') 80 | Yarr[list(range(len(Y))),Y] = 1 81 | Yarr_pre[list(range(len(Y))),Y_pre] = 1 82 | isValid = (Yarr.sum(axis=0) + Yarr_pre.sum(axis=0))>0 83 | Yarr,Yarr_pre = Yarr[:,isValid],Yarr_pre[:,isValid] 84 | TPi = np.array([Yarr_pre[:,i][Yarr[:,i]==1].sum() for i in range(Yarr.shape[1])], dtype='float32') 85 | FPi = Yarr_pre.sum(axis=0) - TPi 86 | TNi = (1^Yarr).sum(axis=0) - FPi 87 | FNi = Yarr.sum(axis=0) - TPi 88 | return TPi,FPi,TNi,FNi 89 | 90 | def ACC(classNum, Y_pre, Y): 91 | TPi,FPi,TNi,FNi = _TPiFPiTNiFNi(classNum, Y_pre, Y) 92 | return TPi.sum() / len(Y) 93 | 94 | def AUC(classNum, Y_prob_pre, Y, average='micro'): 95 | assert average in ['micro', 'macro'] 96 | Yarr = np.zeros((len(Y), classNum), dtype='int32') 97 | Yarr[list(range(len(Y))),Y] = 1 98 | return skmetrics.roc_auc_score(Yarr, Y_prob_pre, average=average) 99 | 100 | def MCC(classNum, Y_pre, Y, average='micro'): 101 | assert average in ['micro', 'macro'] 102 | TPi,FPi,TNi,FNi = _TPiFPiTNiFNi(classNum, Y_pre, Y) 103 | if average=='micro': 104 | TP,FP,TN,FN = TPi.sum(),FPi.sum(),TNi.sum(),FNi.sum() 105 | MiMCC = fill_inf((TP*TN - FP*FN) / np.sqrt( (TP+FP)*(TP+FN)*(TN+FP)*(TN+FN) ), np.nan) 106 | return MiMCC 107 | else: 108 | MCCi = fill_inf( (TPi*TNi - FPi*FNi) / np.sqrt((TPi+FPi)*(TPi+FNi)*(TNi+FPi)*(TNi+FNi)), np.nan ) 109 | return MCCi.mean() 110 | 111 | def Precision(classNum, Y_pre, Y, average='micro'): 112 | assert average in ['micro', 'macro'] 113 | TPi,FPi,TNi,FNi = _TPiFPiTNiFNi(classNum, Y_pre, Y) 114 | if average=='micro': 115 | MiP = fill_inf(TPi.sum() / (TPi.sum() + FPi.sum())) 116 | return MiP 117 | else: 118 | Pi = fill_inf(TPi/(TPi+FPi)) 119 | return Pi.mean() 120 | 121 | def Recall(classNum, Y_pre, Y, average='micro'): 122 | assert average in ['micro', 'macro'] 123 | TPi,FPi,TNi,FNi = _TPiFPiTNiFNi(classNum, Y_pre, Y) 124 | if average=='micro': 125 | MiR = fill_inf(TPi.sum() / (TPi.sum() + FNi.sum())) 126 | return MiR 127 | else: 128 | Ri = fill_inf(TPi/(TPi + FNi)) 129 | return Ri.mean() 130 | 131 | def F1(classNum, Y_pre, Y, average='micro'): 132 | assert average in ['micro', 'macro'] 133 | if average=='micro': 134 | MiP,MiR = Precision(classNum, Y_pre, Y, average='micro'),Recall(classNum, Y_pre, Y, average='micro') 135 | MiF = fill_inf(2*MiP*MiR/(MiP+MiR)) 136 | return MiF 137 | else: 138 | TPi,FPi,TNi,FNi = _TPiFPiTNiFNi(classNum, Y_pre, Y) 139 | Pi,Ri = TPi/(TPi + FPi),TPi/(TPi + FNi) 140 | Pi[Pi==np.inf],Ri[Ri==np.inf] = 0.0,0.0 141 | Fi = fill_inf(2*Pi*Ri/(Pi+Ri)) 142 | return Fi.mean() 143 | 144 | from collections import Iterable 145 | def fill_inf(x, v=0.0): 146 | if isinstance(x, Iterable): 147 | x[x==np.inf] = v 148 | x[np.isnan(x)] = v 149 | else: 150 | x = v if x==np.inf else x 151 | x = v if np.isnan(x) else x 152 | return x 153 | -------------------------------------------------------------------------------- /model/nnLayer.py: -------------------------------------------------------------------------------- 1 | from torch import nn as nn 2 | from torch.nn import functional as F 3 | import torch,time,os 4 | import numpy as np 5 | 6 | class TextSPP(nn.Module): 7 | def __init__(self, size=128, name='textSpp'): 8 | super(TextSPP, self).__init__() 9 | self.name = name 10 | self.spp = nn.AdaptiveAvgPool1d(size) 11 | def forward(self, x): 12 | return self.spp(x.cpu()).to(x.device) 13 | 14 | class TextSPP2(nn.Module): 15 | def __init__(self, size=128, name='textSpp2'): 16 | super(TextSPP2, self).__init__() 17 | self.name = name 18 | self.spp1 = nn.AdaptiveMaxPool1d(size) 19 | self.spp2 = nn.AdaptiveAvgPool1d(size) 20 | def forward(self, x): 21 | x1 = self.spp1(x).unsqueeze(dim=3) # => batchSize × feaSize × size × 1 22 | x2 = self.spp2(x).unsqueeze(dim=3) # => batchSize × feaSize × size × 1 23 | x3 = -self.spp1(-x).unsqueeze(dim=3) # => batchSize × feaSize × size × 1 24 | return torch.cat([x1,x2,x3], dim=3) # => batchSize × feaSize × size × 3 25 | 26 | class TextEmbedding(nn.Module): 27 | def __init__(self, embedding, dropout=0.3, freeze=False, name='textEmbedding'): 28 | super(TextEmbedding, self).__init__() 29 | self.name = name 30 | self.embedding = nn.Embedding.from_pretrained(embedding, freeze=freeze) 31 | self.dropout = nn.Dropout(p=dropout) 32 | def forward(self, x): 33 | # x: batchSize × seqLen 34 | return self.dropout(self.embedding(x)) 35 | 36 | class TextCNN(nn.Module): 37 | def __init__(self, feaSize, contextSizeList, filterNum, name='textCNN'): 38 | super(TextCNN, self).__init__() 39 | self.name = name 40 | moduleList = [] 41 | for i in range(len(contextSizeList)): 42 | moduleList.append( 43 | nn.Sequential( 44 | nn.Conv1d(in_channels=feaSize, out_channels=filterNum, kernel_size=contextSizeList[i]), 45 | nn.ReLU(), 46 | nn.AdaptiveMaxPool1d(1) 47 | ) 48 | ) 49 | self.conv1dList = nn.ModuleList(moduleList) 50 | def forward(self, x): 51 | # x: batchSize × seqLen × feaSize 52 | x = [conv(x).squeeze(dim=2) for conv in self.conv1dList] # => scaleNum * (batchSize × filterNum) 53 | return torch.cat(x, dim=1) # => batchSize × scaleNum*filterNum 54 | 55 | class TextAYNICNN(nn.Module): 56 | def __init__(self, featureSize, filterSize, contextSizeList=[1,3,5], name='textAYNICNN'): 57 | super(TextAYNICNN, self).__init__() 58 | moduleList = [] 59 | for i in range(len(contextSizeList)): 60 | moduleList.append( 61 | nn.Sequential( 62 | nn.Conv1d(in_channels=featureSize, out_channels=filterSize, kernel_size=contextSizeList[i], padding=contextSizeList[i]//2), 63 | nn.ReLU(), 64 | ) 65 | ) 66 | self.feaConv1dList = nn.ModuleList(moduleList) 67 | self.attnConv1d = nn.Sequential( 68 | nn.Conv1d(in_channels=featureSize+filterSize*len(contextSizeList), out_channels=1, kernel_size=5, padding=2), 69 | nn.Softmax(dim=2) 70 | ) 71 | self.name = name 72 | def forward(self, x): 73 | # x: batchSize × feaSize × seqLen 74 | fea = torch.cat([conv(x) for conv in self.feaConv1dList],dim=1) # => batchSize × filterSize*contextNum × seqLen 75 | xfea = torch.cat([x,fea], dim=1) # => batchSize × (filterSize*contextNum+filterSize) × seqLen 76 | alpha = self.attnConv1d(xfea).transpose(1,2) # => batchSize × seqLen × 1 77 | return torch.matmul(xfea, alpha).squeeze(dim=2) # => batchSize × (feaSize+filterSize*contextNum) 78 | 79 | class TextAttnCNN(nn.Module): 80 | def __init__(self, feaSize, contextSizeList, filterNum, seqMaxLen, name='textAttnCNN'): 81 | super(TextAttnCNN, self).__init__() 82 | self.name = name 83 | self.seqMaxLen = seqMaxLen 84 | moduleList = [] 85 | for i in range(len(contextSizeList)): 86 | moduleList.append( 87 | nn.Sequential( 88 | nn.Conv1d(in_channels=feaSize, out_channels=filterNum, kernel_size=contextSizeList[i]), 89 | nn.ReLU(), 90 | SimpleAttention(filterNum, filterNum//4, actFunc=nn.ReLU, name=f'SimpleAttention{i}', transpose=True) 91 | ) 92 | ) 93 | self.attnConv1dList = nn.ModuleList(moduleList) 94 | def forward(self, x): 95 | x = [attnConv(x) for attnConv in self.attnConv1dList] 96 | return torch.cat(x, dim=1) # => batchSize × scaleNum*filterNum 97 | 98 | class TextBiLSTM(nn.Module): 99 | def __init__(self, feaSize, hiddenSize, name='textBiLSTM'): 100 | super(TextBiLSTM, self).__init__() 101 | self.name = name 102 | self.biLSTM = nn.LSTM(feaSize, hiddenSize, bidirectional=True, batch_first=True) 103 | def forward(self, x, xlen=None): 104 | # x: batchSizeh × seqLen × feaSize 105 | if xlen is not None: 106 | xlen, indices = torch.sort(xlen, descending=True) 107 | _, desortedIndices = torch.sort(indices, descending=False) 108 | 109 | x = nn.utils.rnn.pack_padded_sequence(x[indices], xlen, batch_first=True) 110 | output, hn = self.biLSTM(x) # output: batchSize × seqLen × hiddenSize*2; hn: numLayers*2 × batchSize × hiddenSize 111 | if xlen is not None: 112 | output, _ = nn.utils.rnn.pad_packed_sequence(output, batch_first=True) 113 | return output[desortedIndices] 114 | return output # output: batchSize × seqLen × hiddenSize*2 115 | 116 | class TextBiGRU(nn.Module): 117 | def __init__(self, feaSize, hiddenSize, num_layers=1, dropout=0.0, name='textBiGRU'): 118 | super(TextBiGRU, self).__init__() 119 | self.name = name 120 | self.biGRU = nn.GRU(feaSize, hiddenSize, bidirectional=True, batch_first=True, num_layers=num_layers, dropout=dropout) 121 | def forward(self, x, xlen=None): 122 | # x: batchSizeh × seqLen × feaSize 123 | if xlen is not None: 124 | xlen, indices = torch.sort(xlen, descending=True) 125 | _, desortedIndices = torch.sort(indices, descending=False) 126 | 127 | x = nn.utils.rnn.pack_padded_sequence(x[indices], xlen, batch_first=True) 128 | output, hn = self.biGRU(x) # output: batchSize × seqLen × hiddenSize*2; hn: numLayers*2 × batchSize × hiddenSize 129 | if xlen is not None: 130 | output, _ = nn.utils.rnn.pad_packed_sequence(output, batch_first=True) 131 | return output[desortedIndices] 132 | return output # output: batchSize × seqLen × hiddenSize*2 133 | 134 | class FastText(nn.Module): 135 | def __init__(self, feaSize, name='fastText'): 136 | super(FastText, self).__init__() 137 | self.name = name 138 | def forward(self, x, xLen): 139 | # x: batchSize × seqLen × feaSize; xLen: batchSize 140 | x = torch.sum(x, dim=1) / xLen.float().view(-1,1) 141 | return x 142 | 143 | class MLP(nn.Module): 144 | def __init__(self, inSize, outSize, hiddenList=[], dropout=0.1, name='MLP', actFunc=nn.ReLU): 145 | super(MLP, self).__init__() 146 | self.name = name 147 | layers = nn.Sequential() 148 | for i,os in enumerate(hiddenList): 149 | layers.add_module(str(i*2), nn.Linear(inSize, os)) 150 | layers.add_module(str(i*2+1), actFunc()) 151 | inSize = os 152 | self.hiddenLayers = layers 153 | self.dropout = nn.Dropout(p=dropout) 154 | self.out = nn.Linear(inSize, outSize) 155 | def forward(self, x): 156 | x = self.hiddenLayers(x) 157 | return self.out(self.dropout(x)) 158 | 159 | class SimpleAttention(nn.Module): 160 | def __init__(self, inSize, outSize, actFunc=nn.Tanh, name='SimpleAttention', transpose=False): 161 | super(SimpleAttention, self).__init__() 162 | self.name = name 163 | self.W = nn.Parameter(torch.randn(size=(outSize,1), dtype=torch.float32)) 164 | self.attnWeight = nn.Sequential( 165 | nn.Linear(inSize, outSize), 166 | actFunc() 167 | ) 168 | self.transpose = transpose 169 | def forward(self, input): 170 | if self.transpose: 171 | input = input.transpose(1,2) 172 | # input: batchSize × seqLen × inSize 173 | H = self.attnWeight(input) # => batchSize × seqLen × outSize 174 | alpha = F.softmax(torch.matmul(H,self.W), dim=1) # => batchSize × seqLen × 1 175 | return torch.matmul(input.transpose(1,2), alpha).squeeze(2) # => batchSize × inSize 176 | 177 | class ConvAttention(nn.Module): 178 | def __init__(self, feaSize, contextSize, transpose=True, name='convAttention'): 179 | super(ConvAttention, self).__init__() 180 | self.name = name 181 | self.attnConv = nn.Sequential( 182 | nn.Conv1d(in_channels=feaSize, out_channels=1, kernel_size=contextSize, padding=contextSize//2), 183 | nn.Softmax(dim=2) 184 | ) 185 | self.transpose = transpose 186 | def forward(self, x): 187 | if self.transpose: 188 | x = x.transpose(1,2) 189 | # x: batchSize × feaSize × seqLen 190 | alpha = self.attnConv(x) # => batchSize × 1 × seqLen 191 | return alpha.transpose(1,2) # => batchSize × seqLen × 1 192 | 193 | class FocalCrossEntropyLoss(nn.Module): 194 | def __init__(self, gama=2, weight=None, logit=True): 195 | super(FocalCrossEntropyLoss, self).__init__() 196 | self.weight = torch.tensor(weight, dtype=torch.float32) if weight is not None else weight 197 | self.gama = gama 198 | self.logit = logit 199 | def forward(self, Y_pre, Y): 200 | if self.logit: 201 | Y_pre = F.softmax(Y_pre, dim=1) 202 | P = Y_pre[list(range(len(Y))), Y] 203 | if self.weight is not None: 204 | w = self.weight[Y] 205 | else: 206 | w = torch.tensor([1.0 for i in range(len(Y))]) 207 | w = (w/w.sum()).reshape(-1,1) 208 | return -((1-P)**self.gama * torch.log(P)).sum() 209 | 210 | class HierarchicalSoftmax(nn.Module): 211 | def __init__(self, inSize, hierarchicalStructure, lab2id, hiddenList1=[], hiddenList2=[], dropout=0.1, name='HierarchicalSoftmax'): 212 | super(HierarchicalSoftmax, self).__init__() 213 | self.name = name 214 | self.dropout = nn.Dropout(p=dropout) 215 | layers = nn.Sequential() 216 | for i,os in enumerate(hiddenList1): 217 | layers.add_module(str(i*2), nn.Linear(inSize, os)) 218 | layers.add_module(str(i*2+1), nn.ReLU()) 219 | inSize = os 220 | self.hiddenLayers1 = layers 221 | moduleList = [nn.Linear(inSize, len(hierarchicalStructure))] 222 | 223 | layers = nn.Sequential() 224 | for i,os in enumerate(hiddenList2): 225 | layers.add_module(str(i*2), nn.Linear(inSize, os)) 226 | layers.add_module(str(i*2+1), nn.ReLU()) 227 | inSize = os 228 | self.hiddenLayers2 = layers 229 | 230 | for i in hierarchicalStructure: 231 | moduleList.append( nn.Linear(inSize, len(i)) ) 232 | for j in range(len(i)): 233 | i[j] = lab2id[i[j]] 234 | self.hierarchicalNum = [len(i) for i in hierarchicalStructure] 235 | self.restoreIndex = np.argsort(sum(hierarchicalStructure,[])) 236 | self.linearList = nn.ModuleList(moduleList) 237 | def forward(self, x): 238 | # x: batchSize × feaSize 239 | x = self.hiddenLayers1(x) 240 | x = self.dropout(x) 241 | y = [F.softmax(linear(x), dim=1) for linear in self.linearList[:1]] 242 | x = self.hiddenLayers2(x) 243 | y += [F.softmax(linear(x), dim=1) for linear in self.linearList[1:]] 244 | y = torch.cat([y[0][:,i-1].unsqueeze(1)*y[i] for i in range(1,len(y))], dim=1) # => batchSize × classNum 245 | return y[:,self.restoreIndex] 246 | -------------------------------------------------------------------------------- /model/utils.py: -------------------------------------------------------------------------------- 1 | from sklearn.model_selection import train_test_split 2 | from sklearn.preprocessing import OneHotEncoder 3 | from gensim.models import Word2Vec 4 | import numpy as np 5 | from tqdm import tqdm 6 | import os,logging,pickle,random,torch 7 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 8 | 9 | class DataClass: 10 | def __init__(self, dataPath, validSize, testSize, kmers=3, check=True): 11 | validSize *= 1.0/(1.0-testSize) 12 | # Open files and load data 13 | print('Loading the raw data...') 14 | with open(dataPath,'r') as f: 15 | data = [] 16 | for i in tqdm(f): 17 | data.append(i.strip().split('\t')) 18 | self.kmers = kmers 19 | # Get labels and splited sentences 20 | RNA,Lab = [[i[1][j:j+kmers] for j in range(len(i[1])-kmers+1)] for i in data],[i[2] for i in data] 21 | # Get the mapping variables for label and label_id 22 | print('Getting the mapping variables for label and label id......') 23 | self.lab2id,self.id2lab = {},[] 24 | cnt = 0 25 | for lab in tqdm(Lab): 26 | if lab not in self.lab2id: 27 | self.lab2id[lab] = cnt 28 | self.id2lab.append(lab) 29 | cnt += 1 30 | self.classNum = cnt 31 | # Get the mapping variables for kmers and kmers_id 32 | print('Getting the mapping variables for kmers and kmers id......') 33 | self.kmers2id,self.id2kmers = {"":0},[""] 34 | kmersCnt = 1 35 | for rna in tqdm(RNA): 36 | for kmers in rna: 37 | if kmers not in self.kmers2id: 38 | self.kmers2id[kmers] = kmersCnt 39 | self.id2kmers.append(kmers) 40 | kmersCnt += 1 41 | self.kmersNum = kmersCnt 42 | # Tokenize the sentences and labels 43 | self.RNADoc = RNA 44 | self.Lab = np.array( [self.lab2id[i] for i in Lab],dtype='int32' ) 45 | self.RNASeqLen = np.array([len(s)+1 for s in self.RNADoc],dtype='int32') 46 | self.tokenizedRNASent = np.array([[self.kmers2id[i] for i in s] for s in self.RNADoc]) 47 | self.vector = {} 48 | print('Start train_valid_test split......') 49 | while True: 50 | restIdList,testIdList = train_test_split(range(len(data)), test_size=testSize) if testSize>0.0 else (list(range(len(data))),[]) 51 | trainIdList,validIdList = train_test_split(restIdList, test_size=validSize) if validSize>0.0 else (restIdList,[]) 52 | trainSampleNum,validSampleNum,testSampleNum = len(trainIdList),len(validIdList),len(testIdList) 53 | totalSampleNum = trainSampleNum + validSampleNum + testSampleNum 54 | if not check: 55 | break 56 | elif (trainSampleNum==0 or len(set(self.Lab[trainIdList]))==self.classNum) and \ 57 | (validSampleNum==0 or len(set(self.Lab[validIdList]))==self.classNum) and \ 58 | (testSampleNum==0 or len(set(self.Lab[testIdList]))==self.classNum): 59 | break 60 | else: 61 | continue 62 | self.trainIdList,self.validIdList,self.testIdList = trainIdList,validIdList,testIdList 63 | self.trainSampleNum,self.validSampleNum,self.testSampleNum = len(trainIdList),len(validIdList),len(testIdList) 64 | self.totalSampleNum = self.trainSampleNum+self.validSampleNum+self.testSampleNum 65 | print('train sample size:',len(self.trainIdList)) 66 | print('valid sample size:',len(self.validIdList)) 67 | print('test sample size:',len(self.testIdList)) 68 | 69 | def describe(self): 70 | print(f'===========DataClass Describe===========') 71 | print(f'{"CLASS":<16}{"TRAIN":<8}{"VALID":<8}{"TEST":<8}') 72 | for i,c in enumerate(self.id2lab): 73 | trainIsC = sum(self.Lab[self.trainIdList]==i)/self.trainSampleNum if self.trainSampleNum>0 else -1.0 74 | validIsC = sum(self.Lab[self.validIdList]==i)/self.validSampleNum if self.validSampleNum>0 else -1.0 75 | testIsC = sum(self.Lab[self.testIdList]==i) /self.testSampleNum if self.testSampleNum>0 else -1.0 76 | print(f'{c:<16}{trainIsC:<8.3f}{validIsC:<8.3f}{testIsC:<8.3f}') 77 | print(f'========================================') 78 | self.Lab[self.trainIdList] 79 | 80 | def vectorize(self, method="char2vec", feaSize=512, window=5, sg=1, 81 | workers=8, loadCache=True): 82 | if os.path.exists(f'cache/{method}_k{self.kmers}_d{feaSize}.pkl') and loadCache: 83 | with open(f'cache/{method}_k{self.kmers}_d{feaSize}.pkl', 'rb') as f: 84 | if method=='kmers': 85 | tmp = pickle.load(f) 86 | self.vector['encoder'],self.kmersFea = tmp['encoder'],tmp['kmersFea'] 87 | else: 88 | self.vector['embedding'] = pickle.load(f) 89 | print(f'Loaded cache from cache/{method}_k{self.kmers}_d{feaSize}.pkl.') 90 | return 91 | if method == 'char2vec': 92 | doc = [i+[''] for i in self.RNADoc] 93 | model = Word2Vec(doc, min_count=0, window=window, vector_size=feaSize, workers=workers, sg=sg, epochs=10) 94 | char2vec = np.zeros((self.kmersNum, feaSize), dtype=np.float32) 95 | for i in range(self.kmersNum): 96 | char2vec[i] = model.wv[self.id2kmers[i]] 97 | self.vector['embedding'] = char2vec 98 | elif method == 'glovechar': 99 | from glove import Glove,Corpus 100 | doc = [i+[''] for i in self.RNADoc] 101 | corpus = Corpus() 102 | corpus.fit(doc, window=window) 103 | glove = Glove(no_components=feaSize) 104 | glove.fit(corpus.matrix, epochs=10, no_threads=workers, verbose=True) 105 | glove.add_dictionary(corpus.dictionary) 106 | gloveVec = np.zeros((self.kmersNum, feaSize), dtype=np.float32) 107 | for i in range(self.kmersNum): 108 | gloveVec[i] = glove.word_vectors[glove.dictionary[self.id2kmers[i]]] 109 | self.vector['embedding'] = gloveVec 110 | elif method == 'kmers': 111 | enc = OneHotEncoder(categories='auto') 112 | enc.fit([[i] for i in self.kmers2id.values()]) 113 | feaSize = len(self.kmers2id) 114 | kmers = np.zeros((self.totalSampleNum, feaSize)) 115 | bs = 50000 116 | print('Getting the kmers vector......') 117 | for i,t in enumerate(tqdm(self.tokenizedRNASent)): 118 | for k in range((len(t)+bs-1)//bs): 119 | kmers[i] += enc.transform(np.array(t[k*10000:(k+1)*10000]).reshape(-1,1)).toarray().sum(axis=0) 120 | kmers = kmers[:,1:] 121 | kmers = (kmers-kmers.mean(axis=0))/kmers.std(axis=0) 122 | self.vector['encoder'] = enc 123 | self.kmersFea = kmers 124 | with open(f'cache/{method}_k{self.kmers}_d{feaSize}.pkl', 'wb') as f: 125 | if method=='kmers': 126 | pickle.dump({'encoder':self.vector['encoder'], 'kmersFea':self.kmersFea}, f, protocol=4) 127 | else: 128 | pickle.dump(self.vector['embedding'], f, protocol=4) 129 | 130 | def vector_merge(self, vecList, mergeVecName='mergeVec'): 131 | self.vector[mergeVec] = np.hstack([self.vector[i] for i in vecList]) 132 | print(f'Get a new vector "{mergeVec}" with shape {self.vector[mergeVec].shape}...') 133 | 134 | def random_batch_data_stream(self, batchSize=128, type='train', device=torch.device('cpu')): 135 | idList = self.trainIdList if type=='train' else self.validIdList 136 | X,XLen = self.tokenizedRNASent,self.RNASeqLen 137 | while True: 138 | random.shuffle(idList) 139 | for i in range((len(idList)+batchSize-1)//batchSize): 140 | samples = idList[i*batchSize:(i+1)*batchSize] 141 | RNASeqMaxLen = XLen[samples].max() 142 | yield { 143 | "seqArr":torch.tensor([i+[0]*(RNASeqMaxLen-len(i)) for i in X[samples]], dtype=torch.long).to(device), \ 144 | "seqLenArr":torch.tensor(XLen[samples], dtype=torch.int).to(device) 145 | }, torch.tensor(self.Lab[samples], dtype=torch.long).to(device) 146 | 147 | def one_epoch_batch_data_stream(self, batchSize=128, type='valid', device=torch.device('cpu')): 148 | if type == 'train': 149 | idList = self.trainIdList 150 | elif type == 'valid': 151 | idList = self.validIdList 152 | elif type == 'test': 153 | idList = self.testIdList 154 | X,XLen = self.tokenizedRNASent,self.RNASeqLen 155 | for i in range((len(idList)+batchSize-1)//batchSize): 156 | samples = idList[i*batchSize:(i+1)*batchSize] 157 | RNASeqMaxLen = XLen[samples].max() 158 | yield { 159 | "seqArr":torch.tensor([i+[0]*(RNASeqMaxLen-len(i)) for i in X[samples]], dtype=torch.long).to(device), \ 160 | "seqLenArr":torch.tensor(XLen[samples], dtype=torch.int).to(device) 161 | }, torch.tensor(self.Lab[samples], dtype=torch.long).to(device) 162 | -------------------------------------------------------------------------------- /out/README.txt: -------------------------------------------------------------------------------- 1 | The trained models are saved in this folder. -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.22.3 2 | scikit-learn==1.1.1 3 | torch==1.12.0+cu116 4 | gensim==4.0.1 5 | -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | from model.utils import * 2 | from model.DL_ClassifierModel import * 3 | import argparse 4 | parser = argparse.ArgumentParser() 5 | parser.add_argument('--k', default='3') 6 | parser.add_argument('--d', default='64') 7 | parser.add_argument('--s', default='64') 8 | parser.add_argument('--f', default='128') 9 | parser.add_argument('--metrics', default='MaF') 10 | parser.add_argument('--device', default='cuda:0') 11 | parser.add_argument('--savePath', default='out/') 12 | args = parser.parse_args() 13 | if __name__ == '__main__': 14 | k,d,s,f = int(args.k),int(args.d),int(args.s),int(args.f) 15 | device,path = args.device,args.savePath 16 | metrics = args.metrics 17 | 18 | report = ["ACC", "MaF", "MiAUC", "MaAUC"] 19 | 20 | dataClass = DataClass('data.txt', 0.2, 0.0, kmers=k) 21 | dataClass.vectorize("char2vec", feaSize=d, loadCache=True) 22 | 23 | model = TextClassifier_SPPCNN( classNum=5, embedding=dataClass.vector['embedding'], SPPSize=s, feaSize=d, filterNum=f, 24 | contextSizeList=[1,3,5], hiddenList=[], 25 | embDropout=0.3, fcDropout=0.5, useFocalLoss=True, weight=None, 26 | device=device) 27 | model.cv_train(dataClass, trainSize=1, batchSize=16, stopRounds=-1, earlyStop=10, 28 | epoch=100, lr=0.001, kFold=5, savePath=f'{path}CNN_s{s}_f{f}_k{k}_d{d}', report=report) --------------------------------------------------------------------------------