├── Independent_test_set
    └── test_set.txt
├── LICENSE.txt
├── README.md
├── cache
    └── README.txt
├── data.txt
├── model
    ├── DL_ClassifierModel.py
    ├── lncRNA_lib.py
    ├── metrics.py
    ├── nnLayer.py
    └── utils.py
├── out
    └── README.txt
├── requirements.txt
└── train.py


/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Min Zeng & Yifan Wu
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # DeepLncLoc
 2 | A deep learning-based lncRNA subcellular localization predictor
 3 | 
 4 | 
 5 | 
 6 | # Usage
 7 | ## How to train the model
 8 | You can train the model with a very simple way by the command blow:
 9 | ***python train.py --k 3 --d 64 --s 64 --f 128 --metrics MaF --device "cuda:0"***
10 | >***k*** is the value of the k-mers features.  
11 | >***d*** is the dimension of vector of k-mer features which are trained by gensim library.  
12 | >***s*** is the number of subsequences.  
13 | >***f*** is the filter number in the CNN layer.  
14 | >***metrics*** is the evaluation metrics in the training process. "MaF" for macro f1, "ACC" for accuracy, "MaAUC" for macro auc, "MiAUC" for micro auc.  
15 | >***device*** is the device you used to build and train the model. It can be "cpu" for cpu or "cuda" for gpu, and "cuda:0" for gpu 0.  
16 | 
17 | Also you can use the package provided by us to train your model.  
18 | First, you need to import the package.  
19 | ```python
20 | from model.utils import *
21 | from model.DL_ClassifierModel import *
22 | ```
23 | Then you need to create the data object and get the embedding features.  
24 | ```python
25 | dataClass = DataClass('data.txt', validSize=0.2, testSize=0.0, kmers=3)
26 | dataClass.vectorize("char2vec", feaSize=64)
27 | ```
28 | Finally, you can create the model object and start training.
29 | ```python
30 | s,f,k,d = 64,128,3,64
31 | model = TextClassifier_SPPCNN(classNum=5, embedding=dataClass.vector['embedding'], SPPSize=s, feaSize=d, filterNum=f, contextSizeList=[1,3,5], embDropout=0.3, fcDropout=0.5, useFocalLoss=True, device="cuda")
32 | model.cv_train(dataClass, trainSize=1, batchSize=16, stopRounds=200, earlyStop=10, epoch=100, kFold=5, savePath=f"out/DeepLncLoc_s{s}_f{f}_k{k}_d{d}", report=['ACC','MaF','MiAUC','MaAUC'])
33 | ```
34 | 
35 | ==Note that the model need to be named as "..._sx_fx_kx_dx" ('x' represents the parameters' value) , therefore we can get the model parameters from the name to better initialize the model architecture while in prediction.==
36 | 
37 | ## How to do prediction
38 | First, import the package. 
39 | ```python
40 | from model.lncRNA_lib import *
41 | ```
42 | Then instantiate an object.
43 | ```python
44 | model = lncRNALocalizer(weightPath="out/xxx_sx_fx_kx_dx.pkl", classNum=5, contextSizeList=[1,3,5], map_location={"cuda:0":"cpu"}, device="cpu")
45 | ```
46 | Finally, do the prediction.
47 | ```python
48 | res = model.predict("ATCG...")
49 | print(res)
50 | ```
51 | ## Independent test set
52 | The test_set.text in Independent_test_set folder is used in comparison with lncLocator and iLoc-lncRNA. The final prediction results of three predictors (DeepLncLoc, lncLocator, iLoc-lncRNA) can be seen the supplementary materials of the paper.
53 | 
54 | ## Other details
55 | The other details can see the paper and the codes.
56 | 
57 | # Citation
58 | Min Zeng, Yifan Wu, Chengqian Lu, Fuhao Zhang, Fang-Xiang Wu, Min Li. DeepLncLoc: a deep learning framework for long non-coding RNA subcellular localization prediction based on subsequence embedding. Briefings in Bioinformatics 23 (1), 2022, bbab360.
59 | 
60 | # License
61 | This project is licensed under the MIT License - see the LICENSE.txt file for details
62 | 


--------------------------------------------------------------------------------
/cache/README.txt:
--------------------------------------------------------------------------------
1 | The cache files are saved in this folder.


--------------------------------------------------------------------------------
/model/DL_ClassifierModel.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import torch,time,os,pickle
  3 | from torch import nn as nn
  4 | from .nnLayer import *
  5 | from .metrics import *
  6 | from collections import Counter,Iterable
  7 | from sklearn.model_selection import StratifiedKFold
  8 | 
  9 | class BaseClassifier:
 10 |     def __init__(self):
 11 |         pass
 12 |     def calculate_y_logit(self, X, XLen):
 13 |         pass
 14 |     def cv_train(self, dataClass, trainSize=256, batchSize=256, epoch=100, stopRounds=10, earlyStop=10, saveRounds=1, 
 15 |                  optimType='Adam', lr=0.001, weightDecay=0, kFold=5, isHigherBetter=True, metrics="MaF", report=["ACC", "MaF"], 
 16 |                  savePath='model'):
 17 |         skf = StratifiedKFold(n_splits=kFold)
 18 |         validRes,testRes = [],[]
 19 |         tvIdList = dataClass.trainIdList+dataClass.validIdList
 20 |         for i,(trainIndices,validIndices) in enumerate(skf.split(tvIdList, dataClass.Lab[tvIdList])):
 21 |             print(f'CV_{i+1}:')
 22 |             self.reset_parameters()
 23 |             dataClass.trainIdList,dataClass.validIdList = [tvIdList[i] for i in trainIndices],[tvIdList[i] for i in validIndices]
 24 |             res = self.train(dataClass,trainSize,batchSize,epoch,stopRounds,earlyStop,saveRounds,optimType,lr,weightDecay,
 25 |                              isHigherBetter,metrics,report,f"{savePath}_cv{i+1}")
 26 |             validRes.append(res)
 27 |             if dataClass.testSampleNum>0:
 28 |                 testRes.append(self.calculate_indicator_by_iterator(dataClass.one_epoch_batch_data_stream(type='test',batchSize=trainSize), dataClass.classNum, report))
 29 |         Metrictor.table_show(validRes, report)
 30 |         if dataClass.testSampleNum>0:
 31 |             Metrictor.table_show(testRes, report)
 32 |     def train(self, dataClass, trainSize=256, batchSize=256, epoch=100, stopRounds=10, earlyStop=10, saveRounds=1, 
 33 |               optimType='Adam', lr=0.001, weightDecay=0, isHigherBetter=True, metrics="MaF", report=["ACC", "MiF"], 
 34 |               savePath='model'):
 35 |         dataClass.describe()
 36 |         assert batchSize%trainSize==0
 37 |         metrictor = Metrictor(dataClass.classNum)
 38 |         self.stepCounter = 0
 39 |         self.stepUpdate = batchSize//trainSize
 40 |         optimizer = torch.optim.Adam(self.moduleList.parameters(), lr=lr, weight_decay=weightDecay)
 41 |         trainStream = dataClass.random_batch_data_stream(batchSize=trainSize, type='train', device=self.device)
 42 |         itersPerEpoch = (dataClass.trainSampleNum+trainSize-1)//trainSize
 43 |         mtc,bestMtc,stopSteps = 0.0,0.0,0
 44 |         if dataClass.validSampleNum>0: validStream = dataClass.random_batch_data_stream(batchSize=trainSize, type='valid', device=self.device)
 45 |         st = time.time()
 46 |         for e in range(epoch):
 47 |             for i in range(itersPerEpoch):
 48 |                 self.to_train_mode()
 49 |                 X,Y = next(trainStream)
 50 |                 loss = self._train_step(X,Y, optimizer)
 51 |                 if stopRounds>0 and (e*itersPerEpoch+i+1)%stopRounds==0:
 52 |                     self.to_eval_mode()
 53 |                     print(f"After iters {e*itersPerEpoch+i+1}: [train] loss= {loss:.3f};", end='')
 54 |                     if dataClass.validSampleNum>0:
 55 |                         X,Y = next(validStream)
 56 |                         loss = self.calculate_loss(X,Y)
 57 |                         print(f' [valid] loss= {loss:.3f};', end='')
 58 |                     restNum = ((itersPerEpoch-i-1)+(epoch-e-1)*itersPerEpoch)*trainSize
 59 |                     speed = (e*itersPerEpoch+i+1)*trainSize/(time.time()-st)
 60 |                     print(" speed: %.3lf items/s; remaining time: %.3lfs;"%(speed, restNum/speed))
 61 |             if dataClass.validSampleNum>0 and (e+1)%saveRounds==0:
 62 |                 self.to_eval_mode()
 63 |                 print(f'========== Epoch:{e+1:5d} ==========')
 64 |                 Y_pre,Y = self.calculate_y_prob_by_iterator(dataClass.one_epoch_batch_data_stream(trainSize, type='train', device=self.device))
 65 |                 metrictor.set_data(Y_pre, Y)
 66 |                 print(f'[Total Train]',end='')
 67 |                 metrictor(report)
 68 |                 print(f'[Total Valid]',end='')
 69 |                 Y_pre,Y = self.calculate_y_prob_by_iterator(dataClass.one_epoch_batch_data_stream(trainSize, type='valid', device=self.device))
 70 |                 metrictor.set_data(Y_pre, Y)
 71 |                 res = metrictor(report)
 72 |                 mtc = res[metrics]
 73 |                 print('=================================')
 74 |                 if (mtc>bestMtc and isHigherBetter) or (mtc<bestMtc and not isHigherBetter):
 75 |                     print(f'Bingo!!! Get a better Model with val {metrics}: {mtc:.3f}!!!')
 76 |                     bestMtc = mtc
 77 |                     self.save("%s.pkl"%savePath, e+1, bestMtc, dataClass)
 78 |                     stopSteps = 0
 79 |                 else:
 80 |                     stopSteps += 1
 81 |                     if stopSteps>=earlyStop:
 82 |                         print(f'The val {metrics} has not improved for more than {earlyStop} steps in epoch {e+1}, stop training.')
 83 |                         break
 84 |         self.load("%s.pkl"%savePath)
 85 |         os.rename("%s.pkl"%savePath, "%s_%s.pkl"%(savePath, ("%.3lf"%bestMtc)[2:]))
 86 |         print(f'============ Result ============')
 87 |         Y_pre,Y = self.calculate_y_prob_by_iterator(dataClass.one_epoch_batch_data_stream(trainSize, type='train', device=self.device))
 88 |         metrictor.set_data(Y_pre, Y)
 89 |         print(f'[Total Train]',end='')
 90 |         metrictor(report)
 91 |         Y_pre,Y = self.calculate_y_prob_by_iterator(dataClass.one_epoch_batch_data_stream(trainSize, type='valid', device=self.device))
 92 |         metrictor.set_data(Y_pre, Y)
 93 |         print(f'[Total Valid]',end='')
 94 |         res = metrictor(report)
 95 |         metrictor.each_class_indictor_show(dataClass.id2lab)
 96 |         print(f'================================')
 97 |         return res
 98 |     def reset_parameters(self):
 99 |         for module in self.moduleList:
100 |             for subModule in module.modules():
101 |                 if hasattr(subModule, "reset_parameters"):
102 |                     subModule.reset_parameters()
103 |     def save(self, path, epochs, bestMtc=None, dataClass=None):
104 |         stateDict = {'epochs':epochs, 'bestMtc':bestMtc}
105 |         for module in self.moduleList:
106 |             stateDict[module.name] = module.state_dict()
107 |         if dataClass is not None:
108 |             stateDict['trainIdList'],stateDict['validIdList'],stateDict['testIdList'] = dataClass.trainIdList,dataClass.validIdList,dataClass.testIdList
109 |             stateDict['lab2id'],stateDict['id2lab'] = dataClass.lab2id,dataClass.id2lab
110 |             stateDict['kmers2id'],stateDict['id2kmers'] = dataClass.kmers2id,dataClass.id2kmers
111 |         torch.save(stateDict, path)
112 |         print('Model saved in "%s".'%path)
113 |     def load(self, path, map_location=None, dataClass=None):
114 |         parameters = torch.load(path, map_location=map_location)
115 |         for module in self.moduleList:
116 |             module.load_state_dict(parameters[module.name])
117 |         if dataClass is not None:
118 |             if "trainIdList" in parameters:
119 |                 dataClass.trainIdList = parameters['trainIdList']
120 |             if "validIdList" in parameters:
121 |                 dataClass.validIdList = parameters['validIdList']
122 |             if "testIdList" in parameters:
123 |                 dataClass.testIdList = parameters['testIdList']
124 |         print("%d epochs and %.3lf val Score 's model load finished."%(parameters['epochs'], parameters['bestMtc']))
125 |     def calculate_y_prob(self, X):
126 |         Y_pre = self.calculate_y_logit(X)
127 |         return torch.softmax(Y_pre, dim=1)
128 |     def calculate_y(self, X):
129 |         Y_pre = self.calculate_y_prob(X)
130 |         return torch.argmax(Y_pre, dim=1)
131 |     def calculate_loss(self, X, Y):
132 |         Y_logit = self.calculate_y_logit(X)
133 |         return self.criterion(Y_logit, Y)
134 |     def calculate_indicator_by_iterator(self, dataStream, classNum, report):
135 |         metrictor = Metrictor(classNum)
136 |         Y_prob_pre,Y = self.calculate_y_prob_by_iterator(dataStream)
137 |         metrictor.set_data(Y_prob_pre, Y)
138 |         return metrictor(report)
139 |     def calculate_y_prob_by_iterator(self, dataStream):
140 |         YArr,Y_preArr = [],[]
141 |         while True:
142 |             try:
143 |                 X,Y = next(dataStream)
144 |             except:
145 |                 break
146 |             Y_pre,Y = self.calculate_y_prob(X).cpu().data.numpy(),Y.cpu().data.numpy()
147 |             YArr.append(Y)
148 |             Y_preArr.append(Y_pre)
149 |         YArr,Y_preArr = np.hstack(YArr).astype('int32'),np.vstack(Y_preArr).astype('float32')
150 |         return Y_preArr, YArr
151 |     def calculate_y_by_iterator(self, dataStream):
152 |         Y_preArr, YArr = self.calculate_y_prob_by_iterator(dataStream)
153 |         return Y_preArr.argmax(axis=1), YArr
154 |     def to_train_mode(self):
155 |         for module in self.moduleList:
156 |             module.train()
157 |     def to_eval_mode(self):
158 |         for module in self.moduleList:
159 |             module.eval()
160 |     def _train_step(self, X, Y, optimizer):
161 |         self.stepCounter += 1
162 |         if self.stepCounter<self.stepUpdate:
163 |             p = False
164 |         else:
165 |             self.stepCounter = 0
166 |             p = True
167 |         loss = self.calculate_loss(X, Y)/self.stepUpdate
168 |         loss.backward()
169 |         if p:
170 |             optimizer.step()
171 |             optimizer.zero_grad()
172 |         return loss*self.stepUpdate
173 | 
174 | class TextClassifier_SPPCNN(BaseClassifier):
175 |     def __init__(self, classNum, embedding, SPPSize=128, feaSize=512, filterNum=448, contextSizeList=[1,3,5], hiddenList=[], 
176 |                  embDropout=0.3, fcDropout=0.5, 
177 |                  useFocalLoss=False, weight=None, device=torch.device("cuda:0")):
178 |         self.textEmbedding = TextEmbedding( torch.tensor(embedding, dtype=torch.float),dropout=embDropout ).to(device)
179 |         self.textSPP = TextSPP(SPPSize).to(device)
180 |         self.textCNN = TextCNN(feaSize, contextSizeList, filterNum).to(device)
181 |         self.fcLinear = MLP(len(contextSizeList)*filterNum, classNum, hiddenList, fcDropout).to(device)
182 |         self.moduleList = nn.ModuleList([self.textEmbedding, self.textSPP, self.textCNN, self.fcLinear])
183 |         self.classNum = classNum
184 |         self.device = device
185 |         self.feaSize = feaSize
186 |         self.criterion = nn.CrossEntropyLoss() if not useFocalLoss else FocalCrossEntropyLoss(weight=weight)
187 |     def calculate_y_logit(self, X):
188 |         X,_ = X['seqArr'],X['seqLenArr']
189 |         # X: batchSize × seqLen
190 |         X = self.textEmbedding(X) # => batchSize × seqLen × feaSize
191 |         X = X.transpose(1,2)
192 |         X = self.textSPP(X) # => batchSize × feaSize × sppSize
193 |         X = self.textCNN(X) # => batchSize × scaleNum*filterNum
194 |         return self.fcLinear(X) # => batchSize × classNum
195 | 


--------------------------------------------------------------------------------
/model/lncRNA_lib.py:
--------------------------------------------------------------------------------
 1 | from .nnLayer import *
 2 | from torch.nn import functional as F
 3 | 
 4 | class lncRNALocalizer:
 5 |     def __init__(self, weightPath, classNum=5, contextSizeList=[1,3,5], hiddenList=[], map_location=None, device=torch.device("cpu")):
 6 |         tmp = weightPath[:-4].split('_')
 7 |         params = {p[0]:int(p[1:]) for p in tmp if p[0] in ['s','f','k','d']}
 8 |         self.k = params['k']
 9 | 
10 |         stateDict = torch.load(weightPath, map_location=map_location)
11 |         self.lab2id,self.id2lab = stateDict['lab2id'],stateDict['id2lab']
12 |         self.kmers2id,self.id2kmers = stateDict['kmers2id'],stateDict['id2kmers']
13 |         self.textEmbedding = TextEmbedding( torch.zeros((len(self.id2kmers),params['d']), dtype=torch.float) ).to(device)
14 |         self.textSPP = TextSPP(params['s']).to(device)
15 |         self.textCNN = TextCNN(params['d'], contextSizeList, params['f']).to(device)
16 |         self.fcLinear = MLP(len(contextSizeList)*params['f'], classNum, hiddenList).to(device)
17 |         self.moduleList = nn.ModuleList([self.textEmbedding, self.textSPP, self.textCNN, self.fcLinear])
18 |         for module in self.moduleList:
19 |             module.load_state_dict(stateDict[module.name])
20 |             module.eval()
21 |         
22 |         self.device = device
23 | 
24 |     def predict(self, x):
25 |         # x: seqLen
26 |         x = self.__transform__(x) # => 1 × seqLen
27 |         x = self.textEmbedding(x).transpose(1,2) # => 1 × feaSize × seqLen
28 |         x = self.textSPP(x) # => 1 × feaSize × sppSize
29 |         x = self.textCNN(x) # => 1 × scaleNum*filterNum
30 |         x = self.fcLinear(x)[0] # => classNum
31 |         return  {k:v for k,v in zip(self.id2lab,F.softmax(x, dim=0).data.numpy())}
32 |     def __transform__(self, RNA):
33 |         RNA = ''.join([i if i in 'ATCG' else 'O' for i in RNA.replace('U', 'T')])
34 |         kmers = [RNA[i:i+self.k] for i in range(len(RNA)-self.k+1)] + ['<EOS>']
35 |         return torch.tensor( [self.kmers2id[i] for i in kmers if i in self.kmers2id],dtype=torch.long,device=self.device ).view(1,-1)
36 | 
37 | from collections import Counter 
38 | def vote_predict(localizers, RNA):
39 |     num = len(localizers)
40 |     res = Counter({})
41 |     for localizer in localizers:
42 |         res += Counter(localizer.predict(RNA))
43 |     return {k:res[k]/num for k in res.keys()}
44 | 


--------------------------------------------------------------------------------
/model/metrics.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | from sklearn import metrics as skmetrics
  3 | import warnings
  4 | warnings.filterwarnings("ignore")
  5 | 
  6 | def lgb_MaF(preds, dtrain):
  7 |     Y = np.array(dtrain.get_label(), dtype=np.int32)
  8 |     preds = preds.reshape(-1,len(Y))
  9 |     Y_pre = np.argmax( preds, axis=0 )
 10 |     return 'macro_f1', float(MaF(preds.shape[1], Y_pre, Y)), True
 11 | 
 12 | def lgb_precision(preds, dtrain):
 13 |     Y = dtrain.get_label()
 14 |     preds = preds.reshape(-1,len(Y))
 15 |     Y_pre = np.argmax( preds, axis=0 )
 16 |     return 'precision', float(Counter(Y==Y_pre)[True]/len(Y)), True
 17 | 
 18 | class Metrictor:
 19 |     def __init__(self, classNum):
 20 |         self.classNum = classNum
 21 |         self._reporter_ = {"MaF":self.MaF, "MiF":self.MiF, 
 22 |                            "ACC":self.ACC,
 23 |                            "MaAUC":self.MaAUC, "MiAUC":self.MiAUC, 
 24 |                            "MaMCC":self.MaMCC, "MiMCC":self.MiMCC}
 25 |     def __call__(self, report, end='\n'):
 26 |         res = {}
 27 |         for mtc in report:
 28 |             v = self._reporter_[mtc]()
 29 |             print(f" {mtc}={v:6.3f}", end=';')
 30 |             res[mtc] = v
 31 |         print(end=end)
 32 |         return res
 33 |     def set_data(self, Y_prob_pre, Y):
 34 |         self.Y_prob_pre,self.Y = Y_prob_pre,Y
 35 |         self.Y_pre = Y_prob_pre.argmax(axis=1)
 36 |         self.N = len(Y)
 37 |     @staticmethod
 38 |     def table_show(resList, report, rowName='CV'):
 39 |         lineLen = len(report)*8 + 6
 40 |         print("="*(lineLen//2-6) + "FINAL RESULT" + "="*(lineLen//2-6))
 41 |         print(f"{'-':^6}" + "".join([f"{i:>8}" for i in report]))
 42 |         for i,res in enumerate(resList):
 43 |             print(f"{rowName+'_'+str(i+1):^6}" + "".join([f"{res[j]:>8.3f}" for j in report]))
 44 |         print(f"{'MEAN':^6}" + "".join([f"{np.mean([res[i] for res in resList]):>8.3f}" for i in report]))
 45 |         print("======" + "========"*len(report))
 46 |     def each_class_indictor_show(self, id2lab):
 47 |         id2lab = np.array(id2lab)
 48 |         Yarr = np.zeros((self.N, self.classNum), dtype='int32')
 49 |         Yarr[list(range(self.N)),self.Y] = 1
 50 |         TPi,FPi,TNi,FNi = _TPiFPiTNiFNi(self.classNum, self.Y_pre, self.Y)
 51 |         MCCi = fill_inf((TPi*TNi - FPi*FNi) / np.sqrt( (TPi+FPi)*(TPi+FNi)*(TNi+FPi)*(TNi+FNi) ), np.nan)
 52 |         Pi = fill_inf(TPi/(TPi+FPi))
 53 |         Ri = fill_inf(TPi/(TPi+FNi))
 54 |         Fi = fill_inf(2*Pi*Ri/(Pi+Ri))
 55 |         sortedIndex = np.argsort(id2lab)
 56 |         classRate = Yarr.sum(axis=0)[sortedIndex] / self.N
 57 |         id2lab,MCCi,Pi,Ri,Fi = id2lab[sortedIndex],MCCi[sortedIndex],Pi[sortedIndex],Ri[sortedIndex],Fi[sortedIndex]
 58 |         print("-"*28 + "MACRO INDICTOR" + "-"*28)
 59 |         print(f"{'':30}{'rate':<8}{'MCCi':<8}{'Pi':<8}{'Ri':<8}{'Fi':<8}")
 60 |         for i,c in enumerate(id2lab):
 61 |             print(f"{c:30}{classRate[i]:<8.2f}{MCCi[i]:<8.3f}{Pi[i]:<8.3f}{Ri[i]:<8.3f}{Fi[i]:<8.3f}")
 62 |         print("-"*70)
 63 |     def MaF(self):
 64 |         return F1(self.classNum,  self.Y_pre, self.Y, average='macro')
 65 |     def MiF(self, showInfo=False):
 66 |         return F1(self.classNum,  self.Y_pre, self.Y, average='micro')
 67 |     def ACC(self):
 68 |         return ACC(self.classNum, self.Y_pre, self.Y)
 69 |     def MaMCC(self):
 70 |         return MCC(self.classNum, self.Y_pre, self.Y, average='macro')
 71 |     def MiMCC(self):
 72 |         return MCC(self.classNum, self.Y_pre, self.Y, average='micro')
 73 |     def MaAUC(self):
 74 |         return AUC(self.classNum, self.Y_prob_pre, self.Y, average='macro')
 75 |     def MiAUC(self):
 76 |         return AUC(self.classNum, self.Y_prob_pre, self.Y, average='micro')
 77 | 
 78 | def _TPiFPiTNiFNi(classNum, Y_pre, Y):
 79 |     Yarr, Yarr_pre = np.zeros((len(Y), classNum), dtype='int32'), np.zeros((len(Y), classNum), dtype='int32')
 80 |     Yarr[list(range(len(Y))),Y] = 1
 81 |     Yarr_pre[list(range(len(Y))),Y_pre] = 1
 82 |     isValid = (Yarr.sum(axis=0) + Yarr_pre.sum(axis=0))>0
 83 |     Yarr,Yarr_pre = Yarr[:,isValid],Yarr_pre[:,isValid]
 84 |     TPi = np.array([Yarr_pre[:,i][Yarr[:,i]==1].sum() for i in range(Yarr.shape[1])], dtype='float32')
 85 |     FPi = Yarr_pre.sum(axis=0) - TPi
 86 |     TNi = (1^Yarr).sum(axis=0) - FPi
 87 |     FNi = Yarr.sum(axis=0) - TPi
 88 |     return TPi,FPi,TNi,FNi
 89 | 
 90 | def ACC(classNum, Y_pre, Y):
 91 |     TPi,FPi,TNi,FNi = _TPiFPiTNiFNi(classNum, Y_pre, Y)
 92 |     return TPi.sum() / len(Y)
 93 | 
 94 | def AUC(classNum, Y_prob_pre, Y, average='micro'):
 95 |     assert average in ['micro', 'macro']
 96 |     Yarr = np.zeros((len(Y), classNum), dtype='int32')
 97 |     Yarr[list(range(len(Y))),Y] = 1
 98 |     return skmetrics.roc_auc_score(Yarr, Y_prob_pre, average=average)
 99 | 
100 | def MCC(classNum, Y_pre, Y, average='micro'):
101 |     assert average in ['micro', 'macro']
102 |     TPi,FPi,TNi,FNi = _TPiFPiTNiFNi(classNum, Y_pre, Y)
103 |     if average=='micro':
104 |         TP,FP,TN,FN = TPi.sum(),FPi.sum(),TNi.sum(),FNi.sum()
105 |         MiMCC = fill_inf((TP*TN - FP*FN) / np.sqrt( (TP+FP)*(TP+FN)*(TN+FP)*(TN+FN) ), np.nan)
106 |         return MiMCC
107 |     else:
108 |         MCCi = fill_inf( (TPi*TNi - FPi*FNi) / np.sqrt((TPi+FPi)*(TPi+FNi)*(TNi+FPi)*(TNi+FNi)), np.nan )
109 |         return MCCi.mean()
110 | 
111 | def Precision(classNum, Y_pre, Y, average='micro'):
112 |     assert average in ['micro', 'macro']
113 |     TPi,FPi,TNi,FNi = _TPiFPiTNiFNi(classNum, Y_pre, Y)
114 |     if average=='micro':
115 |         MiP = fill_inf(TPi.sum() / (TPi.sum() + FPi.sum()))
116 |         return MiP
117 |     else:
118 |         Pi = fill_inf(TPi/(TPi+FPi))
119 |         return Pi.mean()
120 | 
121 | def Recall(classNum, Y_pre, Y, average='micro'):
122 |     assert average in ['micro', 'macro']
123 |     TPi,FPi,TNi,FNi = _TPiFPiTNiFNi(classNum, Y_pre, Y)
124 |     if average=='micro':
125 |         MiR = fill_inf(TPi.sum() / (TPi.sum() + FNi.sum()))
126 |         return MiR
127 |     else:
128 |         Ri = fill_inf(TPi/(TPi + FNi))
129 |         return Ri.mean()
130 | 
131 | def F1(classNum, Y_pre, Y, average='micro'):
132 |     assert average in ['micro', 'macro']
133 |     if average=='micro':
134 |         MiP,MiR = Precision(classNum, Y_pre, Y, average='micro'),Recall(classNum, Y_pre, Y, average='micro')
135 |         MiF = fill_inf(2*MiP*MiR/(MiP+MiR))
136 |         return MiF
137 |     else:
138 |         TPi,FPi,TNi,FNi = _TPiFPiTNiFNi(classNum, Y_pre, Y)
139 |         Pi,Ri = TPi/(TPi + FPi),TPi/(TPi + FNi)
140 |         Pi[Pi==np.inf],Ri[Ri==np.inf] = 0.0,0.0
141 |         Fi = fill_inf(2*Pi*Ri/(Pi+Ri))
142 |         return Fi.mean()
143 | 
144 | from collections import Iterable
145 | def fill_inf(x, v=0.0):
146 |     if isinstance(x, Iterable):
147 |         x[x==np.inf] = v
148 |         x[np.isnan(x)] = v
149 |     else:
150 |         x = v if x==np.inf else x
151 |         x = v if np.isnan(x) else x
152 |     return x
153 |     


--------------------------------------------------------------------------------
/model/nnLayer.py:
--------------------------------------------------------------------------------
  1 | from torch import nn as nn
  2 | from torch.nn import functional as F
  3 | import torch,time,os
  4 | import numpy as np
  5 | 
  6 | class TextSPP(nn.Module):
  7 |     def __init__(self, size=128, name='textSpp'):
  8 |         super(TextSPP, self).__init__()
  9 |         self.name = name
 10 |         self.spp = nn.AdaptiveAvgPool1d(size)
 11 |     def forward(self, x):
 12 |         return self.spp(x.cpu()).to(x.device)
 13 | 
 14 | class TextSPP2(nn.Module):
 15 |     def __init__(self, size=128, name='textSpp2'):
 16 |         super(TextSPP2, self).__init__()
 17 |         self.name = name
 18 |         self.spp1 = nn.AdaptiveMaxPool1d(size)
 19 |         self.spp2 = nn.AdaptiveAvgPool1d(size)
 20 |     def forward(self, x):
 21 |         x1 = self.spp1(x).unsqueeze(dim=3) # => batchSize × feaSize × size × 1
 22 |         x2 = self.spp2(x).unsqueeze(dim=3) # => batchSize × feaSize × size × 1
 23 |         x3 = -self.spp1(-x).unsqueeze(dim=3) # => batchSize × feaSize × size × 1
 24 |         return torch.cat([x1,x2,x3], dim=3) # => batchSize × feaSize × size × 3
 25 | 
 26 | class TextEmbedding(nn.Module):
 27 |     def __init__(self, embedding, dropout=0.3, freeze=False, name='textEmbedding'):
 28 |         super(TextEmbedding, self).__init__()
 29 |         self.name = name
 30 |         self.embedding = nn.Embedding.from_pretrained(embedding, freeze=freeze)
 31 |         self.dropout = nn.Dropout(p=dropout)
 32 |     def forward(self, x):
 33 |         # x: batchSize × seqLen
 34 |         return self.dropout(self.embedding(x))
 35 | 
 36 | class TextCNN(nn.Module):
 37 |     def __init__(self, feaSize, contextSizeList, filterNum, name='textCNN'):
 38 |         super(TextCNN, self).__init__()
 39 |         self.name = name
 40 |         moduleList = []
 41 |         for i in range(len(contextSizeList)):
 42 |             moduleList.append(
 43 |                 nn.Sequential(
 44 |                     nn.Conv1d(in_channels=feaSize, out_channels=filterNum, kernel_size=contextSizeList[i]),
 45 |                     nn.ReLU(),
 46 |                     nn.AdaptiveMaxPool1d(1)
 47 |                     )
 48 |                 )
 49 |         self.conv1dList = nn.ModuleList(moduleList)
 50 |     def forward(self, x):
 51 |         # x: batchSize × seqLen × feaSize
 52 |         x = [conv(x).squeeze(dim=2) for conv in self.conv1dList] # => scaleNum * (batchSize × filterNum)
 53 |         return torch.cat(x, dim=1) # => batchSize × scaleNum*filterNum
 54 | 
 55 | class TextAYNICNN(nn.Module):
 56 |     def __init__(self, featureSize, filterSize, contextSizeList=[1,3,5], name='textAYNICNN'):
 57 |         super(TextAYNICNN, self).__init__()
 58 |         moduleList = []
 59 |         for i in range(len(contextSizeList)):
 60 |             moduleList.append(
 61 |                 nn.Sequential(
 62 |                     nn.Conv1d(in_channels=featureSize, out_channels=filterSize, kernel_size=contextSizeList[i], padding=contextSizeList[i]//2),
 63 |                     nn.ReLU(),
 64 |                     )
 65 |                 )
 66 |         self.feaConv1dList = nn.ModuleList(moduleList)
 67 |         self.attnConv1d = nn.Sequential(
 68 |                             nn.Conv1d(in_channels=featureSize+filterSize*len(contextSizeList), out_channels=1, kernel_size=5, padding=2),
 69 |                             nn.Softmax(dim=2)
 70 |                           )
 71 |         self.name = name
 72 |     def forward(self, x):
 73 |         # x: batchSize × feaSize × seqLen
 74 |         fea = torch.cat([conv(x) for conv in self.feaConv1dList],dim=1) # => batchSize × filterSize*contextNum × seqLen
 75 |         xfea = torch.cat([x,fea], dim=1) # => batchSize × (filterSize*contextNum+filterSize) × seqLen
 76 |         alpha = self.attnConv1d(xfea).transpose(1,2) # => batchSize × seqLen × 1
 77 |         return torch.matmul(xfea, alpha).squeeze(dim=2) # => batchSize × (feaSize+filterSize*contextNum)
 78 | 
 79 | class TextAttnCNN(nn.Module):
 80 |     def __init__(self, feaSize, contextSizeList, filterNum, seqMaxLen, name='textAttnCNN'):
 81 |         super(TextAttnCNN, self).__init__()
 82 |         self.name = name
 83 |         self.seqMaxLen = seqMaxLen
 84 |         moduleList = []
 85 |         for i in range(len(contextSizeList)):
 86 |             moduleList.append(
 87 |                 nn.Sequential(
 88 |                     nn.Conv1d(in_channels=feaSize, out_channels=filterNum, kernel_size=contextSizeList[i]),
 89 |                     nn.ReLU(),
 90 |                     SimpleAttention(filterNum, filterNum//4, actFunc=nn.ReLU, name=f'SimpleAttention{i}', transpose=True)
 91 |                     )
 92 |                 )
 93 |         self.attnConv1dList = nn.ModuleList(moduleList)
 94 |     def forward(self, x):
 95 |         x = [attnConv(x) for attnConv in self.attnConv1dList]
 96 |         return torch.cat(x, dim=1) # => batchSize × scaleNum*filterNum
 97 | 
 98 | class TextBiLSTM(nn.Module):
 99 |     def __init__(self, feaSize, hiddenSize, name='textBiLSTM'):
100 |         super(TextBiLSTM, self).__init__()
101 |         self.name = name
102 |         self.biLSTM = nn.LSTM(feaSize, hiddenSize, bidirectional=True, batch_first=True)
103 |     def forward(self, x, xlen=None):
104 |         # x: batchSizeh × seqLen × feaSize
105 |         if xlen is not None:
106 |             xlen, indices = torch.sort(xlen, descending=True)
107 |             _, desortedIndices = torch.sort(indices, descending=False)
108 | 
109 |             x = nn.utils.rnn.pack_padded_sequence(x[indices], xlen, batch_first=True)
110 |         output, hn = self.biLSTM(x) # output: batchSize × seqLen × hiddenSize*2; hn: numLayers*2 × batchSize × hiddenSize
111 |         if xlen is not None:
112 |             output, _ = nn.utils.rnn.pad_packed_sequence(output, batch_first=True)
113 |             return output[desortedIndices]
114 |         return output # output: batchSize × seqLen × hiddenSize*2
115 | 
116 | class TextBiGRU(nn.Module):
117 |     def __init__(self, feaSize, hiddenSize, num_layers=1, dropout=0.0, name='textBiGRU'):
118 |         super(TextBiGRU, self).__init__()
119 |         self.name = name
120 |         self.biGRU = nn.GRU(feaSize, hiddenSize, bidirectional=True, batch_first=True, num_layers=num_layers, dropout=dropout)
121 |     def forward(self, x, xlen=None):
122 |         # x: batchSizeh × seqLen × feaSize
123 |         if xlen is not None:
124 |             xlen, indices = torch.sort(xlen, descending=True)
125 |             _, desortedIndices = torch.sort(indices, descending=False)
126 | 
127 |             x = nn.utils.rnn.pack_padded_sequence(x[indices], xlen, batch_first=True)
128 |         output, hn = self.biGRU(x) # output: batchSize × seqLen × hiddenSize*2; hn: numLayers*2 × batchSize × hiddenSize
129 |         if xlen is not None:
130 |             output, _ = nn.utils.rnn.pad_packed_sequence(output, batch_first=True)
131 |             return output[desortedIndices]
132 |         return output # output: batchSize × seqLen × hiddenSize*2
133 | 
134 | class FastText(nn.Module):
135 |     def __init__(self, feaSize, name='fastText'):
136 |         super(FastText, self).__init__()
137 |         self.name = name
138 |     def forward(self, x, xLen):
139 |         # x: batchSize × seqLen × feaSize; xLen: batchSize
140 |         x = torch.sum(x, dim=1) / xLen.float().view(-1,1)
141 |         return x
142 | 
143 | class MLP(nn.Module):
144 |     def __init__(self, inSize, outSize, hiddenList=[], dropout=0.1, name='MLP', actFunc=nn.ReLU):
145 |         super(MLP, self).__init__()
146 |         self.name = name
147 |         layers = nn.Sequential()
148 |         for i,os in enumerate(hiddenList):
149 |             layers.add_module(str(i*2), nn.Linear(inSize, os))
150 |             layers.add_module(str(i*2+1), actFunc())
151 |             inSize = os
152 |         self.hiddenLayers = layers
153 |         self.dropout = nn.Dropout(p=dropout)
154 |         self.out = nn.Linear(inSize, outSize)
155 |     def forward(self, x):
156 |         x = self.hiddenLayers(x)
157 |         return self.out(self.dropout(x))
158 | 
159 | class SimpleAttention(nn.Module):
160 |     def __init__(self, inSize, outSize, actFunc=nn.Tanh, name='SimpleAttention', transpose=False):
161 |         super(SimpleAttention, self).__init__()
162 |         self.name = name
163 |         self.W = nn.Parameter(torch.randn(size=(outSize,1), dtype=torch.float32))
164 |         self.attnWeight = nn.Sequential(
165 |             nn.Linear(inSize, outSize),
166 |             actFunc()
167 |             )
168 |         self.transpose = transpose
169 |     def forward(self, input):
170 |         if self.transpose:
171 |             input = input.transpose(1,2)
172 |         # input: batchSize × seqLen × inSize
173 |         H = self.attnWeight(input) # => batchSize × seqLen × outSize
174 |         alpha = F.softmax(torch.matmul(H,self.W), dim=1) # => batchSize × seqLen × 1
175 |         return torch.matmul(input.transpose(1,2), alpha).squeeze(2) # => batchSize × inSize
176 | 
177 | class ConvAttention(nn.Module):
178 |     def __init__(self, feaSize, contextSize, transpose=True, name='convAttention'):
179 |         super(ConvAttention, self).__init__()
180 |         self.name = name
181 |         self.attnConv = nn.Sequential(
182 |             nn.Conv1d(in_channels=feaSize, out_channels=1, kernel_size=contextSize, padding=contextSize//2), 
183 |             nn.Softmax(dim=2)
184 |             )
185 |         self.transpose = transpose
186 |     def forward(self, x):
187 |         if self.transpose:
188 |             x = x.transpose(1,2)
189 |         # x: batchSize × feaSize × seqLen
190 |         alpha = self.attnConv(x) # => batchSize × 1 × seqLen 
191 |         return alpha.transpose(1,2) # => batchSize × seqLen × 1
192 | 
193 | class FocalCrossEntropyLoss(nn.Module):
194 |     def __init__(self, gama=2, weight=None, logit=True):
195 |         super(FocalCrossEntropyLoss, self).__init__()
196 |         self.weight = torch.tensor(weight, dtype=torch.float32) if weight is not None else weight
197 |         self.gama = gama
198 |         self.logit = logit
199 |     def forward(self, Y_pre, Y):
200 |         if self.logit:
201 |             Y_pre = F.softmax(Y_pre, dim=1)
202 |         P = Y_pre[list(range(len(Y))), Y]
203 |         if self.weight is not None:
204 |             w = self.weight[Y]
205 |         else:
206 |             w = torch.tensor([1.0 for i in range(len(Y))])
207 |         w = (w/w.sum()).reshape(-1,1)
208 |         return -((1-P)**self.gama * torch.log(P)).sum()
209 | 
210 | class HierarchicalSoftmax(nn.Module):
211 |     def __init__(self, inSize, hierarchicalStructure, lab2id, hiddenList1=[], hiddenList2=[], dropout=0.1, name='HierarchicalSoftmax'):
212 |         super(HierarchicalSoftmax, self).__init__()
213 |         self.name = name
214 |         self.dropout = nn.Dropout(p=dropout)
215 |         layers = nn.Sequential()
216 |         for i,os in enumerate(hiddenList1):
217 |             layers.add_module(str(i*2), nn.Linear(inSize, os))
218 |             layers.add_module(str(i*2+1), nn.ReLU())
219 |             inSize = os
220 |         self.hiddenLayers1 = layers
221 |         moduleList = [nn.Linear(inSize, len(hierarchicalStructure))]
222 | 
223 |         layers = nn.Sequential()
224 |         for i,os in enumerate(hiddenList2):
225 |             layers.add_module(str(i*2), nn.Linear(inSize, os))
226 |             layers.add_module(str(i*2+1), nn.ReLU())
227 |             inSize = os
228 |         self.hiddenLayers2 = layers
229 | 
230 |         for i in hierarchicalStructure:
231 |             moduleList.append( nn.Linear(inSize, len(i)) )
232 |             for j in range(len(i)):
233 |                 i[j] = lab2id[i[j]]
234 |         self.hierarchicalNum = [len(i) for i in hierarchicalStructure]
235 |         self.restoreIndex = np.argsort(sum(hierarchicalStructure,[]))
236 |         self.linearList = nn.ModuleList(moduleList)
237 |     def forward(self, x):
238 |         # x: batchSize × feaSize
239 |         x = self.hiddenLayers1(x)
240 |         x = self.dropout(x)
241 |         y = [F.softmax(linear(x), dim=1) for linear in self.linearList[:1]]
242 |         x = self.hiddenLayers2(x)
243 |         y += [F.softmax(linear(x), dim=1) for linear in self.linearList[1:]]
244 |         y = torch.cat([y[0][:,i-1].unsqueeze(1)*y[i] for i in range(1,len(y))], dim=1) # => batchSize × classNum
245 |         return y[:,self.restoreIndex]
246 | 


--------------------------------------------------------------------------------
/model/utils.py:
--------------------------------------------------------------------------------
  1 | from sklearn.model_selection import train_test_split
  2 | from sklearn.preprocessing import OneHotEncoder
  3 | from gensim.models import Word2Vec
  4 | import numpy as np
  5 | from tqdm import tqdm
  6 | import os,logging,pickle,random,torch
  7 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
  8 | 
  9 | class DataClass:
 10 |     def __init__(self, dataPath, validSize, testSize, kmers=3, check=True):
 11 |         validSize *= 1.0/(1.0-testSize)
 12 |         # Open files and load data
 13 |         print('Loading the raw data...')
 14 |         with open(dataPath,'r') as f:
 15 |             data = []
 16 |             for i in tqdm(f):
 17 |                 data.append(i.strip().split('\t'))
 18 |         self.kmers = kmers
 19 |         # Get labels and splited sentences
 20 |         RNA,Lab = [[i[1][j:j+kmers] for j in range(len(i[1])-kmers+1)] for i in data],[i[2] for i in data]
 21 |         # Get the mapping variables for label and label_id
 22 |         print('Getting the mapping variables for label and label id......')
 23 |         self.lab2id,self.id2lab = {},[]
 24 |         cnt = 0
 25 |         for lab in tqdm(Lab):
 26 |             if lab not in self.lab2id:
 27 |                 self.lab2id[lab] = cnt
 28 |                 self.id2lab.append(lab)
 29 |                 cnt += 1
 30 |         self.classNum = cnt
 31 |         # Get the mapping variables for kmers and kmers_id
 32 |         print('Getting the mapping variables for kmers and kmers id......')
 33 |         self.kmers2id,self.id2kmers = {"<EOS>":0},["<EOS>"]
 34 |         kmersCnt = 1
 35 |         for rna in tqdm(RNA):
 36 |             for kmers in rna:
 37 |                 if kmers not in self.kmers2id:
 38 |                     self.kmers2id[kmers] = kmersCnt
 39 |                     self.id2kmers.append(kmers)
 40 |                     kmersCnt += 1
 41 |         self.kmersNum = kmersCnt
 42 |         # Tokenize the sentences and labels
 43 |         self.RNADoc = RNA
 44 |         self.Lab = np.array( [self.lab2id[i] for i in Lab],dtype='int32' )
 45 |         self.RNASeqLen = np.array([len(s)+1 for s in self.RNADoc],dtype='int32')
 46 |         self.tokenizedRNASent = np.array([[self.kmers2id[i] for i in s] for s in self.RNADoc])
 47 |         self.vector = {}
 48 |         print('Start train_valid_test split......')
 49 |         while True:
 50 |             restIdList,testIdList = train_test_split(range(len(data)), test_size=testSize) if testSize>0.0 else (list(range(len(data))),[])
 51 |             trainIdList,validIdList = train_test_split(restIdList, test_size=validSize) if validSize>0.0 else (restIdList,[])
 52 |             trainSampleNum,validSampleNum,testSampleNum = len(trainIdList),len(validIdList),len(testIdList)
 53 |             totalSampleNum = trainSampleNum + validSampleNum + testSampleNum
 54 |             if not check:
 55 |                 break
 56 |             elif (trainSampleNum==0 or len(set(self.Lab[trainIdList]))==self.classNum) and \
 57 |                  (validSampleNum==0 or len(set(self.Lab[validIdList]))==self.classNum) and \
 58 |                  (testSampleNum==0 or len(set(self.Lab[testIdList]))==self.classNum):
 59 |                 break
 60 |             else:
 61 |                 continue
 62 |         self.trainIdList,self.validIdList,self.testIdList = trainIdList,validIdList,testIdList
 63 |         self.trainSampleNum,self.validSampleNum,self.testSampleNum = len(trainIdList),len(validIdList),len(testIdList)
 64 |         self.totalSampleNum = self.trainSampleNum+self.validSampleNum+self.testSampleNum
 65 |         print('train sample size:',len(self.trainIdList))
 66 |         print('valid sample size:',len(self.validIdList))
 67 |         print('test sample size:',len(self.testIdList))
 68 | 
 69 |     def describe(self):
 70 |         print(f'===========DataClass Describe===========')
 71 |         print(f'{"CLASS":<16}{"TRAIN":<8}{"VALID":<8}{"TEST":<8}')
 72 |         for i,c in enumerate(self.id2lab):
 73 |             trainIsC = sum(self.Lab[self.trainIdList]==i)/self.trainSampleNum if self.trainSampleNum>0 else -1.0
 74 |             validIsC = sum(self.Lab[self.validIdList]==i)/self.validSampleNum if self.validSampleNum>0 else -1.0
 75 |             testIsC  = sum(self.Lab[self.testIdList]==i) /self.testSampleNum  if self.testSampleNum>0  else -1.0
 76 |             print(f'{c:<16}{trainIsC:<8.3f}{validIsC:<8.3f}{testIsC:<8.3f}')
 77 |         print(f'========================================')
 78 |         self.Lab[self.trainIdList]
 79 | 
 80 |     def vectorize(self, method="char2vec", feaSize=512, window=5, sg=1, 
 81 |                         workers=8, loadCache=True):
 82 |         if os.path.exists(f'cache/{method}_k{self.kmers}_d{feaSize}.pkl') and loadCache:
 83 |             with open(f'cache/{method}_k{self.kmers}_d{feaSize}.pkl', 'rb') as f:
 84 |                 if method=='kmers':
 85 |                     tmp = pickle.load(f)
 86 |                     self.vector['encoder'],self.kmersFea = tmp['encoder'],tmp['kmersFea']
 87 |                 else:
 88 |                     self.vector['embedding'] = pickle.load(f)
 89 |             print(f'Loaded cache from cache/{method}_k{self.kmers}_d{feaSize}.pkl.')
 90 |             return
 91 |         if method == 'char2vec':
 92 |             doc = [i+['<EOS>'] for i in self.RNADoc]
 93 |             model = Word2Vec(doc, min_count=0, window=window, vector_size=feaSize, workers=workers, sg=sg, epochs=10)
 94 |             char2vec = np.zeros((self.kmersNum, feaSize), dtype=np.float32)
 95 |             for i in range(self.kmersNum):
 96 |                 char2vec[i] = model.wv[self.id2kmers[i]]
 97 |             self.vector['embedding'] = char2vec
 98 |         elif method == 'glovechar':
 99 |             from glove import Glove,Corpus
100 |             doc = [i+['<EOS>'] for i in self.RNADoc]
101 |             corpus = Corpus()
102 |             corpus.fit(doc, window=window)
103 |             glove = Glove(no_components=feaSize)
104 |             glove.fit(corpus.matrix, epochs=10, no_threads=workers, verbose=True)
105 |             glove.add_dictionary(corpus.dictionary)
106 |             gloveVec = np.zeros((self.kmersNum, feaSize), dtype=np.float32)
107 |             for i in range(self.kmersNum):
108 |                 gloveVec[i] = glove.word_vectors[glove.dictionary[self.id2kmers[i]]]
109 |             self.vector['embedding'] = gloveVec
110 |         elif method == 'kmers':
111 |             enc = OneHotEncoder(categories='auto')
112 |             enc.fit([[i] for i in self.kmers2id.values()])
113 |             feaSize = len(self.kmers2id)
114 |             kmers = np.zeros((self.totalSampleNum, feaSize))
115 |             bs = 50000
116 |             print('Getting the kmers vector......')
117 |             for i,t in enumerate(tqdm(self.tokenizedRNASent)):
118 |                 for k in range((len(t)+bs-1)//bs):
119 |                     kmers[i] += enc.transform(np.array(t[k*10000:(k+1)*10000]).reshape(-1,1)).toarray().sum(axis=0)
120 |             kmers = kmers[:,1:]
121 |             kmers = (kmers-kmers.mean(axis=0))/kmers.std(axis=0)
122 |             self.vector['encoder'] = enc
123 |             self.kmersFea = kmers
124 |         with open(f'cache/{method}_k{self.kmers}_d{feaSize}.pkl', 'wb') as f:
125 |             if method=='kmers':
126 |                 pickle.dump({'encoder':self.vector['encoder'], 'kmersFea':self.kmersFea}, f, protocol=4)
127 |             else:
128 |                 pickle.dump(self.vector['embedding'], f, protocol=4)
129 | 
130 |     def vector_merge(self, vecList, mergeVecName='mergeVec'):
131 |         self.vector[mergeVec] = np.hstack([self.vector[i] for i in vecList])
132 |         print(f'Get a new vector "{mergeVec}" with shape {self.vector[mergeVec].shape}...')
133 | 
134 |     def random_batch_data_stream(self, batchSize=128, type='train', device=torch.device('cpu')):
135 |         idList = self.trainIdList if type=='train' else self.validIdList
136 |         X,XLen = self.tokenizedRNASent,self.RNASeqLen
137 |         while True:
138 |             random.shuffle(idList)
139 |             for i in range((len(idList)+batchSize-1)//batchSize):
140 |                 samples = idList[i*batchSize:(i+1)*batchSize]
141 |                 RNASeqMaxLen = XLen[samples].max()
142 |                 yield {
143 |                         "seqArr":torch.tensor([i+[0]*(RNASeqMaxLen-len(i)) for i in X[samples]], dtype=torch.long).to(device), \
144 |                         "seqLenArr":torch.tensor(XLen[samples], dtype=torch.int).to(device)
145 |                       }, torch.tensor(self.Lab[samples], dtype=torch.long).to(device)
146 |     
147 |     def one_epoch_batch_data_stream(self, batchSize=128, type='valid', device=torch.device('cpu')):
148 |         if type == 'train':
149 |             idList = self.trainIdList
150 |         elif type == 'valid':
151 |             idList = self.validIdList
152 |         elif type == 'test':
153 |             idList = self.testIdList
154 |         X,XLen = self.tokenizedRNASent,self.RNASeqLen
155 |         for i in range((len(idList)+batchSize-1)//batchSize):
156 |             samples = idList[i*batchSize:(i+1)*batchSize]
157 |             RNASeqMaxLen = XLen[samples].max()
158 |             yield {
159 |                     "seqArr":torch.tensor([i+[0]*(RNASeqMaxLen-len(i)) for i in X[samples]], dtype=torch.long).to(device), \
160 |                     "seqLenArr":torch.tensor(XLen[samples], dtype=torch.int).to(device)
161 |                   }, torch.tensor(self.Lab[samples], dtype=torch.long).to(device)
162 | 


--------------------------------------------------------------------------------
/out/README.txt:
--------------------------------------------------------------------------------
1 | The trained models are saved in this folder.


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy==1.22.3
2 | scikit-learn==1.1.1
3 | torch==1.12.0+cu116
4 | gensim==4.0.1
5 | 


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
 1 | from model.utils import *
 2 | from model.DL_ClassifierModel import *
 3 | import argparse
 4 | parser = argparse.ArgumentParser()
 5 | parser.add_argument('--k', default='3')
 6 | parser.add_argument('--d', default='64')
 7 | parser.add_argument('--s', default='64')
 8 | parser.add_argument('--f', default='128')
 9 | parser.add_argument('--metrics', default='MaF')
10 | parser.add_argument('--device', default='cuda:0')
11 | parser.add_argument('--savePath', default='out/')
12 | args = parser.parse_args()
13 | if __name__ == '__main__':
14 |     k,d,s,f = int(args.k),int(args.d),int(args.s),int(args.f)
15 |     device,path = args.device,args.savePath
16 |     metrics = args.metrics
17 |     
18 |     report = ["ACC", "MaF", "MiAUC", "MaAUC"]
19 | 
20 |     dataClass = DataClass('data.txt', 0.2, 0.0, kmers=k)
21 |     dataClass.vectorize("char2vec", feaSize=d, loadCache=True)
22 | 
23 |     model = TextClassifier_SPPCNN( classNum=5, embedding=dataClass.vector['embedding'], SPPSize=s, feaSize=d, filterNum=f, 
24 |                                    contextSizeList=[1,3,5], hiddenList=[], 
25 |                                    embDropout=0.3, fcDropout=0.5, useFocalLoss=True, weight=None, 
26 |                                    device=device)
27 |     model.cv_train(dataClass, trainSize=1, batchSize=16, stopRounds=-1, earlyStop=10, 
28 |                    epoch=100, lr=0.001, kFold=5, savePath=f'{path}CNN_s{s}_f{f}_k{k}_d{d}', report=report)


--------------------------------------------------------------------------------