├── dataset1_PepCNN_train.py
├── dataset2_PepCNN_train.py
├── README.md
├── dataset1_num2_train_network.py
├── dataset2_num2_train_network.py
├── dataset1_PepCNN.py
├── dataset2_PepCNN.py
├── dataset2_num1_extraction_of_samples.py
└── dataset1_num1_extraction_of_samples.py


/dataset1_PepCNN_train.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | """
 3 | Created on Mon Aug 16 10:18:50 2023
 4 | 
 5 | @author: abelac
 6 | """
 7 | 
 8 | import dataset1_num1_extraction_of_samples
 9 | import dataset1_num2_train_network
10 | 
11 | dataset1_num1_extraction_of_samples
12 | dataset1_num2_train_network


--------------------------------------------------------------------------------
/dataset2_PepCNN_train.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | """
 3 | Created on Mon Aug 16 10:18:50 2023
 4 | 
 5 | @author: abelac
 6 | """
 7 | 
 8 | import dataset2_num1_extraction_of_samples
 9 | import dataset2_num2_train_network
10 | 
11 | dataset2_num1_extraction_of_samples
12 | dataset2_num2_train_network


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # PepCNN
 2 | Protein-peptide interaction is a very important biological process as it plays role in many cellular processes, but it is also involved in abnormal cellular behaviors which lead to diseases like cancer. Studying these interaction is therefore vital for understanding protein functions as well as discovering of drugs for disease treatment. The physical understanding of the interactions by the use of experimental approach of studying the interactions are laborious, time-consuming, and expensive. In this regard, we have developed a new prediction method called PepCNN which uses structure and sequence-based information from primary protein sequences to predict the peptide-binding residues. The combination of half sphere exposure structural information, position specific scoring matrix and pre-trained transformer language model based sequence information, and convolutional neural network from deep learning resulted in a superior performance compared to the state-of-the-art methods on the two datatsets. 
 3 | 
 4 | ![Architecture](https://github.com/abelavit/PepCNN/assets/36461816/711066e5-aac9-4e3e-afcd-7223cf544f05)
 5 | 
 6 | # Download and Use
 7 | There are two ways to use the provided codes for each dataset. 
 8 | ## 1. Load the trained PepCNN model
 9 |    The result obtained in our work can be replicated by executing dataset1_PepCNN.py script for Dataset1, and dataset2_PepCNN.py script for Dataset2. For instance, to obtain the result of PepCNN on dataset1, run the dataset1_PepCNN.py script after downloading the following files by going to this [link](https://figshare.com/projects/Load_protein-peptide_binding_PepCNN_model/176094) (caution: data size is around 1.3GB for each dataset): 
10 |    - model weights: dataset1_best_model_weights.h5
11 |    - training set negative samples: dataset1_Train_Negatives_All.dat
12 |    - training set positive samples: dataset1_Train_Positives.dat
13 |    - testing set: dataset1_Test_Samples.dat
14 | ## 2. Train the CNN model
15 | To train the network from scratch, it can be done by executing dataset1_PepCNN_train.py script for Dataset1, and dataset2_PepCNN_train.py script for Dataset2. For instance, to train the network on dataset1, run the dataset1_PepCNN_train.py script after downloading the following files by going to this [link](https://figshare.com/projects/Train_the_CNN_model/176151) (caution: data size is 1.22GB for both datasets): 
16 |    - testing protein sequences: Dataset1_test.tsv
17 |    - protein sequences excluding testing sequences: Dataset1_train.tsv
18 |    - pre-trained transformer embeddings: T5_Features.dat
19 |    - PSSM features: PSSM_Features.dat
20 |    - HSE features: HSE_Features.dat
21 | 
22 | Package versions:
23 | Python 3.10.12,
24 | Pandas 1.5.3,
25 | Pickle 4.0,
26 | Numpy 1.25.2,
27 | scikit-learn 1.2.2,
28 | Matplotlib 3.7.2,
29 | Tensorflow 2.12.0
30 | 
31 | 


--------------------------------------------------------------------------------
/dataset1_num2_train_network.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Created on Mon Aug 16 10:18:50 2023
  4 | 
  5 | @author: abelac
  6 | """
  7 | 
  8 | import numpy as np
  9 | import pandas as pd
 10 | import warnings
 11 | import pickle
 12 | from sklearn.preprocessing import StandardScaler
 13 | warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)
 14 | import tensorflow as tf
 15 | import tensorflow.keras.layers as tfl
 16 | 
 17 | 
 18 | # load data that excludes the test data
 19 | file = open("dataset1_Train_Positives_rerun.dat",'rb')
 20 | positive_set = pickle.load(file)
 21 | file = open("dataset1_Train_Negatives_All_rerun.dat",'rb')
 22 | negative_set_entire = pickle.load(file)
 23 | column_names = ['Code','Protein_len','Seq_num','Amino_Acid','Position','Label','Peptide','Mirrored','Feature','Prot_positives']
 24 | # randomly pick negative samples to balance it with positve samples (1.5x positive samples)
 25 | Negative_Samples = negative_set_entire.sample(n=round(len(positive_set)*1.5), random_state=42)
 26 | 
 27 | # combine positive and negative sets to make the final dataset
 28 | Train_set = pd.concat([positive_set, Negative_Samples], ignore_index=True, axis=0)
 29 | 
 30 | # collect the features and labels of train set
 31 | np.set_printoptions(suppress=True)
 32 | X_val = [0]*len(Train_set)
 33 | for i in range(len(Train_set)):
 34 |     feat = Train_set['Feature'][i]
 35 |     X_val[i] = feat
 36 | X_train_orig = np.asarray(X_val)
 37 | y_val = Train_set['Label'].to_numpy(dtype=float)
 38 | Y_train_orig = y_val.reshape(y_val.size,1)
 39 | 
 40 | # Generate a random order of elements with np.random.permutation and simply index into the arrays Feature and label 
 41 | idx = np.random.permutation(len(X_train_orig))
 42 | X_train,Y_train = X_train_orig[idx], Y_train_orig[idx]
 43 | scaler = StandardScaler()
 44 | scaler.fit(X_train) # fit on training set only
 45 | X_train = scaler.transform(X_train) # apply transform to the training set
 46 | 
 47 | # load test data
 48 | file = open("dataset1_Test_Samples_rerun.dat",'rb')
 49 | Independent_test_set = pickle.load(file)
 50 | # collect the features and labels for independent set
 51 | X_independent = [0]*len(Independent_test_set)
 52 | for i in range(len(Independent_test_set)):
 53 |     feat = Independent_test_set['Feature'][i]
 54 |     X_independent[i] = feat
 55 | X_test = np.asarray(X_independent)
 56 | y_independent = Independent_test_set['Label'].to_numpy(dtype=float)
 57 | Y_test = y_independent.reshape(y_independent.size,1)
 58 | X_test = scaler.transform(X_test) # apply standardization (transform) to the test set
 59 | 
 60 | def CNN_Model():
 61 |     
 62 |     model = tf.keras.Sequential()
 63 |     model.add(tfl.Conv1D(128, 5, padding='same', activation='relu', input_shape=(feat_shape,1)))
 64 |     model.add(tfl.BatchNormalization())
 65 |     model.add(tfl.Dropout(0.23))
 66 |     model.add(tfl.Conv1D(128, 3, padding='same',activation='relu'))
 67 |     model.add(tfl.BatchNormalization())
 68 |     model.add(tfl.Dropout(0.21))
 69 |     model.add(tfl.Conv1D(64, 3, padding='same',activation='relu'))
 70 |     model.add(tfl.BatchNormalization())
 71 |     model.add(tfl.Dropout(0.47))
 72 |   
 73 |     model.add(tfl.Flatten())
 74 |     
 75 |     model.add(tfl.Dense(128, activation='relu'))
 76 |     model.add(tfl.Dense(32, activation='relu'))
 77 |     model.add(tfl.Dense(1, activation='sigmoid'))
 78 |     
 79 |     return model
 80 | 
 81 | feat_shape = X_train[0].size
 82 | 
 83 | cnn_model = CNN_Model()
 84 | 
 85 | learning_rate = 0.000001
 86 | optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate)
 87 | cnn_model.compile(optimizer=optimizer,
 88 |                    loss='binary_crossentropy',
 89 |                    metrics=['AUC'])
 90 | 
 91 | cnn_model.summary()
 92 | 
 93 | # Train the Model
 94 | batch_size = 30
 95 | epochs = 200
 96 | 
 97 | checkpoint = tf.keras.callbacks.ModelCheckpoint("dataset1_best_model_weights_rerun.h5", save_best_only=True)
 98 | early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_auc', patience=3, restore_best_weights=True)
 99 | history = cnn_model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size, validation_split=0.2, callbacks=[checkpoint, early_stopping])
100 | 
101 | df_loss_auc = pd.DataFrame(history.history)
102 | df_loss= df_loss_auc[['loss','val_loss']]
103 | df_loss.rename(columns={'loss':'train','val_loss':'validation'},inplace=True)
104 | df_auc= df_loss_auc[['auc','val_auc']]
105 | df_auc.rename(columns={'auc':'train','val_auc':'validation'},inplace=True)
106 | Model_Loss_plot_title = 'Model Loss'
107 | df_loss.plot(title=Model_Loss_plot_title,figsize=(12,8)).set(xlabel='Epoch',ylabel='Loss')
108 | Model_AUC_plot_title = 'Model AUC'
109 | df_auc.plot(title=Model_AUC_plot_title,grid=True,figsize=(12,8)).set(xlabel='Epoch',ylabel='AUC')
110 | 
111 | eval_result = cnn_model.evaluate(X_test, Y_test)
112 | print(f"test loss: {round(eval_result[0],4)}, test auc: {round(eval_result[1],4)}")
113 | 


--------------------------------------------------------------------------------
/dataset2_num2_train_network.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Created on Mon Aug 16 10:18:50 2023
  4 | 
  5 | @author: abelac
  6 | """
  7 | 
  8 | import numpy as np
  9 | import pandas as pd
 10 | import warnings
 11 | import pickle
 12 | from sklearn.preprocessing import StandardScaler
 13 | warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)
 14 | import tensorflow as tf
 15 | import tensorflow.keras.layers as tfl
 16 | 
 17 | 
 18 | # load data that excludes the test data
 19 | file = open("dataset2_Train_Positives_rerun.dat",'rb')
 20 | positive_set = pickle.load(file)
 21 | file = open("dataset2_Train_Negatives_All_rerun.dat",'rb')
 22 | negative_set_entire = pickle.load(file)
 23 | column_names = ['Code','Protein_len','Seq_num','Amino_Acid','Position','Label','Peptide','Mirrored','Feature','Prot_positives']
 24 | # randomly pick negative samples to balance it with positve samples (1.5x positive samples)
 25 | Negative_Samples = negative_set_entire.sample(n=round(len(positive_set)*1.5), random_state=42)
 26 | 
 27 | # combine positive and negative sets to make the final dataset
 28 | Train_set = pd.concat([positive_set, Negative_Samples], ignore_index=True, axis=0)
 29 | 
 30 | # collect the features and labels of train set
 31 | np.set_printoptions(suppress=True)
 32 | X_val = [0]*len(Train_set)
 33 | for i in range(len(Train_set)):
 34 |     feat = Train_set['Feature'][i]
 35 |     X_val[i] = feat
 36 | X_train_orig = np.asarray(X_val)
 37 | y_val = Train_set['Label'].to_numpy(dtype=float)
 38 | Y_train_orig = y_val.reshape(y_val.size,1)
 39 | 
 40 | # Generate a random order of elements with np.random.permutation and simply index into the arrays Feature and label 
 41 | idx = np.random.permutation(len(X_train_orig))
 42 | X_train,Y_train = X_train_orig[idx], Y_train_orig[idx]
 43 | scaler = StandardScaler()
 44 | scaler.fit(X_train) # fit on training set only
 45 | X_train = scaler.transform(X_train) # apply transform to the training set
 46 | 
 47 | # load test data
 48 | file = open("dataset2_Test_Samples_rerun.dat",'rb')
 49 | Independent_test_set = pickle.load(file)
 50 | # collect the features and labels for independent set
 51 | X_independent = [0]*len(Independent_test_set)
 52 | for i in range(len(Independent_test_set)):
 53 |     feat = Independent_test_set['Feature'][i]
 54 |     X_independent[i] = feat
 55 | X_test = np.asarray(X_independent)
 56 | y_independent = Independent_test_set['Label'].to_numpy(dtype=float)
 57 | Y_test = y_independent.reshape(y_independent.size,1)
 58 | X_test = scaler.transform(X_test) # apply standardization (transform) to the test set
 59 | 
 60 | def CNN_Model():
 61 |     
 62 |     model = tf.keras.Sequential()
 63 |     model.add(tfl.Conv1D(128, 5, padding='same', activation='relu', input_shape=(feat_shape,1)))
 64 |     model.add(tfl.BatchNormalization())
 65 |     model.add(tfl.Dropout(0.38))
 66 |     model.add(tfl.Conv1D(128, 3, padding='same',activation='relu'))
 67 |     model.add(tfl.BatchNormalization())
 68 |     model.add(tfl.Dropout(0.38))
 69 |     model.add(tfl.Conv1D(64, 3, padding='same',activation='relu'))
 70 |     model.add(tfl.BatchNormalization())
 71 |     model.add(tfl.Dropout(0.38))
 72 |   
 73 |     model.add(tfl.Flatten())
 74 |     
 75 |     model.add(tfl.Dense(128, activation='relu'))
 76 |     model.add(tfl.Dense(32, activation='relu'))
 77 |     model.add(tfl.Dense(1, activation='sigmoid'))
 78 |     
 79 |     return model
 80 | 
 81 | feat_shape = X_train[0].size
 82 | 
 83 | cnn_model = CNN_Model()
 84 | 
 85 | learning_rate = 0.000001
 86 | optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate)
 87 | cnn_model.compile(optimizer=optimizer,
 88 |                    loss='binary_crossentropy',
 89 |                    metrics=['AUC'])
 90 | 
 91 | cnn_model.summary()
 92 | 
 93 | # Train the Model
 94 | batch_size = 30
 95 | epochs = 200
 96 | 
 97 | checkpoint = tf.keras.callbacks.ModelCheckpoint("dataset2_best_model_weights_rerun.h5", save_best_only=True)
 98 | early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_auc', patience=3, restore_best_weights=True)
 99 | history = cnn_model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size, validation_split=0.2, callbacks=[checkpoint, early_stopping])
100 | 
101 | df_loss_auc = pd.DataFrame(history.history)
102 | df_loss= df_loss_auc[['loss','val_loss']]
103 | df_loss.rename(columns={'loss':'train','val_loss':'validation'},inplace=True)
104 | df_auc= df_loss_auc[['auc','val_auc']]
105 | df_auc.rename(columns={'auc':'train','val_auc':'validation'},inplace=True)
106 | Model_Loss_plot_title = 'Model Loss'
107 | df_loss.plot(title=Model_Loss_plot_title,figsize=(12,8)).set(xlabel='Epoch',ylabel='Loss')
108 | Model_AUC_plot_title = 'Model AUC'
109 | df_auc.plot(title=Model_AUC_plot_title,grid=True,figsize=(12,8)).set(xlabel='Epoch',ylabel='AUC')
110 | 
111 | eval_result = cnn_model.evaluate(X_test, Y_test)
112 | print(f"test loss: {round(eval_result[0],4)}, test auc: {round(eval_result[1],4)}")
113 | 


--------------------------------------------------------------------------------
/dataset1_PepCNN.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Created on Mon Aug 16 10:18:50 2023
  4 | 
  5 | @author: abelac
  6 | """
  7 | 
  8 | import numpy as np
  9 | import pandas as pd
 10 | from sklearn.metrics import roc_auc_score
 11 | from sklearn.metrics import precision_score 
 12 | import warnings
 13 | import pickle
 14 | import copy
 15 | from matplotlib import pyplot
 16 | from sklearn.metrics import roc_curve
 17 | from sklearn.metrics import confusion_matrix
 18 | from sklearn.preprocessing import StandardScaler
 19 | warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)
 20 | import tensorflow as tf
 21 | import tensorflow.keras.layers as tfl
 22 | 
 23 | # load data that excludes the test data
 24 | file = open("dataset1_Train_Positives.dat",'rb')
 25 | positive_set = pickle.load(file)
 26 | file = open("dataset1_Train_Negatives_All.dat",'rb')
 27 | negative_set_entire = pickle.load(file)
 28 | column_names = ['Code','Protein_len','Seq_num','Amino_Acid','Position','Label','Peptide','Mirrored','Feature','Prot_positives']
 29 | # randomly pick negative samples to balance it with positve samples (1.5x positive samples)
 30 | Negative_Samples = negative_set_entire.sample(n=round(len(positive_set)*1.5), random_state=42)
 31 | 
 32 | # combine positive and negative sets to make the final dataset
 33 | Train_set = pd.concat([positive_set, Negative_Samples], ignore_index=True, axis=0)
 34 | 
 35 | # collect the features and labels of train set
 36 | np.set_printoptions(suppress=True)
 37 | X_val = [0]*len(Train_set)
 38 | for i in range(len(Train_set)):
 39 |     feat = Train_set['Feature'][i]
 40 |     X_val[i] = feat
 41 | X_train_orig = np.asarray(X_val)
 42 | y_val = Train_set['Label'].to_numpy(dtype=float)
 43 | Y_train_orig = y_val.reshape(y_val.size,1)
 44 | 
 45 | # Generate a random order of elements with np.random.permutation and simply index into the arrays Feature and label 
 46 | idx = np.random.permutation(len(X_train_orig))
 47 | X_train,Y_train = X_train_orig[idx], Y_train_orig[idx]
 48 | scaler = StandardScaler()
 49 | scaler.fit(X_train) # fit on training set only
 50 | X_train = scaler.transform(X_train) # apply transform to the training set
 51 | 
 52 | # load test data
 53 | file = open("dataset1_Test_Samples.dat",'rb')
 54 | Independent_test_set = pickle.load(file)
 55 | # collect the features and labels for independent set
 56 | X_independent = [0]*len(Independent_test_set)
 57 | for i in range(len(Independent_test_set)):
 58 |     feat = Independent_test_set['Feature'][i]
 59 |     X_independent[i] = feat
 60 | X_test = np.asarray(X_independent)
 61 | y_independent = Independent_test_set['Label'].to_numpy(dtype=float)
 62 | Y_test = y_independent.reshape(y_independent.size,1)
 63 | X_test = scaler.transform(X_test) # apply standardization (transform) to the test set
 64 | 
 65 | def CNN_Model():
 66 |     
 67 |     model = tf.keras.Sequential()
 68 |     model.add(tfl.Conv1D(128, 5, padding='same', activation='relu', input_shape=(feat_shape,1)))
 69 |     model.add(tfl.BatchNormalization())
 70 |     model.add(tfl.Dropout(0.23))
 71 |     model.add(tfl.Conv1D(128, 3, padding='same',activation='relu'))
 72 |     model.add(tfl.BatchNormalization())
 73 |     model.add(tfl.Dropout(0.21))
 74 |     model.add(tfl.Conv1D(64, 3, padding='same',activation='relu'))
 75 |     model.add(tfl.BatchNormalization())
 76 |     model.add(tfl.Dropout(0.47))
 77 |   
 78 |     model.add(tfl.Flatten())
 79 |     
 80 |     model.add(tfl.Dense(128, activation='relu'))
 81 |     model.add(tfl.Dense(32, activation='relu'))
 82 |     model.add(tfl.Dense(1, activation='sigmoid'))
 83 |     
 84 |     return model
 85 | 
 86 | feat_shape = X_train[0].size
 87 | 
 88 | cnn_model = CNN_Model()
 89 | 
 90 | learning_rate = 0.000001
 91 | optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate)
 92 | cnn_model.compile(optimizer=optimizer,
 93 |                    loss='binary_crossentropy',
 94 |                    metrics=['AUC'])
 95 | 
 96 | cnn_model.summary()
 97 | 
 98 | # load the trained weights
 99 | cnn_model.load_weights('dataset1_best_model_weights.h5')
100 | 
101 | eval_result = cnn_model.evaluate(X_test, Y_test)
102 | 
103 | print(f"test loss: {round(eval_result[0],4)}, test auc: {round(eval_result[1],4)}")
104 | Inde_test_prob = cnn_model.predict(X_test)
105 | 
106 | 
107 | def round_based_on_thres(probs_to_round, set_thres):
108 |     for i in range(len(probs_to_round)):
109 |         if probs_to_round[i] <= set_thres:
110 |             probs_to_round[i] = 0
111 |         else:
112 |             probs_to_round[i] = 1
113 |     return probs_to_round
114 | 
115 | # calculate the metrics
116 | set_thres = 0.877
117 | copy_Probs_inde = copy.copy(Inde_test_prob)
118 | round_based_on_thres(copy_Probs_inde, set_thres)
119 | fpr, tpr, thresholds = roc_curve(Y_test, Inde_test_prob)
120 | inde_auc = round(roc_auc_score(Y_test, Inde_test_prob),4)
121 | inde_pre = round(precision_score(Y_test, copy_Probs_inde),4)
122 | cm = confusion_matrix(Y_test, copy_Probs_inde) # for acc, sen, and spe calculation
123 | total_preds=sum(sum(cm))
124 | TN = cm[0,0]
125 | FP = cm[0,1]
126 | FN = cm[1,0]
127 | TP = cm[1,1]
128 | inde_sen = round(TP/(TP+FN),4)
129 | inde_spe = round(TN/(TN+FP),4)
130 | 
131 | # display the metrics
132 | print(f'Independent Sen: {inde_sen}')
133 | print(f'Independent Spe: {inde_spe}')
134 | print(f'Independent Pre: {inde_pre}')
135 | print(f'Independent AUC: {inde_auc}')
136 | 
137 | # plot ROC curve
138 | legend = 'AUC = ' + str(inde_auc)
139 | pyplot.figure(figsize=(12,8))
140 | pyplot.plot([0,1], [0,1], linestyle='--')
141 | pyplot.plot(fpr, tpr, marker='.', label=legend)
142 | pyplot.xlabel('False Positive Rate')
143 | pyplot.ylabel('True Positive Rate')
144 | pyplot.legend()
145 | pyplot.show()
146 | 


--------------------------------------------------------------------------------
/dataset2_PepCNN.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Created on Mon Aug 16 10:18:50 2023
  4 | 
  5 | @author: abelac
  6 | """
  7 | 
  8 | import numpy as np
  9 | import pandas as pd
 10 | from sklearn.metrics import roc_auc_score
 11 | from sklearn.metrics import precision_score 
 12 | import warnings
 13 | import pickle
 14 | import copy
 15 | from matplotlib import pyplot
 16 | from sklearn.metrics import roc_curve
 17 | from sklearn.metrics import confusion_matrix
 18 | from sklearn.preprocessing import StandardScaler
 19 | warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)
 20 | import tensorflow as tf
 21 | import tensorflow.keras.layers as tfl
 22 | 
 23 | # load data that excludes the test data
 24 | file = open("dataset2_Train_Positives.dat",'rb')
 25 | positive_set = pickle.load(file)
 26 | file = open("dataset2_Train_Negatives_All.dat",'rb')
 27 | negative_set_entire = pickle.load(file)
 28 | column_names = ['Code','Protein_len','Seq_num','Amino_Acid','Position','Label','Peptide','Mirrored','Feature','Prot_positives']
 29 | # randomly pick negative samples to balance it with positve samples (1.5x positive samples)
 30 | Negative_Samples = negative_set_entire.sample(n=round(len(positive_set)*1.5), random_state=42)
 31 | 
 32 | # combine positive and negative sets to make the final dataset
 33 | Train_set = pd.concat([positive_set, Negative_Samples], ignore_index=True, axis=0)
 34 | 
 35 | # collect the features and labels of train set
 36 | np.set_printoptions(suppress=True)
 37 | X_val = [0]*len(Train_set)
 38 | for i in range(len(Train_set)):
 39 |     feat = Train_set['Feature'][i]
 40 |     X_val[i] = feat
 41 | X_train_orig = np.asarray(X_val)
 42 | y_val = Train_set['Label'].to_numpy(dtype=float)
 43 | Y_train_orig = y_val.reshape(y_val.size,1)
 44 | 
 45 | # Generate a random order of elements with np.random.permutation and simply index into the arrays Feature and label 
 46 | idx = np.random.permutation(len(X_train_orig))
 47 | X_train,Y_train = X_train_orig[idx], Y_train_orig[idx]
 48 | scaler = StandardScaler()
 49 | scaler.fit(X_train) # fit on training set only
 50 | X_train = scaler.transform(X_train) # apply transform to the training set
 51 | 
 52 | # load test data
 53 | file = open("dataset2_Test_Samples.dat",'rb')
 54 | Independent_test_set = pickle.load(file)
 55 | # collect the features and labels for independent set
 56 | X_independent = [0]*len(Independent_test_set)
 57 | for i in range(len(Independent_test_set)):
 58 |     feat = Independent_test_set['Feature'][i]
 59 |     X_independent[i] = feat
 60 | X_test = np.asarray(X_independent)
 61 | y_independent = Independent_test_set['Label'].to_numpy(dtype=float)
 62 | Y_test = y_independent.reshape(y_independent.size,1)
 63 | X_test = scaler.transform(X_test) # apply standardization (transform) to the test set
 64 | 
 65 | def CNN_Model():
 66 |     
 67 |     model = tf.keras.Sequential()
 68 |     model.add(tfl.Conv1D(128, 5, padding='same', activation='relu', input_shape=(feat_shape,1)))
 69 |     model.add(tfl.BatchNormalization())
 70 |     model.add(tfl.Dropout(0.38))
 71 |     model.add(tfl.Conv1D(128, 3, padding='same',activation='relu'))
 72 |     model.add(tfl.BatchNormalization())
 73 |     model.add(tfl.Dropout(0.38))
 74 |     model.add(tfl.Conv1D(64, 3, padding='same',activation='relu'))
 75 |     model.add(tfl.BatchNormalization())
 76 |     model.add(tfl.Dropout(0.38))
 77 |   
 78 |     model.add(tfl.Flatten())
 79 |     
 80 |     model.add(tfl.Dense(128, activation='relu'))
 81 |     model.add(tfl.Dense(32, activation='relu'))
 82 |     model.add(tfl.Dense(1, activation='sigmoid'))
 83 |     
 84 |     return model
 85 | 
 86 | feat_shape = X_train[0].size
 87 | 
 88 | cnn_model = CNN_Model()
 89 | 
 90 | learning_rate = 0.000001
 91 | optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate)
 92 | cnn_model.compile(optimizer=optimizer,
 93 |                    loss='binary_crossentropy',
 94 |                    metrics=['AUC'])
 95 | 
 96 | cnn_model.summary()
 97 | 
 98 | # load the trained weights
 99 | cnn_model.load_weights('dataset2_best_model_weights.h5')
100 | 
101 | eval_result = cnn_model.evaluate(X_test, Y_test)
102 | 
103 | print(f"test loss: {round(eval_result[0],4)}, test auc: {round(eval_result[1],4)}")
104 | Inde_test_prob = cnn_model.predict(X_test)
105 | 
106 | 
107 | def round_based_on_thres(probs_to_round, set_thres):
108 |     for i in range(len(probs_to_round)):
109 |         if probs_to_round[i] <= set_thres:
110 |             probs_to_round[i] = 0
111 |         else:
112 |             probs_to_round[i] = 1
113 |     return probs_to_round
114 | 
115 | # calculate the metrics
116 | set_thres = 0.885
117 | copy_Probs_inde = copy.copy(Inde_test_prob)
118 | round_based_on_thres(copy_Probs_inde, set_thres)
119 | fpr, tpr, thresholds = roc_curve(Y_test, Inde_test_prob)
120 | inde_auc = round(roc_auc_score(Y_test, Inde_test_prob),4)
121 | inde_pre = round(precision_score(Y_test, copy_Probs_inde),4)
122 | cm = confusion_matrix(Y_test, copy_Probs_inde) # for acc, sen, and spe calculation
123 | total_preds=sum(sum(cm))
124 | TN = cm[0,0]
125 | FP = cm[0,1]
126 | FN = cm[1,0]
127 | TP = cm[1,1]
128 | inde_sen = round(TP/(TP+FN),4)
129 | inde_spe = round(TN/(TN+FP),4)
130 | 
131 | # display the metrics
132 | print(f'Independent Sen: {inde_sen}')
133 | print(f'Independent Spe: {inde_spe}')
134 | print(f'Independent Pre: {inde_pre}')
135 | print(f'Independent AUC: {inde_auc}')
136 | 
137 | # plot ROC curve
138 | legend = 'AUC = ' + str(inde_auc)
139 | pyplot.figure(figsize=(12,8))
140 | pyplot.plot([0,1], [0,1], linestyle='--')
141 | pyplot.plot(fpr, tpr, marker='.', label=legend)
142 | pyplot.xlabel('False Positive Rate')
143 | pyplot.ylabel('True Positive Rate')
144 | pyplot.legend()
145 | pyplot.show()
146 | 


--------------------------------------------------------------------------------
/dataset2_num1_extraction_of_samples.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Created on Mon Aug 16 10:18:50 2023
  4 | 
  5 | @author: abelac
  6 | """
  7 | import pickle
  8 | import pandas as pd
  9 | import numpy as np
 10 | import math
 11 | 
 12 | 
 13 | # function to extract samples
 14 | def peptide_feat(window_size, Protein_seq, Feat, j): # funtion to extract peptide length and feature based on window size
 15 |     
 16 |     if (j - math.ceil(window_size/2)) < -1: # not enough amino acid at N terminus to form peptide
 17 |         peptide1 = Protein_seq[j:math.floor(window_size/2)+j+1] # +1 since the stop value for slicing is exclusive
 18 |         peptide2 = Protein_seq[j+1:math.floor(window_size/2)+j+1] # other peptide half but excluding the central amino acid
 19 |         peptide = peptide2[::-1] + peptide1
 20 |         
 21 |         feat1 = Feat[j:math.floor(window_size/2)+j+1] # +1 since the stop value for slicing is exclusive
 22 |         feat2 = Feat[j+1:math.floor(window_size/2)+j+1] # other peptide half but excluding the central amino acid
 23 |         final_feat = np.concatenate((feat2[::-1], feat1))
 24 |         mirrored = 'Yes'
 25 |         
 26 |     elif ((len(Protein_seq) - (j+1)) < (math.floor(window_size/2))): # not enough amino acid at C terminus to form peptide
 27 |         peptide1 = Protein_seq[j-math.floor(window_size/2):j+1]
 28 |         peptide2 = Protein_seq[j-math.floor(window_size/2):j]
 29 |         peptide = peptide1 + peptide2[::-1]
 30 |         
 31 |         feat1 = Feat[j-math.floor(window_size/2):j+1]
 32 |         feat2 = Feat[j-math.floor(window_size/2):j]
 33 |         final_feat = np.concatenate((feat1, feat2[::-1]))
 34 |         mirrored = 'Yes'
 35 |         
 36 |     else:
 37 |         peptide = Protein_seq[j-math.floor(window_size/2):math.floor(window_size/2)+j+1]
 38 |         final_feat = Feat[j-math.floor(window_size/2):math.floor(window_size/2)+j+1]
 39 |         mirrored = 'No'
 40 |         
 41 |     return peptide, final_feat, mirrored
 42 | 
 43 | 
 44 | 
 45 | 
 46 | # Prepare data
 47 | Dataset_test_tsv = pd.read_table("Dataset2_test.tsv")
 48 | Dataset_train_tsv = pd.read_table("Dataset2_train.tsv")
 49 | 
 50 | file = open("T5_Features.dat",'rb')
 51 | Proteins = pickle.load(file)
 52 | file = open("HSE_Features.dat",'rb')
 53 | Proteins2 = pickle.load(file)
 54 | file = open("PSSM_Features.dat",'rb')
 55 | Proteins3 = pickle.load(file)
 56 | 
 57 | column_headers = list(Proteins.columns.values)
 58 | DatasetTestProteins = pd.DataFrame(columns = column_headers)
 59 | DatasetTestProteins2 = pd.DataFrame(columns = column_headers)
 60 | DatasetTestProteins3 = pd.DataFrame(columns = column_headers)
 61 | 
 62 | matching_index = 0
 63 | for i in range(len(Dataset_test_tsv)):
 64 |     for j in range(len(Proteins)):
 65 |         if (Dataset_test_tsv['seq'][i].upper() == Proteins['Prot_seq'][j].upper()):           
 66 |             DatasetTestProteins.loc[matching_index] = Proteins.loc[j]
 67 |             matching_index += 1
 68 |             break
 69 | matching_index = 0
 70 | for i in range(len(Dataset_test_tsv)):
 71 |     for j in range(len(Proteins2)):
 72 |         if (Dataset_test_tsv['seq'][i].upper() == Proteins2['Prot_seq'][j].upper()):
 73 |             DatasetTestProteins2.loc[matching_index] = Proteins2.loc[j]
 74 |             matching_index += 1
 75 |             break
 76 | matching_index = 0
 77 | for i in range(len(Dataset_test_tsv)):
 78 |     for j in range(len(Proteins3)):
 79 |         if (Dataset_test_tsv['seq'][i].upper() == Proteins3['Prot_seq'][j].upper()):
 80 |             DatasetTestProteins3.loc[matching_index] = Proteins3.loc[j]
 81 |             matching_index += 1
 82 |             break   
 83 |             
 84 | DatasetTrainProteins = pd.DataFrame(columns = column_headers)
 85 | DatasetTrainProteins2 = pd.DataFrame(columns = column_headers)
 86 | DatasetTrainProteins3 = pd.DataFrame(columns = column_headers)
 87 | 
 88 | matching_index = 0
 89 | for i in range(len(Dataset_train_tsv)):
 90 |     for j in range(len(Proteins)):
 91 |         if (Dataset_train_tsv['seq'][i].upper() == Proteins['Prot_seq'][j].upper()):       
 92 |             DatasetTrainProteins.loc[matching_index] = Proteins.loc[j]
 93 |             matching_index += 1
 94 |             break
 95 | 
 96 | matching_index = 0
 97 | for i in range(len(Dataset_train_tsv)):
 98 |     for j in range(len(Proteins2)):
 99 |         if (Dataset_train_tsv['seq'][i].upper() == Proteins2['Prot_seq'][j].upper()):
100 |             DatasetTrainProteins2.loc[matching_index] = Proteins2.loc[j]
101 |             matching_index += 1
102 |             break    
103 | matching_index = 0
104 | for i in range(len(Dataset_train_tsv)):
105 |     for j in range(len(Proteins3)):
106 |         if (Dataset_train_tsv['seq'][i].upper() == Proteins3['Prot_seq'][j].upper()):
107 |             DatasetTrainProteins3.loc[matching_index] = Proteins3.loc[j]
108 |             matching_index += 1
109 |             break 
110 | 
111 | # generate samples for Test protein sequences
112 | column_names = ['Code','Protein_len','Seq_num','Amino_Acid','Position','Label','Peptide','Mirrored','Feature','Prot_positives']
113 | Test_Samples = pd.DataFrame(columns = column_names)
114 | #Test_Negatives = pd.DataFrame(columns = column_names)
115 | 
116 | Pos_index = 0
117 | Neg_index = 0
118 | window_size = 1 # -0 to +0
119 | seq_num = 0
120 | 
121 | # extract feature and peptide for all sites 
122 | for i in range(len(DatasetTestProteins)):
123 |     Protein_seq = DatasetTestProteins['Prot_seq'][i]
124 |     Feat = DatasetTestProteins['Feat'][i] # transpose the feature matrix
125 |     Feat2 = DatasetTestProteins2['Feat'][i]
126 |     Feat3 = DatasetTestProteins3['Feat'][i]
127 |     positive_counts = DatasetTestProteins['Prot_label'][i].count('1')
128 |     
129 |     seq_num += 1
130 |     for j in range(len(Protein_seq)): # go through the protein seq
131 |         
132 |         A_sample = pd.DataFrame(columns = column_names) # create new dataframe using same column names. This dataframe will just have 1 entry.
133 |         A_sample.loc[0,'Code'] = DatasetTestProteins['Prot_name'][i] # store the protein name
134 |         A_sample.loc[0,'Protein_len'] = DatasetTestProteins['Prot_len'][i] # store the protein length
135 |         A_sample.loc[0,'Label'] = DatasetTestProteins['Prot_label'][i][j]
136 |         A_sample.loc[0,'Prot_positives'] = positive_counts
137 |         A_sample.loc[0,'Amino_Acid'] = Protein_seq[j] # store the amino acid 
138 |         A_sample.loc[0,'Position'] = j # store the position of the amino acid
139 |         
140 |         peptide, T5_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat, j) # call the function to extract peptide and feature based on window size
141 |         peptide, HSE_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat2, j)
142 |         peptide, PSSM_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat3, j)
143 |         
144 |         A_sample.loc[0,'Peptide'] = peptide
145 |         Feat_vec = np.concatenate((T5_feat.mean(0),HSE_feat.flatten(),PSSM_feat.flatten()))
146 |         A_sample.loc[0,'Feature'] = np.float32(Feat_vec)
147 |         A_sample.loc[0,'Seq_num'] = seq_num
148 |         A_sample.loc[0,'Mirrored'] = mirrored
149 |         
150 |         
151 |         Test_Samples = pd.concat([Test_Samples, A_sample], ignore_index=True, axis=0)
152 |             
153 |                   
154 |     print('Test Protein ' + str(i+1) + ' out of ' + str(len(DatasetTestProteins)))
155 | print('Number of Proteins in Test: ' + str(len(DatasetTestProteins)))
156 | print('Number of samples in Test: ' + str(len(Test_Samples)))
157 | 
158 | pickle.dump(Test_Samples,open("dataset2_Test_Samples_rerun.dat","wb"))
159 | 
160 | # generate samples for Train protein sequences
161 | Train_Positives = pd.DataFrame(columns = column_names)
162 | Train_Negatives_All = pd.DataFrame(columns = column_names)
163 | 
164 | Pos_index = 0
165 | Neg_index = 0
166 | seq_num = 0
167 | 
168 | # extract feature and peptide for all sites 
169 | for i in range(len(DatasetTrainProteins)):
170 |     Protein_seq = DatasetTrainProteins['Prot_seq'][i]
171 |     Feat = DatasetTrainProteins['Feat'][i] # transpose the feature matrix
172 |     Feat2 = DatasetTrainProteins2['Feat'][i]
173 |     Feat3 = DatasetTrainProteins3['Feat'][i]
174 |     positive_counts = DatasetTrainProteins['Prot_label'][i].count('1')
175 |     
176 |     seq_num += 1
177 |     for j in range(len(Protein_seq)): # go through the protein seq
178 |             
179 |         A_sample = pd.DataFrame(columns = column_names) # create new dataframe using same column names. This dataframe will just have 1 entry.
180 |         A_sample.loc[0,'Code'] = DatasetTrainProteins['Prot_name'][i] # store the protein name
181 |         A_sample.loc[0,'Protein_len'] = DatasetTrainProteins['Prot_len'][i] # store the protein length
182 |         A_sample.loc[0,'Label'] = DatasetTrainProteins['Prot_label'][i][j]
183 |         A_sample.loc[0,'Prot_positives'] = positive_counts
184 |         A_sample.loc[0,'Amino_Acid'] = Protein_seq[j] # store the amino acid 
185 |         A_sample.loc[0,'Position'] = j # store the position of the amino acid
186 |             
187 |         peptide, T5_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat, j) # call the function to extract peptide and feature based on window size
188 |         peptide, HSE_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat2, j)
189 |         peptide, PSSM_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat3, j)
190 |             
191 |         A_sample.loc[0,'Peptide'] = peptide
192 |         Feat_vec = np.concatenate((T5_feat.mean(0),HSE_feat.flatten(),PSSM_feat.flatten()))
193 |         A_sample.loc[0,'Feature'] = np.float32(Feat_vec)        
194 |         A_sample.loc[0,'Seq_num'] = seq_num
195 |         A_sample.loc[0,'Mirrored'] = mirrored
196 |                         
197 |         if A_sample.loc[0,'Label'] == '1':
198 |             Train_Positives = pd.concat([Train_Positives, A_sample], ignore_index=True, axis=0)
199 |                
200 |         else: 
201 |             Train_Negatives_All = pd.concat([Train_Negatives_All, A_sample], ignore_index=True, axis=0)
202 |       
203 |             
204 |     print('Train Protein ' + str(i+1) + ' out of ' + str(len(DatasetTrainProteins)))
205 | print('Number of Proteins in Train: ' + str(len(DatasetTrainProteins)))
206 | print('Feature vector size: ' + str(Test_Samples['Feature'][0].shape))
207 | print('Num of Train Positives: ' + str(len(Train_Positives)))
208 | print('Num of Train Negatives (All): ' + str(len(Train_Negatives_All)))
209 | pickle.dump(Train_Positives,open("dataset2_Train_Positives_rerun.dat","wb"))
210 | pickle.dump(Train_Negatives_All,open("dataset2_Train_Negatives_All_rerun.dat","wb"))
211 | 
212 | 
213 | 


--------------------------------------------------------------------------------
/dataset1_num1_extraction_of_samples.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Created on Mon Aug 16 10:18:50 2023
  4 | 
  5 | @author: abelac
  6 | """
  7 | import pickle
  8 | import pandas as pd
  9 | import numpy as np
 10 | import math
 11 | 
 12 | 
 13 | # function to extract samples
 14 | def peptide_feat(window_size, Protein_seq, Feat, j): # funtion to extract peptide length and feature based on window size
 15 |     
 16 |     if (j - math.ceil(window_size/2)) < -1: # not enough amino acid at N terminus to form peptide
 17 |         peptide1 = Protein_seq[j:math.floor(window_size/2)+j+1] # +1 since the stop value for slicing is exclusive
 18 |         peptide2 = Protein_seq[j+1:math.floor(window_size/2)+j+1] # other peptide half but excluding the central amino acid
 19 |         peptide = peptide2[::-1] + peptide1
 20 |         
 21 |         feat1 = Feat[j:math.floor(window_size/2)+j+1] # +1 since the stop value for slicing is exclusive
 22 |         feat2 = Feat[j+1:math.floor(window_size/2)+j+1] # other peptide half but excluding the central amino acid
 23 |         final_feat = np.concatenate((feat2[::-1], feat1))
 24 |         mirrored = 'Yes'
 25 |         
 26 |     elif ((len(Protein_seq) - (j+1)) < (math.floor(window_size/2))): # not enough amino acid at C terminus to form peptide
 27 |         peptide1 = Protein_seq[j-math.floor(window_size/2):j+1]
 28 |         peptide2 = Protein_seq[j-math.floor(window_size/2):j]
 29 |         peptide = peptide1 + peptide2[::-1]
 30 |         
 31 |         feat1 = Feat[j-math.floor(window_size/2):j+1]
 32 |         feat2 = Feat[j-math.floor(window_size/2):j]
 33 |         final_feat = np.concatenate((feat1, feat2[::-1]))
 34 |         mirrored = 'Yes'
 35 |         
 36 |     else:
 37 |         peptide = Protein_seq[j-math.floor(window_size/2):math.floor(window_size/2)+j+1]
 38 |         final_feat = Feat[j-math.floor(window_size/2):math.floor(window_size/2)+j+1]
 39 |         mirrored = 'No'
 40 |         
 41 |     return peptide, final_feat, mirrored
 42 | 
 43 | 
 44 | 
 45 | 
 46 | # Prepare data
 47 | Dataset_test_tsv = pd.read_table("Dataset1_test.tsv")
 48 | Dataset_train_tsv = pd.read_table("Dataset1_train.tsv")
 49 | 
 50 | file = open("T5_Features.dat",'rb')
 51 | Proteins = pickle.load(file)
 52 | file = open("HSE_Features.dat",'rb')
 53 | Proteins2 = pickle.load(file)
 54 | file = open("PSSM_Features.dat",'rb')
 55 | Proteins3 = pickle.load(file)
 56 | 
 57 | column_headers = list(Proteins.columns.values)
 58 | DatasetTestProteins = pd.DataFrame(columns = column_headers)
 59 | DatasetTestProteins2 = pd.DataFrame(columns = column_headers)
 60 | DatasetTestProteins3 = pd.DataFrame(columns = column_headers)
 61 | 
 62 | matching_index = 0
 63 | for i in range(len(Dataset_test_tsv)):
 64 |     for j in range(len(Proteins)):
 65 |         if (Dataset_test_tsv['seq'][i].upper() == Proteins['Prot_seq'][j].upper()):
 66 |             if(Proteins['Prot_len'][j] > 30) & (Proteins['Prot_label'][j].count('1') >= 3):            
 67 |                 DatasetTestProteins.loc[matching_index] = Proteins.loc[j]
 68 |                 matching_index += 1
 69 |                 break
 70 | matching_index = 0
 71 | for i in range(len(Dataset_test_tsv)):
 72 |     for j in range(len(Proteins2)):
 73 |         if (Dataset_test_tsv['seq'][i].upper() == Proteins2['Prot_seq'][j].upper()):
 74 |             if(Proteins2['Prot_len'][j] > 30) & (Proteins2['Prot_label'][j].count('1') >= 3): 
 75 |                 DatasetTestProteins2.loc[matching_index] = Proteins2.loc[j]
 76 |                 matching_index += 1
 77 |                 break
 78 | matching_index = 0
 79 | for i in range(len(Dataset_test_tsv)):
 80 |     for j in range(len(Proteins3)):
 81 |         if (Dataset_test_tsv['seq'][i].upper() == Proteins3['Prot_seq'][j].upper()):
 82 |             if(Proteins3['Prot_len'][j] > 30) & (Proteins3['Prot_label'][j].count('1') >= 3):
 83 |                 DatasetTestProteins3.loc[matching_index] = Proteins3.loc[j]
 84 |                 matching_index += 1
 85 |                 break 
 86 |             
 87 | DatasetTrainProteins = pd.DataFrame(columns = column_headers)
 88 | DatasetTrainProteins2 = pd.DataFrame(columns = column_headers)
 89 | DatasetTrainProteins3 = pd.DataFrame(columns = column_headers)
 90 | 
 91 | matching_index = 0
 92 | for i in range(len(Dataset_train_tsv)):
 93 |     for j in range(len(Proteins)):
 94 |         if (Dataset_train_tsv['seq'][i].upper() == Proteins['Prot_seq'][j].upper()):
 95 |             if(Proteins['Prot_len'][j] > 30) & (Proteins['Prot_label'][j].count('1') >= 3):        
 96 |                 DatasetTrainProteins.loc[matching_index] = Proteins.loc[j]
 97 |                 matching_index += 1
 98 |                 break
 99 | 
100 | matching_index = 0
101 | for i in range(len(Dataset_train_tsv)):
102 |     for j in range(len(Proteins2)):
103 |         if (Dataset_train_tsv['seq'][i].upper() == Proteins2['Prot_seq'][j].upper()):
104 |             if(Proteins2['Prot_len'][j] > 30) & (Proteins2['Prot_label'][j].count('1') >= 3):
105 |                 DatasetTrainProteins2.loc[matching_index] = Proteins2.loc[j]
106 |                 matching_index += 1
107 |                 break    
108 | matching_index = 0
109 | for i in range(len(Dataset_train_tsv)):
110 |     for j in range(len(Proteins3)):
111 |         if (Dataset_train_tsv['seq'][i].upper() == Proteins3['Prot_seq'][j].upper()):
112 |             if(Proteins3['Prot_len'][j] > 30) & (Proteins3['Prot_label'][j].count('1') >= 3):
113 |                 DatasetTrainProteins3.loc[matching_index] = Proteins3.loc[j]
114 |                 matching_index += 1
115 |                 break
116 | 
117 | # generate samples for Test protein sequences
118 | column_names = ['Code','Protein_len','Seq_num','Amino_Acid','Position','Label','Peptide','Mirrored','Feature','Prot_positives']
119 | Test_Samples = pd.DataFrame(columns = column_names)
120 | 
121 | Pos_index = 0
122 | Neg_index = 0
123 | window_size = 1 # -0 to +0
124 | seq_num = 0
125 | 
126 | # extract feature and peptide for all sites 
127 | for i in range(len(DatasetTestProteins)):
128 |     Protein_seq = DatasetTestProteins['Prot_seq'][i]
129 |     Feat = DatasetTestProteins['Feat'][i] # transpose the feature matrix
130 |     Feat2 = DatasetTestProteins2['Feat'][i]
131 |     Feat3 = DatasetTestProteins3['Feat'][i]
132 |     positive_counts = DatasetTestProteins['Prot_label'][i].count('1')
133 |     
134 |     seq_num += 1
135 |     for j in range(len(Protein_seq)): # go through the protein seq
136 |         
137 |         A_sample = pd.DataFrame(columns = column_names) # create new dataframe using same column names. This dataframe will just have 1 entry.
138 |         A_sample.loc[0,'Code'] = DatasetTestProteins['Prot_name'][i] # store the protein name
139 |         A_sample.loc[0,'Protein_len'] = DatasetTestProteins['Prot_len'][i] # store the protein length
140 |         A_sample.loc[0,'Label'] = DatasetTestProteins['Prot_label'][i][j]
141 |         A_sample.loc[0,'Prot_positives'] = positive_counts
142 |         A_sample.loc[0,'Amino_Acid'] = Protein_seq[j] # store the amino acid 
143 |         A_sample.loc[0,'Position'] = j # store the position of the amino acid
144 |         
145 |         peptide, T5_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat, j) # call the function to extract peptide and feature based on window size
146 |         peptide, HSE_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat2, j)
147 |         peptide, PSSM_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat3, j)
148 |         
149 |         A_sample.loc[0,'Peptide'] = peptide
150 |         Feat_vec = np.concatenate((T5_feat.mean(0),HSE_feat.flatten(),PSSM_feat.flatten()))
151 |         A_sample.loc[0,'Feature'] = np.float32(Feat_vec)
152 |         A_sample.loc[0,'Seq_num'] = seq_num
153 |         A_sample.loc[0,'Mirrored'] = mirrored
154 |         
155 |         
156 |         Test_Samples = pd.concat([Test_Samples, A_sample], ignore_index=True, axis=0)
157 |             
158 |                   
159 |     print('Test Protein ' + str(i+1) + ' out of ' + str(len(DatasetTestProteins)))
160 | print('Number of Proteins in Test: ' + str(len(DatasetTestProteins)))
161 | print('Number of samples in Test: ' + str(len(Test_Samples)))
162 | 
163 | pickle.dump(Test_Samples,open("dataset1_Test_Samples_rerun.dat","wb"))
164 | 
165 | # generate samples for Train protein sequences
166 | Train_Positives = pd.DataFrame(columns = column_names)
167 | Train_Negatives_All = pd.DataFrame(columns = column_names)
168 | 
169 | Pos_index = 0
170 | Neg_index = 0
171 | seq_num = 0
172 | 
173 | # extract feature and peptide for all sites 
174 | for i in range(len(DatasetTrainProteins)):
175 |     Protein_seq = DatasetTrainProteins['Prot_seq'][i]
176 |     Feat = DatasetTrainProteins['Feat'][i] # transpose the feature matrix
177 |     Feat2 = DatasetTrainProteins2['Feat'][i]
178 |     Feat3 = DatasetTrainProteins3['Feat'][i]
179 |     positive_counts = DatasetTrainProteins['Prot_label'][i].count('1')
180 |     
181 |     seq_num += 1
182 |     for j in range(len(Protein_seq)): # go through the protein seq
183 |         
184 |         A_sample = pd.DataFrame(columns = column_names) # create new dataframe using same column names. This dataframe will just have 1 entry.
185 |         A_sample.loc[0,'Code'] = DatasetTrainProteins['Prot_name'][i] # store the protein name
186 |         A_sample.loc[0,'Protein_len'] = DatasetTrainProteins['Prot_len'][i] # store the protein length
187 |         A_sample.loc[0,'Label'] = DatasetTrainProteins['Prot_label'][i][j]
188 |         A_sample.loc[0,'Prot_positives'] = positive_counts
189 |         A_sample.loc[0,'Amino_Acid'] = Protein_seq[j] # store the amino acid 
190 |         A_sample.loc[0,'Position'] = j # store the position of the amino acid
191 |         
192 |         peptide, T5_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat, j) # call the function to extract peptide and feature based on window size
193 |         peptide, HSE_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat2, j)
194 |         peptide, PSSM_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat3, j)
195 |         
196 |         A_sample.loc[0,'Peptide'] = peptide
197 |         Feat_vec = np.concatenate((T5_feat.mean(0),HSE_feat.flatten(),PSSM_feat.flatten()))
198 |         A_sample.loc[0,'Feature'] = np.float32(Feat_vec)        
199 |         A_sample.loc[0,'Seq_num'] = seq_num
200 |         A_sample.loc[0,'Mirrored'] = mirrored
201 |                     
202 |         if A_sample.loc[0,'Label'] == '1':
203 |             Train_Positives = pd.concat([Train_Positives, A_sample], ignore_index=True, axis=0)
204 |             
205 |         else: 
206 |             Train_Negatives_All = pd.concat([Train_Negatives_All, A_sample], ignore_index=True, axis=0)
207 |       
208 |             
209 |     print('Train Protein ' + str(i+1) + ' out of ' + str(len(DatasetTrainProteins)))
210 | print('Number of Proteins in Train: ' + str(len(DatasetTrainProteins)))
211 | print('Feature vector size: ' + str(Test_Samples['Feature'][0].shape))
212 | print('Num of Train Positives: ' + str(len(Train_Positives)))
213 | print('Num of Train Negatives (All): ' + str(len(Train_Negatives_All)))
214 | pickle.dump(Train_Positives,open("dataset1_Train_Positives_rerun.dat","wb"))
215 | pickle.dump(Train_Negatives_All,open("dataset1_Train_Negatives_All_rerun.dat","wb"))
216 | 
217 | 
218 | 


--------------------------------------------------------------------------------