├── dataset1_PepCNN_train.py ├── dataset2_PepCNN_train.py ├── README.md ├── dataset1_num2_train_network.py ├── dataset2_num2_train_network.py ├── dataset1_PepCNN.py ├── dataset2_PepCNN.py ├── dataset2_num1_extraction_of_samples.py └── dataset1_num1_extraction_of_samples.py /dataset1_PepCNN_train.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Aug 16 10:18:50 2023 4 | 5 | @author: abelac 6 | """ 7 | 8 | import dataset1_num1_extraction_of_samples 9 | import dataset1_num2_train_network 10 | 11 | dataset1_num1_extraction_of_samples 12 | dataset1_num2_train_network -------------------------------------------------------------------------------- /dataset2_PepCNN_train.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Aug 16 10:18:50 2023 4 | 5 | @author: abelac 6 | """ 7 | 8 | import dataset2_num1_extraction_of_samples 9 | import dataset2_num2_train_network 10 | 11 | dataset2_num1_extraction_of_samples 12 | dataset2_num2_train_network -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # PepCNN 2 | Protein-peptide interaction is a very important biological process as it plays role in many cellular processes, but it is also involved in abnormal cellular behaviors which lead to diseases like cancer. Studying these interaction is therefore vital for understanding protein functions as well as discovering of drugs for disease treatment. The physical understanding of the interactions by the use of experimental approach of studying the interactions are laborious, time-consuming, and expensive. In this regard, we have developed a new prediction method called PepCNN which uses structure and sequence-based information from primary protein sequences to predict the peptide-binding residues. The combination of half sphere exposure structural information, position specific scoring matrix and pre-trained transformer language model based sequence information, and convolutional neural network from deep learning resulted in a superior performance compared to the state-of-the-art methods on the two datatsets. 3 | 4 | ![Architecture](https://github.com/abelavit/PepCNN/assets/36461816/711066e5-aac9-4e3e-afcd-7223cf544f05) 5 | 6 | # Download and Use 7 | There are two ways to use the provided codes for each dataset. 8 | ## 1. Load the trained PepCNN model 9 | The result obtained in our work can be replicated by executing dataset1_PepCNN.py script for Dataset1, and dataset2_PepCNN.py script for Dataset2. For instance, to obtain the result of PepCNN on dataset1, run the dataset1_PepCNN.py script after downloading the following files by going to this [link](https://figshare.com/projects/Load_protein-peptide_binding_PepCNN_model/176094) (caution: data size is around 1.3GB for each dataset): 10 | - model weights: dataset1_best_model_weights.h5 11 | - training set negative samples: dataset1_Train_Negatives_All.dat 12 | - training set positive samples: dataset1_Train_Positives.dat 13 | - testing set: dataset1_Test_Samples.dat 14 | ## 2. Train the CNN model 15 | To train the network from scratch, it can be done by executing dataset1_PepCNN_train.py script for Dataset1, and dataset2_PepCNN_train.py script for Dataset2. For instance, to train the network on dataset1, run the dataset1_PepCNN_train.py script after downloading the following files by going to this [link](https://figshare.com/projects/Train_the_CNN_model/176151) (caution: data size is 1.22GB for both datasets): 16 | - testing protein sequences: Dataset1_test.tsv 17 | - protein sequences excluding testing sequences: Dataset1_train.tsv 18 | - pre-trained transformer embeddings: T5_Features.dat 19 | - PSSM features: PSSM_Features.dat 20 | - HSE features: HSE_Features.dat 21 | 22 | Package versions: 23 | Python 3.10.12, 24 | Pandas 1.5.3, 25 | Pickle 4.0, 26 | Numpy 1.25.2, 27 | scikit-learn 1.2.2, 28 | Matplotlib 3.7.2, 29 | Tensorflow 2.12.0 30 | 31 | -------------------------------------------------------------------------------- /dataset1_num2_train_network.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Aug 16 10:18:50 2023 4 | 5 | @author: abelac 6 | """ 7 | 8 | import numpy as np 9 | import pandas as pd 10 | import warnings 11 | import pickle 12 | from sklearn.preprocessing import StandardScaler 13 | warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 14 | import tensorflow as tf 15 | import tensorflow.keras.layers as tfl 16 | 17 | 18 | # load data that excludes the test data 19 | file = open("dataset1_Train_Positives_rerun.dat",'rb') 20 | positive_set = pickle.load(file) 21 | file = open("dataset1_Train_Negatives_All_rerun.dat",'rb') 22 | negative_set_entire = pickle.load(file) 23 | column_names = ['Code','Protein_len','Seq_num','Amino_Acid','Position','Label','Peptide','Mirrored','Feature','Prot_positives'] 24 | # randomly pick negative samples to balance it with positve samples (1.5x positive samples) 25 | Negative_Samples = negative_set_entire.sample(n=round(len(positive_set)*1.5), random_state=42) 26 | 27 | # combine positive and negative sets to make the final dataset 28 | Train_set = pd.concat([positive_set, Negative_Samples], ignore_index=True, axis=0) 29 | 30 | # collect the features and labels of train set 31 | np.set_printoptions(suppress=True) 32 | X_val = [0]*len(Train_set) 33 | for i in range(len(Train_set)): 34 | feat = Train_set['Feature'][i] 35 | X_val[i] = feat 36 | X_train_orig = np.asarray(X_val) 37 | y_val = Train_set['Label'].to_numpy(dtype=float) 38 | Y_train_orig = y_val.reshape(y_val.size,1) 39 | 40 | # Generate a random order of elements with np.random.permutation and simply index into the arrays Feature and label 41 | idx = np.random.permutation(len(X_train_orig)) 42 | X_train,Y_train = X_train_orig[idx], Y_train_orig[idx] 43 | scaler = StandardScaler() 44 | scaler.fit(X_train) # fit on training set only 45 | X_train = scaler.transform(X_train) # apply transform to the training set 46 | 47 | # load test data 48 | file = open("dataset1_Test_Samples_rerun.dat",'rb') 49 | Independent_test_set = pickle.load(file) 50 | # collect the features and labels for independent set 51 | X_independent = [0]*len(Independent_test_set) 52 | for i in range(len(Independent_test_set)): 53 | feat = Independent_test_set['Feature'][i] 54 | X_independent[i] = feat 55 | X_test = np.asarray(X_independent) 56 | y_independent = Independent_test_set['Label'].to_numpy(dtype=float) 57 | Y_test = y_independent.reshape(y_independent.size,1) 58 | X_test = scaler.transform(X_test) # apply standardization (transform) to the test set 59 | 60 | def CNN_Model(): 61 | 62 | model = tf.keras.Sequential() 63 | model.add(tfl.Conv1D(128, 5, padding='same', activation='relu', input_shape=(feat_shape,1))) 64 | model.add(tfl.BatchNormalization()) 65 | model.add(tfl.Dropout(0.23)) 66 | model.add(tfl.Conv1D(128, 3, padding='same',activation='relu')) 67 | model.add(tfl.BatchNormalization()) 68 | model.add(tfl.Dropout(0.21)) 69 | model.add(tfl.Conv1D(64, 3, padding='same',activation='relu')) 70 | model.add(tfl.BatchNormalization()) 71 | model.add(tfl.Dropout(0.47)) 72 | 73 | model.add(tfl.Flatten()) 74 | 75 | model.add(tfl.Dense(128, activation='relu')) 76 | model.add(tfl.Dense(32, activation='relu')) 77 | model.add(tfl.Dense(1, activation='sigmoid')) 78 | 79 | return model 80 | 81 | feat_shape = X_train[0].size 82 | 83 | cnn_model = CNN_Model() 84 | 85 | learning_rate = 0.000001 86 | optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate) 87 | cnn_model.compile(optimizer=optimizer, 88 | loss='binary_crossentropy', 89 | metrics=['AUC']) 90 | 91 | cnn_model.summary() 92 | 93 | # Train the Model 94 | batch_size = 30 95 | epochs = 200 96 | 97 | checkpoint = tf.keras.callbacks.ModelCheckpoint("dataset1_best_model_weights_rerun.h5", save_best_only=True) 98 | early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_auc', patience=3, restore_best_weights=True) 99 | history = cnn_model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size, validation_split=0.2, callbacks=[checkpoint, early_stopping]) 100 | 101 | df_loss_auc = pd.DataFrame(history.history) 102 | df_loss= df_loss_auc[['loss','val_loss']] 103 | df_loss.rename(columns={'loss':'train','val_loss':'validation'},inplace=True) 104 | df_auc= df_loss_auc[['auc','val_auc']] 105 | df_auc.rename(columns={'auc':'train','val_auc':'validation'},inplace=True) 106 | Model_Loss_plot_title = 'Model Loss' 107 | df_loss.plot(title=Model_Loss_plot_title,figsize=(12,8)).set(xlabel='Epoch',ylabel='Loss') 108 | Model_AUC_plot_title = 'Model AUC' 109 | df_auc.plot(title=Model_AUC_plot_title,grid=True,figsize=(12,8)).set(xlabel='Epoch',ylabel='AUC') 110 | 111 | eval_result = cnn_model.evaluate(X_test, Y_test) 112 | print(f"test loss: {round(eval_result[0],4)}, test auc: {round(eval_result[1],4)}") 113 | -------------------------------------------------------------------------------- /dataset2_num2_train_network.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Aug 16 10:18:50 2023 4 | 5 | @author: abelac 6 | """ 7 | 8 | import numpy as np 9 | import pandas as pd 10 | import warnings 11 | import pickle 12 | from sklearn.preprocessing import StandardScaler 13 | warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 14 | import tensorflow as tf 15 | import tensorflow.keras.layers as tfl 16 | 17 | 18 | # load data that excludes the test data 19 | file = open("dataset2_Train_Positives_rerun.dat",'rb') 20 | positive_set = pickle.load(file) 21 | file = open("dataset2_Train_Negatives_All_rerun.dat",'rb') 22 | negative_set_entire = pickle.load(file) 23 | column_names = ['Code','Protein_len','Seq_num','Amino_Acid','Position','Label','Peptide','Mirrored','Feature','Prot_positives'] 24 | # randomly pick negative samples to balance it with positve samples (1.5x positive samples) 25 | Negative_Samples = negative_set_entire.sample(n=round(len(positive_set)*1.5), random_state=42) 26 | 27 | # combine positive and negative sets to make the final dataset 28 | Train_set = pd.concat([positive_set, Negative_Samples], ignore_index=True, axis=0) 29 | 30 | # collect the features and labels of train set 31 | np.set_printoptions(suppress=True) 32 | X_val = [0]*len(Train_set) 33 | for i in range(len(Train_set)): 34 | feat = Train_set['Feature'][i] 35 | X_val[i] = feat 36 | X_train_orig = np.asarray(X_val) 37 | y_val = Train_set['Label'].to_numpy(dtype=float) 38 | Y_train_orig = y_val.reshape(y_val.size,1) 39 | 40 | # Generate a random order of elements with np.random.permutation and simply index into the arrays Feature and label 41 | idx = np.random.permutation(len(X_train_orig)) 42 | X_train,Y_train = X_train_orig[idx], Y_train_orig[idx] 43 | scaler = StandardScaler() 44 | scaler.fit(X_train) # fit on training set only 45 | X_train = scaler.transform(X_train) # apply transform to the training set 46 | 47 | # load test data 48 | file = open("dataset2_Test_Samples_rerun.dat",'rb') 49 | Independent_test_set = pickle.load(file) 50 | # collect the features and labels for independent set 51 | X_independent = [0]*len(Independent_test_set) 52 | for i in range(len(Independent_test_set)): 53 | feat = Independent_test_set['Feature'][i] 54 | X_independent[i] = feat 55 | X_test = np.asarray(X_independent) 56 | y_independent = Independent_test_set['Label'].to_numpy(dtype=float) 57 | Y_test = y_independent.reshape(y_independent.size,1) 58 | X_test = scaler.transform(X_test) # apply standardization (transform) to the test set 59 | 60 | def CNN_Model(): 61 | 62 | model = tf.keras.Sequential() 63 | model.add(tfl.Conv1D(128, 5, padding='same', activation='relu', input_shape=(feat_shape,1))) 64 | model.add(tfl.BatchNormalization()) 65 | model.add(tfl.Dropout(0.38)) 66 | model.add(tfl.Conv1D(128, 3, padding='same',activation='relu')) 67 | model.add(tfl.BatchNormalization()) 68 | model.add(tfl.Dropout(0.38)) 69 | model.add(tfl.Conv1D(64, 3, padding='same',activation='relu')) 70 | model.add(tfl.BatchNormalization()) 71 | model.add(tfl.Dropout(0.38)) 72 | 73 | model.add(tfl.Flatten()) 74 | 75 | model.add(tfl.Dense(128, activation='relu')) 76 | model.add(tfl.Dense(32, activation='relu')) 77 | model.add(tfl.Dense(1, activation='sigmoid')) 78 | 79 | return model 80 | 81 | feat_shape = X_train[0].size 82 | 83 | cnn_model = CNN_Model() 84 | 85 | learning_rate = 0.000001 86 | optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate) 87 | cnn_model.compile(optimizer=optimizer, 88 | loss='binary_crossentropy', 89 | metrics=['AUC']) 90 | 91 | cnn_model.summary() 92 | 93 | # Train the Model 94 | batch_size = 30 95 | epochs = 200 96 | 97 | checkpoint = tf.keras.callbacks.ModelCheckpoint("dataset2_best_model_weights_rerun.h5", save_best_only=True) 98 | early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_auc', patience=3, restore_best_weights=True) 99 | history = cnn_model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size, validation_split=0.2, callbacks=[checkpoint, early_stopping]) 100 | 101 | df_loss_auc = pd.DataFrame(history.history) 102 | df_loss= df_loss_auc[['loss','val_loss']] 103 | df_loss.rename(columns={'loss':'train','val_loss':'validation'},inplace=True) 104 | df_auc= df_loss_auc[['auc','val_auc']] 105 | df_auc.rename(columns={'auc':'train','val_auc':'validation'},inplace=True) 106 | Model_Loss_plot_title = 'Model Loss' 107 | df_loss.plot(title=Model_Loss_plot_title,figsize=(12,8)).set(xlabel='Epoch',ylabel='Loss') 108 | Model_AUC_plot_title = 'Model AUC' 109 | df_auc.plot(title=Model_AUC_plot_title,grid=True,figsize=(12,8)).set(xlabel='Epoch',ylabel='AUC') 110 | 111 | eval_result = cnn_model.evaluate(X_test, Y_test) 112 | print(f"test loss: {round(eval_result[0],4)}, test auc: {round(eval_result[1],4)}") 113 | -------------------------------------------------------------------------------- /dataset1_PepCNN.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Aug 16 10:18:50 2023 4 | 5 | @author: abelac 6 | """ 7 | 8 | import numpy as np 9 | import pandas as pd 10 | from sklearn.metrics import roc_auc_score 11 | from sklearn.metrics import precision_score 12 | import warnings 13 | import pickle 14 | import copy 15 | from matplotlib import pyplot 16 | from sklearn.metrics import roc_curve 17 | from sklearn.metrics import confusion_matrix 18 | from sklearn.preprocessing import StandardScaler 19 | warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 20 | import tensorflow as tf 21 | import tensorflow.keras.layers as tfl 22 | 23 | # load data that excludes the test data 24 | file = open("dataset1_Train_Positives.dat",'rb') 25 | positive_set = pickle.load(file) 26 | file = open("dataset1_Train_Negatives_All.dat",'rb') 27 | negative_set_entire = pickle.load(file) 28 | column_names = ['Code','Protein_len','Seq_num','Amino_Acid','Position','Label','Peptide','Mirrored','Feature','Prot_positives'] 29 | # randomly pick negative samples to balance it with positve samples (1.5x positive samples) 30 | Negative_Samples = negative_set_entire.sample(n=round(len(positive_set)*1.5), random_state=42) 31 | 32 | # combine positive and negative sets to make the final dataset 33 | Train_set = pd.concat([positive_set, Negative_Samples], ignore_index=True, axis=0) 34 | 35 | # collect the features and labels of train set 36 | np.set_printoptions(suppress=True) 37 | X_val = [0]*len(Train_set) 38 | for i in range(len(Train_set)): 39 | feat = Train_set['Feature'][i] 40 | X_val[i] = feat 41 | X_train_orig = np.asarray(X_val) 42 | y_val = Train_set['Label'].to_numpy(dtype=float) 43 | Y_train_orig = y_val.reshape(y_val.size,1) 44 | 45 | # Generate a random order of elements with np.random.permutation and simply index into the arrays Feature and label 46 | idx = np.random.permutation(len(X_train_orig)) 47 | X_train,Y_train = X_train_orig[idx], Y_train_orig[idx] 48 | scaler = StandardScaler() 49 | scaler.fit(X_train) # fit on training set only 50 | X_train = scaler.transform(X_train) # apply transform to the training set 51 | 52 | # load test data 53 | file = open("dataset1_Test_Samples.dat",'rb') 54 | Independent_test_set = pickle.load(file) 55 | # collect the features and labels for independent set 56 | X_independent = [0]*len(Independent_test_set) 57 | for i in range(len(Independent_test_set)): 58 | feat = Independent_test_set['Feature'][i] 59 | X_independent[i] = feat 60 | X_test = np.asarray(X_independent) 61 | y_independent = Independent_test_set['Label'].to_numpy(dtype=float) 62 | Y_test = y_independent.reshape(y_independent.size,1) 63 | X_test = scaler.transform(X_test) # apply standardization (transform) to the test set 64 | 65 | def CNN_Model(): 66 | 67 | model = tf.keras.Sequential() 68 | model.add(tfl.Conv1D(128, 5, padding='same', activation='relu', input_shape=(feat_shape,1))) 69 | model.add(tfl.BatchNormalization()) 70 | model.add(tfl.Dropout(0.23)) 71 | model.add(tfl.Conv1D(128, 3, padding='same',activation='relu')) 72 | model.add(tfl.BatchNormalization()) 73 | model.add(tfl.Dropout(0.21)) 74 | model.add(tfl.Conv1D(64, 3, padding='same',activation='relu')) 75 | model.add(tfl.BatchNormalization()) 76 | model.add(tfl.Dropout(0.47)) 77 | 78 | model.add(tfl.Flatten()) 79 | 80 | model.add(tfl.Dense(128, activation='relu')) 81 | model.add(tfl.Dense(32, activation='relu')) 82 | model.add(tfl.Dense(1, activation='sigmoid')) 83 | 84 | return model 85 | 86 | feat_shape = X_train[0].size 87 | 88 | cnn_model = CNN_Model() 89 | 90 | learning_rate = 0.000001 91 | optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate) 92 | cnn_model.compile(optimizer=optimizer, 93 | loss='binary_crossentropy', 94 | metrics=['AUC']) 95 | 96 | cnn_model.summary() 97 | 98 | # load the trained weights 99 | cnn_model.load_weights('dataset1_best_model_weights.h5') 100 | 101 | eval_result = cnn_model.evaluate(X_test, Y_test) 102 | 103 | print(f"test loss: {round(eval_result[0],4)}, test auc: {round(eval_result[1],4)}") 104 | Inde_test_prob = cnn_model.predict(X_test) 105 | 106 | 107 | def round_based_on_thres(probs_to_round, set_thres): 108 | for i in range(len(probs_to_round)): 109 | if probs_to_round[i] <= set_thres: 110 | probs_to_round[i] = 0 111 | else: 112 | probs_to_round[i] = 1 113 | return probs_to_round 114 | 115 | # calculate the metrics 116 | set_thres = 0.877 117 | copy_Probs_inde = copy.copy(Inde_test_prob) 118 | round_based_on_thres(copy_Probs_inde, set_thres) 119 | fpr, tpr, thresholds = roc_curve(Y_test, Inde_test_prob) 120 | inde_auc = round(roc_auc_score(Y_test, Inde_test_prob),4) 121 | inde_pre = round(precision_score(Y_test, copy_Probs_inde),4) 122 | cm = confusion_matrix(Y_test, copy_Probs_inde) # for acc, sen, and spe calculation 123 | total_preds=sum(sum(cm)) 124 | TN = cm[0,0] 125 | FP = cm[0,1] 126 | FN = cm[1,0] 127 | TP = cm[1,1] 128 | inde_sen = round(TP/(TP+FN),4) 129 | inde_spe = round(TN/(TN+FP),4) 130 | 131 | # display the metrics 132 | print(f'Independent Sen: {inde_sen}') 133 | print(f'Independent Spe: {inde_spe}') 134 | print(f'Independent Pre: {inde_pre}') 135 | print(f'Independent AUC: {inde_auc}') 136 | 137 | # plot ROC curve 138 | legend = 'AUC = ' + str(inde_auc) 139 | pyplot.figure(figsize=(12,8)) 140 | pyplot.plot([0,1], [0,1], linestyle='--') 141 | pyplot.plot(fpr, tpr, marker='.', label=legend) 142 | pyplot.xlabel('False Positive Rate') 143 | pyplot.ylabel('True Positive Rate') 144 | pyplot.legend() 145 | pyplot.show() 146 | -------------------------------------------------------------------------------- /dataset2_PepCNN.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Aug 16 10:18:50 2023 4 | 5 | @author: abelac 6 | """ 7 | 8 | import numpy as np 9 | import pandas as pd 10 | from sklearn.metrics import roc_auc_score 11 | from sklearn.metrics import precision_score 12 | import warnings 13 | import pickle 14 | import copy 15 | from matplotlib import pyplot 16 | from sklearn.metrics import roc_curve 17 | from sklearn.metrics import confusion_matrix 18 | from sklearn.preprocessing import StandardScaler 19 | warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 20 | import tensorflow as tf 21 | import tensorflow.keras.layers as tfl 22 | 23 | # load data that excludes the test data 24 | file = open("dataset2_Train_Positives.dat",'rb') 25 | positive_set = pickle.load(file) 26 | file = open("dataset2_Train_Negatives_All.dat",'rb') 27 | negative_set_entire = pickle.load(file) 28 | column_names = ['Code','Protein_len','Seq_num','Amino_Acid','Position','Label','Peptide','Mirrored','Feature','Prot_positives'] 29 | # randomly pick negative samples to balance it with positve samples (1.5x positive samples) 30 | Negative_Samples = negative_set_entire.sample(n=round(len(positive_set)*1.5), random_state=42) 31 | 32 | # combine positive and negative sets to make the final dataset 33 | Train_set = pd.concat([positive_set, Negative_Samples], ignore_index=True, axis=0) 34 | 35 | # collect the features and labels of train set 36 | np.set_printoptions(suppress=True) 37 | X_val = [0]*len(Train_set) 38 | for i in range(len(Train_set)): 39 | feat = Train_set['Feature'][i] 40 | X_val[i] = feat 41 | X_train_orig = np.asarray(X_val) 42 | y_val = Train_set['Label'].to_numpy(dtype=float) 43 | Y_train_orig = y_val.reshape(y_val.size,1) 44 | 45 | # Generate a random order of elements with np.random.permutation and simply index into the arrays Feature and label 46 | idx = np.random.permutation(len(X_train_orig)) 47 | X_train,Y_train = X_train_orig[idx], Y_train_orig[idx] 48 | scaler = StandardScaler() 49 | scaler.fit(X_train) # fit on training set only 50 | X_train = scaler.transform(X_train) # apply transform to the training set 51 | 52 | # load test data 53 | file = open("dataset2_Test_Samples.dat",'rb') 54 | Independent_test_set = pickle.load(file) 55 | # collect the features and labels for independent set 56 | X_independent = [0]*len(Independent_test_set) 57 | for i in range(len(Independent_test_set)): 58 | feat = Independent_test_set['Feature'][i] 59 | X_independent[i] = feat 60 | X_test = np.asarray(X_independent) 61 | y_independent = Independent_test_set['Label'].to_numpy(dtype=float) 62 | Y_test = y_independent.reshape(y_independent.size,1) 63 | X_test = scaler.transform(X_test) # apply standardization (transform) to the test set 64 | 65 | def CNN_Model(): 66 | 67 | model = tf.keras.Sequential() 68 | model.add(tfl.Conv1D(128, 5, padding='same', activation='relu', input_shape=(feat_shape,1))) 69 | model.add(tfl.BatchNormalization()) 70 | model.add(tfl.Dropout(0.38)) 71 | model.add(tfl.Conv1D(128, 3, padding='same',activation='relu')) 72 | model.add(tfl.BatchNormalization()) 73 | model.add(tfl.Dropout(0.38)) 74 | model.add(tfl.Conv1D(64, 3, padding='same',activation='relu')) 75 | model.add(tfl.BatchNormalization()) 76 | model.add(tfl.Dropout(0.38)) 77 | 78 | model.add(tfl.Flatten()) 79 | 80 | model.add(tfl.Dense(128, activation='relu')) 81 | model.add(tfl.Dense(32, activation='relu')) 82 | model.add(tfl.Dense(1, activation='sigmoid')) 83 | 84 | return model 85 | 86 | feat_shape = X_train[0].size 87 | 88 | cnn_model = CNN_Model() 89 | 90 | learning_rate = 0.000001 91 | optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate) 92 | cnn_model.compile(optimizer=optimizer, 93 | loss='binary_crossentropy', 94 | metrics=['AUC']) 95 | 96 | cnn_model.summary() 97 | 98 | # load the trained weights 99 | cnn_model.load_weights('dataset2_best_model_weights.h5') 100 | 101 | eval_result = cnn_model.evaluate(X_test, Y_test) 102 | 103 | print(f"test loss: {round(eval_result[0],4)}, test auc: {round(eval_result[1],4)}") 104 | Inde_test_prob = cnn_model.predict(X_test) 105 | 106 | 107 | def round_based_on_thres(probs_to_round, set_thres): 108 | for i in range(len(probs_to_round)): 109 | if probs_to_round[i] <= set_thres: 110 | probs_to_round[i] = 0 111 | else: 112 | probs_to_round[i] = 1 113 | return probs_to_round 114 | 115 | # calculate the metrics 116 | set_thres = 0.885 117 | copy_Probs_inde = copy.copy(Inde_test_prob) 118 | round_based_on_thres(copy_Probs_inde, set_thres) 119 | fpr, tpr, thresholds = roc_curve(Y_test, Inde_test_prob) 120 | inde_auc = round(roc_auc_score(Y_test, Inde_test_prob),4) 121 | inde_pre = round(precision_score(Y_test, copy_Probs_inde),4) 122 | cm = confusion_matrix(Y_test, copy_Probs_inde) # for acc, sen, and spe calculation 123 | total_preds=sum(sum(cm)) 124 | TN = cm[0,0] 125 | FP = cm[0,1] 126 | FN = cm[1,0] 127 | TP = cm[1,1] 128 | inde_sen = round(TP/(TP+FN),4) 129 | inde_spe = round(TN/(TN+FP),4) 130 | 131 | # display the metrics 132 | print(f'Independent Sen: {inde_sen}') 133 | print(f'Independent Spe: {inde_spe}') 134 | print(f'Independent Pre: {inde_pre}') 135 | print(f'Independent AUC: {inde_auc}') 136 | 137 | # plot ROC curve 138 | legend = 'AUC = ' + str(inde_auc) 139 | pyplot.figure(figsize=(12,8)) 140 | pyplot.plot([0,1], [0,1], linestyle='--') 141 | pyplot.plot(fpr, tpr, marker='.', label=legend) 142 | pyplot.xlabel('False Positive Rate') 143 | pyplot.ylabel('True Positive Rate') 144 | pyplot.legend() 145 | pyplot.show() 146 | -------------------------------------------------------------------------------- /dataset2_num1_extraction_of_samples.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Aug 16 10:18:50 2023 4 | 5 | @author: abelac 6 | """ 7 | import pickle 8 | import pandas as pd 9 | import numpy as np 10 | import math 11 | 12 | 13 | # function to extract samples 14 | def peptide_feat(window_size, Protein_seq, Feat, j): # funtion to extract peptide length and feature based on window size 15 | 16 | if (j - math.ceil(window_size/2)) < -1: # not enough amino acid at N terminus to form peptide 17 | peptide1 = Protein_seq[j:math.floor(window_size/2)+j+1] # +1 since the stop value for slicing is exclusive 18 | peptide2 = Protein_seq[j+1:math.floor(window_size/2)+j+1] # other peptide half but excluding the central amino acid 19 | peptide = peptide2[::-1] + peptide1 20 | 21 | feat1 = Feat[j:math.floor(window_size/2)+j+1] # +1 since the stop value for slicing is exclusive 22 | feat2 = Feat[j+1:math.floor(window_size/2)+j+1] # other peptide half but excluding the central amino acid 23 | final_feat = np.concatenate((feat2[::-1], feat1)) 24 | mirrored = 'Yes' 25 | 26 | elif ((len(Protein_seq) - (j+1)) < (math.floor(window_size/2))): # not enough amino acid at C terminus to form peptide 27 | peptide1 = Protein_seq[j-math.floor(window_size/2):j+1] 28 | peptide2 = Protein_seq[j-math.floor(window_size/2):j] 29 | peptide = peptide1 + peptide2[::-1] 30 | 31 | feat1 = Feat[j-math.floor(window_size/2):j+1] 32 | feat2 = Feat[j-math.floor(window_size/2):j] 33 | final_feat = np.concatenate((feat1, feat2[::-1])) 34 | mirrored = 'Yes' 35 | 36 | else: 37 | peptide = Protein_seq[j-math.floor(window_size/2):math.floor(window_size/2)+j+1] 38 | final_feat = Feat[j-math.floor(window_size/2):math.floor(window_size/2)+j+1] 39 | mirrored = 'No' 40 | 41 | return peptide, final_feat, mirrored 42 | 43 | 44 | 45 | 46 | # Prepare data 47 | Dataset_test_tsv = pd.read_table("Dataset2_test.tsv") 48 | Dataset_train_tsv = pd.read_table("Dataset2_train.tsv") 49 | 50 | file = open("T5_Features.dat",'rb') 51 | Proteins = pickle.load(file) 52 | file = open("HSE_Features.dat",'rb') 53 | Proteins2 = pickle.load(file) 54 | file = open("PSSM_Features.dat",'rb') 55 | Proteins3 = pickle.load(file) 56 | 57 | column_headers = list(Proteins.columns.values) 58 | DatasetTestProteins = pd.DataFrame(columns = column_headers) 59 | DatasetTestProteins2 = pd.DataFrame(columns = column_headers) 60 | DatasetTestProteins3 = pd.DataFrame(columns = column_headers) 61 | 62 | matching_index = 0 63 | for i in range(len(Dataset_test_tsv)): 64 | for j in range(len(Proteins)): 65 | if (Dataset_test_tsv['seq'][i].upper() == Proteins['Prot_seq'][j].upper()): 66 | DatasetTestProteins.loc[matching_index] = Proteins.loc[j] 67 | matching_index += 1 68 | break 69 | matching_index = 0 70 | for i in range(len(Dataset_test_tsv)): 71 | for j in range(len(Proteins2)): 72 | if (Dataset_test_tsv['seq'][i].upper() == Proteins2['Prot_seq'][j].upper()): 73 | DatasetTestProteins2.loc[matching_index] = Proteins2.loc[j] 74 | matching_index += 1 75 | break 76 | matching_index = 0 77 | for i in range(len(Dataset_test_tsv)): 78 | for j in range(len(Proteins3)): 79 | if (Dataset_test_tsv['seq'][i].upper() == Proteins3['Prot_seq'][j].upper()): 80 | DatasetTestProteins3.loc[matching_index] = Proteins3.loc[j] 81 | matching_index += 1 82 | break 83 | 84 | DatasetTrainProteins = pd.DataFrame(columns = column_headers) 85 | DatasetTrainProteins2 = pd.DataFrame(columns = column_headers) 86 | DatasetTrainProteins3 = pd.DataFrame(columns = column_headers) 87 | 88 | matching_index = 0 89 | for i in range(len(Dataset_train_tsv)): 90 | for j in range(len(Proteins)): 91 | if (Dataset_train_tsv['seq'][i].upper() == Proteins['Prot_seq'][j].upper()): 92 | DatasetTrainProteins.loc[matching_index] = Proteins.loc[j] 93 | matching_index += 1 94 | break 95 | 96 | matching_index = 0 97 | for i in range(len(Dataset_train_tsv)): 98 | for j in range(len(Proteins2)): 99 | if (Dataset_train_tsv['seq'][i].upper() == Proteins2['Prot_seq'][j].upper()): 100 | DatasetTrainProteins2.loc[matching_index] = Proteins2.loc[j] 101 | matching_index += 1 102 | break 103 | matching_index = 0 104 | for i in range(len(Dataset_train_tsv)): 105 | for j in range(len(Proteins3)): 106 | if (Dataset_train_tsv['seq'][i].upper() == Proteins3['Prot_seq'][j].upper()): 107 | DatasetTrainProteins3.loc[matching_index] = Proteins3.loc[j] 108 | matching_index += 1 109 | break 110 | 111 | # generate samples for Test protein sequences 112 | column_names = ['Code','Protein_len','Seq_num','Amino_Acid','Position','Label','Peptide','Mirrored','Feature','Prot_positives'] 113 | Test_Samples = pd.DataFrame(columns = column_names) 114 | #Test_Negatives = pd.DataFrame(columns = column_names) 115 | 116 | Pos_index = 0 117 | Neg_index = 0 118 | window_size = 1 # -0 to +0 119 | seq_num = 0 120 | 121 | # extract feature and peptide for all sites 122 | for i in range(len(DatasetTestProteins)): 123 | Protein_seq = DatasetTestProteins['Prot_seq'][i] 124 | Feat = DatasetTestProteins['Feat'][i] # transpose the feature matrix 125 | Feat2 = DatasetTestProteins2['Feat'][i] 126 | Feat3 = DatasetTestProteins3['Feat'][i] 127 | positive_counts = DatasetTestProteins['Prot_label'][i].count('1') 128 | 129 | seq_num += 1 130 | for j in range(len(Protein_seq)): # go through the protein seq 131 | 132 | A_sample = pd.DataFrame(columns = column_names) # create new dataframe using same column names. This dataframe will just have 1 entry. 133 | A_sample.loc[0,'Code'] = DatasetTestProteins['Prot_name'][i] # store the protein name 134 | A_sample.loc[0,'Protein_len'] = DatasetTestProteins['Prot_len'][i] # store the protein length 135 | A_sample.loc[0,'Label'] = DatasetTestProteins['Prot_label'][i][j] 136 | A_sample.loc[0,'Prot_positives'] = positive_counts 137 | A_sample.loc[0,'Amino_Acid'] = Protein_seq[j] # store the amino acid 138 | A_sample.loc[0,'Position'] = j # store the position of the amino acid 139 | 140 | peptide, T5_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat, j) # call the function to extract peptide and feature based on window size 141 | peptide, HSE_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat2, j) 142 | peptide, PSSM_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat3, j) 143 | 144 | A_sample.loc[0,'Peptide'] = peptide 145 | Feat_vec = np.concatenate((T5_feat.mean(0),HSE_feat.flatten(),PSSM_feat.flatten())) 146 | A_sample.loc[0,'Feature'] = np.float32(Feat_vec) 147 | A_sample.loc[0,'Seq_num'] = seq_num 148 | A_sample.loc[0,'Mirrored'] = mirrored 149 | 150 | 151 | Test_Samples = pd.concat([Test_Samples, A_sample], ignore_index=True, axis=0) 152 | 153 | 154 | print('Test Protein ' + str(i+1) + ' out of ' + str(len(DatasetTestProteins))) 155 | print('Number of Proteins in Test: ' + str(len(DatasetTestProteins))) 156 | print('Number of samples in Test: ' + str(len(Test_Samples))) 157 | 158 | pickle.dump(Test_Samples,open("dataset2_Test_Samples_rerun.dat","wb")) 159 | 160 | # generate samples for Train protein sequences 161 | Train_Positives = pd.DataFrame(columns = column_names) 162 | Train_Negatives_All = pd.DataFrame(columns = column_names) 163 | 164 | Pos_index = 0 165 | Neg_index = 0 166 | seq_num = 0 167 | 168 | # extract feature and peptide for all sites 169 | for i in range(len(DatasetTrainProteins)): 170 | Protein_seq = DatasetTrainProteins['Prot_seq'][i] 171 | Feat = DatasetTrainProteins['Feat'][i] # transpose the feature matrix 172 | Feat2 = DatasetTrainProteins2['Feat'][i] 173 | Feat3 = DatasetTrainProteins3['Feat'][i] 174 | positive_counts = DatasetTrainProteins['Prot_label'][i].count('1') 175 | 176 | seq_num += 1 177 | for j in range(len(Protein_seq)): # go through the protein seq 178 | 179 | A_sample = pd.DataFrame(columns = column_names) # create new dataframe using same column names. This dataframe will just have 1 entry. 180 | A_sample.loc[0,'Code'] = DatasetTrainProteins['Prot_name'][i] # store the protein name 181 | A_sample.loc[0,'Protein_len'] = DatasetTrainProteins['Prot_len'][i] # store the protein length 182 | A_sample.loc[0,'Label'] = DatasetTrainProteins['Prot_label'][i][j] 183 | A_sample.loc[0,'Prot_positives'] = positive_counts 184 | A_sample.loc[0,'Amino_Acid'] = Protein_seq[j] # store the amino acid 185 | A_sample.loc[0,'Position'] = j # store the position of the amino acid 186 | 187 | peptide, T5_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat, j) # call the function to extract peptide and feature based on window size 188 | peptide, HSE_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat2, j) 189 | peptide, PSSM_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat3, j) 190 | 191 | A_sample.loc[0,'Peptide'] = peptide 192 | Feat_vec = np.concatenate((T5_feat.mean(0),HSE_feat.flatten(),PSSM_feat.flatten())) 193 | A_sample.loc[0,'Feature'] = np.float32(Feat_vec) 194 | A_sample.loc[0,'Seq_num'] = seq_num 195 | A_sample.loc[0,'Mirrored'] = mirrored 196 | 197 | if A_sample.loc[0,'Label'] == '1': 198 | Train_Positives = pd.concat([Train_Positives, A_sample], ignore_index=True, axis=0) 199 | 200 | else: 201 | Train_Negatives_All = pd.concat([Train_Negatives_All, A_sample], ignore_index=True, axis=0) 202 | 203 | 204 | print('Train Protein ' + str(i+1) + ' out of ' + str(len(DatasetTrainProteins))) 205 | print('Number of Proteins in Train: ' + str(len(DatasetTrainProteins))) 206 | print('Feature vector size: ' + str(Test_Samples['Feature'][0].shape)) 207 | print('Num of Train Positives: ' + str(len(Train_Positives))) 208 | print('Num of Train Negatives (All): ' + str(len(Train_Negatives_All))) 209 | pickle.dump(Train_Positives,open("dataset2_Train_Positives_rerun.dat","wb")) 210 | pickle.dump(Train_Negatives_All,open("dataset2_Train_Negatives_All_rerun.dat","wb")) 211 | 212 | 213 | -------------------------------------------------------------------------------- /dataset1_num1_extraction_of_samples.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Aug 16 10:18:50 2023 4 | 5 | @author: abelac 6 | """ 7 | import pickle 8 | import pandas as pd 9 | import numpy as np 10 | import math 11 | 12 | 13 | # function to extract samples 14 | def peptide_feat(window_size, Protein_seq, Feat, j): # funtion to extract peptide length and feature based on window size 15 | 16 | if (j - math.ceil(window_size/2)) < -1: # not enough amino acid at N terminus to form peptide 17 | peptide1 = Protein_seq[j:math.floor(window_size/2)+j+1] # +1 since the stop value for slicing is exclusive 18 | peptide2 = Protein_seq[j+1:math.floor(window_size/2)+j+1] # other peptide half but excluding the central amino acid 19 | peptide = peptide2[::-1] + peptide1 20 | 21 | feat1 = Feat[j:math.floor(window_size/2)+j+1] # +1 since the stop value for slicing is exclusive 22 | feat2 = Feat[j+1:math.floor(window_size/2)+j+1] # other peptide half but excluding the central amino acid 23 | final_feat = np.concatenate((feat2[::-1], feat1)) 24 | mirrored = 'Yes' 25 | 26 | elif ((len(Protein_seq) - (j+1)) < (math.floor(window_size/2))): # not enough amino acid at C terminus to form peptide 27 | peptide1 = Protein_seq[j-math.floor(window_size/2):j+1] 28 | peptide2 = Protein_seq[j-math.floor(window_size/2):j] 29 | peptide = peptide1 + peptide2[::-1] 30 | 31 | feat1 = Feat[j-math.floor(window_size/2):j+1] 32 | feat2 = Feat[j-math.floor(window_size/2):j] 33 | final_feat = np.concatenate((feat1, feat2[::-1])) 34 | mirrored = 'Yes' 35 | 36 | else: 37 | peptide = Protein_seq[j-math.floor(window_size/2):math.floor(window_size/2)+j+1] 38 | final_feat = Feat[j-math.floor(window_size/2):math.floor(window_size/2)+j+1] 39 | mirrored = 'No' 40 | 41 | return peptide, final_feat, mirrored 42 | 43 | 44 | 45 | 46 | # Prepare data 47 | Dataset_test_tsv = pd.read_table("Dataset1_test.tsv") 48 | Dataset_train_tsv = pd.read_table("Dataset1_train.tsv") 49 | 50 | file = open("T5_Features.dat",'rb') 51 | Proteins = pickle.load(file) 52 | file = open("HSE_Features.dat",'rb') 53 | Proteins2 = pickle.load(file) 54 | file = open("PSSM_Features.dat",'rb') 55 | Proteins3 = pickle.load(file) 56 | 57 | column_headers = list(Proteins.columns.values) 58 | DatasetTestProteins = pd.DataFrame(columns = column_headers) 59 | DatasetTestProteins2 = pd.DataFrame(columns = column_headers) 60 | DatasetTestProteins3 = pd.DataFrame(columns = column_headers) 61 | 62 | matching_index = 0 63 | for i in range(len(Dataset_test_tsv)): 64 | for j in range(len(Proteins)): 65 | if (Dataset_test_tsv['seq'][i].upper() == Proteins['Prot_seq'][j].upper()): 66 | if(Proteins['Prot_len'][j] > 30) & (Proteins['Prot_label'][j].count('1') >= 3): 67 | DatasetTestProteins.loc[matching_index] = Proteins.loc[j] 68 | matching_index += 1 69 | break 70 | matching_index = 0 71 | for i in range(len(Dataset_test_tsv)): 72 | for j in range(len(Proteins2)): 73 | if (Dataset_test_tsv['seq'][i].upper() == Proteins2['Prot_seq'][j].upper()): 74 | if(Proteins2['Prot_len'][j] > 30) & (Proteins2['Prot_label'][j].count('1') >= 3): 75 | DatasetTestProteins2.loc[matching_index] = Proteins2.loc[j] 76 | matching_index += 1 77 | break 78 | matching_index = 0 79 | for i in range(len(Dataset_test_tsv)): 80 | for j in range(len(Proteins3)): 81 | if (Dataset_test_tsv['seq'][i].upper() == Proteins3['Prot_seq'][j].upper()): 82 | if(Proteins3['Prot_len'][j] > 30) & (Proteins3['Prot_label'][j].count('1') >= 3): 83 | DatasetTestProteins3.loc[matching_index] = Proteins3.loc[j] 84 | matching_index += 1 85 | break 86 | 87 | DatasetTrainProteins = pd.DataFrame(columns = column_headers) 88 | DatasetTrainProteins2 = pd.DataFrame(columns = column_headers) 89 | DatasetTrainProteins3 = pd.DataFrame(columns = column_headers) 90 | 91 | matching_index = 0 92 | for i in range(len(Dataset_train_tsv)): 93 | for j in range(len(Proteins)): 94 | if (Dataset_train_tsv['seq'][i].upper() == Proteins['Prot_seq'][j].upper()): 95 | if(Proteins['Prot_len'][j] > 30) & (Proteins['Prot_label'][j].count('1') >= 3): 96 | DatasetTrainProteins.loc[matching_index] = Proteins.loc[j] 97 | matching_index += 1 98 | break 99 | 100 | matching_index = 0 101 | for i in range(len(Dataset_train_tsv)): 102 | for j in range(len(Proteins2)): 103 | if (Dataset_train_tsv['seq'][i].upper() == Proteins2['Prot_seq'][j].upper()): 104 | if(Proteins2['Prot_len'][j] > 30) & (Proteins2['Prot_label'][j].count('1') >= 3): 105 | DatasetTrainProteins2.loc[matching_index] = Proteins2.loc[j] 106 | matching_index += 1 107 | break 108 | matching_index = 0 109 | for i in range(len(Dataset_train_tsv)): 110 | for j in range(len(Proteins3)): 111 | if (Dataset_train_tsv['seq'][i].upper() == Proteins3['Prot_seq'][j].upper()): 112 | if(Proteins3['Prot_len'][j] > 30) & (Proteins3['Prot_label'][j].count('1') >= 3): 113 | DatasetTrainProteins3.loc[matching_index] = Proteins3.loc[j] 114 | matching_index += 1 115 | break 116 | 117 | # generate samples for Test protein sequences 118 | column_names = ['Code','Protein_len','Seq_num','Amino_Acid','Position','Label','Peptide','Mirrored','Feature','Prot_positives'] 119 | Test_Samples = pd.DataFrame(columns = column_names) 120 | 121 | Pos_index = 0 122 | Neg_index = 0 123 | window_size = 1 # -0 to +0 124 | seq_num = 0 125 | 126 | # extract feature and peptide for all sites 127 | for i in range(len(DatasetTestProteins)): 128 | Protein_seq = DatasetTestProteins['Prot_seq'][i] 129 | Feat = DatasetTestProteins['Feat'][i] # transpose the feature matrix 130 | Feat2 = DatasetTestProteins2['Feat'][i] 131 | Feat3 = DatasetTestProteins3['Feat'][i] 132 | positive_counts = DatasetTestProteins['Prot_label'][i].count('1') 133 | 134 | seq_num += 1 135 | for j in range(len(Protein_seq)): # go through the protein seq 136 | 137 | A_sample = pd.DataFrame(columns = column_names) # create new dataframe using same column names. This dataframe will just have 1 entry. 138 | A_sample.loc[0,'Code'] = DatasetTestProteins['Prot_name'][i] # store the protein name 139 | A_sample.loc[0,'Protein_len'] = DatasetTestProteins['Prot_len'][i] # store the protein length 140 | A_sample.loc[0,'Label'] = DatasetTestProteins['Prot_label'][i][j] 141 | A_sample.loc[0,'Prot_positives'] = positive_counts 142 | A_sample.loc[0,'Amino_Acid'] = Protein_seq[j] # store the amino acid 143 | A_sample.loc[0,'Position'] = j # store the position of the amino acid 144 | 145 | peptide, T5_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat, j) # call the function to extract peptide and feature based on window size 146 | peptide, HSE_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat2, j) 147 | peptide, PSSM_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat3, j) 148 | 149 | A_sample.loc[0,'Peptide'] = peptide 150 | Feat_vec = np.concatenate((T5_feat.mean(0),HSE_feat.flatten(),PSSM_feat.flatten())) 151 | A_sample.loc[0,'Feature'] = np.float32(Feat_vec) 152 | A_sample.loc[0,'Seq_num'] = seq_num 153 | A_sample.loc[0,'Mirrored'] = mirrored 154 | 155 | 156 | Test_Samples = pd.concat([Test_Samples, A_sample], ignore_index=True, axis=0) 157 | 158 | 159 | print('Test Protein ' + str(i+1) + ' out of ' + str(len(DatasetTestProteins))) 160 | print('Number of Proteins in Test: ' + str(len(DatasetTestProteins))) 161 | print('Number of samples in Test: ' + str(len(Test_Samples))) 162 | 163 | pickle.dump(Test_Samples,open("dataset1_Test_Samples_rerun.dat","wb")) 164 | 165 | # generate samples for Train protein sequences 166 | Train_Positives = pd.DataFrame(columns = column_names) 167 | Train_Negatives_All = pd.DataFrame(columns = column_names) 168 | 169 | Pos_index = 0 170 | Neg_index = 0 171 | seq_num = 0 172 | 173 | # extract feature and peptide for all sites 174 | for i in range(len(DatasetTrainProteins)): 175 | Protein_seq = DatasetTrainProteins['Prot_seq'][i] 176 | Feat = DatasetTrainProteins['Feat'][i] # transpose the feature matrix 177 | Feat2 = DatasetTrainProteins2['Feat'][i] 178 | Feat3 = DatasetTrainProteins3['Feat'][i] 179 | positive_counts = DatasetTrainProteins['Prot_label'][i].count('1') 180 | 181 | seq_num += 1 182 | for j in range(len(Protein_seq)): # go through the protein seq 183 | 184 | A_sample = pd.DataFrame(columns = column_names) # create new dataframe using same column names. This dataframe will just have 1 entry. 185 | A_sample.loc[0,'Code'] = DatasetTrainProteins['Prot_name'][i] # store the protein name 186 | A_sample.loc[0,'Protein_len'] = DatasetTrainProteins['Prot_len'][i] # store the protein length 187 | A_sample.loc[0,'Label'] = DatasetTrainProteins['Prot_label'][i][j] 188 | A_sample.loc[0,'Prot_positives'] = positive_counts 189 | A_sample.loc[0,'Amino_Acid'] = Protein_seq[j] # store the amino acid 190 | A_sample.loc[0,'Position'] = j # store the position of the amino acid 191 | 192 | peptide, T5_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat, j) # call the function to extract peptide and feature based on window size 193 | peptide, HSE_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat2, j) 194 | peptide, PSSM_feat, mirrored = peptide_feat(window_size, Protein_seq, Feat3, j) 195 | 196 | A_sample.loc[0,'Peptide'] = peptide 197 | Feat_vec = np.concatenate((T5_feat.mean(0),HSE_feat.flatten(),PSSM_feat.flatten())) 198 | A_sample.loc[0,'Feature'] = np.float32(Feat_vec) 199 | A_sample.loc[0,'Seq_num'] = seq_num 200 | A_sample.loc[0,'Mirrored'] = mirrored 201 | 202 | if A_sample.loc[0,'Label'] == '1': 203 | Train_Positives = pd.concat([Train_Positives, A_sample], ignore_index=True, axis=0) 204 | 205 | else: 206 | Train_Negatives_All = pd.concat([Train_Negatives_All, A_sample], ignore_index=True, axis=0) 207 | 208 | 209 | print('Train Protein ' + str(i+1) + ' out of ' + str(len(DatasetTrainProteins))) 210 | print('Number of Proteins in Train: ' + str(len(DatasetTrainProteins))) 211 | print('Feature vector size: ' + str(Test_Samples['Feature'][0].shape)) 212 | print('Num of Train Positives: ' + str(len(Train_Positives))) 213 | print('Num of Train Negatives (All): ' + str(len(Train_Negatives_All))) 214 | pickle.dump(Train_Positives,open("dataset1_Train_Positives_rerun.dat","wb")) 215 | pickle.dump(Train_Negatives_All,open("dataset1_Train_Negatives_All_rerun.dat","wb")) 216 | 217 | 218 | --------------------------------------------------------------------------------