├── .DS_Store ├── .gitattributes ├── LICENSE ├── README.md ├── data.zip ├── model_down.PNG ├── model_up.PNG ├── results └── .DS_Store └── src ├── .DS_Store ├── Graph2GO ├── .DS_Store ├── evaluation.py ├── gcnModel.py ├── initializations.py ├── input_data.py ├── layers.py ├── main.py ├── optimizer.py ├── preprocessing.py ├── trainGcn.py └── trainNN.py └── preprocessing ├── .DS_Store ├── go_anchestor.py └── preprocessing.py /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yanzhanglab/Graph2GO/eab1856de0510d258f7f79d28f4edaefb5698f6d/.DS_Store -------------------------------------------------------------------------------- /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Kunjie Fan 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Graph2GO 2 | ## Description 3 | This is a graph-based representation learning method for predicting protein functions. We use both network information and node attributes to improve the performance. Protein-protein interaction (PPIs) networks and sequence similarity networks are used to construct graphs, which are used to propagate node attribtues, according to the definition of graph convolutional networks. 4 | 5 | We use amino acid sequence (CT encoding), subcellular location (bag-of-words encoding) and protein domains (bag-of-words encoding) as the node attributes (initial feature representation). 6 | 7 | The auto-encoder part of our model is improved based on the implementation by T. N. Kifp. You can find the source code [here](https://github.com/tkipf/gae). 8 | 9 | ![VGAE model](https://github.com/yanzhanglab/Graph2GO/blob/master/model_up.PNG) 10 | ![model architecture](https://github.com/yanzhanglab/Graph2GO/blob/master/model_down.PNG) 11 | 12 | ## citing 13 | If you found Graph2GO is useful for your research, please consider citing our [work](https://academic.oup.com/gigascience/article/9/8/giaa081/5885490): 14 | ``` 15 | @article{10.1093/gigascience/giaa081, 16 | author = {Fan, Kunjie and Guan, Yuanfang and Zhang, Yan}, 17 | title = "{Graph2GO: a multi-modal attributed network embedding method for inferring protein functions}", 18 | journal = {GigaScience}, 19 | volume = {9}, 20 | number = {8}, 21 | year = {2020}, 22 | month = {08}, 23 | issn = {2047-217X}, 24 | doi = {10.1093/gigascience/giaa081}, 25 | url = {https://doi.org/10.1093/gigascience/giaa081} 26 | } 27 | ``` 28 | 29 | ## Usage 30 | ### Requirements 31 | - Python 3.6 32 | - TensorFlow 33 | - Keras 34 | - networkx 35 | - scipy 36 | - numpy 37 | - pickle 38 | - scikit-learn 39 | - pandas 40 | 41 | ### Data 42 | You can download the data of all six species from here data. Please Download the datasets and put the data folder in the same path as thee src folder. 43 | 44 | ### Steps 45 | #### Step1: decompress data files 46 | > unzip data.zip 47 | 48 | #### Step2: run the model 49 | > cd src/Graph2GO 50 | > python main.py 51 | > **Note there are several parameters can be tuned: --ppi_attributes, --simi_attributes, --species, --thr_ppi, --thr_evalue, etc. Please refer to the main.py file for detailed description of all parameters** 52 | -------------------------------------------------------------------------------- /data.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yanzhanglab/Graph2GO/eab1856de0510d258f7f79d28f4edaefb5698f6d/data.zip -------------------------------------------------------------------------------- /model_down.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yanzhanglab/Graph2GO/eab1856de0510d258f7f79d28f4edaefb5698f6d/model_down.PNG -------------------------------------------------------------------------------- /model_up.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yanzhanglab/Graph2GO/eab1856de0510d258f7f79d28f4edaefb5698f6d/model_up.PNG -------------------------------------------------------------------------------- /results/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yanzhanglab/Graph2GO/eab1856de0510d258f7f79d28f4edaefb5698f6d/results/.DS_Store -------------------------------------------------------------------------------- /src/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yanzhanglab/Graph2GO/eab1856de0510d258f7f79d28f4edaefb5698f6d/src/.DS_Store -------------------------------------------------------------------------------- /src/Graph2GO/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yanzhanglab/Graph2GO/eab1856de0510d258f7f79d28f4edaefb5698f6d/src/Graph2GO/.DS_Store -------------------------------------------------------------------------------- /src/Graph2GO/evaluation.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | from collections import defaultdict 4 | from sklearn.metrics import roc_curve, auc, matthews_corrcoef 5 | from sklearn.metrics import roc_auc_score 6 | from sklearn.metrics import accuracy_score, f1_score 7 | from sklearn.metrics import precision_recall_curve 8 | from sklearn.metrics import average_precision_score 9 | 10 | def get_label_frequency(ontology): 11 | col_sums = ontology.sum(0) 12 | index_11_30 = np.where((col_sums>=11) & (col_sums<=30))[0] 13 | index_31_100 = np.where((col_sums>=31) & (col_sums<=100))[0] 14 | index_101_300 = np.where((col_sums>=101) & (col_sums<=300))[0] 15 | index_larger_300 = np.where(col_sums >= 301)[0] 16 | return index_11_30, index_31_100, index_101_300, index_larger_300 17 | 18 | def calculate_accuracy(y_test, y_score): 19 | y_score_max = np.argmax(y_score,axis=1) 20 | cnt = 0 21 | for i in range(y_score.shape[0]): 22 | if y_test[i, y_score_max[i]] == 1: 23 | cnt += 1 24 | 25 | return float(cnt)/y_score.shape[0] 26 | 27 | def calculate_fmax(preds, labels): 28 | preds = np.round(preds, 2) 29 | labels = labels.astype(np.int32) 30 | f_max = 0 31 | p_max = 0 32 | r_max = 0 33 | sp_max = 0 34 | t_max = 0 35 | for t in range(1, 100): 36 | threshold = t / 100.0 37 | predictions = (preds > threshold).astype(np.int32) 38 | tp = np.sum(predictions * labels) 39 | fp = np.sum(predictions) - tp 40 | fn = np.sum(labels) - tp 41 | sn = tp / (1.0 * np.sum(labels)) 42 | sp = np.sum((predictions ^ 1) * (labels ^ 1)) 43 | sp /= 1.0 * np.sum(labels ^ 1) 44 | fpr = 1 - sp 45 | precision = tp / (1.0 * (tp + fp)) 46 | recall = tp / (1.0 * (tp + fn)) 47 | f = 2 * precision * recall / (precision + recall) 48 | if f_max < f: 49 | f_max = f 50 | p_max = precision 51 | r_max = recall 52 | sp_max = sp 53 | t_max = threshold 54 | return f_max 55 | 56 | 57 | def plot_PRCurve(label, score): 58 | precision, recall, _ = precision_recall_curve(label.ravel(), score.ravel()) 59 | aupr = average_precision_score(label, score, average="micro") 60 | 61 | plt.figure() 62 | plt.step(recall, precision, color='b', alpha=0.2, 63 | where='post') 64 | plt.fill_between(recall, precision, alpha=0.2, color='b', 65 | **step_kwargs) 66 | 67 | plt.xlabel('Recall') 68 | plt.ylabel('Precision') 69 | plt.ylim([0.0, 1.05]) 70 | plt.xlim([0.0, 1.0]) 71 | plt.title('Average precision score, micro-averaged over all classes: AP={0:0.2f}'.format(aupr)) 72 | plt.show() 73 | 74 | def evaluate_performance(y_test, y_score): 75 | """Evaluate performance""" 76 | n_classes = y_test.shape[1] 77 | perf = dict() 78 | 79 | perf["M-aupr"] = 0.0 80 | n = 0 81 | aupr_list = [] 82 | num_pos_list = [] 83 | for i in range(n_classes): 84 | num_pos = sum(y_test[:, i]) 85 | if num_pos > 0: 86 | ap = average_precision_score(y_test[:, i], y_score[:, i]) 87 | n += 1 88 | perf["M-aupr"] += ap 89 | aupr_list.append(ap) 90 | num_pos_list.append(num_pos) 91 | perf["M-aupr"] /= n 92 | perf['aupr_list'] = aupr_list 93 | perf['num_pos_list'] = num_pos_list 94 | 95 | # Compute micro-averaged AUPR 96 | perf['m-aupr'] = average_precision_score(y_test.ravel(), y_score.ravel()) 97 | 98 | # Computes accuracy 99 | perf['accuracy'] = calculate_accuracy(y_test, y_score) 100 | 101 | # Computes F1-score 102 | alpha = 3 103 | y_new_pred = np.zeros_like(y_test) 104 | for i in range(y_test.shape[0]): 105 | top_alpha = np.argsort(y_score[i, :])[-alpha:] 106 | y_new_pred[i, top_alpha] = np.array(alpha*[1]) 107 | perf["F1-score"] = f1_score(y_test, y_new_pred, average='micro') 108 | 109 | perf['F-max'] = calculate_fmax(y_score, y_test) 110 | 111 | return perf 112 | 113 | def get_results(ontology, Y_test, y_score): 114 | perf = defaultdict(dict) 115 | index_11_30, index_31_100, index_101_300, index_301 = get_label_frequency(ontology) 116 | 117 | perf['11-30'] = evaluate_performance(Y_test[:,index_11_30], y_score[:,index_11_30]) 118 | perf['31-100'] = evaluate_performance(Y_test[:,index_31_100], y_score[:,index_31_100]) 119 | perf['101-300'] = evaluate_performance(Y_test[:,index_101_300], y_score[:,index_101_300]) 120 | perf['301-'] = evaluate_performance(Y_test[:,index_301], y_score[:,index_301]) 121 | perf['all'] = evaluate_performance(Y_test, y_score) 122 | 123 | return perf 124 | 125 | 126 | 127 | -------------------------------------------------------------------------------- /src/Graph2GO/gcnModel.py: -------------------------------------------------------------------------------- 1 | from layers import GraphConvolution, GraphConvolutionSparse, InnerProductDecoder 2 | import tensorflow as tf 3 | 4 | 5 | class Model(object): 6 | def __init__(self, **kwargs): 7 | allowed_kwargs = {'name', 'logging'} 8 | for kwarg in kwargs.keys(): 9 | assert kwarg in allowed_kwargs, 'Invalid keyword argument: ' + kwarg 10 | 11 | for kwarg in kwargs.keys(): 12 | assert kwarg in allowed_kwargs, 'Invalid keyword argument: ' + kwarg 13 | name = kwargs.get('name') 14 | if not name: 15 | name = self.__class__.__name__.lower() 16 | self.name = name 17 | 18 | logging = kwargs.get('logging', False) 19 | self.logging = logging 20 | 21 | self.vars = {} 22 | 23 | def _build(self): 24 | raise NotImplementedError 25 | 26 | def build(self): 27 | """ Wrapper for _build() """ 28 | with tf.variable_scope(self.name): 29 | self._build() 30 | variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name) 31 | self.vars = {var.name: var for var in variables} 32 | 33 | def fit(self): 34 | pass 35 | 36 | def predict(self): 37 | pass 38 | 39 | 40 | class GCNModelAE(Model): 41 | def __init__(self, placeholders, num_features, features_nonzero, hidden1, hidden2, **kwargs): 42 | super(GCNModelAE, self).__init__(**kwargs) 43 | 44 | self.inputs = placeholders['features'] 45 | self.input_dim = num_features 46 | self.features_nonzero = features_nonzero 47 | self.hidden1_dim = hidden1 48 | self.hidden2_dim = hidden2 49 | self.adj = placeholders['adj'] 50 | self.dropout = placeholders['dropout'] 51 | self.build() 52 | 53 | def _build(self): 54 | self.hidden1 = GraphConvolutionSparse(input_dim=self.input_dim, 55 | output_dim=self.hidden1_dim, 56 | adj=self.adj, 57 | features_nonzero=self.features_nonzero, 58 | act=tf.nn.relu, 59 | dropout=self.dropout, 60 | logging=self.logging)(self.inputs) 61 | 62 | self.embeddings = GraphConvolution(input_dim=self.hidden1_dim, 63 | output_dim=self.hidden2_dim, 64 | adj=self.adj, 65 | act=lambda x: x, 66 | dropout=self.dropout, 67 | logging=self.logging)(self.hidden1) 68 | 69 | self.z_mean = self.embeddings 70 | 71 | self.reconstructions = InnerProductDecoder(input_dim=self.hidden2_dim, 72 | act=lambda x: x, 73 | logging=self.logging)(self.embeddings) 74 | 75 | 76 | class GCNModelVAE(Model): 77 | def __init__(self, placeholders, num_features, num_nodes, features_nonzero, hidden1, hidden2, **kwargs): 78 | super(GCNModelVAE, self).__init__(**kwargs) 79 | 80 | self.inputs = placeholders['features'] 81 | self.input_dim = num_features 82 | self.features_nonzero = features_nonzero 83 | self.n_samples = num_nodes 84 | self.hidden1_dim = hidden1 85 | self.hidden2_dim = hidden2 86 | self.adj = placeholders['adj'] 87 | self.dropout = placeholders['dropout'] 88 | self.build() 89 | 90 | def _build(self): 91 | self.hidden1 = GraphConvolutionSparse(input_dim=self.input_dim, 92 | output_dim=self.hidden1_dim, 93 | adj=self.adj, 94 | features_nonzero=self.features_nonzero, 95 | act=tf.nn.relu, 96 | dropout=self.dropout, 97 | logging=self.logging)(self.inputs) 98 | 99 | self.z_mean = GraphConvolution(input_dim=self.hidden1_dim, 100 | output_dim=self.hidden2_dim, 101 | adj=self.adj, 102 | act=lambda x: x, 103 | dropout=self.dropout, 104 | logging=self.logging)(self.hidden1) 105 | 106 | self.z_log_std = GraphConvolution(input_dim=self.hidden1_dim, 107 | output_dim=self.hidden2_dim, 108 | adj=self.adj, 109 | act=lambda x: x, 110 | dropout=self.dropout, 111 | logging=self.logging)(self.hidden1) 112 | 113 | self.z = self.z_mean + tf.random_normal([self.n_samples, self.hidden2_dim], dtype=tf.float64) * tf.exp(self.z_log_std) 114 | 115 | self.reconstructions = InnerProductDecoder(input_dim=self.hidden2_dim, 116 | act=lambda x: x, 117 | logging=self.logging)(self.z) 118 | -------------------------------------------------------------------------------- /src/Graph2GO/initializations.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | 4 | def weight_variable_glorot(input_dim, output_dim, name=""): 5 | """Create a weight variable with Glorot & Bengio (AISTATS 2010) 6 | initialization. 7 | """ 8 | init_range = np.sqrt(6.0 / (input_dim + output_dim)) 9 | initial = tf.random_uniform([input_dim, output_dim], minval=-init_range, 10 | maxval=init_range, dtype=tf.float64) 11 | return tf.Variable(initial, name=name) 12 | -------------------------------------------------------------------------------- /src/Graph2GO/input_data.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pickle as pkl 3 | import networkx as nx 4 | from networkx.readwrite import json_graph 5 | import scipy.sparse as sp 6 | from sklearn.preprocessing import scale,normalize 7 | from sklearn.feature_extraction.text import CountVectorizer 8 | import pandas as pd 9 | import json 10 | import os 11 | from tqdm import tqdm 12 | 13 | 14 | def load_ppi_network(filename, gene_num, thr): 15 | with open(filename) as f: 16 | data = f.readlines() 17 | adj = np.zeros((gene_num,gene_num)) 18 | for x in tqdm(data): 19 | temp = x.strip().split("\t") 20 | # check whether score larger than the threshold 21 | if float(temp[2]) >= thr: 22 | adj[int(temp[0]),int(temp[1])] = 1 23 | if (adj.T == adj).all(): 24 | pass 25 | else: 26 | adj = adj + adj.T 27 | 28 | return adj 29 | 30 | def load_simi_network(filename, gene_num, thr): 31 | with open(filename) as f: 32 | data = f.readlines() 33 | adj = np.zeros((gene_num,gene_num)) 34 | for x in tqdm(data): 35 | temp = x.strip().split("\t") 36 | # check whether evalue smaller than the threshold 37 | if float(temp[2]) <= thr: 38 | adj[int(temp[0]),int(temp[1])] = 1 39 | if (adj.T == adj).all(): 40 | pass 41 | else: 42 | adj = adj + adj.T 43 | 44 | return adj 45 | 46 | def load_labels(uniprot): 47 | print('loading labels...') 48 | # load labels (GO) 49 | cc = uniprot['cc_label'].values 50 | cc = np.hstack(cc).reshape((len(cc),len(cc[0]))) 51 | 52 | bp = uniprot['bp_label'].values 53 | bp = np.hstack(bp).reshape((len(bp),len(bp[0]))) 54 | 55 | mf = uniprot['mf_label'].values 56 | mf = np.hstack(mf).reshape((len(mf),len(mf[0]))) 57 | 58 | return cc,mf,bp 59 | 60 | 61 | def load_data(graph_type, uniprot, args): 62 | 63 | print('loading data...') 64 | 65 | def reshape(features): 66 | return np.hstack(features).reshape((len(features),len(features[0]))) 67 | 68 | # get feature representations 69 | features_seq = scale(reshape(uniprot['CT'].values)) 70 | features_loc = reshape(uniprot['Sub_cell_loc_encoding'].values) 71 | features_domain = reshape(uniprot['Pro_domain_encoding'].values) 72 | 73 | print('generating features...') 74 | 75 | if graph_type == "ppi": 76 | attribute = args.ppi_attributes 77 | elif graph_type == "sequence_similarity": 78 | attribute = args.simi_attributes 79 | 80 | if attribute == 0: 81 | features = np.identity(uniprot.shape[0]) 82 | print("Without features") 83 | elif attribute == 1: 84 | features = features_seq 85 | print("Only use sequence feature") 86 | elif attribute == 2: 87 | features = features_loc 88 | print("Only use location feature") 89 | elif attribute == 3: 90 | features = features_domain 91 | print("Only use domain feature") 92 | elif attribute == 5: 93 | features = np.concatenate((features_loc,features_domain),axis=1) 94 | print("use location and domain features") 95 | elif attribute == 6: 96 | features = np.concatenate((features_seq, features_loc,features_domain),axis=1) 97 | print("Use all the features") 98 | elif attribute == 7: 99 | features = np.concatenate((features_seq, features_loc,features_domain, np.identity(uniprot.shape[0])),axis=1) 100 | print("Use all features plus identity") 101 | 102 | features = sp.csr_matrix(features) 103 | 104 | print('loading graph...') 105 | if graph_type == "sequence_similarity": 106 | filename = os.path.join(args.data_path, args.species, "networks/sequence_similarity.txt") 107 | adj = load_simi_network(filename, uniprot.shape[0], args.thr_evalue) 108 | elif graph_type == "ppi": 109 | filename = os.path.join(args.data_path, args.species, "networks/ppi.txt") 110 | adj = load_ppi_network(filename, uniprot.shape[0], args.thr_ppi) 111 | 112 | adj = sp.csr_matrix(adj) 113 | 114 | 115 | return adj, features 116 | -------------------------------------------------------------------------------- /src/Graph2GO/layers.py: -------------------------------------------------------------------------------- 1 | from initializations import * 2 | import tensorflow as tf 3 | 4 | # global unique layer ID dictionary for layer name assignment 5 | _LAYER_UIDS = {} 6 | 7 | 8 | def get_layer_uid(layer_name=''): 9 | """Helper function, assigns unique layer IDs 10 | """ 11 | if layer_name not in _LAYER_UIDS: 12 | _LAYER_UIDS[layer_name] = 1 13 | return 1 14 | else: 15 | _LAYER_UIDS[layer_name] += 1 16 | return _LAYER_UIDS[layer_name] 17 | 18 | 19 | def dropout_sparse(x, keep_prob, num_nonzero_elems): 20 | """Dropout for sparse tensors. Currently fails for very large sparse tensors (>1M elements) 21 | """ 22 | noise_shape = [num_nonzero_elems] 23 | random_tensor = keep_prob 24 | random_tensor += tf.random_uniform(noise_shape) 25 | dropout_mask = tf.cast(tf.floor(random_tensor), dtype=tf.bool) 26 | pre_out = tf.sparse_retain(x, dropout_mask) 27 | #return pre_out * (1./keep_prob) 28 | return x 29 | 30 | 31 | class Layer(object): 32 | """Base layer class. Defines basic API for all layer objects. 33 | 34 | # Properties 35 | name: String, defines the variable scope of the layer. 36 | 37 | # Methods 38 | _call(inputs): Defines computation graph of layer 39 | (i.e. takes input, returns output) 40 | __call__(inputs): Wrapper for _call() 41 | """ 42 | def __init__(self, **kwargs): 43 | allowed_kwargs = {'name', 'logging'} 44 | for kwarg in kwargs.keys(): 45 | assert kwarg in allowed_kwargs, 'Invalid keyword argument: ' + kwarg 46 | name = kwargs.get('name') 47 | if not name: 48 | layer = self.__class__.__name__.lower() 49 | name = layer + '_' + str(get_layer_uid(layer)) 50 | self.name = name 51 | self.vars = {} 52 | logging = kwargs.get('logging', False) 53 | self.logging = logging 54 | self.issparse = False 55 | 56 | def _call(self, inputs): 57 | return inputs 58 | 59 | def __call__(self, inputs): 60 | with tf.name_scope(self.name): 61 | outputs = self._call(inputs) 62 | return outputs 63 | 64 | 65 | class GraphConvolution(Layer): 66 | """Basic graph convolution layer for undirected graph without edge labels.""" 67 | def __init__(self, input_dim, output_dim, adj, dropout=0., act=tf.nn.relu, **kwargs): 68 | super(GraphConvolution, self).__init__(**kwargs) 69 | with tf.variable_scope(self.name + '_vars'): 70 | self.vars['weights'] = weight_variable_glorot(input_dim, output_dim, name="weights") 71 | self.dropout = dropout 72 | self.adj = adj 73 | self.act = act 74 | 75 | def _call(self, inputs): 76 | x = inputs 77 | #x = tf.nn.dropout(x, 1-self.dropout) 78 | x = tf.matmul(x, self.vars['weights']) 79 | x = tf.sparse_tensor_dense_matmul(self.adj, x) 80 | outputs = self.act(x) 81 | return outputs 82 | 83 | 84 | class GraphConvolutionSparse(Layer): 85 | """Graph convolution layer for sparse inputs.""" 86 | def __init__(self, input_dim, output_dim, adj, features_nonzero, dropout=0., act=tf.nn.relu, **kwargs): 87 | super(GraphConvolutionSparse, self).__init__(**kwargs) 88 | with tf.variable_scope(self.name + '_vars'): 89 | self.vars['weights'] = weight_variable_glorot(input_dim, output_dim, name="weights") 90 | self.dropout = dropout 91 | self.adj = adj 92 | self.act = act 93 | self.issparse = True 94 | self.features_nonzero = features_nonzero 95 | 96 | def _call(self, inputs): 97 | x = inputs 98 | x = dropout_sparse(x, 1-self.dropout, self.features_nonzero) 99 | x = tf.sparse_tensor_dense_matmul(x, self.vars['weights']) 100 | x = tf.sparse_tensor_dense_matmul(self.adj, x) 101 | outputs = self.act(x) 102 | return outputs 103 | 104 | 105 | class InnerProductDecoder(Layer): 106 | """Decoder model layer for link prediction.""" 107 | def __init__(self, input_dim, dropout=0., act=tf.nn.sigmoid, **kwargs): 108 | super(InnerProductDecoder, self).__init__(**kwargs) 109 | self.dropout = dropout 110 | self.act = act 111 | 112 | def _call(self, inputs): 113 | inputs = tf.nn.dropout(inputs, 1-self.dropout) 114 | x = tf.transpose(inputs) 115 | x = tf.matmul(inputs, x) 116 | x = tf.reshape(x, [-1]) 117 | outputs = self.act(x) 118 | return outputs 119 | -------------------------------------------------------------------------------- /src/Graph2GO/main.py: -------------------------------------------------------------------------------- 1 | from input_data import load_data,load_labels 2 | import numpy as np 3 | import pandas as pd 4 | import sys,getopt 5 | import argparse 6 | import json 7 | import os 8 | from collections import defaultdict 9 | 10 | from trainGcn import train_gcn 11 | from trainNN import train_nn 12 | from trainSVM import train_svm 13 | 14 | from evaluation import get_results 15 | 16 | def reshape(features): 17 | return np.hstack(features).reshape((len(features),len(features[0]))) 18 | 19 | 20 | def train(args): 21 | # load feature dataframe 22 | print("loading features...") 23 | uniprot = pd.read_pickle(os.path.join(args.data_path, args.species, "features.pkl")) 24 | 25 | embeddings_list = [] 26 | for graph in args.graphs: 27 | print("#############################") 28 | print("Training",graph) 29 | adj, features = load_data(graph, uniprot, args) 30 | embeddings = train_gcn(features, adj, args, graph) 31 | embeddings_list.append(embeddings) 32 | 33 | if graph == "combined": 34 | attr = args.ppi_attributes 35 | elif graph == "similarity": 36 | attr = args.simi_attributes 37 | 38 | embeddings = np.hstack(embeddings_list) 39 | 40 | if args.only_gcn == 1: 41 | return 42 | 43 | #load labels 44 | cc,mf,bp = load_labels(uniprot) 45 | 46 | np.random.seed(5959) 47 | 48 | # split data into train and test 49 | num_test = int(np.floor(cc.shape[0] / 5.)) 50 | num_train = cc.shape[0] - num_test 51 | all_idx = list(range(cc.shape[0])) 52 | np.random.shuffle(all_idx) 53 | 54 | train_idx = all_idx[:num_train] 55 | test_idx = all_idx[num_train:(num_train + num_test)] 56 | 57 | Y_train_cc = cc[train_idx] 58 | Y_train_bp = bp[train_idx] 59 | Y_train_mf = mf[train_idx] 60 | 61 | Y_test_cc = cc[test_idx] 62 | Y_test_bp = bp[test_idx] 63 | Y_test_mf = mf[test_idx] 64 | 65 | X_train = embeddings[train_idx] 66 | X_test = embeddings[test_idx] 67 | 68 | print("Start running supervised model...") 69 | rand_str = np.random.randint(1000000) 70 | save_path = os.path.join(args.data_path, args.species, "results_new/results_graph2go_" + args.supervised + "_" + ";".join(args.graphs) + "_" + str(args.ppi_attributes) + "_" + str(args.simi_attributes) + "_" + str(args.thr_combined) + "_" + str(args.thr_evalue)) 71 | 72 | print("###################################") 73 | print('----------------------------------') 74 | print('CC') 75 | 76 | if args.supervised == "svm": 77 | y_score_cc = train_svm(X_train,Y_train_cc,X_test,Y_test_cc) 78 | elif args.supervised == "nn": 79 | y_score_cc = train_nn(X_train,Y_train_cc,X_test,Y_test_cc) 80 | 81 | perf_cc = get_results(cc, Y_test_cc, y_score_cc) 82 | if args.save_results: 83 | with open(save_path + "_cc.json", "w") as f: 84 | json.dump(perf_cc, f) 85 | 86 | 87 | print('----------------------------------') 88 | print('MF') 89 | 90 | if args.supervised == "svm": 91 | y_score_mf = train_svm(X_train,Y_train_mf,X_test,Y_test_mf) 92 | elif args.supervised == "nn": 93 | y_score_mf = train_nn(X_train,Y_train_mf,X_test,Y_test_mf) 94 | 95 | perf_mf = get_results(mf, Y_test_mf, y_score_mf) 96 | if args.save_results: 97 | with open(save_path + "_mf.json","w") as f: 98 | json.dump(perf_mf, f) 99 | 100 | 101 | print('----------------------------------') 102 | print('BP') 103 | 104 | if args.supervised == "svm": 105 | y_score_bp = train_svm(X_train,Y_train_bp,X_test,Y_test_bp) 106 | elif args.supervised == "nn": 107 | y_score_bp = train_nn(X_train,Y_train_bp,X_test,Y_test_bp) 108 | 109 | perf_bp = get_results(bp, Y_test_bp, y_score_bp) 110 | if args.save_results: 111 | with open(save_path + "_bp.json","w") as f: 112 | json.dump(perf_bp, f) 113 | 114 | 115 | 116 | if __name__ == "__main__": 117 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter) 118 | #global parameters 119 | parser.add_argument('--ppi_attributes', type=int, default=6, help="types of attributes used by ppi.") 120 | parser.add_argument('--simi_attributes', type=int, default=5, help="types of attributes used by simi.") 121 | parser.add_argument('--graphs', type=lambda s:[item for item in s.split(",")], default=['combined','similarity'], help="lists of graphs to use.") 122 | parser.add_argument('--species', type=str, default="human", help="which species to use.") 123 | parser.add_argument('--data_path', type=str, default="../../data/", help="path storing data.") 124 | parser.add_argument('--thr_combined', type=float, default=0.3, help="threshold for combiend ppi network.") 125 | parser.add_argument('--thr_evalue', type=float, default=1e-4, help="threshold for similarity network.") 126 | parser.add_argument('--supervised', type=str, default="nn", help="neural networks or svm") 127 | parser.add_argument('--only_gcn', type=int, default=0, help="0 for training all, 1 for only embeddings.") 128 | parser.add_argument('--save_results', type=int, default=1, help="whether to save the performance results") 129 | 130 | #parameters for traing GCN 131 | parser.add_argument('--lr', type=float, default=0.001, help="Initial learning rate.") 132 | parser.add_argument('--epochs_ppi', type=int, default=80, help="Number of epochs to train ppi.") 133 | parser.add_argument('--epochs_simi', type=int, default=60, help="Number of epochs to train similarity network.") 134 | parser.add_argument('--hidden1', type=int, default=800, help="Number of units in hidden layer 1.") 135 | parser.add_argument('--hidden2', type=int, default=400, help="Number of units in hidden layer 2.") 136 | parser.add_argument('--weight_decay', type=float, default=0, help="Weight for L2 loss on embedding matrix.") 137 | parser.add_argument('--dropout', type=float, default=0, help="Dropout rate (1 - keep probability).") 138 | parser.add_argument('--model', type=str, default="gcn_vae", help="Model string.") 139 | 140 | args = parser.parse_args() 141 | print(args) 142 | 143 | train(args) 144 | -------------------------------------------------------------------------------- /src/Graph2GO/optimizer.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | 3 | class OptimizerAE(object): 4 | def __init__(self, preds, labels, pos_weight, norm, lr): 5 | 6 | self.cost = norm * tf.reduce_mean(tf.nn.weighted_cross_entropy_with_logits(logits=preds, targets=labels, pos_weight=pos_weight)) 7 | self.optimizer = tf.train.AdamOptimizer(learning_rate=lr) # Adam Optimizer 8 | 9 | self.opt_op = self.optimizer.minimize(self.cost) 10 | self.grads_vars = self.optimizer.compute_gradients(self.cost) 11 | 12 | self.correct_prediction = tf.equal(tf.cast(tf.greater_equal(preds, 0.0), tf.int32), 13 | tf.cast(tf.greater_equal(labels, 0.0),tf.int32)) 14 | self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, tf.float32)) 15 | 16 | 17 | class OptimizerVAE(object): 18 | def __init__(self, preds, labels, model, num_nodes, pos_weight, norm, lr): 19 | 20 | self.cost = norm * tf.reduce_mean(tf.nn.weighted_cross_entropy_with_logits(logits=preds, targets=labels, pos_weight=pos_weight)) 21 | self.optimizer = tf.train.AdamOptimizer(learning_rate=lr) # Adam Optimizer 22 | 23 | # Latent loss 24 | self.log_lik = self.cost 25 | self.kl = (0.5 / num_nodes) * tf.reduce_mean(tf.reduce_sum(1 + 2 * model.z_log_std - tf.square(model.z_mean) - 26 | tf.square(tf.exp(model.z_log_std)), 1)) 27 | self.cost -= self.kl 28 | 29 | self.opt_op = self.optimizer.minimize(self.cost) 30 | self.grads_vars = self.optimizer.compute_gradients(self.cost) 31 | 32 | 33 | self.correct_prediction = tf.equal(tf.cast(tf.greater_equal(preds, 0.0), tf.int32), 34 | tf.cast(tf.greater_equal(labels, 0.0), tf.int32)) 35 | self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, tf.float32)) 36 | -------------------------------------------------------------------------------- /src/Graph2GO/preprocessing.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import scipy.sparse as sp 3 | 4 | 5 | def sparse_to_tuple(sparse_mx): 6 | if not sp.isspmatrix_coo(sparse_mx): 7 | sparse_mx = sparse_mx.tocoo() 8 | coords = np.vstack((sparse_mx.row, sparse_mx.col)).transpose() 9 | values = sparse_mx.data 10 | shape = sparse_mx.shape 11 | return coords, values, shape 12 | 13 | 14 | def preprocess_graph(adj): 15 | adj = sp.coo_matrix(adj) 16 | adj_ = adj + sp.eye(adj.shape[0]) 17 | rowsum = np.array(adj_.sum(1)) 18 | degree_mat_inv_sqrt = sp.diags(np.power(rowsum, -0.5).flatten()) 19 | adj_normalized = adj_.dot(degree_mat_inv_sqrt).transpose().dot(degree_mat_inv_sqrt).tocoo() 20 | return sparse_to_tuple(adj_normalized) 21 | 22 | 23 | def construct_feed_dict(adj_normalized, adj, features, placeholders): 24 | # construct feed dictionary 25 | feed_dict = dict() 26 | feed_dict.update({placeholders['features']: features}) 27 | feed_dict.update({placeholders['adj']: adj_normalized}) 28 | feed_dict.update({placeholders['adj_orig']: adj}) 29 | return feed_dict 30 | -------------------------------------------------------------------------------- /src/Graph2GO/trainGcn.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | from __future__ import print_function 3 | 4 | import time 5 | import os 6 | import tensorflow as tf 7 | import numpy as np 8 | import scipy.sparse as sp 9 | from sklearn.metrics import average_precision_score 10 | from optimizer import OptimizerAE, OptimizerVAE 11 | from gcnModel import GCNModelAE, GCNModelVAE 12 | from preprocessing import preprocess_graph, construct_feed_dict, sparse_to_tuple 13 | import argparse 14 | 15 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' 16 | 17 | 18 | def train_gcn(features, adj_train, args, graph_type): 19 | model_str = args.model 20 | 21 | # Store original adjacency matrix (without diagonal entries) for later 22 | adj_orig = adj_train 23 | adj_orig = adj_orig - sp.dia_matrix((adj_orig.diagonal()[np.newaxis, :], [0]), shape=adj_orig.shape) 24 | adj_orig.eliminate_zeros() 25 | 26 | adj = adj_train 27 | 28 | # Some preprocessing 29 | adj_norm = preprocess_graph(adj) 30 | 31 | # Define placeholders 32 | placeholders = { 33 | 'features': tf.sparse_placeholder(tf.float64), 34 | 'adj': tf.sparse_placeholder(tf.float64), 35 | 'adj_orig': tf.sparse_placeholder(tf.float64), 36 | 'dropout': tf.placeholder_with_default(0., shape=()) 37 | } 38 | 39 | num_nodes = adj.shape[0] 40 | features = sparse_to_tuple(features.tocoo()) 41 | num_features = features[2][1] 42 | features_nonzero = features[1].shape[0] 43 | 44 | # Create model 45 | model = None 46 | if model_str == 'gcn_ae': 47 | model = GCNModelAE(placeholders, num_features, features_nonzero, args.hidden1, args.hidden2) 48 | elif model_str == 'gcn_vae': 49 | model = GCNModelVAE(placeholders, num_features, num_nodes, features_nonzero, args.hidden1, args.hidden2) 50 | 51 | # Optimizer 52 | with tf.name_scope('optimizer'): 53 | if model_str == 'gcn_ae': 54 | opt = OptimizerAE(preds=model.reconstructions, 55 | labels=tf.reshape(tf.sparse_tensor_to_dense(placeholders['adj_orig'], 56 | validate_indices=False), [-1]), 57 | pos_weight=1, 58 | norm=1, 59 | lr=args.lr) 60 | elif model_str == 'gcn_vae': 61 | opt = OptimizerVAE(preds=model.reconstructions, 62 | labels=tf.reshape(tf.sparse_tensor_to_dense(placeholders['adj_orig'], 63 | validate_indices=False), [-1]), 64 | model=model, num_nodes=num_nodes, 65 | pos_weight=1, 66 | norm=1, 67 | lr=args.lr) 68 | 69 | # Initialize session 70 | sess = tf.Session() 71 | sess.run(tf.global_variables_initializer()) 72 | 73 | 74 | adj_label = adj_train + sp.eye(adj_train.shape[0]) 75 | adj_label = sparse_to_tuple(adj_label) 76 | 77 | 78 | # Train model 79 | # use different epochs for ppi and similarity network 80 | if graph_type == "sequence_similarity": 81 | epochs = args.epochs_simi 82 | else: 83 | epochs = args.epochs_ppi 84 | 85 | for epoch in range(epochs): 86 | 87 | t = time.time() 88 | # Construct feed dictionary 89 | feed_dict = construct_feed_dict(adj_norm, adj_label, features, placeholders) 90 | feed_dict.update({placeholders['dropout']: args.dropout}) 91 | # Run single weight update 92 | outs = sess.run([opt.opt_op, opt.cost], feed_dict=feed_dict) 93 | 94 | if epoch % 10 == 0: 95 | print("Epoch:", '%04d' % (epoch+1), "train_loss=", "{:.5f}".format(outs[1])) 96 | 97 | 98 | print("Optimization Finished!") 99 | 100 | 101 | #return embedding for each protein 102 | emb = sess.run(model.z_mean,feed_dict=feed_dict) 103 | 104 | return emb 105 | 106 | -------------------------------------------------------------------------------- /src/Graph2GO/trainNN.py: -------------------------------------------------------------------------------- 1 | from keras.models import Sequential 2 | from keras.layers import Dense,Activation,Dropout 3 | from keras.callbacks import EarlyStopping 4 | from keras.layers.advanced_activations import LeakyReLU 5 | from keras.wrappers.scikit_learn import KerasClassifier 6 | from keras.layers.normalization import BatchNormalization 7 | import numpy as np 8 | 9 | 10 | def train_nn(X_train, Y_train, X_test, Y_test): 11 | model = Sequential() 12 | model.add(Dense(1024, input_dim=X_train.shape[1])) 13 | model.add(BatchNormalization()) 14 | model.add(LeakyReLU()) 15 | model.add(Dropout(0.3)) 16 | 17 | model.add(Dense(512)) 18 | model.add(BatchNormalization()) 19 | model.add(LeakyReLU()) 20 | model.add(Dropout(0.3)) 21 | 22 | model.add(Dense(256)) 23 | model.add(BatchNormalization()) 24 | model.add(LeakyReLU()) 25 | model.add(Dropout(0.3)) 26 | 27 | model.add(Dense(Y_train.shape[1],activation='sigmoid')) 28 | 29 | model.compile(loss='binary_crossentropy', 30 | optimizer='adam', 31 | metrics=['accuracy']) 32 | model.fit(X_train, Y_train, epochs=100, batch_size=128, verbose=0) 33 | 34 | y_prob = model.predict(X_test) 35 | return y_prob 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | -------------------------------------------------------------------------------- /src/preprocessing/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yanzhanglab/Graph2GO/eab1856de0510d258f7f79d28f4edaefb5698f6d/src/preprocessing/.DS_Store -------------------------------------------------------------------------------- /src/preprocessing/go_anchestor.py: -------------------------------------------------------------------------------- 1 | from collections import deque 2 | import pandas as pd 3 | from xml.etree import ElementTree as ET 4 | 5 | def get_gene_ontology(filename='../../data/go-basic.obo'): 6 | # Reading Gene Ontology from OBO Formatted file 7 | go = dict() 8 | obj = None 9 | with open(filename, 'r') as f: 10 | for line in f: 11 | line = line.strip() 12 | if not line: 13 | continue 14 | if line == '[Term]': 15 | if obj is not None: 16 | go[obj['id']] = obj 17 | obj = dict() 18 | obj['is_a'] = list() 19 | obj['part_of'] = list() 20 | obj['regulates'] = list() 21 | obj['is_obsolete'] = False 22 | continue 23 | elif line == '[Typedef]': 24 | obj = None 25 | else: 26 | if obj is None: 27 | continue 28 | l = line.split(": ") 29 | if l[0] == 'id': 30 | obj['id'] = l[1] 31 | elif l[0] == 'is_a': 32 | obj['is_a'].append(l[1].split(' ! ')[0]) 33 | elif l[0] == 'name': 34 | obj['name'] = l[1] 35 | elif l[0] == 'is_obsolete' and l[1] == 'true': 36 | obj['is_obsolete'] = True 37 | if obj is not None: 38 | go[obj['id']] = obj 39 | for go_id in go.keys(): 40 | if go[go_id]['is_obsolete']: 41 | del go[go_id] 42 | for go_id, val in go.iteritems(): 43 | if 'children' not in val: 44 | val['children'] = set() 45 | for p_id in val['is_a']: 46 | if p_id in go: 47 | if 'children' not in go[p_id]: 48 | go[p_id]['children'] = set() 49 | go[p_id]['children'].add(go_id) 50 | return go 51 | 52 | 53 | def get_anchestors(go, go_id): 54 | go_set = set() 55 | q = deque() 56 | q.append(go_id) 57 | while(len(q) > 0): 58 | g_id = q.popleft() 59 | go_set.add(g_id) 60 | if g_id not in go: 61 | #print g_id 62 | continue 63 | for parent_id in go[g_id]['is_a']: 64 | if parent_id in go: 65 | q.append(parent_id) 66 | return go_set 67 | 68 | 69 | def get_parents(go, go_id): 70 | go_set = set() 71 | for parent_id in go[go_id]['is_a']: 72 | if parent_id in go: 73 | go_set.add(parent_id) 74 | return go_set 75 | 76 | 77 | def get_go_set(go, go_id): 78 | go_set = set() 79 | q = deque() 80 | q.append(go_id) 81 | while len(q) > 0: 82 | g_id = q.popleft() 83 | go_set.add(g_id) 84 | for ch_id in go[g_id]['children']: 85 | q.append(ch_id) 86 | return go_set 87 | -------------------------------------------------------------------------------- /src/preprocessing/preprocessing.py: -------------------------------------------------------------------------------- 1 | # This file is used to preprocess uniprot and STRING file to get input for Graph2GO model 2 | 3 | import pandas as pd 4 | import numpy as np 5 | import json 6 | from networkx.readwrite import json_graph 7 | import networkx as nx 8 | import re 9 | from collections import defaultdict 10 | from scipy import sparse 11 | import argparse 12 | from tqdm import tqdm 13 | import os 14 | 15 | from go_anchestor import get_gene_ontology,get_anchestors 16 | 17 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter) 18 | parser.add_argument('--data_path', type=str, default="../../data/", help="path storing data.") 19 | parser.add_argument('--species', type=str, default="human", help="which species to use.") 20 | args = parser.parse_args() 21 | 22 | 23 | ########################################## 24 | ########## process uniprot ############### 25 | 26 | print("Start processing uniprot...") 27 | 28 | #### load file 29 | print("Loading data...") 30 | uniprot_file = os.path.join(args.data_path, args.species, "uniprot-" + args.species + ".tab") 31 | uniprot = pd.read_table(uniprot_file) 32 | print(uniprot.shape) 33 | 34 | 35 | #### filtering 36 | print("filtering...") 37 | # filter by STRING ID occurence 38 | uniprot = uniprot[~uniprot['Cross-reference (STRING)'].isna()] 39 | uniprot.index = range(uniprot.shape[0]) 40 | uniprot['Cross-reference (STRING)'] = uniprot['Cross-reference (STRING)'].apply(lambda x:x[:-1]) 41 | 42 | # filter by sequence length in order to compare with DeepGO 43 | uniprot['Length'] = uniprot['Sequence'].apply(len) 44 | uniprot = uniprot[ uniprot['Length'] <= 1000 ] 45 | 46 | # filter by ambiguous amino acid 47 | def find_amino_acid(x): 48 | return ('B' in x) | ('O' in x) | ('J' in x) | ('U' in x) | ('X' in x) | ('Z' in x) 49 | 50 | ambiguous_index = uniprot.loc[uniprot['Sequence'].apply(find_amino_acid)].index 51 | uniprot.drop(ambiguous_index, axis=0, inplace=True) 52 | uniprot.index = range(len(uniprot)) 53 | print("after filtering:", uniprot.shape) 54 | 55 | 56 | 57 | #### obtain GO annotations 58 | print("obtain GO annotations...") 59 | uniprot['Gene ontology (biological process)'][uniprot['Gene ontology (biological process)'].isna()] = '' 60 | uniprot['Gene ontology (cellular component)'][uniprot['Gene ontology (cellular component)'].isna()] = '' 61 | uniprot['Gene ontology (molecular function)'][uniprot['Gene ontology (molecular function)'].isna()] = '' 62 | 63 | def get_GO(x): 64 | pattern = re.compile(r"GO:\d+") 65 | return pattern.findall(x) 66 | 67 | uniprot['cc'] = uniprot['Gene ontology (cellular component)'].apply(get_GO) 68 | uniprot['bp'] = uniprot['Gene ontology (biological process)'].apply(get_GO) 69 | uniprot['mf'] = uniprot['Gene ontology (molecular function)'].apply(get_GO) 70 | 71 | num_cc_before = sum(len(x) for x in uniprot['cc']) 72 | num_mf_before = sum(len(x) for x in uniprot['mf']) 73 | num_bp_before = sum(len(x) for x in uniprot['bp']) 74 | print "number of CCs, before enrich", num_cc_before 75 | print "number of MFs, before enrich", num_mf_before 76 | print "number of BPs, before enrich", num_bp_before 77 | 78 | 79 | 80 | print("start enriching go annotations...") 81 | # enrich go terms using ancestors 82 | go = get_gene_ontology(os.path.join(args.data_path, "go-basic.obo")) 83 | BIOLOGICAL_PROCESS = 'GO:0008150' 84 | MOLECULAR_FUNCTION = 'GO:0003674' 85 | CELLULAR_COMPONENT = 'GO:0005575' 86 | 87 | new_cc = [] 88 | new_mf = [] 89 | new_bp = [] 90 | 91 | for i, row in uniprot.iterrows(): 92 | labels = row['cc'] 93 | temp = set([]) 94 | for x in labels: 95 | temp = temp | get_anchestors(go, x) 96 | temp.discard(CELLULAR_COMPONENT) 97 | new_cc.append(list(temp)) 98 | 99 | labels = row['mf'] 100 | temp = set([]) 101 | for x in labels: 102 | temp = temp | get_anchestors(go, x) 103 | temp.discard(MOLECULAR_FUNCTION) 104 | new_mf.append(list(temp)) 105 | 106 | labels = row['bp'] 107 | temp = set([]) 108 | for x in labels: 109 | temp = temp | get_anchestors(go, x) 110 | temp.discard(BIOLOGICAL_PROCESS) 111 | new_bp.append(list(temp)) 112 | 113 | uniprot['cc'] = new_cc 114 | uniprot['mf'] = new_mf 115 | uniprot['bp'] = new_bp 116 | 117 | num_cc_after = sum(len(x) for x in uniprot['cc']) 118 | num_mf_after = sum(len(x) for x in uniprot['mf']) 119 | num_bp_after = sum(len(x) for x in uniprot['bp']) 120 | print "number of CCs, after enrich", num_cc_after 121 | print "number of MFs, after enrich", num_mf_after 122 | print "number of BPs, after enrich", num_bp_after 123 | 124 | 125 | 126 | #### filter GO terms by the number of occurence 127 | print("filter GO terms by the number of occurence...") 128 | # filter GO by the number of occurence 129 | mf_items = [item for sublist in uniprot['mf'] for item in sublist] 130 | mf_unique_elements, mf_counts_elements = np.unique(mf_items, return_counts=True) 131 | bp_items = [item for sublist in uniprot['bp'] for item in sublist] 132 | bp_unique_elements, bp_counts_elements = np.unique(bp_items, return_counts=True) 133 | cc_items = [item for sublist in uniprot['cc'] for item in sublist] 134 | cc_unique_elements, cc_counts_elements = np.unique(cc_items, return_counts=True) 135 | 136 | mf_list = mf_unique_elements[np.where(mf_counts_elements >= 10)] 137 | cc_list = cc_unique_elements[np.where(cc_counts_elements >= 10)] 138 | bp_list = bp_unique_elements[np.where(bp_counts_elements >= 10)] 139 | 140 | temp_mf = uniprot['mf'].apply(lambda x: list(set(x) & set(mf_list))) 141 | uniprot['filter_mf'] = temp_mf 142 | temp_cc = uniprot['cc'].apply(lambda x: list(set(x) & set(cc_list))) 143 | uniprot['filter_cc'] = temp_cc 144 | temp_bp = uniprot['bp'].apply(lambda x: list(set(x) & set(bp_list))) 145 | uniprot['filter_bp'] = temp_bp 146 | 147 | # write out filtered ontology lists 148 | def write_go_list(ontology,ll): 149 | filename = os.path.join(args.data_path, args.species, ontology+"_list.txt") 150 | with open(filename,'w') as f: 151 | for x in ll: 152 | f.write(x + '\n') 153 | print("writing go term list...") 154 | write_go_list('cc',cc_list) 155 | write_go_list('mf',mf_list) 156 | write_go_list('bp',bp_list) 157 | 158 | 159 | 160 | #### encode GO terms 161 | print("encoding GO terms...") 162 | mf_dict = dict(zip(list(mf_list),range(len(mf_list)))) 163 | cc_dict = dict(zip(list(cc_list),range(len(cc_list)))) 164 | bp_dict = dict(zip(list(bp_list),range(len(bp_list)))) 165 | mf_encoding = [[0]*len(mf_dict) for i in range(len(uniprot))] 166 | cc_encoding = [[0]*len(cc_dict) for i in range(len(uniprot))] 167 | bp_encoding = [[0]*len(bp_dict) for i in range(len(uniprot))] 168 | 169 | for i,row in uniprot.iterrows(): 170 | for x in row['filter_mf']: 171 | mf_encoding[i][ mf_dict[x] ] = 1 172 | for x in row['filter_cc']: 173 | cc_encoding[i][ cc_dict[x] ] = 1 174 | for x in row['filter_bp']: 175 | bp_encoding[i][ bp_dict[x] ] = 1 176 | 177 | uniprot['cc_label'] = cc_encoding 178 | uniprot['mf_label'] = mf_encoding 179 | uniprot['bp_label'] = bp_encoding 180 | 181 | uniprot.drop(columns=['mf','cc','bp','Gene ontology (biological process)', 182 | 'Gene ontology (cellular component)', 183 | 'Gene ontology (molecular function)'],inplace=True) 184 | 185 | 186 | 187 | #### encode amino acid sequence using CT 188 | print("encode amino acid sequence using CT...") 189 | def CT(sequence): 190 | classMap = {'G':'1','A':'1','V':'1','L':'2','I':'2','F':'2','P':'2', 191 | 'Y':'3','M':'3','T':'3','S':'3','H':'4','N':'4','Q':'4','W':'4', 192 | 'R':'5','K':'5','D':'6','E':'6','C':'7'} 193 | 194 | seq = ''.join([classMap[x] for x in sequence]) 195 | length = len(seq) 196 | coding = np.zeros(343,dtype=np.int) 197 | for i in range(length-2): 198 | index = int(seq[i]) + (int(seq[i+1])-1)*7 + (int(seq[i+2])-1)*49 - 1 199 | coding[index] = coding[index] + 1 200 | return coding 201 | 202 | CT_list = [] 203 | for seq in uniprot['Sequence'].values: 204 | CT_list.append(CT(seq)) 205 | uniprot['CT'] = CT_list 206 | 207 | 208 | #### encode subcellular location 209 | print("encode subcellular location...") 210 | 211 | def process_sub_loc(x): 212 | if str(x) == 'nan': 213 | return [] 214 | x = x[22:-1] 215 | # check if exists "Note=" 216 | pos = x.find("Note=") 217 | if pos != -1: 218 | x = x[:(pos-2)] 219 | temp = [t.strip() for t in x.split(".")] 220 | temp = [t.split(";")[0] for t in temp] 221 | temp = [t.split("{")[0].strip() for t in temp] 222 | temp = [x for x in temp if '}' not in x and x != ''] 223 | return temp 224 | 225 | uniprot['Sub_cell_loc'] = uniprot['Subcellular location [CC]'].apply(process_sub_loc) 226 | items = [item for sublist in uniprot['Sub_cell_loc'] for item in sublist] 227 | items = np.unique(items) 228 | sub_mapping = dict(zip(list(items),range(len(items)))) 229 | sub_encoding = [[0]*len(items) for i in range(len(uniprot))] 230 | for i,row in uniprot.iterrows(): 231 | for loc in row['Sub_cell_loc']: 232 | sub_encoding[i][ sub_mapping[loc] ] = 1 233 | uniprot['Sub_cell_loc_encoding'] = sub_encoding 234 | uniprot.drop(['Subcellular location [CC]'],axis=1,inplace=True) 235 | 236 | 237 | #### encode protein domains 238 | print("encode protein domains...") 239 | 240 | def process_domain(x): 241 | if str(x) == 'nan': 242 | return [] 243 | temp = [t.strip() for t in x[:-1].split(";")] 244 | return temp 245 | 246 | uniprot['protein-domain'] = uniprot['Cross-reference (Pfam)'].apply(process_domain) 247 | items = [item for sublist in uniprot['protein-domain'] for item in sublist] 248 | unique_elements, counts_elements = np.unique(items, return_counts=True) 249 | items = unique_elements[np.where(counts_elements > 5)] 250 | pro_mapping = dict(zip(list(items),range(len(items)))) 251 | pro_encoding = [[0]*len(items) for i in range(len(uniprot))] 252 | 253 | for i,row in uniprot.iterrows(): 254 | for fam in row['protein-domain']: 255 | if fam in pro_mapping: 256 | pro_encoding[i][ pro_mapping[fam] ] = 1 257 | 258 | uniprot['Pro_domain_encoding'] = pro_encoding 259 | 260 | 261 | #### wirte files 262 | print("write files...") 263 | uniprot.to_pickle(os.path.join(args.data_path, args.species, "features.pkl")) 264 | uniprot[['Entry','Gene names','Cross-reference (STRING)']].to_csv(os.path.join(args.data_path,args.species,"gene_list.csv"), 265 | index_label='ID') 266 | 267 | 268 | 269 | ################################# 270 | ######## process PPIs ########### 271 | 272 | print("Start processing PPIs...") 273 | 274 | string_file = os.path.join(args.data_path, args.species, "string-"+args.species+".txt") 275 | string = pd.read_table(string_file, delimiter=" ") 276 | gene_list = pd.read_csv(os.path.join(args.data_path,args.species,"gene_list.csv")) 277 | 278 | # filter by uniprot 279 | string = string[string['protein1'].isin(gene_list['Cross-reference (STRING)'].values)] 280 | string = string[string['protein2'].isin(gene_list['Cross-reference (STRING)'].values)] 281 | 282 | # map names to indexs 283 | id_mapping = dict(zip(list(gene_list['Cross-reference (STRING)'].values), 284 | list(gene_list['ID'].values))) 285 | string['protein1_id'] = string['protein1'].apply(lambda x:id_mapping[x]) 286 | string['protein2_id'] = string['protein2'].apply(lambda x:id_mapping[x]) 287 | 288 | subnetwork = string[['protein1_id','protein2_id','combined_score']] 289 | subnetwork['combined_score'] = subnetwork['combined_score']/1000.0 290 | subnetwork.to_csv(os.path.join(args.data_path, args.species, "networks/ppi.txt"), index=False, header=False, sep="\t") 291 | 292 | 293 | 294 | ################################### 295 | ######## process similarity ####### 296 | 297 | print("Start processing similarity...") 298 | 299 | def write_fasta(feats): 300 | filename = os.path.join(args.data_path, args.species, "blast/uniprot_seq.fas") 301 | with open(filename, "w") as f: 302 | for i,row in feats.iterrows(): 303 | f.write(">" + row['Entry'] + "\n") 304 | f.write(row['Sequence'] + "\n") 305 | 306 | write_fasta(uniprot) --------------------------------------------------------------------------------