├── .DS_Store
├── .gitattributes
├── LICENSE
├── README.md
├── data.zip
├── model_down.PNG
├── model_up.PNG
├── results
    └── .DS_Store
└── src
    ├── .DS_Store
    ├── Graph2GO
        ├── .DS_Store
        ├── evaluation.py
        ├── gcnModel.py
        ├── initializations.py
        ├── input_data.py
        ├── layers.py
        ├── main.py
        ├── optimizer.py
        ├── preprocessing.py
        ├── trainGcn.py
        └── trainNN.py
    └── preprocessing
        ├── .DS_Store
        ├── go_anchestor.py
        └── preprocessing.py


/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yanzhanglab/Graph2GO/eab1856de0510d258f7f79d28f4edaefb5698f6d/.DS_Store


--------------------------------------------------------------------------------
/.gitattributes:
--------------------------------------------------------------------------------
1 | # Auto detect text files and perform LF normalization
2 | * text=auto
3 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Kunjie Fan
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Graph2GO
 2 | ## Description
 3 | This is a graph-based representation learning method for predicting protein functions. We use both network information and node attributes to improve the performance. Protein-protein interaction (PPIs) networks and sequence similarity networks are used to construct graphs, which are used to propagate node attribtues, according to the definition of graph convolutional networks.
 4 | 
 5 | We use amino acid sequence (CT encoding), subcellular location (bag-of-words encoding) and protein domains (bag-of-words encoding) as the node attributes (initial feature representation).
 6 | 
 7 | The auto-encoder part of our model is improved based on the implementation by T. N. Kifp. You can find the source code [here](https://github.com/tkipf/gae). 
 8 | 
 9 | ![VGAE model](https://github.com/yanzhanglab/Graph2GO/blob/master/model_up.PNG)
10 | ![model architecture](https://github.com/yanzhanglab/Graph2GO/blob/master/model_down.PNG)
11 | 
12 | ## citing
13 | If you found Graph2GO is useful for your research, please consider citing our [work](https://academic.oup.com/gigascience/article/9/8/giaa081/5885490):
14 | ```
15 | @article{10.1093/gigascience/giaa081,
16 |     author = {Fan, Kunjie and Guan, Yuanfang and Zhang, Yan},
17 |     title = "{Graph2GO: a multi-modal attributed network embedding method for inferring protein functions}",
18 |     journal = {GigaScience},
19 |     volume = {9},
20 |     number = {8},
21 |     year = {2020},
22 |     month = {08},
23 |     issn = {2047-217X},
24 |     doi = {10.1093/gigascience/giaa081},
25 |     url = {https://doi.org/10.1093/gigascience/giaa081}
26 | }
27 | ```
28 | 
29 | ## Usage
30 | ### Requirements
31 | - Python 3.6
32 | - TensorFlow
33 | - Keras
34 | - networkx
35 | - scipy
36 | - numpy
37 | - pickle
38 | - scikit-learn
39 | - pandas
40 | 
41 | ### Data
42 | You can download the data of all six species from here <a href="https://www.dropbox.com/s/ilrudy0j7wb7b8s/data.zip?dl=0" target="_blank">data</a>. Please Download the datasets and put the data folder in the same path as thee src folder.
43 | 
44 | ### Steps
45 | #### Step1: decompress data files
46 | > unzip data.zip
47 | 
48 | #### Step2: run the model
49 | > cd src/Graph2GO     
50 | > python main.py    
51 | > **Note there are several parameters can be tuned: --ppi_attributes, --simi_attributes, --species, --thr_ppi, --thr_evalue, etc. Please refer to the main.py file for detailed description of all parameters**
52 | 


--------------------------------------------------------------------------------
/data.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yanzhanglab/Graph2GO/eab1856de0510d258f7f79d28f4edaefb5698f6d/data.zip


--------------------------------------------------------------------------------
/model_down.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yanzhanglab/Graph2GO/eab1856de0510d258f7f79d28f4edaefb5698f6d/model_down.PNG


--------------------------------------------------------------------------------
/model_up.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yanzhanglab/Graph2GO/eab1856de0510d258f7f79d28f4edaefb5698f6d/model_up.PNG


--------------------------------------------------------------------------------
/results/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yanzhanglab/Graph2GO/eab1856de0510d258f7f79d28f4edaefb5698f6d/results/.DS_Store


--------------------------------------------------------------------------------
/src/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yanzhanglab/Graph2GO/eab1856de0510d258f7f79d28f4edaefb5698f6d/src/.DS_Store


--------------------------------------------------------------------------------
/src/Graph2GO/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yanzhanglab/Graph2GO/eab1856de0510d258f7f79d28f4edaefb5698f6d/src/Graph2GO/.DS_Store


--------------------------------------------------------------------------------
/src/Graph2GO/evaluation.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | from collections import defaultdict
  4 | from sklearn.metrics import roc_curve, auc, matthews_corrcoef
  5 | from sklearn.metrics import roc_auc_score
  6 | from sklearn.metrics import accuracy_score, f1_score
  7 | from sklearn.metrics import precision_recall_curve
  8 | from sklearn.metrics import average_precision_score
  9 | 
 10 | def get_label_frequency(ontology):
 11 |     col_sums = ontology.sum(0)
 12 |     index_11_30 = np.where((col_sums>=11) & (col_sums<=30))[0]
 13 |     index_31_100 = np.where((col_sums>=31) & (col_sums<=100))[0]
 14 |     index_101_300 = np.where((col_sums>=101) & (col_sums<=300))[0]
 15 |     index_larger_300 = np.where(col_sums >= 301)[0]
 16 |     return index_11_30, index_31_100, index_101_300, index_larger_300
 17 | 
 18 | def calculate_accuracy(y_test, y_score):
 19 |     y_score_max = np.argmax(y_score,axis=1)
 20 |     cnt = 0
 21 |     for i in range(y_score.shape[0]):
 22 |         if y_test[i, y_score_max[i]] == 1:
 23 |             cnt += 1
 24 |     
 25 |     return float(cnt)/y_score.shape[0]
 26 | 
 27 | def calculate_fmax(preds, labels):
 28 |     preds = np.round(preds, 2)
 29 |     labels = labels.astype(np.int32)
 30 |     f_max = 0
 31 |     p_max = 0
 32 |     r_max = 0
 33 |     sp_max = 0
 34 |     t_max = 0
 35 |     for t in range(1, 100):
 36 |         threshold = t / 100.0
 37 |         predictions = (preds > threshold).astype(np.int32)
 38 |         tp = np.sum(predictions * labels)
 39 |         fp = np.sum(predictions) - tp
 40 |         fn = np.sum(labels) - tp
 41 |         sn = tp / (1.0 * np.sum(labels))
 42 |         sp = np.sum((predictions ^ 1) * (labels ^ 1))
 43 |         sp /= 1.0 * np.sum(labels ^ 1)
 44 |         fpr = 1 - sp
 45 |         precision = tp / (1.0 * (tp + fp))
 46 |         recall = tp / (1.0 * (tp + fn))
 47 |         f = 2 * precision * recall / (precision + recall)
 48 |         if f_max < f:
 49 |             f_max = f
 50 |             p_max = precision
 51 |             r_max = recall
 52 |             sp_max = sp
 53 |             t_max = threshold
 54 |     return f_max
 55 | 
 56 | 
 57 | def plot_PRCurve(label, score):
 58 |     precision, recall, _ = precision_recall_curve(label.ravel(), score.ravel())
 59 |     aupr = average_precision_score(label, score, average="micro")
 60 |     
 61 |     plt.figure()
 62 |     plt.step(recall, precision, color='b', alpha=0.2,
 63 |          where='post')
 64 |     plt.fill_between(recall, precision, alpha=0.2, color='b',
 65 |                  **step_kwargs)
 66 | 
 67 |     plt.xlabel('Recall')
 68 |     plt.ylabel('Precision')
 69 |     plt.ylim([0.0, 1.05])
 70 |     plt.xlim([0.0, 1.0])
 71 |     plt.title('Average precision score, micro-averaged over all classes: AP={0:0.2f}'.format(aupr))
 72 |     plt.show()
 73 |     
 74 | def evaluate_performance(y_test, y_score):
 75 |     """Evaluate performance"""
 76 |     n_classes = y_test.shape[1]
 77 |     perf = dict()
 78 |     
 79 |     perf["M-aupr"] = 0.0
 80 |     n = 0
 81 |     aupr_list = []
 82 |     num_pos_list = []
 83 |     for i in range(n_classes):
 84 |         num_pos = sum(y_test[:, i])
 85 |         if num_pos > 0:
 86 |             ap = average_precision_score(y_test[:, i], y_score[:, i])
 87 |             n += 1
 88 |             perf["M-aupr"] += ap
 89 |             aupr_list.append(ap)
 90 |             num_pos_list.append(num_pos)
 91 |     perf["M-aupr"] /= n
 92 |     perf['aupr_list'] = aupr_list
 93 |     perf['num_pos_list'] = num_pos_list
 94 | 
 95 |     # Compute micro-averaged AUPR
 96 |     perf['m-aupr'] = average_precision_score(y_test.ravel(), y_score.ravel())
 97 | 
 98 |     # Computes accuracy
 99 |     perf['accuracy'] = calculate_accuracy(y_test, y_score)
100 | 
101 |     # Computes F1-score
102 |     alpha = 3
103 |     y_new_pred = np.zeros_like(y_test)
104 |     for i in range(y_test.shape[0]):
105 |         top_alpha = np.argsort(y_score[i, :])[-alpha:]
106 |         y_new_pred[i, top_alpha] = np.array(alpha*[1])
107 |     perf["F1-score"] = f1_score(y_test, y_new_pred, average='micro')
108 |     
109 |     perf['F-max'] = calculate_fmax(y_score, y_test)
110 | 
111 |     return perf
112 | 
113 | def get_results(ontology, Y_test, y_score):
114 |     perf = defaultdict(dict) 
115 |     index_11_30, index_31_100, index_101_300, index_301 = get_label_frequency(ontology)
116 |     
117 |     perf['11-30'] = evaluate_performance(Y_test[:,index_11_30], y_score[:,index_11_30])
118 |     perf['31-100'] = evaluate_performance(Y_test[:,index_31_100], y_score[:,index_31_100])
119 |     perf['101-300'] = evaluate_performance(Y_test[:,index_101_300], y_score[:,index_101_300])
120 |     perf['301-'] = evaluate_performance(Y_test[:,index_301], y_score[:,index_301])
121 |     perf['all'] = evaluate_performance(Y_test, y_score)
122 |     
123 |     return perf
124 |     
125 |     
126 |     
127 |     


--------------------------------------------------------------------------------
/src/Graph2GO/gcnModel.py:
--------------------------------------------------------------------------------
  1 | from layers import GraphConvolution, GraphConvolutionSparse, InnerProductDecoder
  2 | import tensorflow as tf
  3 | 
  4 | 
  5 | class Model(object):
  6 |     def __init__(self, **kwargs):
  7 |         allowed_kwargs = {'name', 'logging'}
  8 |         for kwarg in kwargs.keys():
  9 |             assert kwarg in allowed_kwargs, 'Invalid keyword argument: ' + kwarg
 10 | 
 11 |         for kwarg in kwargs.keys():
 12 |             assert kwarg in allowed_kwargs, 'Invalid keyword argument: ' + kwarg
 13 |         name = kwargs.get('name')
 14 |         if not name:
 15 |             name = self.__class__.__name__.lower()
 16 |         self.name = name
 17 | 
 18 |         logging = kwargs.get('logging', False)
 19 |         self.logging = logging
 20 | 
 21 |         self.vars = {}
 22 | 
 23 |     def _build(self):
 24 |         raise NotImplementedError
 25 | 
 26 |     def build(self):
 27 |         """ Wrapper for _build() """
 28 |         with tf.variable_scope(self.name):
 29 |             self._build()
 30 |         variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name)
 31 |         self.vars = {var.name: var for var in variables}
 32 | 
 33 |     def fit(self):
 34 |         pass
 35 | 
 36 |     def predict(self):
 37 |         pass
 38 | 
 39 | 
 40 | class GCNModelAE(Model):
 41 |     def __init__(self, placeholders, num_features, features_nonzero, hidden1, hidden2, **kwargs):
 42 |         super(GCNModelAE, self).__init__(**kwargs)
 43 | 
 44 |         self.inputs = placeholders['features']
 45 |         self.input_dim = num_features
 46 |         self.features_nonzero = features_nonzero
 47 |         self.hidden1_dim = hidden1
 48 |         self.hidden2_dim = hidden2
 49 |         self.adj = placeholders['adj']
 50 |         self.dropout = placeholders['dropout']
 51 |         self.build()
 52 | 
 53 |     def _build(self):
 54 |         self.hidden1 = GraphConvolutionSparse(input_dim=self.input_dim,
 55 |                                               output_dim=self.hidden1_dim,
 56 |                                               adj=self.adj,
 57 |                                               features_nonzero=self.features_nonzero,
 58 |                                               act=tf.nn.relu,
 59 |                                               dropout=self.dropout,
 60 |                                               logging=self.logging)(self.inputs)
 61 | 
 62 |         self.embeddings = GraphConvolution(input_dim=self.hidden1_dim,
 63 |                                            output_dim=self.hidden2_dim,
 64 |                                            adj=self.adj,
 65 |                                            act=lambda x: x,
 66 |                                            dropout=self.dropout,
 67 |                                            logging=self.logging)(self.hidden1)
 68 | 
 69 |         self.z_mean = self.embeddings
 70 | 
 71 |         self.reconstructions = InnerProductDecoder(input_dim=self.hidden2_dim,
 72 |                                         act=lambda x: x,
 73 |                                       logging=self.logging)(self.embeddings)
 74 | 
 75 | 
 76 | class GCNModelVAE(Model):
 77 |     def __init__(self, placeholders, num_features, num_nodes, features_nonzero, hidden1, hidden2, **kwargs):
 78 |         super(GCNModelVAE, self).__init__(**kwargs)
 79 | 
 80 |         self.inputs = placeholders['features']
 81 |         self.input_dim = num_features
 82 |         self.features_nonzero = features_nonzero
 83 |         self.n_samples = num_nodes
 84 |         self.hidden1_dim = hidden1
 85 |         self.hidden2_dim = hidden2
 86 |         self.adj = placeholders['adj']
 87 |         self.dropout = placeholders['dropout']
 88 |         self.build()
 89 | 
 90 |     def _build(self):
 91 |         self.hidden1 = GraphConvolutionSparse(input_dim=self.input_dim,
 92 |                                               output_dim=self.hidden1_dim,
 93 |                                               adj=self.adj,
 94 |                                               features_nonzero=self.features_nonzero,
 95 |                                               act=tf.nn.relu,
 96 |                                               dropout=self.dropout,
 97 |                                               logging=self.logging)(self.inputs)
 98 | 
 99 |         self.z_mean = GraphConvolution(input_dim=self.hidden1_dim,
100 |                                        output_dim=self.hidden2_dim,
101 |                                        adj=self.adj,
102 |                                        act=lambda x: x,
103 |                                        dropout=self.dropout,
104 |                                        logging=self.logging)(self.hidden1)
105 | 
106 |         self.z_log_std = GraphConvolution(input_dim=self.hidden1_dim,
107 |                                           output_dim=self.hidden2_dim,
108 |                                           adj=self.adj,
109 |                                           act=lambda x: x,
110 |                                           dropout=self.dropout,
111 |                                           logging=self.logging)(self.hidden1)
112 | 
113 |         self.z = self.z_mean + tf.random_normal([self.n_samples, self.hidden2_dim], dtype=tf.float64) * tf.exp(self.z_log_std)
114 | 
115 |         self.reconstructions = InnerProductDecoder(input_dim=self.hidden2_dim,
116 |                                         act=lambda x: x,
117 |                                       logging=self.logging)(self.z)
118 | 


--------------------------------------------------------------------------------
/src/Graph2GO/initializations.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | import numpy as np
 3 | 
 4 | def weight_variable_glorot(input_dim, output_dim, name=""):
 5 |     """Create a weight variable with Glorot & Bengio (AISTATS 2010)
 6 |     initialization.
 7 |     """
 8 |     init_range = np.sqrt(6.0 / (input_dim + output_dim))
 9 |     initial = tf.random_uniform([input_dim, output_dim], minval=-init_range,
10 |                                 maxval=init_range, dtype=tf.float64)
11 |     return tf.Variable(initial, name=name)
12 | 


--------------------------------------------------------------------------------
/src/Graph2GO/input_data.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pickle as pkl
  3 | import networkx as nx
  4 | from networkx.readwrite import json_graph
  5 | import scipy.sparse as sp
  6 | from sklearn.preprocessing import scale,normalize
  7 | from sklearn.feature_extraction.text import CountVectorizer
  8 | import pandas as pd
  9 | import json
 10 | import os
 11 | from tqdm import tqdm
 12 | 
 13 | 
 14 | def load_ppi_network(filename, gene_num, thr):
 15 |     with open(filename) as f:
 16 |         data = f.readlines()
 17 |     adj = np.zeros((gene_num,gene_num))
 18 |     for x in tqdm(data):
 19 |         temp = x.strip().split("\t")
 20 |         # check whether score larger than the threshold
 21 |         if float(temp[2]) >= thr:
 22 |             adj[int(temp[0]),int(temp[1])] = 1
 23 |     if (adj.T == adj).all():
 24 |         pass
 25 |     else:
 26 |         adj = adj + adj.T
 27 |     
 28 |     return adj
 29 | 
 30 | def load_simi_network(filename, gene_num, thr):
 31 |     with open(filename) as f:
 32 |         data = f.readlines()
 33 |     adj = np.zeros((gene_num,gene_num))
 34 |     for x in tqdm(data):
 35 |         temp = x.strip().split("\t")
 36 |         # check whether evalue smaller than the threshold
 37 |         if float(temp[2]) <= thr:
 38 |             adj[int(temp[0]),int(temp[1])] = 1
 39 |     if (adj.T == adj).all():
 40 |         pass
 41 |     else:
 42 |         adj = adj + adj.T
 43 |     
 44 |     return adj
 45 | 
 46 | def load_labels(uniprot):
 47 |     print('loading labels...')
 48 |     # load labels (GO)
 49 |     cc = uniprot['cc_label'].values
 50 |     cc = np.hstack(cc).reshape((len(cc),len(cc[0])))
 51 | 
 52 |     bp = uniprot['bp_label'].values
 53 |     bp = np.hstack(bp).reshape((len(bp),len(bp[0])))
 54 | 
 55 |     mf = uniprot['mf_label'].values
 56 |     mf = np.hstack(mf).reshape((len(mf),len(mf[0])))
 57 |     
 58 |     return cc,mf,bp
 59 | 
 60 | 
 61 | def load_data(graph_type, uniprot, args):
 62 |     
 63 |     print('loading data...')
 64 |     
 65 |     def reshape(features):
 66 |         return np.hstack(features).reshape((len(features),len(features[0])))
 67 |     
 68 |     # get feature representations 
 69 |     features_seq = scale(reshape(uniprot['CT'].values))
 70 |     features_loc = reshape(uniprot['Sub_cell_loc_encoding'].values)
 71 |     features_domain = reshape(uniprot['Pro_domain_encoding'].values)
 72 |     
 73 |     print('generating features...')
 74 | 
 75 |     if graph_type == "ppi":
 76 |         attribute = args.ppi_attributes
 77 |     elif graph_type == "sequence_similarity":
 78 |         attribute = args.simi_attributes
 79 |         
 80 |     if attribute == 0:
 81 |         features = np.identity(uniprot.shape[0]) 
 82 |         print("Without features")
 83 |     elif attribute == 1:
 84 |         features = features_seq
 85 |         print("Only use sequence feature")
 86 |     elif attribute == 2:
 87 |         features = features_loc
 88 |         print("Only use location feature")
 89 |     elif attribute == 3:
 90 |         features = features_domain
 91 |         print("Only use domain feature")
 92 |     elif attribute == 5:
 93 |         features = np.concatenate((features_loc,features_domain),axis=1)
 94 |         print("use location and domain features")
 95 |     elif attribute == 6:
 96 |         features = np.concatenate((features_seq, features_loc,features_domain),axis=1)
 97 |         print("Use all the features")   
 98 |     elif attribute == 7:
 99 |         features = np.concatenate((features_seq, features_loc,features_domain, np.identity(uniprot.shape[0])),axis=1)
100 |         print("Use all features plus identity")
101 |     
102 |     features = sp.csr_matrix(features)
103 | 
104 |     print('loading graph...')
105 |     if graph_type == "sequence_similarity":
106 |         filename = os.path.join(args.data_path, args.species, "networks/sequence_similarity.txt")
107 |         adj = load_simi_network(filename, uniprot.shape[0], args.thr_evalue)
108 |     elif graph_type == "ppi":
109 |         filename = os.path.join(args.data_path, args.species, "networks/ppi.txt")
110 |         adj = load_ppi_network(filename, uniprot.shape[0], args.thr_ppi)
111 |     
112 |     adj = sp.csr_matrix(adj)
113 |      
114 | 
115 |     return adj, features
116 | 


--------------------------------------------------------------------------------
/src/Graph2GO/layers.py:
--------------------------------------------------------------------------------
  1 | from initializations import *
  2 | import tensorflow as tf
  3 | 
  4 | # global unique layer ID dictionary for layer name assignment
  5 | _LAYER_UIDS = {}
  6 | 
  7 | 
  8 | def get_layer_uid(layer_name=''):
  9 |     """Helper function, assigns unique layer IDs
 10 |     """
 11 |     if layer_name not in _LAYER_UIDS:
 12 |         _LAYER_UIDS[layer_name] = 1
 13 |         return 1
 14 |     else:
 15 |         _LAYER_UIDS[layer_name] += 1
 16 |         return _LAYER_UIDS[layer_name]
 17 | 
 18 | 
 19 | def dropout_sparse(x, keep_prob, num_nonzero_elems):
 20 |     """Dropout for sparse tensors. Currently fails for very large sparse tensors (>1M elements)
 21 |     """
 22 |     noise_shape = [num_nonzero_elems]
 23 |     random_tensor = keep_prob
 24 |     random_tensor += tf.random_uniform(noise_shape)
 25 |     dropout_mask = tf.cast(tf.floor(random_tensor), dtype=tf.bool)
 26 |     pre_out = tf.sparse_retain(x, dropout_mask)
 27 |     #return pre_out * (1./keep_prob)
 28 |     return x
 29 | 
 30 | 
 31 | class Layer(object):
 32 |     """Base layer class. Defines basic API for all layer objects.
 33 | 
 34 |     # Properties
 35 |         name: String, defines the variable scope of the layer.
 36 | 
 37 |     # Methods
 38 |         _call(inputs): Defines computation graph of layer
 39 |             (i.e. takes input, returns output)
 40 |         __call__(inputs): Wrapper for _call()
 41 |     """
 42 |     def __init__(self, **kwargs):
 43 |         allowed_kwargs = {'name', 'logging'}
 44 |         for kwarg in kwargs.keys():
 45 |             assert kwarg in allowed_kwargs, 'Invalid keyword argument: ' + kwarg
 46 |         name = kwargs.get('name')
 47 |         if not name:
 48 |             layer = self.__class__.__name__.lower()
 49 |             name = layer + '_' + str(get_layer_uid(layer))
 50 |         self.name = name
 51 |         self.vars = {}
 52 |         logging = kwargs.get('logging', False)
 53 |         self.logging = logging
 54 |         self.issparse = False
 55 | 
 56 |     def _call(self, inputs):
 57 |         return inputs
 58 | 
 59 |     def __call__(self, inputs):
 60 |         with tf.name_scope(self.name):
 61 |             outputs = self._call(inputs)
 62 |             return outputs
 63 | 
 64 | 
 65 | class GraphConvolution(Layer):
 66 |     """Basic graph convolution layer for undirected graph without edge labels."""
 67 |     def __init__(self, input_dim, output_dim, adj, dropout=0., act=tf.nn.relu, **kwargs):
 68 |         super(GraphConvolution, self).__init__(**kwargs)
 69 |         with tf.variable_scope(self.name + '_vars'):
 70 |             self.vars['weights'] = weight_variable_glorot(input_dim, output_dim, name="weights")
 71 |         self.dropout = dropout
 72 |         self.adj = adj
 73 |         self.act = act
 74 | 
 75 |     def _call(self, inputs):
 76 |         x = inputs
 77 |         #x = tf.nn.dropout(x, 1-self.dropout)
 78 |         x = tf.matmul(x, self.vars['weights'])
 79 |         x = tf.sparse_tensor_dense_matmul(self.adj, x)
 80 |         outputs = self.act(x)
 81 |         return outputs
 82 | 
 83 | 
 84 | class GraphConvolutionSparse(Layer):
 85 |     """Graph convolution layer for sparse inputs."""
 86 |     def __init__(self, input_dim, output_dim, adj, features_nonzero, dropout=0., act=tf.nn.relu, **kwargs):
 87 |         super(GraphConvolutionSparse, self).__init__(**kwargs)
 88 |         with tf.variable_scope(self.name + '_vars'):
 89 |             self.vars['weights'] = weight_variable_glorot(input_dim, output_dim, name="weights")
 90 |         self.dropout = dropout
 91 |         self.adj = adj
 92 |         self.act = act
 93 |         self.issparse = True
 94 |         self.features_nonzero = features_nonzero
 95 | 
 96 |     def _call(self, inputs):
 97 |         x = inputs
 98 |         x = dropout_sparse(x, 1-self.dropout, self.features_nonzero)
 99 |         x = tf.sparse_tensor_dense_matmul(x, self.vars['weights'])
100 |         x = tf.sparse_tensor_dense_matmul(self.adj, x)
101 |         outputs = self.act(x)
102 |         return outputs
103 | 
104 | 
105 | class InnerProductDecoder(Layer):
106 |     """Decoder model layer for link prediction."""
107 |     def __init__(self, input_dim, dropout=0., act=tf.nn.sigmoid, **kwargs):
108 |         super(InnerProductDecoder, self).__init__(**kwargs)
109 |         self.dropout = dropout
110 |         self.act = act
111 | 
112 |     def _call(self, inputs):
113 |         inputs = tf.nn.dropout(inputs, 1-self.dropout)
114 |         x = tf.transpose(inputs)
115 |         x = tf.matmul(inputs, x)
116 |         x = tf.reshape(x, [-1])
117 |         outputs = self.act(x)
118 |         return outputs
119 | 


--------------------------------------------------------------------------------
/src/Graph2GO/main.py:
--------------------------------------------------------------------------------
  1 | from input_data import load_data,load_labels
  2 | import numpy as np
  3 | import pandas as pd
  4 | import sys,getopt
  5 | import argparse
  6 | import json
  7 | import os
  8 | from collections import defaultdict
  9 | 
 10 | from trainGcn import train_gcn
 11 | from trainNN import train_nn
 12 | from trainSVM import train_svm
 13 | 
 14 | from evaluation import get_results
 15 | 
 16 | def reshape(features):
 17 |     return np.hstack(features).reshape((len(features),len(features[0])))
 18 | 
 19 | 
 20 | def train(args):
 21 |     # load feature dataframe
 22 |     print("loading features...") 
 23 |     uniprot = pd.read_pickle(os.path.join(args.data_path, args.species, "features.pkl"))
 24 |     
 25 |     embeddings_list = []
 26 |     for graph in args.graphs:
 27 |         print("#############################")
 28 |         print("Training",graph)
 29 |         adj, features = load_data(graph, uniprot, args)
 30 |         embeddings = train_gcn(features, adj, args, graph)
 31 |         embeddings_list.append(embeddings)
 32 |         
 33 |         if graph == "combined":
 34 |             attr = args.ppi_attributes
 35 |         elif graph == "similarity":
 36 |             attr = args.simi_attributes
 37 |     
 38 |     embeddings = np.hstack(embeddings_list)
 39 |     
 40 |     if args.only_gcn == 1:
 41 |         return
 42 |     
 43 |     #load labels
 44 |     cc,mf,bp = load_labels(uniprot)
 45 |     
 46 |     np.random.seed(5959)
 47 | 
 48 |     # split data into train and test
 49 |     num_test = int(np.floor(cc.shape[0] / 5.))
 50 |     num_train = cc.shape[0] - num_test
 51 |     all_idx = list(range(cc.shape[0]))
 52 |     np.random.shuffle(all_idx)
 53 |     
 54 |     train_idx = all_idx[:num_train]
 55 |     test_idx = all_idx[num_train:(num_train + num_test)]
 56 |     
 57 |     Y_train_cc = cc[train_idx]
 58 |     Y_train_bp = bp[train_idx]
 59 |     Y_train_mf = mf[train_idx]
 60 |     
 61 |     Y_test_cc = cc[test_idx]
 62 |     Y_test_bp = bp[test_idx]
 63 |     Y_test_mf = mf[test_idx]
 64 |     
 65 |     X_train = embeddings[train_idx]
 66 |     X_test = embeddings[test_idx]
 67 | 
 68 |     print("Start running supervised model...")
 69 |     rand_str = np.random.randint(1000000)
 70 |     save_path = os.path.join(args.data_path, args.species, "results_new/results_graph2go_" + args.supervised + "_" + ";".join(args.graphs) + "_" + str(args.ppi_attributes) + "_" + str(args.simi_attributes) + "_" + str(args.thr_combined) + "_" + str(args.thr_evalue))
 71 |     
 72 |     print("###################################")
 73 |     print('----------------------------------')
 74 |     print('CC')
 75 |     
 76 |     if args.supervised == "svm":
 77 |         y_score_cc = train_svm(X_train,Y_train_cc,X_test,Y_test_cc)
 78 |     elif args.supervised == "nn":
 79 |         y_score_cc = train_nn(X_train,Y_train_cc,X_test,Y_test_cc)
 80 |     
 81 |     perf_cc = get_results(cc, Y_test_cc, y_score_cc)
 82 |     if args.save_results:
 83 |         with open(save_path + "_cc.json", "w") as f:
 84 |             json.dump(perf_cc, f)
 85 |     
 86 |     
 87 |     print('----------------------------------')
 88 |     print('MF')
 89 |     
 90 |     if args.supervised == "svm":
 91 |         y_score_mf = train_svm(X_train,Y_train_mf,X_test,Y_test_mf)
 92 |     elif args.supervised == "nn":
 93 |         y_score_mf = train_nn(X_train,Y_train_mf,X_test,Y_test_mf)
 94 |     
 95 |     perf_mf = get_results(mf, Y_test_mf, y_score_mf)
 96 |     if args.save_results:
 97 |         with open(save_path + "_mf.json","w") as f:
 98 |             json.dump(perf_mf, f)
 99 |     
100 |     
101 |     print('----------------------------------')
102 |     print('BP')
103 | 
104 |     if args.supervised == "svm":
105 |         y_score_bp = train_svm(X_train,Y_train_bp,X_test,Y_test_bp)
106 |     elif args.supervised == "nn":
107 |         y_score_bp = train_nn(X_train,Y_train_bp,X_test,Y_test_bp)
108 |     
109 |     perf_bp = get_results(bp, Y_test_bp, y_score_bp)
110 |     if args.save_results:
111 |         with open(save_path + "_bp.json","w") as f:
112 |             json.dump(perf_bp, f)
113 |     
114 |     
115 | 
116 | if __name__ == "__main__":
117 |     parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
118 |     #global parameters
119 |     parser.add_argument('--ppi_attributes', type=int, default=6, help="types of attributes used by ppi.")
120 |     parser.add_argument('--simi_attributes', type=int, default=5, help="types of attributes used by simi.")
121 |     parser.add_argument('--graphs', type=lambda s:[item for item in s.split(",")], default=['combined','similarity'], help="lists of graphs to use.")    
122 |     parser.add_argument('--species', type=str, default="human", help="which species to use.")
123 |     parser.add_argument('--data_path', type=str, default="../../data/", help="path storing data.")
124 |     parser.add_argument('--thr_combined', type=float, default=0.3, help="threshold for combiend ppi network.")
125 |     parser.add_argument('--thr_evalue', type=float, default=1e-4, help="threshold for similarity network.")
126 |     parser.add_argument('--supervised', type=str, default="nn", help="neural networks or svm")
127 |     parser.add_argument('--only_gcn', type=int, default=0, help="0 for training all, 1 for only embeddings.")
128 |     parser.add_argument('--save_results', type=int, default=1, help="whether to save the performance results")
129 |     
130 |     #parameters for traing GCN
131 |     parser.add_argument('--lr', type=float, default=0.001, help="Initial learning rate.")
132 |     parser.add_argument('--epochs_ppi', type=int, default=80, help="Number of epochs to train ppi.")
133 |     parser.add_argument('--epochs_simi', type=int, default=60, help="Number of epochs to train similarity network.")
134 |     parser.add_argument('--hidden1', type=int, default=800, help="Number of units in hidden layer 1.")
135 |     parser.add_argument('--hidden2', type=int, default=400, help="Number of units in hidden layer 2.")
136 |     parser.add_argument('--weight_decay', type=float, default=0, help="Weight for L2 loss on embedding matrix.")
137 |     parser.add_argument('--dropout', type=float, default=0, help="Dropout rate (1 - keep probability).")
138 |     parser.add_argument('--model', type=str, default="gcn_vae", help="Model string.")
139 |     
140 |     args = parser.parse_args()
141 |     print(args)
142 | 
143 |     train(args)
144 | 


--------------------------------------------------------------------------------
/src/Graph2GO/optimizer.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | 
 3 | class OptimizerAE(object):
 4 |     def __init__(self, preds, labels, pos_weight, norm, lr):
 5 | 
 6 |         self.cost = norm * tf.reduce_mean(tf.nn.weighted_cross_entropy_with_logits(logits=preds, targets=labels, pos_weight=pos_weight))
 7 |         self.optimizer = tf.train.AdamOptimizer(learning_rate=lr)  # Adam Optimizer
 8 | 
 9 |         self.opt_op = self.optimizer.minimize(self.cost)
10 |         self.grads_vars = self.optimizer.compute_gradients(self.cost)
11 |         
12 |         self.correct_prediction = tf.equal(tf.cast(tf.greater_equal(preds, 0.0), tf.int32),
13 |                                            tf.cast(tf.greater_equal(labels, 0.0),tf.int32))
14 |         self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, tf.float32))
15 | 
16 | 
17 | class OptimizerVAE(object):
18 |     def __init__(self, preds, labels, model, num_nodes, pos_weight, norm, lr):
19 | 
20 |         self.cost = norm * tf.reduce_mean(tf.nn.weighted_cross_entropy_with_logits(logits=preds, targets=labels, pos_weight=pos_weight))
21 |         self.optimizer = tf.train.AdamOptimizer(learning_rate=lr)  # Adam Optimizer
22 | 
23 |         # Latent loss
24 |         self.log_lik = self.cost
25 |         self.kl = (0.5 / num_nodes) * tf.reduce_mean(tf.reduce_sum(1 + 2 * model.z_log_std - tf.square(model.z_mean) -
26 |                                                                    tf.square(tf.exp(model.z_log_std)), 1))
27 |         self.cost -= self.kl
28 | 
29 |         self.opt_op = self.optimizer.minimize(self.cost)
30 |         self.grads_vars = self.optimizer.compute_gradients(self.cost)
31 | 
32 |         
33 |         self.correct_prediction = tf.equal(tf.cast(tf.greater_equal(preds, 0.0), tf.int32),
34 |                                            tf.cast(tf.greater_equal(labels, 0.0), tf.int32))
35 |         self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, tf.float32))
36 | 


--------------------------------------------------------------------------------
/src/Graph2GO/preprocessing.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import scipy.sparse as sp
 3 | 
 4 | 
 5 | def sparse_to_tuple(sparse_mx):
 6 |     if not sp.isspmatrix_coo(sparse_mx):
 7 |         sparse_mx = sparse_mx.tocoo()
 8 |     coords = np.vstack((sparse_mx.row, sparse_mx.col)).transpose()
 9 |     values = sparse_mx.data
10 |     shape = sparse_mx.shape
11 |     return coords, values, shape
12 | 
13 | 
14 | def preprocess_graph(adj):
15 |     adj = sp.coo_matrix(adj)
16 |     adj_ = adj + sp.eye(adj.shape[0])
17 |     rowsum = np.array(adj_.sum(1))
18 |     degree_mat_inv_sqrt = sp.diags(np.power(rowsum, -0.5).flatten())
19 |     adj_normalized = adj_.dot(degree_mat_inv_sqrt).transpose().dot(degree_mat_inv_sqrt).tocoo()
20 |     return sparse_to_tuple(adj_normalized)
21 | 
22 | 
23 | def construct_feed_dict(adj_normalized, adj, features, placeholders):
24 |     # construct feed dictionary
25 |     feed_dict = dict()
26 |     feed_dict.update({placeholders['features']: features})
27 |     feed_dict.update({placeholders['adj']: adj_normalized})
28 |     feed_dict.update({placeholders['adj_orig']: adj})
29 |     return feed_dict
30 | 


--------------------------------------------------------------------------------
/src/Graph2GO/trainGcn.py:
--------------------------------------------------------------------------------
  1 | from __future__ import division
  2 | from __future__ import print_function
  3 | 
  4 | import time
  5 | import os
  6 | import tensorflow as tf
  7 | import numpy as np
  8 | import scipy.sparse as sp
  9 | from sklearn.metrics import average_precision_score
 10 | from optimizer import OptimizerAE, OptimizerVAE
 11 | from gcnModel import GCNModelAE, GCNModelVAE
 12 | from preprocessing import preprocess_graph, construct_feed_dict, sparse_to_tuple
 13 | import argparse
 14 | 
 15 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
 16 | 
 17 | 
 18 | def train_gcn(features, adj_train, args, graph_type):
 19 |     model_str = args.model
 20 | 
 21 |     # Store original adjacency matrix (without diagonal entries) for later
 22 |     adj_orig = adj_train
 23 |     adj_orig = adj_orig - sp.dia_matrix((adj_orig.diagonal()[np.newaxis, :], [0]), shape=adj_orig.shape)
 24 |     adj_orig.eliminate_zeros()
 25 | 
 26 |     adj = adj_train
 27 | 
 28 |     # Some preprocessing
 29 |     adj_norm = preprocess_graph(adj)
 30 | 
 31 |     # Define placeholders
 32 |     placeholders = {
 33 |         'features': tf.sparse_placeholder(tf.float64),
 34 |         'adj': tf.sparse_placeholder(tf.float64),
 35 |         'adj_orig': tf.sparse_placeholder(tf.float64),
 36 |         'dropout': tf.placeholder_with_default(0., shape=())
 37 |     }
 38 | 
 39 |     num_nodes = adj.shape[0]
 40 |     features = sparse_to_tuple(features.tocoo())
 41 |     num_features = features[2][1]
 42 |     features_nonzero = features[1].shape[0]
 43 | 
 44 |     # Create model
 45 |     model = None
 46 |     if model_str == 'gcn_ae':
 47 |         model = GCNModelAE(placeholders, num_features, features_nonzero, args.hidden1, args.hidden2)
 48 |     elif model_str == 'gcn_vae':
 49 |         model = GCNModelVAE(placeholders, num_features, num_nodes, features_nonzero, args.hidden1, args.hidden2)
 50 | 
 51 |     # Optimizer
 52 |     with tf.name_scope('optimizer'):
 53 |         if model_str == 'gcn_ae':
 54 |             opt = OptimizerAE(preds=model.reconstructions,
 55 |                           labels=tf.reshape(tf.sparse_tensor_to_dense(placeholders['adj_orig'],
 56 |                           validate_indices=False), [-1]),
 57 |                           pos_weight=1,
 58 |                           norm=1,
 59 |                           lr=args.lr)
 60 |         elif model_str == 'gcn_vae':
 61 |             opt = OptimizerVAE(preds=model.reconstructions,
 62 |                            labels=tf.reshape(tf.sparse_tensor_to_dense(placeholders['adj_orig'],
 63 |                            validate_indices=False), [-1]),
 64 |                            model=model, num_nodes=num_nodes,
 65 |                            pos_weight=1,
 66 |                            norm=1,
 67 |                            lr=args.lr)
 68 | 
 69 |     # Initialize session
 70 |     sess = tf.Session()
 71 |     sess.run(tf.global_variables_initializer())
 72 | 
 73 | 
 74 |     adj_label = adj_train + sp.eye(adj_train.shape[0])
 75 |     adj_label = sparse_to_tuple(adj_label)
 76 | 
 77 | 
 78 |     # Train model
 79 |     # use different epochs for ppi and similarity network
 80 |     if graph_type == "sequence_similarity":
 81 |         epochs = args.epochs_simi
 82 |     else:
 83 |         epochs = args.epochs_ppi
 84 | 
 85 |     for epoch in range(epochs):
 86 | 
 87 |         t = time.time()
 88 |         # Construct feed dictionary
 89 |         feed_dict = construct_feed_dict(adj_norm, adj_label, features, placeholders)
 90 |         feed_dict.update({placeholders['dropout']: args.dropout})
 91 |         # Run single weight update
 92 |         outs = sess.run([opt.opt_op, opt.cost], feed_dict=feed_dict)
 93 |         
 94 |         if epoch % 10 == 0:
 95 |             print("Epoch:", '%04d' % (epoch+1), "train_loss=", "{:.5f}".format(outs[1]))
 96 | 
 97 | 
 98 |     print("Optimization Finished!")
 99 |     
100 |     
101 |     #return embedding for each protein
102 |     emb = sess.run(model.z_mean,feed_dict=feed_dict)
103 |     
104 |     return emb
105 | 
106 | 


--------------------------------------------------------------------------------
/src/Graph2GO/trainNN.py:
--------------------------------------------------------------------------------
 1 | from keras.models import Sequential
 2 | from keras.layers import Dense,Activation,Dropout
 3 | from keras.callbacks import EarlyStopping
 4 | from keras.layers.advanced_activations import LeakyReLU
 5 | from keras.wrappers.scikit_learn import KerasClassifier
 6 | from keras.layers.normalization import BatchNormalization
 7 | import numpy as np
 8 | 
 9 | 
10 | def train_nn(X_train, Y_train, X_test, Y_test):
11 |     model = Sequential()
12 |     model.add(Dense(1024, input_dim=X_train.shape[1]))
13 |     model.add(BatchNormalization())
14 |     model.add(LeakyReLU())
15 |     model.add(Dropout(0.3))
16 |     
17 |     model.add(Dense(512))
18 |     model.add(BatchNormalization())
19 |     model.add(LeakyReLU())
20 |     model.add(Dropout(0.3))
21 |     
22 |     model.add(Dense(256))
23 |     model.add(BatchNormalization())
24 |     model.add(LeakyReLU())
25 |     model.add(Dropout(0.3))
26 |     
27 |     model.add(Dense(Y_train.shape[1],activation='sigmoid'))
28 | 
29 |     model.compile(loss='binary_crossentropy',
30 |                     optimizer='adam',
31 |                     metrics=['accuracy'])
32 |     model.fit(X_train, Y_train, epochs=100, batch_size=128, verbose=0)
33 | 
34 |     y_prob = model.predict(X_test)
35 |     return y_prob
36 | 
37 | 
38 | 
39 | 
40 | 
41 |     
42 | 
43 | 
44 | 


--------------------------------------------------------------------------------
/src/preprocessing/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yanzhanglab/Graph2GO/eab1856de0510d258f7f79d28f4edaefb5698f6d/src/preprocessing/.DS_Store


--------------------------------------------------------------------------------
/src/preprocessing/go_anchestor.py:
--------------------------------------------------------------------------------
 1 | from collections import deque
 2 | import pandas as pd
 3 | from xml.etree import ElementTree as ET
 4 | 
 5 | def get_gene_ontology(filename='../../data/go-basic.obo'):
 6 |     # Reading Gene Ontology from OBO Formatted file
 7 |     go = dict()
 8 |     obj = None
 9 |     with open(filename, 'r') as f:
10 |         for line in f:
11 |             line = line.strip()
12 |             if not line:
13 |                 continue
14 |             if line == '[Term]':
15 |                 if obj is not None:
16 |                     go[obj['id']] = obj
17 |                 obj = dict()
18 |                 obj['is_a'] = list()
19 |                 obj['part_of'] = list()
20 |                 obj['regulates'] = list()
21 |                 obj['is_obsolete'] = False
22 |                 continue
23 |             elif line == '[Typedef]':
24 |                 obj = None
25 |             else:
26 |                 if obj is None:
27 |                     continue
28 |                 l = line.split(": ")
29 |                 if l[0] == 'id':
30 |                     obj['id'] = l[1]
31 |                 elif l[0] == 'is_a':
32 |                     obj['is_a'].append(l[1].split(' ! ')[0])
33 |                 elif l[0] == 'name':
34 |                     obj['name'] = l[1]
35 |                 elif l[0] == 'is_obsolete' and l[1] == 'true':
36 |                     obj['is_obsolete'] = True
37 |     if obj is not None:
38 |         go[obj['id']] = obj
39 |     for go_id in go.keys():
40 |         if go[go_id]['is_obsolete']:
41 |             del go[go_id]
42 |     for go_id, val in go.iteritems():
43 |         if 'children' not in val:
44 |             val['children'] = set()
45 |         for p_id in val['is_a']:
46 |             if p_id in go:
47 |                 if 'children' not in go[p_id]:
48 |                     go[p_id]['children'] = set()
49 |                 go[p_id]['children'].add(go_id)
50 |     return go
51 | 
52 | 
53 | def get_anchestors(go, go_id):
54 |     go_set = set()
55 |     q = deque()
56 |     q.append(go_id)
57 |     while(len(q) > 0):
58 |         g_id = q.popleft()
59 |         go_set.add(g_id)
60 |         if g_id not in go:
61 |             #print g_id
62 |             continue
63 |         for parent_id in go[g_id]['is_a']:
64 |             if parent_id in go:
65 |                 q.append(parent_id)
66 |     return go_set
67 | 
68 | 
69 | def get_parents(go, go_id):
70 |     go_set = set()
71 |     for parent_id in go[go_id]['is_a']:
72 |         if parent_id in go:
73 |             go_set.add(parent_id)
74 |     return go_set
75 | 
76 | 
77 | def get_go_set(go, go_id):
78 |     go_set = set()
79 |     q = deque()
80 |     q.append(go_id)
81 |     while len(q) > 0:
82 |         g_id = q.popleft()
83 |         go_set.add(g_id)
84 |         for ch_id in go[g_id]['children']:
85 |             q.append(ch_id)
86 |     return go_set    
87 | 


--------------------------------------------------------------------------------
/src/preprocessing/preprocessing.py:
--------------------------------------------------------------------------------
  1 | # This file is used to preprocess uniprot and STRING file to get input for Graph2GO model
  2 | 
  3 | import pandas as pd
  4 | import numpy as np
  5 | import json
  6 | from networkx.readwrite import json_graph
  7 | import networkx as nx
  8 | import re
  9 | from collections import defaultdict
 10 | from scipy import sparse
 11 | import argparse
 12 | from tqdm import tqdm
 13 | import os
 14 | 
 15 | from go_anchestor import get_gene_ontology,get_anchestors
 16 | 
 17 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 18 | parser.add_argument('--data_path', type=str, default="../../data/", help="path storing data.")
 19 | parser.add_argument('--species', type=str, default="human", help="which species to use.")
 20 | args = parser.parse_args()
 21 | 
 22 | 
 23 | ##########################################
 24 | ########## process uniprot ###############
 25 | 
 26 | print("Start processing uniprot...")
 27 | 
 28 | #### load file
 29 | print("Loading data...")
 30 | uniprot_file = os.path.join(args.data_path, args.species, "uniprot-" + args.species + ".tab")
 31 | uniprot = pd.read_table(uniprot_file)
 32 | print(uniprot.shape)
 33 | 
 34 | 
 35 | #### filtering
 36 | print("filtering...")
 37 | # filter by STRING ID occurence
 38 | uniprot = uniprot[~uniprot['Cross-reference (STRING)'].isna()]
 39 | uniprot.index = range(uniprot.shape[0])
 40 | uniprot['Cross-reference (STRING)'] = uniprot['Cross-reference (STRING)'].apply(lambda x:x[:-1])
 41 |         
 42 | # filter by sequence length in order to compare with DeepGO
 43 | uniprot['Length'] = uniprot['Sequence'].apply(len)
 44 | uniprot = uniprot[ uniprot['Length'] <= 1000 ]
 45 | 
 46 | # filter by ambiguous amino acid
 47 | def find_amino_acid(x):
 48 |     return ('B' in x) | ('O' in x) | ('J' in x) | ('U' in x) | ('X' in x) | ('Z' in x)
 49 | 
 50 | ambiguous_index = uniprot.loc[uniprot['Sequence'].apply(find_amino_acid)].index
 51 | uniprot.drop(ambiguous_index, axis=0, inplace=True)
 52 | uniprot.index = range(len(uniprot))
 53 | print("after filtering:", uniprot.shape)
 54 | 
 55 | 
 56 | 
 57 | #### obtain GO annotations
 58 | print("obtain GO annotations...")
 59 | uniprot['Gene ontology (biological process)'][uniprot['Gene ontology (biological process)'].isna()] = ''
 60 | uniprot['Gene ontology (cellular component)'][uniprot['Gene ontology (cellular component)'].isna()] = ''
 61 | uniprot['Gene ontology (molecular function)'][uniprot['Gene ontology (molecular function)'].isna()] = ''
 62 | 
 63 | def get_GO(x):
 64 |     pattern = re.compile(r"GO:\d+")
 65 |     return pattern.findall(x)
 66 | 
 67 | uniprot['cc'] = uniprot['Gene ontology (cellular component)'].apply(get_GO)
 68 | uniprot['bp'] = uniprot['Gene ontology (biological process)'].apply(get_GO)
 69 | uniprot['mf'] = uniprot['Gene ontology (molecular function)'].apply(get_GO)
 70 | 
 71 | num_cc_before = sum(len(x) for x in uniprot['cc'])
 72 | num_mf_before = sum(len(x) for x in uniprot['mf'])
 73 | num_bp_before = sum(len(x) for x in uniprot['bp'])
 74 | print "number of CCs, before enrich", num_cc_before
 75 | print "number of MFs, before enrich", num_mf_before
 76 | print "number of BPs, before enrich", num_bp_before
 77 | 
 78 | 
 79 | 
 80 | print("start enriching go annotations...")
 81 | # enrich go terms using ancestors
 82 | go = get_gene_ontology(os.path.join(args.data_path, "go-basic.obo"))
 83 | BIOLOGICAL_PROCESS = 'GO:0008150'
 84 | MOLECULAR_FUNCTION = 'GO:0003674'
 85 | CELLULAR_COMPONENT = 'GO:0005575'
 86 | 
 87 | new_cc = []
 88 | new_mf = []
 89 | new_bp = []
 90 | 
 91 | for i, row in uniprot.iterrows():
 92 |     labels = row['cc']
 93 |     temp = set([])
 94 |     for x in labels:
 95 |         temp = temp | get_anchestors(go, x)
 96 |     temp.discard(CELLULAR_COMPONENT)
 97 |     new_cc.append(list(temp))
 98 | 
 99 |     labels = row['mf']
100 |     temp = set([])
101 |     for x in labels:
102 |         temp = temp | get_anchestors(go, x)
103 |     temp.discard(MOLECULAR_FUNCTION)
104 |     new_mf.append(list(temp))
105 | 
106 |     labels = row['bp']
107 |     temp = set([])
108 |     for x in labels:
109 |         temp = temp | get_anchestors(go, x)
110 |     temp.discard(BIOLOGICAL_PROCESS)
111 |     new_bp.append(list(temp))
112 | 
113 | uniprot['cc'] = new_cc
114 | uniprot['mf'] = new_mf
115 | uniprot['bp'] = new_bp
116 | 
117 | num_cc_after = sum(len(x) for x in uniprot['cc'])
118 | num_mf_after = sum(len(x) for x in uniprot['mf'])
119 | num_bp_after = sum(len(x) for x in uniprot['bp'])
120 | print "number of CCs, after enrich", num_cc_after
121 | print "number of MFs, after enrich", num_mf_after
122 | print "number of BPs, after enrich", num_bp_after
123 | 
124 | 
125 | 
126 | #### filter GO terms by the number of occurence
127 | print("filter GO terms by the number of occurence...")
128 | # filter GO by the number of occurence
129 | mf_items = [item for sublist in uniprot['mf'] for item in sublist]
130 | mf_unique_elements, mf_counts_elements = np.unique(mf_items, return_counts=True)
131 | bp_items = [item for sublist in uniprot['bp'] for item in sublist]
132 | bp_unique_elements, bp_counts_elements = np.unique(bp_items, return_counts=True)
133 | cc_items = [item for sublist in uniprot['cc'] for item in sublist]
134 | cc_unique_elements, cc_counts_elements = np.unique(cc_items, return_counts=True)
135 | 
136 | mf_list = mf_unique_elements[np.where(mf_counts_elements >= 10)]
137 | cc_list = cc_unique_elements[np.where(cc_counts_elements >= 10)]
138 | bp_list = bp_unique_elements[np.where(bp_counts_elements >= 10)]
139 | 
140 | temp_mf = uniprot['mf'].apply(lambda x: list(set(x) & set(mf_list)))
141 | uniprot['filter_mf'] = temp_mf
142 | temp_cc = uniprot['cc'].apply(lambda x: list(set(x) & set(cc_list)))
143 | uniprot['filter_cc'] = temp_cc
144 | temp_bp = uniprot['bp'].apply(lambda x: list(set(x) & set(bp_list)))
145 | uniprot['filter_bp'] = temp_bp
146 | 
147 | # write out filtered ontology lists
148 | def write_go_list(ontology,ll):
149 |     filename = os.path.join(args.data_path, args.species, ontology+"_list.txt")
150 |     with open(filename,'w') as f:
151 |         for x in ll:
152 |             f.write(x + '\n')
153 | print("writing go term list...")
154 | write_go_list('cc',cc_list)
155 | write_go_list('mf',mf_list)
156 | write_go_list('bp',bp_list)
157 | 
158 | 
159 | 
160 | #### encode GO terms
161 | print("encoding GO terms...")
162 | mf_dict = dict(zip(list(mf_list),range(len(mf_list))))
163 | cc_dict = dict(zip(list(cc_list),range(len(cc_list))))
164 | bp_dict = dict(zip(list(bp_list),range(len(bp_list))))
165 | mf_encoding = [[0]*len(mf_dict) for i in range(len(uniprot))]
166 | cc_encoding = [[0]*len(cc_dict) for i in range(len(uniprot))]
167 | bp_encoding = [[0]*len(bp_dict) for i in range(len(uniprot))]
168 | 
169 | for i,row in uniprot.iterrows():
170 |     for x in row['filter_mf']:
171 |         mf_encoding[i][ mf_dict[x] ] = 1
172 |     for x in row['filter_cc']:
173 |         cc_encoding[i][ cc_dict[x] ] = 1
174 |     for x in row['filter_bp']:
175 |         bp_encoding[i][ bp_dict[x] ] = 1
176 | 
177 | uniprot['cc_label'] = cc_encoding
178 | uniprot['mf_label'] = mf_encoding
179 | uniprot['bp_label'] = bp_encoding
180 | 
181 | uniprot.drop(columns=['mf','cc','bp','Gene ontology (biological process)',
182 |                       'Gene ontology (cellular component)',
183 |                       'Gene ontology (molecular function)'],inplace=True)
184 | 
185 | 
186 | 
187 | #### encode amino acid sequence using CT
188 | print("encode amino acid sequence using CT...")
189 | def CT(sequence):
190 |     classMap = {'G':'1','A':'1','V':'1','L':'2','I':'2','F':'2','P':'2',
191 |             'Y':'3','M':'3','T':'3','S':'3','H':'4','N':'4','Q':'4','W':'4',
192 |             'R':'5','K':'5','D':'6','E':'6','C':'7'}
193 | 
194 |     seq = ''.join([classMap[x] for x in sequence])
195 |     length = len(seq)
196 |     coding = np.zeros(343,dtype=np.int)
197 |     for i in range(length-2):
198 |         index = int(seq[i]) + (int(seq[i+1])-1)*7 + (int(seq[i+2])-1)*49 - 1
199 |         coding[index] = coding[index] + 1
200 |     return coding
201 | 
202 | CT_list = []
203 | for seq in uniprot['Sequence'].values:
204 |     CT_list.append(CT(seq))
205 | uniprot['CT'] = CT_list
206 | 
207 | 
208 | #### encode subcellular location
209 | print("encode subcellular location...")
210 | 
211 | def process_sub_loc(x):
212 |     if str(x) == 'nan':
213 |         return []
214 |     x = x[22:-1]
215 |     # check if exists "Note="
216 |     pos = x.find("Note=")
217 |     if pos != -1:
218 |         x = x[:(pos-2)]
219 |     temp = [t.strip() for t in x.split(".")]
220 |     temp = [t.split(";")[0] for t in temp]
221 |     temp = [t.split("{")[0].strip() for t in temp]
222 |     temp = [x for x in temp if '}' not in x and x != '']
223 |     return temp
224 | 
225 | uniprot['Sub_cell_loc'] = uniprot['Subcellular location [CC]'].apply(process_sub_loc)
226 | items = [item for sublist in uniprot['Sub_cell_loc'] for item in sublist]
227 | items = np.unique(items)
228 | sub_mapping = dict(zip(list(items),range(len(items))))
229 | sub_encoding = [[0]*len(items) for i in range(len(uniprot))]
230 | for i,row in uniprot.iterrows():
231 |     for loc in row['Sub_cell_loc']:
232 |         sub_encoding[i][ sub_mapping[loc] ] = 1
233 | uniprot['Sub_cell_loc_encoding'] = sub_encoding
234 | uniprot.drop(['Subcellular location [CC]'],axis=1,inplace=True)
235 | 
236 | 
237 | #### encode protein domains
238 | print("encode protein domains...")
239 | 
240 | def process_domain(x):
241 |     if str(x) == 'nan':
242 |         return []
243 |     temp = [t.strip() for t in x[:-1].split(";")]
244 |     return temp
245 | 
246 | uniprot['protein-domain'] = uniprot['Cross-reference (Pfam)'].apply(process_domain)
247 | items = [item for sublist in uniprot['protein-domain'] for item in sublist]
248 | unique_elements, counts_elements = np.unique(items, return_counts=True)
249 | items = unique_elements[np.where(counts_elements > 5)]
250 | pro_mapping = dict(zip(list(items),range(len(items))))
251 | pro_encoding = [[0]*len(items) for i in range(len(uniprot))]
252 | 
253 | for i,row in uniprot.iterrows():
254 |     for fam in row['protein-domain']:
255 |         if fam in pro_mapping:
256 |             pro_encoding[i][ pro_mapping[fam] ] = 1
257 | 
258 | uniprot['Pro_domain_encoding'] = pro_encoding
259 | 
260 | 
261 | #### wirte files
262 | print("write files...")
263 | uniprot.to_pickle(os.path.join(args.data_path, args.species, "features.pkl"))
264 | uniprot[['Entry','Gene names','Cross-reference (STRING)']].to_csv(os.path.join(args.data_path,args.species,"gene_list.csv"),
265 |                                                                  index_label='ID')
266 | 
267 | 
268 | 
269 | #################################
270 | ######## process PPIs ###########
271 | 
272 | print("Start processing PPIs...")
273 | 
274 | string_file = os.path.join(args.data_path, args.species, "string-"+args.species+".txt")
275 | string = pd.read_table(string_file, delimiter=" ")
276 | gene_list = pd.read_csv(os.path.join(args.data_path,args.species,"gene_list.csv"))
277 | 
278 | # filter by uniprot
279 | string = string[string['protein1'].isin(gene_list['Cross-reference (STRING)'].values)]
280 | string = string[string['protein2'].isin(gene_list['Cross-reference (STRING)'].values)]
281 | 
282 | # map names to indexs
283 | id_mapping = dict(zip(list(gene_list['Cross-reference (STRING)'].values), 
284 |                      list(gene_list['ID'].values)))
285 | string['protein1_id'] = string['protein1'].apply(lambda x:id_mapping[x])
286 | string['protein2_id'] = string['protein2'].apply(lambda x:id_mapping[x])
287 | 
288 | subnetwork = string[['protein1_id','protein2_id','combined_score']]
289 | subnetwork['combined_score'] = subnetwork['combined_score']/1000.0
290 | subnetwork.to_csv(os.path.join(args.data_path, args.species, "networks/ppi.txt"), index=False, header=False, sep="\t")
291 | 
292 | 
293 | 
294 | ###################################
295 | ######## process similarity #######
296 | 
297 | print("Start processing similarity...")
298 | 
299 | def write_fasta(feats):
300 |     filename = os.path.join(args.data_path, args.species, "blast/uniprot_seq.fas")
301 |     with open(filename, "w") as f:
302 |         for i,row in feats.iterrows():
303 |             f.write(">" + row['Entry'] + "\n")
304 |             f.write(row['Sequence'] + "\n")  
305 | 
306 | write_fasta(uniprot)


--------------------------------------------------------------------------------