├── .gitignore
├── LICENSE
├── README.md
├── annotation_transfer.py
├── architectures.py
├── assess_performance.py
├── bindEmbed21DL.py
├── bindPredictML17
    ├── README.md
    ├── bindPredict.py
    ├── data
    │   └── blosum_62.txt
    ├── scripts
    │   ├── bindpredict_predictor.py
    │   ├── file_manager.py
    │   ├── helper.py
    │   ├── parser.py
    │   ├── protein.py
    │   └── select_model.py
    └── trained_model
    │   └── mm3_cons_solv_more.pkl
├── config.py
├── data
    ├── development_set
    │   ├── all.fasta
    │   ├── binding_residues_2.5_metal.txt
    │   ├── binding_residues_2.5_nuclear.txt
    │   ├── binding_residues_2.5_small.txt
    │   ├── ids_split1.txt
    │   ├── ids_split2.txt
    │   ├── ids_split3.txt
    │   ├── ids_split4.txt
    │   ├── ids_split5.txt
    │   └── uniprot_test.txt
    └── independent_set
    │   ├── binding_residues_metal.txt
    │   ├── binding_residues_nuclear.txt
    │   ├── binding_residues_small.txt
    │   ├── indep_set.fasta
    │   └── indep_set.txt
├── data_preparation.py
├── develop_bindEmbed21DL.py
├── homology_based_inference.py
├── human_proteome
    ├── hbi_predictions.tar.gz
    └── ml_predictions.tar.gz
├── ml_predictor.py
├── ml_trainer.py
├── mmseqs_wrapper.py
├── pytorchtools.py
├── run_bindEmbed21.py
├── run_bindEmbed21DL.py
├── run_bindEmbed21HBI.py
├── trained_models
    ├── checkpoint1.pt
    ├── checkpoint2.pt
    ├── checkpoint3.pt
    ├── checkpoint4.pt
    └── checkpoint5.pt
└── utils
    └── convert_embeddings.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | # C extensions
 7 | *.so
 8 | 
 9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | 
27 | # PyInstaller
28 | #  Usually these files are written by a python script from a template
29 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
30 | *.manifest
31 | *.spec
32 | 
33 | # Installer logs
34 | pip-log.txt
35 | pip-delete-this-directory.txt
36 | 
37 | # Unit test / coverage reports
38 | htmlcov/
39 | .tox/
40 | .coverage
41 | .coverage.*
42 | .cache
43 | nosetests.xml
44 | coverage.xml
45 | *,cover
46 | .hypothesis/
47 | 
48 | # Translations
49 | *.mo
50 | *.pot
51 | 
52 | # Django stuff:
53 | *.log
54 | local_settings.py
55 | 
56 | # Flask stuff:
57 | instance/
58 | .webassets-cache
59 | 
60 | # Scrapy stuff:
61 | .scrapy
62 | 
63 | # Sphinx documentation
64 | docs/_build/
65 | 
66 | # PyBuilder
67 | target/
68 | 
69 | # IPython Notebook
70 | .ipynb_checkpoints
71 | 
72 | # pyenv
73 | .python-version
74 | 
75 | # celery beat schedule file
76 | celerybeat-schedule
77 | 
78 | # dotenv
79 | .env
80 | 
81 | # virtualenv
82 | venv/
83 | ENV/
84 | 
85 | # Spyder project settings
86 | .spyderproject
87 | 
88 | # Rope project settings
89 | .ropeproject
90 | 
91 | .idea
92 | .DS_Store


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 Rostlab
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # bindEmbed21
 2 | 
 3 | bindEmbed21 is a method to predict whether a residue in a protein is binding to metal ions, nucleic acids (DNA or RNA), or small molecules. Towards this end, bindEmbed21 combines homology-based inference and Machine Learning. Homology-based inference is executed using MMseqs2 [1]. For the Machine Learning method, bindEmbed21DL uses ProtT5 embeddings [2] as input to a 2-layer CNN. Since bindEmbed21 is based on single sequences, it can easily be applied to any protein sequence.
 4 | 
 5 | ## Usage
 6 | 
 7 | `run_bindEmbed21DL.py` shows an example how to generate binding residue predictions using the Machine Learning part of bindEmbed21 (bindEmbed21DL)
 8 | 
 9 | `run_bindEmbed21HBI.py` shows an example how to generate bidning residue predictions using the homology-inference part of bindEmbed21 (bindEmbed21HBI)
10 | 
11 | `run_bindEmbed21.py` combines ML and HBI into the final method bindEmbed21
12 | 
13 | `develop_bindEmbed21DL.py` provides the code to reproduce the bindEmbed21DL development (hyperparameter optimization, training, performance assessment on the test set).
14 | 
15 | All needed files and paths can be set in `config.py` (marked as TODOs).
16 | 
17 | ## Data
18 | 
19 | ### Development Set
20 | 
21 | The data set used for training and testing was extracted from BioLip [3]. The UniProt identifiers for the 5 splits used during cross-validation (DevSet1014), the test set (TestSet300), and the independent set of proteins added to BioLip after November 2019 (TestSetNew46) as well as the corresponding FASTA sequences and used binding annotations are made available in the `data` folder.
22 | 
23 | The trained models are available in the `trained_models` folder.
24 | 
25 | ProtT5 embeddings can be generated using [the bio_embeddings pipeline](https://github.com/sacdallago/bio_embeddings) [4]. To use them with `bindEmbed21`, they need to be converted to use the correct keys. A script for the conversion can be found in the folder `utils`.
26 | 
27 | ### Sets for homology-based inference
28 | For the homology-based inference (bindEmbed21HBI), query proteins will be aligned against big80 to generate profiles. Those profiles are then searched against a lookup set of proteins with known binding residues. The pre-computed MMseqs2 database files and the FASTA file for the lookup database can be downloaded here:
29 | 
30 | * Pre-computed big80 DB: [ftp://rostlab.org/bindEmbed21/profile_db.tar.gz](ftp://rostlab.org/bindEmbed21/profile_db.tar.gz)
31 | * Pre-computed lookup DB: [ftp://rostlab.org/bindEmbed21/lookup_db.tar.gz](ftp://rostlab.org/bindEmbed21/lookup_db.tar.gz)
32 | * FASTA for lookup DB: [ftp://rostlab.org/bindEmbed21/lookup.fasta](ftp://rostlab.org/bindEmbed21/lookup.fasta)
33 | 
34 | ### Human proteome predictions
35 | 
36 | We applied bindEmbed21DL as well as homology-based inference to the entire human proteome. While annotations were only available for 15% of the human proteins, homology-based inference allowed transferring annotations for 48% (9,694) and bindEmbed21DL provided binding predictions for 92% (18,663) of the human proteome. Both predictions are available in the folder `human_proteome`. For predictions made using homology-based inference, values of -1.0 refer to position which were not inferred, and therefore, were considered non-binding.
37 | 
38 | ## Availability
39 | 
40 | bindEmbed21 is also part of [the bio_embeddings pipeline](https://github.com/sacdallago/bio_embeddings) [4]. Also, predictions of bindEmbed21DL can also be run and visualized on a predicted 3D structure using [LambdaPP](https://embed.predictprotein.org/) [5].  
41 | 
42 | ## Requirements
43 | 
44 | bindEmbed21 is written in Python3. In order to execute bindEmbed21, Python3 has to be installed locally. Additionally, the following Python packages have to be installed:
45 | - numpy
46 | - scikit-learn
47 | - torch
48 | - pandas
49 | - h5py
50 | 
51 | To be able to run homology-based inference, MMseqs2 has to be locally installed. Otherwise, it is also possible to only run the Machine Learning part of bindEmbed21 (bindEmbed21DL).
52 | 
53 | ## Cite
54 | 
55 | In case, you are using this method and find it helpful, we would appreciate if you could cite the following publication:
56 | 
57 | Littmann M, Heinzinger M, Dallago C, Weissenow K, Rost B. Protein embeddings and deep learning predict binding residues for various ligand classes. *Sci Rep* **11**, 23916 (2021). https://doi.org/10.1038/s41598-021-03431-4
58 | 
59 | 
60 | ## References
61 | [1] Steinegger M, Söding J (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35.
62 | 
63 | [2] Elnaggar A, Heinzinger M, Dallago C, Rihawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Bhowmik D, Rost B (2021). ProtTrans: towards cracking the language of life's code through self-supervised deep learning and high performance computing. bioRxiv.
64 | 
65 | [3] Yang J, Roy A, Zhang Y (2013). BioLip: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Research, 41.
66 | 
67 | [4] Dallago C, Schütze K, Heinzinger M, Olenyi T, Littmann M, Lu AX, Yang KK, Min S, Yoon S, Morton JT, & Rost B (2021). Learned embeddings from deep learning to visualize and predict protein sets. Current Protocols, 1, e113. doi: 10.1002/cpz1.113
68 | 
69 | [5] Olenyi T, Marquet C, Heinzinger M, Kröger B, Nikolova T, Bernhofer M, Sändig P, Schütze K, Littmann M, Mirdita M, Steinegger M, Dallago C, & Rost B (2022). LambdaPP: Fast and accessible protein-specific phenotype predictions. bioRxiv
70 | 
71 | 
72 | ## bindPredictML17
73 | If you are interested in the predecessor of bindEmbed21, bindPredictML17, you can find all relevant information in the subfolder `bindPredictML17`.
74 | 


--------------------------------------------------------------------------------
/annotation_transfer.py:
--------------------------------------------------------------------------------
  1 | from homology_based_inference import MMseqsPredictor
  2 | from config import FileManager
  3 | 
  4 | import numpy
  5 | import random
  6 | 
  7 | 
  8 | class Inference(object):
  9 | 
 10 |     def __init__(self, prot_ids, labels, sequences, rand_pred=False):
 11 |         self.ids = prot_ids
 12 |         self.labels = labels
 13 |         self.sequences = sequences
 14 | 
 15 |         self.rand_pred = rand_pred
 16 |         self.rnd_probas = {'metal': 0.147, 'nuclear': 0.030, 'small': 0.090}
 17 | 
 18 |     def infer_binding_annotations_seq(self, e_val, criterion, set_name, exclude_self_hits=True):
 19 | 
 20 |         # get hits from MMseqs2 alignments
 21 |         mmseqs_pred = MMseqsPredictor(exclude_self_hits)
 22 |         hits, mmseqs_file = mmseqs_pred.get_mmseqs_hits(self.ids, e_val, criterion, set_name)
 23 | 
 24 |         inferred_labels, proteins_with_hit = self._infer_binding_local_alignment(hits, mmseqs_file)
 25 | 
 26 |         return inferred_labels, proteins_with_hit
 27 | 
 28 |     def _infer_binding_local_alignment(self, hits, mmseqs_file):
 29 |         """
 30 |         Get binding annotations from local alignment
 31 |         :param hits: Found hits
 32 |         :param mmseqs_file: File with local alignments from MMseqs2
 33 |         :return:
 34 |         """
 35 | 
 36 |         inferred_labels = dict()
 37 |         proteins_with_hit = dict()
 38 | 
 39 |         mmseqs = FileManager.read_mmseqs_alignments(mmseqs_file)
 40 | 
 41 |         for i in hits.keys():
 42 |             hit = list(hits[i].keys())[0]
 43 | 
 44 |             alignment = mmseqs[i][hit]
 45 |             indices1, indices2 = Inference._get_index_mapping_mmseqs(alignment)
 46 | 
 47 |             binding_annotation = self.labels[hit]
 48 |             # set all position to -1; replace inferred positions with 0 (non-binding) or 1 (binding)
 49 |             inferred_annotation = numpy.zeros([len(self.sequences[i]), 3], dtype=numpy.float32) - 1
 50 | 
 51 |             for idx, pos2 in enumerate(indices2):
 52 |                 pos1 = indices1[idx] - 1
 53 | 
 54 |                 if pos1 >= 0 and pos2 >= 1:  # both positions are aligned
 55 |                     anno = binding_annotation[pos2 - 1]
 56 |                     inferred_annotation[pos1] = anno
 57 | 
 58 |             if 1 in inferred_annotation:
 59 |                 # any binding annotation in the aligned region
 60 |                 inferred_labels[i] = inferred_annotation
 61 |                 ligands = {'metal': 0, 'nuclear': 0, 'small': 0}
 62 | 
 63 |                 if 1 in inferred_annotation[:, 0]:
 64 |                     ligands['metal'] = 1
 65 |                 if 1 in inferred_annotation[:, 1]:
 66 |                     ligands['nuclear'] = 1
 67 |                 if 1 in inferred_annotation[:, 2]:
 68 |                     ligands['small'] = 1
 69 | 
 70 |                 proteins_with_hit[i] = ligands
 71 | 
 72 |         for i in self.ids:
 73 |             if i not in inferred_labels.keys():  # no hit generated
 74 |                 inferred_annotation = numpy.zeros([len(self.sequences[i]), 3], dtype=numpy.float32) - 1
 75 | 
 76 |                 # generate random prediction for proteins without a hit
 77 |                 if self.rand_pred:
 78 | 
 79 |                     metal_indices = random.choices([0,  1],
 80 |                                                    weights=[1-self.rnd_probas['metal'], self.rnd_probas['metal']],
 81 |                                                    k=len(self.sequences[i]))
 82 |                     nuclear_indices = random.choices([0, 1],
 83 |                                                      weights=[1-self.rnd_probas['nuclear'], self.rnd_probas['nuclear']],
 84 |                                                      k=len(self.sequences[i]))
 85 |                     small_indices = random.choices([0, 1],
 86 |                                                    weights=[1-self.rnd_probas['small'], self.rnd_probas['small']],
 87 |                                                    k=len(self.sequences[i]))
 88 | 
 89 |                     inferred_annotation[:, 0] = metal_indices
 90 |                     inferred_annotation[:, 1] = nuclear_indices
 91 |                     inferred_annotation[:, 2] = small_indices
 92 | 
 93 |                 inferred_labels[i] = inferred_annotation
 94 | 
 95 |         return inferred_labels, proteins_with_hit
 96 | 
 97 |     @staticmethod
 98 |     def _get_index_mapping_mmseqs(alignment):
 99 |         """
100 |         Get indices of alignment position in actual sequence
101 |         :param alignment: Local alignment of 2 sequences
102 |         :return: Indices for 2 sequences
103 |         """
104 | 
105 |         start1 = alignment['qstart']
106 |         start2 = alignment['tstart']
107 | 
108 |         seq1 = alignment['qaln']
109 |         seq2 = alignment['taln']
110 | 
111 |         indices1 = Inference._get_indices_seq(start1, seq1)
112 |         indices2 = Inference._get_indices_seq(start2, seq2)
113 | 
114 |         return indices1, indices2
115 | 
116 |     @staticmethod
117 |     def _get_indices_seq(start, seq):
118 |         """
119 |         Get sequence indices of aligned sequence
120 |         :param start: Start index of aligned sequence
121 |         :param seq: Aligned sequence (incl. gaps)
122 |         :return: Sequence indices for position in aligned sequence
123 |         """
124 | 
125 |         indices = []
126 | 
127 |         for i in range(0, len(seq)):
128 |             if seq[i] == '-':  # position is gap
129 |                 indices.append(0)
130 |             else:
131 |                 indices.append(start)
132 |                 start += 1
133 | 
134 |         return indices
135 | 


--------------------------------------------------------------------------------
/architectures.py:
--------------------------------------------------------------------------------
 1 | import torch.nn
 2 | 
 3 | 
 4 | class CNN2Layers(torch.nn.Module):
 5 | 
 6 |     def __init__(self, in_channels, feature_channels, kernel_size, stride, padding, dropout):
 7 |         super(CNN2Layers, self).__init__()
 8 |         self.conv1 = torch.nn.Sequential(
 9 |             torch.nn.Conv1d(in_channels=in_channels, out_channels=feature_channels, kernel_size=kernel_size,
10 |                             stride=stride, padding=padding),
11 |             torch.nn.ELU(),
12 |             torch.nn.Dropout(dropout),
13 | 
14 |             torch.nn.Conv1d(in_channels=feature_channels, out_channels=3, kernel_size=kernel_size,
15 |                             stride=stride, padding=padding),
16 |         )
17 | 
18 |     def forward(self, x):
19 |         x = self.conv1(x)
20 |         return torch.squeeze(x)
21 | 


--------------------------------------------------------------------------------
/assess_performance.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import math
  3 | import numpy as np
  4 | from scipy.stats import t
  5 | 
  6 | 
  7 | class ModelPerformance(object):
  8 | 
  9 |     """Wrapper to save performances values per model"""
 10 | 
 11 |     def __init__(self):
 12 | 
 13 |         self.losses = []
 14 | 
 15 |         self.precisions = []
 16 |         self.recalls = []
 17 |         self.f1s = []
 18 | 
 19 |         self.neg_precisions = []
 20 |         self.neg_recalls = []
 21 |         self.neg_f1s = []
 22 | 
 23 |         self.accs = []
 24 |         self.mccs = []
 25 | 
 26 |         self.confusion_matrix = {'tp': 0, 'fp': 0, 'tn': 0, 'fn': 0}
 27 |         self.cross_prediction = [0, 0, 0, 0]
 28 | 
 29 |         self.coverage = 0
 30 |         self.predictions = {'bound_predicted': 0, 'bound_not_predicted': 0, 'not_bound_predicted': 0,
 31 |                             'not_bound_not_predicted': 0}
 32 | 
 33 |     def add_single_performance(self, loss, acc, prec, recall, f1, mcc):
 34 |         """Add performance for one protein"""
 35 | 
 36 |         self.losses.append(loss)
 37 |         self.accs.append(acc)
 38 |         self.precisions.append(prec)
 39 |         self.recalls.append(recall)
 40 |         self.f1s.append(f1)
 41 |         self.mccs.append(mcc)
 42 | 
 43 |     def add_single_performance_negatives(self, loss, acc, prec, recall, f1, neg_prec, neg_recall, neg_f1, mcc):
 44 |         """Add performance for one protein (incl. negative performance values)"""
 45 | 
 46 |         self.add_single_performance(loss, acc, prec, recall, f1, mcc)
 47 |         self.neg_precisions.append(neg_prec)
 48 |         self.neg_recalls.append(neg_recall)
 49 |         self.neg_f1s.append(neg_f1)
 50 | 
 51 |     def add_confusion_matrix(self, tp, fp, tn, fn):
 52 |         """Add TP, FP, TN,FN to confusion matrix"""
 53 | 
 54 |         self.confusion_matrix['tp'] += tp
 55 |         self.confusion_matrix['fp'] += fp
 56 |         self.confusion_matrix['tn'] += tn
 57 |         self.confusion_matrix['fn'] += fn
 58 | 
 59 |         if (tp + fp) > 0:
 60 |             self.coverage += 1
 61 |             if (tp + fn) > 0:
 62 |                 self.predictions['bound_predicted'] += 1
 63 |             else:
 64 |                 self.predictions['not_bound_predicted'] += 1
 65 |         else:
 66 |             if fn > 0:
 67 |                 self.predictions['bound_not_predicted'] += 1
 68 |             else:
 69 |                 self.predictions['not_bound_not_predicted'] += 1
 70 | 
 71 |     def add_cross_prediction(self, cross_prediction):
 72 |         """Add cross prediction"""
 73 |         for i in range(0, len(cross_prediction)):
 74 |             self.cross_prediction[i] += cross_prediction[i]
 75 | 
 76 |     def get_mean_performance(self):
 77 |         """Calculate average performance values"""
 78 |         loss = np.average(self.losses)
 79 |         acc = np.average(self.accs)
 80 |         precision = np.average(self.precisions)
 81 |         recall = np.average(self.recalls)
 82 |         f1 = np.average(self.f1s)
 83 |         mcc = np.average(self.mccs)
 84 | 
 85 |         return loss, acc, precision, recall, f1, mcc
 86 | 
 87 |     def get_mean_ci_performance(self):
 88 |         """Calculate average performance values and 95% CIs"""
 89 |         acc, acc_ci = ModelPerformance._get_mean_ci(self.accs)
 90 |         recall, recall_ci = ModelPerformance._get_mean_ci(self.recalls)
 91 |         prec, prec_ci = ModelPerformance._get_mean_ci(self.precisions)
 92 |         f1, f1_ci = ModelPerformance._get_mean_ci(self.f1s)
 93 |         mcc, mcc_ci = ModelPerformance._get_mean_ci(self.mccs)
 94 | 
 95 |         return acc, prec, recall, f1, mcc, acc_ci, prec_ci, recall_ci, f1_ci, mcc_ci
 96 | 
 97 |     def get_mean_ci_performance_negatives(self):
 98 |         """Calculate average performance values for negative class and 95% CIs"""
 99 | 
100 |         neg_recall, neg_recall_ci = ModelPerformance._get_mean_ci(self.neg_recalls)
101 |         neg_prec, neg_prec_ci = ModelPerformance._get_mean_ci(self.neg_precisions)
102 |         neg_f1, neg_f1_ci = ModelPerformance._get_mean_ci(self.neg_f1s)
103 | 
104 |         return neg_prec, neg_recall, neg_f1, neg_prec_ci, neg_recall_ci, neg_f1_ci
105 | 
106 |     @staticmethod
107 |     def _get_mean_ci(vec):
108 |         """
109 |         Calculate mean and 95% CI for a given vector
110 |         :param vec: vector
111 |         :return: mean and ci
112 |         """
113 |         mean = round(np.average(vec), 3)
114 |         if len(vec) > 1:
115 |             ci = round(np.std(vec)/math.sqrt(len(vec)) * t.ppf((1 + 0.95) / 2, len(vec)), 3)
116 |         else:
117 |             ci = 0
118 | 
119 |         return mean, ci
120 | 
121 | 
122 | class PerformanceEpochs(object):
123 |     """
124 |     Wrapper to save performance values per epoch
125 |     """
126 | 
127 |     def __init__(self):
128 | 
129 |         self.loss_epochs = []
130 |         self.mcc_epochs = []
131 |         self.prec_epochs = []
132 |         self.recall_epochs = []
133 |         self.f1_epochs = []
134 |         self.acc_epochs = []
135 | 
136 |     def get_performance_last_epoch(self):
137 |         """Get performance for last epoch"""
138 |         loss = self.loss_epochs[-1]
139 |         mcc = self.mcc_epochs[-1]
140 |         prec = self.prec_epochs[-1]
141 |         recall = self.recall_epochs[-1]
142 |         f1 = self.f1_epochs[-1]
143 |         acc = self.acc_epochs[-1]
144 | 
145 |         return loss, acc, prec, recall, f1, mcc
146 | 
147 |     def add_performance_epoch(self, loss, mcc, prec, recall, f1, acc):
148 |         """Add performance for one epoch"""
149 |         self.loss_epochs.append(loss)
150 |         self.acc_epochs.append(acc)
151 |         self.mcc_epochs.append(mcc)
152 |         self.f1_epochs.append(f1)
153 |         self.prec_epochs.append(prec)
154 |         self.recall_epochs.append(recall)
155 | 
156 |     @staticmethod
157 |     def get_performance_batch(pred, target):
158 |         """Calculate performance for one batch"""
159 | 
160 |         tp, fp, tn, fn = PerformanceAssessment.evaluate_per_residue_torch(pred, target)
161 |         acc, prec, rec, f1, mcc = PerformanceAssessment.calc_performance_measurements(tp, fp, tn, fn)
162 | 
163 |         return tp, fp, tn, fn, acc, prec, rec, f1, mcc
164 | 
165 | 
166 | class PerformanceAssessment(object):
167 | 
168 |     @staticmethod
169 |     def evaluate_per_residue_torch(prediction, target):
170 |         """Calculate tp, fp, tn, fn for tensor"""
171 |         # reduce prediction & target to one dimension
172 |         prediction = prediction.t()
173 |         target = target.t()
174 |         prediction = torch.sum(torch.ge(prediction, 0.5), 1)
175 |         target = torch.sum(torch.ge(target, 0.5), 1)
176 | 
177 |         # get confusion matrix
178 |         tp = torch.sum(torch.ge(prediction, 0.5) * torch.ge(target, 0.5))
179 |         tn = torch.sum(torch.lt(prediction, 0.5) * torch.lt(target, 0.5))
180 |         fp = torch.sum(torch.ge(prediction, 0.5) * torch.lt(target, 0.5))
181 |         fn = torch.sum(torch.lt(prediction, 0.5) * torch.ge(target, 0.5))
182 | 
183 |         return tp, fp, tn, fn
184 | 
185 |     @staticmethod
186 |     def calc_performance_measurements(tp, fp, tn, fn):
187 |         """Calculate precision, recall, f1, mcc, and accuracy"""
188 | 
189 |         tp = float(tp)
190 |         fp = float(fp)
191 |         fn = float(fn)
192 |         tn = float(tn)
193 | 
194 |         recall = prec = f1 = mcc = 0
195 |         acc = round((tp + tn) / (tp + tn + fn + fp), 3)
196 | 
197 |         if tp > 0 or fn > 0:
198 |             recall = round(tp / (tp + fn), 3)
199 |         if tp > 0 or fp > 0:
200 |             prec = round(tp / (tp + fp), 3)
201 |         if recall > 0 or prec > 0:
202 |             f1 = round(2 * recall * prec / (recall + prec), 3)
203 |         if (tp > 0 or fp > 0) and (tp > 0 or fn > 0) and (tn > 0 or fp > 0) and (tn > 0 or fn > 0):
204 |             mcc = round((tp * tn - fp * fn) / math.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn)), 3)
205 | 
206 |         return acc, prec, recall, f1, mcc
207 | 
208 |     @staticmethod
209 |     def combine_protein_performance(proteins, cutoff, labels):
210 |         """
211 |         Calculate final model performance from per-protein performances
212 |         :param proteins:
213 |         :param cutoff:
214 |         :param labels:
215 |         :return:
216 |         """
217 |         model_performances = {'overall': ModelPerformance(), 'metal': ModelPerformance(),
218 |                               'nucleic': ModelPerformance(), 'small': ModelPerformance()}
219 | 
220 |         for p in proteins.keys():
221 |             prot = proteins[p]
222 |             prot.set_labels(labels[p])
223 | 
224 |             # calculate performance for protein
225 |             performance = prot.calc_performance_measurements(cutoff)
226 | 
227 |             for k in performance.keys():
228 |                 model_performance = model_performances[k]
229 |                 if 'tp' in performance[k].keys():
230 |                     tp = performance[k]['tp']
231 |                     fp = performance[k]['fp']
232 |                     fn = performance[k]['fn']
233 |                     tn = performance[k]['tn']
234 |                     if k == 'overall' and (tp + fn) == 0:
235 |                         print('No residues annotated as binding for {}'.format(p))
236 |                     if (tp + fp + fn) > 0:
237 |                         # only add performance if this protein binds or if one residue was predicted to bind
238 |                         model_performance.add_single_performance(0, performance[k]['acc'], performance[k]['prec'],
239 |                                                                  performance[k]['recall'], performance[k]['f1'],
240 |                                                                  performance[k]['mcc'])
241 |                         model_performance.add_confusion_matrix(tp, fp, tn, fn)
242 |                     else:
243 |                         model_performance.predictions['not_bound_not_predicted'] += 1
244 | 
245 |         return model_performances
246 | 
247 |     @staticmethod
248 |     def write_performance_results(model_performances, out_file):
249 |         """Write average performance"""
250 |         with open(out_file, 'w') as out:
251 |             out.write('Type\ttp\tfp\ttn\tfn\tprec\tprec.ci\trecall\trecall.ci\tf1\tf1.ci\tmcc\tmcc.ci\tacc\tacc.ci\n')
252 |             for k in model_performances.keys():
253 |                 model_performance = model_performances[k]
254 |                 acc, pr, rec, f1, mcc, acc_ci, pr_ci, rec_ci, f1_ci, mcc_ci = \
255 |                     model_performance.get_mean_ci_performance()
256 | 
257 |                 confusion_matrix = model_performance.confusion_matrix
258 |                 tp = confusion_matrix['tp']
259 |                 fp = confusion_matrix['fp']
260 |                 tn = confusion_matrix['tn']
261 |                 fn = confusion_matrix['fn']
262 | 
263 |                 out.write('{}\t{}\t{}\t{}\t{}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}'
264 |                           '\t{:.3f}\n'.format(k, tp, fp, tn, fn, pr, pr_ci, rec, rec_ci, f1, f1_ci, mcc, mcc_ci, acc,
265 |                                               acc_ci))
266 | 
267 |     @staticmethod
268 |     def print_performance_results(model_performances):
269 |         """Print average performance"""
270 | 
271 |         for k in model_performances.keys():
272 |             print(k)
273 |             model_performance = model_performances[k]
274 |             acc, pr, rec, f1, mcc, acc_ci, pr_ci, rec_ci, f1_ci, mcc_ci = \
275 |                 model_performance.get_mean_ci_performance()
276 | 
277 |             cov_proteins = model_performance.coverage
278 |             if len(model_performance.accs) > 0:
279 |                 cov_percentage = cov_proteins / len(model_performance.accs)
280 |             else:
281 |                 cov_percentage = 0.0
282 | 
283 |             confusion_matrix = model_performance.confusion_matrix
284 |             predictions = model_performance.predictions
285 | 
286 |             print('CovOneBind: {} ({:.3f})'.format(cov_proteins, cov_percentage))
287 |             print('Bound: With predictions: {}, Without predictions: {}\nNot Bound: With predictions: {}, '
288 |                   'Without predictions: {}'.format(predictions['bound_predicted'], predictions['bound_not_predicted'],
289 |                                                    predictions['not_bound_predicted'],
290 |                                                    predictions['not_bound_not_predicted']))
291 |             print('TP: {}, FP: {}, TN: {}, FN: {}'.format(confusion_matrix['tp'], confusion_matrix['fp'],
292 |                                                           confusion_matrix['tn'], confusion_matrix['fn']))
293 |             print("Prec: {:.3f} +/- {:.3f}, Recall: {:.3f} +/- {:.3f}, F1: {:.3f} +/- {:.3f}, "
294 |                   "MCC: {:.3f} +/- {:.3f}, Acc: {:.3f} +/- {:.3f}".format(pr, pr_ci, rec, rec_ci, f1, f1_ci, mcc,
295 |                                                                           mcc_ci, acc, acc_ci))
296 | 
297 |     @staticmethod
298 |     def print_cross_prediction_results(model_performances):
299 |         """Print cross-predictions"""
300 | 
301 |         for k in model_performances.keys():
302 |             print(k)
303 |             cross_predictions = model_performances[k].cross_prediction
304 |             print('Metal: {}, Nucleic: {}, Small: {}, Non-Binding: {}'.format(cross_predictions[0],
305 |                                                                               cross_predictions[1],
306 |                                                                               cross_predictions[2],
307 |                                                                               cross_predictions[3]))
308 | 


--------------------------------------------------------------------------------
/bindEmbed21DL.py:
--------------------------------------------------------------------------------
  1 | from data_preparation import ProteinInformation
  2 | from ml_trainer import MLTrainer
  3 | from ml_predictor import MLPredictor
  4 | from config import FileSetter, FileManager
  5 | from architectures import CNN2Layers
  6 | 
  7 | import numpy as np
  8 | from sklearn.model_selection import PredefinedSplit
  9 | 
 10 | 
 11 | class BindEmbed21DL(object):
 12 | 
 13 |     @staticmethod
 14 |     def cross_train_pipeline(params, model_output, predictions_output, ri):
 15 |         """
 16 |         Run cross-training pipeline for a specific set of parameters
 17 |         :param params:
 18 |         :param model_output: If None, trained model is not written
 19 |         :param predictions_output: If None, predictions are not written
 20 |         :param ri: Should RI or raw probabilities be written?
 21 |         :return:
 22 |         """
 23 | 
 24 |         print("Prepare data")
 25 |         ids = []
 26 |         fold_array = []
 27 |         for s in range(1, 6):
 28 |             ids_in = '{}{}.txt'.format(FileSetter.split_ids_in(), s)
 29 |             split_ids = FileManager.read_ids(ids_in)
 30 | 
 31 |             ids += split_ids
 32 |             fold_array += [s] * len(split_ids)
 33 | 
 34 |         ps = PredefinedSplit(fold_array)
 35 | 
 36 |         # get sequences + maximum length + labels
 37 |         sequences, max_length, labels = ProteinInformation.get_data(ids)
 38 |         embeddings = FileManager.read_embeddings(FileSetter.embeddings_input())
 39 | 
 40 |         proteins = dict()
 41 |         trainer = MLTrainer(pos_weights=params['weights'])
 42 | 
 43 |         for train_index, test_index in ps.split():
 44 | 
 45 |             split_counter = fold_array[test_index[0]]
 46 |             train_ids = [ids[train_idx] for train_idx in train_index]
 47 |             validation_ids = [ids[test_idx] for test_idx in test_index]
 48 | 
 49 |             print("Train model")
 50 |             model_split = trainer.train_validate(params, train_ids, validation_ids, sequences, embeddings, labels,
 51 |                                                  max_length, verbose=False)
 52 | 
 53 |             if model_output is not None:
 54 |                 model_path = '{}{}.pt'.format(model_output, split_counter)
 55 |                 FileManager.save_classifier_torch(model_split, model_path)
 56 | 
 57 |             print("Calculate predictions per protein")
 58 |             ml_predictor = MLPredictor(model_split)
 59 |             curr_proteins = ml_predictor.predict_per_protein(validation_ids, sequences, embeddings, labels, max_length)
 60 | 
 61 |             proteins = {**proteins, **curr_proteins}
 62 | 
 63 |             if predictions_output is not None:
 64 |                 FileManager.write_predictions(proteins, predictions_output, 0.5, ri)
 65 | 
 66 |         return proteins
 67 | 
 68 |     @staticmethod
 69 |     def hyperparameter_optimization_pipeline(params, num_splits, result_file):
 70 |         """
 71 |         Development pipeline used to optimize hyperparameters
 72 |         :param params:
 73 |         :param num_splits:
 74 |         :param result_file:
 75 |         :return:
 76 |         """
 77 | 
 78 |         print("Prepare data")
 79 |         ids = []
 80 |         fold_array = []
 81 |         for s in range(1, num_splits + 1):
 82 |             ids_in = '{}{}.txt'.format(FileSetter.split_ids_in(), s)
 83 |             split_ids = FileManager.read_ids(ids_in)
 84 | 
 85 |             ids += split_ids
 86 |             fold_array += [s] * len(split_ids)
 87 | 
 88 |         ids = np.array(ids)
 89 | 
 90 |         # get sequences + maximum length + labels
 91 |         sequences, max_length, labels = ProteinInformation.get_data(ids)
 92 |         embeddings = FileManager.read_embeddings(FileSetter.embeddings_input())
 93 | 
 94 |         print("Perform hyperparameter optimization")
 95 |         trainer = MLTrainer(pos_weights=params['weights'])
 96 |         del params['weights']  # remove weights to not consider as parameter for optimization
 97 | 
 98 |         model = trainer.cross_validate(params, ids, fold_array, sequences, embeddings, labels, max_length, result_file)
 99 | 
100 |         return model
101 | 
102 |     @staticmethod
103 |     def prediction_pipeline(model_prefix, cutoff, result_folder, ids, fasta_file, ri):
104 |         """
105 |         Run predictions with bindEmbed21DL for a given list of proteins
106 |         :param model_prefix:
107 |         :param cutoff: Cutoff to use to define prediction as binding (default: 0.5)
108 |         :param result_folder:
109 |         :param ids:
110 |         :param fasta_file:
111 |         :param ri: Should RI or raw probabilities be written?
112 |         :return:
113 |         """
114 | 
115 |         print("Prepare data")
116 |         sequences, max_length, labels = ProteinInformation.get_data_predictions(ids, fasta_file)
117 |         embeddings = FileManager.read_embeddings(FileSetter.embeddings_input())
118 | 
119 |         proteins = dict()
120 |         for i in range(0, 5):
121 |             print("Load model")
122 |             model_path = '{}{}.pt'.format(model_prefix, i + 1)
123 |             model = FileManager.load_classifier_torch(model_path)
124 | 
125 |             print("Calculate predictions")
126 |             ml_predictor = MLPredictor(model)
127 |             curr_proteins = ml_predictor.predict_per_protein(ids, sequences, embeddings, labels, max_length)
128 | 
129 |             for k in curr_proteins.keys():
130 |                 if k in proteins.keys():
131 |                     prot = proteins[k]
132 |                     prot.add_predictions(curr_proteins[k].predictions)
133 |                 else:
134 |                     proteins[k] = curr_proteins[k]
135 | 
136 |         for k in proteins.keys():
137 |             proteins[k].normalize_predictions(5)
138 | 
139 |         if result_folder is not None:
140 |             FileManager.write_predictions(proteins, result_folder, cutoff, ri)
141 | 
142 |         return proteins
143 | 


--------------------------------------------------------------------------------
/bindPredictML17/README.md:
--------------------------------------------------------------------------------
 1 | # bindPredictML17
 2 | 
 3 | bindPredictML17 is a sequence-based method to predict binding site residues. 
 4 | It uses an Artifical Neural Network (ANN) and was trained on enzymes and DNA-binding proteins 
 5 | (i.e. it is most suited to predict binding sites in these type of proteins).
 6 | It uses 10 input features calculated for each residue:
 7 | - cumulative coupling scores
 8 | - cumulative coupling scores filtered by distance
 9 | - cumulative coupling scores filtered by solvent accessibility
10 | - clustering coefficients
11 | - clustering coefficients filtered by distance
12 | - clustering coefficients filterd by solvent accessibility
13 | - BLOSUM62-SNAP2 comparison
14 | - BLOSUM62-EVmutation comparison
15 | - relative solvent accessibility
16 | - conservation
17 | 
18 | 
19 | ## Implementation
20 | bindPredict is written in Python3. The trained model using the 10 input features named above is included in this repository.
21 | bindPredict can be executed via command line runnning "bindPredict.py".
22 | To calculate the different scores used for the prediction, bindPredict needs pre-calculated results from external programs.
23 | As input parameters, the method needs:
24 | - path to a folder containing EVcouplings results 
25 |   (including *.di_scores, *_CouplingScoresCompared_all.csv, *.epistatic_effect, *_frequencies.csv, *_alignment_statistics.csv)
26 | - path to a folder containing SNAP2 results
27 | - path to a folder containing PROFphd results
28 | 
29 | The method then makes a prediction for every protein present in the SNAP2 results folder. So, all files should be named the same (e.g. P0A0I7.snap2, P0A0I7.profphd,...)
30 | If any files are not provided, the corresponding scores are not calculated and set to 0.
31 | 
32 | ## External programs
33 | The pre-calculated results from external programs can be calculated as follows:
34 | 
35 | - **SNAP2** results: SNAP2 can be run using [predictprotein.org](https://www.predictprotein.org/) [1]. After the job has finished, the results can be downloaded under "Effect of Point Mutations" using the export button.
36 | - **PROFphd** results: PROFphd can also be run using [predictprotein.org](https://www.predictprotein.org/). After the job has finised, the view of the results have to be changed to "text" (upper right corner) and the text part under "Secondary Structure" has to be copied to a text file.
37 | - **EVcouplings** results (\*.di_scores, \*_CouplingScoresCompared_all.csv, \*_frequencies.csv, \*_alignment_statistics.csv): [EVcouplings](http://evfold.org/evfold-web/newmarkec.do) [2,3] is available as a Github repository. A detailed description on how to run EVcouplings can be found [here](https://github.com/debbiemarkslab/EVcouplings). EVcouplings only provides EC scores inferred by plmDCA. To calculate DI scores, one can use
38 |   - **FreeContact** [4] which can be downloaded as a [debian package](https://packages.debian.org/search?keywords=freecontact). Using the alignment generated by EVcouplings, DI scores can be calculated using the option "evfold".
39 |   - the **EVcouplings** [webserver](http://evfold.org/evfold-web/newmarkec.do). DI scores can be calculated by entering the UniProt identifier or the sequence in FASTA format and by choosing "DI" as coupling scoring.
40 | - **EVmutation** results (\*.epistatic_effect): 
41 | [EVmutation](https://marks.hms.harvard.edu/evmutation/index.html) [5] is only available as a Github repository. A detailed description on how to run EVmutation can be found [here](https://htmlpreview.github.io/?https://github.com/debbiemarkslab/EVmutation/blob/master/EVmutation.html)
42 | 
43 | If results from EVcouplings or EVmutation cannot be obtained due to problems with local installation, a minimal version of bindPredict can be run by using the pre-trained model, but only providing SNAP2 and PROFphd results as input for the prediction. All other input scores will be set to 0. However, this will limit performance of the method.
44 | 
45 | ## Cite
46 | If you are using this method and find it helpful, we would appreciate if you could cite the following publication:
47 | Schelling, M., Hopf T.A., Rost, B. (2018). [Evolutionary couplings and sequence variation effect predict protein binding sites](https://onlinelibrary.wiley.com/doi/abs/10.1002/prot.25585). Proteins **86**: 1064-1074
48 | 
49 | ## References
50 | [1] Yachdav, G., Kloppmann, E., Laszlo, K., Hecht, M., Goldberg, T., Hamp, T., Hönigschmid, P., Schafferhans, A., Roos, M., Bernhofer, M. et al. (2014). [PredictProtein - an open resource for online prediction of protein structural and functional features](https://academic.oup.com/nar/article/42/W1/W337/2435518). Nucleic acids research **42**, pages W337–W343.
51 | 
52 | [2] Marks, D. S., Colwell, L. J., Sheridan, R., Hopf, T. A., Pagnani, A., Zecchina, R., Sander, C. (2011). [Protein 3D Structure Computed from Evolutionary Sequence Variation](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0028766). PLoS ONE **6**(12): e28766.
53 | 
54 | [3] Hopf, T. A., Schärfe, C. P. I., Rodrigues, J. P. G. L. M., Green, A. G., Kohlbacher, O., Sander, C., Bonvin, A. M. J. J., Debora S Marks, D.S. (2014) [Sequence co-evolution gives 3D contacts and structures of protein complexes](https://elifesciences.org/articles/03430). eLife; 3:e03430
55 | 
56 | [4] Kaján L., Hopf T. A., Kalaš M., Marks D. S., Rost B. (2014) [FreeContact: fast and free software for protein contact prediction from residue co-evolution.](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-85). BMC Bioinformatics **15**:58
57 | 
58 | [5] Hopf, T. A., Ingraham, J. B., Poelwijk, F.J., Schärfe, C.P.I., Springer, M., Sander, C., & Marks, D. S. (2017). 
59 | [Mutation effects predicted from sequence co-variation](https://www.nature.com/articles/nbt.3769). Nature Biotechnology **35**, pages 128–135.
60 | 
61 | 


--------------------------------------------------------------------------------
/bindPredictML17/bindPredict.py:
--------------------------------------------------------------------------------
 1 | """
 2 | DESCRIPTION:
 3 | 
 4 | Development and training of bindPredict to predict binding site residues
 5 | """
 6 | 
 7 | import argparse
 8 | import sys
 9 | 
10 | from scripts.helper import Helper
11 | from scripts.bindpredict_predictor import Predictor
12 | 
13 | 
14 | def main():
15 |     usage_string = 'python bindPredict.py evcouplings_res/ snap_res/ solv_acc/ results/'
16 | 
17 |     # parse command line options
18 |     parser = argparse.ArgumentParser(description=__doc__, usage=usage_string)
19 |     parser.add_argument('evc_folder', help='Folder with protein folders containing EVcouplings results')
20 |     parser.add_argument('snap_folder', help='Folder with SNAP2 predictions. '
21 |                                             'SNAP-Files need to have the same name in front of the suffix as '
22 |                                             'the EVcouplings folders '
23 |                                             '(e.g. Q9XLZ3/ <-> Q9XLZ3.snap2)')
24 |     parser.add_argument('solv_acc_folder', help='Folder with PROFacc predictions. '
25 |                                                 'PROFacc-Files need to have the same name in front of the suffix as '
26 |                                                 'the EVcouplings folders '
27 |                                                 '(e.g. Q9XLZ3/ <-> Q9XLZ3.reprof)')
28 |     parser.add_argument('output_folder', help='Folder to write binding site predictions to (one file for each protein)')
29 |     parser.add_argument('-m', '--model', help='Input model to use for predictor training (default: mm3_cons_solv_more)',
30 |                         default='mm3_cons_solv_more')
31 |     parser.add_argument('--snap_suffix', help='Suffix of files in given SNAP-folder (default: "*.snap2")',
32 |                         default='*.snap2')
33 |     parser.add_argument('--profacc_suffix', help='Suffix of files in given PROFacc-folder (default: "*.reprof")',
34 |                         default='*.reprof')
35 |     args = parser.parse_args()
36 |     helper = Helper()
37 |     if helper.folder_existence_check(args.evc_folder):  # check if evc folder exists and is reachable
38 |         if helper.folder_existence_check(args.snap_folder):  # check if snap folder exists and is reachable
39 |             if helper.folder_existence_check(
40 |                     args.solv_acc_folder):  # check if solvent accessibility folder exists and is reachable
41 |                 predictor = Predictor()
42 |                 predictor.run_prediction(args)
43 | 
44 | 
45 | def error(*objs):
46 |     print("ERROR: ", *objs, file=sys.stderr)
47 | 
48 | 
49 | if __name__ == "__main__":
50 |     main()
51 | 


--------------------------------------------------------------------------------
/bindPredictML17/data/blosum_62.txt:
--------------------------------------------------------------------------------
 1 |    A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  J  Z  X  *
 2 | A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1 -1 -1 -4
 3 | R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1 -2  0 -1 -4
 4 | N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  4 -3  0 -1 -4
 5 | D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4 -3  1 -1 -4
 6 | C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -1 -3 -1 -4
 7 | Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0 -2  4 -1 -4
 8 | E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1 -3  4 -1 -4
 9 | G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -4 -2 -1 -4
10 | H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0 -3  0 -1 -4
11 | I -1 -3 -3 -3 -1 -3 -3 -4 -3  4  2 -3  1  0 -3 -2 -1 -3 -1  3 -3  3 -3 -1 -4
12 | L -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4 -2  2  0 -3 -2 -1 -2 -1  1 -4  3 -3 -1 -4
13 | K -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5 -1 -3 -1  0 -1 -3 -2 -2  0 -3  1 -1 -4
14 | M -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5  0 -2 -1 -1 -1 -1  1 -3  2 -1 -1 -4
15 | F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6 -4 -2 -2  1  3 -1 -3  0 -3 -1 -4
16 | P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7 -1 -1 -4 -3 -2 -2 -3 -1 -1 -4
17 | S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4  1 -3 -2 -2  0 -2  0 -1 -4
18 | T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5 -2 -2  0 -1 -1 -1 -1 -4
19 | W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11  2 -3 -4 -2 -2 -1 -4
20 | Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7 -1 -3 -1 -2 -1 -4
21 | V  0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4 -3  2 -2 -1 -4
22 | B -2 -1  4  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4 -3  0 -1 -4
23 | J -1 -2 -3 -3 -1 -2 -3 -4 -3  3  3 -3  2  0 -3 -2 -1 -2 -1  2 -3  3 -3 -1 -4
24 | Z -1  0  0  1 -3  4  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -2 -2 -2  0 -3  4 -1 -4
25 | X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -4
26 | * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4  1
27 | 


--------------------------------------------------------------------------------
/bindPredictML17/scripts/bindpredict_predictor.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sys
  3 | import glob
  4 | import numpy as np
  5 | from sklearn.externals import joblib
  6 | from scripts.protein import Protein
  7 | from scripts.file_manager import FileManager
  8 | from scripts.select_model import SelectModel
  9 | 
 10 | 
 11 | class Predictor(object):
 12 |     def __init__(self):
 13 |         self.query_proteins = dict()
 14 |         self.fm = FileManager()
 15 | 
 16 |     def run_prediction(self, args):
 17 |         # 1) Read files
 18 |         print('1) Read files...')
 19 |         self.get_snap_files(args.snap_folder, args.snap_suffix)
 20 |         self.get_evc_files(args.evc_folder)
 21 |         self.get_solv_files(args.solv_acc_folder, args.profacc_suffix)
 22 |         # 2) Load saved model
 23 |         print('2) Load saved model...')
 24 |         cols_to_remove, ranges, words = SelectModel.define_model(args.model)
 25 |         classifier = joblib.load("trained_model/" + args.model + ".pkl")
 26 |         # 3) Prepare predictions
 27 |         print('3) Calculate scores for query...')
 28 |         self.calculate_scores(self.query_proteins)
 29 |         print('4) Run predictions...')
 30 |         for p in self.query_proteins:
 31 |             protein_scores = self.prepare_scores_prediction(p, cols_to_remove)
 32 |             protein_scores = np.array(protein_scores)
 33 |             # 4) Run predictions per protein
 34 |             predictions = []
 35 |             for index in range(0, len(protein_scores)):
 36 |                 el = protein_scores[index]
 37 |                 to_test = el[:len(el) - 1]
 38 |                 pos = el[len(el) - 1]
 39 |                 to_test = np.reshape(to_test, (1, -1))
 40 |                 to_test = to_test.astype(float)
 41 |                 proba = classifier.predict_proba(to_test)
 42 |                 prediction = proba[0][1]
 43 |                 row = [pos, prediction]
 44 |                 predictions = predictions + [row]
 45 |             # 5) Write output
 46 |             self.write_output_predictions(args.output_folder, predictions, p)
 47 | 
 48 |     def get_snap_files(self, folder, suffix):
 49 |         """ Read all ids in a big dictionray.
 50 |         :param folder: Folder where the snap2 predictions are stored
 51 |         :param suffix: suffix of the files (default: snap2)
 52 |         :return:
 53 |         """
 54 | 
 55 |         snap_pattern = os.path.join(folder, suffix)
 56 |         for f in glob.iglob(snap_pattern):
 57 |             name = os.path.basename(f)
 58 |             suffix_start = name.find('.')
 59 |             name = name[:suffix_start]  # reduce the file name to the prefix only
 60 |             self.query_proteins[name] = Protein()
 61 |             self.query_proteins[name].snap_file = f
 62 |             self.query_proteins[name].blosum_file = self.fm.blosum_file
 63 | 
 64 |     def get_evc_files(self, folder):
 65 |         """ For each id, determine the different evc files.
 66 |         :param folder: Folder where the folders with EVcouplings predictions are stored
 67 |         :return:
 68 |         """
 69 | 
 70 |         for p in self.query_proteins:
 71 |             evc_folder = os.path.join(folder, p)
 72 |             evc_file = os.path.join(evc_folder, (p + self.fm.ec_suffix))
 73 |             dist_file = os.path.join(evc_folder, (p + self.fm.dist_suffix))
 74 |             evmut_file = os.path.join(evc_folder, (p + self.fm.evmut_suffix))
 75 |             cons_file = os.path.join(evc_folder, (p + self.fm.cons_suffix))
 76 |             align_file = os.path.join(evc_folder, (p + self.fm.align_suffix))
 77 | 
 78 |             self.query_proteins[p].evc_file = evc_file
 79 |             self.query_proteins[p].dist_file = dist_file
 80 |             self.query_proteins[p].evmut_file = evmut_file
 81 |             self.query_proteins[p].cons_file = cons_file
 82 | 
 83 |             counter = 1
 84 |             with open(align_file) as af:
 85 |                 for line in af:
 86 |                     if counter == 2:
 87 |                         splitted = line.split(",")
 88 |                         length = int(splitted[3])
 89 |                         length_cov = int(splitted[4])
 90 |                         self.query_proteins[p].length = length
 91 |                         self.query_proteins[p].length_cov = length_cov
 92 |                     counter += 1
 93 | 
 94 |     def get_solv_files(self, folder, suffix):
 95 |         """ For each id, safe the solv file.
 96 |         :param folder: Folder where PROFacc/Reprof/PROFphd predictions are stored
 97 |         :param suffix: suffix of the files (default: reprof)
 98 |         :return:
 99 |         """
100 | 
101 |         solv_pattern = os.path.join(folder, suffix)
102 |         for f in glob.iglob(solv_pattern):
103 |             name = os.path.basename(f)
104 |             suffix_start = name.find('.')
105 |             name = name[:suffix_start]
106 |             self.query_proteins[name].solv_file = f
107 | 
108 |     def calculate_scores(self, id_list):
109 |         """ Calculate cum scores, clustering coefficients, SNAP2/BLOSUM, EVmut/BLOSUM, Solv and Cons.
110 |         :param id_list: list of ids to calculate scores for
111 |         :return
112 |         """
113 | 
114 |         for i in id_list:
115 |             self.query_proteins[i].calc_scores()
116 | 
117 |     def prepare_scores_prediction(self, p, cols_to_remove):
118 |         data = []
119 |         scores = self.query_proteins[p].scores
120 |         for res in scores.keys():
121 |             resi = str(res)
122 |             scores_res = scores[res]
123 |             row = [float(i) for j, i in enumerate(scores_res) if j not in cols_to_remove]
124 |             row = row + [resi]
125 |             data.append(row)
126 | 
127 |         return data
128 | 
129 |     @staticmethod
130 |     def write_output_predictions(folder, predictions, p):
131 |         """ Write predictions results into output file.
132 |         :param folder: Folder to write the prediction results to
133 |         :param p: Protein id to use as filename
134 |         :param predictions: Per residue predictions
135 |         :return:
136 |         """
137 | 
138 |         out_file = os.path.join(folder, (p + ".bindPredict_out"))
139 |         with open(out_file, 'w') as out:
140 |             # format: res, proba_pos
141 |             for pred in predictions:
142 |                 label = "nb"
143 |                 if pred[1] >= 0.6:
144 |                     label = "b"
145 |                 out.write(str(pred[0]) + "\t" + str(pred[1]) + "\t" + label + "\n")
146 | 
147 | 
148 | def error(*objs):
149 |     print("ERROR: ", *objs, file=sys.stderr)
150 | 


--------------------------------------------------------------------------------
/bindPredictML17/scripts/file_manager.py:
--------------------------------------------------------------------------------
 1 | """ Class specifying file endings and locations.
 2 | Before usage, values should be adjusted to local settings"""
 3 | 
 4 | 
 5 | class FileManager(object):
 6 | 
 7 |     @property
 8 |     def ec_suffix(self):
 9 |         return ".di_scores"
10 | 
11 |     @property
12 |     def dist_suffix(self):
13 |         return "_CouplingScoresCompared_all.csv"
14 | 
15 |     @property
16 |     def evmut_suffix(self):
17 |         return ".epistatic_effect"
18 | 
19 |     @property
20 |     def cons_suffix(self):
21 |         return "_frequencies.csv"
22 | 
23 |     @property
24 |     def align_suffix(self):
25 |         return "_alignment_statistics.csv"
26 | 
27 |     @property
28 |     def ids_file(self):
29 |         file_name = "ids_split"
30 |         return file_name
31 | 
32 |     @property
33 |     def ids_folder(self):
34 |         return "data/id_lists/"
35 | 
36 |     @property
37 |     def binding_file(self):
38 |         return "data/binding_residues_whole_swissprot_all.txt"
39 | 
40 |     @property
41 |     def blosum_file(self):
42 |         return "data/blosum_62.txt"
43 | 


--------------------------------------------------------------------------------
/bindPredictML17/scripts/helper.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | 
 4 | class Helper(object):
 5 |     @staticmethod
 6 |     def folder_existence_check(folder_name=None):
 7 |         """
 8 |         Check if folder exists and is reachable, if not exit the program.
 9 |         :param folder_name: Folder to check, if it exists (default: None)
10 |         :type folder_name: String
11 |         :return: True if folder exists and is reachable, False otherwise
12 |         """
13 |         if os.path.exists(folder_name):
14 |             folder_is_there = True
15 |         else:
16 |             os.error('Folder {fl} does not exist or is not reachable - exit!'.format(fl=folder_name))
17 |             folder_is_there = False
18 |             exit(404)
19 | 
20 |         return folder_is_there
21 | 
22 |     @staticmethod
23 |     def read_more_data(ids_folder, run):
24 |         """
25 |         Read ids for more data for the training set of the specified run
26 |         :param ids_folder: Folder were id files are located
27 |         :param run: Current run of cross-validation to read data for
28 |         :return: ids for more data
29 |         """
30 | 
31 |         ids = []
32 |         with open(os.path.join(ids_folder, ("ids_run" + str(run) + "_more.txt"))) as f:
33 |             for line in f:
34 |                 ids.append(line.strip())
35 | 
36 |         return ids
37 | 
38 |     @staticmethod
39 |     def read_id_lists(ids_folder, ids_file):
40 |         """
41 |         Read the id lists from a given folder
42 |         :param ids_folder: Folder were ids files are located
43 |         :param ids_file: prefix of id files
44 |         :return: ids of split 1-5
45 |         """
46 | 
47 |         split1_ids = []
48 |         split2_ids = []
49 |         split3_ids = []
50 |         split4_ids = []
51 |         split5_ids = []
52 | 
53 |         with open(os.path.join(ids_folder, (ids_file + "1.txt"))) as f:
54 |             for line in f:
55 |                 split1_ids.append(line.strip())
56 | 
57 |         with open(os.path.join(ids_folder, (ids_file + "2.txt"))) as f:
58 |             for line in f:
59 |                 split2_ids.append(line.strip())
60 | 
61 |         with open(os.path.join(ids_folder, (ids_file + "3.txt"))) as f:
62 |             for line in f:
63 |                 split3_ids.append(line.strip())
64 | 
65 |         with open(os.path.join(ids_folder, (ids_file + "4.txt"))) as f:
66 |             for line in f:
67 |                 split4_ids.append(line.strip())
68 | 
69 |         with open(os.path.join(ids_folder, (ids_file + "5.txt"))) as f:
70 |             for line in f:
71 |                 split5_ids.append(line.strip())
72 | 
73 |         return split1_ids, split2_ids, split3_ids, split4_ids, split5_ids
74 | 


--------------------------------------------------------------------------------
/bindPredictML17/scripts/parser.py:
--------------------------------------------------------------------------------
  1 | """ Provides methods to parser output 
  2 | from secondary structure and solvent accessibility predictions like ProfPHD and Reprof.
  3 | """
  4 | 
  5 | import sys
  6 | from Bio.Seq import Seq
  7 | 
  8 | 
  9 | class ProfParser(object):
 10 |     """ Parse and save information from ProfPDH or Reprof output.
 11 |     """
 12 | 
 13 |     # general
 14 |     seq = Seq("")
 15 | 
 16 |     # sec structure
 17 |     o_sec_struc = []
 18 |     p_sec_struc = []
 19 |     ri_sec_struc = []
 20 |     p_h = []
 21 |     p_e = []
 22 |     p_l = []
 23 |     o_t_h = []
 24 |     o_t_e = []
 25 |     o_t_l = []
 26 | 
 27 |     # solv access
 28 |     o_solv_acc = []
 29 |     p_solv_acc = []
 30 |     o_rel_solv_10 = []
 31 |     p_rel_solv_10 = []
 32 |     p_rel_solv = []
 33 |     p_10 = []
 34 |     ri_solv_acc = []
 35 |     o_be = []
 36 |     p_be = []
 37 |     o_bie = []
 38 |     p_bie = []
 39 |     o_t_0 = []
 40 |     o_t_1 = []
 41 |     o_t_2 = []
 42 |     o_t_3 = []
 43 |     o_t_4 = []
 44 |     o_t_5 = []
 45 |     o_t_6 = []
 46 |     o_t_7 = []
 47 |     o_t_8 = []
 48 |     o_t_9 = []
 49 | 
 50 |     # tmh
 51 |     omn = []
 52 |     pmn = []
 53 |     prmn = []
 54 |     ri_m = []
 55 |     p_m = []
 56 |     p_n = []
 57 |     o_t_m = []
 58 |     o_t_n = []
 59 | 
 60 |     def __init__(self, file_in):
 61 |         self.seq = Seq("")
 62 | 
 63 |         self.o_sec_struc = []
 64 |         self.p_sec_struc = []
 65 |         self.ri_sec_struc = []
 66 |         self.p_h = []
 67 |         self.p_e = []
 68 |         self.p_l = []
 69 |         self.o_t_h = []
 70 |         self.o_t_e = []
 71 |         self.o_t_l = []
 72 | 
 73 |         self.o_solv_acc = []
 74 |         self.p_solv_acc = []
 75 |         self.o_rel_solv_10 = []
 76 |         self.p_rel_solv_10 = []
 77 |         self.p_rel_solv = []
 78 |         self.p_10 = []
 79 |         self.ri_solv_acc = []
 80 |         self.o_be = []
 81 |         self.p_be = []
 82 |         self.o_bie = []
 83 |         self.p_bie = []
 84 |         self.o_t_0 = []
 85 |         self.o_t_1 = []
 86 |         self.o_t_2 = []
 87 |         self.o_t_3 = []
 88 |         self.o_t_4 = []
 89 |         self.o_t_5 = []
 90 |         self.o_t_6 = []
 91 |         self.o_t_7 = []
 92 |         self.o_t_8 = []
 93 |         self.o_t_9 = []
 94 | 
 95 |         self.omn = []
 96 |         self.pmn = []
 97 |         self.prmn = []
 98 |         self.ri_m = []
 99 |         self.p_m = []
100 |         self.p_n = []
101 |         self.o_t_m = []
102 |         self.o_t_n = []
103 | 
104 |         self._parse_file(file_in)
105 | 
106 |     def _parse_file(self, file_in):
107 |         rows = []
108 | 
109 |         with open(file_in) as f:
110 |             for line in f:
111 |                 if '#' not in line:
112 |                     row = line.strip().split("\t")
113 |                     rows.append(row)
114 | 
115 |         tmp = rows.pop(0)
116 | 
117 |         col_map = {}
118 |         for i in range(0, len(tmp)):
119 |             col = tmp[i]
120 |             col_map[col] = i
121 | 
122 |         self._parse_profphd(rows, col_map)
123 | 
124 |     def _parse_profphd(self, data, col_map):
125 |         seq_str = ''
126 | 
127 |         for i in range(0, len(data)):
128 |             el = data[i]
129 |             aa = el[1]
130 |             seq_str = seq_str + aa
131 | 
132 |             # get column indices
133 |             o_sec_i = -1
134 |             if 'OHEL' in col_map:
135 |                 o_sec_i = col_map['OHEL']
136 |             p_sec_i = -1
137 |             if 'PHEL' in col_map:
138 |                 p_sec_i = col_map['PHEL']
139 |             ri_sec_i = -1
140 |             if 'RI_S' in col_map:
141 |                 ri_sec_i = col_map['RI_S']
142 |             p_h_i = -1
143 |             if 'pH' in col_map:
144 |                 p_h_i = col_map['pH']
145 |             p_e_i = -1
146 |             if 'pE' in col_map:
147 |                 p_e_i = col_map['pE']
148 |             p_l_i = -1
149 |             if 'pL' in col_map:
150 |                 p_l_i = col_map['pL']
151 |             o_t_h_i = -1
152 |             if 'OtH' in col_map:
153 |                 o_t_h_i = col_map['OtH']
154 |             o_t_e_i = -1
155 |             if 'OtE' in col_map:
156 |                 o_t_e_i = col_map['OtE']
157 |             o_t_l_i = -1
158 |             if 'OtL' in col_map:
159 |                 o_t_l_i = col_map['OtL']
160 | 
161 |             o_solv_acc_i = -1
162 |             if 'OACC' in col_map:
163 |                 o_solv_acc_i = col_map['OACC']
164 |             p_solv_acc_i = -1
165 |             if 'PACC' in col_map:
166 |                 p_solv_acc_i = col_map['PACC']
167 |             o_rel_solv_10_i = -1
168 |             if 'OREL' in col_map:
169 |                 o_rel_solv_10_i = col_map['OREL']
170 |             p_rel_solv_10_i = -1
171 |             if 'PREL' in col_map:
172 |                 p_rel_solv_10_i = col_map['PREL']
173 |             p_10_i = -1
174 |             if 'P10' in col_map:
175 |                 p_10_i = col_map['P10']
176 |             ri_solv_i = -1
177 |             if 'RI_A' in col_map:
178 |                 ri_solv_i = col_map['RI_A']
179 |             o_be_i = -1
180 |             if 'Obe' in col_map:
181 |                 o_be_i = col_map['Obe']
182 |             p_be_i = -1
183 |             if 'Pbe' in col_map:
184 |                 p_be_i = col_map['Pbe']
185 |             o_bie_i = -1
186 |             if 'Obie' in col_map:
187 |                 o_bie_i = col_map['Obie']
188 |             p_bie_i = -1
189 |             if 'Pbie' in col_map:
190 |                 p_bie_i = col_map['Pbie']
191 |             o_t_0_i = -1
192 |             if 'Ot0' in col_map:
193 |                 o_t_0_i = col_map['Ot0']
194 |             o_t_1_i = -1
195 |             if 'Ot1' in col_map:
196 |                 o_t_1_i = col_map['Ot1']
197 |             o_t_2_i = -1
198 |             if 'Ot2' in col_map:
199 |                 o_t_2_i = col_map['Ot2']
200 |             o_t_3_i = -1
201 |             if 'Ot3' in col_map:
202 |                 o_t_3_i = col_map['Ot3']
203 |             o_t_4_i = -1
204 |             if 'Ot4' in col_map:
205 |                 o_t_4_i = col_map['Ot4']
206 |             o_t_5_i = -1
207 |             if 'Ot5' in col_map:
208 |                 o_t_5_i = col_map['Ot5']
209 |             o_t_6_i = -1
210 |             if 'Ot6' in col_map:
211 |                 o_t_6_i = col_map['Ot6']
212 |             o_t_7_i = -1
213 |             if 'Ot7' in col_map:
214 |                 o_t_7_i = col_map['Ot7']
215 |             o_t_8_i = -1
216 |             if 'Ot8' in col_map:
217 |                 o_t_8_i = col_map['Ot8']
218 |             o_t_9_i = -1
219 |             if 'Ot9' in col_map:
220 |                 o_t_9_i = col_map['Ot9']
221 | 
222 |             omn_i = -1
223 |             if 'OMN' in col_map:
224 |                 omn_i = col_map['OMN']
225 |             pmn_i = -1
226 |             if 'PMN' in col_map:
227 |                 pmn_i = col_map['PMN']
228 |             prmn_i = -1
229 |             if 'PRMN' in col_map:
230 |                 prmn_i = col_map['PRMN']
231 |             ri_m_i = -1
232 |             if 'RI_M' in col_map:
233 |                 ri_m_i = col_map['RI_M']
234 |             p_m_i = -1
235 |             if 'pM' in col_map:
236 |                 p_m_i = col_map['pM']
237 |             p_n_i = -1
238 |             if 'pN' in col_map:
239 |                 p_n_i = col_map['pN']
240 |             o_t_m_i = -1
241 |             if 'OtM' in col_map:
242 |                 o_t_m_i = col_map['OtM']
243 |             o_t_n_i = -1
244 |             if 'OtN' in col_map:
245 |                 o_t_n_i = col_map['OtN']
246 | 
247 |             if o_sec_i > -1:
248 |                 self.o_sec_struc.append(el[o_sec_i])
249 |             if p_sec_i > -1:
250 |                 self.p_sec_struc.append(el[p_sec_i])
251 |             if ri_sec_i > -1:
252 |                 self.ri_sec_struc.append(int(el[ri_sec_i]))
253 |             if p_h_i > -1:
254 |                 self.p_h.append(int(el[p_h_i]))
255 |             if p_e_i > -1:
256 |                 self.p_e.append(int(el[p_e_i]))
257 |             if p_l_i > -1:
258 |                 self.p_l.append(int(el[p_l_i]))
259 |             if o_t_h_i > -1:
260 |                 self.o_t_h.append(int(el[o_t_h_i]))
261 |             if o_t_e_i > -1:
262 |                 self.o_t_e.append(int(el[o_t_e_i]))
263 |             if o_t_l_i > -1:
264 |                 self.o_t_l.append(int(el[o_t_l_i]))
265 | 
266 |             if o_solv_acc_i > -1:
267 |                 self.o_solv_acc.append(int(el[o_solv_acc_i]))
268 |             if p_solv_acc_i > -1:
269 |                 self.p_solv_acc.append(int(el[p_solv_acc_i]))
270 |             if o_rel_solv_10_i > -1:
271 |                 self.o_rel_solv_10.append(int(el[o_rel_solv_10_i]))
272 |             if p_rel_solv_10_i > -1:
273 |                 self.p_rel_solv_10.append(int(el[p_rel_solv_10_i]))
274 |             if p_10_i > -1:
275 |                 self.p_10.append(int(el[p_10_i]))
276 |             if ri_solv_i > -1:
277 |                 self.ri_solv_acc.append(int(el[ri_solv_i]))
278 |             if o_be_i > -1:
279 |                 self.o_be.append(el[o_be_i])
280 |             if p_be_i > -1:
281 |                 self.p_be.append(el[p_be_i])
282 |             if o_bie_i > -1:
283 |                 self.o_bie.append(el[o_bie_i])
284 |             if p_bie_i > -1:
285 |                 self.p_bie.append(el[p_bie_i])
286 |             if o_t_0_i > -1:
287 |                 self.o_t_0.append(int(el[o_t_0_i]))
288 |             if o_t_1_i > -1:
289 |                 self.o_t_1.append(int(el[o_t_1_i]))
290 |             if o_t_2_i > -1:
291 |                 self.o_t_2.append(int(el[o_t_2_i]))
292 |             if o_t_3_i > -1:
293 |                 self.o_t_3.append(int(el[o_t_3_i]))
294 |             if o_t_4_i > -1:
295 |                 self.o_t_4.append(int(el[o_t_4_i]))
296 |             if o_t_5_i > -1:
297 |                 self.o_t_5.append(int(el[o_t_5_i]))
298 |             if o_t_6_i > -1:
299 |                 self.o_t_6.append(int(el[o_t_6_i]))
300 |             if o_t_7_i > -1:
301 |                 self.o_t_7.append(int(el[o_t_7_i]))
302 |             if o_t_8_i > -1:
303 |                 self.o_t_8.append(int(el[o_t_8_i]))
304 |             if o_t_9_i > -1:
305 |                 self.o_t_9.append(int(el[o_t_9_i]))
306 | 
307 |             if omn_i > -1:
308 |                 self.omn.append(el[omn_i])
309 |             if pmn_i > -1:
310 |                 self.pmn.append(el[pmn_i])
311 |             if prmn_i > -1:
312 |                 self.prmn.append(el[prmn_i])
313 |             if ri_m_i > -1:
314 |                 self.ri_m.append(el[ri_m_i])
315 |             if p_m_i > -1:
316 |                 self.p_m.append(el[p_m_i])
317 |             if p_n_i > -1:
318 |                 self.p_n.append(el[p_n_i])
319 |             if o_t_m_i > -1:
320 |                 self.o_t_m.append(el[o_t_m_i])
321 |             if o_t_n_i > -1:
322 |                 self.o_t_n.append(el[o_t_n_i])
323 | 
324 |         self.seq = Seq(seq_str)
325 | 
326 | 
327 | def error(*objs):
328 |     print("ERROR: ", *objs, file=sys.stderr)
329 | 


--------------------------------------------------------------------------------
/bindPredictML17/scripts/protein.py:
--------------------------------------------------------------------------------
  1 | from collections import defaultdict
  2 | import re
  3 | import os.path
  4 | import sys
  5 | import numpy
  6 | from scripts.parser import ProfParser
  7 | 
  8 | 
  9 | class Protein(object):
 10 |     """ A protein in the context of binding site prediction """
 11 | 
 12 |     def __init__(self):
 13 |         self.snap_file = None
 14 |         self.evc_file = None
 15 |         self.dist_file = None
 16 |         self.evmut_file = None
 17 |         self.cons_file = None
 18 |         self.solv_file = None
 19 |         self.length = 0
 20 |         self.length_cov = 0
 21 |         self.predictions = dict()
 22 |         self.performance = {'TP': 0, 'FP': 0, 'TN': 0, 'FN': 0, 'Cov': 0, 'Prec': 0, 'F1': 0, 'Acc': 0}
 23 |         self.scores = defaultdict(list)
 24 |         self.binding_res = []
 25 |         self.seq_dist = 5
 26 |         self.threshold = 0.0
 27 |         self.average = 0.0
 28 |         self.factor = 2
 29 |         self.cs_dist_thresh = 8.0
 30 |         self.cc_dist_thresh = 20.0
 31 |         self.blosum_file = None
 32 |         self.blosum_thresh = 0.0
 33 |         self.blosum_mat = defaultdict(dict)
 34 |         self.snap_percentage = 0.3
 35 |         self.evc_percentage = 0.4
 36 |         self.abs_vals = {'A': 106, 'C': 135, 'F': 197, 'I': 169, 'M': 188, 'Q': 198, 'T': 142, 'X': 180, 'Y': 222,
 37 |                          'B': 160, 'D': 163, 'G': 84, 'K': 205, 'N': 157, 'R': 248, 'V': 142, 'Z': 196, 'E': 194,
 38 |                          'H': 184, 'L': 164, 'P': 136, 'S': 130, 'W': 227, 'U': 135}
 39 | 
 40 |     def calc_scores(self):
 41 |         """ Calculate cum scores, clustering coefficients, SNAP2/BLOSUM, EVmut/BLOSUM, Solv and Cons for a protein.
 42 |         If any of the needed files do not exist, the corresponding are not calculated for this protein
 43 |         :param:
 44 |         :return
 45 |         """
 46 | 
 47 |         residue_map = defaultdict(dict)
 48 |         max_len = self.length + 3
 49 |         for r in range(-1, max_len):
 50 |             r = str(r)
 51 |             residue_map[r] = {'Cum': 0, 'Cum_Dist': 0, 'Cum_Solv': 0,
 52 |                               'Cluster': 0, 'Cluster_Dist': 0, 'Cluster_Solv': 0,
 53 |                               'SNAP': 0, 'EVmut': 0, 'Cons': 0, 'Solv': 0}
 54 | 
 55 |         # calculate BLOSUM comparison
 56 |         self.read_blosum_matrix()
 57 |         # for SNAP2
 58 |         if os.path.isfile(self.snap_file):
 59 |             snap_effect = self.get_snap_effect()
 60 |             snap_thresh = self.determine_snap_thresh(snap_effect)
 61 |             snap_blosum_diff = self.get_blosum_diff(snap_effect, snap_thresh, 'snap')
 62 |             for r in snap_blosum_diff.keys():
 63 |                 diff = snap_blosum_diff[r]
 64 |                 r = str(r)
 65 |                 residue_map[r]['SNAP'] = diff
 66 |         # for EVmutation
 67 |         if os.path.isfile(self.evmut_file):
 68 |             evc_effect = self.get_evc_effect()
 69 |             evc_thresh = self.determine_evc_thresh(evc_effect)
 70 |             evc_blosum_diff = self.get_blosum_diff(evc_effect, evc_thresh, 'evmut')
 71 |             for r in evc_blosum_diff.keys():
 72 |                 diff = evc_blosum_diff[r]
 73 |                 r = str(r)
 74 |                 residue_map[r]['EVmut'] = diff
 75 | 
 76 |         # get per residue conservation
 77 |         line_num = 1
 78 |         if os.path.isfile(self.cons_file):
 79 |             with open(self.cons_file) as cons:
 80 |                 for line in cons:
 81 |                     if line_num > 1:
 82 |                         splitted = line.split(",")
 83 |                         pos = splitted[0]
 84 |                         conservation = splitted[2]
 85 |                         residue_map[pos]['Cons'] = conservation
 86 |                     line_num += 1
 87 | 
 88 |         # get per residue solvent accessibility
 89 |         solv_acc = dict()
 90 |         if os.path.isfile(self.solv_file):
 91 |             pp = ProfParser(self.solv_file)
 92 |             seq = pp.seq
 93 |             solv = pp.p_solv_acc
 94 | 
 95 |             for i in range(0, len(solv)):
 96 |                 val = solv[i]
 97 |                 letter = seq[i]
 98 |                 max_val = self.abs_vals[letter]
 99 |                 rel_val = val / max_val
100 |                 rel_val = round(rel_val, 3)
101 |                 pos = i + 1
102 |                 residue_map[str(pos)]['Solv'] = rel_val
103 |                 solv_acc[pos] = rel_val
104 | 
105 |         if os.path.isfile(self.evc_file):
106 |             # calculate cumulative coupling scores
107 |             distances = self.get_distances()
108 |             ec_scores = self.get_ec_scores()
109 |             self.calc_thresh_avg(ec_scores)
110 |             # don't filter
111 |             filtered_scores = self.filter_scores(ec_scores, distances, solv_acc, 'none', 0)
112 |             cumulative_scores = self.calc_cum_scores(filtered_scores)
113 |             # filter by distance
114 |             filtered_dist_scores = self.filter_scores(ec_scores, distances, solv_acc, 'dist', self.cs_dist_thresh)
115 |             cumulative_dist_scores = self.calc_cum_scores(filtered_dist_scores)
116 |             # filter by solvent accessibility
117 |             filtered_solv_scores = self.filter_scores(ec_scores, distances, solv_acc, 'solv', 0)
118 |             cumulative_solv_scores = self.calc_cum_scores(filtered_solv_scores)
119 | 
120 |             # calculate clustering coefficients
121 |             # don't filter
122 |             clustering_coefficients = self.calc_clustering_coefficients(filtered_scores)
123 |             # filter by distance
124 |             filtered_dist_scores = self.filter_scores(ec_scores, distances, solv_acc, 'dist', self.cc_dist_thresh)
125 |             clustering_coefficients_dist = self.calc_clustering_coefficients(filtered_dist_scores)
126 |             # filter by solvent accessibility
127 |             clustering_coefficients_solv = self.calc_clustering_coefficients(filtered_solv_scores)
128 | 
129 |             for r in range(1, self.length + 1):
130 |                 cum_score = cumulative_scores[r]
131 |                 cum_dist_score = cumulative_dist_scores[r]
132 |                 cum_solv_score = cumulative_solv_scores[r]
133 |                 cluster_score = clustering_coefficients[r]
134 |                 cluster_dist_score = clustering_coefficients_dist[r]
135 |                 cluster_solv_score = clustering_coefficients_solv[r]
136 |                 r = str(r)
137 |                 residue_map[r]['Cum'] = cum_score
138 |                 residue_map[r]['Cum_Dist'] = cum_dist_score
139 |                 residue_map[r]['Cum_Solv'] = cum_solv_score
140 |                 residue_map[r]['Cluster'] = cluster_score
141 |                 residue_map[r]['Cluster_Dist'] = cluster_dist_score
142 |                 residue_map[r]['Cluster_Solv'] = cluster_solv_score
143 | 
144 |         # convert to right format
145 |         for r in residue_map.keys():
146 |             r = int(r)
147 |             if 0 < r <= self.length:
148 |                 r1 = r - 2
149 |                 r2 = r - 1
150 |                 r3 = r + 1
151 |                 r4 = r + 2
152 |                 r_map = residue_map[str(r)]
153 |                 r1_map = residue_map[str(r1)]
154 |                 r2_map = residue_map[str(r2)]
155 |                 r3_map = residue_map[str(r3)]
156 |                 r4_map = residue_map[str(r4)]
157 | 
158 |                 scores = [r_map['Cum'], r2_map['Cum'], r3_map['Cum'], r1_map['Cum'], r4_map['Cum'],
159 |                           r_map['Cum_Solv'], r2_map['Cum_Solv'], r3_map['Cum_Solv'], r1_map['Cum_Solv'],
160 |                           r4_map['Cum_Solv'],
161 |                           r_map['Cum_Dist'], r2_map['Cum_Dist'], r3_map['Cum_Dist'], r1_map['Cum_Dist'],
162 |                           r4_map['Cum_Dist'],
163 |                           r_map['Cluster'], r2_map['Cluster'], r3_map['Cluster'], r1_map['Cluster'], r4_map['Cluster'],
164 |                           r_map['Cluster_Solv'], r2_map['Cluster_Solv'], r3_map['Cluster_Solv'], r1_map['Cluster_Solv'],
165 |                           r4_map['Cluster_Solv'],
166 |                           r_map['Cluster_Dist'], r2_map['Cluster_Dist'], r3_map['Cluster_Dist'], r1_map['Cluster_Dist'],
167 |                           r4_map['Cluster_Dist'],
168 |                           r_map['EVmut'], r2_map['EVmut'], r3_map['EVmut'], r1_map['EVmut'], r4_map['EVmut'],
169 |                           r_map['SNAP'], r2_map['SNAP'], r3_map['SNAP'], r1_map['SNAP'], r4_map['SNAP'],
170 |                           r_map['Solv'], r2_map['Solv'], r3_map['Solv'], r1_map['Solv'], r4_map['Solv'],
171 |                           r_map['Cons'], r2_map['Cons'], r3_map['Cons'], r1_map['Cons'], r4_map['Cons']]
172 |                 self.scores[r] = scores
173 | 
174 |     def get_distances(self):
175 |         """
176 |         read pairwise distances from file
177 |         :return:
178 |         """
179 |         line_num = 1
180 |         dist_dict = dict()
181 |         with open(self.dist_file) as f:
182 |             for line in f:
183 |                 if line_num > 1:
184 |                     splitted = line.split(",")
185 |                     pos1 = splitted[0]
186 |                     pos2 = splitted[1]
187 |                     comb_pos = pos1 + "/" + pos2
188 |                     dist = 0.0
189 |                     if splitted[len(splitted) - 2] != "" and re.search("[a-zA-z]+",
190 |                                                                        splitted[len(splitted) - 2]) == 'None':
191 |                         dist = float(splitted[len(splitted) - 2])
192 |                     dist_dict[comb_pos] = dist
193 |                 line_num += 1
194 |         return dist_dict
195 | 
196 |     def get_ec_scores(self):
197 |         """
198 |         read pairwise EC scores from file
199 |         :return:
200 |         """
201 |         ec_dict = dict()
202 |         with open(self.evc_file) as f:
203 |             for line in f:
204 |                 splitted = line.split(" ")
205 |                 comb_pos = splitted[0].strip() + "/" + splitted[2].strip()
206 |                 score = float(splitted[4])
207 |                 ec_dict[comb_pos] = score
208 |         return ec_dict
209 | 
210 |     def calc_thresh_avg(self, ec_scores):
211 |         """
212 |         Determine the cutoff and average to normalize to determine the list of high-ranking
213 |         coupling pairs from a list of DI scores
214 |         :param ec_scores: List of coupling scores
215 |         :return:
216 |         """
217 |         num = self.length_cov * self.factor
218 |         high_scores = list()
219 |         counter = 0
220 |         for key in ec_scores.keys():
221 |             splitted = key.split("/")
222 |             pos1 = int(splitted[0])
223 |             pos2 = int(splitted[1])
224 |             score = ec_scores[key]
225 |             if abs(pos1 - pos2) > self.seq_dist:
226 |                 if len(high_scores) >= num:
227 |                     el = high_scores[0]
228 |                     if el < score:
229 |                         del high_scores[0]
230 |                         high_scores.append(score)
231 |                 else:
232 |                     high_scores.append(score)
233 |                 high_scores.sort()
234 |             else:
235 |                 counter += 1
236 |         thresh = high_scores[0]
237 |         avg = float(numpy.mean(high_scores))
238 |         average = round(avg, 2)
239 |         self.threshold = thresh
240 |         self.average = average
241 | 
242 |     def filter_scores(self, ec_scores, distances, solv_acc, identifier, dist_thresh):
243 |         """
244 |         Filter coupling scores
245 |         :param ec_scores: List of coupling scores
246 |         :param distances: Pairwise distances
247 |         :param solv_acc: Per-residue solvent accessibility
248 |         :param identifier: To identify which filter should be applied (none/dist/solv)
249 |         :param dist_thresh: Threshold to use as distance cutoff
250 |         :return: filtered scores
251 |         """
252 |         filtered_scores = dict()
253 | 
254 |         for key in ec_scores.keys():
255 |             score = ec_scores[key]
256 |             key_parts = key.split("/")
257 |             pos1 = int(key_parts[0])
258 |             pos2 = int(key_parts[1])
259 |             curr_seq_dist = abs(pos1 - pos2)
260 |             if curr_seq_dist > self.seq_dist and score >= self.threshold:
261 |                 if identifier == 'none':
262 |                     filtered_scores[key] = score
263 |                 elif identifier == 'solv':
264 |                     core = False
265 |                     if pos1 in solv_acc.keys() and pos2 in solv_acc.keys():
266 |                         solv_acc1 = solv_acc[pos1]
267 |                         solv_acc2 = solv_acc[pos2]
268 |                         if solv_acc1 <= 0.1 and solv_acc2 <= 0.1:
269 |                             core = True
270 |                     if not core:
271 |                         filtered_scores[key] = score
272 |                 elif identifier == 'dist':
273 |                     pair_dist = float('Inf')
274 |                     if key in distances:
275 |                         pair_dist = distances[key]
276 |                     if pair_dist <= dist_thresh:
277 |                         filtered_scores[key] = score
278 |                 else:
279 |                     print("Invalid filter applied!")
280 |         return filtered_scores
281 | 
282 |     def calc_cum_scores(self, scores):
283 |         """ Calc cumulative coupling scores from a given set of EC scores
284 |         :param scores: list of ec scores
285 |         :return: cumulative coupling scores
286 |         """
287 |         cum_scores = dict()
288 |         for i in range(1, self.length + 1):
289 |             sum_ec = 0.0
290 |             num = 0
291 |             for j in range(1, self.length + 1):
292 |                 if i <= j:
293 |                     key = str(i) + "/" + str(j)
294 |                 else:
295 |                     key = str(j) + "/" + str(i)
296 |                 if key in scores.keys():
297 |                     score = scores[key]
298 |                     sum_ec += score
299 |                     num += 1
300 |             ec_strength = 0.0
301 |             if num > 0:
302 |                 ec_strength = sum_ec / self.average
303 |                 ec_strength = round(ec_strength, 3)
304 |             cum_scores[i] = ec_strength
305 | 
306 |         return cum_scores
307 | 
308 |     def calc_clustering_coefficients(self, scores):
309 |         """ Calc clustering coefficients from a given set of EC scores
310 |         :param scores: list of ec scores
311 |         :return: clustering coefficients
312 |         """
313 |         coefficients = dict()
314 |         for i in range(1, self.length + 1):
315 |             neighbourhood = set()
316 |             for j in range(1, self.length + 1):
317 |                 if i <= j:
318 |                     key = str(i) + "/" + str(j)
319 |                 else:
320 |                     key = str(j) + "/" + str(i)
321 |                 # if key == '228/234':
322 |                 #    print(key + "\t" + str(key in scores.keys())) 
323 |                 if key in scores.keys():
324 |                     neighbourhood.add(j)
325 | 
326 |             num_edges = 0
327 |             for n1 in neighbourhood:
328 |                 for n2 in neighbourhood:
329 |                     if n1 <= n2:
330 |                         key = str(n1) + "/" + str(n2)
331 |                     else:
332 |                         key = str(n2) + "/" + str(n1)
333 |                     if key in scores.keys():
334 |                         num_edges += 1
335 |             coeff = 0
336 | 
337 |             set_size = len(neighbourhood)
338 |             if set_size > 1:
339 |                 coeff = num_edges / (set_size * (set_size - 1))
340 |                 coeff = round(coeff, 3)
341 |             coefficients[i] = coeff
342 | 
343 |         return coefficients
344 | 
345 |     def get_snap_effect(self):
346 |         """ Read in pre-calculated snap results
347 |         :return: snap scores per residue and mutation
348 |         """
349 |         snap_scores = dict()
350 |         with open(self.snap_file) as f:
351 |             for line in f:
352 |                 if "=>" in line:
353 |                     splitted = line.split("=>")
354 |                     identifier = splitted[0].strip()
355 |                     scores = splitted[1].split("\\|")
356 |                     sum_tmp = scores[len(scores) - 1].strip()
357 |                     sum_parts = sum_tmp.split("=")
358 |                     sum_final = int(sum_parts[1].strip())
359 |                     snap_scores[identifier] = sum_final
360 |                     # print(identifier + "\t" + str(sum_final))
361 |         return snap_scores
362 | 
363 |     def determine_snap_thresh(self, scores):
364 |         """Determine the smallest value for SNAP2 still to be classified as having an effect
365 |         :param scores
366 |         :return smallest value
367 |         """
368 |         values = list()
369 |         for key in scores.keys():
370 |             values.append(scores[key])
371 |         sorted_values = sorted(values)
372 |         index = int(len(values) * (1 - self.snap_percentage))
373 |         thresh = sorted_values[index]
374 | 
375 |         return thresh
376 | 
377 |     def get_evc_effect(self):
378 |         """ Read in pre-calculated EVmutation results
379 |         :return EVmutation scores per residue and mutation
380 |         """
381 |         evc_scores = dict()
382 |         with open(self.evmut_file) as f:
383 |             for line in f:
384 |                 splitted = line.strip().split()
385 |                 # print(splitted)
386 |                 key = splitted[0]
387 |                 # print(key)
388 |                 value = float(splitted[1])
389 |                 evc_scores[key] = value
390 |         return evc_scores
391 | 
392 |     def determine_evc_thresh(self, scores):
393 |         """Determine the smallest value for EVmutation still to be classified as having an effect
394 |         :param scores
395 |         :return smallest value
396 |         """
397 |         positive_scores = list()
398 |         for key in scores.keys():
399 |             val = scores[key]
400 |             val = abs(val)
401 |             positive_scores.append(val)
402 |         sorted_scores = sorted(positive_scores)
403 |         size = len(sorted_scores)
404 |         num_scores = int(size * self.evc_percentage)
405 |         index = size - num_scores
406 |         thresh = sorted_scores[index]
407 |         return thresh
408 | 
409 |     def get_blosum_diff(self, scores, thresh, method):
410 |         """ Calculate difference between BLOSUM and effect predictions
411 |         :param scores: effect prediction scores
412 |         :param thresh: smallest value to still classify as having an effect
413 |         :param method: method used to predict effects
414 |         :return BLOSUM difference for each residue
415 |         """
416 |         blosum_diff = dict()
417 |         for i in range(1, self.length + 1):
418 |             prog = re.compile("[A-Z]" + str(i) + "[A-Z]")
419 |             num_accepted = 0
420 |             num_mutations = 0
421 |             for mut in scores.keys():
422 |                 # print(mut)
423 |                 res = prog.search(mut)
424 |                 # print(res)
425 |                 if res is not None:
426 |                     aa1 = mut[:1]
427 |                     aa2 = mut[len(mut) - 1:]
428 |                     blosum_score = self.get_blosum_score(aa1, aa2)
429 |                     # print(mut + "\t" + aa1 + "\t" + aa2 + "\t" + str(blosum_score))
430 | 
431 |                     if blosum_score >= self.blosum_thresh:
432 |                         num_accepted += 1
433 |                         mut_score = scores[mut]
434 |                         if method == 'evmut':
435 |                             mut_score = abs(mut_score)
436 |                         if mut_score >= thresh:
437 |                             num_mutations += 1
438 |             diff = 0.0
439 |             if num_accepted > 0:
440 |                 diff = num_mutations / num_accepted
441 |                 diff = round(diff, 3)
442 |             blosum_diff[i] = diff
443 |             # print(str(i) + "\t" + str(diff))
444 |         return blosum_diff
445 | 
446 |     def read_blosum_matrix(self):
447 |         """Read the specificed blosum matrix"""
448 |         line_num = 1
449 |         pos_mapping = dict()
450 |         with open(self.blosum_file) as f:
451 |             for line in f:
452 |                 splitted = line.strip().split()
453 |                 if line_num == 1:
454 |                     for i in range(0, len(splitted)):
455 |                         pos = i + 1
456 |                         pos_mapping[pos] = splitted[i]
457 |                 else:
458 |                     pos1 = splitted[0]
459 |                     for j in range(1, len(splitted)):
460 |                         pos2 = pos_mapping[j]
461 |                         self.blosum_mat[pos1][pos2] = int(splitted[j])
462 |                 line_num += 1
463 | 
464 |     def get_blosum_score(self, aa1, aa2):
465 |         return self.blosum_mat[aa1][aa2]
466 | 
467 | 
468 | def error(*objs):
469 |     print("ERROR: ", *objs, file=sys.stderr)
470 | 


--------------------------------------------------------------------------------
/bindPredictML17/scripts/select_model.py:
--------------------------------------------------------------------------------
 1 | class SelectModel(object):
 2 |     @staticmethod
 3 |     def define_model(model):
 4 |         cols_to_remove = []
 5 |         ranges = []
 6 |         words = []
 7 | 
 8 |         if model == "mm3_cons_solv_more":
 9 |             cols_to_remove = [1, 2, 3, 4, 6, 7, 8, 9, 11, 12, 13, 14,
10 |                               16, 17, 18, 19, 21, 22, 23, 24, 26, 27, 28, 29, 31, 32, 33, 34, 36, 37, 38, 39,
11 |                               41, 42, 43, 44, 46, 47, 48, 49]
12 |             ranges = []
13 |             words = ["no", "yes"]
14 |         else:
15 |             print("Unknown model specified")
16 | 
17 |         return cols_to_remove, ranges, words
18 | 


--------------------------------------------------------------------------------
/bindPredictML17/trained_model/mm3_cons_solv_more.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rostlab/bindPredict/aa7acfff3b99b3dcc737a22ab5bd1abff3a0efdb/bindPredictML17/trained_model/mm3_cons_solv_more.pkl


--------------------------------------------------------------------------------
/config.py:
--------------------------------------------------------------------------------
  1 | from pandas import DataFrame
  2 | import torch
  3 | import os
  4 | import numpy
  5 | import h5py
  6 | import random
  7 | from collections import defaultdict
  8 | 
  9 | 
 10 | class FileSetter(object):
 11 | 
 12 |     @staticmethod
 13 |     def embeddings_input():
 14 |         return ''
 15 |         # TODO set path to embeddings, this should be a .h5-file generated containing per-residue embeddings for all
 16 |         #  proteins with key: UniProt-ID, value: embeddings
 17 | 
 18 |     @staticmethod
 19 |     def predictions_folder():
 20 |         return ''  # TODO set path to where predictions should be written
 21 | 
 22 |     @staticmethod
 23 |     def profile_db():
 24 |         # TODO set path to pre-computed big_80 database
 25 |         # can be downloaded from ftp://rostlab.org/bindEmbed21/profile_db.tar.gz
 26 |         return ''
 27 | 
 28 |     @staticmethod
 29 |     def lookup_fasta():
 30 |         # TODO set path to FASTA file of lookup set
 31 |         # can be downloaded from ftp://rostlab.org/bindEmbed21/lookup.fasta
 32 |         return ''
 33 | 
 34 |     @staticmethod
 35 |     def lookup_db():
 36 |         # TODO set path to pre-computed lookup database
 37 |         # can be downloaded from ftp://rostlab.org/bindEmbed21/lookup_db.tar.gz
 38 |         return ''
 39 | 
 40 |     @staticmethod
 41 |     def mmseqs_output():
 42 |         # TODO set path to where MMseqs2 output folder should be written.
 43 |         # tmp files will be stored in this folder in a sub-directory tmp/
 44 |         # predictions will be stored in this folder in a sub-directory hbi_predictions/
 45 |         return ''
 46 | 
 47 |     @staticmethod
 48 |     def query_set():
 49 |         return ''  # TODO set path to FASTA set of query sequences to generate predictions for
 50 | 
 51 |     @staticmethod
 52 |     def mmseqs_path():
 53 |         return ''  # TODO set path to MMseqs2 installation
 54 | 
 55 |     @staticmethod
 56 |     def split_ids_in():
 57 |         return 'data/development_set/ids_split'  # cv splits used during development; available on GitHub
 58 | 
 59 |     @staticmethod
 60 |     def test_ids_in():
 61 |         return 'data/development_set/uniprot_test.txt'  # test ids used during development; available on GitHub
 62 | 
 63 |     @staticmethod
 64 |     def fasta_file():
 65 |         return 'data/development_set/all.fasta'  # path to development set; available on GitHub
 66 | 
 67 |     @staticmethod
 68 |     def binding_residues_by_ligand(ligand):
 69 |         return 'data/development_set/binding_residues_2.5_{}.txt'.format(ligand)
 70 |         # files with binding labels used during development; available on GitHub
 71 | 
 72 | 
 73 | class FileManager(object):
 74 | 
 75 |     @staticmethod
 76 |     def read_ids(file_in):
 77 |         """
 78 |         Read list of ids into list
 79 |         :param file_in:
 80 |         :return:
 81 |         """
 82 |         ids = []
 83 |         with open(file_in) as read_in:
 84 |             for line in read_in:
 85 |                 ids.append(line.strip())
 86 | 
 87 |         return ids
 88 | 
 89 |     @staticmethod
 90 |     def read_fasta(file_in):
 91 |         """
 92 |         Read sequences from FASTA file
 93 |         :param file_in:
 94 |         :return: dict with key: ID, value: sequence
 95 |         """
 96 |         sequences = dict()
 97 |         current_id = None
 98 | 
 99 |         with open(file_in) as read_in:
100 |             for line in read_in:
101 |                 line = line.strip()
102 |                 if line.startswith(">"):
103 |                     current_id = line[1:]
104 |                     sequences[current_id] = ''
105 |                 else:
106 |                     sequences[current_id] += line
107 | 
108 |         return sequences
109 | 
110 |     @staticmethod
111 |     def read_embeddings(file_in):
112 |         """
113 |         Read embeddings from .h5-file
114 |         :param file_in:
115 |         :return: dict with key: ID, value: per-residue embeddings
116 |         """
117 | 
118 |         embeddings = dict()
119 |         with h5py.File(file_in, 'r') as f:
120 |             for key, embedding in f.items():
121 |                 embeddings[key] = numpy.array(embedding, dtype=numpy.float32)
122 | 
123 |         return embeddings
124 | 
125 |     @staticmethod
126 |     def read_binding_residues(file_in):
127 |         """
128 |         Read binding residues from file
129 |         :param file_in:
130 |         :return:
131 |         """
132 |         binding = dict()
133 | 
134 |         with open(file_in) as read_in:
135 |             for line in read_in:
136 |                 splitted_line = line.strip().split()
137 |                 if len(splitted_line) > 1:
138 |                     identifier = splitted_line[0]
139 |                     residues = splitted_line[1].split(',')
140 |                     residues_int = [int(r) for r in residues]
141 | 
142 |                     binding[identifier] = residues_int
143 | 
144 |         return binding
145 | 
146 |     @staticmethod
147 |     def read_mmseqs_alignments(file_in):
148 |         """Read MMseqs2 alignments"""
149 | 
150 |         mmseqs = defaultdict(defaultdict)
151 | 
152 |         with open(file_in) as read_in:
153 |             for line in read_in:
154 |                 splitted_line = line.strip().split()
155 |                 query_id = splitted_line[0]
156 |                 target_id = splitted_line[1]
157 |                 qstart = int(splitted_line[5])
158 |                 tstart = int(splitted_line[6])
159 |                 qaln = splitted_line[7]
160 |                 taln = splitted_line[8]
161 | 
162 |                 mmseqs[query_id][target_id] = {'qstart': qstart, 'tstart': tstart, 'qaln': qaln, 'taln': taln}
163 | 
164 |         return mmseqs
165 | 
166 |     @staticmethod
167 |     def save_cv_results(cv_results, file):
168 |         """
169 |         Save CV results to csv file
170 |         :param cv_results:
171 |         :param file:
172 |         :return:
173 |         """
174 | 
175 |         cv_dataframe = DataFrame.from_dict(cv_results)
176 |         cv_dataframe.to_csv(path_or_buf=file)
177 | 
178 |     @staticmethod
179 |     def save_classifier_torch(classifier, model_path):
180 |         """Save pre-trained model"""
181 |         torch.save(classifier, model_path)
182 | 
183 |     @staticmethod
184 |     def load_classifier_torch(model_path):
185 |         """ Load pre-saved model """
186 |         if torch.cuda.is_available():
187 |             device = 'cuda:0'
188 |         else:
189 |             device = 'cpu'
190 |         classifier = torch.load(model_path, map_location=device)
191 |         return classifier
192 | 
193 |     @staticmethod
194 |     def write_predictions(proteins, out_folder, cutoff, ri):
195 |         """
196 |         Write predictions for a set of proteins
197 |         :param proteins:
198 |         :param out_folder:
199 |         :param cutoff: Cutoff to define whether a residue is binding or not
200 |         :param ri: Should raw probabilities or RI be written to file?
201 |         :return:
202 |         """
203 |         for k in proteins.keys():
204 |             p = proteins[k]
205 |             predictions = p.predictions
206 |             predictions = predictions.squeeze()
207 |             out_file = os.path.join(out_folder, (k + '.bindPredict_out'))
208 | 
209 |             FileManager.write_predictions_single_protein(out_file, predictions, cutoff, ri)
210 | 
211 |     @staticmethod
212 |     def write_predictions_single_protein(out_file, predictions, cutoff, ri):
213 |         """ Write predictions for a specific protein """
214 |         with open(out_file, 'w') as out:
215 |             if ri:
216 |                 out.write("Position\tMetal.RI\tMetal.Class\tNuc.RI\tNuc.Class\tSmall.RI\tSmall.Class\tAny.Class\n")
217 |             else:
218 |                 out.write("Position\tMetal.Proba\tMetal.Class\tNuclear.Proba\tNuclear.Class\tSmall.Proba\tSmall.Class"
219 |                           "\tAny.Class\n")
220 |             for idx, p in enumerate(predictions):
221 |                 pos = idx + 1
222 | 
223 |                 metal_proba = p[0]
224 |                 nuc_proba = p[1]
225 |                 small_proba = p[2]
226 | 
227 |                 metal_ri = GeneralInformation.convert_proba_to_ri(metal_proba)
228 |                 nuc_ri = GeneralInformation.convert_proba_to_ri(nuc_proba)
229 |                 small_ri = GeneralInformation.convert_proba_to_ri(small_proba)
230 | 
231 |                 metal_label = GeneralInformation.get_predicted_label(metal_proba, cutoff)
232 |                 nuc_label = GeneralInformation.get_predicted_label(nuc_proba, cutoff)
233 |                 small_label = GeneralInformation.get_predicted_label(small_proba, cutoff)
234 | 
235 |                 overall_label = 'nb'
236 |                 if metal_label == 'b' or nuc_label == 'b' or small_label == 'b':
237 |                     overall_label = 'b'
238 | 
239 |                 if ri:
240 |                     out.write('{}\t{:.3f}\t{}\t{:.3f}\t{}\t{:.3f}\t{}\t{}\n'.format(pos, metal_ri, metal_label, nuc_ri,
241 |                                                                                     nuc_label, small_ri, small_label,
242 |                                                                                     overall_label))
243 |                 else:
244 |                     out.write('{}\t{:.3f}\t{}\t{:.3f}\t{}\t{:.3f}\t{}\t{}\n'.format(pos, metal_proba, metal_label,
245 |                                                                                     nuc_proba, nuc_label, small_proba,
246 |                                                                                     small_label, overall_label))
247 | 
248 | 
249 | class GeneralInformation(object):
250 | 
251 |     @staticmethod
252 |     def get_predicted_label(proba, cutoff):
253 |         if proba >= cutoff:
254 |             return 'b'
255 |         else:
256 |             return 'nb'
257 | 
258 |     @staticmethod
259 |     def convert_proba_to_ri(proba):
260 |         """Convert probabilitiy to RI ranging from 0 to 9"""
261 | 
262 |         if proba < 0.5:
263 |             ri = round((0.5 - proba) * 9 / 0.5)
264 |         else:
265 |             ri = round((proba - 0.5) * 9 / 0.5)
266 | 
267 |         return ri
268 | 
269 |     @staticmethod
270 |     def seed_worker(worker_id):
271 |         worker_seed = torch.initial_seed() % 2 ** 32
272 |         numpy.random.seed(worker_seed)
273 |         random.seed(worker_seed)
274 | 
275 |     @staticmethod
276 |     def seed_all(seed):
277 |         if not seed:
278 |             seed = 10
279 | 
280 |         # print("[ Using Seed : ", seed, " ]")
281 | 
282 |         torch.manual_seed(seed)
283 |         torch.cuda.manual_seed_all(seed)
284 |         torch.cuda.manual_seed(seed)
285 |         numpy.random.seed(seed)
286 |         random.seed(seed)
287 |         torch.backends.cudnn.deterministic = True
288 |         torch.backends.cudnn.benchmark = False
289 | 
290 |     @staticmethod
291 |     def remove_padded_positions(pred, target, i):
292 |         indices = (i[i.shape[0] - 1, :] != 0).nonzero()
293 | 
294 |         pred_i = pred[:, indices].squeeze()
295 |         target_i = target[:, indices].squeeze()
296 | 
297 |         return pred_i, target_i
298 | 


--------------------------------------------------------------------------------
/data/development_set/ids_split1.txt:
--------------------------------------------------------------------------------
  1 | Q5LL55
  2 | H9L4N9
  3 | O34738
  4 | P39579
  5 | P01887
  6 | O32221
  7 | P0CL67
  8 | A7VAB4
  9 | O60895
 10 | P86179
 11 | P58568
 12 | Q9Y3B4
 13 | Q9NWV4
 14 | Q9KQN0
 15 | P85511
 16 | Q2VE61
 17 | P18138
 18 | Q9H5X1
 19 | Q7Z4H3
 20 | B2FQ63
 21 | Q05097
 22 | P20116
 23 | F6KMV5
 24 | A9CID9
 25 | P46926
 26 | D0VWR5
 27 | O34918
 28 | Q8TLY9
 29 | Q81HL8
 30 | Q9RJC1
 31 | P15570
 32 | Q9RY97
 33 | B3Y002
 34 | P07386
 35 | P77754
 36 | P94690
 37 | C9K1X5
 38 | P04382
 39 | O32108
 40 | P32021
 41 | P17900
 42 | P43933
 43 | P0AE05
 44 | D0VWX2
 45 | P49789
 46 | A0A384LKY8
 47 | P67700
 48 | Q58830
 49 | O87496
 50 | Q52424
 51 | C3SZN7
 52 | P50049
 53 | P16108
 54 | P05161
 55 | Q9RN60
 56 | P0CX80
 57 | G0Z026
 58 | P67809
 59 | P83467
 60 | P10175
 61 | F1NZ18
 62 | Q47765
 63 | Q82Z21
 64 | H9NAL3
 65 | Q9KCV1
 66 | Q07341
 67 | E7E815
 68 | A7UQX3
 69 | P04038
 70 | O68396
 71 | Q8Z4D7
 72 | Q77DJ5
 73 | P48061
 74 | Q8XMB9
 75 | P27999
 76 | I0BZV0
 77 | D1CIZ5
 78 | Q0P9D1
 79 | Q74G82
 80 | P07737
 81 | Q00277
 82 | Q96FJ2
 83 | P02876
 84 | Q8GGH0
 85 | P04418
 86 | P29256
 87 | Q1Q7P3
 88 | P03606
 89 | Q05776
 90 | P23907
 91 | P14135
 92 | P0A6I0
 93 | Q65NU7
 94 | Q5LAA6
 95 | Q38HX3
 96 | Q6XBH1
 97 | Q96YV5
 98 | Q8RQE7
 99 | O58552
100 | P08190
101 | Q8NI22
102 | P67825
103 | L7P7R7
104 | P06762
105 | P76077
106 | P81180
107 | Q65163
108 | Q5NHD0
109 | Q9KTK0
110 | G2QAB5
111 | P22483
112 | P12238
113 | Q82SE2
114 | Q6SJ71
115 | O14960
116 | Q8ZVX8
117 | P12240
118 | D6ES11
119 | P0ABE5
120 | Q6ZYH1
121 | P24321
122 | Q6MHT0
123 | P08773
124 | P17413
125 | Q966X9
126 | A7J283
127 | I3VE74
128 | P16458
129 | P0ABS1
130 | Q89IH6
131 | Q9HD34
132 | O13881
133 | Q56320
134 | B3CET1
135 | Q8AA93
136 | P05109
137 | O26253
138 | P04918
139 | O26413
140 | Q9WZF7
141 | Q91EV7
142 | A6ZWK1
143 | P37975
144 | Q04416
145 | Q7A260
146 | Q7VYU0
147 | O86327
148 | P21149
149 | Q8MTC1
150 | A7V213
151 | P00974
152 | Q8RM02
153 | O31562
154 | O25671
155 | O51419
156 | Q9UG22
157 | O28028
158 | Q9RFC8
159 | P10971
160 | P0ACI6
161 | P09132
162 | Q97TX9
163 | P12733
164 | P14121
165 | P09353
166 | P13183
167 | P00147
168 | P0A6X7
169 | Q8TL28
170 | P24473
171 | A4CYJ6
172 | O29089
173 | Q9HLN2
174 | P03051
175 | Q5HHM4
176 | P12312
177 | Q27415
178 | Q8ZPA3
179 | P10282
180 | Q8WWA0
181 | P03882
182 | A0QRC4
183 | O27725
184 | Q5SJH3
185 | O66665
186 | Q7PXT9
187 | P60618
188 | Q8I3W2
189 | Q0PB48
190 | P39135
191 | P02883
192 | P62669
193 | O43924
194 | B5Z8Y5
195 | P74917
196 | P58566
197 | P80312
198 | P59087
199 | Q983T0
200 | Q6NDF6
201 | Q9Z1R3
202 | P32761
203 | B2J933
204 | 


--------------------------------------------------------------------------------
/data/development_set/ids_split2.txt:
--------------------------------------------------------------------------------
  1 | Q4QRE8
  2 | Q88R27
  3 | A4GRE3
  4 | Q9X0H1
  5 | P04531
  6 | Q4J9K9
  7 | Q8EDL6
  8 | P80377
  9 | P33678
 10 | A3D092
 11 | P0C794
 12 | Q5JIH3
 13 | Q9X1W7
 14 | Q0KFV0
 15 | P32411
 16 | Q91IE3
 17 | Q8IK57
 18 | Q2PYN0
 19 | A7XUK7
 20 | Q748S4
 21 | Q8E9W8
 22 | A6T925
 23 | Q93SX1
 24 | B9HM14
 25 | Q32WH4
 26 | Q9LA15
 27 | Q8U440
 28 | Q53W28
 29 | Q9BTP7
 30 | P80176
 31 | A4FTK7
 32 | Q96CS7
 33 | Q71AW2
 34 | P0A7D1
 35 | Q99584
 36 | P13988
 37 | A9A2G4
 38 | Q00457
 39 | Q48EL2
 40 | P80721
 41 | Q8VZS8
 42 | Q7NSA6
 43 | P72583
 44 | C1IFD2
 45 | Q805N7
 46 | Q55670
 47 | Q47UY7
 48 | Q55330
 49 | Q5KYT1
 50 | O43255
 51 | B4F366
 52 | P32321
 53 | P64488
 54 | Q5T4W7
 55 | Q9HQM9
 56 | P08525
 57 | Q8DM36
 58 | Q9ALS3
 59 | P02244
 60 | Q54X65
 61 | A0A0F7RDM3
 62 | P26789
 63 | Q472T4
 64 | Q57829
 65 | P15369
 66 | P0CI74
 67 | P54173
 68 | P09598
 69 | Q0B0G9
 70 | Q9UUI1
 71 | Q9QZ88
 72 | P22139
 73 | Q5Y812
 74 | O95639
 75 | Q9RMI1
 76 | P68919
 77 | Q5KVJ9
 78 | O43598
 79 | Q8YJY7
 80 | Q9TVF0
 81 | S0BAP9
 82 | P12737
 83 | P13271
 84 | Q0P891
 85 | Q8E372
 86 | P0AC25
 87 | Q8RFH4
 88 | Q79ZR9
 89 | E0TW95
 90 | P35530
 91 | Q3SFD8
 92 | P74903
 93 | Q8KFZ1
 94 | Q1HEA2
 95 | P05019
 96 | P53738
 97 | Q9HL57
 98 | P00273
 99 | Q7X0D9
100 | Q8YVQ2
101 | Q96D96
102 | Q9DEF8
103 | P14119
104 | Q2FZE9
105 | P52391
106 | P08203
107 | P65121
108 | P81605
109 | P11456
110 | Q64288
111 | Q9KKU4
112 | A9CHM9
113 | P01543
114 | P04168
115 | P26788
116 | Q4K423
117 | Q9A1G2
118 | O43583
119 | P02210
120 | P00423
121 | B9A5C1
122 | O25841
123 | Q8IU26
124 | Q58667
125 | Q9RSM4
126 | Q72QS5
127 | P58246
128 | Q9S508
129 | P44887
130 | Q9LUV2
131 | P03042
132 | Q97G05
133 | Q8VL32
134 | P37001
135 | Q5SJ76
136 | P03206
137 | P06465
138 | B9A0T7
139 | Q9BUL8
140 | Q8ZC05
141 | Q6L6Q4
142 | Q96HA8
143 | Q5YUF0
144 | Q92M25
145 | P60022
146 | Q9NAV8
147 | Q8YQN0
148 | D5MNX7
149 | D8NA05
150 | P22452
151 | P14116
152 | Q9Y657
153 | Q5LUH0
154 | Q16873
155 | D2Z0P1
156 | O25423
157 | Q9RIM2
158 | D5X329
159 | P41261
160 | A2SL15
161 | C6KFA4
162 | Q9WYJ5
163 | O28416
164 | O58307
165 | P10153
166 | Q9NRN9
167 | C6AAT5
168 | O58836
169 | Q8IWS0
170 | P19753
171 | P08877
172 | P39639
173 | Q67XG0
174 | P07471
175 | Q6XQ58
176 | P9WLS2
177 | Q2UP89
178 | Q97WD2
179 | Q5F9K6
180 | P21163
181 | P0A805
182 | P76632
183 | Q82EE4
184 | P82291
185 | P60033
186 | Q0RYE0
187 | Q7NSZ5
188 | P76014
189 | O02372
190 | P71650
191 | O13610
192 | P32055
193 | S3TFW2
194 | P0A410
195 | B0R5M0
196 | P75430
197 | Q5SK02
198 | Q9FBN7
199 | P02966
200 | Q96A44
201 | Q936H5
202 | A0A0H2W6Y8
203 | P38636
204 | 


--------------------------------------------------------------------------------
/data/development_set/ids_split3.txt:
--------------------------------------------------------------------------------
  1 | P23827
  2 | Q2G4M1
  3 | Q8TVS2
  4 | P43777
  5 | Q6N4M9
  6 | P46952
  7 | C1D7P6
  8 | P32173
  9 | Q8TQ93
 10 | A1TY68
 11 | Q504Y3
 12 | Q9I3C8
 13 | Q5SHN3
 14 | P69506
 15 | B8FYU2
 16 | O25064
 17 | Q8VK10
 18 | P06107
 19 | Q9KAX6
 20 | P12732
 21 | Q9HRE7
 22 | Q65LG7
 23 | Q46582
 24 | Q0SDB1
 25 | Q8K2P6
 26 | Q8QL27
 27 | Q6JXI5
 28 | A8ALU5
 29 | P03712
 30 | A0QRS5
 31 | Q5UQA5
 32 | Q5L811
 33 | P04608
 34 | Q9V535
 35 | Q9ZAG3
 36 | P39186
 37 | A7Z2A9
 38 | Q0PBW2
 39 | A8MT69
 40 | P39301
 41 | Q01469
 42 | Q06549
 43 | Q2F9Z1
 44 | Q9BX68
 45 | P72761
 46 | Q57DY1
 47 | Q15036
 48 | Q9K0G4
 49 | O82040
 50 | O43809
 51 | Q9REI7
 52 | Q9JYL1
 53 | Q8T893
 54 | O66511
 55 | B6F143
 56 | A7UNK4
 57 | P19267
 58 | P63272
 59 | Q10589
 60 | Q64368
 61 | P04355
 62 | Q37964
 63 | D5CN26
 64 | Q2RRQ9
 65 | Q71TT1
 66 | P18317
 67 | Q97WQ4
 68 | Q9NG96
 69 | C4LSE7
 70 | Q74AE4
 71 | Q5ZYB7
 72 | Q91LD0
 73 | P03047
 74 | P00430
 75 | Q4CQE2
 76 | Q4PP54
 77 | P12306
 78 | P0ACT8
 79 | O24984
 80 | D0VWV4
 81 | F2Z288
 82 | Q8GPH6
 83 | Q6NEL2
 84 | Q99WU4
 85 | O30176
 86 | Q9HSF4
 87 | Q3AB29
 88 | P0A8J2
 89 | P12946
 90 | A0A2B6C3P9
 91 | P84887
 92 | V9P0A9
 93 | F9UMW3
 94 | Q9LUJ3
 95 | P01574
 96 | Q1CAM3
 97 | Q7RTV0
 98 | P69783
 99 | Q2FZ56
100 | Q9RQB9
101 | Q9NA73
102 | Q5KXY4
103 | Q3ZDQ9
104 | B5THI3
105 | P82242
106 | Q5ZT91
107 | P0A6D0
108 | Q70LE8
109 | Q9UHP7
110 | Q95VR4
111 | P07013
112 | Q82ZC8
113 | P35804
114 | Q9RHJ6
115 | Q7M4F9
116 | P00641
117 | O41094
118 | Q8SUP0
119 | P15927
120 | Q7DB61
121 | D0A8L1
122 | Q8RCC3
123 | P08246
124 | A2KD59
125 | P07470
126 | O41136
127 | A7N805
128 | Q8GTM0
129 | Q89VT6
130 | P13513
131 | Q8DP99
132 | O18426
133 | Q9SN73
134 | Q8A8A4
135 | Q8RLX2
136 | P0CY08
137 | O96048
138 | Q8NQA4
139 | P0A731
140 | Q8DLM0
141 | Q739M9
142 | Q00433
143 | D3E4S5
144 | Q8EII5
145 | Q7CPA2
146 | Q58AD3
147 | O34911
148 | Q6N0K4
149 | Q9HV14
150 | P80382
151 | Q9YEQ6
152 | P9WPG7
153 | P21816
154 | P19508
155 | Q99616
156 | Q7NSS5
157 | Q5QUJ8
158 | P22450
159 | P76364
160 | Q9KXR9
161 | P18434
162 | Q5SHP2
163 | P02775
164 | E0TXX3
165 | L7R9I1
166 | E0TY72
167 | Q9VAK8
168 | Q45488
169 | P0ADF8
170 | Q05315
171 | P50618
172 | P59082
173 | O59893
174 | P29602
175 | Q5LD59
176 | P62010
177 | P11690
178 | P17728
179 | P23873
180 | P0A3Y1
181 | O77421
182 | Q9F1R6
183 | P03126
184 | O74017
185 | Q8YYB7
186 | P10970
187 | F4JL28
188 | D6E269
189 | P79085
190 | Q9JKE2
191 | P0AAM2
192 | Q14493
193 | O34994
194 | J7MFT5
195 | Q5SHE1
196 | Q59646
197 | Q72LF4
198 | Q7SID0
199 | Q74ML9
200 | P74564
201 | P50616
202 | M1MR49
203 | Q8NT71
204 | 


--------------------------------------------------------------------------------
/data/development_set/ids_split4.txt:
--------------------------------------------------------------------------------
  1 | P65401
  2 | D7GNB4
  3 | P01175
  4 | P49458
  5 | O95931
  6 | P60619
  7 | Q5SLL2
  8 | Q64438
  9 | A9CKT4
 10 | Q9KJ88
 11 | Q7MVF6
 12 | O53623
 13 | P20047
 14 | Q9KKG7
 15 | Q3M4Y5
 16 | P06148
 17 | Q44635
 18 | Q8WQK3
 19 | P44096
 20 | A0A0F7RHX8
 21 | P04570
 22 | P47224
 23 | Q38242
 24 | P23657
 25 | Q9UK53
 26 | P69202
 27 | P12376
 28 | P00210
 29 | P03496
 30 | Q7DDR9
 31 | Q81JI2
 32 | O67789
 33 | Q6YRW8
 34 | Q8PJZ4
 35 | P61459
 36 | Q3JFV4
 37 | P0A8U6
 38 | Q93CG9
 39 | P0AFX7
 40 | P01552
 41 | A2RI47
 42 | Q46R41
 43 | Q1W640
 44 | P74174
 45 | Q9NPF0
 46 | Q01468
 47 | Q5A0X8
 48 | P28074
 49 | Q9IPS9
 50 | Q9A5I0
 51 | Q03148
 52 | P01555
 53 | Q8AAW0
 54 | P26790
 55 | Q9HTR2
 56 | Q9HY80
 57 | D7GXG5
 58 | Q6CK86
 59 | P84801
 60 | Q800Y1
 61 | Q53908
 62 | P61969
 63 | Q2MHR1
 64 | P56221
 65 | Q9ZKJ5
 66 | D6WJ77
 67 | P00217
 68 | Q8GAV3
 69 | Q7AKQ8
 70 | P07465
 71 | Q86SG5
 72 | Q9HJY3
 73 | Q2U8Y3
 74 | O52512
 75 | Q7PGA3
 76 | Q8XVB9
 77 | Q9X7H4
 78 | Q8N6N7
 79 | F3ZXK6
 80 | Q6IMW5
 81 | Q07524
 82 | P44558
 83 | Q04I02
 84 | Q58584
 85 | G7J032
 86 | Q6RJQ3
 87 | P10972
 88 | Q206Z5
 89 | Q9HWW1
 90 | Q08826
 91 | Q4FPZ7
 92 | Q84AQ1
 93 | P58560
 94 | L7IX95
 95 | Q27084
 96 | Q97ZF4
 97 | A5WF35
 98 | P0A333
 99 | P27838
100 | P73048
101 | Q6N5V5
102 | P56178
103 | O58368
104 | E8WYN5
105 | Q2F1K8
106 | Q5F5Y8
107 | P12239
108 | Q55169
109 | Q9ZND7
110 | Q8DQG2
111 | Q47764
112 | P77214
113 | Q1EMV2
114 | Q8NS29
115 | W8VZW3
116 | Q17091
117 | P04043
118 | P84138
119 | Q8YY42
120 | P0AC44
121 | Q4K8M0
122 | P14682
123 | O77093
124 | Q08840
125 | P69776
126 | P13689
127 | G2RUZ1
128 | P22365
129 | O59172
130 | P48060
131 | Q2ITY5
132 | P0A459
133 | B9W5G6
134 | C3W947
135 | Q58164
136 | Q04D30
137 | Q82134
138 | O26745
139 | P0AEK0
140 | O67778
141 | P27797
142 | P03536
143 | P23308
144 | Q7BSV8
145 | P39234
146 | Q8WRW5
147 | O83795
148 | Q09176
149 | I6X7F9
150 | A4DA31
151 | P52102
152 | P0AEV9
153 | P00816
154 | P40422
155 | Q8WQM6
156 | Q2FVD0
157 | Q97ZE3
158 | O76242
159 | P24005
160 | Q2G1A5
161 | P19399
162 | P03052
163 | P60568
164 | Q9FA38
165 | O32080
166 | Q2FQ04
167 | A7ATL3
168 | Q0I165
169 | P04390
170 | Q7LFL8
171 | A6TB83
172 | Q59DX8
173 | Q8YNB0
174 | Q9SUQ8
175 | O88273
176 | P27707
177 | Q1HRL7
178 | Q9ZLI1
179 | Q5SLP8
180 | O25323
181 | P38424
182 | Q973T5
183 | A5GZW8
184 | P39146
185 | Q8U1Y4
186 | A4GRC7
187 | P12241
188 | Q46864
189 | Q44057
190 | Q9X113
191 | P00383
192 | P81303
193 | K1QRB6
194 | Q0VSW0
195 | Q716G6
196 | Q8WTQ1
197 | O88188
198 | Q8Y3Y7
199 | Q6SVB6
200 | Q1LKZ5
201 | Q8A7Y0
202 | A3F715
203 | Q3Y316
204 | 


--------------------------------------------------------------------------------
/data/development_set/ids_split5.txt:
--------------------------------------------------------------------------------
  1 | P72321
  2 | A6WG04
  3 | Q9P126
  4 | Q9Y547
  5 | P48329
  6 | Q04837
  7 | Q9I3B4
  8 | P00592
  9 | A1JSS7
 10 | Q28TU0
 11 | Q9BXU0
 12 | Q928A2
 13 | P00649
 14 | P00428
 15 | P35926
 16 | Q9I4L4
 17 | Q02169
 18 | P0AFC1
 19 | Q8DKM3
 20 | P11746
 21 | P24297
 22 | P68168
 23 | P46797
 24 | C4MBE2
 25 | D5SZ58
 26 | Q7SIB3
 27 | Q9NZV6
 28 | Q39TV9
 29 | E9AND8
 30 | Q9LW31
 31 | Q9KY22
 32 | Q9H777
 33 | Q4R0L3
 34 | P01127
 35 | Q8GC87
 36 | Q1Q7P4
 37 | Q9DHS8
 38 | Q934G3
 39 | C8BD48
 40 | Q12178
 41 | Q9NPJ3
 42 | Q976J8
 43 | P67870
 44 | Q9NV35
 45 | P27198
 46 | P24224
 47 | P23154
 48 | P45718
 49 | P29717
 50 | D0E9M1
 51 | Q53396
 52 | P80734
 53 | Q8ISI9
 54 | Q5SID6
 55 | P09391
 56 | P13123
 57 | P50187
 58 | P43410
 59 | B3GA02
 60 | P25052
 61 | P49949
 62 | P62661
 63 | O29432
 64 | O42978
 65 | P35244
 66 | Q93S40
 67 | P16113
 68 | Q57468
 69 | Q9BYN0
 70 | Q92730
 71 | Q83WS1
 72 | O36005
 73 | G2SLH8
 74 | R9RY64
 75 | Q3ZTX8
 76 | Q2T8S1
 77 | Q8ZVF7
 78 | P76045
 79 | P02763
 80 | Q5KY38
 81 | P22027
 82 | Q8C6P8
 83 | P49771
 84 | P73954
 85 | P25604
 86 | P0A7F3
 87 | D0E8I5
 88 | O28604
 89 | Q9X1H4
 90 | Q8EF49
 91 | Q8W453
 92 | Q9NYG5
 93 | P32410
 94 | Q8PBM3
 95 | Q5SJ82
 96 | Q87TZ9
 97 | P02753
 98 | P03612
 99 | B3LQW9
100 | P68265
101 | Q13PU4
102 | P06010
103 | O30133
104 | P15692
105 | Q9KFA8
106 | Q9LPZ1
107 | Q9KU27
108 | P04862
109 | B3QER1
110 | Q9KCC6
111 | P27505
112 | E6Z0R4
113 | C8V0B5
114 | P37108
115 | P0AC69
116 | B3FQS5
117 | Q9WZ50
118 | Q4UNB3
119 | Q8NRS3
120 | P00099
121 | P63228
122 | Q2G8A3
123 | Q9F1K9
124 | D0E7M2
125 | O07517
126 | Q52SA8
127 | C6F3U5
128 | O28085
129 | Q9RBY6
130 | E2EKP5
131 | D1BQI7
132 | P81860
133 | Q9XG81
134 | P18000
135 | Q9LSX0
136 | V6F235
137 | Q8N5K1
138 | P13298
139 | Q6KD95
140 | Q9EV84
141 | O28417
142 | P08854
143 | Q6UV28
144 | Q9RHG4
145 | P29772
146 | M1QWM5
147 | Q00458
148 | O15263
149 | P56106
150 | P0A182
151 | P33788
152 | D3HJY4
153 | Q9SPD4
154 | A2SQK8
155 | Q86YI8
156 | Q9ES57
157 | Q2RVS1
158 | O76096
159 | Q716G8
160 | Q63P16
161 | P0CI76
162 | P95673
163 | A1U1C4
164 | L0N3Y0
165 | Q2W014
166 | Q74NK7
167 | G8XHD5
168 | Q8EAP9
169 | O31667
170 | P52754
171 | Q57851
172 | B4EDC1
173 | P41500
174 | O51615
175 | Q5FBS0
176 | P97253
177 | Q46604
178 | B6HWK0
179 | P59665
180 | Q5TA50
181 | Q9UT12
182 | Q8NJY3
183 | P0ADV7
184 | P12962
185 | O87963
186 | O66186
187 | Q6XK79
188 | P60616
189 | Q8ZPC0
190 | Q9UGC6
191 | P10599
192 | P08337
193 | P73213
194 | P08037
195 | B1YKD7
196 | P43457
197 | P0A790
198 | Q46982
199 | D1Z0H7
200 | Q5SLE7
201 | Q9D967
202 | Q2VZ87
203 | 


--------------------------------------------------------------------------------
/data/development_set/uniprot_test.txt:
--------------------------------------------------------------------------------
  1 | D0VX23
  2 | B7J6R7
  3 | Q3IDI7
  4 | E9JSA3
  5 | P12734
  6 | P80547
  7 | P39621
  8 | Q8KRV3
  9 | Q747J2
 10 | P10245
 11 | P60848
 12 | Q9EV85
 13 | P07603
 14 | Q03243
 15 | P10868
 16 | A6L7X2
 17 | Q5SK07
 18 | P40065
 19 | Q7A1N5
 20 | Q44501
 21 | Q5NV90
 22 | Q6D2K4
 23 | P63165
 24 | Q9HC16
 25 | Q8A8Q1
 26 | P68661
 27 | Q7VWF8
 28 | Q01747
 29 | Q8GCY3
 30 | Q8ZJW4
 31 | P56930
 32 | Q72EF4
 33 | O29338
 34 | B7IE18
 35 | P00426
 36 | Q9HVI1
 37 | Q46898
 38 | Q8DRZ8
 39 | Q81BL7
 40 | P15328
 41 | Q8T6U0
 42 | P19656
 43 | Q8LGG8
 44 | O92323
 45 | A2I2W2
 46 | Q13M28
 47 | Q15369
 48 | Q9SE33
 49 | D2Z0P2
 50 | Q96GG9
 51 | Q8P4Q6
 52 | P45696
 53 | P69687
 54 | Q57587
 55 | P0CB20
 56 | Q929T5
 57 | B9DL91
 58 | Q8IJK2
 59 | G2EA45
 60 | P29460
 61 | P07059
 62 | P80882
 63 | Q8IKH2
 64 | C8WS74
 65 | Q9KFV3
 66 | Q9BVM4
 67 | Q9NVS9
 68 | A0L5S6
 69 | B7TYB2
 70 | G0RUC2
 71 | Q06151
 72 | Q96YD0
 73 | Q8DJ43
 74 | Q3E840
 75 | Q9A585
 76 | P00698
 77 | P00648
 78 | P00766
 79 | P05373
 80 | P00651
 81 | P22636
 82 | P03050
 83 | P0ACH5
 84 | P19793
 85 | P56406
 86 | Q05599
 87 | P16184
 88 | P63159
 89 | P19080
 90 | P07445
 91 | P06956
 92 | P09184
 93 | P40347
 94 | Q55389
 95 | O15527
 96 | Q5SJ80
 97 | P50861
 98 | P02263
 99 | P84229
100 | P62801
101 | P0A0Z8
102 | P45850
103 | P03772
104 | P31570
105 | P30197
106 | P05798
107 | P0AFJ5
108 | P31992
109 | P03013
110 | O29428
111 | P22061
112 | P0A7A9
113 | P08692
114 | O28597
115 | P03004
116 | Q9X1X6
117 | P44007
118 | P0AF28
119 | P0A006
120 | P0ACP7
121 | P0A881
122 | Q8A1E9
123 | P50465
124 | P25524
125 | P46859
126 | Q9HAN9
127 | P84233
128 | P62799
129 | P06897
130 | P02281
131 | Q9HLQ2
132 | P33905
133 | P00806
134 | O15496
135 | P03034
136 | P56653
137 | P06534
138 | P56839
139 | P75914
140 | P17802
141 | P10828
142 | P19921
143 | Q93PP9
144 | P13393
145 | P42938
146 | P0ACT4
147 | Q7MHK3
148 | P00630
149 | P04392
150 | P11350
151 | Q56148
152 | Q05783
153 | Q06592
154 | P0A6C1
155 | P84131
156 | Q06241
157 | P39075
158 | P03699
159 | O29975
160 | Q9SF23
161 | P34257
162 | O34598
163 | P0ADU2
164 | P06134
165 | P76502
166 | P46072
167 | O33833
168 | O59248
169 | O05510
170 | Q56408
171 | P82543
172 | Q9CAQ2
173 | Q9SE00
174 | O06553
175 | P44886
176 | P56073
177 | Q9WZC6
178 | P44688
179 | P21673
180 | P0A7I7
181 | Q48255
182 | O41156
183 | P95855
184 | O59188
185 | Q5SJ65
186 | O66188
187 | Q46389
188 | P00894
189 | O52646
190 | O34522
191 | O95989
192 | P0A6G7
193 | O25928
194 | P22069
195 | P0AGB6
196 | Q13907
197 | Q96EY8
198 | O66615
199 | Q51948
200 | P03018
201 | P33643
202 | Q9I0M3
203 | Q96EK6
204 | A1L259
205 | Q9GZZ1
206 | Q8Z2A5
207 | O31465
208 | O50580
209 | Q6T1W8
210 | P76491
211 | Q6D820
212 | Q7BSH1
213 | A9CKY2
214 | P39058
215 | P08038
216 | P0A6L4
217 | Q8TDX5
218 | Q8SXK5
219 | O31801
220 | O29537
221 | Q8NMG3
222 | P0A8P1
223 | P02958
224 | Q7A827
225 | P86383
226 | P77544
227 | P96692
228 | P0A988
229 | P00447
230 | P0ACY1
231 | P0A0I7
232 | Q6NS38
233 | P03036
234 | Q00459
235 | B8FW11
236 | P53615
237 | P32643
238 | P76143
239 | Q979C2
240 | P49914
241 | O87008
242 | O75936
243 | P0ACE7
244 | P37146
245 | Q74Y93
246 | P96693
247 | Q9I0Q8
248 | Q06S87
249 | P0CL19
250 | Q6NAM1
251 | Q51507
252 | Q7CRQ0
253 | Q66GT5
254 | P03697
255 | O74859
256 | Q9HX08
257 | P36639
258 | P40363
259 | P0C960
260 | Q9Y530
261 | B5XYG3
262 | Q9C9G4
263 | O35013
264 | P0ABN1
265 | P31580
266 | Q81UB2
267 | A5F8G9
268 | Q8G907
269 | P0A6Q3
270 | Q96GX9
271 | P27431
272 | Q9UZK4
273 | C1I202
274 | O07051
275 | P39840
276 | O32085
277 | Q6PHW0
278 | P09528
279 | Q86UY6
280 | P0A776
281 | O25094
282 | P84907
283 | Q9I3A4
284 | Q96FX7
285 | P60955
286 | I6WU39
287 | Q9KKS5
288 | O06662
289 | A0QWG5
290 | P09601
291 | P0A998
292 | Q5VWZ2
293 | P0ADA1
294 | A1KVH5
295 | O32163
296 | P0A951
297 | O59413
298 | O69787
299 | P9WKH3
300 | P9WFF7
301 | 


--------------------------------------------------------------------------------
/data/independent_set/binding_residues_metal.txt:
--------------------------------------------------------------------------------
  1 | B3TFG2	25,139,60,63
  2 | Q57727	67,70,79,80,63
  3 | M1GMS4	401,396,399
  4 | P0A2K1	256,66,69,231,232,71,268,270,304,306,308,309
  5 | P32021	268,189,191
  6 | C6H0Y9	134,41,107,108,80,119,120
  7 | P0CS93	70,73,75,268,269,537,622,78,182,183,536,313
  8 | F8UNJ8	755,766,751
  9 | P00918	96,64,129,34,4,132,198,36,106,15,19,119,94
 10 | Q236S9	481,610,533
 11 | P21347	353,355,260,325,297,361,363,364,461,463,401,342,377,381,319
 12 | P46873	204,94
 13 | A8E7C6	285,164,250,299,363,216,218,251,252,253,254,287
 14 | W0TJ64	81,76,77,78
 15 | A0A0H2YN38	321,323,326,204,205,1431,1433,1435,93,1439,96,98,101,426,427,1388,318
 16 | P22897	743,745,751,689,691,695,765,766
 17 | P06746	192,256,288,145,252,121,124,285,190
 18 | P29317	757,646,663
 19 | F1QCV2	176,267,174
 20 | Q84II6	197,183,253,187,333,254
 21 | P00800	398,409,415,416,417,289,419,291,293,422,423,425,426,427,429,432,370,374,378
 22 | O28951	224,159,358,359,360,361,362,427,220,221,223
 23 | O74036	141,143,144,145,146
 24 | Q709H6	64,67,140,109,145,149,92,95
 25 | A9A1Y2	449,453,295,299
 26 | P62826	24,42
 27 | D1A7C3	264,265,202,204,269,274,276,222,223,224,225,227,228,229,188,189
 28 | P00760	75,77,78,80,81,82,85
 29 | P19573	256,257,130,259,324,323,326,267,494,270,433,273,382
 30 | P02791	97,35,46,145,49,51,148,50,53,118,119,123,127
 31 | Q9Y233	515,582,519,553,554,623,664
 32 | E8MGH8	417,418,338,340
 33 | Q08499	453,455,456,620,466,502,503
 34 | Q1D0B6	64,66,69
 35 | D3DJ42	274,287
 36 | Q9X0L2	96,129,132,19,52,55
 37 | Q2XVP4	166,39,71,41,11,44,45,49,55,249,158,254
 38 | P02554	11,69
 39 | E1BQ43	74,331,333
 40 | P21589	36,38,298,299,235,237,208,243,85,117,213,214,220
 41 | H0SLX7	163,168,112,117,152,153
 42 | Q9K2N0	198,201,202,205,91,40,169,41,42,43,240,114,179,116,118,58,59,188,189,190
 43 | C7C422	208,122,120,250,124,189
 44 | P52699	176,97,99,215,157,95
 45 | Q9H816	35,36,120
 46 | W8QPS6	194,229,6,231,230,41
 47 | A0A0H3MRW3	146,195,150
 48 | P61769	77,37,78,103,104,105,107,93,94
 49 | Q8WWQ0	1393,1394,1396
 50 | E2GIN1	161,212,159
 51 | Q17TM8	64,393,396,334,338,344,347,348
 52 | K7RJ88	98,171,172
 53 | A0A411MR89	448,225,407,409,311,281,215,315,189,318
 54 | P01116	17,34,35,57,58,92,95
 55 | P04049	165,103,168,139,173,176,125,184,155,152
 56 | P79800	17,35
 57 | P42592	454,572,456,458,460,462
 58 | P54818	676,493,527,495
 59 | P07306	197,265,266,240,242,243,278,216,253,254,191
 60 | Q9H7B4	65,261,263,71,266,75,208,49,83,52,87,62
 61 | P31725	68,70,72,74,79,21,24,27,28,29,31,32,37,103,105,107
 62 | P0DTD1	5379,4879,3345,4370,5011,5396,4373,5399,3484,4381,3486,4383,5152,5034,5037,5038,3526,5192,3529,4687,5329,5203,5332,4693,1752,1753,1754,1755,5340,4698,4702,5343,5350,4327,5353,4330,5357,4336,5363,1142,4343,3321,1787,1789,5374
 63 | Q16769	202,107,330,76,276,148,280,218,222,159
 64 | Q7XSK0	82,85
 65 | P47205	241,277,279,237,78
 66 | P35270	257,260
 67 | Q83V25	294,9,11,177
 68 | A0A402C2V4	328,201,26,24,255
 69 | A0A402C2Q3	265,337,211,340,25,27
 70 | Q9Y253	231,13,14,113,115,116
 71 | A0A3B6UEU3	132,134,295,310
 72 | Q9D4J1	163,104,108,140,110,142,144,146,115,151,157
 73 | Q59FX0	962,965,970,973
 74 | P04406	289,291
 75 | P26664	1123,1125,1171,1175
 76 | F2NWD3	294,79,83,308,309,310
 77 | A0A0N9JNY6	832,865,866,650,654,883,884,694,599,600,602,603,892,861,862
 78 | Q05097	101,37,105,108,109
 79 | Q9GZT9	374,313,315
 80 | A0A1M6Y2K1	137,136,73,138,76
 81 | O43426	591,543
 82 | A0A5P5Z9C8	132,37
 83 | A0A3B6UEP6	128,34,26,31,29,127
 84 | P80025	227,675,301,303,305,307,339
 85 | Q92769	191,194,198,199,200,265,175,176,177,179,188,223
 86 | Q5CYN0	629,426,477
 87 | B2FTM1	181,246,105,107,109,110
 88 | Q70MM3	139,108,25,60,63
 89 | Q8W4D0	481,486,599,600,601,506,509,511
 90 | Q16790	226,228,251
 91 | G3FFN6	263,41,42,43,269,14
 92 | P01112	17,35
 93 | G3I8R9	257,252
 94 | P11142	232,227
 95 | P54132	848,1050,1047,1063,1048,1066,684,1036,1055
 96 | Q88FY1	320,265,105,302,318,319
 97 | A0A1Y1BQV9	289,329,332,336,337,340,341,286
 98 | Q9KDJ7	150,152,91
 99 | L8ICE9	226,227,675,301,303,305,307,339,372,247,222,543
100 | Q9I5W4	700,696,716
101 | Q9BY41	178,180,267
102 | C3W5S0	80,134,119,120,41,107,108
103 | Q6JC41	288,424,361,426,362,526,561,531,563,734,286
104 | Q4VA93	132,102,135,140,143,115,118,151
105 | Q3JRA0	10,12,44
106 | P61586	19,36
107 | P30793	144,210,212,141
108 | P06873	330,365,174,305,177,178,280,121,282
109 | Q06338	283
110 | Q9NRD9	832,864,834,866,836,867,868,839,869,870,875,828,830
111 | A0A1D1UCW7	64,67,68,70,135,136,137,104,142,63
112 | P20051	98,258,137,14,16,180
113 | Q9IGQ6	384,386,324,376,294,298,342,343,344,345,378
114 | Q9RZN1	224,225,162,163,100,101,230,197,200,201,112,217,219,223
115 | Q8YY76	160,162,163,165,166,39,42,52,53,159
116 | B2RHG5	224,297,298,301,302,303,336,150,152,154,252,253
117 | P20292	58,61,62
118 | Q9NY97	376,247
119 | Q8PET2	32,288,151
120 | P31947	192,196,133,83,84,86,87,215,89,219,31,35,228,231,110,112,56,60
121 | P0C0Y8	231,191
122 | P0C0Y9	267,220,235
123 | A9CJ36	192,137,141,214
124 | P04746	173,115,182,216
125 | C9K1X5	224,228,220
126 | P23360	201,235,237,240
127 | P9WNV1	32,72,44,28
128 | Q2CEE3	388,391,392
129 | Q0TR53	73,74,108,111
130 | Q96HY7	368,333,366
131 | T2B7E1	137,24,107,138,59,104,141,62
132 | Q2UWK0	225,167,169
133 | B0R5R2	16,11,60,15
134 | A0QWT2	291,207
135 | Q06187	165,143,152,154,155
136 | P04067	70,211,86,87,311,225,289,227,228,290,230,291,104,235,236,242,252,244,247,248,60
137 | Q4QBL1	98,250,102,170,251
138 | P50405	296,292
139 | P22297	289,108,222
140 | A0A452CSW8	234,171,84,86,88,89,152
141 | P11234	28,46
142 | A0A0S4TLR1	69,47
143 | Q9LXT4	98,164,100
144 | Q9SJI9	98,164,100
145 | Q8LGJ5	136,197,134
146 | P21327	153,156,317,79
147 | Q92794	289,259,292,230,262,233,265,268,238,209,241,307,212,310,281,284
148 | P08312	256,308,101,102,103,310,268
149 | P07395	416,160,33,161,454,264,459,460,463,687,627,691,629,503,504,253,254,159
150 | A0R5T1	103,231,105,106,234,140,109,142,182
151 | P00811	33,202,37,245
152 | A0A0F7VBC6	305,308
153 | Q9I0B9	64,66
154 | P61224	17,35
155 | Q13490	326,333,306,309
156 | Q9R1E6	739,741,358,359,743,745,746,171,747,209,311,474,315
157 | A0A059ZPP5	208,323,213,202
158 | P62136	64,96,66,173,92,272,248,124,125
159 | Q8N6T7	166,141,144,177
160 | P62575	398,399,400,401,375,412
161 | A0A418LHZ8	145,55,90
162 | A0R5R2	51,86,141
163 | A0A140HJ20	643,291,647,267,269,367,592,593,371,372,376,286,287
164 | P24300	257,287,181,245,247,217,220,255
165 | A0A2P1GNT2	324,293,294,295,297,344,345
166 | A0A369R1N0	325,327,186,235,237,188,179,122,124
167 | G8T6H8	165,166,139
168 | F2TZN0	456,491,492,605
169 | P02794	128,129,132,133,135,10,11,12,13,14,15,16,142,18,147,19,145,22,23,151,26,28,30,34,41,170,172,173,174,51,61,62,63,64,65,66,72,75,76,80,82,84,88,90,92,93,108,110,111,113,127
170 | Q9XI75	176,178,235
171 | P21802	644,630,631
172 | P02866	197,91,171,173,175,177,243,245,182,250,187,94
173 | Q16659	171,157
174 | P29068	389,390,199,202,148,149,154,155,157,159,287,167,103,105,170,107,302,112
175 | Q96KQ7	1027,1115,974,976,1168,1170,980,1175,1017,987,985,1021,1023
176 | O95639	162,132,166,138,142,156,148,124
177 | E1AQY1	81,2,83,4,67
178 | O00408	808,660,727,696,697
179 | P33590	516,517,392,297,330,393,394,363,312,313,187
180 | Q50KB2	329,331,443,348,351
181 | P0DTC1	1752,1755,1789,1787
182 | A0A286S0G7	176,97,99,215,157,95
183 | Q9NXA8	166,169,207,212
184 | B9LW38	153,107,156,151
185 | P02872	160,144,146,148,150,155
186 | Q9L069	467,443,445,446,463
187 | P71447	10,8,169,170
188 | P12821	992,418,390,394,460,1016,988,383
189 | P09528	66,170,28,174,63
190 | Q9HCE5	336,332,157
191 | Q96SD1	256,33,35,228,37,38,136,272,115,254
192 | P28720	320,323,349,318
193 | Q9KLZ6	68,69,55,121
194 | A0A218QMM7	118,120,185
195 | A0A0S4TKQ5	273,36
196 | Q93P54	64,65,67,137,334,335,84,278,280,287,31,292,484,296,504,505,122
197 | A0A0F7UZ05	17,19,21,15
198 | B9PZ33	16,18,20,22
199 | A0A0A7HR51	113,265
200 | P04278	81,189,79
201 | C5A3Z3	356,357,408,222
202 | P9WKK7	276,308,279
203 | Q86SQ9	34,38
204 | Q8IFW4	88,105
205 | P08799	655,656,179,180,181,183,184,185,186,125,126,127
206 | Q15181	116,119,102,118,120,121
207 | P93836	226,308,394
208 | A0QUZ2	65,101,85,127
209 | Q5S007	1409,1441,1410,1442,1413,1415,1338
210 | Q09LZ6	57,11,59
211 | P0A746	98,46,49,95
212 | O43570	145,119,121
213 | I3P686	161,194,104,200,202,235,204,181,182,183
214 | P05164	334,335,336,338,340
215 | O59952	66,67,69,103,107,79,84,55
216 | Q86WT6	64,41,44,78,81,56,58,61
217 | P39230	96,97,94
218 | Q4WQ18	178,119,175
219 | Q9KM24	251,238,254
220 | Q04430	101,102
221 | Q9NP87	418,330,332
222 | P14618	296,75,77,270,272,113,114,243,244,83,86,56
223 | P27988	417,328,331,412,414
224 | Q9BYC5	388,389,491
225 | A4F980	209,242,132,101,135,205,239
226 | V7II86	314,491,365
227 | Q9CPU0	34,100,173,127
228 | S8G5K8	919,835,839,858
229 | A0A0K8P8E7	304,307,309,311,313
230 | P61006	64,22,38,40,63
231 | Q6NZB1	167
232 | U2NM08	50,147,155
233 | W0FBY3	140,132,30
234 | Q9I5E2	60,87
235 | P00533	745,842,721,722,723,724,855,858
236 | P33763	64,1,66,3,4,5,70,71,74,11,15,20,23,24,25,28,33,41,59,60,61,62
237 | P12689	548,549,550,551,518,553,362,363,554,525,465,467,468
238 | Q9H082	65,63,47
239 | Q86Y01	449,452,446,468,471,411,444,414
240 | P25321	185,172
241 | A0A0F6NGI2	500,501,503,506,508
242 | Q32904	83,85
243 | P27353	140,144,209,114,243,147,246
244 | B8QIQ9	240,114,179,116,198,118
245 | B2UR60	205,209,308,309,311,215,316,317,319
246 | B1KYY8	111,115,85,182,118,216,186,219
247 | Q941L3	388,421,390,392
248 | A0A0R6L508	465,466,246,285
249 | A0A173N065	174,118,183,217
250 | P0DKX7	1539,1540,1413,1414,1415,1541,1417,1543,1547,1548,1549,1550,1422,1552,1423,1424,1426,1556,1557,1430,1431,1432,1433,1558,1435,1559,1565,1566,1567,1568,1561,1570,1440,1441,1444,1574,1575,1448,1449,1450,1451,1576,1453,1577,1583,1584,1585,1586,1579,1588,1457,1458,1459,1462,1460,1596,1470,1599,1473,1603,1605,1606,1477,1439,1609,1478,1479,1482,1480,1442,1523,1652,1525,1654,1530,1531,1532,1534
251 | I4E596	270,273,250,253
252 | Q9H165	802,772,805,775,744,747,783,792,818,786,788,823,760,764
253 | P0ABE7	34,99,38,41,43,47,23,24,26,27,61,30,95
254 | A0A223GEC9	101,23,191
255 | A0A0S2GKZ1	97,20,183
256 | D0CD18	27,33,11,14
257 | Q04111	437,438,412,477
258 | Q8K299	416,417,482,419,454,455,422,456,457,458,477,478,415
259 | Q6ZMJ2	481,482,419,420,421,486,423,426,458,460,461,459
260 | P81947	39,71,41,11,44,45,47,49,50,55,61
261 | A0A068C8U8	19,37
262 | Q96L73	1920,1925,2023,1895,1897,1931,1905,2003,2070,1911,2072,1978,2077
263 | H9JH18	19,21,23,25,30
264 | Q4J989	20,205,207
265 | O76074	617,682,764,653,654
266 | D2Z025	198,200,107,235,77,109,111,110,113,86
267 | B7UCZ5	225,66,258,255,261,204,205,206,209,210,243,244,53,55,254,63
268 | Q9Y3Z3	164,167,233,206,207,210,311
269 | P62993	152,79
270 | F8W4B7	705,323,612,230,614,232
271 | Q86UW9	450,453,415,469,472,412,445,447
272 | P29166	40,42
273 | A0A179QS89	69,38,154,72,8,9,11,151,122
274 | Q8C6L5	384,385,288,291,392,378
275 | P07528	64,376,357,360,393,363,396,334,344,347,348,318
276 | P77990	37,39,150,59,60
277 | A0A2E2GL08	32,24,26,92
278 | Q9NZK7	48,49,66,45,47
279 | Q86V24	352,202,348
280 | A0A024B7W1	3234,3369,2959,3250,2963,2968,2971
281 | A0A160T9D2	205,207,208,179,181
282 | D0VWU9	404,311,315,141,542,143
283 | P00644	103,117,122,123,125
284 | B4EUK6	117,72,74
285 | P01111	35,17,18,21,22,27,28,29
286 | P00327	98,68,101,104,47,112,175
287 | E7E2M2	97,98,259,260,138,140,301,303,304,305,143,144
288 | Q04M32	195,191
289 | A0A0C2W6A5	84,54,87
290 | O08600	164,136,169,137,138,141,177
291 | Q72HW2	97,450,133,135,455,393,137,396,398,444,445,446
292 | P16952	648,650,664
293 | G0S3F2	130,391,135,79
294 | P43215	114,126
295 | P36871	288,290,292,293,117
296 | P84095	17,35
297 | G7JFU5	266,268,181,183,185,157
298 | P46637	161,270,272,185,186,187,189
299 | A4Q9E8	359,264,361,241,178,179,346,125
300 | Q9H7Z6	226,230,210,213
301 | A0A3Q0L1E1	360,393,395
302 | Q5NQE9	116,212,214,118,120,92
303 | G2EBB4	252,254
304 | P11444	195,247,297,221
305 | E9QYP0	136,138
306 | G2R014	288,545,289,392,492,493,339,565,341,285
307 | Q8IM71	305,307,293
308 | O15178	186,167
309 | A0A135VHY8	66,2,68,137,138,144,147,19,26,27,95,96,160,172,175,176,49,177,179,116,61
310 | A0A135VDL7	96,160,66,67,68,137,138,172,175,49,177,179,116,26,27,61,95
311 | X6Q997	248,184,265,136
312 | B6K6D6	128,130,131,199,201,202,206,146,147,148,149,151,153,218,219,220,156,93,94,98,99,102,123,183,184,186,187,125
313 | Q8IUN9	267,269,270,206,210,215,280,281,218,216,220,224,292,293,305,243,251
314 | Q9NV35	67,63
315 | Q9H993	291,253,254
316 | Q1LCS4	128,162,165,110,51,57,125,95
317 | P47811	354,356,359,168,174,150,313,155
318 | P02766	86,119
319 | Q05397	937,956
320 | Q96DA2	40,22
321 | P57729	41,23
322 | P96787	48,240,46
323 | A9H324	275,276,277,300,279,280,204
324 | P03366	1042,1077,1097,1148
325 | Q3V872	160,260
326 | P60624	68,69
327 | Q6B856	177,11
328 | G0SC54	280,185,283,188
329 | F5SWR8	112,113,97,99,94,111
330 | Q6ZJK7	408,411
331 | Q9BPX6	432,421,423,425,427
332 | Q8IYU8	196,185,187,189,191
333 | Q972K4	114,86,90
334 | B0VCI2	100,103
335 | P0AC84	165,110,53,55,57,58,127
336 | P9WIC7	15
337 | P12955	289,452,241,370,276,410,412,287
338 | A5U6B6	34,36,38,90
339 | P9WNX1	36,40,92,110,159
340 | A8WBX8	323,325,296,232,234,339,221,190
341 | P68400	161,175
342 | P9WFX1	297,434
343 | Q88JH0	305,306,188,190
344 | O43809	149,88,171,76,93
345 | H9UEF1	64,120,117,62
346 | Q97KK2	482,515,484,517,486,519,488,521,493,526
347 | P0A8W0	176,172,244,222
348 | Q24451	267,1073,534,153,155,1085
349 | Q9NWT6	201,279,199
350 | B7GTV0	288,258,386,356,331,207,272,209,210,276,181,281,124,382
351 | Q6L732	132,134,219
352 | F0NDX2	512,515,650,651,206,207,464,210,467,885
353 | P27302	185,155,187
354 | P21816	86,88,140,93
355 | D3ZKP6	409,426,405,430
356 | A0A128A3G4	289,172,467,126,222
357 | I1RDR8	224,225,98,226,227,165,209,213,216,217,94
358 | K7CID1	647,210,666,475,669,798,799,736,801,738,672,804,740,742,359,360,743,744,568,172,312,316
359 | Q9L2C2	134,7,84,60
360 | D0MGT2	50,54,23,22
361 | A2ENQ8	96,99,37
362 | A0A024B5J2	241,54,313,238
363 | A0A0E3ZCF3	208,193,211,39,282,191
364 | P22234	97,137
365 | A0A142J6I6	161,293,294,282,79
366 | Q75I93	96,93
367 | P40993	64,109,61,63
368 | P0AEG4	20,23
369 | Q8CIB9	408,386,404,389
370 | L0I969	88,85
371 | Q12791	1018,1012,1020,1015
372 | P00722	417,162,419,194,164,462,16,19,22
373 | Q96DE0	169,99,173
374 | Q4LAV3	49,50,51,53
375 | Q9SS90	258,133,205,207,209,338,339,211,347,156,349,350,286,288,164,166,378
376 | Q97ZJ8	615,618,602,605
377 | A0A249T061	63,61,285,95
378 | Q9RVZ5	322,326,345
379 | Q8N884	390,396,397,404
380 | P07328	32,37,38,39,40,204,45,205,429,433,29,30
381 | P07329	353,258,259,357,108,109
382 | Q95V93	265,188
383 | Q9I6A3	290,234,235,237,273,279,251,286
384 | Q9WZX9	288,234,250,294
385 | B2FKF0	114,115,111
386 | A0A0H3AJ04	128,45,179,95,215,158,159
387 | M5AAG8	80,115,69,72,74,142,78
388 | P61889	71,142,112,113,242
389 | Q9NR30	339,340
390 | O67206	161,514,451,454,519,456,118,120,508,509,510,159
391 | P27695	96,68
392 | C3RVQ0	164,280,169
393 | C3RVP5	164,168,169,276,280,158
394 | P29600	2,40,73,75,77,78,79
395 | A6N5T2	339,291,292,454,457,459
396 | N0DKS8	67,70,71,9,10,11,75,83,86,87,25,160,106,109,112,113,243,116,115,251,255
397 | Q15382	20,38
398 | A0A2Z6G7U6	64,66,70,139,140,141,119,153
399 | A0A2Z6G7U3	65,67,122,71,139,140,118,120,154
400 | Q81258	1129,1177,1131
401 | O39929	1123,1171,1125
402 | O91936	1176,1172,1124,1126
403 | Q04609	387,453,425,553,266,269,272,273,433,436,377
404 | P43166	96,121,98
405 | B7GUI0	179,142
406 | N0B9M5	258,163,261,263,298,299,301,307,115,116,312,153,158
407 | Q9X1X0	180,216,58,92,93
408 | P0A7F3	114,109,141,138
409 | Q4E2I1	210,42,44
410 | Q6NQ79	677,680,557,559,691,724,694,727,699,669,702,575
411 | P02753	193,188
412 | A0A1C3NEV1	464,283,244,463
413 | Q6IQ55	146,163
414 | Q15818	358,359,360,367,370,280
415 | Q8TF76	563,467,477
416 | P0C9C6	231,233,142,271,78,179,115,182,218
417 | L7T0L4	41,74,76,46
418 | P42212	209,132,56,25,58,57,143
419 | L0FY79	240,180,117,119,121,199
420 | A0A0U3H0V9	161,166,172,205,206,213,88,153,90,92
421 | Q9KN86	297,378,301
422 | O00311	353,196,360,363,182,351
423 | Q9UBU7	296,299,309,315
424 | A0A125W693	345,322,318,359
425 | Q8YYB7	161,162,147,163,170,123
426 | Q5LUB5	144,145,58,61
427 | G0FQ07	114,118
428 | A9GMG4	305,262,264
429 | Q9NVW2	610,588,590,593,596,570,573,607
430 | O13836	240,201,150,239
431 | P02792	128,131
432 | A0A1Z1F9L9	200,203,204,205,86,120,122,123
433 | O32163	128,66,41,43
434 | Q92832	453,454,456,457,434,435,437
435 | Q4Q5S8	175,176,177,178,179,180
436 | Q00987	452,457,461,464,438,441,475,478
437 | Q7YRZ8	452,457,461,464,438,441,475,478
438 | Q5XI77	256,257,261,211,212,213,214,215,409,411,412,413,417,366,369,370,371,253,255
439 | Q5SHZ3	228,270,113,114,116,87,281,282,283,284,157
440 | P05057	74,76
441 | P05089	128,101,232,234,246,122,124,125,126
442 | Q6DDT4	9,34
443 | Q8WS26	161,165,230,313,317
444 | Q9WZL1	210,181
445 | P50213	257,261
446 | P51553	117,311
447 | P46881	592,433,431
448 | A0A1E1GJG5	224,204,82,85,278,215,24,214,27,220,221,216
449 | E5RPG3	2002,2004,2013
450 | Q9BV73	2209,2212
451 | O95989	65,50,51,66,52,70
452 | P29476	331,326
453 | P29475	336,331
454 | P29474	369,99,365,94
455 | P80366	352,194,162,228,229,313,350,191
456 | P0DP23	65,130,68,132,134,136,141,21,23,25,27,29,94,32,96,98,100,105,57,59,61,63
457 | Q14191	563,935,936,939,908,558
458 | Q7LBC6	1560,1689,1562
459 | Q9WZD0	49,42,45,39
460 | A0A024FRL9	208,250,120,122,124,189
461 | P31153	265,31
462 | P41977	98,50,179,146,183
463 | D9QA55	136,137,111
464 | Q80J94	438,440,410,366,447
465 | P65473	354,355,195
466 | Q96EP0	986,930,898,901,998,871,969,874,1001,925,911,916,890,923,893,926
467 | A0A1M2CSI6	208,250,120,122,189
468 | Q9I311	692,694,743,744
469 | B1MD73	166,168,169,170,74,75,77,111,112,113,48,149,153,90
470 | P96202	2125,2126
471 | B3EYM9	201,203,205,207,209
472 | Q9I188	428,271,432,375,216,379,220
473 | P75960	174,176,177,155,157,158
474 | H6QM92	134,41,106,107,108,80,119,120
475 | H6QM91	305,306,445,446
476 | Q81TB4	96,161,195,165,198,93,62
477 | P25098	368,360,361,363,348,366
478 | Q86T24	512,544,516,552,555,524,527,496,499,568,540,573
479 | Q9A180	1190,1192,1194,1196,1197,1198,908,910,912,914,1042,916,1043,1110,1017,1114,1019
480 | Q08467	232,246
481 | P0CT50	8,169,298
482 | Q5KZU5	266,206,145,178,23,25
483 | A0A0U1MWF9	386,579,644,581,580,647,360,652,655,532,380
484 | A0A1S4CVP6	258,175,116,118
485 | D2JIV0	135,105,138,139,108,112,25,60,63
486 | Q9D4V7	66,20,38
487 | A0A386KZ50	352,355,359,366,351
488 | A0A161XU12	360,362,365,699,700
489 | O43583	192,193,195,142,145,146,149,150,153,160,163,171,172,175,176,188,179,183,184,121,122,187,124,190,191
490 | D2W2Z5	227,326,170,172,173,146,147,53,149,190
491 | E1BF86	293,296,301,304
492 | A0A109PTH9	228,230,329,330,332,334,218
493 | P22963	34,37,38,137,172,174,175,23,22,183,218
494 | Q8KPT4	167,170,138,83,86,55
495 | Q9H2K2	1089,1081,1084,1092
496 | Q9X1H0	52,54,58,92
497 | K4PWX3	176,97,99,215,157,95
498 | P02879	137,279,281,141
499 | Q8EAC7	231,201,202,250,251,252
500 | G8B4G1	176,97,99,215,157,95
501 | Q7WYA8	176,97,99,215,157,95
502 | B2VCC3	52,54,56,140,141
503 | Q8NB16	33,19,71,104,15
504 | P00396	240,368,290,291,369
505 | P00428	113,116,91,93
506 | A0A3S7X3C0	180,243,244,143
507 | E9BLM4	226,227,180,124,125,143
508 | Q9D8Q7	135,137,141,157,158
509 | L8AXY8	98,99,100,101,8,72,200,43,12,44,14,146,116,186,94,95
510 | Q99828	161,163,165,167,172,116,118,120,122,127
511 | A0Q7V5	57,11,12,55
512 | P0ACD8	528,582,57
513 | Q8IEK1	496,500,519
514 | Q8YVV8	52,54
515 | Q96AE4	304,337,341,334
516 | O60232	70,73,53,56
517 | Q5AR53	242,244,319
518 | Q9BYF1	402,374,378
519 | P37659	378,243,278,442,443,381,383
520 | D0L035	116,88,92
521 | Q9F4D5	194,675,676,707,678,424,425,426,111,113,638
522 | Q56307	513,584,586,511,575
523 | Q9UGM3	128,130,1026,135,1058,1059,1060,1061,169,1081,1082,1019,1020,1021,1086,1023
524 | A3DC31	203,205,274,277,278,282,283
525 | P00441	64,72,47,81,49,84,121
526 | Q9NZ08	353,357,376
527 | Q2U8Y3	129,66,137,171,270,143,47,241,82,175,84,148,150,279,120,61,127
528 | A0A062V290	35,71
529 | P31327	362,335,337,315
530 | D9N168	356,533,357
531 | C4LXK0	9,169,7
532 | Q9BYW2	1533,1539,1516,1678,1520,1680,1685,1529,1499,1501,1631
533 | Q8R0F8	105,74,76,21,22,26,126
534 | Q05086	800,799
535 | P17562	17,273,279,253
536 | P76038	141,206,181,183,157
537 | P43860	250,108,253,254
538 | Q939N5	490,427,428,494,399,401
539 | F0FS71	257,259,357,361,284,285
540 | P68135	288,290,68,70,39,265,267
541 | A0QR77	9,10,107,108,152,159
542 | E5KIY2	122,208,124,152,250,120,189,223
543 | P9WK11	469,471,440
544 | Q5KW80	129,130,98,69,102,167,105,202,205
545 | P29082	114,37,86,88,90
546 | Q63SB3	37,39,216
547 | P39900	192,194,196,198,199,201,218,158,222,228,168,170,175,176,178,180,183,124,190,191
548 | Q9WYT9	53,55,9,10
549 | P0A031	200,107,108,109,110,203,204,208,209,20,21
550 | Q63870	1168,1061,1065,1131,1132
551 | P04637	238,176,242,179
552 | Q13936	1659
553 | A0A094ZWQ2	185,189
554 | P50477	362,363,396,364,399,400,401,115,308,309,310,311,116,150,157,155,156,152
555 | G0SFF0	57,77
556 | Q58M74	97,66,2,68,72,41,78,48,20,27,31
557 | Q0QC76	160,74,93,95
558 | Q84EX5	38,39,40,168,43,16,240,123,124
559 | P00431	127
560 | Q93HL2	194,54,38,216,190,63
561 | A0A098SKQ8	145,149,55,103
562 | P68826	65,154,158,111
563 | Q25704	99,648,42,46,47,208,689,210,692,185,188
564 | P43873	288,290,276
565 | M1GSK9	65,97,257,100,290,287
566 | A0A2K0XYW4	166,168,173,156
567 | J9XNG9	675,868,677,678,903,871,869,650,904,684,653,686,909,693,663,696,668,671
568 | P56547	45,46,112,117,121
569 | Q8WQX9	673,709,710,822
570 | Q14980	120,123,13,119
571 | P04632	193,137,152,154,247,156,158,225,163,227,228,109,112,114,115,116,182,119,184,249,186,251,188,253
572 | P16218	742,814,752,689,786,690,788,691,790,663,792,665
573 | G0SEG4	418,419,327
574 | Q8A2H2	83,332,43,44,333
575 | Q8TZW1	32,161,64,200,303,24,280,281,26,286,63
576 | P0A776	53,37,38,120,57,39
577 | P48735	314,318
578 | Q10471	224,226,359
579 | Q6B0I6	238,244,310,312
580 | Q13185	130,131,132,133,135,137
581 | Q8ZM86	98,101,7,8,9,120
582 | P02787	225,139,536,411,604,445
583 | P38108	32,35
584 | Q5JHF1	40,94
585 | Q9H9S5	416,289,296,360,362,364,345,346,317,318
586 | A7NNM5	76,141,77,146,151
587 | Q8PGN7	241,243,239
588 | P05058	50,52,74,76
589 | A0A2N7KUA1	35,291,296,488,300,141,338,339,88
590 | M2XAQ9	16,50,15
591 | Q0I9X8	33,66,69,40,110,143,113
592 | Q9QY93	66,98,63,95
593 | P84138	142,80,242,245,182,183,125
594 | Q9P8C9	297,233,299,269,272,340,308,310
595 | Q9UL01	481,452,470,471
596 | Q8GXJ1	404,313,315
597 | P05361	212,199,201
598 | P19812	160,136,139,175,177,148,189,118,151,123,157
599 | A0A452E9Y6	305,227,307,301,303
600 | F4KCC2	134,137,170,172,155,146,149,184,121,187,158
601 | W8QLX4	1810,1811,1836,1838
602 | P68871	97,43
603 | Q8IXJ6	224,195,200,221
604 | P43408	17,140
605 | V5VCK8	160,162
606 | O50271	448,481,53,501,519,40,395,446
607 | Q44384	67,313,37,295,40,296,330,331,332,235,238,237,240,241,314,216,217,218,221
608 | Q48484	386,388,540
609 | Q15848	193,194,195,187,188
610 | Q8WWA0	97,98,260,133,262,274,86,87,89,282,92
611 | G0S1N3	456,378,451,459
612 | Q9FD68	106,191
613 | C6XF58	58,266,125,191
614 | Q8YSJ5	65,35,68,163,133,166,199,201,56,60
615 | X2D812	64,68,107,62
616 | Q9FG77	326,328,298,303
617 | A0A1J6PWI8	168,249,172
618 | P9WPQ5	16,49,37,108,15
619 | Q57514	50,52,74,76
620 | P35828	257,259,771,264,265,266,268,783,785,280,793,282,795,285,797,799,288,289,290,291,292,294,297,298,559,562,314,315,317,320,322,586,864,865,866,868,873,876,878,882,883,884,885,887,634,636,893,638,892,896,900,901,902,903,905,651,652,653,654,908,656,910,928,673,674,675,676,933,678,931,689,690,691,692,695,697,698,699,966,967,968,716,717,990,992,993,994,484,485,491,757,758,759,762,255
621 | Q8IXL7	135,138,86,89
622 | Q06210	473,474,476
623 | Q6GHN9	2
624 | A0A022MRT4	261,294,263,281,238,78,240,81,79,147,308,148,310,87,248,149,282,283,120,254
625 | A0A022MQ12	75,77,205,245,246,345,253
626 | Q7N4I5	115,26,188
627 | K7SA42	195,112,176,114
628 | Q2N9N3	256,259,164,167,200,112,114,242,116,181,88,91
629 | C6KFA3	89,97,134,136,137,138
630 | A0QTY3	152,121,123
631 | P42981	113,12,15
632 | L0N6M2	486,441,443
633 | A0A024L327	256,247
634 | P37793	97,104,172,175,80,82,84,153,156
635 | Q2PAJ1	547,548,485,549,550,488,553,490,138,140,558,93,95
636 | Q94707	192,193,331,267,368,371,308,309,215,218,283,221
637 | Q3UPF5	96,162,168,73,106,172,110,78,174,82,86,150,88,182,187,191
638 | P06396	304,354,329,330
639 | A5K1A2	165,169,170,171,172,173,174
640 | P00959	162,146,149,159
641 | Q92841	385,132,263,398,399,208,465,209,210,351,223,484,485,231,234,235,179,372,373,381,383
642 | Q9UKK9	96,97,166,111,112,82,115,116
643 | Q9SXZ2	81,141
644 | Q5I3C1	3202,3204,3337,2927,3218,2931,2936,2939
645 | V5TER4	224,103,106,45,47
646 | P33755	137,139,204,208,145,148,216,219
647 | A0A068PS59	737,739,882,764,767
648 | O95251	384,388,368,371
649 | P15498	546,516,549,554,557,529,532,564
650 | D7Y2H5	104,73,105
651 | P11802	48,145,47,42,138,156,157,158,159
652 | H0ZAB5	275,277,413
653 | A0A0S4IKF4	580,644,647,584,620,623,628,660,631,664,601,604
654 | D7Y2H2	72,74
655 | C6XID6	199,241,180,116,118,120
656 | A0A380UY17	65,15,58,60,61,62,63
657 | Q6P179	370,374,393
658 | A6T6Z0	355,389,358,328,362,239,336,342
659 | D0VMS8	384,385,386,131,388,267,268,397,269,271,404,410,414,299,301,302,304,186,190,66,68,196,70,72,74,341,343,348,349,350,352,357,358,359,361,365,366,367,368,370,374,375,376,377,379,383
660 | Q8ZXN7	33,264,138,170,266,173,268,211,252,86,156,30
661 | A0A0D5YK08	17,209,207
662 | Q8DLC7	497,498,435,581,439,584,543
663 | P05555	160,225,156,158
664 | P84308	227,190
665 | A0A1Y2HEJ3	96,136,92,93
666 | A5H660	186,188,285
667 | I3DBY6	233,151,152,153
668 | P35816	481,418,516,485,392,361,362,494,144,145
669 | D5VCE0	248,249
670 | O07431	192,193,194,195,165,166,198,139
671 | P18858	568,621,720
672 | P08709	66,67,74,76,270,79,80,272,275,276,85,86,280,89,106,107,109,123,124,61,62
673 | P0DP24	96,32,130,98,132,100,134,68,136,105,27,141,25,21,23,57,59,61,94,63
674 | Q8N371	400,321,323
675 | P54922	56,304,54,55
676 | Q52ZA0	116,118,281
677 | Q9HVA8	224,129,87,39,223
678 | Q9K943	99,102,103,8,11,77
679 | A0A1K2FZ20	179,243,215,253,218,285,255
680 | Q9PPB4	200,230
681 | E9AE57	468,471,216,219
682 | 


--------------------------------------------------------------------------------
/data/independent_set/binding_residues_nuclear.txt:
--------------------------------------------------------------------------------
  1 | Q6SPF0	43,45,46,92,48,49,87,56,57,88,60,94
  2 | C6H0Y9	130,194,133,84,88,24,34,102,38,41,106,121,122,123,124,125
  3 | Q8IUH3	192,193,66,68,70,152,27,155,29,31,163,100,165,122,126,104,105,106,107,112,114,53,55,186,124,125,190,191
  4 | P06746	256,258,133,135,271,272,273,274,276,279,280,283,27,284,287,288,34,35,292,293,38,39,294,295,296,36,40,183,62,190,192,64,66,65,68,69,63,67,229,230,231,232,233,234,103,104,105,107,108,109,110,106,236,254
  5 | P18858	768,769,770,771,776,794,795,796,797,798,799,800,801,802,546,549,557,303,304,305,568,571,572,573,588,589,590,592,594,850,851,852,853,855,346,347,348,349,350,351,352,353,869,870,871,872,873,874,635,636,639,640,641,642,643,644,415,416,417,418,419,420,449,451,452,453,454,455,456,457,458,720,738,744,746,749,504,767
  6 | O28951	385,383,137,138,139,140,143,147,151,152,154,155,156,26,27,159,30,163,427,329,330,331,332,117,118,119,120,123,380,127
  7 | O95243	466,467,468,469,470,536,473,538,539,540,541,560,562,504,505,506,507,508,509,510,511
  8 | P00734	433,434,435,372,437,438,441,477
  9 | K7RJ88	129,131,132,135,138,396,397,398,424,301,302,303,430,432,433,448,451,332,335,338,216,217,218,219,355,356,364,117,118,119
 10 | Q9HVC1	26,27,28,32,36,37,38,39,40,42,46,49,50,52,55
 11 | Q9Y253	256,257,258,259,260,261,262,384,293,38,39,42,46,47,48,311,313,316,61,62,317,64,320,321,322,323,324,326,318,319,328,329,327,86,89,93,351,224,113,115,116,377,378,381,382,383
 12 | P0DTD1	6784,5250,5253,6792,5257,6794,6795,6796,6797,6928,6930,6932,6935,4892,4893,5151,5152,5153,4899,6822,6823,6824,6825,6696,6828,6701,6831,6832,6834,6840,6968,6841,6970,6844,6971,6972,5307,5076,4933,4935,5077,4950,4951,5074,5075,6996,6741,5205,6999,7000,6873,6874,7001,6872,6743,6745,5206,5081,4961,4969,5228,4972,5232,4982,5239,4984,4986,4987,5246
 13 | F2NWD3	10,11,12,13,15,16,340,98,100,101,102,39,104,41,42,122,123,124,125
 14 | A9CJ36	67,68,8,9,10,31,33,34,35,44,45,46,47,49
 15 | A0R5T1	17,20,23,25,35,63,320,65,321,322,323,324,71,73,74,75,76,77,330,331,336,81,219,220,93,221,95,224,97,94,231,238,239,115,116
 16 | P55316	227,228,230,231,234,204,272,241,274,185,186,253,255
 17 | Q9E006	449,450,382,388,453,452,456,475,479,428,429,431,381,380,441,443,444,445,446,383
 18 | Q1PST0	2,35,34,5,38,37,42,45,46,49,30,31
 19 | Q9NP87	385,386,387,389,275,416,418,174,175,177,434,178,179,433,435,436,438,442,445,446,449,450,323,456,457,329,459,332,204,206,207,208,209,330,364,365,245,246,247,249,250
 20 | P26368	260,262,264,265,146,150,152,155,289,291,294,300,302,304,192,195,196,197,199,328,329,331,333,335,338,339,340,225,227,228,229,230,231,252,254
 21 | Q9LVJ7	206,207,208,210,211,212,214,215,216,218,220,222,223,224,225,226,227,228,229
 22 | D9N168	256,257,258,259,260,261,531,532,533,534,281,282,285,288,289,290,293,313,314,315,316,318,320,321,322,323,325,327,328,329,330,331,332,340,410,413,414,417,419,420,421,422,423,424,427,432,474,484,485,230,232,233,489,234,236,492,493,235,496,500,253,254,255
 23 | P12689	518,551,552,553,554,555,556,557,307,319,320,322,323,324,325,326,328,329,588,589,590,332,606,362,619,620,621,622,623,624,625,626,627,628,629,366,367,393,394,395,396,397,398,399,401,402,663,667,417,681,682,684,685,686,691,692,693,694,696,464,465,467,468,734
 24 | I4E596	2,3,131,259,264,265,34,35,167,174,175,176,58,59,72,74,75,94,95,113,114,243,116,115
 25 | Q9H165	777,791,749,781,814,784,782,783,787,756,788,753,759,760,755,763
 26 | P84233	64,65,66,67,70,73,84,85,86,87,57,40,41,42,43,44,45,46,47,48,50,117,118,119,121
 27 | P62799	33,36,37,46,47,80,49,81,79
 28 | P02281	87,88,89,30,31,32,33,34,35,36,37,40,41,43,54,55,56,57
 29 | D0CD09	64,66,67,90,24,26,27,92,63
 30 | D0CD04	18,19,39
 31 | D0CD13	2,66,4,5,67,16,20,26,28,31,32,33,34,35,97,37,99,39,100,41,101,45,46,47,48,51,60
 32 | D0C9L7	16,84,21,22,23,87,26,27,28,30,31,97,101,38,40,41,46
 33 | D0CDQ7	72,73,74
 34 | D0CAL0	20,21
 35 | D0CD15	51,19,52
 36 | D0CBZ8	1,2
 37 | Q2G285	32,5,6,7,10,11,14
 38 | Q9Y3Z3	451,453,456,136,137,142,145,155,156,165,116,117,118,372,376,378,125
 39 | P28147	147,148,149,152,154,26,27,29,31,161,163,165,167,169,171,173,56,57,58,61,63,68,70,72,74,76,78,80,82,83,116,117,119
 40 | D0VWU9	384,385,386,389,269,664,665,667,540,668,542,673,674,675,676,677,678,679,709,710,711,590,591,592,593,594,348,349,350,735,606,607,611,612,613,739,743,495,498,499,245,501,383
 41 | B8H4R9	34,38,42,49,55,56,59,60,63
 42 | P37524	146,147,213,214,156,157,158,159,160,162,163,166,169,184,185,186,187,189
 43 | G2R014	326,327,394,397,399,400,403,404,341,559,372,437,438,375,377,379,445
 44 | P52952	192,194,142,143,144,145,162,165,168,181,183,185,187,188,190,191
 45 | P62399	64,66,69,90,73,74,75,80,24,89,26,27,92,63
 46 | P0ADY7	16,17,18,19,50,55,38,54,59
 47 | P0C018	64,3,67,68,15,25,27,29,30,31,32,33,34,98,36,100,38,101,102,44,45,46,47,54,55,56,63
 48 | P68919	9,75,12,13,14,78,17,18,19,21,88,90,29,31,32,37
 49 | P0A7L8	73,74,76
 50 | P0AG51	17,20,53
 51 | Q6CJ70	521,522,538,554,555,570,571,572,573,574,576,578,579,580,583,586,591,593,595,597,598,608,610,611,612,619,620,622,628,637,639,641,644,645,646,648,650,684,687,688,689,690,691,694,700,702,461,465
 52 | B8GW30	199,200,201,202,204,208,155,161,162,227,225,226,230,234,171,172,173,174,175,177,178,181
 53 | Q96MU7	404,405,407,428,431,432,433,434,438,439,455,472,473,474,475,476,361,362,363,367,377,378,379,380,381
 54 | F0NDX5	145,146,83,149,87,154,155,31,32,33,35,39,56
 55 | F0NDX6	265,266,267,268,269,270,271,20,21,22,23,26,29,30,31,32,47,48,50,51,52,54,55,59,72,73,74,75,78,79,80,81,210,82,211,212,213,214,216,219,220,221,222,224,225,226,227
 56 | F0NDX1	256,257,138,139,140,141,142,15,16,17,147,20,149,150,151,152,153,154,148,155,40,41,44,45,47,48,51,187,60,189,190,191,64,192,248,249,250,251
 57 | F0NDX3	128,129,264,265,266,268,155,156,158,159,160,162,178,179,180,181,183,184,185,74,77,207,208,209,210,211,212,221,222,223,224,225,127
 58 | F0NDX7	98,99,36,101
 59 | F0NDX2	960,625,626,458,459,1037,478
 60 | A0A0M4DML1	73,78,79,82,20,21,22,85,23,29,31,32,100,101,102,103,104,105
 61 | Q97ZJ8	15,16,168,169,172,173,176,177,180,181,308,184,58,59,60,318,320,68,196,198,199,200,201,202,203,80,89,90,493,494,496,497,116,500,508
 62 | Q9NR30	268,269,270,294,295,298,315,317,318,446,448,321,447,468,469,476,349,352,494,495,496
 63 | O94408	64,65,35,36,37,63
 64 | Q9Y7M4	42,44,81,82,83,27
 65 | O14352	64,35,36,37,62,63
 66 | Q9UUI1	65,63
 67 | O74483	32,33,63,62,94,61,30,31
 68 | P87173	73,74,75,43,44,46
 69 | O42978	66,68,39
 70 | C0VHC9	449,259,324,325,390,391,454,424,425,426,299,300,301,302,363,427,369,428,298
 71 | P27695	128,266,268,269,270,271,276,278,280,282,156,171,174,177,309,181,70,71,73,74,78,210,212,222,96,224,226,98,228,229,230,103,231,125,126,127
 72 | Q14527	97,68,69,72,73,110,111,112,113,142,91,93,94
 73 | Q04609	514,545,546,183,186,698,316,317,318,191,320,321,322,700,701,702,703,709,207,605,609,610,612,613,614,616,617,619,620,623,504,633,511
 74 | Q6NQ79	646,648,651,652,716,717,655,600,601,602,603,604,626,628,629,630,631
 75 | P0C9C6	5,7,8,9,142,271,14,273,146,147,148,149,45,48,49,50,179,52,51,182,55,78,81,82,84,87,89,92,229,231,232,233,115
 76 | Q4VGL6	193,259,262,263,264,265,219,220,238,239,240,241,247,184,185,250,251,188,253,190
 77 | Q4X049	288,289,268,269,271,272,273,275,276,277,279,281,283,284,285,286,287
 78 | Q4WDM9	104,105,106,88,89,90,91,93,94
 79 | G5EAZ0	268,269,270,272,273,274,276,277,278,280,282,284,285,286,287,288,289
 80 | Q5B5Z6	66,67,68,101,69,100,72,44,49,50,51,52,55,58
 81 | Q5AYY8	104,105,106,137,138,139,83,85,87,88,89,90,91,93,94
 82 | Q8YYB7	68,73,75,141,78,144,80,145,147,161,163,164,165,167,168,169,170,43,171,44,46,112,115,116,119
 83 | E5RPG3	1930,2058,2060,2059,2061,1941,1942,1943,1944,1839,1969,2147,2020,1845,1846,1844,2106,2107,2108,1852,1858,1864,2124,2132,1751,2137,1755,2141,1888,2144,1890,2145,2148,2021,2022,2023,2024,2151,2026,2152,2025,2029,1905,1907,1909,1910,1919
 84 | Q9SB92	82,83,86,87,89,90,92,93,94,96,100,38,39,40,41,60
 85 | P35716	64,68,76,77,80,81,84,87,88,120,121,106,48,50,51,117,54,118,56,57,53,61
 86 | P68431	38,40,41,42,43,44,45,46,47,48,50,57,64,65,66,67,70,73,84,85,86,87,117,118,119,121
 87 | P62805	79,80,81,18,20,24,31,33,36,37,46,47,48,49
 88 | P04908	75,76,77,78,15,16,17,18,14,21,29,30,33,36,43,44,45,46
 89 | O60814	87,88,89,32,33,34,35,37,40,41,43,54,55,56,57
 90 | H6QM92	512,514,388,389,390,391,392,393,534,279,281,300,301,302,303,561,562,563,564,691,569,577,326,328,458,459,462,466,467,470,471,473,476,351,352,483,485,491,363,366,367,368,369,370,371,372,373,503,505,507,508,509,511
 91 | H6QM91	260,524,525,272,528,274,275,24,30,32,33,34,668,37,38,669,553,557,560,562,566,567,568,569,310,571,350,352,353,355,356,365,633,635,124,125,126,127,123,135,136,409,410,411,667,413,414,670,671,672,673,674,675,676,678,679,680,683,687,690,694,697,186,444,445,446,191,704,193,706,196,198,200,202,203,225,227,229,237,493,239,494,241,242,243,249,507,510,511
 92 | H6QM90	201,205,81,82,83,84,86,88,216,33,38,41,43,52,53,54,55,58
 93 | Q86T24	578,515,579,520,584,522,586,595,596,533,597,535,536,598,599,539,534,600,538,549,550,562,563,564,503,504,505,507,508,511
 94 | Q9Y6K1	790,792,831,832,834,835,836,708,838,710,711,709,714,715,716,717,718,841,720,846,762,881,882,883,756,757,758,761,890,891,763
 95 | A0A1S4CVP6	258,144,145,254,149,213,215,119,218,223,224,225,229,186,187,176,178,179,180,181,182,183,184,185,250,251,252,118,190
 96 | Q01826	419,421,389,390,391,425,400,401,402,403,404,406,407,410
 97 | P0C0S5	15,79,81,18,19,20,80,31,32,35,38,43,46,47,48,49
 98 | P06899	87,88,89,30,31,32,33,34,35,36,37,40,41,43,54,55,56,57
 99 | Q9V2B6	7,25,26,27,156,29,28,30,45,46,47,48,184,185
100 | P11938	546,523,399,401,402,405,406,408,409,410
101 | A0QR77	152,287,297,42,41,300,44,302,303,46,305,304,183,184,185,186,187,60,189,190,62,188,192,193,64,196,197,68,65,69,334,335,339,215,216,217,218,219,220,221,104,107,108,242,244,245,246,247,248,249,250,251,252
102 | Q53W14	384,386,11,12,13,15,16,143,402,147,403,405,550,42,299,300,301,46,303,304,322,324,325,581,582,86,87,88,89,90,222,223,113,380,382,383
103 | Q9UBX2	64,65,66,68,69,71,72,73,137,139,75,140,141,143,79,144,146,20,21,148,23,24,88,26,95,96,97,98,99,43,49,118,124,62
104 | P67809	64,65,67,69,70,72,74,83,85,87,118,120,121
105 | Q9SNB4	193,194,196,170,171,172,146,147,148,149,185,186,187,190
106 | P56981	352,289,484,356,454,455,361,366,527,144,530,532,502,503,504,345,349,477
107 | Q9FG77	293,295,299,306,308,279,280,281,282,283,284,286,287
108 | P16104	76,77,78,12,16,17,18,15,21,29,30,33,36,43,44,45,46,125
109 | O08600	116,117,119
110 | Q9H0Z9	64,66,69,76,77,79,35,37,102,39,40,41,42,43,105,107,109,112,113,114,62
111 | Q3UPF5	144,148,149,150,87,88,89,90,151,96,97,98,105,106,107,108,170
112 | Q92841	280,281,284,426,427,428,301,302,303,304,307,449,450,457,329,335,338,475,476,477,497,252,253,254
113 | D7Y2H5	226,227,228,229,134,136,137,141,81,53
114 | Q9BTM1	12,13,14,15,16,78,18,77,76,17,29,30,33,36,43,44,45,46
115 | C5A3Z3	162,78,16,81,82,19,83,53,54,55,147,121,154,148,61,157,151,159
116 | I3DBY6	2,3,7,28,29,155,27,157,30,159,32,160,162,161,163,165,158,167,34,166,46,48,49,51,52,53,55,56,57,58,71,200,73,74,75,209,211,220,222,223,225,227,101,102,120
117 | P40947	3,7,13,14,15,16,19,21,33,35,37,50,52,54,56,62,66,70,73,74,80,86,88,91,97,98,105,106,107,108,109,110,111
118 | 


--------------------------------------------------------------------------------
/data/independent_set/indep_set.fasta:
--------------------------------------------------------------------------------
 1 | >A0A0B0QJR1
 2 | MTNIEPVIIETRLELIGRYLDHLKKFENISLDDYLSSFEQQLITERLLQLITQAAIDINDHILSKLKSGKSYTNFEAFIELGKYQILTPELAKQIAPSSGLRNRLVHEYDDIDPNQVFMAISFALQQYPLYVRQINSYLITLEEEND
 3 | >A0A0C2W6A5
 4 | MDDFNNILDLLINESKKAIKHNDIPVSCCIIDSNNNILSLAINSRYKNKDISQHAEINVINDLISKLNSFNLSKYKLITTLEPCMMCYSAIKQVKINTIYYLVDSYKFGIKNNYSINDQNLNLIQIKNQKKQSEYIKLLNIFFINKR
 5 | >A0A1Y1BWQ0
 6 | MTTVAPERLSRIREIIAENIDVDLDGLSDTALFIDELGADSLKLIDVLSALEMEYSIVIDMNELPKMTNVEATYQVTAAAAGW
 7 | >A0QP43
 8 | MAGAQDFVPHTADLAELAAAAGECRGCGLYRDATQAVFGAGGRSARIMMIGEQPGDKEDLAGLPFVGPAGRLLDRALEAADIDRDALYVTNAVKHFKFTRAAGGKRRIHKTPSRTEVVACRPWLIAEMTSVEPDVVVLLGATAAKALLGNDFRVTQHRGEVLHVDDVPGDPALVATVHPSSLLRGPKEERESAFAGLVDDLRVAADVRP
 9 | >B8H4R9
10 | MADDAIPHTDVLNSTAQGQLKSIIERVERLEVEKAEIMEQIKEVYAEAKGNGFDVKVLKKVVRIRKQDRAKRQEEDAILDLYLSAIGEI
11 | >D0CAL0
12 | MSKVCQVTGKRPVVGNNVSHANNKTKRRFEPNLHHHRFWLESEKRFVRLRLTTKGMRIIDKLGIEKVVADLRAQGQKI
13 | >D0CBZ8
14 | MRADIHPKYEKLVATCSCGNVIETRSALGKETIYLDVCSACHPFYTGKQKNVDTGGRIDKFKQRFAGMSRSIKR
15 | >D0CD18
16 | MKVQASVKKICGSCKVIRRNGVIRVICSAEPRHKQRQG
17 | >D0CDQ7
18 | MATKKAGGSTKNGRDSNPKMLGVKVYGGQTVTAGNIIVRQRGTEFHAGANVGMGRDHTLFATADGVVKFEVKGQFGRRYVKVETV
19 | >E1C9L3
20 | DFIGSPTNLIMVTSTSLMLFAGRFGLAPSANRKATAGLKLEVRDSGLQTGDPAGFTLADTLACGVVGHIIGVGVVLGLKNIGAL
21 | >F0NDX5
22 | MLYEEDYKLALEAFKKVFNALTHYGAKQAFRSRARDLVEEIYNSGFIPTFFYIISKAELNSDSLDSLISLFSSDNAILRGSDENVSYSAYLFIILYYLIKRGIIEQKFLIQALRCEKTRLDLIDKLYNLAPIISAKIRTYLLAIKRLSEALIEAR
23 | >F0Q4R9
24 | MQRMGMVIGIKPEHIDEYKRLHAAVWPAVLARLAEAHVRNYSIFLREPENLLFGYWEYHGTDYAADMEAIAQDPETRRWWTFCGPCQEPLASRQPGEHWAHMEEVFHVD
25 | >K5BJ73
26 | MSTPQDNANTVHRYLEFVAKGQPDEIAALYADDATVEDPVGSEVHIGRQAIRGFYGNLENVQSRTEVKTLRALGHEVAFYWTLSIGGDEGGMTMDIISVMTFNDDGRIKSMKAYWTPENITQR
27 | >L7T0L4
28 | MPGIAVCNMDSAGGVILPGPNVKCFYKGQPFAVIGCAVAGHGRTPHDSARMIQGSVKMAIAGIPVCLQGSMASCGHTATGRPNLTCGS
29 | >O94408
30 | MLFYSFFKTLIDTEVTVELKNDMSIRGILKSVDQFLNVKLENISVVDASKYPHMAAVKDLFIRGSVVRYVHMSSAYVDTILLADACRRDLANNKRQ
31 | >P00441
32 | MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ
33 | >P04632
34 | MFLVNSFLKGGGGGGGGGGGLGGGLGNVLGGLISGAGGGGGGGGGGGGGGGGGGGGTAMRILGGVISAISEAAAQYNPEPPPPRTHYSNIEANESEEVRQFRRLFAQLAGDDMEVSATELMNILNKVVTRHPDLKTDGFGIDTCRSMVAVMDSDTTGKLGFEEFKYLWNNIKRWQAIYKQFDTDRSGTICSSELPGAFEAAGFHLNEHLYNMIIRRYSDESGNMDFDNFISCLVRLDAMFRAFKSLDKDGTGQIQVNIQEWLQLTMYS
35 | >P07014
36 | MRLEFSIYRYNPDVDDAPRMQDYTLEADEGRDMMLLDALIQLKEKDPSLSFRRSCREGVCGSDGLNMNGKNGLACITPISALNQPGKKIVIRPLPGLPVIRDLVVDMGQFYAQYEKIKPYLLNNGQNPPAREHLQMPEQREKLDGLYECILCACCSTSCPSFWWNPDKFIGPAGLLAAYRFLIDSRDTETDSRLDGLSDAFSVFRCHSIMNCVSVCPKGLNPTRAIGHIKSMLLQRNA
37 | >P0AG51
38 | MAKTIKITQTRSAIGRLPKHKATLLGLGLRRIGHTVEREDTPAIRGMINAVSFMVKVEE
39 | >P17227
40 | MINLPSLFVPLVGLLFPAVAMASLFLHVEKRLLFSTKKIN
41 | >P39230
42 | MKKFIFATIFALASCAAQPAMAGYDKDLCEWSMTADQTEVETQIEADIMNIVKRDRPEMKAEVQKQLKSGGVMQYNYVLYCDKNFNNKNIIAEVVGE
43 | >P43215
44 | MVAMFLAVAVVLGLATSPTAEGGKATTEEQKLIEDVNASFRAAMATTANVPPADKYKTFEAAFTVSSKRNLADAVSKAPQLVPKLDEVYNAAYNAADHAAPEDKYEAFVLHFSEALRIIAGTPEVHAVKPGA
45 | >P60624
46 | MAAKIRRDDEVIVLTGKDKGKRGKVKNVLSSGKVIVEGINLVKKHQKPVPALNQPGGIVEKEAAIQVSNVAIFNAATGKADRVGFRFEDGKKVRFFKSNSETIK
47 | >P62805
48 | MSGRGKGGKGLGKGGAKRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKVFLENVIRDAVTYTEHAKRKTVTAMDVVYALKRQGRTLYGFGG
49 | >P62937
50 | MVNPTVFFDIAVDGEPLGRVSFELFADKVPKTAENFRALSTGEKGFGYKGSCFHRIIPGFMCQGGDFTRHNGTGGKSIYGEKFEDENFILKHTGPGILSMANAGPNTNGSQFFICTAKTEWLDGKHVVFGKVKEGMNIVEAMERFGSRNGKTSKKITIADCGQLE
51 | >P63073
52 | MATVEPETTPTTNPPPAEEEKTESNQEVANPEHYIKHPLQNRWALWFFKNDKSKTWQANLRLISKFDTVEDFWALYNHIQLSSNLMPGCDYSLFKDGIEPMWEDEKNKRGGRWLITLNKQQRRSDLDRFWLETLLCLIGESFDDYSDDVCGAVVNVRAKGDKIAIWTTECENRDAVTHIGRVYKERLGLPPKIVIGYQSHADTATKSGSTTKNRFVV
53 | >P9WPA4
54 | MTGAVCPGSFDPVTLGHVDIFERAAAQFDEVVVAILVNPAKTGMFDLDERIAMVKESTTHLPNLRVQVGHGLVVDFVRSCGMTAIVKGLRTGTDFEYELQMAQMNKHIAGVDTFFVATAPRYSFVSSSLAKEVAMLGGDVSELLPEPVNRRLRDRLNTERT
55 | >Q01782
56 | MTAPTVPVALVTGAAKRLGRSIAEGLHAEGYAVCLHYHRSAAEANALSATLNARRPNSAITVQADLSNVATAPVSGADGSAPVTLFTRCAELVAACYTHWGRCDVLVNNASSFYPTPLLRNDEDGHEPCVGDREAMETATADLFGSNAIAPYFLIKAFAHRFAGTPAKHRGTNYSIINMVDAMTNQPLLGYTIYTMAKGALEGLTRSAALELAPLQIRVNGVGPGLSVLVDDMPPAVWEGHRSKVPLYQRDSSAAEVSDVVIFLCSSKAKYITGTCVKVDGGYSLTRA
57 | >Q03503
58 | MEIVYKPLDIRNEEQFASIKKLIDADLSEPYSIYVYRYFLNQWPELTYIAVDNKSGTPNIPIGCIVCKMDPHRNVRLRGYIGMLAVESTYRGHGIAKKLVEIAIDKMQREHCDEIMLETEVENSAALNLYEGMGFIRMKRMFRYYLNEGDAFKLILPLTEKSCTRSTFLMHGRLAT
59 | >Q07654
60 | MKRVLSCVPEPTVVMAARALCMLGLVLALLSSSSAEEYVGLSANQCAVPAKDRVDCGYPHVTPKECNNRGCCFDSRIPGVPWCFKPLQEAECTF
61 | >Q2G285
62 | MIIKNYSYARQNLKALMTKVNDDSDMVTVTSTDDKNVVIMSESDYNSMMETLYLQQNPNNAEHLAQSIADLERGKTITKDIDV
63 | >Q32904
64 | MATQALVSSSSLTFAAEAVRQSFRARSLPSSVGCSRKGLVRAAATPPVKQGGVDRPLWFASKQSLSYLDGSLPGDYGFDPLGLSDPEGTGGFIEPRWLAYGEVINGRFAMLGAVGAIAPEYLGKVGLIPQETALAWFQTGVIPPAGTYNYWADNYTLFVLEMALMGFAEHRRFQDWAKPGSMGKQYFLGLEKGFGGSGNPAYPGGPFFNPLGFGKDEKSLKELKLKEVKNGRLAMLAILGYFIQGLVTGVGPYQNLLDHVADPVNNNVLTSLKFH
65 | >Q4KCZ1
66 | MDGEEVKEKIRRYIMEDLIGPSAKEDELDDQTPLLEWGILNSMNIVKLMVYIRDEMGVSIPSTHITGKYFKDLNAISRTVEQLKAECA
67 | >Q58380
68 | MEALVLVGHGSRLPYSKELLVKLAEKVKERNLFPIVEIGLMEFSEPTIPQAVKKAIEQGAKRIIVVPVFLAHGIHTTRDIPRLLGLIEDNHEHHHEHSHHHHHHHHHEHEKLEIPEDVEIIYREPIGADDRIVDIIIDRAFGR
69 | >Q709H6
70 | MTRLEVLIRPTEQTAAKANAVGYTHALTWVWHSQTWDVDSVRDPSLRADFNPEKVGWVSVSFACTQCTAHYYTSEQVKYFTNIPPVHFDVVCADCERSVQLDDEIDREHQERNAEISACNARALSEGRPASLVYLSRDACDIPEHSGRCRFVKYLNF
71 | >Q8A5J2
72 | MKENKLDYIPEPMDLSLVDLPESLIQLSERIAENVHEVWAKARIDEGWTYGEKRDDIHKKHPCLVPYDELPEEEKEYDRNTAMNTIKMVKKLGFRIEKED
73 | >Q8LGJ5
74 | MGTDTVMSGRVRKDLSKTNPNGNIPENRSNSRKKIQRRSKKTLICPVQKLFDTCKKVFADGKSGTVPSQENIEMLRAVLDEIKPEDVGVNPKMSYFRSTVTGRSPLVTYLHIYACHRFSICIFCLPPSGVIPLHNHPEMTVFSKLLFGTMHIKSYDWVPDSPQPSSDTRLAKVKVDSDFTAPCDTSILYPADGGNMHCFTAKTACAVLDVIGPPYSDPAGRHCTYYFDYPFSSFSVDGVVVAEEEKEGYAWLKEREEKPEDLTVTALMYSGPTIKE
75 | >Q9BTM1
76 | MSGRGKQGGKVRAKAKSRSSRAGLQFPVGRVHRLLRKGNYAERVGAGAPVYLAAVLEYLTAEILELAGNAARDNKKTRIIPRHLQLAIRNDEELNKLLGKVTIAQGGVLPNIQAVLLPKKTESQKTKSK
77 | >Q9H492
78 | MPSDRPFKQRRSFADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKIIRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGF
79 | >Q9I0B9
80 | MDELFEEHLEIAKALFAQRLPYWCDVFLRPADQAFNAYLNARGQASTYLVLEGFDPVYVPRGCDLDAVRATARARARLREAGLGEDALPVLL
81 | >Q9K943
82 | MPTPSMEDYLERIYLLIEEKGYARVSDIAEALEVHPSSVTKMVQKLDKSDYLVYERYRGLILTAKGKKIGKRLVYRHDLLEDFLKMIGVDSDHIYEDVEGIEHHLSWDAIDRIGDLVQYFQEDPSRLNDLREVQKKNEE
83 | >Q9KDJ7
84 | MSDEKKILGEERRSLLIKWLKASDTPLTGAELAKRTNVSRQVIVQDVSLLKAKNHPILATAQGYIYMKEANTVQAQRVVACQHGPADMKDELLTLVDHGVLIKDVTVDHPVYGDITASLHLKSRKDVALFCKRMEESNGTLLSTLTKGVHMHTLEAESEAILDEAIRALEEKGYLLNSF
85 | >Q9LFM3
86 | MGAGREVSVSLDGVRDKNLMQLKILNTVLFPVRYNDKYYADAIAAGEFTKLAYYNDICVGAIACRLEKKESGAMRVYIMTLGVLAPYRGIGIGSNLLNHVLDMCSKQNMCEIYLHVQTNNEDAIKFYKKFGFEITDTIQNYYINIEPRDCYVVSKSFAQSEANK
87 | >Q9SJ89
88 | MNLQAVSCSFGFLSSPLGVTPRTSFRRFVIRAKTEPSEKSVEIMRKFSEQYARRSGTYFCVDKGVTSVVIKGLAEHKDSYGAPLCPCRHYDDKAAEVGQGFWNCPCVPMRERKECHCMLFLTPDNDFAGKDQTITSDEIKETTANM
89 | >U2EQ00
90 | MAWLILIIAGIFEVVWAIALKYSNGFTRLIPSMITLIGMLISFYLLSQATKTLPIGTAYAIWTGIGALGAVICGIIFFKEPLTALRIVFMILLLTGIIGLKATSS
91 | >X2D812
92 | MTSERRPADPEIVEGLPIPLAVAGHHQPAPFYLTADMFGGLPVQLAGGELSTLVGKPVAAPHTHPVDELYLLVSPNKGGARIEVQLDGRRHELLSPAVMRIPAGSEHCFLTLEAEVGSYCFGILLGDRL
93 | 


--------------------------------------------------------------------------------
/data/independent_set/indep_set.txt:
--------------------------------------------------------------------------------
 1 | A0A0B0QJR1
 2 | A0A0C2W6A5
 3 | A0A1Y1BWQ0
 4 | A0QP43
 5 | B8H4R9
 6 | D0CAL0
 7 | D0CBZ8
 8 | D0CD18
 9 | D0CDQ7
10 | E1C9L3
11 | F0NDX5
12 | F0Q4R9
13 | K5BJ73
14 | L7T0L4
15 | O94408
16 | P00441
17 | P04632
18 | P07014
19 | P0AG51
20 | P17227
21 | P39230
22 | P43215
23 | P60624
24 | P62805
25 | P62937
26 | P63073
27 | P9WPA4
28 | Q01782
29 | Q03503
30 | Q07654
31 | Q2G285
32 | Q32904
33 | Q4KCZ1
34 | Q58380
35 | Q709H6
36 | Q8A5J2
37 | Q8LGJ5
38 | Q9BTM1
39 | Q9H492
40 | Q9I0B9
41 | Q9K943
42 | Q9KDJ7
43 | Q9LFM3
44 | Q9SJ89
45 | U2EQ00
46 | X2D812
47 | 


--------------------------------------------------------------------------------
/data_preparation.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import numpy as np
  3 | import sys
  4 | from collections import defaultdict
  5 | 
  6 | from config import FileManager, FileSetter
  7 | from assess_performance import PerformanceAssessment
  8 | 
  9 | 
 10 | class MyDataset(torch.utils.data.Dataset):
 11 | 
 12 |     """Dataset for bindEmbed21DL"""
 13 | 
 14 |     def __init__(self, samples, embeddings, seqs, labels, max_length, protein_prediction=False):
 15 |         self.protein_prediction = protein_prediction
 16 |         self.samples = samples
 17 |         self.embeddings = embeddings
 18 |         self.seqs = seqs
 19 | 
 20 |         self.labels = labels
 21 |         self.max_length = max_length
 22 |         self.n_features = self.get_input_dimensions()
 23 | 
 24 |         print('Number of input features: {}'.format(self.n_features))
 25 | 
 26 |     def __len__(self):
 27 |         return len(self.samples)
 28 | 
 29 |     def __getitem__(self, item):
 30 |         prot_id = self.samples[item]
 31 |         prot_length = len(self.seqs[prot_id])
 32 | 
 33 |         embedding = self.embeddings[prot_id]
 34 | 
 35 |         # pad all inputs to the maximum length & add another feature to encode whether the element is a position
 36 |         # in the sequence or padded
 37 |         features = np.zeros((self.n_features + 1, self.max_length), dtype=np.float32)
 38 |         features[:self.n_features, :prot_length] = np.transpose(embedding)  # set feature maps to embedding values
 39 |         features[self.n_features, :prot_length] = 1  # set last element to 1 because positions are not padded
 40 | 
 41 |         target = np.zeros((3, self.max_length), dtype=np.float32)
 42 |         target[:3, :prot_length] = np.transpose(self.labels[prot_id])
 43 |         loss_mask = np.zeros((3, self.max_length), dtype=np.float32)
 44 |         loss_mask[:3, :prot_length] = 1 * prot_length
 45 | 
 46 |         if self.protein_prediction:
 47 |             return features, target, loss_mask, prot_id
 48 |         else:
 49 |             return features, target, loss_mask
 50 | 
 51 |     def get_input_dimensions(self):
 52 |         first_key = list(self.embeddings.keys())[0]
 53 |         first_embedding = self.embeddings[first_key]
 54 | 
 55 |         return np.shape(first_embedding)[1]
 56 | 
 57 | 
 58 | class ProteinInformation(object):
 59 | 
 60 |     @staticmethod
 61 |     def get_data(ids):
 62 |         """
 63 |         Get sequences, labels, and maximum length for a set of ids
 64 |         :param ids:
 65 |         :return:
 66 |         """
 67 | 
 68 |         sequences = FileManager.read_fasta(FileSetter.fasta_file())
 69 |         max_length = ProteinInformation.determine_max_length(sequences, ids)
 70 |         labels = ProteinInformation.get_labels(ids, sequences)
 71 | 
 72 |         return sequences, max_length, labels
 73 | 
 74 |     @staticmethod
 75 |     def get_data_predictions(ids, fasta_file):
 76 |         """
 77 |         Generate dummy labels for test proteins without annotations to allow re-use of general DataLoader
 78 |         :param ids:
 79 |         :param fasta_file:
 80 |         :return: sequences, max. length, dummy labels
 81 |         """
 82 |         sequences = FileManager.read_fasta(fasta_file)
 83 | 
 84 |         max_length = ProteinInformation.determine_max_length(sequences, ids)
 85 |         labels = dict()
 86 |         for i in ids:
 87 |             prot_length = len(sequences[i])
 88 |             binding_tensor = np.zeros([prot_length, 3], dtype=np.float32)
 89 |             labels[i] = binding_tensor
 90 | 
 91 |         return sequences, max_length, labels
 92 | 
 93 |     @staticmethod
 94 |     def determine_max_length(sequences, ids):
 95 |         """Get maximum length in set of sequences"""
 96 |         max_len = 0
 97 |         for i in ids:
 98 |             if len(sequences[i]) > max_len:
 99 |                 max_len = len(sequences[i])
100 | 
101 |         return max_len
102 | 
103 |     @staticmethod
104 |     def get_labels(ids, sequences, file_prefix=None):
105 |         """
106 |         Read binding residues for metal, nucleic acids, and small molecule binding
107 |         :param ids:
108 |         :param sequences:
109 |         :param file_prefix: If None, files set in FileSetter will be used
110 |         :return:
111 |         """
112 |         labels = dict()
113 | 
114 |         if file_prefix is None:
115 |             metal_residues = FileManager.read_binding_residues(FileSetter.binding_residues_by_ligand('metal'))
116 |             nuclear_residues = FileManager.read_binding_residues(FileSetter.binding_residues_by_ligand('nuclear'))
117 |             small_residues = FileManager.read_binding_residues(FileSetter.binding_residues_by_ligand('small'))
118 |         else:
119 |             metal_residues = FileManager.read_binding_residues('{}_metal.txt'.format(file_prefix))
120 |             nuclear_residues = FileManager.read_binding_residues('{}_nuclear.txt'.format(file_prefix))
121 |             small_residues = FileManager.read_binding_residues('{}_small.txt'.format(file_prefix))
122 | 
123 |         for prot_id in ids:
124 |             prot_length = len(sequences[prot_id])
125 |             binding_tensor = np.zeros([prot_length, 3], dtype=np.float32)
126 | 
127 |             metal_res = nuc_res = small_res = []
128 | 
129 |             if prot_id in metal_residues.keys():
130 |                 metal_res = metal_residues[prot_id]
131 |             if prot_id in nuclear_residues.keys():
132 |                 nuc_res = nuclear_residues[prot_id]
133 |             if prot_id in small_residues.keys():
134 |                 small_res = small_residues[prot_id]
135 | 
136 |             metal_residues_0_ind = ProteinInformation._get_zero_based_residues(metal_res)
137 |             nuc_residues_0_ind = ProteinInformation._get_zero_based_residues(nuc_res)
138 |             small_residues_0_ind = ProteinInformation._get_zero_based_residues(small_res)
139 | 
140 |             binding_tensor[metal_residues_0_ind, 0] = 1
141 |             binding_tensor[nuc_residues_0_ind, 1] = 1
142 |             binding_tensor[small_residues_0_ind, 2] = 1
143 | 
144 |             labels[prot_id] = binding_tensor
145 | 
146 |         return labels
147 | 
148 |     @staticmethod
149 |     def _get_zero_based_residues(residues):
150 |         residues_0_ind = []
151 |         for r in residues:
152 |             residues_0_ind.append(int(r) - 1)
153 | 
154 |         return residues_0_ind
155 | 
156 | 
157 | class ProteinResults(object):
158 | 
159 |     def __init__(self, name, bind_cutoff=0.5):
160 |         self.name = name
161 |         self.labels = np.array([])
162 |         self.predictions = np.array([])
163 | 
164 |         self.bind_cutoff = bind_cutoff
165 |         # cutoff to define if label is binding/non-binding; default: 0: non-binding, 1:binding
166 | 
167 |     def set_labels(self, labels):
168 |         self.labels = np.array(labels)
169 | 
170 |     def set_predictions(self, predictions):
171 |         self.predictions = np.around(np.array(np.transpose(predictions)), 3)
172 | 
173 |     def add_predictions(self, predictions):
174 |         self.predictions = np.add(self.predictions, np.around(predictions, 3))
175 | 
176 |     def normalize_predictions(self, norm_factor):
177 |         self.predictions = np.around(self.predictions / norm_factor, 3)
178 | 
179 |     def calc_num_predictions(self, cutoff):
180 |         num_predictions = np.count_nonzero(self.predictions >= cutoff, axis=0)
181 | 
182 |         return num_predictions[0], num_predictions[1], num_predictions[2]
183 | 
184 |     def get_bound_ligand(self, cutoff):
185 |         num_labels = np.count_nonzero(self.labels >= cutoff, axis=0)
186 | 
187 |         metal = nuclear = small = False
188 |         if num_labels[0] > 0:
189 |             metal = True
190 |         if num_labels[1] > 0:
191 |             nuclear = True
192 |         if num_labels[2] > 0:
193 |             small = True
194 | 
195 |         return metal, nuclear, small
196 | 
197 |     def get_predictions_ligand(self, ligand):
198 | 
199 |         if ligand == 'metal':
200 |             return self.predictions[:, 0]
201 |         elif ligand == 'nucleic':
202 |             return self.predictions[:, 1]
203 |         elif ligand == 'small':
204 |             return self.predictions[:, 2]
205 |         elif ligand == 'overall':
206 |             return np.amax(self.predictions, axis=1)
207 |         else:
208 |             sys.exit('{} is not a valid ligand type'.format(ligand))
209 | 
210 |     def calc_performance_measurements(self, cutoff):
211 |         performance = self.calc_performance_given_labels(cutoff, self.labels)
212 | 
213 |         return performance
214 | 
215 |     def calc_performance_given_labels(self, cutoff, ligand_labels):
216 |         """Calculate performance values for this protein"""
217 |         performance = defaultdict(dict)
218 |         num_ligands = np.shape(ligand_labels)[1]
219 | 
220 |         if num_ligands > 1:  # ligand-type assessment
221 |             # calc per-ligand assessment for multi-label prediction
222 |             for i in range(0, num_ligands):
223 |                 tp = fp = tn = fn = 0
224 |                 cross_pred = [0, 0, 0, 0]
225 |                 for idx, lig in enumerate(ligand_labels):
226 |                     if self.predictions[idx, i] >= cutoff:  # predicted as binding to this ligand
227 |                         cross_prediction = False
228 |                         true_prediction = False
229 | 
230 |                         for j in range(0, num_ligands):
231 |                             if i == j:  # same as predicted ligand
232 |                                 if lig[j] >= self.bind_cutoff:  # also annotated to this ligand
233 |                                     tp += 1
234 |                                     cross_pred[i] += 1
235 |                                     true_prediction = True
236 |                                 else:
237 |                                     fp += 1
238 |                             else:
239 |                                 if lig[j] >= self.bind_cutoff and not true_prediction:
240 |                                     cross_pred[j] += 1
241 |                                     cross_prediction = True
242 | 
243 |                         if not true_prediction and not cross_prediction:
244 |                             # residues is not annotated to bind any of the ligands
245 |                             cross_pred[3] += 1
246 |                     else:
247 |                         if lig[i] >= cutoff:
248 |                             fn += 1
249 |                         else:
250 |                             tn += 1
251 | 
252 |                 if i == 0:
253 |                     ligand = 'metal'
254 |                 elif i == 1:
255 |                     ligand = 'nucleic'
256 |                 else:
257 |                     ligand = 'small'
258 | 
259 |                 bound = False
260 |                 if (tp + fn) > 0:
261 |                     bound = True
262 |                 acc, prec, recall, f1, mcc = PerformanceAssessment.calc_performance_measurements(tp, fp, tn, fn)
263 |                 # calculate performance measurements for negatives
264 |                 _, neg_p, neg_r, neg_f1, _ = PerformanceAssessment.calc_performance_measurements(tn, fn, tp, fp)
265 | 
266 |                 performance[ligand] = {'tp': tp, 'fp': fp, 'tn': tn, 'fn': fn, 'acc': acc, 'prec': prec,
267 |                                        'recall': recall, 'f1': f1, 'neg_prec': neg_p, 'neg_recall': neg_r,
268 |                                        'neg_f1': neg_f1, 'mcc': mcc, 'bound': bound,
269 |                                        'cross_prediction': cross_pred}
270 | 
271 |         # get overall performance
272 |         reduced_labels = np.sum(ligand_labels > cutoff, axis=1)
273 |         if len(self.predictions.shape) == 1:
274 |             reduced_predictions = (self.predictions >= cutoff)
275 |         else:
276 |             reduced_predictions = np.sum(self.predictions >= cutoff, axis=1)
277 | 
278 |         tp = np.sum(np.logical_and(reduced_labels > 0, reduced_predictions > 0))
279 |         fp = np.sum(np.logical_and(reduced_labels == 0, reduced_predictions > 0))
280 |         tn = np.sum(np.logical_and(reduced_labels == 0, reduced_predictions == 0))
281 |         fn = np.sum(np.logical_and(reduced_labels > 0, reduced_predictions == 0))
282 | 
283 |         acc, prec, recall, f1, mcc = PerformanceAssessment.calc_performance_measurements(tp, fp, tn, fn)
284 |         _, neg_p, neg_r, neg_f1, _ = PerformanceAssessment.calc_performance_measurements(tn, fn, tp, fp)
285 |         performance['overall'] = {'tp': tp, 'fp': fp, 'tn': tn, 'fn': fn, 'acc': acc, 'prec': prec, 'recall': recall,
286 |                                   'f1': f1, 'neg_prec': neg_p, 'neg_recall': neg_r, 'neg_f1': neg_f1, 'mcc': mcc,
287 |                                   'bound': True, 'cross_prediction': [0, 0, 0, 0]}
288 | 
289 |         return performance
290 | 


--------------------------------------------------------------------------------
/develop_bindEmbed21DL.py:
--------------------------------------------------------------------------------
 1 | from bindEmbed21DL import BindEmbed21DL
 2 | from assess_performance import PerformanceAssessment
 3 | from config import FileSetter, FileManager, GeneralInformation
 4 | from data_preparation import ProteinInformation
 5 | 
 6 | import sys
 7 | from pathlib import Path
 8 | 
 9 | 
10 | def main():
11 |     GeneralInformation.seed_all(42)
12 | 
13 |     keyword = sys.argv[1]
14 | 
15 |     path = ''  # TODO set path to working directory
16 |     Path(path).mkdir(parents=True, exist_ok=True)
17 | 
18 |     cutoff = 0.5
19 |     ri = False  # Should RI or raw probabilities be written?
20 | 
21 |     sequences = FileManager.read_fasta(FileSetter.fasta_file())
22 | 
23 |     if keyword == 'optimize-architecture':
24 |         params = {'lr': [0.1, 0.05, 0.01], 'betas': [(0.9, 0.999)], 'eps': [1e-8], 'weight_decay': [0.0],
25 |                   'features': [512, 256, 128, 64, 32], 'kernel': [3, 5, 7], 'stride': [1],
26 |                   'dropout': [0.7, 0.5, 0.4, 0.3, 0.2], 'epochs': [200], 'early_stopping': [True],
27 |                   'weights': [[8.9, 7.7, 4.4]]}
28 | 
29 |         result_file = '{}/cross_validation_results.txt'.format(path)
30 | 
31 |         BindEmbed21DL.hyperparameter_optimization_pipeline(params, 5, result_file)
32 | 
33 |     elif keyword == 'best-training':
34 |         params = {'lr': 0.01, 'betas': (0.9, 0.999), 'eps': 1e-8, 'weight_decay': 0.0, 'features': 128, 'kernel': 5,
35 |                   'stride': 1, 'dropout': 0.7, 'epochs': 200, 'early_stopping': True, 'weights': [8.9, 7.7, 4.4]}
36 | 
37 |         model_prefix = '{}/trained_model'.format(path)
38 |         prediction_folder = '{}/predictions'.format(path)
39 |         Path(prediction_folder).mkdir(parents=True, exist_ok=True)
40 | 
41 |         proteins = BindEmbed21DL.cross_train_pipeline(params, model_prefix, prediction_folder, ri)
42 | 
43 |         # assess performance
44 |         labels = ProteinInformation.get_labels(proteins.keys(), sequences)
45 |         model_performances = PerformanceAssessment.combine_protein_performance(proteins, cutoff, labels)
46 |         PerformanceAssessment.print_performance_results(model_performances)
47 | 
48 |     elif keyword == 'testing':
49 |         model_prefix = '{}/trained_model'.format(path)
50 |         prediction_folder = '{}/predictions_testset/'.format(path)
51 |         Path(prediction_folder).mkdir(parents=True, exist_ok=True)
52 | 
53 |         ids_in = FileSetter.test_ids_in()
54 |         fasta_file = FileSetter.fasta_file()
55 |         proteins = BindEmbed21DL.prediction_pipeline(model_prefix, cutoff, prediction_folder, ids_in, fasta_file, ri)
56 | 
57 |         # assess performance
58 |         labels = ProteinInformation.get_labels(proteins.keys(), sequences)
59 |         model_performances = PerformanceAssessment.combine_protein_performance(proteins, cutoff, labels)
60 |         PerformanceAssessment.print_performance_results(model_performances)
61 | 
62 | 
63 | main()
64 | 


--------------------------------------------------------------------------------
/homology_based_inference.py:
--------------------------------------------------------------------------------
  1 | from config import FileSetter
  2 | from mmseqs_wrapper import MMSeqs2Wrapper
  3 | 
  4 | import math
  5 | from collections import defaultdict
  6 | import os
  7 | 
  8 | 
  9 | class MMseqsPredictor(object):
 10 | 
 11 |     def __init__(self, exclude_self_hits):
 12 |         self.exclude_self_hits = exclude_self_hits
 13 | 
 14 |     def get_mmseqs_hits(self, ids, evalue, criterion, set_name, hval=0):
 15 |         """
 16 |         Get MMseqs2 hits using local alignments at a specific E-value (H-value)
 17 |         :param ids: Ids to get hits for
 18 |         :param evalue: E-value cutoff to use for (1) MMseqs2 search and (2) filtering of hits
 19 |         :param criterion: eval or hval (should E-value or H-value be used to filter hits)
 20 |         :param set_name: Name for the query set; used for output files
 21 |         :param hval: H-value cutoff to use for filtering hits
 22 |         :return:
 23 |         """
 24 |         mmseqs_hits = defaultdict(defaultdict)
 25 | 
 26 |         set_ids = set(ids)
 27 |         # calc MMseqs2 alignments
 28 |         mmseqs_out = MMseqsPredictor.calc_mmseqs_results(evalue, set_name)
 29 | 
 30 |         if os.path.exists(mmseqs_out):
 31 |             # parse MMseqs2 output
 32 |             with open(mmseqs_out) as read_mmseqs:
 33 |                 for line in read_mmseqs:
 34 |                     splitted_line = line.strip().split()
 35 |                     i = splitted_line[0]
 36 |                     if i in set_ids:
 37 |                         hit = splitted_line[1]
 38 |                         if i != hit or not self.exclude_self_hits:
 39 |                             e_val = float(splitted_line[2])
 40 | 
 41 |                             nident = int(splitted_line[3])  # number of identical positions
 42 |                             nmismatch = int(splitted_line[4])  # number of mismatches
 43 | 
 44 |                             lali = nident + nmismatch
 45 |                             pide = nident / lali
 46 | 
 47 |                             h_val = MMseqsPredictor._calc_hssp(pide, lali)
 48 | 
 49 |                             if criterion == 'eval' and e_val <= float(evalue):
 50 |                                 # filter hits by E-value
 51 |                                 mmseqs_hits[i][hit] = {'pide': pide, 'lali': lali, 'eval': e_val, 'hval': h_val}
 52 |                             elif criterion == 'hval' and h_val >= float(hval):
 53 |                                 # filter hits by H-value
 54 |                                 mmseqs_hits[i][hit] = {'pide': pide, 'lali': lali, 'eval': e_val, 'hval': h_val}
 55 | 
 56 |         # determine hit with minimal E-value/maximum H-value
 57 |         best_hits = defaultdict(dict)
 58 |         for i in mmseqs_hits.keys():
 59 |             max_hit = ''
 60 |             min_eval = 1e5
 61 |             max_pide = 0
 62 |             max_hval = -100
 63 | 
 64 |             for h in mmseqs_hits[i].keys():
 65 |                 e_val = mmseqs_hits[i][h]['eval']
 66 |                 pide = mmseqs_hits[i][h]['pide']
 67 |                 h_val = mmseqs_hits[i][h]['hval']
 68 | 
 69 |                 if criterion == 'eval' and (e_val < min_eval or (e_val <= min_eval and pide > max_pide)):
 70 |                     max_hit = h
 71 |                     min_eval = e_val
 72 |                     max_pide = pide
 73 |                 if criterion == 'hval' and (h_val > max_hval or (h_val >= max_hval and pide > max_pide)):
 74 |                     max_hit = h
 75 |                     max_hval = h_val
 76 |                     max_pide = pide
 77 | 
 78 |             if criterion == 'eval':
 79 |                 best_hits[i][max_hit] = min_eval
 80 |             else:
 81 |                 best_hits[i][max_hit] = max_hval
 82 | 
 83 |         return best_hits, mmseqs_out
 84 | 
 85 |     @staticmethod
 86 |     def calc_mmseqs_results(evalue, set_name):
 87 |         """
 88 |         Calculate MMseqs2 alignments using profiles against big80
 89 |         :param evalue: e-value threshold for final alignments
 90 |         :param set_name: Meaningful set name to ensure that previous results are not overwritten/re-used
 91 |         :return: File to which MMseqs2 results were written
 92 |         """
 93 | 
 94 |         fasta_in = FileSetter.query_set()
 95 |         aln_out = '{}/mmseqs_big80_results_{}.m8'.format(FileSetter.mmseqs_output(), set_name)
 96 | 
 97 |         # mmseqs profile search against big80
 98 |         profile_db = FileSetter.profile_db()  # use pre-computed database from big80
 99 |         lookup_db = FileSetter.lookup_db()  # use pre-computed database of lookup set
100 | 
101 |         mmseqs = MMSeqs2Wrapper(FileSetter.mmseqs_path(), FileSetter.mmseqs_output(), set_name)
102 | 
103 |         # create database of the query set
104 |         query_db = mmseqs.create_db(fasta_in, set_name)
105 | 
106 |         # create profiles against big80
107 |         aln_res = mmseqs.seq2seq_search(query_db, profile_db, num_iterations=2)
108 |         query_profiles = mmseqs.alnres2prof(query_db, profile_db, aln_res)
109 | 
110 |         # search profiles against lookup db
111 |         profile_hits_raw = mmseqs.prof2seq_search(query_profiles, lookup_db, 0, evalue)
112 |         mmseqs.convertali(query_db, lookup_db, profile_hits_raw, aln_out)
113 | 
114 |         return aln_out
115 | 
116 |     @staticmethod
117 |     def _calc_hssp(pide, lali):
118 |         """
119 |         Calculate H-value
120 |         :param pide: Percentage pairwise sequence identity
121 |         :param lali: Alignment length (excl. gaps)
122 |         :return:
123 |         """
124 | 
125 |         pide = pide * 100
126 | 
127 |         if lali <= 11:
128 |             return pide - 100
129 |         elif lali > 450:
130 |             return pide - 19.5
131 |         else:
132 |             exp = -0.32 * (1 + math.exp(-lali / 1000))
133 |             return pide - (480 * (lali ** exp))
134 | 


--------------------------------------------------------------------------------
/human_proteome/hbi_predictions.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rostlab/bindPredict/aa7acfff3b99b3dcc737a22ab5bd1abff3a0efdb/human_proteome/hbi_predictions.tar.gz


--------------------------------------------------------------------------------
/human_proteome/ml_predictions.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rostlab/bindPredict/aa7acfff3b99b3dcc737a22ab5bd1abff3a0efdb/human_proteome/ml_predictions.tar.gz


--------------------------------------------------------------------------------
/ml_predictor.py:
--------------------------------------------------------------------------------
 1 | from data_preparation import MyDataset, ProteinResults
 2 | from config import GeneralInformation
 3 | 
 4 | import torch
 5 | 
 6 | 
 7 | class MLPredictor(object):
 8 | 
 9 |     def __init__(self, model):
10 |         if torch.cuda.is_available():
11 |             self.device = 'cuda:0'
12 |         else:
13 |             self.device = 'cpu'
14 | 
15 |         self.model = model.to(self.device)
16 | 
17 |     def predict_per_protein(self, ids, sequences, embeddings, labels, max_length):
18 |         """
19 |         Generate predictions for each protein from a list
20 |         :param ids:
21 |         :param sequences:
22 |         :param embeddings:
23 |         :param labels:
24 |         :param max_length:
25 |         :return:
26 |         """
27 | 
28 |         validation_set = MyDataset(ids, embeddings, sequences, labels, max_length, protein_prediction=True)
29 |         validation_loader = torch.utils.data.DataLoader(validation_set, batch_size=1, shuffle=True, pin_memory=True)
30 |         sigm = torch.nn.Sigmoid()
31 | 
32 |         proteins = dict()
33 |         for features, target, mask, prot_id in validation_loader:
34 |             prot_id = prot_id[0]
35 | 
36 |             self.model.eval()
37 |             with torch.no_grad():
38 |                 features = features.to(self.device)
39 |                 target = target.to(self.device)
40 | 
41 |                 features_1024 = features[..., :-1, :]
42 |                 pred = self.model.forward(features_1024)
43 |                 pred = sigm(pred)
44 | 
45 |                 pred = pred.squeeze()
46 |                 target = target.squeeze()
47 |                 features = features.squeeze()
48 | 
49 |                 pred_i, target_i = GeneralInformation.remove_padded_positions(pred, target, features)
50 |                 pred_i = pred_i.detach().cpu()
51 | 
52 |                 prot = ProteinResults(prot_id)
53 |                 prot.set_predictions(pred_i)
54 |                 proteins[prot_id] = prot
55 | 
56 |         return proteins
57 | 


--------------------------------------------------------------------------------
/ml_trainer.py:
--------------------------------------------------------------------------------
  1 | from sklearn.model_selection import PredefinedSplit
  2 | from collections import defaultdict
  3 | import itertools as it
  4 | 
  5 | import torch
  6 | 
  7 | from architectures import CNN2Layers
  8 | from data_preparation import MyDataset
  9 | from assess_performance import ModelPerformance, PerformanceEpochs
 10 | from config import FileManager, GeneralInformation
 11 | from pytorchtools import EarlyStopping
 12 | 
 13 | 
 14 | class MLTrainer(object):
 15 | 
 16 |     def __init__(self, pos_weights, batch_size=406):
 17 |         self.batch_size = batch_size
 18 |         if torch.cuda.is_available():
 19 |             self.device = 'cuda:0'
 20 |         else:
 21 |             self.device = 'cpu'
 22 | 
 23 |         self.pos_weights = torch.tensor(pos_weights).to(self.device)
 24 | 
 25 |     def train_validate(self, params, train_ids, validation_ids, sequences, embeddings, labels, max_length,
 26 |                        verbose=True):
 27 |         """
 28 |         Train & validate predictor for one set of parameters and ids
 29 |         :param params:
 30 |         :param train_ids:
 31 |         :param validation_ids:
 32 |         :param sequences:
 33 |         :param embeddings:
 34 |         :param labels:
 35 |         :param max_length:
 36 |         :param verbose:
 37 |         :return:
 38 |         """
 39 | 
 40 |         model, train_performance, val_performance = self._train_validate(params, train_ids, validation_ids, sequences,
 41 |                                                                          embeddings, labels, max_length,
 42 |                                                                          verbose=verbose)
 43 | 
 44 |         train_loss, train_acc, train_prec, train_recall, train_f1, train_mcc = \
 45 |             train_performance.get_performance_last_epoch()
 46 |         val_loss, val_acc, val_prec, val_recall, val_f1, val_mcc = val_performance.get_performance_last_epoch()
 47 | 
 48 |         print("Train loss: {:.3f}, Prec: {:.3f}, Recall: {:.3f}, F1: {:.3f}, MCC: {:.3f}".format(train_loss,
 49 |                                                                                                  train_prec,
 50 |                                                                                                  train_recall,
 51 |                                                                                                  train_f1,
 52 |                                                                                                  train_mcc))
 53 |         print("Val loss: {:.3f}, Prec: {:.3f}, Recall: {:.3f}, F1: {:.3f}, MCC: {:.3f}".format(val_loss,
 54 |                                                                                                val_prec,
 55 |                                                                                                val_recall, val_f1,
 56 |                                                                                                val_mcc))
 57 | 
 58 |         return model
 59 | 
 60 |     def cross_validate(self, params, ids, fold_array, sequences, embeddings, labels, max_length, result_file):
 61 |         """
 62 |         Perform cross-validation to optimize hyperparameters
 63 |         :param params:
 64 |         :param ids:
 65 |         :param fold_array:
 66 |         :param sequences:
 67 |         :param embeddings:
 68 |         :param labels:
 69 |         :param max_length:
 70 |         :param result_file:
 71 |         :return:
 72 |         """
 73 |         ps = PredefinedSplit(fold_array)
 74 | 
 75 |         # create parameter grid
 76 |         param_sets = defaultdict(dict)
 77 |         sorted_keys = sorted(params.keys())
 78 |         param_combos = it.product(*(params[s] for s in sorted_keys))
 79 |         counter = 0
 80 |         for p in list(param_combos):
 81 |             curr_params = list(p)
 82 |             param_dict = dict(zip(sorted_keys, curr_params))
 83 |             param_sets[counter] = param_dict
 84 | 
 85 |             counter += 1
 86 | 
 87 |         best_score = 0
 88 |         best_params = dict()  # save best parameter set
 89 |         best_classifier = None  # save best classifier
 90 |         performance = defaultdict(list)  # save performance for each parameter combination
 91 | 
 92 |         params_counter = 1
 93 | 
 94 |         for p in param_sets.keys():
 95 |             curr_params = param_sets[p]
 96 |             print('{}\t{}'.format(params_counter, curr_params))
 97 | 
 98 |             model = None
 99 | 
100 |             train_model_performance = ModelPerformance()
101 |             val_model_performance = ModelPerformance()
102 | 
103 |             for train_index, test_index in ps.split():
104 |                 train_ids, validation_ids = ids[train_index], ids[test_index]
105 | 
106 |                 model, train_performance, val_performance = self._train_validate(curr_params, train_ids, validation_ids,
107 |                                                                                  sequences, embeddings, labels,
108 |                                                                                  max_length)
109 | 
110 |                 train_loss, train_acc, train_prec, train_recall, train_f1, train_mcc = \
111 |                     train_performance.get_performance_last_epoch()
112 |                 val_loss, val_acc, val_prec, val_recall, val_f1, val_mcc = val_performance.get_performance_last_epoch()
113 | 
114 |                 train_model_performance.add_single_performance(train_loss, train_acc, train_prec, train_recall,
115 |                                                                train_f1, train_mcc)
116 | 
117 |                 val_model_performance.add_single_performance(val_loss, val_acc, val_prec, val_recall, val_f1, val_mcc)
118 | 
119 |             # take average over all splits
120 |             train_loss, train_acc, train_prec, train_recall, train_f1, train_mcc = \
121 |                 train_model_performance.get_mean_performance()
122 |             val_loss, val_acc, val_prec, val_recall, val_f1, val_mcc = val_model_performance.get_mean_performance()
123 | 
124 |             performance['train_precision'].append(train_prec)
125 |             performance['train_recall'].append(train_recall)
126 |             performance['train_f1'].append(train_f1)
127 |             performance['train_mcc'].append(train_mcc)
128 |             performance['train_acc'].append(train_acc)
129 |             performance['train_loss'].append(train_loss)
130 | 
131 |             performance['val_precision'].append(val_prec)
132 |             performance['val_recall'].append(val_recall)
133 |             performance['val_f1'].append(val_f1)
134 |             performance['val_mcc'].append(val_mcc)
135 |             performance['val_acc'].append(val_acc)
136 |             performance['val_loss'].append(val_loss)
137 | 
138 |             for param in curr_params.keys():
139 |                 performance[param].append(curr_params[param])
140 | 
141 |             if val_f1 > best_score:
142 |                 best_score = val_f1
143 |                 best_params = curr_params
144 |                 best_classifier = model
145 | 
146 |             params_counter += 1
147 | 
148 |         FileManager.save_cv_results(performance, result_file)
149 | 
150 |         print(best_score)
151 |         print(best_params)
152 | 
153 |         return best_classifier
154 | 
155 |     def _train_validate(self, params, train_ids, validation_ids, sequences, embeddings, labels, max_length,
156 |                         verbose=True):
157 |         """
158 |         Train and validate bindEmbed21DL model
159 |         :param params:
160 |         :param train_ids:
161 |         :param validation_ids:
162 |         :param sequences:
163 |         :param labels:
164 |         :param max_length:
165 |         :param verbose:
166 |         :return:
167 |         """
168 | 
169 |         # define data sets
170 |         train_set = MyDataset(train_ids, embeddings, sequences, labels, max_length)
171 |         validation_set = MyDataset(validation_ids, embeddings, sequences, labels, max_length)
172 | 
173 |         train_loader = torch.utils.data.DataLoader(train_set, batch_size=self.batch_size, shuffle=True, pin_memory=True,
174 |                                                    worker_init_fn=GeneralInformation.seed_worker)
175 |         validation_loader = torch.utils.data.DataLoader(validation_set, batch_size=self.batch_size, shuffle=True,
176 |                                                         pin_memory=True, worker_init_fn=GeneralInformation.seed_worker)
177 | 
178 |         pos_weights = self.pos_weights.expand(max_length, 3)
179 |         pos_weights = pos_weights.t()
180 | 
181 |         loss_fun = torch.nn.BCEWithLogitsLoss(reduction='none', pos_weight=pos_weights)
182 |         sigm = torch.nn.Sigmoid()
183 | 
184 |         padding = int((params['kernel'] - 1) / 2)
185 |         model = CNN2Layers(train_set.get_input_dimensions(), params['features'], params['kernel'], params['stride'],
186 |                            padding, params['dropout'])
187 |         model.to(self.device)
188 | 
189 |         optim_args = {'lr': params['lr'], 'betas': params['betas'], 'eps': params['eps'],
190 |                       'weight_decay': params['weight_decay']}
191 |         optimizer = torch.optim.Adamax(model.parameters(), **optim_args)
192 | 
193 |         checkpoint_file = 'checkpoint_early_stopping.pt'
194 |         early_stopping = EarlyStopping(patience=10, delta=0.01, checkpoint_file=checkpoint_file)
195 | 
196 |         train_performance = PerformanceEpochs()
197 |         validation_performance = PerformanceEpochs()
198 | 
199 |         num_epochs = 0
200 | 
201 |         for epoch in range(params['epochs']):
202 |             if verbose:
203 |                 print("Epoch {}".format(epoch))
204 | 
205 |             train_loss = val_loss = 0
206 |             train_loss_count = val_loss_count = 0
207 |             train_tp = train_tn = train_fn = train_fp = 0
208 |             val_tp = val_tn = val_fn = val_fp = 0
209 | 
210 |             train_acc = train_prec = train_rec = train_f1 = train_mcc = 0
211 |             val_acc = val_prec = val_rec = val_f1 = val_mcc = 0
212 | 
213 |             # training
214 |             model.train()
215 |             for in_feature, target, loss_mask in train_loader:
216 |                 optimizer.zero_grad()
217 |                 in_feature = in_feature.to(self.device)
218 |                 in_feature_1024 = in_feature[:, :-1, :]
219 | 
220 |                 target = target.to(self.device)
221 |                 loss_mask = loss_mask.to(self.device)
222 |                 pred = model.forward(in_feature_1024)
223 | 
224 |                 # don't consider padded positions for loss calculation
225 |                 loss_el = loss_fun(pred, target)
226 |                 loss_el_masked = loss_el * loss_mask
227 | 
228 |                 loss_norm = torch.sum(loss_el_masked)
229 | 
230 |                 train_loss += loss_norm.item()
231 |                 train_loss_count += in_feature.shape[0]
232 | 
233 |                 for idx, i in enumerate(in_feature):  # remove padded positions to calculate tp, fp, tn, fn
234 |                     pred_i, target_i = GeneralInformation.remove_padded_positions(pred[idx], target[idx], i)
235 | 
236 |                     pred_i = sigm(pred_i)
237 |                     tp, fp, tn, fn, acc, prec, rec, f1, mcc = \
238 |                         train_performance.get_performance_batch(pred_i.detach().cpu(), target_i.detach().cpu())
239 |                     train_tp += tp
240 |                     train_fp += fp
241 |                     train_tn += tn
242 |                     train_fn += fn
243 | 
244 |                     train_acc += acc
245 |                     train_prec += prec
246 |                     train_rec += rec
247 |                     train_f1 += f1
248 |                     train_mcc += mcc
249 | 
250 |                 loss_norm.backward()
251 |                 optimizer.step()
252 | 
253 |             # validation
254 |             model.eval()
255 |             with torch.no_grad():
256 |                 for in_feature, target, loss_mask in validation_loader:
257 |                     in_feature = in_feature.to(self.device)
258 |                     in_feature_1024 = in_feature[:, :-1, :]
259 | 
260 |                     target = target.to(self.device)
261 |                     loss_mask = loss_mask.to(self.device)
262 | 
263 |                     pred = model.forward(in_feature_1024)
264 | 
265 |                     # don't consider padded position for loss calculation
266 |                     loss_el = loss_fun(pred, target)
267 |                     loss_el_masked = loss_el * loss_mask
268 |                     val_loss += torch.sum(loss_el_masked).item()
269 |                     val_loss_count += in_feature.shape[0]
270 | 
271 |                     for idx, i in enumerate(in_feature):  # remove padded positions to calculate tp, fp, tn, fn
272 |                         pred_i, target_i = GeneralInformation.remove_padded_positions(pred[idx], target[idx], i)
273 | 
274 |                         pred_i = sigm(pred_i)
275 | 
276 |                         tp, fp, tn, fn, acc, prec, rec, f1, mcc = \
277 |                             train_performance.get_performance_batch(pred_i.detach().cpu(), target_i.detach().cpu())
278 |                         val_tp += tp
279 |                         val_fp += fp
280 |                         val_tn += tn
281 |                         val_fn += fn
282 | 
283 |                         val_acc += acc
284 |                         val_prec += prec
285 |                         val_rec += rec
286 |                         val_f1 += f1
287 |                         val_mcc += mcc
288 | 
289 |             train_loss = train_loss / (train_loss_count * 3)
290 |             val_loss = val_loss / (val_loss_count * 3)
291 | 
292 |             train_acc = train_acc / train_loss_count
293 |             train_prec = train_prec / train_loss_count
294 |             train_rec = train_rec / train_loss_count
295 |             train_f1 = train_f1 / train_loss_count
296 |             train_mcc = train_mcc / train_loss_count
297 | 
298 |             val_acc = val_acc / val_loss_count
299 |             val_prec = val_prec / val_loss_count
300 |             val_rec = val_rec / val_loss_count
301 |             val_f1 = val_f1 / val_loss_count
302 |             val_mcc = val_mcc / val_loss_count
303 | 
304 |             if verbose:
305 |                 print("Train loss: {:.3f}, Prec: {:.3f}, Recall: {:.3f}, F1: {:.3f}, MCC: {:.3f}".format(train_loss,
306 |                                                                                                          train_prec,
307 |                                                                                                          train_rec,
308 |                                                                                                          train_f1,
309 |                                                                                                          train_mcc))
310 |                 print('TP: {}, FP: {}, TN: {}, FN: {}'.format(train_tp, train_fp, train_tn, train_fn))
311 |                 print("Val loss: {:.3f}, Prec: {:.3f}, Recall: {:.3f}, F1: {:.3f}, MCC: {:.3f}".format(val_loss,
312 |                                                                                                        val_prec,
313 |                                                                                                        val_rec, val_f1,
314 |                                                                                                        val_mcc))
315 |                 print('TP: {}, FP: {}, TN: {}, FN: {}'.format(val_tp, val_fp, val_tn, val_fn))
316 | 
317 |             # append average performance for this epoch
318 |             train_performance.add_performance_epoch(train_loss, train_mcc, train_prec, train_rec, train_f1, train_acc)
319 |             validation_performance.add_performance_epoch(val_loss, val_mcc, val_prec, val_rec, val_f1, val_acc)
320 | 
321 |             num_epochs += 1
322 | 
323 |             # stop training if F1 score doesn't improve anymore
324 |             if 'early_stopping' in params.keys() and params['early_stopping']:
325 |                 eval_val = val_f1 * (-1)
326 |                 # eval_val = val_loss
327 |                 early_stopping(eval_val, model, verbose)
328 |                 if early_stopping.early_stop:
329 |                     break
330 | 
331 |         if 'early_stopping' in params.keys() and params['early_stopping']:   # load best model
332 |             model = torch.load(checkpoint_file)
333 | 
334 |         return model, train_performance, validation_performance
335 | 


--------------------------------------------------------------------------------
/mmseqs_wrapper.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | # -*- coding: utf-8 -*-
 3 | """
 4 | Created on Tue Jun  8 11:34:00 2021
 5 | 
 6 | @author: mheinzinger
 7 | """
 8 | 
 9 | import subprocess
10 | import os
11 | 
12 | """
13 | Overview over pipeline steps:
14 |     1. Create DBs for queries and lookup
15 |         mmseqs createdb test_set.fasta test_set
16 |         mmseqs createdb lookup_set.fasta lookup_seq
17 |     2. Use sequence-sequence search to search test against lookup
18 |         mmseqs search test_set lookup_set alnResults tmp --num-iterations 3 -a
19 |     3. Compute profiles from hits
20 |         mmseqs result2profile test_set lookup_set alnResults test_set_profile 
21 |     4. Search test set profiles against lookup sequences to detect remote hits
22 |         mmseqs search test_set_profile lookup_set alnResults tmp --min-seq-id 0.2 -s 7.5 --max-seqs 10000
23 | """
24 | 
25 | 
26 | class MMSeqs2Wrapper(object):
27 |     
28 |     def __init__(self, mmseqs_path, work_dir, query_set):
29 |         self.mmseqs = mmseqs_path
30 |         self.work_dir = work_dir  # Directory for storing final results
31 |         self.tmp_dir = '{}/tmp'.format(self.work_dir)  # Directory for storing intermediate results
32 |         self.query_set = query_set
33 |         
34 |     def create_db(self, fasta_p, set_name):
35 |         db_res = '{}/{}'.format(self.tmp_dir, set_name)
36 |         if not os.path.exists(db_res):
37 |             subprocess.call([str(self.mmseqs), "createdb", fasta_p, db_res])
38 |         return db_res
39 |     
40 |     def seq2seq_search(self, query_db, lookup_db, num_iterations=3):
41 |         aln_res = '{}/seq2seq_alnRes_{}'.format(self.tmp_dir, self.query_set)
42 | 
43 |         subprocess.call([str(self.mmseqs), "search", query_db, lookup_db, aln_res, str(self.tmp_dir),
44 |                         "--num-iterations", str(num_iterations), "-a"])
45 |         return aln_res
46 |     
47 |     def alnres2prof(self, query_db, lookup_db, aln_res):
48 |         query_profiles = '{}/test_set_profiles_{}'.format(self.tmp_dir, self.query_set)
49 | 
50 |         subprocess.call([str(self.mmseqs), "result2profile", query_db, lookup_db, aln_res, query_profiles])
51 |         return query_profiles
52 |     
53 |     def prof2seq_search(self, query_profiles, lookup_db, min_seq_id, e_val, sensitivity=7.5, max_seqs=100000):
54 |         aln_res = '{}/prof2seq_alnRes_{}'.format(self.tmp_dir, self.query_set)
55 | 
56 |         subprocess.call([str(self.mmseqs), "search", query_profiles, lookup_db, aln_res, str(self.tmp_dir),
57 |                          "--min-seq-id", str(min_seq_id), "-s", str(sensitivity), "--max-seqs", str(max_seqs),
58 |                          "-e", e_val, "-a"])
59 |         return aln_res
60 | 
61 |     def convertali(self, query_db, lookup_db, profile_hits_raw, output_file):
62 |         subprocess.call([str(self.mmseqs), "convertalis", query_db, lookup_db, profile_hits_raw, output_file,
63 |                          "--format-output",  "query,target,evalue,nident,mismatch,qstart,tstart,qaln,taln"])
64 |         return output_file
65 | 


--------------------------------------------------------------------------------
/pytorchtools.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import torch
 3 | 
 4 | 
 5 | class EarlyStopping:
 6 |     """Early stops the training if validation loss doesn't improve after a given patience."""
 7 |     def __init__(self, patience=7, verbose=False, delta=0, checkpoint_file='checkpoint.pt'):
 8 |         """
 9 |         Args:
10 |             patience (int): How long to wait after last time validation loss improved.
11 |                             Default: 7
12 |             verbose (bool): If True, prints a message for each validation loss improvement.
13 |                             Default: False
14 |             delta (float): Minimum change in the monitored quantity to qualify as an improvement.
15 |                             Default: 0
16 |         """
17 |         self.patience = patience
18 |         self.verbose = verbose
19 |         self.counter = 0
20 |         self.best_score = None
21 |         self.early_stop = False
22 |         self.val_loss_min = np.Inf
23 |         self.delta = delta
24 | 
25 |         self.checkpoint_file = checkpoint_file
26 | 
27 |     def __call__(self, val_loss, model, verbose=True):
28 | 
29 |         score = -val_loss
30 | 
31 |         if self.best_score is None:
32 |             self.best_score = score
33 |             self.save_checkpoint(val_loss, model)
34 |         elif score < self.best_score + self.delta:
35 |             self.counter += 1
36 |             if verbose:
37 |                 print(f'EarlyStopping counter: {self.counter} out of {self.patience}')
38 |             if self.counter >= self.patience:
39 |                 self.early_stop = True
40 |         else:
41 |             self.best_score = score
42 |             self.save_checkpoint(val_loss, model)
43 |             self.counter = 0
44 | 
45 |     def save_checkpoint(self, val_loss, model):
46 |         """Saves model when validation loss decrease."""
47 | 
48 |         if self.verbose:
49 |             print(f'Validation loss decreased ({self.val_loss_min:.6f} --> {val_loss:.6f}).  Saving model ...')
50 |         torch.save(model, self.checkpoint_file)
51 |         self.val_loss_min = val_loss
52 | 


--------------------------------------------------------------------------------
/run_bindEmbed21.py:
--------------------------------------------------------------------------------
 1 | import numpy
 2 | from pathlib import Path
 3 | 
 4 | from config import FileManager, FileSetter
 5 | from data_preparation import ProteinInformation, ProteinResults
 6 | 
 7 | from annotation_transfer import Inference
 8 | from bindEmbed21DL import BindEmbed21DL
 9 | 
10 | 
11 | def main():
12 | 
13 |     set_name = 'example'  # TODO set a meaningful set name used for output files
14 |     query_fasta = FileSetter.query_set()
15 |     query_sequences = FileManager.read_fasta(query_fasta)
16 |     query_ids = list(query_sequences.keys())
17 | 
18 |     print('Run predictions using homology-based inference (bindEmbed21HBI)')
19 |     result_folder = FileSetter.mmseqs_output()
20 |     Path(result_folder).mkdir(parents=True, exist_ok=True)
21 | 
22 |     e_val = '1e-3'  # E-value used for bindEmbed21HBI, can be changed
23 |     proteins_inferred = dict()
24 | 
25 |     # inference set with annotations
26 |     hbi_sequences = FileManager.read_fasta(FileSetter.lookup_fasta())
27 |     labels_hbi = ProteinInformation.get_labels(list(hbi_sequences.keys()), hbi_sequences)
28 | 
29 |     inference = Inference(query_ids, labels_hbi, query_sequences)
30 |     # run HBI and don't allow self-hits
31 |     inferred_labels, inferred_proteins = inference.infer_binding_annotations_seq(e_val, 'eval', set_name)
32 | 
33 |     missing_proteins = []
34 |     for p in query_ids:
35 |         if p in inferred_proteins:
36 |             prot = ProteinResults(p, 3)
37 |             prot.set_predictions(numpy.transpose(inferred_labels[p]))
38 |             proteins_inferred[p] = prot
39 |         else:
40 |             missing_proteins.append(p)
41 | 
42 |     print('Number of proteins with hit: {}'.format(len(proteins_inferred.keys())))
43 | 
44 |     print('Run predictions using Machine Learning for remaining proteins (bindEmbed21DL)')
45 |     model_prefix = 'trained_models/checkpoint'
46 |     ri = True  # Whether to write RI or Probabilities
47 | 
48 |     predicted_proteins = BindEmbed21DL.prediction_pipeline(model_prefix, 0.5, None, missing_proteins, query_fasta,
49 |                                                            ri)
50 | 
51 |     print("Write predictions")
52 |     proteins = {**inferred_proteins, **predicted_proteins}
53 |     prediction_folder = '{}/predictions'.format(result_folder)
54 |     Path(prediction_folder).mkdir(parents=True, exist_ok=True)
55 |     FileManager.write_predictions(proteins, prediction_folder, 0.5, ri)
56 | 
57 | 
58 | main()
59 | 


--------------------------------------------------------------------------------
/run_bindEmbed21DL.py:
--------------------------------------------------------------------------------
 1 | from config import FileSetter, FileManager
 2 | from bindEmbed21DL import BindEmbed21DL
 3 | 
 4 | from pathlib import Path
 5 | 
 6 | 
 7 | def main():
 8 | 
 9 |     prediction_folder = FileSetter.predictions_folder()
10 |     Path(prediction_folder).mkdir(parents=True, exist_ok=True)
11 | 
12 |     model_prefix = 'trained_models/checkpoint'
13 | 
14 |     query_fasta = FileSetter.query_set()
15 |     query_sequences = FileManager.read_fasta(query_fasta)
16 |     query_ids = list(query_sequences.keys())
17 | 
18 |     ri = True  # Whether to write RI or Probabilities
19 | 
20 |     BindEmbed21DL.prediction_pipeline(model_prefix, 0.5, prediction_folder, query_ids, query_fasta, ri)
21 | 
22 | 
23 | main()
24 | 


--------------------------------------------------------------------------------
/run_bindEmbed21HBI.py:
--------------------------------------------------------------------------------
 1 | import numpy
 2 | from pathlib import Path
 3 | 
 4 | from config import FileManager, FileSetter
 5 | from data_preparation import ProteinInformation, ProteinResults
 6 | 
 7 | from annotation_transfer import Inference
 8 | 
 9 | 
10 | def main():
11 | 
12 |     set_name = 'example'  # TODO set a meaningful set name used for output files
13 |     query_fasta = FileSetter.query_set()
14 |     query_sequences = FileManager.read_fasta(query_fasta)
15 |     query_ids = list(query_sequences.keys())
16 | 
17 |     result_folder = FileSetter.mmseqs_output()
18 |     Path(result_folder).mkdir(parents=True, exist_ok=True)
19 | 
20 |     e_val = '1e-3'  # E-value used for bindEmbed21HBI, can be changed
21 |     proteins_inferred = dict()
22 | 
23 |     # inference set with annotations
24 |     hbi_sequences = FileManager.read_fasta(FileSetter.lookup_fasta())
25 |     labels_hbi = ProteinInformation.get_labels(list(hbi_sequences.keys()), hbi_sequences)
26 | 
27 |     inference = Inference(query_ids, labels_hbi, query_sequences)
28 |     # run HBI and don't allow self-hits
29 |     inferred_labels, inferred_proteins = inference.infer_binding_annotations_seq(e_val, 'eval', set_name)
30 | 
31 |     for p in inferred_proteins:
32 |         prot = ProteinResults(p, 3)
33 |         prot.set_predictions(numpy.transpose(inferred_labels[p]))
34 |         proteins_inferred[p] = prot
35 | 
36 |     print('Number of proteins with hit: {}'.format(len(proteins_inferred.keys())))
37 | 
38 |     # write predictions
39 |     predictions_folder = '{}/hbi_predictions/'.format(FileSetter.mmseqs_output())
40 |     Path(predictions_folder).mkdir(parents=True, exist_ok=True)
41 |     FileManager.write_predictions(proteins_inferred, predictions_folder, 0.5, True)
42 | 
43 | 
44 | main()
45 | 


--------------------------------------------------------------------------------
/trained_models/checkpoint1.pt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rostlab/bindPredict/aa7acfff3b99b3dcc737a22ab5bd1abff3a0efdb/trained_models/checkpoint1.pt


--------------------------------------------------------------------------------
/trained_models/checkpoint2.pt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rostlab/bindPredict/aa7acfff3b99b3dcc737a22ab5bd1abff3a0efdb/trained_models/checkpoint2.pt


--------------------------------------------------------------------------------
/trained_models/checkpoint3.pt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rostlab/bindPredict/aa7acfff3b99b3dcc737a22ab5bd1abff3a0efdb/trained_models/checkpoint3.pt


--------------------------------------------------------------------------------
/trained_models/checkpoint4.pt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rostlab/bindPredict/aa7acfff3b99b3dcc737a22ab5bd1abff3a0efdb/trained_models/checkpoint4.pt


--------------------------------------------------------------------------------
/trained_models/checkpoint5.pt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rostlab/bindPredict/aa7acfff3b99b3dcc737a22ab5bd1abff3a0efdb/trained_models/checkpoint5.pt


--------------------------------------------------------------------------------
/utils/convert_embeddings.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Script to convert embeddings generated with the bio_embeddings pipeline in the expected format
 3 | """
 4 | 
 5 | import h5py
 6 | import numpy as np
 7 | import sys
 8 | 
 9 | 
10 | def main():
11 |     embeddings_in = sys.argv[1]
12 |     embeddings_out = sys.argv[2]
13 | 
14 |     with h5py.File(embeddings_in, 'r') as f_in:
15 |         with h5py.File(embeddings_out, 'w') as f_out:
16 |             for key, embedding in f_in.items():
17 |                 original_id = embedding.attrs['original_id']
18 |                 embedding = np.array(embedding)
19 | 
20 |                 f_out.create_dataset(original_id, data=embedding)
21 | 
22 | 
23 | main()
24 | 


--------------------------------------------------------------------------------