├── .gitignore ├── LICENSE ├── README.md ├── annotation_transfer.py ├── architectures.py ├── assess_performance.py ├── bindEmbed21DL.py ├── bindPredictML17 ├── README.md ├── bindPredict.py ├── data │ └── blosum_62.txt ├── scripts │ ├── bindpredict_predictor.py │ ├── file_manager.py │ ├── helper.py │ ├── parser.py │ ├── protein.py │ └── select_model.py └── trained_model │ └── mm3_cons_solv_more.pkl ├── config.py ├── data ├── development_set │ ├── all.fasta │ ├── binding_residues_2.5_metal.txt │ ├── binding_residues_2.5_nuclear.txt │ ├── binding_residues_2.5_small.txt │ ├── ids_split1.txt │ ├── ids_split2.txt │ ├── ids_split3.txt │ ├── ids_split4.txt │ ├── ids_split5.txt │ └── uniprot_test.txt └── independent_set │ ├── binding_residues_metal.txt │ ├── binding_residues_nuclear.txt │ ├── binding_residues_small.txt │ ├── indep_set.fasta │ └── indep_set.txt ├── data_preparation.py ├── develop_bindEmbed21DL.py ├── homology_based_inference.py ├── human_proteome ├── hbi_predictions.tar.gz └── ml_predictions.tar.gz ├── ml_predictor.py ├── ml_trainer.py ├── mmseqs_wrapper.py ├── pytorchtools.py ├── run_bindEmbed21.py ├── run_bindEmbed21DL.py ├── run_bindEmbed21HBI.py ├── trained_models ├── checkpoint1.pt ├── checkpoint2.pt ├── checkpoint3.pt ├── checkpoint4.pt └── checkpoint5.pt └── utils └── convert_embeddings.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .coverage 41 | .coverage.* 42 | .cache 43 | nosetests.xml 44 | coverage.xml 45 | *,cover 46 | .hypothesis/ 47 | 48 | # Translations 49 | *.mo 50 | *.pot 51 | 52 | # Django stuff: 53 | *.log 54 | local_settings.py 55 | 56 | # Flask stuff: 57 | instance/ 58 | .webassets-cache 59 | 60 | # Scrapy stuff: 61 | .scrapy 62 | 63 | # Sphinx documentation 64 | docs/_build/ 65 | 66 | # PyBuilder 67 | target/ 68 | 69 | # IPython Notebook 70 | .ipynb_checkpoints 71 | 72 | # pyenv 73 | .python-version 74 | 75 | # celery beat schedule file 76 | celerybeat-schedule 77 | 78 | # dotenv 79 | .env 80 | 81 | # virtualenv 82 | venv/ 83 | ENV/ 84 | 85 | # Spyder project settings 86 | .spyderproject 87 | 88 | # Rope project settings 89 | .ropeproject 90 | 91 | .idea 92 | .DS_Store -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Rostlab 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # bindEmbed21 2 | 3 | bindEmbed21 is a method to predict whether a residue in a protein is binding to metal ions, nucleic acids (DNA or RNA), or small molecules. Towards this end, bindEmbed21 combines homology-based inference and Machine Learning. Homology-based inference is executed using MMseqs2 [1]. For the Machine Learning method, bindEmbed21DL uses ProtT5 embeddings [2] as input to a 2-layer CNN. Since bindEmbed21 is based on single sequences, it can easily be applied to any protein sequence. 4 | 5 | ## Usage 6 | 7 | `run_bindEmbed21DL.py` shows an example how to generate binding residue predictions using the Machine Learning part of bindEmbed21 (bindEmbed21DL) 8 | 9 | `run_bindEmbed21HBI.py` shows an example how to generate bidning residue predictions using the homology-inference part of bindEmbed21 (bindEmbed21HBI) 10 | 11 | `run_bindEmbed21.py` combines ML and HBI into the final method bindEmbed21 12 | 13 | `develop_bindEmbed21DL.py` provides the code to reproduce the bindEmbed21DL development (hyperparameter optimization, training, performance assessment on the test set). 14 | 15 | All needed files and paths can be set in `config.py` (marked as TODOs). 16 | 17 | ## Data 18 | 19 | ### Development Set 20 | 21 | The data set used for training and testing was extracted from BioLip [3]. The UniProt identifiers for the 5 splits used during cross-validation (DevSet1014), the test set (TestSet300), and the independent set of proteins added to BioLip after November 2019 (TestSetNew46) as well as the corresponding FASTA sequences and used binding annotations are made available in the `data` folder. 22 | 23 | The trained models are available in the `trained_models` folder. 24 | 25 | ProtT5 embeddings can be generated using [the bio_embeddings pipeline](https://github.com/sacdallago/bio_embeddings) [4]. To use them with `bindEmbed21`, they need to be converted to use the correct keys. A script for the conversion can be found in the folder `utils`. 26 | 27 | ### Sets for homology-based inference 28 | For the homology-based inference (bindEmbed21HBI), query proteins will be aligned against big80 to generate profiles. Those profiles are then searched against a lookup set of proteins with known binding residues. The pre-computed MMseqs2 database files and the FASTA file for the lookup database can be downloaded here: 29 | 30 | * Pre-computed big80 DB: [ftp://rostlab.org/bindEmbed21/profile_db.tar.gz](ftp://rostlab.org/bindEmbed21/profile_db.tar.gz) 31 | * Pre-computed lookup DB: [ftp://rostlab.org/bindEmbed21/lookup_db.tar.gz](ftp://rostlab.org/bindEmbed21/lookup_db.tar.gz) 32 | * FASTA for lookup DB: [ftp://rostlab.org/bindEmbed21/lookup.fasta](ftp://rostlab.org/bindEmbed21/lookup.fasta) 33 | 34 | ### Human proteome predictions 35 | 36 | We applied bindEmbed21DL as well as homology-based inference to the entire human proteome. While annotations were only available for 15% of the human proteins, homology-based inference allowed transferring annotations for 48% (9,694) and bindEmbed21DL provided binding predictions for 92% (18,663) of the human proteome. Both predictions are available in the folder `human_proteome`. For predictions made using homology-based inference, values of -1.0 refer to position which were not inferred, and therefore, were considered non-binding. 37 | 38 | ## Availability 39 | 40 | bindEmbed21 is also part of [the bio_embeddings pipeline](https://github.com/sacdallago/bio_embeddings) [4]. Also, predictions of bindEmbed21DL can also be run and visualized on a predicted 3D structure using [LambdaPP](https://embed.predictprotein.org/) [5]. 41 | 42 | ## Requirements 43 | 44 | bindEmbed21 is written in Python3. In order to execute bindEmbed21, Python3 has to be installed locally. Additionally, the following Python packages have to be installed: 45 | - numpy 46 | - scikit-learn 47 | - torch 48 | - pandas 49 | - h5py 50 | 51 | To be able to run homology-based inference, MMseqs2 has to be locally installed. Otherwise, it is also possible to only run the Machine Learning part of bindEmbed21 (bindEmbed21DL). 52 | 53 | ## Cite 54 | 55 | In case, you are using this method and find it helpful, we would appreciate if you could cite the following publication: 56 | 57 | Littmann M, Heinzinger M, Dallago C, Weissenow K, Rost B. Protein embeddings and deep learning predict binding residues for various ligand classes. *Sci Rep* **11**, 23916 (2021). https://doi.org/10.1038/s41598-021-03431-4 58 | 59 | 60 | ## References 61 | [1] Steinegger M, Söding J (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35. 62 | 63 | [2] Elnaggar A, Heinzinger M, Dallago C, Rihawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Bhowmik D, Rost B (2021). ProtTrans: towards cracking the language of life's code through self-supervised deep learning and high performance computing. bioRxiv. 64 | 65 | [3] Yang J, Roy A, Zhang Y (2013). BioLip: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Research, 41. 66 | 67 | [4] Dallago C, Schütze K, Heinzinger M, Olenyi T, Littmann M, Lu AX, Yang KK, Min S, Yoon S, Morton JT, & Rost B (2021). Learned embeddings from deep learning to visualize and predict protein sets. Current Protocols, 1, e113. doi: 10.1002/cpz1.113 68 | 69 | [5] Olenyi T, Marquet C, Heinzinger M, Kröger B, Nikolova T, Bernhofer M, Sändig P, Schütze K, Littmann M, Mirdita M, Steinegger M, Dallago C, & Rost B (2022). LambdaPP: Fast and accessible protein-specific phenotype predictions. bioRxiv 70 | 71 | 72 | ## bindPredictML17 73 | If you are interested in the predecessor of bindEmbed21, bindPredictML17, you can find all relevant information in the subfolder `bindPredictML17`. 74 | -------------------------------------------------------------------------------- /annotation_transfer.py: -------------------------------------------------------------------------------- 1 | from homology_based_inference import MMseqsPredictor 2 | from config import FileManager 3 | 4 | import numpy 5 | import random 6 | 7 | 8 | class Inference(object): 9 | 10 | def __init__(self, prot_ids, labels, sequences, rand_pred=False): 11 | self.ids = prot_ids 12 | self.labels = labels 13 | self.sequences = sequences 14 | 15 | self.rand_pred = rand_pred 16 | self.rnd_probas = {'metal': 0.147, 'nuclear': 0.030, 'small': 0.090} 17 | 18 | def infer_binding_annotations_seq(self, e_val, criterion, set_name, exclude_self_hits=True): 19 | 20 | # get hits from MMseqs2 alignments 21 | mmseqs_pred = MMseqsPredictor(exclude_self_hits) 22 | hits, mmseqs_file = mmseqs_pred.get_mmseqs_hits(self.ids, e_val, criterion, set_name) 23 | 24 | inferred_labels, proteins_with_hit = self._infer_binding_local_alignment(hits, mmseqs_file) 25 | 26 | return inferred_labels, proteins_with_hit 27 | 28 | def _infer_binding_local_alignment(self, hits, mmseqs_file): 29 | """ 30 | Get binding annotations from local alignment 31 | :param hits: Found hits 32 | :param mmseqs_file: File with local alignments from MMseqs2 33 | :return: 34 | """ 35 | 36 | inferred_labels = dict() 37 | proteins_with_hit = dict() 38 | 39 | mmseqs = FileManager.read_mmseqs_alignments(mmseqs_file) 40 | 41 | for i in hits.keys(): 42 | hit = list(hits[i].keys())[0] 43 | 44 | alignment = mmseqs[i][hit] 45 | indices1, indices2 = Inference._get_index_mapping_mmseqs(alignment) 46 | 47 | binding_annotation = self.labels[hit] 48 | # set all position to -1; replace inferred positions with 0 (non-binding) or 1 (binding) 49 | inferred_annotation = numpy.zeros([len(self.sequences[i]), 3], dtype=numpy.float32) - 1 50 | 51 | for idx, pos2 in enumerate(indices2): 52 | pos1 = indices1[idx] - 1 53 | 54 | if pos1 >= 0 and pos2 >= 1: # both positions are aligned 55 | anno = binding_annotation[pos2 - 1] 56 | inferred_annotation[pos1] = anno 57 | 58 | if 1 in inferred_annotation: 59 | # any binding annotation in the aligned region 60 | inferred_labels[i] = inferred_annotation 61 | ligands = {'metal': 0, 'nuclear': 0, 'small': 0} 62 | 63 | if 1 in inferred_annotation[:, 0]: 64 | ligands['metal'] = 1 65 | if 1 in inferred_annotation[:, 1]: 66 | ligands['nuclear'] = 1 67 | if 1 in inferred_annotation[:, 2]: 68 | ligands['small'] = 1 69 | 70 | proteins_with_hit[i] = ligands 71 | 72 | for i in self.ids: 73 | if i not in inferred_labels.keys(): # no hit generated 74 | inferred_annotation = numpy.zeros([len(self.sequences[i]), 3], dtype=numpy.float32) - 1 75 | 76 | # generate random prediction for proteins without a hit 77 | if self.rand_pred: 78 | 79 | metal_indices = random.choices([0, 1], 80 | weights=[1-self.rnd_probas['metal'], self.rnd_probas['metal']], 81 | k=len(self.sequences[i])) 82 | nuclear_indices = random.choices([0, 1], 83 | weights=[1-self.rnd_probas['nuclear'], self.rnd_probas['nuclear']], 84 | k=len(self.sequences[i])) 85 | small_indices = random.choices([0, 1], 86 | weights=[1-self.rnd_probas['small'], self.rnd_probas['small']], 87 | k=len(self.sequences[i])) 88 | 89 | inferred_annotation[:, 0] = metal_indices 90 | inferred_annotation[:, 1] = nuclear_indices 91 | inferred_annotation[:, 2] = small_indices 92 | 93 | inferred_labels[i] = inferred_annotation 94 | 95 | return inferred_labels, proteins_with_hit 96 | 97 | @staticmethod 98 | def _get_index_mapping_mmseqs(alignment): 99 | """ 100 | Get indices of alignment position in actual sequence 101 | :param alignment: Local alignment of 2 sequences 102 | :return: Indices for 2 sequences 103 | """ 104 | 105 | start1 = alignment['qstart'] 106 | start2 = alignment['tstart'] 107 | 108 | seq1 = alignment['qaln'] 109 | seq2 = alignment['taln'] 110 | 111 | indices1 = Inference._get_indices_seq(start1, seq1) 112 | indices2 = Inference._get_indices_seq(start2, seq2) 113 | 114 | return indices1, indices2 115 | 116 | @staticmethod 117 | def _get_indices_seq(start, seq): 118 | """ 119 | Get sequence indices of aligned sequence 120 | :param start: Start index of aligned sequence 121 | :param seq: Aligned sequence (incl. gaps) 122 | :return: Sequence indices for position in aligned sequence 123 | """ 124 | 125 | indices = [] 126 | 127 | for i in range(0, len(seq)): 128 | if seq[i] == '-': # position is gap 129 | indices.append(0) 130 | else: 131 | indices.append(start) 132 | start += 1 133 | 134 | return indices 135 | -------------------------------------------------------------------------------- /architectures.py: -------------------------------------------------------------------------------- 1 | import torch.nn 2 | 3 | 4 | class CNN2Layers(torch.nn.Module): 5 | 6 | def __init__(self, in_channels, feature_channels, kernel_size, stride, padding, dropout): 7 | super(CNN2Layers, self).__init__() 8 | self.conv1 = torch.nn.Sequential( 9 | torch.nn.Conv1d(in_channels=in_channels, out_channels=feature_channels, kernel_size=kernel_size, 10 | stride=stride, padding=padding), 11 | torch.nn.ELU(), 12 | torch.nn.Dropout(dropout), 13 | 14 | torch.nn.Conv1d(in_channels=feature_channels, out_channels=3, kernel_size=kernel_size, 15 | stride=stride, padding=padding), 16 | ) 17 | 18 | def forward(self, x): 19 | x = self.conv1(x) 20 | return torch.squeeze(x) 21 | -------------------------------------------------------------------------------- /assess_performance.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import math 3 | import numpy as np 4 | from scipy.stats import t 5 | 6 | 7 | class ModelPerformance(object): 8 | 9 | """Wrapper to save performances values per model""" 10 | 11 | def __init__(self): 12 | 13 | self.losses = [] 14 | 15 | self.precisions = [] 16 | self.recalls = [] 17 | self.f1s = [] 18 | 19 | self.neg_precisions = [] 20 | self.neg_recalls = [] 21 | self.neg_f1s = [] 22 | 23 | self.accs = [] 24 | self.mccs = [] 25 | 26 | self.confusion_matrix = {'tp': 0, 'fp': 0, 'tn': 0, 'fn': 0} 27 | self.cross_prediction = [0, 0, 0, 0] 28 | 29 | self.coverage = 0 30 | self.predictions = {'bound_predicted': 0, 'bound_not_predicted': 0, 'not_bound_predicted': 0, 31 | 'not_bound_not_predicted': 0} 32 | 33 | def add_single_performance(self, loss, acc, prec, recall, f1, mcc): 34 | """Add performance for one protein""" 35 | 36 | self.losses.append(loss) 37 | self.accs.append(acc) 38 | self.precisions.append(prec) 39 | self.recalls.append(recall) 40 | self.f1s.append(f1) 41 | self.mccs.append(mcc) 42 | 43 | def add_single_performance_negatives(self, loss, acc, prec, recall, f1, neg_prec, neg_recall, neg_f1, mcc): 44 | """Add performance for one protein (incl. negative performance values)""" 45 | 46 | self.add_single_performance(loss, acc, prec, recall, f1, mcc) 47 | self.neg_precisions.append(neg_prec) 48 | self.neg_recalls.append(neg_recall) 49 | self.neg_f1s.append(neg_f1) 50 | 51 | def add_confusion_matrix(self, tp, fp, tn, fn): 52 | """Add TP, FP, TN,FN to confusion matrix""" 53 | 54 | self.confusion_matrix['tp'] += tp 55 | self.confusion_matrix['fp'] += fp 56 | self.confusion_matrix['tn'] += tn 57 | self.confusion_matrix['fn'] += fn 58 | 59 | if (tp + fp) > 0: 60 | self.coverage += 1 61 | if (tp + fn) > 0: 62 | self.predictions['bound_predicted'] += 1 63 | else: 64 | self.predictions['not_bound_predicted'] += 1 65 | else: 66 | if fn > 0: 67 | self.predictions['bound_not_predicted'] += 1 68 | else: 69 | self.predictions['not_bound_not_predicted'] += 1 70 | 71 | def add_cross_prediction(self, cross_prediction): 72 | """Add cross prediction""" 73 | for i in range(0, len(cross_prediction)): 74 | self.cross_prediction[i] += cross_prediction[i] 75 | 76 | def get_mean_performance(self): 77 | """Calculate average performance values""" 78 | loss = np.average(self.losses) 79 | acc = np.average(self.accs) 80 | precision = np.average(self.precisions) 81 | recall = np.average(self.recalls) 82 | f1 = np.average(self.f1s) 83 | mcc = np.average(self.mccs) 84 | 85 | return loss, acc, precision, recall, f1, mcc 86 | 87 | def get_mean_ci_performance(self): 88 | """Calculate average performance values and 95% CIs""" 89 | acc, acc_ci = ModelPerformance._get_mean_ci(self.accs) 90 | recall, recall_ci = ModelPerformance._get_mean_ci(self.recalls) 91 | prec, prec_ci = ModelPerformance._get_mean_ci(self.precisions) 92 | f1, f1_ci = ModelPerformance._get_mean_ci(self.f1s) 93 | mcc, mcc_ci = ModelPerformance._get_mean_ci(self.mccs) 94 | 95 | return acc, prec, recall, f1, mcc, acc_ci, prec_ci, recall_ci, f1_ci, mcc_ci 96 | 97 | def get_mean_ci_performance_negatives(self): 98 | """Calculate average performance values for negative class and 95% CIs""" 99 | 100 | neg_recall, neg_recall_ci = ModelPerformance._get_mean_ci(self.neg_recalls) 101 | neg_prec, neg_prec_ci = ModelPerformance._get_mean_ci(self.neg_precisions) 102 | neg_f1, neg_f1_ci = ModelPerformance._get_mean_ci(self.neg_f1s) 103 | 104 | return neg_prec, neg_recall, neg_f1, neg_prec_ci, neg_recall_ci, neg_f1_ci 105 | 106 | @staticmethod 107 | def _get_mean_ci(vec): 108 | """ 109 | Calculate mean and 95% CI for a given vector 110 | :param vec: vector 111 | :return: mean and ci 112 | """ 113 | mean = round(np.average(vec), 3) 114 | if len(vec) > 1: 115 | ci = round(np.std(vec)/math.sqrt(len(vec)) * t.ppf((1 + 0.95) / 2, len(vec)), 3) 116 | else: 117 | ci = 0 118 | 119 | return mean, ci 120 | 121 | 122 | class PerformanceEpochs(object): 123 | """ 124 | Wrapper to save performance values per epoch 125 | """ 126 | 127 | def __init__(self): 128 | 129 | self.loss_epochs = [] 130 | self.mcc_epochs = [] 131 | self.prec_epochs = [] 132 | self.recall_epochs = [] 133 | self.f1_epochs = [] 134 | self.acc_epochs = [] 135 | 136 | def get_performance_last_epoch(self): 137 | """Get performance for last epoch""" 138 | loss = self.loss_epochs[-1] 139 | mcc = self.mcc_epochs[-1] 140 | prec = self.prec_epochs[-1] 141 | recall = self.recall_epochs[-1] 142 | f1 = self.f1_epochs[-1] 143 | acc = self.acc_epochs[-1] 144 | 145 | return loss, acc, prec, recall, f1, mcc 146 | 147 | def add_performance_epoch(self, loss, mcc, prec, recall, f1, acc): 148 | """Add performance for one epoch""" 149 | self.loss_epochs.append(loss) 150 | self.acc_epochs.append(acc) 151 | self.mcc_epochs.append(mcc) 152 | self.f1_epochs.append(f1) 153 | self.prec_epochs.append(prec) 154 | self.recall_epochs.append(recall) 155 | 156 | @staticmethod 157 | def get_performance_batch(pred, target): 158 | """Calculate performance for one batch""" 159 | 160 | tp, fp, tn, fn = PerformanceAssessment.evaluate_per_residue_torch(pred, target) 161 | acc, prec, rec, f1, mcc = PerformanceAssessment.calc_performance_measurements(tp, fp, tn, fn) 162 | 163 | return tp, fp, tn, fn, acc, prec, rec, f1, mcc 164 | 165 | 166 | class PerformanceAssessment(object): 167 | 168 | @staticmethod 169 | def evaluate_per_residue_torch(prediction, target): 170 | """Calculate tp, fp, tn, fn for tensor""" 171 | # reduce prediction & target to one dimension 172 | prediction = prediction.t() 173 | target = target.t() 174 | prediction = torch.sum(torch.ge(prediction, 0.5), 1) 175 | target = torch.sum(torch.ge(target, 0.5), 1) 176 | 177 | # get confusion matrix 178 | tp = torch.sum(torch.ge(prediction, 0.5) * torch.ge(target, 0.5)) 179 | tn = torch.sum(torch.lt(prediction, 0.5) * torch.lt(target, 0.5)) 180 | fp = torch.sum(torch.ge(prediction, 0.5) * torch.lt(target, 0.5)) 181 | fn = torch.sum(torch.lt(prediction, 0.5) * torch.ge(target, 0.5)) 182 | 183 | return tp, fp, tn, fn 184 | 185 | @staticmethod 186 | def calc_performance_measurements(tp, fp, tn, fn): 187 | """Calculate precision, recall, f1, mcc, and accuracy""" 188 | 189 | tp = float(tp) 190 | fp = float(fp) 191 | fn = float(fn) 192 | tn = float(tn) 193 | 194 | recall = prec = f1 = mcc = 0 195 | acc = round((tp + tn) / (tp + tn + fn + fp), 3) 196 | 197 | if tp > 0 or fn > 0: 198 | recall = round(tp / (tp + fn), 3) 199 | if tp > 0 or fp > 0: 200 | prec = round(tp / (tp + fp), 3) 201 | if recall > 0 or prec > 0: 202 | f1 = round(2 * recall * prec / (recall + prec), 3) 203 | if (tp > 0 or fp > 0) and (tp > 0 or fn > 0) and (tn > 0 or fp > 0) and (tn > 0 or fn > 0): 204 | mcc = round((tp * tn - fp * fn) / math.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn)), 3) 205 | 206 | return acc, prec, recall, f1, mcc 207 | 208 | @staticmethod 209 | def combine_protein_performance(proteins, cutoff, labels): 210 | """ 211 | Calculate final model performance from per-protein performances 212 | :param proteins: 213 | :param cutoff: 214 | :param labels: 215 | :return: 216 | """ 217 | model_performances = {'overall': ModelPerformance(), 'metal': ModelPerformance(), 218 | 'nucleic': ModelPerformance(), 'small': ModelPerformance()} 219 | 220 | for p in proteins.keys(): 221 | prot = proteins[p] 222 | prot.set_labels(labels[p]) 223 | 224 | # calculate performance for protein 225 | performance = prot.calc_performance_measurements(cutoff) 226 | 227 | for k in performance.keys(): 228 | model_performance = model_performances[k] 229 | if 'tp' in performance[k].keys(): 230 | tp = performance[k]['tp'] 231 | fp = performance[k]['fp'] 232 | fn = performance[k]['fn'] 233 | tn = performance[k]['tn'] 234 | if k == 'overall' and (tp + fn) == 0: 235 | print('No residues annotated as binding for {}'.format(p)) 236 | if (tp + fp + fn) > 0: 237 | # only add performance if this protein binds or if one residue was predicted to bind 238 | model_performance.add_single_performance(0, performance[k]['acc'], performance[k]['prec'], 239 | performance[k]['recall'], performance[k]['f1'], 240 | performance[k]['mcc']) 241 | model_performance.add_confusion_matrix(tp, fp, tn, fn) 242 | else: 243 | model_performance.predictions['not_bound_not_predicted'] += 1 244 | 245 | return model_performances 246 | 247 | @staticmethod 248 | def write_performance_results(model_performances, out_file): 249 | """Write average performance""" 250 | with open(out_file, 'w') as out: 251 | out.write('Type\ttp\tfp\ttn\tfn\tprec\tprec.ci\trecall\trecall.ci\tf1\tf1.ci\tmcc\tmcc.ci\tacc\tacc.ci\n') 252 | for k in model_performances.keys(): 253 | model_performance = model_performances[k] 254 | acc, pr, rec, f1, mcc, acc_ci, pr_ci, rec_ci, f1_ci, mcc_ci = \ 255 | model_performance.get_mean_ci_performance() 256 | 257 | confusion_matrix = model_performance.confusion_matrix 258 | tp = confusion_matrix['tp'] 259 | fp = confusion_matrix['fp'] 260 | tn = confusion_matrix['tn'] 261 | fn = confusion_matrix['fn'] 262 | 263 | out.write('{}\t{}\t{}\t{}\t{}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}' 264 | '\t{:.3f}\n'.format(k, tp, fp, tn, fn, pr, pr_ci, rec, rec_ci, f1, f1_ci, mcc, mcc_ci, acc, 265 | acc_ci)) 266 | 267 | @staticmethod 268 | def print_performance_results(model_performances): 269 | """Print average performance""" 270 | 271 | for k in model_performances.keys(): 272 | print(k) 273 | model_performance = model_performances[k] 274 | acc, pr, rec, f1, mcc, acc_ci, pr_ci, rec_ci, f1_ci, mcc_ci = \ 275 | model_performance.get_mean_ci_performance() 276 | 277 | cov_proteins = model_performance.coverage 278 | if len(model_performance.accs) > 0: 279 | cov_percentage = cov_proteins / len(model_performance.accs) 280 | else: 281 | cov_percentage = 0.0 282 | 283 | confusion_matrix = model_performance.confusion_matrix 284 | predictions = model_performance.predictions 285 | 286 | print('CovOneBind: {} ({:.3f})'.format(cov_proteins, cov_percentage)) 287 | print('Bound: With predictions: {}, Without predictions: {}\nNot Bound: With predictions: {}, ' 288 | 'Without predictions: {}'.format(predictions['bound_predicted'], predictions['bound_not_predicted'], 289 | predictions['not_bound_predicted'], 290 | predictions['not_bound_not_predicted'])) 291 | print('TP: {}, FP: {}, TN: {}, FN: {}'.format(confusion_matrix['tp'], confusion_matrix['fp'], 292 | confusion_matrix['tn'], confusion_matrix['fn'])) 293 | print("Prec: {:.3f} +/- {:.3f}, Recall: {:.3f} +/- {:.3f}, F1: {:.3f} +/- {:.3f}, " 294 | "MCC: {:.3f} +/- {:.3f}, Acc: {:.3f} +/- {:.3f}".format(pr, pr_ci, rec, rec_ci, f1, f1_ci, mcc, 295 | mcc_ci, acc, acc_ci)) 296 | 297 | @staticmethod 298 | def print_cross_prediction_results(model_performances): 299 | """Print cross-predictions""" 300 | 301 | for k in model_performances.keys(): 302 | print(k) 303 | cross_predictions = model_performances[k].cross_prediction 304 | print('Metal: {}, Nucleic: {}, Small: {}, Non-Binding: {}'.format(cross_predictions[0], 305 | cross_predictions[1], 306 | cross_predictions[2], 307 | cross_predictions[3])) 308 | -------------------------------------------------------------------------------- /bindEmbed21DL.py: -------------------------------------------------------------------------------- 1 | from data_preparation import ProteinInformation 2 | from ml_trainer import MLTrainer 3 | from ml_predictor import MLPredictor 4 | from config import FileSetter, FileManager 5 | from architectures import CNN2Layers 6 | 7 | import numpy as np 8 | from sklearn.model_selection import PredefinedSplit 9 | 10 | 11 | class BindEmbed21DL(object): 12 | 13 | @staticmethod 14 | def cross_train_pipeline(params, model_output, predictions_output, ri): 15 | """ 16 | Run cross-training pipeline for a specific set of parameters 17 | :param params: 18 | :param model_output: If None, trained model is not written 19 | :param predictions_output: If None, predictions are not written 20 | :param ri: Should RI or raw probabilities be written? 21 | :return: 22 | """ 23 | 24 | print("Prepare data") 25 | ids = [] 26 | fold_array = [] 27 | for s in range(1, 6): 28 | ids_in = '{}{}.txt'.format(FileSetter.split_ids_in(), s) 29 | split_ids = FileManager.read_ids(ids_in) 30 | 31 | ids += split_ids 32 | fold_array += [s] * len(split_ids) 33 | 34 | ps = PredefinedSplit(fold_array) 35 | 36 | # get sequences + maximum length + labels 37 | sequences, max_length, labels = ProteinInformation.get_data(ids) 38 | embeddings = FileManager.read_embeddings(FileSetter.embeddings_input()) 39 | 40 | proteins = dict() 41 | trainer = MLTrainer(pos_weights=params['weights']) 42 | 43 | for train_index, test_index in ps.split(): 44 | 45 | split_counter = fold_array[test_index[0]] 46 | train_ids = [ids[train_idx] for train_idx in train_index] 47 | validation_ids = [ids[test_idx] for test_idx in test_index] 48 | 49 | print("Train model") 50 | model_split = trainer.train_validate(params, train_ids, validation_ids, sequences, embeddings, labels, 51 | max_length, verbose=False) 52 | 53 | if model_output is not None: 54 | model_path = '{}{}.pt'.format(model_output, split_counter) 55 | FileManager.save_classifier_torch(model_split, model_path) 56 | 57 | print("Calculate predictions per protein") 58 | ml_predictor = MLPredictor(model_split) 59 | curr_proteins = ml_predictor.predict_per_protein(validation_ids, sequences, embeddings, labels, max_length) 60 | 61 | proteins = {**proteins, **curr_proteins} 62 | 63 | if predictions_output is not None: 64 | FileManager.write_predictions(proteins, predictions_output, 0.5, ri) 65 | 66 | return proteins 67 | 68 | @staticmethod 69 | def hyperparameter_optimization_pipeline(params, num_splits, result_file): 70 | """ 71 | Development pipeline used to optimize hyperparameters 72 | :param params: 73 | :param num_splits: 74 | :param result_file: 75 | :return: 76 | """ 77 | 78 | print("Prepare data") 79 | ids = [] 80 | fold_array = [] 81 | for s in range(1, num_splits + 1): 82 | ids_in = '{}{}.txt'.format(FileSetter.split_ids_in(), s) 83 | split_ids = FileManager.read_ids(ids_in) 84 | 85 | ids += split_ids 86 | fold_array += [s] * len(split_ids) 87 | 88 | ids = np.array(ids) 89 | 90 | # get sequences + maximum length + labels 91 | sequences, max_length, labels = ProteinInformation.get_data(ids) 92 | embeddings = FileManager.read_embeddings(FileSetter.embeddings_input()) 93 | 94 | print("Perform hyperparameter optimization") 95 | trainer = MLTrainer(pos_weights=params['weights']) 96 | del params['weights'] # remove weights to not consider as parameter for optimization 97 | 98 | model = trainer.cross_validate(params, ids, fold_array, sequences, embeddings, labels, max_length, result_file) 99 | 100 | return model 101 | 102 | @staticmethod 103 | def prediction_pipeline(model_prefix, cutoff, result_folder, ids, fasta_file, ri): 104 | """ 105 | Run predictions with bindEmbed21DL for a given list of proteins 106 | :param model_prefix: 107 | :param cutoff: Cutoff to use to define prediction as binding (default: 0.5) 108 | :param result_folder: 109 | :param ids: 110 | :param fasta_file: 111 | :param ri: Should RI or raw probabilities be written? 112 | :return: 113 | """ 114 | 115 | print("Prepare data") 116 | sequences, max_length, labels = ProteinInformation.get_data_predictions(ids, fasta_file) 117 | embeddings = FileManager.read_embeddings(FileSetter.embeddings_input()) 118 | 119 | proteins = dict() 120 | for i in range(0, 5): 121 | print("Load model") 122 | model_path = '{}{}.pt'.format(model_prefix, i + 1) 123 | model = FileManager.load_classifier_torch(model_path) 124 | 125 | print("Calculate predictions") 126 | ml_predictor = MLPredictor(model) 127 | curr_proteins = ml_predictor.predict_per_protein(ids, sequences, embeddings, labels, max_length) 128 | 129 | for k in curr_proteins.keys(): 130 | if k in proteins.keys(): 131 | prot = proteins[k] 132 | prot.add_predictions(curr_proteins[k].predictions) 133 | else: 134 | proteins[k] = curr_proteins[k] 135 | 136 | for k in proteins.keys(): 137 | proteins[k].normalize_predictions(5) 138 | 139 | if result_folder is not None: 140 | FileManager.write_predictions(proteins, result_folder, cutoff, ri) 141 | 142 | return proteins 143 | -------------------------------------------------------------------------------- /bindPredictML17/README.md: -------------------------------------------------------------------------------- 1 | # bindPredictML17 2 | 3 | bindPredictML17 is a sequence-based method to predict binding site residues. 4 | It uses an Artifical Neural Network (ANN) and was trained on enzymes and DNA-binding proteins 5 | (i.e. it is most suited to predict binding sites in these type of proteins). 6 | It uses 10 input features calculated for each residue: 7 | - cumulative coupling scores 8 | - cumulative coupling scores filtered by distance 9 | - cumulative coupling scores filtered by solvent accessibility 10 | - clustering coefficients 11 | - clustering coefficients filtered by distance 12 | - clustering coefficients filterd by solvent accessibility 13 | - BLOSUM62-SNAP2 comparison 14 | - BLOSUM62-EVmutation comparison 15 | - relative solvent accessibility 16 | - conservation 17 | 18 | 19 | ## Implementation 20 | bindPredict is written in Python3. The trained model using the 10 input features named above is included in this repository. 21 | bindPredict can be executed via command line runnning "bindPredict.py". 22 | To calculate the different scores used for the prediction, bindPredict needs pre-calculated results from external programs. 23 | As input parameters, the method needs: 24 | - path to a folder containing EVcouplings results 25 | (including *.di_scores, *_CouplingScoresCompared_all.csv, *.epistatic_effect, *_frequencies.csv, *_alignment_statistics.csv) 26 | - path to a folder containing SNAP2 results 27 | - path to a folder containing PROFphd results 28 | 29 | The method then makes a prediction for every protein present in the SNAP2 results folder. So, all files should be named the same (e.g. P0A0I7.snap2, P0A0I7.profphd,...) 30 | If any files are not provided, the corresponding scores are not calculated and set to 0. 31 | 32 | ## External programs 33 | The pre-calculated results from external programs can be calculated as follows: 34 | 35 | - **SNAP2** results: SNAP2 can be run using [predictprotein.org](https://www.predictprotein.org/) [1]. After the job has finished, the results can be downloaded under "Effect of Point Mutations" using the export button. 36 | - **PROFphd** results: PROFphd can also be run using [predictprotein.org](https://www.predictprotein.org/). After the job has finised, the view of the results have to be changed to "text" (upper right corner) and the text part under "Secondary Structure" has to be copied to a text file. 37 | - **EVcouplings** results (\*.di_scores, \*_CouplingScoresCompared_all.csv, \*_frequencies.csv, \*_alignment_statistics.csv): [EVcouplings](http://evfold.org/evfold-web/newmarkec.do) [2,3] is available as a Github repository. A detailed description on how to run EVcouplings can be found [here](https://github.com/debbiemarkslab/EVcouplings). EVcouplings only provides EC scores inferred by plmDCA. To calculate DI scores, one can use 38 | - **FreeContact** [4] which can be downloaded as a [debian package](https://packages.debian.org/search?keywords=freecontact). Using the alignment generated by EVcouplings, DI scores can be calculated using the option "evfold". 39 | - the **EVcouplings** [webserver](http://evfold.org/evfold-web/newmarkec.do). DI scores can be calculated by entering the UniProt identifier or the sequence in FASTA format and by choosing "DI" as coupling scoring. 40 | - **EVmutation** results (\*.epistatic_effect): 41 | [EVmutation](https://marks.hms.harvard.edu/evmutation/index.html) [5] is only available as a Github repository. A detailed description on how to run EVmutation can be found [here](https://htmlpreview.github.io/?https://github.com/debbiemarkslab/EVmutation/blob/master/EVmutation.html) 42 | 43 | If results from EVcouplings or EVmutation cannot be obtained due to problems with local installation, a minimal version of bindPredict can be run by using the pre-trained model, but only providing SNAP2 and PROFphd results as input for the prediction. All other input scores will be set to 0. However, this will limit performance of the method. 44 | 45 | ## Cite 46 | If you are using this method and find it helpful, we would appreciate if you could cite the following publication: 47 | Schelling, M., Hopf T.A., Rost, B. (2018). [Evolutionary couplings and sequence variation effect predict protein binding sites](https://onlinelibrary.wiley.com/doi/abs/10.1002/prot.25585). Proteins **86**: 1064-1074 48 | 49 | ## References 50 | [1] Yachdav, G., Kloppmann, E., Laszlo, K., Hecht, M., Goldberg, T., Hamp, T., Hönigschmid, P., Schafferhans, A., Roos, M., Bernhofer, M. et al. (2014). [PredictProtein - an open resource for online prediction of protein structural and functional features](https://academic.oup.com/nar/article/42/W1/W337/2435518). Nucleic acids research **42**, pages W337–W343. 51 | 52 | [2] Marks, D. S., Colwell, L. J., Sheridan, R., Hopf, T. A., Pagnani, A., Zecchina, R., Sander, C. (2011). [Protein 3D Structure Computed from Evolutionary Sequence Variation](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0028766). PLoS ONE **6**(12): e28766. 53 | 54 | [3] Hopf, T. A., Schärfe, C. P. I., Rodrigues, J. P. G. L. M., Green, A. G., Kohlbacher, O., Sander, C., Bonvin, A. M. J. J., Debora S Marks, D.S. (2014) [Sequence co-evolution gives 3D contacts and structures of protein complexes](https://elifesciences.org/articles/03430). eLife; 3:e03430 55 | 56 | [4] Kaján L., Hopf T. A., Kalaš M., Marks D. S., Rost B. (2014) [FreeContact: fast and free software for protein contact prediction from residue co-evolution.](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-85). BMC Bioinformatics **15**:58 57 | 58 | [5] Hopf, T. A., Ingraham, J. B., Poelwijk, F.J., Schärfe, C.P.I., Springer, M., Sander, C., & Marks, D. S. (2017). 59 | [Mutation effects predicted from sequence co-variation](https://www.nature.com/articles/nbt.3769). Nature Biotechnology **35**, pages 128–135. 60 | 61 | -------------------------------------------------------------------------------- /bindPredictML17/bindPredict.py: -------------------------------------------------------------------------------- 1 | """ 2 | DESCRIPTION: 3 | 4 | Development and training of bindPredict to predict binding site residues 5 | """ 6 | 7 | import argparse 8 | import sys 9 | 10 | from scripts.helper import Helper 11 | from scripts.bindpredict_predictor import Predictor 12 | 13 | 14 | def main(): 15 | usage_string = 'python bindPredict.py evcouplings_res/ snap_res/ solv_acc/ results/' 16 | 17 | # parse command line options 18 | parser = argparse.ArgumentParser(description=__doc__, usage=usage_string) 19 | parser.add_argument('evc_folder', help='Folder with protein folders containing EVcouplings results') 20 | parser.add_argument('snap_folder', help='Folder with SNAP2 predictions. ' 21 | 'SNAP-Files need to have the same name in front of the suffix as ' 22 | 'the EVcouplings folders ' 23 | '(e.g. Q9XLZ3/ <-> Q9XLZ3.snap2)') 24 | parser.add_argument('solv_acc_folder', help='Folder with PROFacc predictions. ' 25 | 'PROFacc-Files need to have the same name in front of the suffix as ' 26 | 'the EVcouplings folders ' 27 | '(e.g. Q9XLZ3/ <-> Q9XLZ3.reprof)') 28 | parser.add_argument('output_folder', help='Folder to write binding site predictions to (one file for each protein)') 29 | parser.add_argument('-m', '--model', help='Input model to use for predictor training (default: mm3_cons_solv_more)', 30 | default='mm3_cons_solv_more') 31 | parser.add_argument('--snap_suffix', help='Suffix of files in given SNAP-folder (default: "*.snap2")', 32 | default='*.snap2') 33 | parser.add_argument('--profacc_suffix', help='Suffix of files in given PROFacc-folder (default: "*.reprof")', 34 | default='*.reprof') 35 | args = parser.parse_args() 36 | helper = Helper() 37 | if helper.folder_existence_check(args.evc_folder): # check if evc folder exists and is reachable 38 | if helper.folder_existence_check(args.snap_folder): # check if snap folder exists and is reachable 39 | if helper.folder_existence_check( 40 | args.solv_acc_folder): # check if solvent accessibility folder exists and is reachable 41 | predictor = Predictor() 42 | predictor.run_prediction(args) 43 | 44 | 45 | def error(*objs): 46 | print("ERROR: ", *objs, file=sys.stderr) 47 | 48 | 49 | if __name__ == "__main__": 50 | main() 51 | -------------------------------------------------------------------------------- /bindPredictML17/data/blosum_62.txt: -------------------------------------------------------------------------------- 1 | A R N D C Q E G H I L K M F P S T W Y V B J Z X * 2 | A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 -1 -1 -4 3 | R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 -2 0 -1 -4 4 | N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 4 -3 0 -1 -4 5 | D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 -3 1 -1 -4 6 | C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -1 -3 -1 -4 7 | Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 -2 4 -1 -4 8 | E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 -3 4 -1 -4 9 | G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -4 -2 -1 -4 10 | H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 -3 0 -1 -4 11 | I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 3 -3 -1 -4 12 | L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 3 -3 -1 -4 13 | K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 -3 1 -1 -4 14 | M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 2 -1 -1 -4 15 | F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 0 -3 -1 -4 16 | P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -3 -1 -1 -4 17 | S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 -2 0 -1 -4 18 | T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 -1 -1 -4 19 | W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -2 -2 -1 -4 20 | Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -1 -2 -1 -4 21 | V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 2 -2 -1 -4 22 | B -2 -1 4 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 -3 0 -1 -4 23 | J -1 -2 -3 -3 -1 -2 -3 -4 -3 3 3 -3 2 0 -3 -2 -1 -2 -1 2 -3 3 -3 -1 -4 24 | Z -1 0 0 1 -3 4 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -2 -2 -2 0 -3 4 -1 -4 25 | X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -4 26 | * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1 27 | -------------------------------------------------------------------------------- /bindPredictML17/scripts/bindpredict_predictor.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import glob 4 | import numpy as np 5 | from sklearn.externals import joblib 6 | from scripts.protein import Protein 7 | from scripts.file_manager import FileManager 8 | from scripts.select_model import SelectModel 9 | 10 | 11 | class Predictor(object): 12 | def __init__(self): 13 | self.query_proteins = dict() 14 | self.fm = FileManager() 15 | 16 | def run_prediction(self, args): 17 | # 1) Read files 18 | print('1) Read files...') 19 | self.get_snap_files(args.snap_folder, args.snap_suffix) 20 | self.get_evc_files(args.evc_folder) 21 | self.get_solv_files(args.solv_acc_folder, args.profacc_suffix) 22 | # 2) Load saved model 23 | print('2) Load saved model...') 24 | cols_to_remove, ranges, words = SelectModel.define_model(args.model) 25 | classifier = joblib.load("trained_model/" + args.model + ".pkl") 26 | # 3) Prepare predictions 27 | print('3) Calculate scores for query...') 28 | self.calculate_scores(self.query_proteins) 29 | print('4) Run predictions...') 30 | for p in self.query_proteins: 31 | protein_scores = self.prepare_scores_prediction(p, cols_to_remove) 32 | protein_scores = np.array(protein_scores) 33 | # 4) Run predictions per protein 34 | predictions = [] 35 | for index in range(0, len(protein_scores)): 36 | el = protein_scores[index] 37 | to_test = el[:len(el) - 1] 38 | pos = el[len(el) - 1] 39 | to_test = np.reshape(to_test, (1, -1)) 40 | to_test = to_test.astype(float) 41 | proba = classifier.predict_proba(to_test) 42 | prediction = proba[0][1] 43 | row = [pos, prediction] 44 | predictions = predictions + [row] 45 | # 5) Write output 46 | self.write_output_predictions(args.output_folder, predictions, p) 47 | 48 | def get_snap_files(self, folder, suffix): 49 | """ Read all ids in a big dictionray. 50 | :param folder: Folder where the snap2 predictions are stored 51 | :param suffix: suffix of the files (default: snap2) 52 | :return: 53 | """ 54 | 55 | snap_pattern = os.path.join(folder, suffix) 56 | for f in glob.iglob(snap_pattern): 57 | name = os.path.basename(f) 58 | suffix_start = name.find('.') 59 | name = name[:suffix_start] # reduce the file name to the prefix only 60 | self.query_proteins[name] = Protein() 61 | self.query_proteins[name].snap_file = f 62 | self.query_proteins[name].blosum_file = self.fm.blosum_file 63 | 64 | def get_evc_files(self, folder): 65 | """ For each id, determine the different evc files. 66 | :param folder: Folder where the folders with EVcouplings predictions are stored 67 | :return: 68 | """ 69 | 70 | for p in self.query_proteins: 71 | evc_folder = os.path.join(folder, p) 72 | evc_file = os.path.join(evc_folder, (p + self.fm.ec_suffix)) 73 | dist_file = os.path.join(evc_folder, (p + self.fm.dist_suffix)) 74 | evmut_file = os.path.join(evc_folder, (p + self.fm.evmut_suffix)) 75 | cons_file = os.path.join(evc_folder, (p + self.fm.cons_suffix)) 76 | align_file = os.path.join(evc_folder, (p + self.fm.align_suffix)) 77 | 78 | self.query_proteins[p].evc_file = evc_file 79 | self.query_proteins[p].dist_file = dist_file 80 | self.query_proteins[p].evmut_file = evmut_file 81 | self.query_proteins[p].cons_file = cons_file 82 | 83 | counter = 1 84 | with open(align_file) as af: 85 | for line in af: 86 | if counter == 2: 87 | splitted = line.split(",") 88 | length = int(splitted[3]) 89 | length_cov = int(splitted[4]) 90 | self.query_proteins[p].length = length 91 | self.query_proteins[p].length_cov = length_cov 92 | counter += 1 93 | 94 | def get_solv_files(self, folder, suffix): 95 | """ For each id, safe the solv file. 96 | :param folder: Folder where PROFacc/Reprof/PROFphd predictions are stored 97 | :param suffix: suffix of the files (default: reprof) 98 | :return: 99 | """ 100 | 101 | solv_pattern = os.path.join(folder, suffix) 102 | for f in glob.iglob(solv_pattern): 103 | name = os.path.basename(f) 104 | suffix_start = name.find('.') 105 | name = name[:suffix_start] 106 | self.query_proteins[name].solv_file = f 107 | 108 | def calculate_scores(self, id_list): 109 | """ Calculate cum scores, clustering coefficients, SNAP2/BLOSUM, EVmut/BLOSUM, Solv and Cons. 110 | :param id_list: list of ids to calculate scores for 111 | :return 112 | """ 113 | 114 | for i in id_list: 115 | self.query_proteins[i].calc_scores() 116 | 117 | def prepare_scores_prediction(self, p, cols_to_remove): 118 | data = [] 119 | scores = self.query_proteins[p].scores 120 | for res in scores.keys(): 121 | resi = str(res) 122 | scores_res = scores[res] 123 | row = [float(i) for j, i in enumerate(scores_res) if j not in cols_to_remove] 124 | row = row + [resi] 125 | data.append(row) 126 | 127 | return data 128 | 129 | @staticmethod 130 | def write_output_predictions(folder, predictions, p): 131 | """ Write predictions results into output file. 132 | :param folder: Folder to write the prediction results to 133 | :param p: Protein id to use as filename 134 | :param predictions: Per residue predictions 135 | :return: 136 | """ 137 | 138 | out_file = os.path.join(folder, (p + ".bindPredict_out")) 139 | with open(out_file, 'w') as out: 140 | # format: res, proba_pos 141 | for pred in predictions: 142 | label = "nb" 143 | if pred[1] >= 0.6: 144 | label = "b" 145 | out.write(str(pred[0]) + "\t" + str(pred[1]) + "\t" + label + "\n") 146 | 147 | 148 | def error(*objs): 149 | print("ERROR: ", *objs, file=sys.stderr) 150 | -------------------------------------------------------------------------------- /bindPredictML17/scripts/file_manager.py: -------------------------------------------------------------------------------- 1 | """ Class specifying file endings and locations. 2 | Before usage, values should be adjusted to local settings""" 3 | 4 | 5 | class FileManager(object): 6 | 7 | @property 8 | def ec_suffix(self): 9 | return ".di_scores" 10 | 11 | @property 12 | def dist_suffix(self): 13 | return "_CouplingScoresCompared_all.csv" 14 | 15 | @property 16 | def evmut_suffix(self): 17 | return ".epistatic_effect" 18 | 19 | @property 20 | def cons_suffix(self): 21 | return "_frequencies.csv" 22 | 23 | @property 24 | def align_suffix(self): 25 | return "_alignment_statistics.csv" 26 | 27 | @property 28 | def ids_file(self): 29 | file_name = "ids_split" 30 | return file_name 31 | 32 | @property 33 | def ids_folder(self): 34 | return "data/id_lists/" 35 | 36 | @property 37 | def binding_file(self): 38 | return "data/binding_residues_whole_swissprot_all.txt" 39 | 40 | @property 41 | def blosum_file(self): 42 | return "data/blosum_62.txt" 43 | -------------------------------------------------------------------------------- /bindPredictML17/scripts/helper.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | 4 | class Helper(object): 5 | @staticmethod 6 | def folder_existence_check(folder_name=None): 7 | """ 8 | Check if folder exists and is reachable, if not exit the program. 9 | :param folder_name: Folder to check, if it exists (default: None) 10 | :type folder_name: String 11 | :return: True if folder exists and is reachable, False otherwise 12 | """ 13 | if os.path.exists(folder_name): 14 | folder_is_there = True 15 | else: 16 | os.error('Folder {fl} does not exist or is not reachable - exit!'.format(fl=folder_name)) 17 | folder_is_there = False 18 | exit(404) 19 | 20 | return folder_is_there 21 | 22 | @staticmethod 23 | def read_more_data(ids_folder, run): 24 | """ 25 | Read ids for more data for the training set of the specified run 26 | :param ids_folder: Folder were id files are located 27 | :param run: Current run of cross-validation to read data for 28 | :return: ids for more data 29 | """ 30 | 31 | ids = [] 32 | with open(os.path.join(ids_folder, ("ids_run" + str(run) + "_more.txt"))) as f: 33 | for line in f: 34 | ids.append(line.strip()) 35 | 36 | return ids 37 | 38 | @staticmethod 39 | def read_id_lists(ids_folder, ids_file): 40 | """ 41 | Read the id lists from a given folder 42 | :param ids_folder: Folder were ids files are located 43 | :param ids_file: prefix of id files 44 | :return: ids of split 1-5 45 | """ 46 | 47 | split1_ids = [] 48 | split2_ids = [] 49 | split3_ids = [] 50 | split4_ids = [] 51 | split5_ids = [] 52 | 53 | with open(os.path.join(ids_folder, (ids_file + "1.txt"))) as f: 54 | for line in f: 55 | split1_ids.append(line.strip()) 56 | 57 | with open(os.path.join(ids_folder, (ids_file + "2.txt"))) as f: 58 | for line in f: 59 | split2_ids.append(line.strip()) 60 | 61 | with open(os.path.join(ids_folder, (ids_file + "3.txt"))) as f: 62 | for line in f: 63 | split3_ids.append(line.strip()) 64 | 65 | with open(os.path.join(ids_folder, (ids_file + "4.txt"))) as f: 66 | for line in f: 67 | split4_ids.append(line.strip()) 68 | 69 | with open(os.path.join(ids_folder, (ids_file + "5.txt"))) as f: 70 | for line in f: 71 | split5_ids.append(line.strip()) 72 | 73 | return split1_ids, split2_ids, split3_ids, split4_ids, split5_ids 74 | -------------------------------------------------------------------------------- /bindPredictML17/scripts/parser.py: -------------------------------------------------------------------------------- 1 | """ Provides methods to parser output 2 | from secondary structure and solvent accessibility predictions like ProfPHD and Reprof. 3 | """ 4 | 5 | import sys 6 | from Bio.Seq import Seq 7 | 8 | 9 | class ProfParser(object): 10 | """ Parse and save information from ProfPDH or Reprof output. 11 | """ 12 | 13 | # general 14 | seq = Seq("") 15 | 16 | # sec structure 17 | o_sec_struc = [] 18 | p_sec_struc = [] 19 | ri_sec_struc = [] 20 | p_h = [] 21 | p_e = [] 22 | p_l = [] 23 | o_t_h = [] 24 | o_t_e = [] 25 | o_t_l = [] 26 | 27 | # solv access 28 | o_solv_acc = [] 29 | p_solv_acc = [] 30 | o_rel_solv_10 = [] 31 | p_rel_solv_10 = [] 32 | p_rel_solv = [] 33 | p_10 = [] 34 | ri_solv_acc = [] 35 | o_be = [] 36 | p_be = [] 37 | o_bie = [] 38 | p_bie = [] 39 | o_t_0 = [] 40 | o_t_1 = [] 41 | o_t_2 = [] 42 | o_t_3 = [] 43 | o_t_4 = [] 44 | o_t_5 = [] 45 | o_t_6 = [] 46 | o_t_7 = [] 47 | o_t_8 = [] 48 | o_t_9 = [] 49 | 50 | # tmh 51 | omn = [] 52 | pmn = [] 53 | prmn = [] 54 | ri_m = [] 55 | p_m = [] 56 | p_n = [] 57 | o_t_m = [] 58 | o_t_n = [] 59 | 60 | def __init__(self, file_in): 61 | self.seq = Seq("") 62 | 63 | self.o_sec_struc = [] 64 | self.p_sec_struc = [] 65 | self.ri_sec_struc = [] 66 | self.p_h = [] 67 | self.p_e = [] 68 | self.p_l = [] 69 | self.o_t_h = [] 70 | self.o_t_e = [] 71 | self.o_t_l = [] 72 | 73 | self.o_solv_acc = [] 74 | self.p_solv_acc = [] 75 | self.o_rel_solv_10 = [] 76 | self.p_rel_solv_10 = [] 77 | self.p_rel_solv = [] 78 | self.p_10 = [] 79 | self.ri_solv_acc = [] 80 | self.o_be = [] 81 | self.p_be = [] 82 | self.o_bie = [] 83 | self.p_bie = [] 84 | self.o_t_0 = [] 85 | self.o_t_1 = [] 86 | self.o_t_2 = [] 87 | self.o_t_3 = [] 88 | self.o_t_4 = [] 89 | self.o_t_5 = [] 90 | self.o_t_6 = [] 91 | self.o_t_7 = [] 92 | self.o_t_8 = [] 93 | self.o_t_9 = [] 94 | 95 | self.omn = [] 96 | self.pmn = [] 97 | self.prmn = [] 98 | self.ri_m = [] 99 | self.p_m = [] 100 | self.p_n = [] 101 | self.o_t_m = [] 102 | self.o_t_n = [] 103 | 104 | self._parse_file(file_in) 105 | 106 | def _parse_file(self, file_in): 107 | rows = [] 108 | 109 | with open(file_in) as f: 110 | for line in f: 111 | if '#' not in line: 112 | row = line.strip().split("\t") 113 | rows.append(row) 114 | 115 | tmp = rows.pop(0) 116 | 117 | col_map = {} 118 | for i in range(0, len(tmp)): 119 | col = tmp[i] 120 | col_map[col] = i 121 | 122 | self._parse_profphd(rows, col_map) 123 | 124 | def _parse_profphd(self, data, col_map): 125 | seq_str = '' 126 | 127 | for i in range(0, len(data)): 128 | el = data[i] 129 | aa = el[1] 130 | seq_str = seq_str + aa 131 | 132 | # get column indices 133 | o_sec_i = -1 134 | if 'OHEL' in col_map: 135 | o_sec_i = col_map['OHEL'] 136 | p_sec_i = -1 137 | if 'PHEL' in col_map: 138 | p_sec_i = col_map['PHEL'] 139 | ri_sec_i = -1 140 | if 'RI_S' in col_map: 141 | ri_sec_i = col_map['RI_S'] 142 | p_h_i = -1 143 | if 'pH' in col_map: 144 | p_h_i = col_map['pH'] 145 | p_e_i = -1 146 | if 'pE' in col_map: 147 | p_e_i = col_map['pE'] 148 | p_l_i = -1 149 | if 'pL' in col_map: 150 | p_l_i = col_map['pL'] 151 | o_t_h_i = -1 152 | if 'OtH' in col_map: 153 | o_t_h_i = col_map['OtH'] 154 | o_t_e_i = -1 155 | if 'OtE' in col_map: 156 | o_t_e_i = col_map['OtE'] 157 | o_t_l_i = -1 158 | if 'OtL' in col_map: 159 | o_t_l_i = col_map['OtL'] 160 | 161 | o_solv_acc_i = -1 162 | if 'OACC' in col_map: 163 | o_solv_acc_i = col_map['OACC'] 164 | p_solv_acc_i = -1 165 | if 'PACC' in col_map: 166 | p_solv_acc_i = col_map['PACC'] 167 | o_rel_solv_10_i = -1 168 | if 'OREL' in col_map: 169 | o_rel_solv_10_i = col_map['OREL'] 170 | p_rel_solv_10_i = -1 171 | if 'PREL' in col_map: 172 | p_rel_solv_10_i = col_map['PREL'] 173 | p_10_i = -1 174 | if 'P10' in col_map: 175 | p_10_i = col_map['P10'] 176 | ri_solv_i = -1 177 | if 'RI_A' in col_map: 178 | ri_solv_i = col_map['RI_A'] 179 | o_be_i = -1 180 | if 'Obe' in col_map: 181 | o_be_i = col_map['Obe'] 182 | p_be_i = -1 183 | if 'Pbe' in col_map: 184 | p_be_i = col_map['Pbe'] 185 | o_bie_i = -1 186 | if 'Obie' in col_map: 187 | o_bie_i = col_map['Obie'] 188 | p_bie_i = -1 189 | if 'Pbie' in col_map: 190 | p_bie_i = col_map['Pbie'] 191 | o_t_0_i = -1 192 | if 'Ot0' in col_map: 193 | o_t_0_i = col_map['Ot0'] 194 | o_t_1_i = -1 195 | if 'Ot1' in col_map: 196 | o_t_1_i = col_map['Ot1'] 197 | o_t_2_i = -1 198 | if 'Ot2' in col_map: 199 | o_t_2_i = col_map['Ot2'] 200 | o_t_3_i = -1 201 | if 'Ot3' in col_map: 202 | o_t_3_i = col_map['Ot3'] 203 | o_t_4_i = -1 204 | if 'Ot4' in col_map: 205 | o_t_4_i = col_map['Ot4'] 206 | o_t_5_i = -1 207 | if 'Ot5' in col_map: 208 | o_t_5_i = col_map['Ot5'] 209 | o_t_6_i = -1 210 | if 'Ot6' in col_map: 211 | o_t_6_i = col_map['Ot6'] 212 | o_t_7_i = -1 213 | if 'Ot7' in col_map: 214 | o_t_7_i = col_map['Ot7'] 215 | o_t_8_i = -1 216 | if 'Ot8' in col_map: 217 | o_t_8_i = col_map['Ot8'] 218 | o_t_9_i = -1 219 | if 'Ot9' in col_map: 220 | o_t_9_i = col_map['Ot9'] 221 | 222 | omn_i = -1 223 | if 'OMN' in col_map: 224 | omn_i = col_map['OMN'] 225 | pmn_i = -1 226 | if 'PMN' in col_map: 227 | pmn_i = col_map['PMN'] 228 | prmn_i = -1 229 | if 'PRMN' in col_map: 230 | prmn_i = col_map['PRMN'] 231 | ri_m_i = -1 232 | if 'RI_M' in col_map: 233 | ri_m_i = col_map['RI_M'] 234 | p_m_i = -1 235 | if 'pM' in col_map: 236 | p_m_i = col_map['pM'] 237 | p_n_i = -1 238 | if 'pN' in col_map: 239 | p_n_i = col_map['pN'] 240 | o_t_m_i = -1 241 | if 'OtM' in col_map: 242 | o_t_m_i = col_map['OtM'] 243 | o_t_n_i = -1 244 | if 'OtN' in col_map: 245 | o_t_n_i = col_map['OtN'] 246 | 247 | if o_sec_i > -1: 248 | self.o_sec_struc.append(el[o_sec_i]) 249 | if p_sec_i > -1: 250 | self.p_sec_struc.append(el[p_sec_i]) 251 | if ri_sec_i > -1: 252 | self.ri_sec_struc.append(int(el[ri_sec_i])) 253 | if p_h_i > -1: 254 | self.p_h.append(int(el[p_h_i])) 255 | if p_e_i > -1: 256 | self.p_e.append(int(el[p_e_i])) 257 | if p_l_i > -1: 258 | self.p_l.append(int(el[p_l_i])) 259 | if o_t_h_i > -1: 260 | self.o_t_h.append(int(el[o_t_h_i])) 261 | if o_t_e_i > -1: 262 | self.o_t_e.append(int(el[o_t_e_i])) 263 | if o_t_l_i > -1: 264 | self.o_t_l.append(int(el[o_t_l_i])) 265 | 266 | if o_solv_acc_i > -1: 267 | self.o_solv_acc.append(int(el[o_solv_acc_i])) 268 | if p_solv_acc_i > -1: 269 | self.p_solv_acc.append(int(el[p_solv_acc_i])) 270 | if o_rel_solv_10_i > -1: 271 | self.o_rel_solv_10.append(int(el[o_rel_solv_10_i])) 272 | if p_rel_solv_10_i > -1: 273 | self.p_rel_solv_10.append(int(el[p_rel_solv_10_i])) 274 | if p_10_i > -1: 275 | self.p_10.append(int(el[p_10_i])) 276 | if ri_solv_i > -1: 277 | self.ri_solv_acc.append(int(el[ri_solv_i])) 278 | if o_be_i > -1: 279 | self.o_be.append(el[o_be_i]) 280 | if p_be_i > -1: 281 | self.p_be.append(el[p_be_i]) 282 | if o_bie_i > -1: 283 | self.o_bie.append(el[o_bie_i]) 284 | if p_bie_i > -1: 285 | self.p_bie.append(el[p_bie_i]) 286 | if o_t_0_i > -1: 287 | self.o_t_0.append(int(el[o_t_0_i])) 288 | if o_t_1_i > -1: 289 | self.o_t_1.append(int(el[o_t_1_i])) 290 | if o_t_2_i > -1: 291 | self.o_t_2.append(int(el[o_t_2_i])) 292 | if o_t_3_i > -1: 293 | self.o_t_3.append(int(el[o_t_3_i])) 294 | if o_t_4_i > -1: 295 | self.o_t_4.append(int(el[o_t_4_i])) 296 | if o_t_5_i > -1: 297 | self.o_t_5.append(int(el[o_t_5_i])) 298 | if o_t_6_i > -1: 299 | self.o_t_6.append(int(el[o_t_6_i])) 300 | if o_t_7_i > -1: 301 | self.o_t_7.append(int(el[o_t_7_i])) 302 | if o_t_8_i > -1: 303 | self.o_t_8.append(int(el[o_t_8_i])) 304 | if o_t_9_i > -1: 305 | self.o_t_9.append(int(el[o_t_9_i])) 306 | 307 | if omn_i > -1: 308 | self.omn.append(el[omn_i]) 309 | if pmn_i > -1: 310 | self.pmn.append(el[pmn_i]) 311 | if prmn_i > -1: 312 | self.prmn.append(el[prmn_i]) 313 | if ri_m_i > -1: 314 | self.ri_m.append(el[ri_m_i]) 315 | if p_m_i > -1: 316 | self.p_m.append(el[p_m_i]) 317 | if p_n_i > -1: 318 | self.p_n.append(el[p_n_i]) 319 | if o_t_m_i > -1: 320 | self.o_t_m.append(el[o_t_m_i]) 321 | if o_t_n_i > -1: 322 | self.o_t_n.append(el[o_t_n_i]) 323 | 324 | self.seq = Seq(seq_str) 325 | 326 | 327 | def error(*objs): 328 | print("ERROR: ", *objs, file=sys.stderr) 329 | -------------------------------------------------------------------------------- /bindPredictML17/scripts/protein.py: -------------------------------------------------------------------------------- 1 | from collections import defaultdict 2 | import re 3 | import os.path 4 | import sys 5 | import numpy 6 | from scripts.parser import ProfParser 7 | 8 | 9 | class Protein(object): 10 | """ A protein in the context of binding site prediction """ 11 | 12 | def __init__(self): 13 | self.snap_file = None 14 | self.evc_file = None 15 | self.dist_file = None 16 | self.evmut_file = None 17 | self.cons_file = None 18 | self.solv_file = None 19 | self.length = 0 20 | self.length_cov = 0 21 | self.predictions = dict() 22 | self.performance = {'TP': 0, 'FP': 0, 'TN': 0, 'FN': 0, 'Cov': 0, 'Prec': 0, 'F1': 0, 'Acc': 0} 23 | self.scores = defaultdict(list) 24 | self.binding_res = [] 25 | self.seq_dist = 5 26 | self.threshold = 0.0 27 | self.average = 0.0 28 | self.factor = 2 29 | self.cs_dist_thresh = 8.0 30 | self.cc_dist_thresh = 20.0 31 | self.blosum_file = None 32 | self.blosum_thresh = 0.0 33 | self.blosum_mat = defaultdict(dict) 34 | self.snap_percentage = 0.3 35 | self.evc_percentage = 0.4 36 | self.abs_vals = {'A': 106, 'C': 135, 'F': 197, 'I': 169, 'M': 188, 'Q': 198, 'T': 142, 'X': 180, 'Y': 222, 37 | 'B': 160, 'D': 163, 'G': 84, 'K': 205, 'N': 157, 'R': 248, 'V': 142, 'Z': 196, 'E': 194, 38 | 'H': 184, 'L': 164, 'P': 136, 'S': 130, 'W': 227, 'U': 135} 39 | 40 | def calc_scores(self): 41 | """ Calculate cum scores, clustering coefficients, SNAP2/BLOSUM, EVmut/BLOSUM, Solv and Cons for a protein. 42 | If any of the needed files do not exist, the corresponding are not calculated for this protein 43 | :param: 44 | :return 45 | """ 46 | 47 | residue_map = defaultdict(dict) 48 | max_len = self.length + 3 49 | for r in range(-1, max_len): 50 | r = str(r) 51 | residue_map[r] = {'Cum': 0, 'Cum_Dist': 0, 'Cum_Solv': 0, 52 | 'Cluster': 0, 'Cluster_Dist': 0, 'Cluster_Solv': 0, 53 | 'SNAP': 0, 'EVmut': 0, 'Cons': 0, 'Solv': 0} 54 | 55 | # calculate BLOSUM comparison 56 | self.read_blosum_matrix() 57 | # for SNAP2 58 | if os.path.isfile(self.snap_file): 59 | snap_effect = self.get_snap_effect() 60 | snap_thresh = self.determine_snap_thresh(snap_effect) 61 | snap_blosum_diff = self.get_blosum_diff(snap_effect, snap_thresh, 'snap') 62 | for r in snap_blosum_diff.keys(): 63 | diff = snap_blosum_diff[r] 64 | r = str(r) 65 | residue_map[r]['SNAP'] = diff 66 | # for EVmutation 67 | if os.path.isfile(self.evmut_file): 68 | evc_effect = self.get_evc_effect() 69 | evc_thresh = self.determine_evc_thresh(evc_effect) 70 | evc_blosum_diff = self.get_blosum_diff(evc_effect, evc_thresh, 'evmut') 71 | for r in evc_blosum_diff.keys(): 72 | diff = evc_blosum_diff[r] 73 | r = str(r) 74 | residue_map[r]['EVmut'] = diff 75 | 76 | # get per residue conservation 77 | line_num = 1 78 | if os.path.isfile(self.cons_file): 79 | with open(self.cons_file) as cons: 80 | for line in cons: 81 | if line_num > 1: 82 | splitted = line.split(",") 83 | pos = splitted[0] 84 | conservation = splitted[2] 85 | residue_map[pos]['Cons'] = conservation 86 | line_num += 1 87 | 88 | # get per residue solvent accessibility 89 | solv_acc = dict() 90 | if os.path.isfile(self.solv_file): 91 | pp = ProfParser(self.solv_file) 92 | seq = pp.seq 93 | solv = pp.p_solv_acc 94 | 95 | for i in range(0, len(solv)): 96 | val = solv[i] 97 | letter = seq[i] 98 | max_val = self.abs_vals[letter] 99 | rel_val = val / max_val 100 | rel_val = round(rel_val, 3) 101 | pos = i + 1 102 | residue_map[str(pos)]['Solv'] = rel_val 103 | solv_acc[pos] = rel_val 104 | 105 | if os.path.isfile(self.evc_file): 106 | # calculate cumulative coupling scores 107 | distances = self.get_distances() 108 | ec_scores = self.get_ec_scores() 109 | self.calc_thresh_avg(ec_scores) 110 | # don't filter 111 | filtered_scores = self.filter_scores(ec_scores, distances, solv_acc, 'none', 0) 112 | cumulative_scores = self.calc_cum_scores(filtered_scores) 113 | # filter by distance 114 | filtered_dist_scores = self.filter_scores(ec_scores, distances, solv_acc, 'dist', self.cs_dist_thresh) 115 | cumulative_dist_scores = self.calc_cum_scores(filtered_dist_scores) 116 | # filter by solvent accessibility 117 | filtered_solv_scores = self.filter_scores(ec_scores, distances, solv_acc, 'solv', 0) 118 | cumulative_solv_scores = self.calc_cum_scores(filtered_solv_scores) 119 | 120 | # calculate clustering coefficients 121 | # don't filter 122 | clustering_coefficients = self.calc_clustering_coefficients(filtered_scores) 123 | # filter by distance 124 | filtered_dist_scores = self.filter_scores(ec_scores, distances, solv_acc, 'dist', self.cc_dist_thresh) 125 | clustering_coefficients_dist = self.calc_clustering_coefficients(filtered_dist_scores) 126 | # filter by solvent accessibility 127 | clustering_coefficients_solv = self.calc_clustering_coefficients(filtered_solv_scores) 128 | 129 | for r in range(1, self.length + 1): 130 | cum_score = cumulative_scores[r] 131 | cum_dist_score = cumulative_dist_scores[r] 132 | cum_solv_score = cumulative_solv_scores[r] 133 | cluster_score = clustering_coefficients[r] 134 | cluster_dist_score = clustering_coefficients_dist[r] 135 | cluster_solv_score = clustering_coefficients_solv[r] 136 | r = str(r) 137 | residue_map[r]['Cum'] = cum_score 138 | residue_map[r]['Cum_Dist'] = cum_dist_score 139 | residue_map[r]['Cum_Solv'] = cum_solv_score 140 | residue_map[r]['Cluster'] = cluster_score 141 | residue_map[r]['Cluster_Dist'] = cluster_dist_score 142 | residue_map[r]['Cluster_Solv'] = cluster_solv_score 143 | 144 | # convert to right format 145 | for r in residue_map.keys(): 146 | r = int(r) 147 | if 0 < r <= self.length: 148 | r1 = r - 2 149 | r2 = r - 1 150 | r3 = r + 1 151 | r4 = r + 2 152 | r_map = residue_map[str(r)] 153 | r1_map = residue_map[str(r1)] 154 | r2_map = residue_map[str(r2)] 155 | r3_map = residue_map[str(r3)] 156 | r4_map = residue_map[str(r4)] 157 | 158 | scores = [r_map['Cum'], r2_map['Cum'], r3_map['Cum'], r1_map['Cum'], r4_map['Cum'], 159 | r_map['Cum_Solv'], r2_map['Cum_Solv'], r3_map['Cum_Solv'], r1_map['Cum_Solv'], 160 | r4_map['Cum_Solv'], 161 | r_map['Cum_Dist'], r2_map['Cum_Dist'], r3_map['Cum_Dist'], r1_map['Cum_Dist'], 162 | r4_map['Cum_Dist'], 163 | r_map['Cluster'], r2_map['Cluster'], r3_map['Cluster'], r1_map['Cluster'], r4_map['Cluster'], 164 | r_map['Cluster_Solv'], r2_map['Cluster_Solv'], r3_map['Cluster_Solv'], r1_map['Cluster_Solv'], 165 | r4_map['Cluster_Solv'], 166 | r_map['Cluster_Dist'], r2_map['Cluster_Dist'], r3_map['Cluster_Dist'], r1_map['Cluster_Dist'], 167 | r4_map['Cluster_Dist'], 168 | r_map['EVmut'], r2_map['EVmut'], r3_map['EVmut'], r1_map['EVmut'], r4_map['EVmut'], 169 | r_map['SNAP'], r2_map['SNAP'], r3_map['SNAP'], r1_map['SNAP'], r4_map['SNAP'], 170 | r_map['Solv'], r2_map['Solv'], r3_map['Solv'], r1_map['Solv'], r4_map['Solv'], 171 | r_map['Cons'], r2_map['Cons'], r3_map['Cons'], r1_map['Cons'], r4_map['Cons']] 172 | self.scores[r] = scores 173 | 174 | def get_distances(self): 175 | """ 176 | read pairwise distances from file 177 | :return: 178 | """ 179 | line_num = 1 180 | dist_dict = dict() 181 | with open(self.dist_file) as f: 182 | for line in f: 183 | if line_num > 1: 184 | splitted = line.split(",") 185 | pos1 = splitted[0] 186 | pos2 = splitted[1] 187 | comb_pos = pos1 + "/" + pos2 188 | dist = 0.0 189 | if splitted[len(splitted) - 2] != "" and re.search("[a-zA-z]+", 190 | splitted[len(splitted) - 2]) == 'None': 191 | dist = float(splitted[len(splitted) - 2]) 192 | dist_dict[comb_pos] = dist 193 | line_num += 1 194 | return dist_dict 195 | 196 | def get_ec_scores(self): 197 | """ 198 | read pairwise EC scores from file 199 | :return: 200 | """ 201 | ec_dict = dict() 202 | with open(self.evc_file) as f: 203 | for line in f: 204 | splitted = line.split(" ") 205 | comb_pos = splitted[0].strip() + "/" + splitted[2].strip() 206 | score = float(splitted[4]) 207 | ec_dict[comb_pos] = score 208 | return ec_dict 209 | 210 | def calc_thresh_avg(self, ec_scores): 211 | """ 212 | Determine the cutoff and average to normalize to determine the list of high-ranking 213 | coupling pairs from a list of DI scores 214 | :param ec_scores: List of coupling scores 215 | :return: 216 | """ 217 | num = self.length_cov * self.factor 218 | high_scores = list() 219 | counter = 0 220 | for key in ec_scores.keys(): 221 | splitted = key.split("/") 222 | pos1 = int(splitted[0]) 223 | pos2 = int(splitted[1]) 224 | score = ec_scores[key] 225 | if abs(pos1 - pos2) > self.seq_dist: 226 | if len(high_scores) >= num: 227 | el = high_scores[0] 228 | if el < score: 229 | del high_scores[0] 230 | high_scores.append(score) 231 | else: 232 | high_scores.append(score) 233 | high_scores.sort() 234 | else: 235 | counter += 1 236 | thresh = high_scores[0] 237 | avg = float(numpy.mean(high_scores)) 238 | average = round(avg, 2) 239 | self.threshold = thresh 240 | self.average = average 241 | 242 | def filter_scores(self, ec_scores, distances, solv_acc, identifier, dist_thresh): 243 | """ 244 | Filter coupling scores 245 | :param ec_scores: List of coupling scores 246 | :param distances: Pairwise distances 247 | :param solv_acc: Per-residue solvent accessibility 248 | :param identifier: To identify which filter should be applied (none/dist/solv) 249 | :param dist_thresh: Threshold to use as distance cutoff 250 | :return: filtered scores 251 | """ 252 | filtered_scores = dict() 253 | 254 | for key in ec_scores.keys(): 255 | score = ec_scores[key] 256 | key_parts = key.split("/") 257 | pos1 = int(key_parts[0]) 258 | pos2 = int(key_parts[1]) 259 | curr_seq_dist = abs(pos1 - pos2) 260 | if curr_seq_dist > self.seq_dist and score >= self.threshold: 261 | if identifier == 'none': 262 | filtered_scores[key] = score 263 | elif identifier == 'solv': 264 | core = False 265 | if pos1 in solv_acc.keys() and pos2 in solv_acc.keys(): 266 | solv_acc1 = solv_acc[pos1] 267 | solv_acc2 = solv_acc[pos2] 268 | if solv_acc1 <= 0.1 and solv_acc2 <= 0.1: 269 | core = True 270 | if not core: 271 | filtered_scores[key] = score 272 | elif identifier == 'dist': 273 | pair_dist = float('Inf') 274 | if key in distances: 275 | pair_dist = distances[key] 276 | if pair_dist <= dist_thresh: 277 | filtered_scores[key] = score 278 | else: 279 | print("Invalid filter applied!") 280 | return filtered_scores 281 | 282 | def calc_cum_scores(self, scores): 283 | """ Calc cumulative coupling scores from a given set of EC scores 284 | :param scores: list of ec scores 285 | :return: cumulative coupling scores 286 | """ 287 | cum_scores = dict() 288 | for i in range(1, self.length + 1): 289 | sum_ec = 0.0 290 | num = 0 291 | for j in range(1, self.length + 1): 292 | if i <= j: 293 | key = str(i) + "/" + str(j) 294 | else: 295 | key = str(j) + "/" + str(i) 296 | if key in scores.keys(): 297 | score = scores[key] 298 | sum_ec += score 299 | num += 1 300 | ec_strength = 0.0 301 | if num > 0: 302 | ec_strength = sum_ec / self.average 303 | ec_strength = round(ec_strength, 3) 304 | cum_scores[i] = ec_strength 305 | 306 | return cum_scores 307 | 308 | def calc_clustering_coefficients(self, scores): 309 | """ Calc clustering coefficients from a given set of EC scores 310 | :param scores: list of ec scores 311 | :return: clustering coefficients 312 | """ 313 | coefficients = dict() 314 | for i in range(1, self.length + 1): 315 | neighbourhood = set() 316 | for j in range(1, self.length + 1): 317 | if i <= j: 318 | key = str(i) + "/" + str(j) 319 | else: 320 | key = str(j) + "/" + str(i) 321 | # if key == '228/234': 322 | # print(key + "\t" + str(key in scores.keys())) 323 | if key in scores.keys(): 324 | neighbourhood.add(j) 325 | 326 | num_edges = 0 327 | for n1 in neighbourhood: 328 | for n2 in neighbourhood: 329 | if n1 <= n2: 330 | key = str(n1) + "/" + str(n2) 331 | else: 332 | key = str(n2) + "/" + str(n1) 333 | if key in scores.keys(): 334 | num_edges += 1 335 | coeff = 0 336 | 337 | set_size = len(neighbourhood) 338 | if set_size > 1: 339 | coeff = num_edges / (set_size * (set_size - 1)) 340 | coeff = round(coeff, 3) 341 | coefficients[i] = coeff 342 | 343 | return coefficients 344 | 345 | def get_snap_effect(self): 346 | """ Read in pre-calculated snap results 347 | :return: snap scores per residue and mutation 348 | """ 349 | snap_scores = dict() 350 | with open(self.snap_file) as f: 351 | for line in f: 352 | if "=>" in line: 353 | splitted = line.split("=>") 354 | identifier = splitted[0].strip() 355 | scores = splitted[1].split("\\|") 356 | sum_tmp = scores[len(scores) - 1].strip() 357 | sum_parts = sum_tmp.split("=") 358 | sum_final = int(sum_parts[1].strip()) 359 | snap_scores[identifier] = sum_final 360 | # print(identifier + "\t" + str(sum_final)) 361 | return snap_scores 362 | 363 | def determine_snap_thresh(self, scores): 364 | """Determine the smallest value for SNAP2 still to be classified as having an effect 365 | :param scores 366 | :return smallest value 367 | """ 368 | values = list() 369 | for key in scores.keys(): 370 | values.append(scores[key]) 371 | sorted_values = sorted(values) 372 | index = int(len(values) * (1 - self.snap_percentage)) 373 | thresh = sorted_values[index] 374 | 375 | return thresh 376 | 377 | def get_evc_effect(self): 378 | """ Read in pre-calculated EVmutation results 379 | :return EVmutation scores per residue and mutation 380 | """ 381 | evc_scores = dict() 382 | with open(self.evmut_file) as f: 383 | for line in f: 384 | splitted = line.strip().split() 385 | # print(splitted) 386 | key = splitted[0] 387 | # print(key) 388 | value = float(splitted[1]) 389 | evc_scores[key] = value 390 | return evc_scores 391 | 392 | def determine_evc_thresh(self, scores): 393 | """Determine the smallest value for EVmutation still to be classified as having an effect 394 | :param scores 395 | :return smallest value 396 | """ 397 | positive_scores = list() 398 | for key in scores.keys(): 399 | val = scores[key] 400 | val = abs(val) 401 | positive_scores.append(val) 402 | sorted_scores = sorted(positive_scores) 403 | size = len(sorted_scores) 404 | num_scores = int(size * self.evc_percentage) 405 | index = size - num_scores 406 | thresh = sorted_scores[index] 407 | return thresh 408 | 409 | def get_blosum_diff(self, scores, thresh, method): 410 | """ Calculate difference between BLOSUM and effect predictions 411 | :param scores: effect prediction scores 412 | :param thresh: smallest value to still classify as having an effect 413 | :param method: method used to predict effects 414 | :return BLOSUM difference for each residue 415 | """ 416 | blosum_diff = dict() 417 | for i in range(1, self.length + 1): 418 | prog = re.compile("[A-Z]" + str(i) + "[A-Z]") 419 | num_accepted = 0 420 | num_mutations = 0 421 | for mut in scores.keys(): 422 | # print(mut) 423 | res = prog.search(mut) 424 | # print(res) 425 | if res is not None: 426 | aa1 = mut[:1] 427 | aa2 = mut[len(mut) - 1:] 428 | blosum_score = self.get_blosum_score(aa1, aa2) 429 | # print(mut + "\t" + aa1 + "\t" + aa2 + "\t" + str(blosum_score)) 430 | 431 | if blosum_score >= self.blosum_thresh: 432 | num_accepted += 1 433 | mut_score = scores[mut] 434 | if method == 'evmut': 435 | mut_score = abs(mut_score) 436 | if mut_score >= thresh: 437 | num_mutations += 1 438 | diff = 0.0 439 | if num_accepted > 0: 440 | diff = num_mutations / num_accepted 441 | diff = round(diff, 3) 442 | blosum_diff[i] = diff 443 | # print(str(i) + "\t" + str(diff)) 444 | return blosum_diff 445 | 446 | def read_blosum_matrix(self): 447 | """Read the specificed blosum matrix""" 448 | line_num = 1 449 | pos_mapping = dict() 450 | with open(self.blosum_file) as f: 451 | for line in f: 452 | splitted = line.strip().split() 453 | if line_num == 1: 454 | for i in range(0, len(splitted)): 455 | pos = i + 1 456 | pos_mapping[pos] = splitted[i] 457 | else: 458 | pos1 = splitted[0] 459 | for j in range(1, len(splitted)): 460 | pos2 = pos_mapping[j] 461 | self.blosum_mat[pos1][pos2] = int(splitted[j]) 462 | line_num += 1 463 | 464 | def get_blosum_score(self, aa1, aa2): 465 | return self.blosum_mat[aa1][aa2] 466 | 467 | 468 | def error(*objs): 469 | print("ERROR: ", *objs, file=sys.stderr) 470 | -------------------------------------------------------------------------------- /bindPredictML17/scripts/select_model.py: -------------------------------------------------------------------------------- 1 | class SelectModel(object): 2 | @staticmethod 3 | def define_model(model): 4 | cols_to_remove = [] 5 | ranges = [] 6 | words = [] 7 | 8 | if model == "mm3_cons_solv_more": 9 | cols_to_remove = [1, 2, 3, 4, 6, 7, 8, 9, 11, 12, 13, 14, 10 | 16, 17, 18, 19, 21, 22, 23, 24, 26, 27, 28, 29, 31, 32, 33, 34, 36, 37, 38, 39, 11 | 41, 42, 43, 44, 46, 47, 48, 49] 12 | ranges = [] 13 | words = ["no", "yes"] 14 | else: 15 | print("Unknown model specified") 16 | 17 | return cols_to_remove, ranges, words 18 | -------------------------------------------------------------------------------- /bindPredictML17/trained_model/mm3_cons_solv_more.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rostlab/bindPredict/aa7acfff3b99b3dcc737a22ab5bd1abff3a0efdb/bindPredictML17/trained_model/mm3_cons_solv_more.pkl -------------------------------------------------------------------------------- /config.py: -------------------------------------------------------------------------------- 1 | from pandas import DataFrame 2 | import torch 3 | import os 4 | import numpy 5 | import h5py 6 | import random 7 | from collections import defaultdict 8 | 9 | 10 | class FileSetter(object): 11 | 12 | @staticmethod 13 | def embeddings_input(): 14 | return '' 15 | # TODO set path to embeddings, this should be a .h5-file generated containing per-residue embeddings for all 16 | # proteins with key: UniProt-ID, value: embeddings 17 | 18 | @staticmethod 19 | def predictions_folder(): 20 | return '' # TODO set path to where predictions should be written 21 | 22 | @staticmethod 23 | def profile_db(): 24 | # TODO set path to pre-computed big_80 database 25 | # can be downloaded from ftp://rostlab.org/bindEmbed21/profile_db.tar.gz 26 | return '' 27 | 28 | @staticmethod 29 | def lookup_fasta(): 30 | # TODO set path to FASTA file of lookup set 31 | # can be downloaded from ftp://rostlab.org/bindEmbed21/lookup.fasta 32 | return '' 33 | 34 | @staticmethod 35 | def lookup_db(): 36 | # TODO set path to pre-computed lookup database 37 | # can be downloaded from ftp://rostlab.org/bindEmbed21/lookup_db.tar.gz 38 | return '' 39 | 40 | @staticmethod 41 | def mmseqs_output(): 42 | # TODO set path to where MMseqs2 output folder should be written. 43 | # tmp files will be stored in this folder in a sub-directory tmp/ 44 | # predictions will be stored in this folder in a sub-directory hbi_predictions/ 45 | return '' 46 | 47 | @staticmethod 48 | def query_set(): 49 | return '' # TODO set path to FASTA set of query sequences to generate predictions for 50 | 51 | @staticmethod 52 | def mmseqs_path(): 53 | return '' # TODO set path to MMseqs2 installation 54 | 55 | @staticmethod 56 | def split_ids_in(): 57 | return 'data/development_set/ids_split' # cv splits used during development; available on GitHub 58 | 59 | @staticmethod 60 | def test_ids_in(): 61 | return 'data/development_set/uniprot_test.txt' # test ids used during development; available on GitHub 62 | 63 | @staticmethod 64 | def fasta_file(): 65 | return 'data/development_set/all.fasta' # path to development set; available on GitHub 66 | 67 | @staticmethod 68 | def binding_residues_by_ligand(ligand): 69 | return 'data/development_set/binding_residues_2.5_{}.txt'.format(ligand) 70 | # files with binding labels used during development; available on GitHub 71 | 72 | 73 | class FileManager(object): 74 | 75 | @staticmethod 76 | def read_ids(file_in): 77 | """ 78 | Read list of ids into list 79 | :param file_in: 80 | :return: 81 | """ 82 | ids = [] 83 | with open(file_in) as read_in: 84 | for line in read_in: 85 | ids.append(line.strip()) 86 | 87 | return ids 88 | 89 | @staticmethod 90 | def read_fasta(file_in): 91 | """ 92 | Read sequences from FASTA file 93 | :param file_in: 94 | :return: dict with key: ID, value: sequence 95 | """ 96 | sequences = dict() 97 | current_id = None 98 | 99 | with open(file_in) as read_in: 100 | for line in read_in: 101 | line = line.strip() 102 | if line.startswith(">"): 103 | current_id = line[1:] 104 | sequences[current_id] = '' 105 | else: 106 | sequences[current_id] += line 107 | 108 | return sequences 109 | 110 | @staticmethod 111 | def read_embeddings(file_in): 112 | """ 113 | Read embeddings from .h5-file 114 | :param file_in: 115 | :return: dict with key: ID, value: per-residue embeddings 116 | """ 117 | 118 | embeddings = dict() 119 | with h5py.File(file_in, 'r') as f: 120 | for key, embedding in f.items(): 121 | embeddings[key] = numpy.array(embedding, dtype=numpy.float32) 122 | 123 | return embeddings 124 | 125 | @staticmethod 126 | def read_binding_residues(file_in): 127 | """ 128 | Read binding residues from file 129 | :param file_in: 130 | :return: 131 | """ 132 | binding = dict() 133 | 134 | with open(file_in) as read_in: 135 | for line in read_in: 136 | splitted_line = line.strip().split() 137 | if len(splitted_line) > 1: 138 | identifier = splitted_line[0] 139 | residues = splitted_line[1].split(',') 140 | residues_int = [int(r) for r in residues] 141 | 142 | binding[identifier] = residues_int 143 | 144 | return binding 145 | 146 | @staticmethod 147 | def read_mmseqs_alignments(file_in): 148 | """Read MMseqs2 alignments""" 149 | 150 | mmseqs = defaultdict(defaultdict) 151 | 152 | with open(file_in) as read_in: 153 | for line in read_in: 154 | splitted_line = line.strip().split() 155 | query_id = splitted_line[0] 156 | target_id = splitted_line[1] 157 | qstart = int(splitted_line[5]) 158 | tstart = int(splitted_line[6]) 159 | qaln = splitted_line[7] 160 | taln = splitted_line[8] 161 | 162 | mmseqs[query_id][target_id] = {'qstart': qstart, 'tstart': tstart, 'qaln': qaln, 'taln': taln} 163 | 164 | return mmseqs 165 | 166 | @staticmethod 167 | def save_cv_results(cv_results, file): 168 | """ 169 | Save CV results to csv file 170 | :param cv_results: 171 | :param file: 172 | :return: 173 | """ 174 | 175 | cv_dataframe = DataFrame.from_dict(cv_results) 176 | cv_dataframe.to_csv(path_or_buf=file) 177 | 178 | @staticmethod 179 | def save_classifier_torch(classifier, model_path): 180 | """Save pre-trained model""" 181 | torch.save(classifier, model_path) 182 | 183 | @staticmethod 184 | def load_classifier_torch(model_path): 185 | """ Load pre-saved model """ 186 | if torch.cuda.is_available(): 187 | device = 'cuda:0' 188 | else: 189 | device = 'cpu' 190 | classifier = torch.load(model_path, map_location=device) 191 | return classifier 192 | 193 | @staticmethod 194 | def write_predictions(proteins, out_folder, cutoff, ri): 195 | """ 196 | Write predictions for a set of proteins 197 | :param proteins: 198 | :param out_folder: 199 | :param cutoff: Cutoff to define whether a residue is binding or not 200 | :param ri: Should raw probabilities or RI be written to file? 201 | :return: 202 | """ 203 | for k in proteins.keys(): 204 | p = proteins[k] 205 | predictions = p.predictions 206 | predictions = predictions.squeeze() 207 | out_file = os.path.join(out_folder, (k + '.bindPredict_out')) 208 | 209 | FileManager.write_predictions_single_protein(out_file, predictions, cutoff, ri) 210 | 211 | @staticmethod 212 | def write_predictions_single_protein(out_file, predictions, cutoff, ri): 213 | """ Write predictions for a specific protein """ 214 | with open(out_file, 'w') as out: 215 | if ri: 216 | out.write("Position\tMetal.RI\tMetal.Class\tNuc.RI\tNuc.Class\tSmall.RI\tSmall.Class\tAny.Class\n") 217 | else: 218 | out.write("Position\tMetal.Proba\tMetal.Class\tNuclear.Proba\tNuclear.Class\tSmall.Proba\tSmall.Class" 219 | "\tAny.Class\n") 220 | for idx, p in enumerate(predictions): 221 | pos = idx + 1 222 | 223 | metal_proba = p[0] 224 | nuc_proba = p[1] 225 | small_proba = p[2] 226 | 227 | metal_ri = GeneralInformation.convert_proba_to_ri(metal_proba) 228 | nuc_ri = GeneralInformation.convert_proba_to_ri(nuc_proba) 229 | small_ri = GeneralInformation.convert_proba_to_ri(small_proba) 230 | 231 | metal_label = GeneralInformation.get_predicted_label(metal_proba, cutoff) 232 | nuc_label = GeneralInformation.get_predicted_label(nuc_proba, cutoff) 233 | small_label = GeneralInformation.get_predicted_label(small_proba, cutoff) 234 | 235 | overall_label = 'nb' 236 | if metal_label == 'b' or nuc_label == 'b' or small_label == 'b': 237 | overall_label = 'b' 238 | 239 | if ri: 240 | out.write('{}\t{:.3f}\t{}\t{:.3f}\t{}\t{:.3f}\t{}\t{}\n'.format(pos, metal_ri, metal_label, nuc_ri, 241 | nuc_label, small_ri, small_label, 242 | overall_label)) 243 | else: 244 | out.write('{}\t{:.3f}\t{}\t{:.3f}\t{}\t{:.3f}\t{}\t{}\n'.format(pos, metal_proba, metal_label, 245 | nuc_proba, nuc_label, small_proba, 246 | small_label, overall_label)) 247 | 248 | 249 | class GeneralInformation(object): 250 | 251 | @staticmethod 252 | def get_predicted_label(proba, cutoff): 253 | if proba >= cutoff: 254 | return 'b' 255 | else: 256 | return 'nb' 257 | 258 | @staticmethod 259 | def convert_proba_to_ri(proba): 260 | """Convert probabilitiy to RI ranging from 0 to 9""" 261 | 262 | if proba < 0.5: 263 | ri = round((0.5 - proba) * 9 / 0.5) 264 | else: 265 | ri = round((proba - 0.5) * 9 / 0.5) 266 | 267 | return ri 268 | 269 | @staticmethod 270 | def seed_worker(worker_id): 271 | worker_seed = torch.initial_seed() % 2 ** 32 272 | numpy.random.seed(worker_seed) 273 | random.seed(worker_seed) 274 | 275 | @staticmethod 276 | def seed_all(seed): 277 | if not seed: 278 | seed = 10 279 | 280 | # print("[ Using Seed : ", seed, " ]") 281 | 282 | torch.manual_seed(seed) 283 | torch.cuda.manual_seed_all(seed) 284 | torch.cuda.manual_seed(seed) 285 | numpy.random.seed(seed) 286 | random.seed(seed) 287 | torch.backends.cudnn.deterministic = True 288 | torch.backends.cudnn.benchmark = False 289 | 290 | @staticmethod 291 | def remove_padded_positions(pred, target, i): 292 | indices = (i[i.shape[0] - 1, :] != 0).nonzero() 293 | 294 | pred_i = pred[:, indices].squeeze() 295 | target_i = target[:, indices].squeeze() 296 | 297 | return pred_i, target_i 298 | -------------------------------------------------------------------------------- /data/development_set/ids_split1.txt: -------------------------------------------------------------------------------- 1 | Q5LL55 2 | H9L4N9 3 | O34738 4 | P39579 5 | P01887 6 | O32221 7 | P0CL67 8 | A7VAB4 9 | O60895 10 | P86179 11 | P58568 12 | Q9Y3B4 13 | Q9NWV4 14 | Q9KQN0 15 | P85511 16 | Q2VE61 17 | P18138 18 | Q9H5X1 19 | Q7Z4H3 20 | B2FQ63 21 | Q05097 22 | P20116 23 | F6KMV5 24 | A9CID9 25 | P46926 26 | D0VWR5 27 | O34918 28 | Q8TLY9 29 | Q81HL8 30 | Q9RJC1 31 | P15570 32 | Q9RY97 33 | B3Y002 34 | P07386 35 | P77754 36 | P94690 37 | C9K1X5 38 | P04382 39 | O32108 40 | P32021 41 | P17900 42 | P43933 43 | P0AE05 44 | D0VWX2 45 | P49789 46 | A0A384LKY8 47 | P67700 48 | Q58830 49 | O87496 50 | Q52424 51 | C3SZN7 52 | P50049 53 | P16108 54 | P05161 55 | Q9RN60 56 | P0CX80 57 | G0Z026 58 | P67809 59 | P83467 60 | P10175 61 | F1NZ18 62 | Q47765 63 | Q82Z21 64 | H9NAL3 65 | Q9KCV1 66 | Q07341 67 | E7E815 68 | A7UQX3 69 | P04038 70 | O68396 71 | Q8Z4D7 72 | Q77DJ5 73 | P48061 74 | Q8XMB9 75 | P27999 76 | I0BZV0 77 | D1CIZ5 78 | Q0P9D1 79 | Q74G82 80 | P07737 81 | Q00277 82 | Q96FJ2 83 | P02876 84 | Q8GGH0 85 | P04418 86 | P29256 87 | Q1Q7P3 88 | P03606 89 | Q05776 90 | P23907 91 | P14135 92 | P0A6I0 93 | Q65NU7 94 | Q5LAA6 95 | Q38HX3 96 | Q6XBH1 97 | Q96YV5 98 | Q8RQE7 99 | O58552 100 | P08190 101 | Q8NI22 102 | P67825 103 | L7P7R7 104 | P06762 105 | P76077 106 | P81180 107 | Q65163 108 | Q5NHD0 109 | Q9KTK0 110 | G2QAB5 111 | P22483 112 | P12238 113 | Q82SE2 114 | Q6SJ71 115 | O14960 116 | Q8ZVX8 117 | P12240 118 | D6ES11 119 | P0ABE5 120 | Q6ZYH1 121 | P24321 122 | Q6MHT0 123 | P08773 124 | P17413 125 | Q966X9 126 | A7J283 127 | I3VE74 128 | P16458 129 | P0ABS1 130 | Q89IH6 131 | Q9HD34 132 | O13881 133 | Q56320 134 | B3CET1 135 | Q8AA93 136 | P05109 137 | O26253 138 | P04918 139 | O26413 140 | Q9WZF7 141 | Q91EV7 142 | A6ZWK1 143 | P37975 144 | Q04416 145 | Q7A260 146 | Q7VYU0 147 | O86327 148 | P21149 149 | Q8MTC1 150 | A7V213 151 | P00974 152 | Q8RM02 153 | O31562 154 | O25671 155 | O51419 156 | Q9UG22 157 | O28028 158 | Q9RFC8 159 | P10971 160 | P0ACI6 161 | P09132 162 | Q97TX9 163 | P12733 164 | P14121 165 | P09353 166 | P13183 167 | P00147 168 | P0A6X7 169 | Q8TL28 170 | P24473 171 | A4CYJ6 172 | O29089 173 | Q9HLN2 174 | P03051 175 | Q5HHM4 176 | P12312 177 | Q27415 178 | Q8ZPA3 179 | P10282 180 | Q8WWA0 181 | P03882 182 | A0QRC4 183 | O27725 184 | Q5SJH3 185 | O66665 186 | Q7PXT9 187 | P60618 188 | Q8I3W2 189 | Q0PB48 190 | P39135 191 | P02883 192 | P62669 193 | O43924 194 | B5Z8Y5 195 | P74917 196 | P58566 197 | P80312 198 | P59087 199 | Q983T0 200 | Q6NDF6 201 | Q9Z1R3 202 | P32761 203 | B2J933 204 | -------------------------------------------------------------------------------- /data/development_set/ids_split2.txt: -------------------------------------------------------------------------------- 1 | Q4QRE8 2 | Q88R27 3 | A4GRE3 4 | Q9X0H1 5 | P04531 6 | Q4J9K9 7 | Q8EDL6 8 | P80377 9 | P33678 10 | A3D092 11 | P0C794 12 | Q5JIH3 13 | Q9X1W7 14 | Q0KFV0 15 | P32411 16 | Q91IE3 17 | Q8IK57 18 | Q2PYN0 19 | A7XUK7 20 | Q748S4 21 | Q8E9W8 22 | A6T925 23 | Q93SX1 24 | B9HM14 25 | Q32WH4 26 | Q9LA15 27 | Q8U440 28 | Q53W28 29 | Q9BTP7 30 | P80176 31 | A4FTK7 32 | Q96CS7 33 | Q71AW2 34 | P0A7D1 35 | Q99584 36 | P13988 37 | A9A2G4 38 | Q00457 39 | Q48EL2 40 | P80721 41 | Q8VZS8 42 | Q7NSA6 43 | P72583 44 | C1IFD2 45 | Q805N7 46 | Q55670 47 | Q47UY7 48 | Q55330 49 | Q5KYT1 50 | O43255 51 | B4F366 52 | P32321 53 | P64488 54 | Q5T4W7 55 | Q9HQM9 56 | P08525 57 | Q8DM36 58 | Q9ALS3 59 | P02244 60 | Q54X65 61 | A0A0F7RDM3 62 | P26789 63 | Q472T4 64 | Q57829 65 | P15369 66 | P0CI74 67 | P54173 68 | P09598 69 | Q0B0G9 70 | Q9UUI1 71 | Q9QZ88 72 | P22139 73 | Q5Y812 74 | O95639 75 | Q9RMI1 76 | P68919 77 | Q5KVJ9 78 | O43598 79 | Q8YJY7 80 | Q9TVF0 81 | S0BAP9 82 | P12737 83 | P13271 84 | Q0P891 85 | Q8E372 86 | P0AC25 87 | Q8RFH4 88 | Q79ZR9 89 | E0TW95 90 | P35530 91 | Q3SFD8 92 | P74903 93 | Q8KFZ1 94 | Q1HEA2 95 | P05019 96 | P53738 97 | Q9HL57 98 | P00273 99 | Q7X0D9 100 | Q8YVQ2 101 | Q96D96 102 | Q9DEF8 103 | P14119 104 | Q2FZE9 105 | P52391 106 | P08203 107 | P65121 108 | P81605 109 | P11456 110 | Q64288 111 | Q9KKU4 112 | A9CHM9 113 | P01543 114 | P04168 115 | P26788 116 | Q4K423 117 | Q9A1G2 118 | O43583 119 | P02210 120 | P00423 121 | B9A5C1 122 | O25841 123 | Q8IU26 124 | Q58667 125 | Q9RSM4 126 | Q72QS5 127 | P58246 128 | Q9S508 129 | P44887 130 | Q9LUV2 131 | P03042 132 | Q97G05 133 | Q8VL32 134 | P37001 135 | Q5SJ76 136 | P03206 137 | P06465 138 | B9A0T7 139 | Q9BUL8 140 | Q8ZC05 141 | Q6L6Q4 142 | Q96HA8 143 | Q5YUF0 144 | Q92M25 145 | P60022 146 | Q9NAV8 147 | Q8YQN0 148 | D5MNX7 149 | D8NA05 150 | P22452 151 | P14116 152 | Q9Y657 153 | Q5LUH0 154 | Q16873 155 | D2Z0P1 156 | O25423 157 | Q9RIM2 158 | D5X329 159 | P41261 160 | A2SL15 161 | C6KFA4 162 | Q9WYJ5 163 | O28416 164 | O58307 165 | P10153 166 | Q9NRN9 167 | C6AAT5 168 | O58836 169 | Q8IWS0 170 | P19753 171 | P08877 172 | P39639 173 | Q67XG0 174 | P07471 175 | Q6XQ58 176 | P9WLS2 177 | Q2UP89 178 | Q97WD2 179 | Q5F9K6 180 | P21163 181 | P0A805 182 | P76632 183 | Q82EE4 184 | P82291 185 | P60033 186 | Q0RYE0 187 | Q7NSZ5 188 | P76014 189 | O02372 190 | P71650 191 | O13610 192 | P32055 193 | S3TFW2 194 | P0A410 195 | B0R5M0 196 | P75430 197 | Q5SK02 198 | Q9FBN7 199 | P02966 200 | Q96A44 201 | Q936H5 202 | A0A0H2W6Y8 203 | P38636 204 | -------------------------------------------------------------------------------- /data/development_set/ids_split3.txt: -------------------------------------------------------------------------------- 1 | P23827 2 | Q2G4M1 3 | Q8TVS2 4 | P43777 5 | Q6N4M9 6 | P46952 7 | C1D7P6 8 | P32173 9 | Q8TQ93 10 | A1TY68 11 | Q504Y3 12 | Q9I3C8 13 | Q5SHN3 14 | P69506 15 | B8FYU2 16 | O25064 17 | Q8VK10 18 | P06107 19 | Q9KAX6 20 | P12732 21 | Q9HRE7 22 | Q65LG7 23 | Q46582 24 | Q0SDB1 25 | Q8K2P6 26 | Q8QL27 27 | Q6JXI5 28 | A8ALU5 29 | P03712 30 | A0QRS5 31 | Q5UQA5 32 | Q5L811 33 | P04608 34 | Q9V535 35 | Q9ZAG3 36 | P39186 37 | A7Z2A9 38 | Q0PBW2 39 | A8MT69 40 | P39301 41 | Q01469 42 | Q06549 43 | Q2F9Z1 44 | Q9BX68 45 | P72761 46 | Q57DY1 47 | Q15036 48 | Q9K0G4 49 | O82040 50 | O43809 51 | Q9REI7 52 | Q9JYL1 53 | Q8T893 54 | O66511 55 | B6F143 56 | A7UNK4 57 | P19267 58 | P63272 59 | Q10589 60 | Q64368 61 | P04355 62 | Q37964 63 | D5CN26 64 | Q2RRQ9 65 | Q71TT1 66 | P18317 67 | Q97WQ4 68 | Q9NG96 69 | C4LSE7 70 | Q74AE4 71 | Q5ZYB7 72 | Q91LD0 73 | P03047 74 | P00430 75 | Q4CQE2 76 | Q4PP54 77 | P12306 78 | P0ACT8 79 | O24984 80 | D0VWV4 81 | F2Z288 82 | Q8GPH6 83 | Q6NEL2 84 | Q99WU4 85 | O30176 86 | Q9HSF4 87 | Q3AB29 88 | P0A8J2 89 | P12946 90 | A0A2B6C3P9 91 | P84887 92 | V9P0A9 93 | F9UMW3 94 | Q9LUJ3 95 | P01574 96 | Q1CAM3 97 | Q7RTV0 98 | P69783 99 | Q2FZ56 100 | Q9RQB9 101 | Q9NA73 102 | Q5KXY4 103 | Q3ZDQ9 104 | B5THI3 105 | P82242 106 | Q5ZT91 107 | P0A6D0 108 | Q70LE8 109 | Q9UHP7 110 | Q95VR4 111 | P07013 112 | Q82ZC8 113 | P35804 114 | Q9RHJ6 115 | Q7M4F9 116 | P00641 117 | O41094 118 | Q8SUP0 119 | P15927 120 | Q7DB61 121 | D0A8L1 122 | Q8RCC3 123 | P08246 124 | A2KD59 125 | P07470 126 | O41136 127 | A7N805 128 | Q8GTM0 129 | Q89VT6 130 | P13513 131 | Q8DP99 132 | O18426 133 | Q9SN73 134 | Q8A8A4 135 | Q8RLX2 136 | P0CY08 137 | O96048 138 | Q8NQA4 139 | P0A731 140 | Q8DLM0 141 | Q739M9 142 | Q00433 143 | D3E4S5 144 | Q8EII5 145 | Q7CPA2 146 | Q58AD3 147 | O34911 148 | Q6N0K4 149 | Q9HV14 150 | P80382 151 | Q9YEQ6 152 | P9WPG7 153 | P21816 154 | P19508 155 | Q99616 156 | Q7NSS5 157 | Q5QUJ8 158 | P22450 159 | P76364 160 | Q9KXR9 161 | P18434 162 | Q5SHP2 163 | P02775 164 | E0TXX3 165 | L7R9I1 166 | E0TY72 167 | Q9VAK8 168 | Q45488 169 | P0ADF8 170 | Q05315 171 | P50618 172 | P59082 173 | O59893 174 | P29602 175 | Q5LD59 176 | P62010 177 | P11690 178 | P17728 179 | P23873 180 | P0A3Y1 181 | O77421 182 | Q9F1R6 183 | P03126 184 | O74017 185 | Q8YYB7 186 | P10970 187 | F4JL28 188 | D6E269 189 | P79085 190 | Q9JKE2 191 | P0AAM2 192 | Q14493 193 | O34994 194 | J7MFT5 195 | Q5SHE1 196 | Q59646 197 | Q72LF4 198 | Q7SID0 199 | Q74ML9 200 | P74564 201 | P50616 202 | M1MR49 203 | Q8NT71 204 | -------------------------------------------------------------------------------- /data/development_set/ids_split4.txt: -------------------------------------------------------------------------------- 1 | P65401 2 | D7GNB4 3 | P01175 4 | P49458 5 | O95931 6 | P60619 7 | Q5SLL2 8 | Q64438 9 | A9CKT4 10 | Q9KJ88 11 | Q7MVF6 12 | O53623 13 | P20047 14 | Q9KKG7 15 | Q3M4Y5 16 | P06148 17 | Q44635 18 | Q8WQK3 19 | P44096 20 | A0A0F7RHX8 21 | P04570 22 | P47224 23 | Q38242 24 | P23657 25 | Q9UK53 26 | P69202 27 | P12376 28 | P00210 29 | P03496 30 | Q7DDR9 31 | Q81JI2 32 | O67789 33 | Q6YRW8 34 | Q8PJZ4 35 | P61459 36 | Q3JFV4 37 | P0A8U6 38 | Q93CG9 39 | P0AFX7 40 | P01552 41 | A2RI47 42 | Q46R41 43 | Q1W640 44 | P74174 45 | Q9NPF0 46 | Q01468 47 | Q5A0X8 48 | P28074 49 | Q9IPS9 50 | Q9A5I0 51 | Q03148 52 | P01555 53 | Q8AAW0 54 | P26790 55 | Q9HTR2 56 | Q9HY80 57 | D7GXG5 58 | Q6CK86 59 | P84801 60 | Q800Y1 61 | Q53908 62 | P61969 63 | Q2MHR1 64 | P56221 65 | Q9ZKJ5 66 | D6WJ77 67 | P00217 68 | Q8GAV3 69 | Q7AKQ8 70 | P07465 71 | Q86SG5 72 | Q9HJY3 73 | Q2U8Y3 74 | O52512 75 | Q7PGA3 76 | Q8XVB9 77 | Q9X7H4 78 | Q8N6N7 79 | F3ZXK6 80 | Q6IMW5 81 | Q07524 82 | P44558 83 | Q04I02 84 | Q58584 85 | G7J032 86 | Q6RJQ3 87 | P10972 88 | Q206Z5 89 | Q9HWW1 90 | Q08826 91 | Q4FPZ7 92 | Q84AQ1 93 | P58560 94 | L7IX95 95 | Q27084 96 | Q97ZF4 97 | A5WF35 98 | P0A333 99 | P27838 100 | P73048 101 | Q6N5V5 102 | P56178 103 | O58368 104 | E8WYN5 105 | Q2F1K8 106 | Q5F5Y8 107 | P12239 108 | Q55169 109 | Q9ZND7 110 | Q8DQG2 111 | Q47764 112 | P77214 113 | Q1EMV2 114 | Q8NS29 115 | W8VZW3 116 | Q17091 117 | P04043 118 | P84138 119 | Q8YY42 120 | P0AC44 121 | Q4K8M0 122 | P14682 123 | O77093 124 | Q08840 125 | P69776 126 | P13689 127 | G2RUZ1 128 | P22365 129 | O59172 130 | P48060 131 | Q2ITY5 132 | P0A459 133 | B9W5G6 134 | C3W947 135 | Q58164 136 | Q04D30 137 | Q82134 138 | O26745 139 | P0AEK0 140 | O67778 141 | P27797 142 | P03536 143 | P23308 144 | Q7BSV8 145 | P39234 146 | Q8WRW5 147 | O83795 148 | Q09176 149 | I6X7F9 150 | A4DA31 151 | P52102 152 | P0AEV9 153 | P00816 154 | P40422 155 | Q8WQM6 156 | Q2FVD0 157 | Q97ZE3 158 | O76242 159 | P24005 160 | Q2G1A5 161 | P19399 162 | P03052 163 | P60568 164 | Q9FA38 165 | O32080 166 | Q2FQ04 167 | A7ATL3 168 | Q0I165 169 | P04390 170 | Q7LFL8 171 | A6TB83 172 | Q59DX8 173 | Q8YNB0 174 | Q9SUQ8 175 | O88273 176 | P27707 177 | Q1HRL7 178 | Q9ZLI1 179 | Q5SLP8 180 | O25323 181 | P38424 182 | Q973T5 183 | A5GZW8 184 | P39146 185 | Q8U1Y4 186 | A4GRC7 187 | P12241 188 | Q46864 189 | Q44057 190 | Q9X113 191 | P00383 192 | P81303 193 | K1QRB6 194 | Q0VSW0 195 | Q716G6 196 | Q8WTQ1 197 | O88188 198 | Q8Y3Y7 199 | Q6SVB6 200 | Q1LKZ5 201 | Q8A7Y0 202 | A3F715 203 | Q3Y316 204 | -------------------------------------------------------------------------------- /data/development_set/ids_split5.txt: -------------------------------------------------------------------------------- 1 | P72321 2 | A6WG04 3 | Q9P126 4 | Q9Y547 5 | P48329 6 | Q04837 7 | Q9I3B4 8 | P00592 9 | A1JSS7 10 | Q28TU0 11 | Q9BXU0 12 | Q928A2 13 | P00649 14 | P00428 15 | P35926 16 | Q9I4L4 17 | Q02169 18 | P0AFC1 19 | Q8DKM3 20 | P11746 21 | P24297 22 | P68168 23 | P46797 24 | C4MBE2 25 | D5SZ58 26 | Q7SIB3 27 | Q9NZV6 28 | Q39TV9 29 | E9AND8 30 | Q9LW31 31 | Q9KY22 32 | Q9H777 33 | Q4R0L3 34 | P01127 35 | Q8GC87 36 | Q1Q7P4 37 | Q9DHS8 38 | Q934G3 39 | C8BD48 40 | Q12178 41 | Q9NPJ3 42 | Q976J8 43 | P67870 44 | Q9NV35 45 | P27198 46 | P24224 47 | P23154 48 | P45718 49 | P29717 50 | D0E9M1 51 | Q53396 52 | P80734 53 | Q8ISI9 54 | Q5SID6 55 | P09391 56 | P13123 57 | P50187 58 | P43410 59 | B3GA02 60 | P25052 61 | P49949 62 | P62661 63 | O29432 64 | O42978 65 | P35244 66 | Q93S40 67 | P16113 68 | Q57468 69 | Q9BYN0 70 | Q92730 71 | Q83WS1 72 | O36005 73 | G2SLH8 74 | R9RY64 75 | Q3ZTX8 76 | Q2T8S1 77 | Q8ZVF7 78 | P76045 79 | P02763 80 | Q5KY38 81 | P22027 82 | Q8C6P8 83 | P49771 84 | P73954 85 | P25604 86 | P0A7F3 87 | D0E8I5 88 | O28604 89 | Q9X1H4 90 | Q8EF49 91 | Q8W453 92 | Q9NYG5 93 | P32410 94 | Q8PBM3 95 | Q5SJ82 96 | Q87TZ9 97 | P02753 98 | P03612 99 | B3LQW9 100 | P68265 101 | Q13PU4 102 | P06010 103 | O30133 104 | P15692 105 | Q9KFA8 106 | Q9LPZ1 107 | Q9KU27 108 | P04862 109 | B3QER1 110 | Q9KCC6 111 | P27505 112 | E6Z0R4 113 | C8V0B5 114 | P37108 115 | P0AC69 116 | B3FQS5 117 | Q9WZ50 118 | Q4UNB3 119 | Q8NRS3 120 | P00099 121 | P63228 122 | Q2G8A3 123 | Q9F1K9 124 | D0E7M2 125 | O07517 126 | Q52SA8 127 | C6F3U5 128 | O28085 129 | Q9RBY6 130 | E2EKP5 131 | D1BQI7 132 | P81860 133 | Q9XG81 134 | P18000 135 | Q9LSX0 136 | V6F235 137 | Q8N5K1 138 | P13298 139 | Q6KD95 140 | Q9EV84 141 | O28417 142 | P08854 143 | Q6UV28 144 | Q9RHG4 145 | P29772 146 | M1QWM5 147 | Q00458 148 | O15263 149 | P56106 150 | P0A182 151 | P33788 152 | D3HJY4 153 | Q9SPD4 154 | A2SQK8 155 | Q86YI8 156 | Q9ES57 157 | Q2RVS1 158 | O76096 159 | Q716G8 160 | Q63P16 161 | P0CI76 162 | P95673 163 | A1U1C4 164 | L0N3Y0 165 | Q2W014 166 | Q74NK7 167 | G8XHD5 168 | Q8EAP9 169 | O31667 170 | P52754 171 | Q57851 172 | B4EDC1 173 | P41500 174 | O51615 175 | Q5FBS0 176 | P97253 177 | Q46604 178 | B6HWK0 179 | P59665 180 | Q5TA50 181 | Q9UT12 182 | Q8NJY3 183 | P0ADV7 184 | P12962 185 | O87963 186 | O66186 187 | Q6XK79 188 | P60616 189 | Q8ZPC0 190 | Q9UGC6 191 | P10599 192 | P08337 193 | P73213 194 | P08037 195 | B1YKD7 196 | P43457 197 | P0A790 198 | Q46982 199 | D1Z0H7 200 | Q5SLE7 201 | Q9D967 202 | Q2VZ87 203 | -------------------------------------------------------------------------------- /data/development_set/uniprot_test.txt: -------------------------------------------------------------------------------- 1 | D0VX23 2 | B7J6R7 3 | Q3IDI7 4 | E9JSA3 5 | P12734 6 | P80547 7 | P39621 8 | Q8KRV3 9 | Q747J2 10 | P10245 11 | P60848 12 | Q9EV85 13 | P07603 14 | Q03243 15 | P10868 16 | A6L7X2 17 | Q5SK07 18 | P40065 19 | Q7A1N5 20 | Q44501 21 | Q5NV90 22 | Q6D2K4 23 | P63165 24 | Q9HC16 25 | Q8A8Q1 26 | P68661 27 | Q7VWF8 28 | Q01747 29 | Q8GCY3 30 | Q8ZJW4 31 | P56930 32 | Q72EF4 33 | O29338 34 | B7IE18 35 | P00426 36 | Q9HVI1 37 | Q46898 38 | Q8DRZ8 39 | Q81BL7 40 | P15328 41 | Q8T6U0 42 | P19656 43 | Q8LGG8 44 | O92323 45 | A2I2W2 46 | Q13M28 47 | Q15369 48 | Q9SE33 49 | D2Z0P2 50 | Q96GG9 51 | Q8P4Q6 52 | P45696 53 | P69687 54 | Q57587 55 | P0CB20 56 | Q929T5 57 | B9DL91 58 | Q8IJK2 59 | G2EA45 60 | P29460 61 | P07059 62 | P80882 63 | Q8IKH2 64 | C8WS74 65 | Q9KFV3 66 | Q9BVM4 67 | Q9NVS9 68 | A0L5S6 69 | B7TYB2 70 | G0RUC2 71 | Q06151 72 | Q96YD0 73 | Q8DJ43 74 | Q3E840 75 | Q9A585 76 | P00698 77 | P00648 78 | P00766 79 | P05373 80 | P00651 81 | P22636 82 | P03050 83 | P0ACH5 84 | P19793 85 | P56406 86 | Q05599 87 | P16184 88 | P63159 89 | P19080 90 | P07445 91 | P06956 92 | P09184 93 | P40347 94 | Q55389 95 | O15527 96 | Q5SJ80 97 | P50861 98 | P02263 99 | P84229 100 | P62801 101 | P0A0Z8 102 | P45850 103 | P03772 104 | P31570 105 | P30197 106 | P05798 107 | P0AFJ5 108 | P31992 109 | P03013 110 | O29428 111 | P22061 112 | P0A7A9 113 | P08692 114 | O28597 115 | P03004 116 | Q9X1X6 117 | P44007 118 | P0AF28 119 | P0A006 120 | P0ACP7 121 | P0A881 122 | Q8A1E9 123 | P50465 124 | P25524 125 | P46859 126 | Q9HAN9 127 | P84233 128 | P62799 129 | P06897 130 | P02281 131 | Q9HLQ2 132 | P33905 133 | P00806 134 | O15496 135 | P03034 136 | P56653 137 | P06534 138 | P56839 139 | P75914 140 | P17802 141 | P10828 142 | P19921 143 | Q93PP9 144 | P13393 145 | P42938 146 | P0ACT4 147 | Q7MHK3 148 | P00630 149 | P04392 150 | P11350 151 | Q56148 152 | Q05783 153 | Q06592 154 | P0A6C1 155 | P84131 156 | Q06241 157 | P39075 158 | P03699 159 | O29975 160 | Q9SF23 161 | P34257 162 | O34598 163 | P0ADU2 164 | P06134 165 | P76502 166 | P46072 167 | O33833 168 | O59248 169 | O05510 170 | Q56408 171 | P82543 172 | Q9CAQ2 173 | Q9SE00 174 | O06553 175 | P44886 176 | P56073 177 | Q9WZC6 178 | P44688 179 | P21673 180 | P0A7I7 181 | Q48255 182 | O41156 183 | P95855 184 | O59188 185 | Q5SJ65 186 | O66188 187 | Q46389 188 | P00894 189 | O52646 190 | O34522 191 | O95989 192 | P0A6G7 193 | O25928 194 | P22069 195 | P0AGB6 196 | Q13907 197 | Q96EY8 198 | O66615 199 | Q51948 200 | P03018 201 | P33643 202 | Q9I0M3 203 | Q96EK6 204 | A1L259 205 | Q9GZZ1 206 | Q8Z2A5 207 | O31465 208 | O50580 209 | Q6T1W8 210 | P76491 211 | Q6D820 212 | Q7BSH1 213 | A9CKY2 214 | P39058 215 | P08038 216 | P0A6L4 217 | Q8TDX5 218 | Q8SXK5 219 | O31801 220 | O29537 221 | Q8NMG3 222 | P0A8P1 223 | P02958 224 | Q7A827 225 | P86383 226 | P77544 227 | P96692 228 | P0A988 229 | P00447 230 | P0ACY1 231 | P0A0I7 232 | Q6NS38 233 | P03036 234 | Q00459 235 | B8FW11 236 | P53615 237 | P32643 238 | P76143 239 | Q979C2 240 | P49914 241 | O87008 242 | O75936 243 | P0ACE7 244 | P37146 245 | Q74Y93 246 | P96693 247 | Q9I0Q8 248 | Q06S87 249 | P0CL19 250 | Q6NAM1 251 | Q51507 252 | Q7CRQ0 253 | Q66GT5 254 | P03697 255 | O74859 256 | Q9HX08 257 | P36639 258 | P40363 259 | P0C960 260 | Q9Y530 261 | B5XYG3 262 | Q9C9G4 263 | O35013 264 | P0ABN1 265 | P31580 266 | Q81UB2 267 | A5F8G9 268 | Q8G907 269 | P0A6Q3 270 | Q96GX9 271 | P27431 272 | Q9UZK4 273 | C1I202 274 | O07051 275 | P39840 276 | O32085 277 | Q6PHW0 278 | P09528 279 | Q86UY6 280 | P0A776 281 | O25094 282 | P84907 283 | Q9I3A4 284 | Q96FX7 285 | P60955 286 | I6WU39 287 | Q9KKS5 288 | O06662 289 | A0QWG5 290 | P09601 291 | P0A998 292 | Q5VWZ2 293 | P0ADA1 294 | A1KVH5 295 | O32163 296 | P0A951 297 | O59413 298 | O69787 299 | P9WKH3 300 | P9WFF7 301 | -------------------------------------------------------------------------------- /data/independent_set/binding_residues_metal.txt: -------------------------------------------------------------------------------- 1 | B3TFG2 25,139,60,63 2 | Q57727 67,70,79,80,63 3 | M1GMS4 401,396,399 4 | P0A2K1 256,66,69,231,232,71,268,270,304,306,308,309 5 | P32021 268,189,191 6 | C6H0Y9 134,41,107,108,80,119,120 7 | P0CS93 70,73,75,268,269,537,622,78,182,183,536,313 8 | F8UNJ8 755,766,751 9 | P00918 96,64,129,34,4,132,198,36,106,15,19,119,94 10 | Q236S9 481,610,533 11 | P21347 353,355,260,325,297,361,363,364,461,463,401,342,377,381,319 12 | P46873 204,94 13 | A8E7C6 285,164,250,299,363,216,218,251,252,253,254,287 14 | W0TJ64 81,76,77,78 15 | A0A0H2YN38 321,323,326,204,205,1431,1433,1435,93,1439,96,98,101,426,427,1388,318 16 | P22897 743,745,751,689,691,695,765,766 17 | P06746 192,256,288,145,252,121,124,285,190 18 | P29317 757,646,663 19 | F1QCV2 176,267,174 20 | Q84II6 197,183,253,187,333,254 21 | P00800 398,409,415,416,417,289,419,291,293,422,423,425,426,427,429,432,370,374,378 22 | O28951 224,159,358,359,360,361,362,427,220,221,223 23 | O74036 141,143,144,145,146 24 | Q709H6 64,67,140,109,145,149,92,95 25 | A9A1Y2 449,453,295,299 26 | P62826 24,42 27 | D1A7C3 264,265,202,204,269,274,276,222,223,224,225,227,228,229,188,189 28 | P00760 75,77,78,80,81,82,85 29 | P19573 256,257,130,259,324,323,326,267,494,270,433,273,382 30 | P02791 97,35,46,145,49,51,148,50,53,118,119,123,127 31 | Q9Y233 515,582,519,553,554,623,664 32 | E8MGH8 417,418,338,340 33 | Q08499 453,455,456,620,466,502,503 34 | Q1D0B6 64,66,69 35 | D3DJ42 274,287 36 | Q9X0L2 96,129,132,19,52,55 37 | Q2XVP4 166,39,71,41,11,44,45,49,55,249,158,254 38 | P02554 11,69 39 | E1BQ43 74,331,333 40 | P21589 36,38,298,299,235,237,208,243,85,117,213,214,220 41 | H0SLX7 163,168,112,117,152,153 42 | Q9K2N0 198,201,202,205,91,40,169,41,42,43,240,114,179,116,118,58,59,188,189,190 43 | C7C422 208,122,120,250,124,189 44 | P52699 176,97,99,215,157,95 45 | Q9H816 35,36,120 46 | W8QPS6 194,229,6,231,230,41 47 | A0A0H3MRW3 146,195,150 48 | P61769 77,37,78,103,104,105,107,93,94 49 | Q8WWQ0 1393,1394,1396 50 | E2GIN1 161,212,159 51 | Q17TM8 64,393,396,334,338,344,347,348 52 | K7RJ88 98,171,172 53 | A0A411MR89 448,225,407,409,311,281,215,315,189,318 54 | P01116 17,34,35,57,58,92,95 55 | P04049 165,103,168,139,173,176,125,184,155,152 56 | P79800 17,35 57 | P42592 454,572,456,458,460,462 58 | P54818 676,493,527,495 59 | P07306 197,265,266,240,242,243,278,216,253,254,191 60 | Q9H7B4 65,261,263,71,266,75,208,49,83,52,87,62 61 | P31725 68,70,72,74,79,21,24,27,28,29,31,32,37,103,105,107 62 | P0DTD1 5379,4879,3345,4370,5011,5396,4373,5399,3484,4381,3486,4383,5152,5034,5037,5038,3526,5192,3529,4687,5329,5203,5332,4693,1752,1753,1754,1755,5340,4698,4702,5343,5350,4327,5353,4330,5357,4336,5363,1142,4343,3321,1787,1789,5374 63 | Q16769 202,107,330,76,276,148,280,218,222,159 64 | Q7XSK0 82,85 65 | P47205 241,277,279,237,78 66 | P35270 257,260 67 | Q83V25 294,9,11,177 68 | A0A402C2V4 328,201,26,24,255 69 | A0A402C2Q3 265,337,211,340,25,27 70 | Q9Y253 231,13,14,113,115,116 71 | A0A3B6UEU3 132,134,295,310 72 | Q9D4J1 163,104,108,140,110,142,144,146,115,151,157 73 | Q59FX0 962,965,970,973 74 | P04406 289,291 75 | P26664 1123,1125,1171,1175 76 | F2NWD3 294,79,83,308,309,310 77 | A0A0N9JNY6 832,865,866,650,654,883,884,694,599,600,602,603,892,861,862 78 | Q05097 101,37,105,108,109 79 | Q9GZT9 374,313,315 80 | A0A1M6Y2K1 137,136,73,138,76 81 | O43426 591,543 82 | A0A5P5Z9C8 132,37 83 | A0A3B6UEP6 128,34,26,31,29,127 84 | P80025 227,675,301,303,305,307,339 85 | Q92769 191,194,198,199,200,265,175,176,177,179,188,223 86 | Q5CYN0 629,426,477 87 | B2FTM1 181,246,105,107,109,110 88 | Q70MM3 139,108,25,60,63 89 | Q8W4D0 481,486,599,600,601,506,509,511 90 | Q16790 226,228,251 91 | G3FFN6 263,41,42,43,269,14 92 | P01112 17,35 93 | G3I8R9 257,252 94 | P11142 232,227 95 | P54132 848,1050,1047,1063,1048,1066,684,1036,1055 96 | Q88FY1 320,265,105,302,318,319 97 | A0A1Y1BQV9 289,329,332,336,337,340,341,286 98 | Q9KDJ7 150,152,91 99 | L8ICE9 226,227,675,301,303,305,307,339,372,247,222,543 100 | Q9I5W4 700,696,716 101 | Q9BY41 178,180,267 102 | C3W5S0 80,134,119,120,41,107,108 103 | Q6JC41 288,424,361,426,362,526,561,531,563,734,286 104 | Q4VA93 132,102,135,140,143,115,118,151 105 | Q3JRA0 10,12,44 106 | P61586 19,36 107 | P30793 144,210,212,141 108 | P06873 330,365,174,305,177,178,280,121,282 109 | Q06338 283 110 | Q9NRD9 832,864,834,866,836,867,868,839,869,870,875,828,830 111 | A0A1D1UCW7 64,67,68,70,135,136,137,104,142,63 112 | P20051 98,258,137,14,16,180 113 | Q9IGQ6 384,386,324,376,294,298,342,343,344,345,378 114 | Q9RZN1 224,225,162,163,100,101,230,197,200,201,112,217,219,223 115 | Q8YY76 160,162,163,165,166,39,42,52,53,159 116 | B2RHG5 224,297,298,301,302,303,336,150,152,154,252,253 117 | P20292 58,61,62 118 | Q9NY97 376,247 119 | Q8PET2 32,288,151 120 | P31947 192,196,133,83,84,86,87,215,89,219,31,35,228,231,110,112,56,60 121 | P0C0Y8 231,191 122 | P0C0Y9 267,220,235 123 | A9CJ36 192,137,141,214 124 | P04746 173,115,182,216 125 | C9K1X5 224,228,220 126 | P23360 201,235,237,240 127 | P9WNV1 32,72,44,28 128 | Q2CEE3 388,391,392 129 | Q0TR53 73,74,108,111 130 | Q96HY7 368,333,366 131 | T2B7E1 137,24,107,138,59,104,141,62 132 | Q2UWK0 225,167,169 133 | B0R5R2 16,11,60,15 134 | A0QWT2 291,207 135 | Q06187 165,143,152,154,155 136 | P04067 70,211,86,87,311,225,289,227,228,290,230,291,104,235,236,242,252,244,247,248,60 137 | Q4QBL1 98,250,102,170,251 138 | P50405 296,292 139 | P22297 289,108,222 140 | A0A452CSW8 234,171,84,86,88,89,152 141 | P11234 28,46 142 | A0A0S4TLR1 69,47 143 | Q9LXT4 98,164,100 144 | Q9SJI9 98,164,100 145 | Q8LGJ5 136,197,134 146 | P21327 153,156,317,79 147 | Q92794 289,259,292,230,262,233,265,268,238,209,241,307,212,310,281,284 148 | P08312 256,308,101,102,103,310,268 149 | P07395 416,160,33,161,454,264,459,460,463,687,627,691,629,503,504,253,254,159 150 | A0R5T1 103,231,105,106,234,140,109,142,182 151 | P00811 33,202,37,245 152 | A0A0F7VBC6 305,308 153 | Q9I0B9 64,66 154 | P61224 17,35 155 | Q13490 326,333,306,309 156 | Q9R1E6 739,741,358,359,743,745,746,171,747,209,311,474,315 157 | A0A059ZPP5 208,323,213,202 158 | P62136 64,96,66,173,92,272,248,124,125 159 | Q8N6T7 166,141,144,177 160 | P62575 398,399,400,401,375,412 161 | A0A418LHZ8 145,55,90 162 | A0R5R2 51,86,141 163 | A0A140HJ20 643,291,647,267,269,367,592,593,371,372,376,286,287 164 | P24300 257,287,181,245,247,217,220,255 165 | A0A2P1GNT2 324,293,294,295,297,344,345 166 | A0A369R1N0 325,327,186,235,237,188,179,122,124 167 | G8T6H8 165,166,139 168 | F2TZN0 456,491,492,605 169 | P02794 128,129,132,133,135,10,11,12,13,14,15,16,142,18,147,19,145,22,23,151,26,28,30,34,41,170,172,173,174,51,61,62,63,64,65,66,72,75,76,80,82,84,88,90,92,93,108,110,111,113,127 170 | Q9XI75 176,178,235 171 | P21802 644,630,631 172 | P02866 197,91,171,173,175,177,243,245,182,250,187,94 173 | Q16659 171,157 174 | P29068 389,390,199,202,148,149,154,155,157,159,287,167,103,105,170,107,302,112 175 | Q96KQ7 1027,1115,974,976,1168,1170,980,1175,1017,987,985,1021,1023 176 | O95639 162,132,166,138,142,156,148,124 177 | E1AQY1 81,2,83,4,67 178 | O00408 808,660,727,696,697 179 | P33590 516,517,392,297,330,393,394,363,312,313,187 180 | Q50KB2 329,331,443,348,351 181 | P0DTC1 1752,1755,1789,1787 182 | A0A286S0G7 176,97,99,215,157,95 183 | Q9NXA8 166,169,207,212 184 | B9LW38 153,107,156,151 185 | P02872 160,144,146,148,150,155 186 | Q9L069 467,443,445,446,463 187 | P71447 10,8,169,170 188 | P12821 992,418,390,394,460,1016,988,383 189 | P09528 66,170,28,174,63 190 | Q9HCE5 336,332,157 191 | Q96SD1 256,33,35,228,37,38,136,272,115,254 192 | P28720 320,323,349,318 193 | Q9KLZ6 68,69,55,121 194 | A0A218QMM7 118,120,185 195 | A0A0S4TKQ5 273,36 196 | Q93P54 64,65,67,137,334,335,84,278,280,287,31,292,484,296,504,505,122 197 | A0A0F7UZ05 17,19,21,15 198 | B9PZ33 16,18,20,22 199 | A0A0A7HR51 113,265 200 | P04278 81,189,79 201 | C5A3Z3 356,357,408,222 202 | P9WKK7 276,308,279 203 | Q86SQ9 34,38 204 | Q8IFW4 88,105 205 | P08799 655,656,179,180,181,183,184,185,186,125,126,127 206 | Q15181 116,119,102,118,120,121 207 | P93836 226,308,394 208 | A0QUZ2 65,101,85,127 209 | Q5S007 1409,1441,1410,1442,1413,1415,1338 210 | Q09LZ6 57,11,59 211 | P0A746 98,46,49,95 212 | O43570 145,119,121 213 | I3P686 161,194,104,200,202,235,204,181,182,183 214 | P05164 334,335,336,338,340 215 | O59952 66,67,69,103,107,79,84,55 216 | Q86WT6 64,41,44,78,81,56,58,61 217 | P39230 96,97,94 218 | Q4WQ18 178,119,175 219 | Q9KM24 251,238,254 220 | Q04430 101,102 221 | Q9NP87 418,330,332 222 | P14618 296,75,77,270,272,113,114,243,244,83,86,56 223 | P27988 417,328,331,412,414 224 | Q9BYC5 388,389,491 225 | A4F980 209,242,132,101,135,205,239 226 | V7II86 314,491,365 227 | Q9CPU0 34,100,173,127 228 | S8G5K8 919,835,839,858 229 | A0A0K8P8E7 304,307,309,311,313 230 | P61006 64,22,38,40,63 231 | Q6NZB1 167 232 | U2NM08 50,147,155 233 | W0FBY3 140,132,30 234 | Q9I5E2 60,87 235 | P00533 745,842,721,722,723,724,855,858 236 | P33763 64,1,66,3,4,5,70,71,74,11,15,20,23,24,25,28,33,41,59,60,61,62 237 | P12689 548,549,550,551,518,553,362,363,554,525,465,467,468 238 | Q9H082 65,63,47 239 | Q86Y01 449,452,446,468,471,411,444,414 240 | P25321 185,172 241 | A0A0F6NGI2 500,501,503,506,508 242 | Q32904 83,85 243 | P27353 140,144,209,114,243,147,246 244 | B8QIQ9 240,114,179,116,198,118 245 | B2UR60 205,209,308,309,311,215,316,317,319 246 | B1KYY8 111,115,85,182,118,216,186,219 247 | Q941L3 388,421,390,392 248 | A0A0R6L508 465,466,246,285 249 | A0A173N065 174,118,183,217 250 | P0DKX7 1539,1540,1413,1414,1415,1541,1417,1543,1547,1548,1549,1550,1422,1552,1423,1424,1426,1556,1557,1430,1431,1432,1433,1558,1435,1559,1565,1566,1567,1568,1561,1570,1440,1441,1444,1574,1575,1448,1449,1450,1451,1576,1453,1577,1583,1584,1585,1586,1579,1588,1457,1458,1459,1462,1460,1596,1470,1599,1473,1603,1605,1606,1477,1439,1609,1478,1479,1482,1480,1442,1523,1652,1525,1654,1530,1531,1532,1534 251 | I4E596 270,273,250,253 252 | Q9H165 802,772,805,775,744,747,783,792,818,786,788,823,760,764 253 | P0ABE7 34,99,38,41,43,47,23,24,26,27,61,30,95 254 | A0A223GEC9 101,23,191 255 | A0A0S2GKZ1 97,20,183 256 | D0CD18 27,33,11,14 257 | Q04111 437,438,412,477 258 | Q8K299 416,417,482,419,454,455,422,456,457,458,477,478,415 259 | Q6ZMJ2 481,482,419,420,421,486,423,426,458,460,461,459 260 | P81947 39,71,41,11,44,45,47,49,50,55,61 261 | A0A068C8U8 19,37 262 | Q96L73 1920,1925,2023,1895,1897,1931,1905,2003,2070,1911,2072,1978,2077 263 | H9JH18 19,21,23,25,30 264 | Q4J989 20,205,207 265 | O76074 617,682,764,653,654 266 | D2Z025 198,200,107,235,77,109,111,110,113,86 267 | B7UCZ5 225,66,258,255,261,204,205,206,209,210,243,244,53,55,254,63 268 | Q9Y3Z3 164,167,233,206,207,210,311 269 | P62993 152,79 270 | F8W4B7 705,323,612,230,614,232 271 | Q86UW9 450,453,415,469,472,412,445,447 272 | P29166 40,42 273 | A0A179QS89 69,38,154,72,8,9,11,151,122 274 | Q8C6L5 384,385,288,291,392,378 275 | P07528 64,376,357,360,393,363,396,334,344,347,348,318 276 | P77990 37,39,150,59,60 277 | A0A2E2GL08 32,24,26,92 278 | Q9NZK7 48,49,66,45,47 279 | Q86V24 352,202,348 280 | A0A024B7W1 3234,3369,2959,3250,2963,2968,2971 281 | A0A160T9D2 205,207,208,179,181 282 | D0VWU9 404,311,315,141,542,143 283 | P00644 103,117,122,123,125 284 | B4EUK6 117,72,74 285 | P01111 35,17,18,21,22,27,28,29 286 | P00327 98,68,101,104,47,112,175 287 | E7E2M2 97,98,259,260,138,140,301,303,304,305,143,144 288 | Q04M32 195,191 289 | A0A0C2W6A5 84,54,87 290 | O08600 164,136,169,137,138,141,177 291 | Q72HW2 97,450,133,135,455,393,137,396,398,444,445,446 292 | P16952 648,650,664 293 | G0S3F2 130,391,135,79 294 | P43215 114,126 295 | P36871 288,290,292,293,117 296 | P84095 17,35 297 | G7JFU5 266,268,181,183,185,157 298 | P46637 161,270,272,185,186,187,189 299 | A4Q9E8 359,264,361,241,178,179,346,125 300 | Q9H7Z6 226,230,210,213 301 | A0A3Q0L1E1 360,393,395 302 | Q5NQE9 116,212,214,118,120,92 303 | G2EBB4 252,254 304 | P11444 195,247,297,221 305 | E9QYP0 136,138 306 | G2R014 288,545,289,392,492,493,339,565,341,285 307 | Q8IM71 305,307,293 308 | O15178 186,167 309 | A0A135VHY8 66,2,68,137,138,144,147,19,26,27,95,96,160,172,175,176,49,177,179,116,61 310 | A0A135VDL7 96,160,66,67,68,137,138,172,175,49,177,179,116,26,27,61,95 311 | X6Q997 248,184,265,136 312 | B6K6D6 128,130,131,199,201,202,206,146,147,148,149,151,153,218,219,220,156,93,94,98,99,102,123,183,184,186,187,125 313 | Q8IUN9 267,269,270,206,210,215,280,281,218,216,220,224,292,293,305,243,251 314 | Q9NV35 67,63 315 | Q9H993 291,253,254 316 | Q1LCS4 128,162,165,110,51,57,125,95 317 | P47811 354,356,359,168,174,150,313,155 318 | P02766 86,119 319 | Q05397 937,956 320 | Q96DA2 40,22 321 | P57729 41,23 322 | P96787 48,240,46 323 | A9H324 275,276,277,300,279,280,204 324 | P03366 1042,1077,1097,1148 325 | Q3V872 160,260 326 | P60624 68,69 327 | Q6B856 177,11 328 | G0SC54 280,185,283,188 329 | F5SWR8 112,113,97,99,94,111 330 | Q6ZJK7 408,411 331 | Q9BPX6 432,421,423,425,427 332 | Q8IYU8 196,185,187,189,191 333 | Q972K4 114,86,90 334 | B0VCI2 100,103 335 | P0AC84 165,110,53,55,57,58,127 336 | P9WIC7 15 337 | P12955 289,452,241,370,276,410,412,287 338 | A5U6B6 34,36,38,90 339 | P9WNX1 36,40,92,110,159 340 | A8WBX8 323,325,296,232,234,339,221,190 341 | P68400 161,175 342 | P9WFX1 297,434 343 | Q88JH0 305,306,188,190 344 | O43809 149,88,171,76,93 345 | H9UEF1 64,120,117,62 346 | Q97KK2 482,515,484,517,486,519,488,521,493,526 347 | P0A8W0 176,172,244,222 348 | Q24451 267,1073,534,153,155,1085 349 | Q9NWT6 201,279,199 350 | B7GTV0 288,258,386,356,331,207,272,209,210,276,181,281,124,382 351 | Q6L732 132,134,219 352 | F0NDX2 512,515,650,651,206,207,464,210,467,885 353 | P27302 185,155,187 354 | P21816 86,88,140,93 355 | D3ZKP6 409,426,405,430 356 | A0A128A3G4 289,172,467,126,222 357 | I1RDR8 224,225,98,226,227,165,209,213,216,217,94 358 | K7CID1 647,210,666,475,669,798,799,736,801,738,672,804,740,742,359,360,743,744,568,172,312,316 359 | Q9L2C2 134,7,84,60 360 | D0MGT2 50,54,23,22 361 | A2ENQ8 96,99,37 362 | A0A024B5J2 241,54,313,238 363 | A0A0E3ZCF3 208,193,211,39,282,191 364 | P22234 97,137 365 | A0A142J6I6 161,293,294,282,79 366 | Q75I93 96,93 367 | P40993 64,109,61,63 368 | P0AEG4 20,23 369 | Q8CIB9 408,386,404,389 370 | L0I969 88,85 371 | Q12791 1018,1012,1020,1015 372 | P00722 417,162,419,194,164,462,16,19,22 373 | Q96DE0 169,99,173 374 | Q4LAV3 49,50,51,53 375 | Q9SS90 258,133,205,207,209,338,339,211,347,156,349,350,286,288,164,166,378 376 | Q97ZJ8 615,618,602,605 377 | A0A249T061 63,61,285,95 378 | Q9RVZ5 322,326,345 379 | Q8N884 390,396,397,404 380 | P07328 32,37,38,39,40,204,45,205,429,433,29,30 381 | P07329 353,258,259,357,108,109 382 | Q95V93 265,188 383 | Q9I6A3 290,234,235,237,273,279,251,286 384 | Q9WZX9 288,234,250,294 385 | B2FKF0 114,115,111 386 | A0A0H3AJ04 128,45,179,95,215,158,159 387 | M5AAG8 80,115,69,72,74,142,78 388 | P61889 71,142,112,113,242 389 | Q9NR30 339,340 390 | O67206 161,514,451,454,519,456,118,120,508,509,510,159 391 | P27695 96,68 392 | C3RVQ0 164,280,169 393 | C3RVP5 164,168,169,276,280,158 394 | P29600 2,40,73,75,77,78,79 395 | A6N5T2 339,291,292,454,457,459 396 | N0DKS8 67,70,71,9,10,11,75,83,86,87,25,160,106,109,112,113,243,116,115,251,255 397 | Q15382 20,38 398 | A0A2Z6G7U6 64,66,70,139,140,141,119,153 399 | A0A2Z6G7U3 65,67,122,71,139,140,118,120,154 400 | Q81258 1129,1177,1131 401 | O39929 1123,1171,1125 402 | O91936 1176,1172,1124,1126 403 | Q04609 387,453,425,553,266,269,272,273,433,436,377 404 | P43166 96,121,98 405 | B7GUI0 179,142 406 | N0B9M5 258,163,261,263,298,299,301,307,115,116,312,153,158 407 | Q9X1X0 180,216,58,92,93 408 | P0A7F3 114,109,141,138 409 | Q4E2I1 210,42,44 410 | Q6NQ79 677,680,557,559,691,724,694,727,699,669,702,575 411 | P02753 193,188 412 | A0A1C3NEV1 464,283,244,463 413 | Q6IQ55 146,163 414 | Q15818 358,359,360,367,370,280 415 | Q8TF76 563,467,477 416 | P0C9C6 231,233,142,271,78,179,115,182,218 417 | L7T0L4 41,74,76,46 418 | P42212 209,132,56,25,58,57,143 419 | L0FY79 240,180,117,119,121,199 420 | A0A0U3H0V9 161,166,172,205,206,213,88,153,90,92 421 | Q9KN86 297,378,301 422 | O00311 353,196,360,363,182,351 423 | Q9UBU7 296,299,309,315 424 | A0A125W693 345,322,318,359 425 | Q8YYB7 161,162,147,163,170,123 426 | Q5LUB5 144,145,58,61 427 | G0FQ07 114,118 428 | A9GMG4 305,262,264 429 | Q9NVW2 610,588,590,593,596,570,573,607 430 | O13836 240,201,150,239 431 | P02792 128,131 432 | A0A1Z1F9L9 200,203,204,205,86,120,122,123 433 | O32163 128,66,41,43 434 | Q92832 453,454,456,457,434,435,437 435 | Q4Q5S8 175,176,177,178,179,180 436 | Q00987 452,457,461,464,438,441,475,478 437 | Q7YRZ8 452,457,461,464,438,441,475,478 438 | Q5XI77 256,257,261,211,212,213,214,215,409,411,412,413,417,366,369,370,371,253,255 439 | Q5SHZ3 228,270,113,114,116,87,281,282,283,284,157 440 | P05057 74,76 441 | P05089 128,101,232,234,246,122,124,125,126 442 | Q6DDT4 9,34 443 | Q8WS26 161,165,230,313,317 444 | Q9WZL1 210,181 445 | P50213 257,261 446 | P51553 117,311 447 | P46881 592,433,431 448 | A0A1E1GJG5 224,204,82,85,278,215,24,214,27,220,221,216 449 | E5RPG3 2002,2004,2013 450 | Q9BV73 2209,2212 451 | O95989 65,50,51,66,52,70 452 | P29476 331,326 453 | P29475 336,331 454 | P29474 369,99,365,94 455 | P80366 352,194,162,228,229,313,350,191 456 | P0DP23 65,130,68,132,134,136,141,21,23,25,27,29,94,32,96,98,100,105,57,59,61,63 457 | Q14191 563,935,936,939,908,558 458 | Q7LBC6 1560,1689,1562 459 | Q9WZD0 49,42,45,39 460 | A0A024FRL9 208,250,120,122,124,189 461 | P31153 265,31 462 | P41977 98,50,179,146,183 463 | D9QA55 136,137,111 464 | Q80J94 438,440,410,366,447 465 | P65473 354,355,195 466 | Q96EP0 986,930,898,901,998,871,969,874,1001,925,911,916,890,923,893,926 467 | A0A1M2CSI6 208,250,120,122,189 468 | Q9I311 692,694,743,744 469 | B1MD73 166,168,169,170,74,75,77,111,112,113,48,149,153,90 470 | P96202 2125,2126 471 | B3EYM9 201,203,205,207,209 472 | Q9I188 428,271,432,375,216,379,220 473 | P75960 174,176,177,155,157,158 474 | H6QM92 134,41,106,107,108,80,119,120 475 | H6QM91 305,306,445,446 476 | Q81TB4 96,161,195,165,198,93,62 477 | P25098 368,360,361,363,348,366 478 | Q86T24 512,544,516,552,555,524,527,496,499,568,540,573 479 | Q9A180 1190,1192,1194,1196,1197,1198,908,910,912,914,1042,916,1043,1110,1017,1114,1019 480 | Q08467 232,246 481 | P0CT50 8,169,298 482 | Q5KZU5 266,206,145,178,23,25 483 | A0A0U1MWF9 386,579,644,581,580,647,360,652,655,532,380 484 | A0A1S4CVP6 258,175,116,118 485 | D2JIV0 135,105,138,139,108,112,25,60,63 486 | Q9D4V7 66,20,38 487 | A0A386KZ50 352,355,359,366,351 488 | A0A161XU12 360,362,365,699,700 489 | O43583 192,193,195,142,145,146,149,150,153,160,163,171,172,175,176,188,179,183,184,121,122,187,124,190,191 490 | D2W2Z5 227,326,170,172,173,146,147,53,149,190 491 | E1BF86 293,296,301,304 492 | A0A109PTH9 228,230,329,330,332,334,218 493 | P22963 34,37,38,137,172,174,175,23,22,183,218 494 | Q8KPT4 167,170,138,83,86,55 495 | Q9H2K2 1089,1081,1084,1092 496 | Q9X1H0 52,54,58,92 497 | K4PWX3 176,97,99,215,157,95 498 | P02879 137,279,281,141 499 | Q8EAC7 231,201,202,250,251,252 500 | G8B4G1 176,97,99,215,157,95 501 | Q7WYA8 176,97,99,215,157,95 502 | B2VCC3 52,54,56,140,141 503 | Q8NB16 33,19,71,104,15 504 | P00396 240,368,290,291,369 505 | P00428 113,116,91,93 506 | A0A3S7X3C0 180,243,244,143 507 | E9BLM4 226,227,180,124,125,143 508 | Q9D8Q7 135,137,141,157,158 509 | L8AXY8 98,99,100,101,8,72,200,43,12,44,14,146,116,186,94,95 510 | Q99828 161,163,165,167,172,116,118,120,122,127 511 | A0Q7V5 57,11,12,55 512 | P0ACD8 528,582,57 513 | Q8IEK1 496,500,519 514 | Q8YVV8 52,54 515 | Q96AE4 304,337,341,334 516 | O60232 70,73,53,56 517 | Q5AR53 242,244,319 518 | Q9BYF1 402,374,378 519 | P37659 378,243,278,442,443,381,383 520 | D0L035 116,88,92 521 | Q9F4D5 194,675,676,707,678,424,425,426,111,113,638 522 | Q56307 513,584,586,511,575 523 | Q9UGM3 128,130,1026,135,1058,1059,1060,1061,169,1081,1082,1019,1020,1021,1086,1023 524 | A3DC31 203,205,274,277,278,282,283 525 | P00441 64,72,47,81,49,84,121 526 | Q9NZ08 353,357,376 527 | Q2U8Y3 129,66,137,171,270,143,47,241,82,175,84,148,150,279,120,61,127 528 | A0A062V290 35,71 529 | P31327 362,335,337,315 530 | D9N168 356,533,357 531 | C4LXK0 9,169,7 532 | Q9BYW2 1533,1539,1516,1678,1520,1680,1685,1529,1499,1501,1631 533 | Q8R0F8 105,74,76,21,22,26,126 534 | Q05086 800,799 535 | P17562 17,273,279,253 536 | P76038 141,206,181,183,157 537 | P43860 250,108,253,254 538 | Q939N5 490,427,428,494,399,401 539 | F0FS71 257,259,357,361,284,285 540 | P68135 288,290,68,70,39,265,267 541 | A0QR77 9,10,107,108,152,159 542 | E5KIY2 122,208,124,152,250,120,189,223 543 | P9WK11 469,471,440 544 | Q5KW80 129,130,98,69,102,167,105,202,205 545 | P29082 114,37,86,88,90 546 | Q63SB3 37,39,216 547 | P39900 192,194,196,198,199,201,218,158,222,228,168,170,175,176,178,180,183,124,190,191 548 | Q9WYT9 53,55,9,10 549 | P0A031 200,107,108,109,110,203,204,208,209,20,21 550 | Q63870 1168,1061,1065,1131,1132 551 | P04637 238,176,242,179 552 | Q13936 1659 553 | A0A094ZWQ2 185,189 554 | P50477 362,363,396,364,399,400,401,115,308,309,310,311,116,150,157,155,156,152 555 | G0SFF0 57,77 556 | Q58M74 97,66,2,68,72,41,78,48,20,27,31 557 | Q0QC76 160,74,93,95 558 | Q84EX5 38,39,40,168,43,16,240,123,124 559 | P00431 127 560 | Q93HL2 194,54,38,216,190,63 561 | A0A098SKQ8 145,149,55,103 562 | P68826 65,154,158,111 563 | Q25704 99,648,42,46,47,208,689,210,692,185,188 564 | P43873 288,290,276 565 | M1GSK9 65,97,257,100,290,287 566 | A0A2K0XYW4 166,168,173,156 567 | J9XNG9 675,868,677,678,903,871,869,650,904,684,653,686,909,693,663,696,668,671 568 | P56547 45,46,112,117,121 569 | Q8WQX9 673,709,710,822 570 | Q14980 120,123,13,119 571 | P04632 193,137,152,154,247,156,158,225,163,227,228,109,112,114,115,116,182,119,184,249,186,251,188,253 572 | P16218 742,814,752,689,786,690,788,691,790,663,792,665 573 | G0SEG4 418,419,327 574 | Q8A2H2 83,332,43,44,333 575 | Q8TZW1 32,161,64,200,303,24,280,281,26,286,63 576 | P0A776 53,37,38,120,57,39 577 | P48735 314,318 578 | Q10471 224,226,359 579 | Q6B0I6 238,244,310,312 580 | Q13185 130,131,132,133,135,137 581 | Q8ZM86 98,101,7,8,9,120 582 | P02787 225,139,536,411,604,445 583 | P38108 32,35 584 | Q5JHF1 40,94 585 | Q9H9S5 416,289,296,360,362,364,345,346,317,318 586 | A7NNM5 76,141,77,146,151 587 | Q8PGN7 241,243,239 588 | P05058 50,52,74,76 589 | A0A2N7KUA1 35,291,296,488,300,141,338,339,88 590 | M2XAQ9 16,50,15 591 | Q0I9X8 33,66,69,40,110,143,113 592 | Q9QY93 66,98,63,95 593 | P84138 142,80,242,245,182,183,125 594 | Q9P8C9 297,233,299,269,272,340,308,310 595 | Q9UL01 481,452,470,471 596 | Q8GXJ1 404,313,315 597 | P05361 212,199,201 598 | P19812 160,136,139,175,177,148,189,118,151,123,157 599 | A0A452E9Y6 305,227,307,301,303 600 | F4KCC2 134,137,170,172,155,146,149,184,121,187,158 601 | W8QLX4 1810,1811,1836,1838 602 | P68871 97,43 603 | Q8IXJ6 224,195,200,221 604 | P43408 17,140 605 | V5VCK8 160,162 606 | O50271 448,481,53,501,519,40,395,446 607 | Q44384 67,313,37,295,40,296,330,331,332,235,238,237,240,241,314,216,217,218,221 608 | Q48484 386,388,540 609 | Q15848 193,194,195,187,188 610 | Q8WWA0 97,98,260,133,262,274,86,87,89,282,92 611 | G0S1N3 456,378,451,459 612 | Q9FD68 106,191 613 | C6XF58 58,266,125,191 614 | Q8YSJ5 65,35,68,163,133,166,199,201,56,60 615 | X2D812 64,68,107,62 616 | Q9FG77 326,328,298,303 617 | A0A1J6PWI8 168,249,172 618 | P9WPQ5 16,49,37,108,15 619 | Q57514 50,52,74,76 620 | P35828 257,259,771,264,265,266,268,783,785,280,793,282,795,285,797,799,288,289,290,291,292,294,297,298,559,562,314,315,317,320,322,586,864,865,866,868,873,876,878,882,883,884,885,887,634,636,893,638,892,896,900,901,902,903,905,651,652,653,654,908,656,910,928,673,674,675,676,933,678,931,689,690,691,692,695,697,698,699,966,967,968,716,717,990,992,993,994,484,485,491,757,758,759,762,255 621 | Q8IXL7 135,138,86,89 622 | Q06210 473,474,476 623 | Q6GHN9 2 624 | A0A022MRT4 261,294,263,281,238,78,240,81,79,147,308,148,310,87,248,149,282,283,120,254 625 | A0A022MQ12 75,77,205,245,246,345,253 626 | Q7N4I5 115,26,188 627 | K7SA42 195,112,176,114 628 | Q2N9N3 256,259,164,167,200,112,114,242,116,181,88,91 629 | C6KFA3 89,97,134,136,137,138 630 | A0QTY3 152,121,123 631 | P42981 113,12,15 632 | L0N6M2 486,441,443 633 | A0A024L327 256,247 634 | P37793 97,104,172,175,80,82,84,153,156 635 | Q2PAJ1 547,548,485,549,550,488,553,490,138,140,558,93,95 636 | Q94707 192,193,331,267,368,371,308,309,215,218,283,221 637 | Q3UPF5 96,162,168,73,106,172,110,78,174,82,86,150,88,182,187,191 638 | P06396 304,354,329,330 639 | A5K1A2 165,169,170,171,172,173,174 640 | P00959 162,146,149,159 641 | Q92841 385,132,263,398,399,208,465,209,210,351,223,484,485,231,234,235,179,372,373,381,383 642 | Q9UKK9 96,97,166,111,112,82,115,116 643 | Q9SXZ2 81,141 644 | Q5I3C1 3202,3204,3337,2927,3218,2931,2936,2939 645 | V5TER4 224,103,106,45,47 646 | P33755 137,139,204,208,145,148,216,219 647 | A0A068PS59 737,739,882,764,767 648 | O95251 384,388,368,371 649 | P15498 546,516,549,554,557,529,532,564 650 | D7Y2H5 104,73,105 651 | P11802 48,145,47,42,138,156,157,158,159 652 | H0ZAB5 275,277,413 653 | A0A0S4IKF4 580,644,647,584,620,623,628,660,631,664,601,604 654 | D7Y2H2 72,74 655 | C6XID6 199,241,180,116,118,120 656 | A0A380UY17 65,15,58,60,61,62,63 657 | Q6P179 370,374,393 658 | A6T6Z0 355,389,358,328,362,239,336,342 659 | D0VMS8 384,385,386,131,388,267,268,397,269,271,404,410,414,299,301,302,304,186,190,66,68,196,70,72,74,341,343,348,349,350,352,357,358,359,361,365,366,367,368,370,374,375,376,377,379,383 660 | Q8ZXN7 33,264,138,170,266,173,268,211,252,86,156,30 661 | A0A0D5YK08 17,209,207 662 | Q8DLC7 497,498,435,581,439,584,543 663 | P05555 160,225,156,158 664 | P84308 227,190 665 | A0A1Y2HEJ3 96,136,92,93 666 | A5H660 186,188,285 667 | I3DBY6 233,151,152,153 668 | P35816 481,418,516,485,392,361,362,494,144,145 669 | D5VCE0 248,249 670 | O07431 192,193,194,195,165,166,198,139 671 | P18858 568,621,720 672 | P08709 66,67,74,76,270,79,80,272,275,276,85,86,280,89,106,107,109,123,124,61,62 673 | P0DP24 96,32,130,98,132,100,134,68,136,105,27,141,25,21,23,57,59,61,94,63 674 | Q8N371 400,321,323 675 | P54922 56,304,54,55 676 | Q52ZA0 116,118,281 677 | Q9HVA8 224,129,87,39,223 678 | Q9K943 99,102,103,8,11,77 679 | A0A1K2FZ20 179,243,215,253,218,285,255 680 | Q9PPB4 200,230 681 | E9AE57 468,471,216,219 682 | -------------------------------------------------------------------------------- /data/independent_set/binding_residues_nuclear.txt: -------------------------------------------------------------------------------- 1 | Q6SPF0 43,45,46,92,48,49,87,56,57,88,60,94 2 | C6H0Y9 130,194,133,84,88,24,34,102,38,41,106,121,122,123,124,125 3 | Q8IUH3 192,193,66,68,70,152,27,155,29,31,163,100,165,122,126,104,105,106,107,112,114,53,55,186,124,125,190,191 4 | P06746 256,258,133,135,271,272,273,274,276,279,280,283,27,284,287,288,34,35,292,293,38,39,294,295,296,36,40,183,62,190,192,64,66,65,68,69,63,67,229,230,231,232,233,234,103,104,105,107,108,109,110,106,236,254 5 | P18858 768,769,770,771,776,794,795,796,797,798,799,800,801,802,546,549,557,303,304,305,568,571,572,573,588,589,590,592,594,850,851,852,853,855,346,347,348,349,350,351,352,353,869,870,871,872,873,874,635,636,639,640,641,642,643,644,415,416,417,418,419,420,449,451,452,453,454,455,456,457,458,720,738,744,746,749,504,767 6 | O28951 385,383,137,138,139,140,143,147,151,152,154,155,156,26,27,159,30,163,427,329,330,331,332,117,118,119,120,123,380,127 7 | O95243 466,467,468,469,470,536,473,538,539,540,541,560,562,504,505,506,507,508,509,510,511 8 | P00734 433,434,435,372,437,438,441,477 9 | K7RJ88 129,131,132,135,138,396,397,398,424,301,302,303,430,432,433,448,451,332,335,338,216,217,218,219,355,356,364,117,118,119 10 | Q9HVC1 26,27,28,32,36,37,38,39,40,42,46,49,50,52,55 11 | Q9Y253 256,257,258,259,260,261,262,384,293,38,39,42,46,47,48,311,313,316,61,62,317,64,320,321,322,323,324,326,318,319,328,329,327,86,89,93,351,224,113,115,116,377,378,381,382,383 12 | P0DTD1 6784,5250,5253,6792,5257,6794,6795,6796,6797,6928,6930,6932,6935,4892,4893,5151,5152,5153,4899,6822,6823,6824,6825,6696,6828,6701,6831,6832,6834,6840,6968,6841,6970,6844,6971,6972,5307,5076,4933,4935,5077,4950,4951,5074,5075,6996,6741,5205,6999,7000,6873,6874,7001,6872,6743,6745,5206,5081,4961,4969,5228,4972,5232,4982,5239,4984,4986,4987,5246 13 | F2NWD3 10,11,12,13,15,16,340,98,100,101,102,39,104,41,42,122,123,124,125 14 | A9CJ36 67,68,8,9,10,31,33,34,35,44,45,46,47,49 15 | A0R5T1 17,20,23,25,35,63,320,65,321,322,323,324,71,73,74,75,76,77,330,331,336,81,219,220,93,221,95,224,97,94,231,238,239,115,116 16 | P55316 227,228,230,231,234,204,272,241,274,185,186,253,255 17 | Q9E006 449,450,382,388,453,452,456,475,479,428,429,431,381,380,441,443,444,445,446,383 18 | Q1PST0 2,35,34,5,38,37,42,45,46,49,30,31 19 | Q9NP87 385,386,387,389,275,416,418,174,175,177,434,178,179,433,435,436,438,442,445,446,449,450,323,456,457,329,459,332,204,206,207,208,209,330,364,365,245,246,247,249,250 20 | P26368 260,262,264,265,146,150,152,155,289,291,294,300,302,304,192,195,196,197,199,328,329,331,333,335,338,339,340,225,227,228,229,230,231,252,254 21 | Q9LVJ7 206,207,208,210,211,212,214,215,216,218,220,222,223,224,225,226,227,228,229 22 | D9N168 256,257,258,259,260,261,531,532,533,534,281,282,285,288,289,290,293,313,314,315,316,318,320,321,322,323,325,327,328,329,330,331,332,340,410,413,414,417,419,420,421,422,423,424,427,432,474,484,485,230,232,233,489,234,236,492,493,235,496,500,253,254,255 23 | P12689 518,551,552,553,554,555,556,557,307,319,320,322,323,324,325,326,328,329,588,589,590,332,606,362,619,620,621,622,623,624,625,626,627,628,629,366,367,393,394,395,396,397,398,399,401,402,663,667,417,681,682,684,685,686,691,692,693,694,696,464,465,467,468,734 24 | I4E596 2,3,131,259,264,265,34,35,167,174,175,176,58,59,72,74,75,94,95,113,114,243,116,115 25 | Q9H165 777,791,749,781,814,784,782,783,787,756,788,753,759,760,755,763 26 | P84233 64,65,66,67,70,73,84,85,86,87,57,40,41,42,43,44,45,46,47,48,50,117,118,119,121 27 | P62799 33,36,37,46,47,80,49,81,79 28 | P02281 87,88,89,30,31,32,33,34,35,36,37,40,41,43,54,55,56,57 29 | D0CD09 64,66,67,90,24,26,27,92,63 30 | D0CD04 18,19,39 31 | D0CD13 2,66,4,5,67,16,20,26,28,31,32,33,34,35,97,37,99,39,100,41,101,45,46,47,48,51,60 32 | D0C9L7 16,84,21,22,23,87,26,27,28,30,31,97,101,38,40,41,46 33 | D0CDQ7 72,73,74 34 | D0CAL0 20,21 35 | D0CD15 51,19,52 36 | D0CBZ8 1,2 37 | Q2G285 32,5,6,7,10,11,14 38 | Q9Y3Z3 451,453,456,136,137,142,145,155,156,165,116,117,118,372,376,378,125 39 | P28147 147,148,149,152,154,26,27,29,31,161,163,165,167,169,171,173,56,57,58,61,63,68,70,72,74,76,78,80,82,83,116,117,119 40 | D0VWU9 384,385,386,389,269,664,665,667,540,668,542,673,674,675,676,677,678,679,709,710,711,590,591,592,593,594,348,349,350,735,606,607,611,612,613,739,743,495,498,499,245,501,383 41 | B8H4R9 34,38,42,49,55,56,59,60,63 42 | P37524 146,147,213,214,156,157,158,159,160,162,163,166,169,184,185,186,187,189 43 | G2R014 326,327,394,397,399,400,403,404,341,559,372,437,438,375,377,379,445 44 | P52952 192,194,142,143,144,145,162,165,168,181,183,185,187,188,190,191 45 | P62399 64,66,69,90,73,74,75,80,24,89,26,27,92,63 46 | P0ADY7 16,17,18,19,50,55,38,54,59 47 | P0C018 64,3,67,68,15,25,27,29,30,31,32,33,34,98,36,100,38,101,102,44,45,46,47,54,55,56,63 48 | P68919 9,75,12,13,14,78,17,18,19,21,88,90,29,31,32,37 49 | P0A7L8 73,74,76 50 | P0AG51 17,20,53 51 | Q6CJ70 521,522,538,554,555,570,571,572,573,574,576,578,579,580,583,586,591,593,595,597,598,608,610,611,612,619,620,622,628,637,639,641,644,645,646,648,650,684,687,688,689,690,691,694,700,702,461,465 52 | B8GW30 199,200,201,202,204,208,155,161,162,227,225,226,230,234,171,172,173,174,175,177,178,181 53 | Q96MU7 404,405,407,428,431,432,433,434,438,439,455,472,473,474,475,476,361,362,363,367,377,378,379,380,381 54 | F0NDX5 145,146,83,149,87,154,155,31,32,33,35,39,56 55 | F0NDX6 265,266,267,268,269,270,271,20,21,22,23,26,29,30,31,32,47,48,50,51,52,54,55,59,72,73,74,75,78,79,80,81,210,82,211,212,213,214,216,219,220,221,222,224,225,226,227 56 | F0NDX1 256,257,138,139,140,141,142,15,16,17,147,20,149,150,151,152,153,154,148,155,40,41,44,45,47,48,51,187,60,189,190,191,64,192,248,249,250,251 57 | F0NDX3 128,129,264,265,266,268,155,156,158,159,160,162,178,179,180,181,183,184,185,74,77,207,208,209,210,211,212,221,222,223,224,225,127 58 | F0NDX7 98,99,36,101 59 | F0NDX2 960,625,626,458,459,1037,478 60 | A0A0M4DML1 73,78,79,82,20,21,22,85,23,29,31,32,100,101,102,103,104,105 61 | Q97ZJ8 15,16,168,169,172,173,176,177,180,181,308,184,58,59,60,318,320,68,196,198,199,200,201,202,203,80,89,90,493,494,496,497,116,500,508 62 | Q9NR30 268,269,270,294,295,298,315,317,318,446,448,321,447,468,469,476,349,352,494,495,496 63 | O94408 64,65,35,36,37,63 64 | Q9Y7M4 42,44,81,82,83,27 65 | O14352 64,35,36,37,62,63 66 | Q9UUI1 65,63 67 | O74483 32,33,63,62,94,61,30,31 68 | P87173 73,74,75,43,44,46 69 | O42978 66,68,39 70 | C0VHC9 449,259,324,325,390,391,454,424,425,426,299,300,301,302,363,427,369,428,298 71 | P27695 128,266,268,269,270,271,276,278,280,282,156,171,174,177,309,181,70,71,73,74,78,210,212,222,96,224,226,98,228,229,230,103,231,125,126,127 72 | Q14527 97,68,69,72,73,110,111,112,113,142,91,93,94 73 | Q04609 514,545,546,183,186,698,316,317,318,191,320,321,322,700,701,702,703,709,207,605,609,610,612,613,614,616,617,619,620,623,504,633,511 74 | Q6NQ79 646,648,651,652,716,717,655,600,601,602,603,604,626,628,629,630,631 75 | P0C9C6 5,7,8,9,142,271,14,273,146,147,148,149,45,48,49,50,179,52,51,182,55,78,81,82,84,87,89,92,229,231,232,233,115 76 | Q4VGL6 193,259,262,263,264,265,219,220,238,239,240,241,247,184,185,250,251,188,253,190 77 | Q4X049 288,289,268,269,271,272,273,275,276,277,279,281,283,284,285,286,287 78 | Q4WDM9 104,105,106,88,89,90,91,93,94 79 | G5EAZ0 268,269,270,272,273,274,276,277,278,280,282,284,285,286,287,288,289 80 | Q5B5Z6 66,67,68,101,69,100,72,44,49,50,51,52,55,58 81 | Q5AYY8 104,105,106,137,138,139,83,85,87,88,89,90,91,93,94 82 | Q8YYB7 68,73,75,141,78,144,80,145,147,161,163,164,165,167,168,169,170,43,171,44,46,112,115,116,119 83 | E5RPG3 1930,2058,2060,2059,2061,1941,1942,1943,1944,1839,1969,2147,2020,1845,1846,1844,2106,2107,2108,1852,1858,1864,2124,2132,1751,2137,1755,2141,1888,2144,1890,2145,2148,2021,2022,2023,2024,2151,2026,2152,2025,2029,1905,1907,1909,1910,1919 84 | Q9SB92 82,83,86,87,89,90,92,93,94,96,100,38,39,40,41,60 85 | P35716 64,68,76,77,80,81,84,87,88,120,121,106,48,50,51,117,54,118,56,57,53,61 86 | P68431 38,40,41,42,43,44,45,46,47,48,50,57,64,65,66,67,70,73,84,85,86,87,117,118,119,121 87 | P62805 79,80,81,18,20,24,31,33,36,37,46,47,48,49 88 | P04908 75,76,77,78,15,16,17,18,14,21,29,30,33,36,43,44,45,46 89 | O60814 87,88,89,32,33,34,35,37,40,41,43,54,55,56,57 90 | H6QM92 512,514,388,389,390,391,392,393,534,279,281,300,301,302,303,561,562,563,564,691,569,577,326,328,458,459,462,466,467,470,471,473,476,351,352,483,485,491,363,366,367,368,369,370,371,372,373,503,505,507,508,509,511 91 | H6QM91 260,524,525,272,528,274,275,24,30,32,33,34,668,37,38,669,553,557,560,562,566,567,568,569,310,571,350,352,353,355,356,365,633,635,124,125,126,127,123,135,136,409,410,411,667,413,414,670,671,672,673,674,675,676,678,679,680,683,687,690,694,697,186,444,445,446,191,704,193,706,196,198,200,202,203,225,227,229,237,493,239,494,241,242,243,249,507,510,511 92 | H6QM90 201,205,81,82,83,84,86,88,216,33,38,41,43,52,53,54,55,58 93 | Q86T24 578,515,579,520,584,522,586,595,596,533,597,535,536,598,599,539,534,600,538,549,550,562,563,564,503,504,505,507,508,511 94 | Q9Y6K1 790,792,831,832,834,835,836,708,838,710,711,709,714,715,716,717,718,841,720,846,762,881,882,883,756,757,758,761,890,891,763 95 | A0A1S4CVP6 258,144,145,254,149,213,215,119,218,223,224,225,229,186,187,176,178,179,180,181,182,183,184,185,250,251,252,118,190 96 | Q01826 419,421,389,390,391,425,400,401,402,403,404,406,407,410 97 | P0C0S5 15,79,81,18,19,20,80,31,32,35,38,43,46,47,48,49 98 | P06899 87,88,89,30,31,32,33,34,35,36,37,40,41,43,54,55,56,57 99 | Q9V2B6 7,25,26,27,156,29,28,30,45,46,47,48,184,185 100 | P11938 546,523,399,401,402,405,406,408,409,410 101 | A0QR77 152,287,297,42,41,300,44,302,303,46,305,304,183,184,185,186,187,60,189,190,62,188,192,193,64,196,197,68,65,69,334,335,339,215,216,217,218,219,220,221,104,107,108,242,244,245,246,247,248,249,250,251,252 102 | Q53W14 384,386,11,12,13,15,16,143,402,147,403,405,550,42,299,300,301,46,303,304,322,324,325,581,582,86,87,88,89,90,222,223,113,380,382,383 103 | Q9UBX2 64,65,66,68,69,71,72,73,137,139,75,140,141,143,79,144,146,20,21,148,23,24,88,26,95,96,97,98,99,43,49,118,124,62 104 | P67809 64,65,67,69,70,72,74,83,85,87,118,120,121 105 | Q9SNB4 193,194,196,170,171,172,146,147,148,149,185,186,187,190 106 | P56981 352,289,484,356,454,455,361,366,527,144,530,532,502,503,504,345,349,477 107 | Q9FG77 293,295,299,306,308,279,280,281,282,283,284,286,287 108 | P16104 76,77,78,12,16,17,18,15,21,29,30,33,36,43,44,45,46,125 109 | O08600 116,117,119 110 | Q9H0Z9 64,66,69,76,77,79,35,37,102,39,40,41,42,43,105,107,109,112,113,114,62 111 | Q3UPF5 144,148,149,150,87,88,89,90,151,96,97,98,105,106,107,108,170 112 | Q92841 280,281,284,426,427,428,301,302,303,304,307,449,450,457,329,335,338,475,476,477,497,252,253,254 113 | D7Y2H5 226,227,228,229,134,136,137,141,81,53 114 | Q9BTM1 12,13,14,15,16,78,18,77,76,17,29,30,33,36,43,44,45,46 115 | C5A3Z3 162,78,16,81,82,19,83,53,54,55,147,121,154,148,61,157,151,159 116 | I3DBY6 2,3,7,28,29,155,27,157,30,159,32,160,162,161,163,165,158,167,34,166,46,48,49,51,52,53,55,56,57,58,71,200,73,74,75,209,211,220,222,223,225,227,101,102,120 117 | P40947 3,7,13,14,15,16,19,21,33,35,37,50,52,54,56,62,66,70,73,74,80,86,88,91,97,98,105,106,107,108,109,110,111 118 | -------------------------------------------------------------------------------- /data/independent_set/indep_set.fasta: -------------------------------------------------------------------------------- 1 | >A0A0B0QJR1 2 | MTNIEPVIIETRLELIGRYLDHLKKFENISLDDYLSSFEQQLITERLLQLITQAAIDINDHILSKLKSGKSYTNFEAFIELGKYQILTPELAKQIAPSSGLRNRLVHEYDDIDPNQVFMAISFALQQYPLYVRQINSYLITLEEEND 3 | >A0A0C2W6A5 4 | MDDFNNILDLLINESKKAIKHNDIPVSCCIIDSNNNILSLAINSRYKNKDISQHAEINVINDLISKLNSFNLSKYKLITTLEPCMMCYSAIKQVKINTIYYLVDSYKFGIKNNYSINDQNLNLIQIKNQKKQSEYIKLLNIFFINKR 5 | >A0A1Y1BWQ0 6 | MTTVAPERLSRIREIIAENIDVDLDGLSDTALFIDELGADSLKLIDVLSALEMEYSIVIDMNELPKMTNVEATYQVTAAAAGW 7 | >A0QP43 8 | MAGAQDFVPHTADLAELAAAAGECRGCGLYRDATQAVFGAGGRSARIMMIGEQPGDKEDLAGLPFVGPAGRLLDRALEAADIDRDALYVTNAVKHFKFTRAAGGKRRIHKTPSRTEVVACRPWLIAEMTSVEPDVVVLLGATAAKALLGNDFRVTQHRGEVLHVDDVPGDPALVATVHPSSLLRGPKEERESAFAGLVDDLRVAADVRP 9 | >B8H4R9 10 | MADDAIPHTDVLNSTAQGQLKSIIERVERLEVEKAEIMEQIKEVYAEAKGNGFDVKVLKKVVRIRKQDRAKRQEEDAILDLYLSAIGEI 11 | >D0CAL0 12 | MSKVCQVTGKRPVVGNNVSHANNKTKRRFEPNLHHHRFWLESEKRFVRLRLTTKGMRIIDKLGIEKVVADLRAQGQKI 13 | >D0CBZ8 14 | MRADIHPKYEKLVATCSCGNVIETRSALGKETIYLDVCSACHPFYTGKQKNVDTGGRIDKFKQRFAGMSRSIKR 15 | >D0CD18 16 | MKVQASVKKICGSCKVIRRNGVIRVICSAEPRHKQRQG 17 | >D0CDQ7 18 | MATKKAGGSTKNGRDSNPKMLGVKVYGGQTVTAGNIIVRQRGTEFHAGANVGMGRDHTLFATADGVVKFEVKGQFGRRYVKVETV 19 | >E1C9L3 20 | DFIGSPTNLIMVTSTSLMLFAGRFGLAPSANRKATAGLKLEVRDSGLQTGDPAGFTLADTLACGVVGHIIGVGVVLGLKNIGAL 21 | >F0NDX5 22 | MLYEEDYKLALEAFKKVFNALTHYGAKQAFRSRARDLVEEIYNSGFIPTFFYIISKAELNSDSLDSLISLFSSDNAILRGSDENVSYSAYLFIILYYLIKRGIIEQKFLIQALRCEKTRLDLIDKLYNLAPIISAKIRTYLLAIKRLSEALIEAR 23 | >F0Q4R9 24 | MQRMGMVIGIKPEHIDEYKRLHAAVWPAVLARLAEAHVRNYSIFLREPENLLFGYWEYHGTDYAADMEAIAQDPETRRWWTFCGPCQEPLASRQPGEHWAHMEEVFHVD 25 | >K5BJ73 26 | MSTPQDNANTVHRYLEFVAKGQPDEIAALYADDATVEDPVGSEVHIGRQAIRGFYGNLENVQSRTEVKTLRALGHEVAFYWTLSIGGDEGGMTMDIISVMTFNDDGRIKSMKAYWTPENITQR 27 | >L7T0L4 28 | MPGIAVCNMDSAGGVILPGPNVKCFYKGQPFAVIGCAVAGHGRTPHDSARMIQGSVKMAIAGIPVCLQGSMASCGHTATGRPNLTCGS 29 | >O94408 30 | MLFYSFFKTLIDTEVTVELKNDMSIRGILKSVDQFLNVKLENISVVDASKYPHMAAVKDLFIRGSVVRYVHMSSAYVDTILLADACRRDLANNKRQ 31 | >P00441 32 | MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ 33 | >P04632 34 | MFLVNSFLKGGGGGGGGGGGLGGGLGNVLGGLISGAGGGGGGGGGGGGGGGGGGGGTAMRILGGVISAISEAAAQYNPEPPPPRTHYSNIEANESEEVRQFRRLFAQLAGDDMEVSATELMNILNKVVTRHPDLKTDGFGIDTCRSMVAVMDSDTTGKLGFEEFKYLWNNIKRWQAIYKQFDTDRSGTICSSELPGAFEAAGFHLNEHLYNMIIRRYSDESGNMDFDNFISCLVRLDAMFRAFKSLDKDGTGQIQVNIQEWLQLTMYS 35 | >P07014 36 | MRLEFSIYRYNPDVDDAPRMQDYTLEADEGRDMMLLDALIQLKEKDPSLSFRRSCREGVCGSDGLNMNGKNGLACITPISALNQPGKKIVIRPLPGLPVIRDLVVDMGQFYAQYEKIKPYLLNNGQNPPAREHLQMPEQREKLDGLYECILCACCSTSCPSFWWNPDKFIGPAGLLAAYRFLIDSRDTETDSRLDGLSDAFSVFRCHSIMNCVSVCPKGLNPTRAIGHIKSMLLQRNA 37 | >P0AG51 38 | MAKTIKITQTRSAIGRLPKHKATLLGLGLRRIGHTVEREDTPAIRGMINAVSFMVKVEE 39 | >P17227 40 | MINLPSLFVPLVGLLFPAVAMASLFLHVEKRLLFSTKKIN 41 | >P39230 42 | MKKFIFATIFALASCAAQPAMAGYDKDLCEWSMTADQTEVETQIEADIMNIVKRDRPEMKAEVQKQLKSGGVMQYNYVLYCDKNFNNKNIIAEVVGE 43 | >P43215 44 | MVAMFLAVAVVLGLATSPTAEGGKATTEEQKLIEDVNASFRAAMATTANVPPADKYKTFEAAFTVSSKRNLADAVSKAPQLVPKLDEVYNAAYNAADHAAPEDKYEAFVLHFSEALRIIAGTPEVHAVKPGA 45 | >P60624 46 | MAAKIRRDDEVIVLTGKDKGKRGKVKNVLSSGKVIVEGINLVKKHQKPVPALNQPGGIVEKEAAIQVSNVAIFNAATGKADRVGFRFEDGKKVRFFKSNSETIK 47 | >P62805 48 | MSGRGKGGKGLGKGGAKRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKVFLENVIRDAVTYTEHAKRKTVTAMDVVYALKRQGRTLYGFGG 49 | >P62937 50 | MVNPTVFFDIAVDGEPLGRVSFELFADKVPKTAENFRALSTGEKGFGYKGSCFHRIIPGFMCQGGDFTRHNGTGGKSIYGEKFEDENFILKHTGPGILSMANAGPNTNGSQFFICTAKTEWLDGKHVVFGKVKEGMNIVEAMERFGSRNGKTSKKITIADCGQLE 51 | >P63073 52 | MATVEPETTPTTNPPPAEEEKTESNQEVANPEHYIKHPLQNRWALWFFKNDKSKTWQANLRLISKFDTVEDFWALYNHIQLSSNLMPGCDYSLFKDGIEPMWEDEKNKRGGRWLITLNKQQRRSDLDRFWLETLLCLIGESFDDYSDDVCGAVVNVRAKGDKIAIWTTECENRDAVTHIGRVYKERLGLPPKIVIGYQSHADTATKSGSTTKNRFVV 53 | >P9WPA4 54 | MTGAVCPGSFDPVTLGHVDIFERAAAQFDEVVVAILVNPAKTGMFDLDERIAMVKESTTHLPNLRVQVGHGLVVDFVRSCGMTAIVKGLRTGTDFEYELQMAQMNKHIAGVDTFFVATAPRYSFVSSSLAKEVAMLGGDVSELLPEPVNRRLRDRLNTERT 55 | >Q01782 56 | MTAPTVPVALVTGAAKRLGRSIAEGLHAEGYAVCLHYHRSAAEANALSATLNARRPNSAITVQADLSNVATAPVSGADGSAPVTLFTRCAELVAACYTHWGRCDVLVNNASSFYPTPLLRNDEDGHEPCVGDREAMETATADLFGSNAIAPYFLIKAFAHRFAGTPAKHRGTNYSIINMVDAMTNQPLLGYTIYTMAKGALEGLTRSAALELAPLQIRVNGVGPGLSVLVDDMPPAVWEGHRSKVPLYQRDSSAAEVSDVVIFLCSSKAKYITGTCVKVDGGYSLTRA 57 | >Q03503 58 | MEIVYKPLDIRNEEQFASIKKLIDADLSEPYSIYVYRYFLNQWPELTYIAVDNKSGTPNIPIGCIVCKMDPHRNVRLRGYIGMLAVESTYRGHGIAKKLVEIAIDKMQREHCDEIMLETEVENSAALNLYEGMGFIRMKRMFRYYLNEGDAFKLILPLTEKSCTRSTFLMHGRLAT 59 | >Q07654 60 | MKRVLSCVPEPTVVMAARALCMLGLVLALLSSSSAEEYVGLSANQCAVPAKDRVDCGYPHVTPKECNNRGCCFDSRIPGVPWCFKPLQEAECTF 61 | >Q2G285 62 | MIIKNYSYARQNLKALMTKVNDDSDMVTVTSTDDKNVVIMSESDYNSMMETLYLQQNPNNAEHLAQSIADLERGKTITKDIDV 63 | >Q32904 64 | MATQALVSSSSLTFAAEAVRQSFRARSLPSSVGCSRKGLVRAAATPPVKQGGVDRPLWFASKQSLSYLDGSLPGDYGFDPLGLSDPEGTGGFIEPRWLAYGEVINGRFAMLGAVGAIAPEYLGKVGLIPQETALAWFQTGVIPPAGTYNYWADNYTLFVLEMALMGFAEHRRFQDWAKPGSMGKQYFLGLEKGFGGSGNPAYPGGPFFNPLGFGKDEKSLKELKLKEVKNGRLAMLAILGYFIQGLVTGVGPYQNLLDHVADPVNNNVLTSLKFH 65 | >Q4KCZ1 66 | MDGEEVKEKIRRYIMEDLIGPSAKEDELDDQTPLLEWGILNSMNIVKLMVYIRDEMGVSIPSTHITGKYFKDLNAISRTVEQLKAECA 67 | >Q58380 68 | MEALVLVGHGSRLPYSKELLVKLAEKVKERNLFPIVEIGLMEFSEPTIPQAVKKAIEQGAKRIIVVPVFLAHGIHTTRDIPRLLGLIEDNHEHHHEHSHHHHHHHHHEHEKLEIPEDVEIIYREPIGADDRIVDIIIDRAFGR 69 | >Q709H6 70 | MTRLEVLIRPTEQTAAKANAVGYTHALTWVWHSQTWDVDSVRDPSLRADFNPEKVGWVSVSFACTQCTAHYYTSEQVKYFTNIPPVHFDVVCADCERSVQLDDEIDREHQERNAEISACNARALSEGRPASLVYLSRDACDIPEHSGRCRFVKYLNF 71 | >Q8A5J2 72 | MKENKLDYIPEPMDLSLVDLPESLIQLSERIAENVHEVWAKARIDEGWTYGEKRDDIHKKHPCLVPYDELPEEEKEYDRNTAMNTIKMVKKLGFRIEKED 73 | >Q8LGJ5 74 | MGTDTVMSGRVRKDLSKTNPNGNIPENRSNSRKKIQRRSKKTLICPVQKLFDTCKKVFADGKSGTVPSQENIEMLRAVLDEIKPEDVGVNPKMSYFRSTVTGRSPLVTYLHIYACHRFSICIFCLPPSGVIPLHNHPEMTVFSKLLFGTMHIKSYDWVPDSPQPSSDTRLAKVKVDSDFTAPCDTSILYPADGGNMHCFTAKTACAVLDVIGPPYSDPAGRHCTYYFDYPFSSFSVDGVVVAEEEKEGYAWLKEREEKPEDLTVTALMYSGPTIKE 75 | >Q9BTM1 76 | MSGRGKQGGKVRAKAKSRSSRAGLQFPVGRVHRLLRKGNYAERVGAGAPVYLAAVLEYLTAEILELAGNAARDNKKTRIIPRHLQLAIRNDEELNKLLGKVTIAQGGVLPNIQAVLLPKKTESQKTKSK 77 | >Q9H492 78 | MPSDRPFKQRRSFADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKIIRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGF 79 | >Q9I0B9 80 | MDELFEEHLEIAKALFAQRLPYWCDVFLRPADQAFNAYLNARGQASTYLVLEGFDPVYVPRGCDLDAVRATARARARLREAGLGEDALPVLL 81 | >Q9K943 82 | MPTPSMEDYLERIYLLIEEKGYARVSDIAEALEVHPSSVTKMVQKLDKSDYLVYERYRGLILTAKGKKIGKRLVYRHDLLEDFLKMIGVDSDHIYEDVEGIEHHLSWDAIDRIGDLVQYFQEDPSRLNDLREVQKKNEE 83 | >Q9KDJ7 84 | MSDEKKILGEERRSLLIKWLKASDTPLTGAELAKRTNVSRQVIVQDVSLLKAKNHPILATAQGYIYMKEANTVQAQRVVACQHGPADMKDELLTLVDHGVLIKDVTVDHPVYGDITASLHLKSRKDVALFCKRMEESNGTLLSTLTKGVHMHTLEAESEAILDEAIRALEEKGYLLNSF 85 | >Q9LFM3 86 | MGAGREVSVSLDGVRDKNLMQLKILNTVLFPVRYNDKYYADAIAAGEFTKLAYYNDICVGAIACRLEKKESGAMRVYIMTLGVLAPYRGIGIGSNLLNHVLDMCSKQNMCEIYLHVQTNNEDAIKFYKKFGFEITDTIQNYYINIEPRDCYVVSKSFAQSEANK 87 | >Q9SJ89 88 | MNLQAVSCSFGFLSSPLGVTPRTSFRRFVIRAKTEPSEKSVEIMRKFSEQYARRSGTYFCVDKGVTSVVIKGLAEHKDSYGAPLCPCRHYDDKAAEVGQGFWNCPCVPMRERKECHCMLFLTPDNDFAGKDQTITSDEIKETTANM 89 | >U2EQ00 90 | MAWLILIIAGIFEVVWAIALKYSNGFTRLIPSMITLIGMLISFYLLSQATKTLPIGTAYAIWTGIGALGAVICGIIFFKEPLTALRIVFMILLLTGIIGLKATSS 91 | >X2D812 92 | MTSERRPADPEIVEGLPIPLAVAGHHQPAPFYLTADMFGGLPVQLAGGELSTLVGKPVAAPHTHPVDELYLLVSPNKGGARIEVQLDGRRHELLSPAVMRIPAGSEHCFLTLEAEVGSYCFGILLGDRL 93 | -------------------------------------------------------------------------------- /data/independent_set/indep_set.txt: -------------------------------------------------------------------------------- 1 | A0A0B0QJR1 2 | A0A0C2W6A5 3 | A0A1Y1BWQ0 4 | A0QP43 5 | B8H4R9 6 | D0CAL0 7 | D0CBZ8 8 | D0CD18 9 | D0CDQ7 10 | E1C9L3 11 | F0NDX5 12 | F0Q4R9 13 | K5BJ73 14 | L7T0L4 15 | O94408 16 | P00441 17 | P04632 18 | P07014 19 | P0AG51 20 | P17227 21 | P39230 22 | P43215 23 | P60624 24 | P62805 25 | P62937 26 | P63073 27 | P9WPA4 28 | Q01782 29 | Q03503 30 | Q07654 31 | Q2G285 32 | Q32904 33 | Q4KCZ1 34 | Q58380 35 | Q709H6 36 | Q8A5J2 37 | Q8LGJ5 38 | Q9BTM1 39 | Q9H492 40 | Q9I0B9 41 | Q9K943 42 | Q9KDJ7 43 | Q9LFM3 44 | Q9SJ89 45 | U2EQ00 46 | X2D812 47 | -------------------------------------------------------------------------------- /data_preparation.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import numpy as np 3 | import sys 4 | from collections import defaultdict 5 | 6 | from config import FileManager, FileSetter 7 | from assess_performance import PerformanceAssessment 8 | 9 | 10 | class MyDataset(torch.utils.data.Dataset): 11 | 12 | """Dataset for bindEmbed21DL""" 13 | 14 | def __init__(self, samples, embeddings, seqs, labels, max_length, protein_prediction=False): 15 | self.protein_prediction = protein_prediction 16 | self.samples = samples 17 | self.embeddings = embeddings 18 | self.seqs = seqs 19 | 20 | self.labels = labels 21 | self.max_length = max_length 22 | self.n_features = self.get_input_dimensions() 23 | 24 | print('Number of input features: {}'.format(self.n_features)) 25 | 26 | def __len__(self): 27 | return len(self.samples) 28 | 29 | def __getitem__(self, item): 30 | prot_id = self.samples[item] 31 | prot_length = len(self.seqs[prot_id]) 32 | 33 | embedding = self.embeddings[prot_id] 34 | 35 | # pad all inputs to the maximum length & add another feature to encode whether the element is a position 36 | # in the sequence or padded 37 | features = np.zeros((self.n_features + 1, self.max_length), dtype=np.float32) 38 | features[:self.n_features, :prot_length] = np.transpose(embedding) # set feature maps to embedding values 39 | features[self.n_features, :prot_length] = 1 # set last element to 1 because positions are not padded 40 | 41 | target = np.zeros((3, self.max_length), dtype=np.float32) 42 | target[:3, :prot_length] = np.transpose(self.labels[prot_id]) 43 | loss_mask = np.zeros((3, self.max_length), dtype=np.float32) 44 | loss_mask[:3, :prot_length] = 1 * prot_length 45 | 46 | if self.protein_prediction: 47 | return features, target, loss_mask, prot_id 48 | else: 49 | return features, target, loss_mask 50 | 51 | def get_input_dimensions(self): 52 | first_key = list(self.embeddings.keys())[0] 53 | first_embedding = self.embeddings[first_key] 54 | 55 | return np.shape(first_embedding)[1] 56 | 57 | 58 | class ProteinInformation(object): 59 | 60 | @staticmethod 61 | def get_data(ids): 62 | """ 63 | Get sequences, labels, and maximum length for a set of ids 64 | :param ids: 65 | :return: 66 | """ 67 | 68 | sequences = FileManager.read_fasta(FileSetter.fasta_file()) 69 | max_length = ProteinInformation.determine_max_length(sequences, ids) 70 | labels = ProteinInformation.get_labels(ids, sequences) 71 | 72 | return sequences, max_length, labels 73 | 74 | @staticmethod 75 | def get_data_predictions(ids, fasta_file): 76 | """ 77 | Generate dummy labels for test proteins without annotations to allow re-use of general DataLoader 78 | :param ids: 79 | :param fasta_file: 80 | :return: sequences, max. length, dummy labels 81 | """ 82 | sequences = FileManager.read_fasta(fasta_file) 83 | 84 | max_length = ProteinInformation.determine_max_length(sequences, ids) 85 | labels = dict() 86 | for i in ids: 87 | prot_length = len(sequences[i]) 88 | binding_tensor = np.zeros([prot_length, 3], dtype=np.float32) 89 | labels[i] = binding_tensor 90 | 91 | return sequences, max_length, labels 92 | 93 | @staticmethod 94 | def determine_max_length(sequences, ids): 95 | """Get maximum length in set of sequences""" 96 | max_len = 0 97 | for i in ids: 98 | if len(sequences[i]) > max_len: 99 | max_len = len(sequences[i]) 100 | 101 | return max_len 102 | 103 | @staticmethod 104 | def get_labels(ids, sequences, file_prefix=None): 105 | """ 106 | Read binding residues for metal, nucleic acids, and small molecule binding 107 | :param ids: 108 | :param sequences: 109 | :param file_prefix: If None, files set in FileSetter will be used 110 | :return: 111 | """ 112 | labels = dict() 113 | 114 | if file_prefix is None: 115 | metal_residues = FileManager.read_binding_residues(FileSetter.binding_residues_by_ligand('metal')) 116 | nuclear_residues = FileManager.read_binding_residues(FileSetter.binding_residues_by_ligand('nuclear')) 117 | small_residues = FileManager.read_binding_residues(FileSetter.binding_residues_by_ligand('small')) 118 | else: 119 | metal_residues = FileManager.read_binding_residues('{}_metal.txt'.format(file_prefix)) 120 | nuclear_residues = FileManager.read_binding_residues('{}_nuclear.txt'.format(file_prefix)) 121 | small_residues = FileManager.read_binding_residues('{}_small.txt'.format(file_prefix)) 122 | 123 | for prot_id in ids: 124 | prot_length = len(sequences[prot_id]) 125 | binding_tensor = np.zeros([prot_length, 3], dtype=np.float32) 126 | 127 | metal_res = nuc_res = small_res = [] 128 | 129 | if prot_id in metal_residues.keys(): 130 | metal_res = metal_residues[prot_id] 131 | if prot_id in nuclear_residues.keys(): 132 | nuc_res = nuclear_residues[prot_id] 133 | if prot_id in small_residues.keys(): 134 | small_res = small_residues[prot_id] 135 | 136 | metal_residues_0_ind = ProteinInformation._get_zero_based_residues(metal_res) 137 | nuc_residues_0_ind = ProteinInformation._get_zero_based_residues(nuc_res) 138 | small_residues_0_ind = ProteinInformation._get_zero_based_residues(small_res) 139 | 140 | binding_tensor[metal_residues_0_ind, 0] = 1 141 | binding_tensor[nuc_residues_0_ind, 1] = 1 142 | binding_tensor[small_residues_0_ind, 2] = 1 143 | 144 | labels[prot_id] = binding_tensor 145 | 146 | return labels 147 | 148 | @staticmethod 149 | def _get_zero_based_residues(residues): 150 | residues_0_ind = [] 151 | for r in residues: 152 | residues_0_ind.append(int(r) - 1) 153 | 154 | return residues_0_ind 155 | 156 | 157 | class ProteinResults(object): 158 | 159 | def __init__(self, name, bind_cutoff=0.5): 160 | self.name = name 161 | self.labels = np.array([]) 162 | self.predictions = np.array([]) 163 | 164 | self.bind_cutoff = bind_cutoff 165 | # cutoff to define if label is binding/non-binding; default: 0: non-binding, 1:binding 166 | 167 | def set_labels(self, labels): 168 | self.labels = np.array(labels) 169 | 170 | def set_predictions(self, predictions): 171 | self.predictions = np.around(np.array(np.transpose(predictions)), 3) 172 | 173 | def add_predictions(self, predictions): 174 | self.predictions = np.add(self.predictions, np.around(predictions, 3)) 175 | 176 | def normalize_predictions(self, norm_factor): 177 | self.predictions = np.around(self.predictions / norm_factor, 3) 178 | 179 | def calc_num_predictions(self, cutoff): 180 | num_predictions = np.count_nonzero(self.predictions >= cutoff, axis=0) 181 | 182 | return num_predictions[0], num_predictions[1], num_predictions[2] 183 | 184 | def get_bound_ligand(self, cutoff): 185 | num_labels = np.count_nonzero(self.labels >= cutoff, axis=0) 186 | 187 | metal = nuclear = small = False 188 | if num_labels[0] > 0: 189 | metal = True 190 | if num_labels[1] > 0: 191 | nuclear = True 192 | if num_labels[2] > 0: 193 | small = True 194 | 195 | return metal, nuclear, small 196 | 197 | def get_predictions_ligand(self, ligand): 198 | 199 | if ligand == 'metal': 200 | return self.predictions[:, 0] 201 | elif ligand == 'nucleic': 202 | return self.predictions[:, 1] 203 | elif ligand == 'small': 204 | return self.predictions[:, 2] 205 | elif ligand == 'overall': 206 | return np.amax(self.predictions, axis=1) 207 | else: 208 | sys.exit('{} is not a valid ligand type'.format(ligand)) 209 | 210 | def calc_performance_measurements(self, cutoff): 211 | performance = self.calc_performance_given_labels(cutoff, self.labels) 212 | 213 | return performance 214 | 215 | def calc_performance_given_labels(self, cutoff, ligand_labels): 216 | """Calculate performance values for this protein""" 217 | performance = defaultdict(dict) 218 | num_ligands = np.shape(ligand_labels)[1] 219 | 220 | if num_ligands > 1: # ligand-type assessment 221 | # calc per-ligand assessment for multi-label prediction 222 | for i in range(0, num_ligands): 223 | tp = fp = tn = fn = 0 224 | cross_pred = [0, 0, 0, 0] 225 | for idx, lig in enumerate(ligand_labels): 226 | if self.predictions[idx, i] >= cutoff: # predicted as binding to this ligand 227 | cross_prediction = False 228 | true_prediction = False 229 | 230 | for j in range(0, num_ligands): 231 | if i == j: # same as predicted ligand 232 | if lig[j] >= self.bind_cutoff: # also annotated to this ligand 233 | tp += 1 234 | cross_pred[i] += 1 235 | true_prediction = True 236 | else: 237 | fp += 1 238 | else: 239 | if lig[j] >= self.bind_cutoff and not true_prediction: 240 | cross_pred[j] += 1 241 | cross_prediction = True 242 | 243 | if not true_prediction and not cross_prediction: 244 | # residues is not annotated to bind any of the ligands 245 | cross_pred[3] += 1 246 | else: 247 | if lig[i] >= cutoff: 248 | fn += 1 249 | else: 250 | tn += 1 251 | 252 | if i == 0: 253 | ligand = 'metal' 254 | elif i == 1: 255 | ligand = 'nucleic' 256 | else: 257 | ligand = 'small' 258 | 259 | bound = False 260 | if (tp + fn) > 0: 261 | bound = True 262 | acc, prec, recall, f1, mcc = PerformanceAssessment.calc_performance_measurements(tp, fp, tn, fn) 263 | # calculate performance measurements for negatives 264 | _, neg_p, neg_r, neg_f1, _ = PerformanceAssessment.calc_performance_measurements(tn, fn, tp, fp) 265 | 266 | performance[ligand] = {'tp': tp, 'fp': fp, 'tn': tn, 'fn': fn, 'acc': acc, 'prec': prec, 267 | 'recall': recall, 'f1': f1, 'neg_prec': neg_p, 'neg_recall': neg_r, 268 | 'neg_f1': neg_f1, 'mcc': mcc, 'bound': bound, 269 | 'cross_prediction': cross_pred} 270 | 271 | # get overall performance 272 | reduced_labels = np.sum(ligand_labels > cutoff, axis=1) 273 | if len(self.predictions.shape) == 1: 274 | reduced_predictions = (self.predictions >= cutoff) 275 | else: 276 | reduced_predictions = np.sum(self.predictions >= cutoff, axis=1) 277 | 278 | tp = np.sum(np.logical_and(reduced_labels > 0, reduced_predictions > 0)) 279 | fp = np.sum(np.logical_and(reduced_labels == 0, reduced_predictions > 0)) 280 | tn = np.sum(np.logical_and(reduced_labels == 0, reduced_predictions == 0)) 281 | fn = np.sum(np.logical_and(reduced_labels > 0, reduced_predictions == 0)) 282 | 283 | acc, prec, recall, f1, mcc = PerformanceAssessment.calc_performance_measurements(tp, fp, tn, fn) 284 | _, neg_p, neg_r, neg_f1, _ = PerformanceAssessment.calc_performance_measurements(tn, fn, tp, fp) 285 | performance['overall'] = {'tp': tp, 'fp': fp, 'tn': tn, 'fn': fn, 'acc': acc, 'prec': prec, 'recall': recall, 286 | 'f1': f1, 'neg_prec': neg_p, 'neg_recall': neg_r, 'neg_f1': neg_f1, 'mcc': mcc, 287 | 'bound': True, 'cross_prediction': [0, 0, 0, 0]} 288 | 289 | return performance 290 | -------------------------------------------------------------------------------- /develop_bindEmbed21DL.py: -------------------------------------------------------------------------------- 1 | from bindEmbed21DL import BindEmbed21DL 2 | from assess_performance import PerformanceAssessment 3 | from config import FileSetter, FileManager, GeneralInformation 4 | from data_preparation import ProteinInformation 5 | 6 | import sys 7 | from pathlib import Path 8 | 9 | 10 | def main(): 11 | GeneralInformation.seed_all(42) 12 | 13 | keyword = sys.argv[1] 14 | 15 | path = '' # TODO set path to working directory 16 | Path(path).mkdir(parents=True, exist_ok=True) 17 | 18 | cutoff = 0.5 19 | ri = False # Should RI or raw probabilities be written? 20 | 21 | sequences = FileManager.read_fasta(FileSetter.fasta_file()) 22 | 23 | if keyword == 'optimize-architecture': 24 | params = {'lr': [0.1, 0.05, 0.01], 'betas': [(0.9, 0.999)], 'eps': [1e-8], 'weight_decay': [0.0], 25 | 'features': [512, 256, 128, 64, 32], 'kernel': [3, 5, 7], 'stride': [1], 26 | 'dropout': [0.7, 0.5, 0.4, 0.3, 0.2], 'epochs': [200], 'early_stopping': [True], 27 | 'weights': [[8.9, 7.7, 4.4]]} 28 | 29 | result_file = '{}/cross_validation_results.txt'.format(path) 30 | 31 | BindEmbed21DL.hyperparameter_optimization_pipeline(params, 5, result_file) 32 | 33 | elif keyword == 'best-training': 34 | params = {'lr': 0.01, 'betas': (0.9, 0.999), 'eps': 1e-8, 'weight_decay': 0.0, 'features': 128, 'kernel': 5, 35 | 'stride': 1, 'dropout': 0.7, 'epochs': 200, 'early_stopping': True, 'weights': [8.9, 7.7, 4.4]} 36 | 37 | model_prefix = '{}/trained_model'.format(path) 38 | prediction_folder = '{}/predictions'.format(path) 39 | Path(prediction_folder).mkdir(parents=True, exist_ok=True) 40 | 41 | proteins = BindEmbed21DL.cross_train_pipeline(params, model_prefix, prediction_folder, ri) 42 | 43 | # assess performance 44 | labels = ProteinInformation.get_labels(proteins.keys(), sequences) 45 | model_performances = PerformanceAssessment.combine_protein_performance(proteins, cutoff, labels) 46 | PerformanceAssessment.print_performance_results(model_performances) 47 | 48 | elif keyword == 'testing': 49 | model_prefix = '{}/trained_model'.format(path) 50 | prediction_folder = '{}/predictions_testset/'.format(path) 51 | Path(prediction_folder).mkdir(parents=True, exist_ok=True) 52 | 53 | ids_in = FileSetter.test_ids_in() 54 | fasta_file = FileSetter.fasta_file() 55 | proteins = BindEmbed21DL.prediction_pipeline(model_prefix, cutoff, prediction_folder, ids_in, fasta_file, ri) 56 | 57 | # assess performance 58 | labels = ProteinInformation.get_labels(proteins.keys(), sequences) 59 | model_performances = PerformanceAssessment.combine_protein_performance(proteins, cutoff, labels) 60 | PerformanceAssessment.print_performance_results(model_performances) 61 | 62 | 63 | main() 64 | -------------------------------------------------------------------------------- /homology_based_inference.py: -------------------------------------------------------------------------------- 1 | from config import FileSetter 2 | from mmseqs_wrapper import MMSeqs2Wrapper 3 | 4 | import math 5 | from collections import defaultdict 6 | import os 7 | 8 | 9 | class MMseqsPredictor(object): 10 | 11 | def __init__(self, exclude_self_hits): 12 | self.exclude_self_hits = exclude_self_hits 13 | 14 | def get_mmseqs_hits(self, ids, evalue, criterion, set_name, hval=0): 15 | """ 16 | Get MMseqs2 hits using local alignments at a specific E-value (H-value) 17 | :param ids: Ids to get hits for 18 | :param evalue: E-value cutoff to use for (1) MMseqs2 search and (2) filtering of hits 19 | :param criterion: eval or hval (should E-value or H-value be used to filter hits) 20 | :param set_name: Name for the query set; used for output files 21 | :param hval: H-value cutoff to use for filtering hits 22 | :return: 23 | """ 24 | mmseqs_hits = defaultdict(defaultdict) 25 | 26 | set_ids = set(ids) 27 | # calc MMseqs2 alignments 28 | mmseqs_out = MMseqsPredictor.calc_mmseqs_results(evalue, set_name) 29 | 30 | if os.path.exists(mmseqs_out): 31 | # parse MMseqs2 output 32 | with open(mmseqs_out) as read_mmseqs: 33 | for line in read_mmseqs: 34 | splitted_line = line.strip().split() 35 | i = splitted_line[0] 36 | if i in set_ids: 37 | hit = splitted_line[1] 38 | if i != hit or not self.exclude_self_hits: 39 | e_val = float(splitted_line[2]) 40 | 41 | nident = int(splitted_line[3]) # number of identical positions 42 | nmismatch = int(splitted_line[4]) # number of mismatches 43 | 44 | lali = nident + nmismatch 45 | pide = nident / lali 46 | 47 | h_val = MMseqsPredictor._calc_hssp(pide, lali) 48 | 49 | if criterion == 'eval' and e_val <= float(evalue): 50 | # filter hits by E-value 51 | mmseqs_hits[i][hit] = {'pide': pide, 'lali': lali, 'eval': e_val, 'hval': h_val} 52 | elif criterion == 'hval' and h_val >= float(hval): 53 | # filter hits by H-value 54 | mmseqs_hits[i][hit] = {'pide': pide, 'lali': lali, 'eval': e_val, 'hval': h_val} 55 | 56 | # determine hit with minimal E-value/maximum H-value 57 | best_hits = defaultdict(dict) 58 | for i in mmseqs_hits.keys(): 59 | max_hit = '' 60 | min_eval = 1e5 61 | max_pide = 0 62 | max_hval = -100 63 | 64 | for h in mmseqs_hits[i].keys(): 65 | e_val = mmseqs_hits[i][h]['eval'] 66 | pide = mmseqs_hits[i][h]['pide'] 67 | h_val = mmseqs_hits[i][h]['hval'] 68 | 69 | if criterion == 'eval' and (e_val < min_eval or (e_val <= min_eval and pide > max_pide)): 70 | max_hit = h 71 | min_eval = e_val 72 | max_pide = pide 73 | if criterion == 'hval' and (h_val > max_hval or (h_val >= max_hval and pide > max_pide)): 74 | max_hit = h 75 | max_hval = h_val 76 | max_pide = pide 77 | 78 | if criterion == 'eval': 79 | best_hits[i][max_hit] = min_eval 80 | else: 81 | best_hits[i][max_hit] = max_hval 82 | 83 | return best_hits, mmseqs_out 84 | 85 | @staticmethod 86 | def calc_mmseqs_results(evalue, set_name): 87 | """ 88 | Calculate MMseqs2 alignments using profiles against big80 89 | :param evalue: e-value threshold for final alignments 90 | :param set_name: Meaningful set name to ensure that previous results are not overwritten/re-used 91 | :return: File to which MMseqs2 results were written 92 | """ 93 | 94 | fasta_in = FileSetter.query_set() 95 | aln_out = '{}/mmseqs_big80_results_{}.m8'.format(FileSetter.mmseqs_output(), set_name) 96 | 97 | # mmseqs profile search against big80 98 | profile_db = FileSetter.profile_db() # use pre-computed database from big80 99 | lookup_db = FileSetter.lookup_db() # use pre-computed database of lookup set 100 | 101 | mmseqs = MMSeqs2Wrapper(FileSetter.mmseqs_path(), FileSetter.mmseqs_output(), set_name) 102 | 103 | # create database of the query set 104 | query_db = mmseqs.create_db(fasta_in, set_name) 105 | 106 | # create profiles against big80 107 | aln_res = mmseqs.seq2seq_search(query_db, profile_db, num_iterations=2) 108 | query_profiles = mmseqs.alnres2prof(query_db, profile_db, aln_res) 109 | 110 | # search profiles against lookup db 111 | profile_hits_raw = mmseqs.prof2seq_search(query_profiles, lookup_db, 0, evalue) 112 | mmseqs.convertali(query_db, lookup_db, profile_hits_raw, aln_out) 113 | 114 | return aln_out 115 | 116 | @staticmethod 117 | def _calc_hssp(pide, lali): 118 | """ 119 | Calculate H-value 120 | :param pide: Percentage pairwise sequence identity 121 | :param lali: Alignment length (excl. gaps) 122 | :return: 123 | """ 124 | 125 | pide = pide * 100 126 | 127 | if lali <= 11: 128 | return pide - 100 129 | elif lali > 450: 130 | return pide - 19.5 131 | else: 132 | exp = -0.32 * (1 + math.exp(-lali / 1000)) 133 | return pide - (480 * (lali ** exp)) 134 | -------------------------------------------------------------------------------- /human_proteome/hbi_predictions.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rostlab/bindPredict/aa7acfff3b99b3dcc737a22ab5bd1abff3a0efdb/human_proteome/hbi_predictions.tar.gz -------------------------------------------------------------------------------- /human_proteome/ml_predictions.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rostlab/bindPredict/aa7acfff3b99b3dcc737a22ab5bd1abff3a0efdb/human_proteome/ml_predictions.tar.gz -------------------------------------------------------------------------------- /ml_predictor.py: -------------------------------------------------------------------------------- 1 | from data_preparation import MyDataset, ProteinResults 2 | from config import GeneralInformation 3 | 4 | import torch 5 | 6 | 7 | class MLPredictor(object): 8 | 9 | def __init__(self, model): 10 | if torch.cuda.is_available(): 11 | self.device = 'cuda:0' 12 | else: 13 | self.device = 'cpu' 14 | 15 | self.model = model.to(self.device) 16 | 17 | def predict_per_protein(self, ids, sequences, embeddings, labels, max_length): 18 | """ 19 | Generate predictions for each protein from a list 20 | :param ids: 21 | :param sequences: 22 | :param embeddings: 23 | :param labels: 24 | :param max_length: 25 | :return: 26 | """ 27 | 28 | validation_set = MyDataset(ids, embeddings, sequences, labels, max_length, protein_prediction=True) 29 | validation_loader = torch.utils.data.DataLoader(validation_set, batch_size=1, shuffle=True, pin_memory=True) 30 | sigm = torch.nn.Sigmoid() 31 | 32 | proteins = dict() 33 | for features, target, mask, prot_id in validation_loader: 34 | prot_id = prot_id[0] 35 | 36 | self.model.eval() 37 | with torch.no_grad(): 38 | features = features.to(self.device) 39 | target = target.to(self.device) 40 | 41 | features_1024 = features[..., :-1, :] 42 | pred = self.model.forward(features_1024) 43 | pred = sigm(pred) 44 | 45 | pred = pred.squeeze() 46 | target = target.squeeze() 47 | features = features.squeeze() 48 | 49 | pred_i, target_i = GeneralInformation.remove_padded_positions(pred, target, features) 50 | pred_i = pred_i.detach().cpu() 51 | 52 | prot = ProteinResults(prot_id) 53 | prot.set_predictions(pred_i) 54 | proteins[prot_id] = prot 55 | 56 | return proteins 57 | -------------------------------------------------------------------------------- /ml_trainer.py: -------------------------------------------------------------------------------- 1 | from sklearn.model_selection import PredefinedSplit 2 | from collections import defaultdict 3 | import itertools as it 4 | 5 | import torch 6 | 7 | from architectures import CNN2Layers 8 | from data_preparation import MyDataset 9 | from assess_performance import ModelPerformance, PerformanceEpochs 10 | from config import FileManager, GeneralInformation 11 | from pytorchtools import EarlyStopping 12 | 13 | 14 | class MLTrainer(object): 15 | 16 | def __init__(self, pos_weights, batch_size=406): 17 | self.batch_size = batch_size 18 | if torch.cuda.is_available(): 19 | self.device = 'cuda:0' 20 | else: 21 | self.device = 'cpu' 22 | 23 | self.pos_weights = torch.tensor(pos_weights).to(self.device) 24 | 25 | def train_validate(self, params, train_ids, validation_ids, sequences, embeddings, labels, max_length, 26 | verbose=True): 27 | """ 28 | Train & validate predictor for one set of parameters and ids 29 | :param params: 30 | :param train_ids: 31 | :param validation_ids: 32 | :param sequences: 33 | :param embeddings: 34 | :param labels: 35 | :param max_length: 36 | :param verbose: 37 | :return: 38 | """ 39 | 40 | model, train_performance, val_performance = self._train_validate(params, train_ids, validation_ids, sequences, 41 | embeddings, labels, max_length, 42 | verbose=verbose) 43 | 44 | train_loss, train_acc, train_prec, train_recall, train_f1, train_mcc = \ 45 | train_performance.get_performance_last_epoch() 46 | val_loss, val_acc, val_prec, val_recall, val_f1, val_mcc = val_performance.get_performance_last_epoch() 47 | 48 | print("Train loss: {:.3f}, Prec: {:.3f}, Recall: {:.3f}, F1: {:.3f}, MCC: {:.3f}".format(train_loss, 49 | train_prec, 50 | train_recall, 51 | train_f1, 52 | train_mcc)) 53 | print("Val loss: {:.3f}, Prec: {:.3f}, Recall: {:.3f}, F1: {:.3f}, MCC: {:.3f}".format(val_loss, 54 | val_prec, 55 | val_recall, val_f1, 56 | val_mcc)) 57 | 58 | return model 59 | 60 | def cross_validate(self, params, ids, fold_array, sequences, embeddings, labels, max_length, result_file): 61 | """ 62 | Perform cross-validation to optimize hyperparameters 63 | :param params: 64 | :param ids: 65 | :param fold_array: 66 | :param sequences: 67 | :param embeddings: 68 | :param labels: 69 | :param max_length: 70 | :param result_file: 71 | :return: 72 | """ 73 | ps = PredefinedSplit(fold_array) 74 | 75 | # create parameter grid 76 | param_sets = defaultdict(dict) 77 | sorted_keys = sorted(params.keys()) 78 | param_combos = it.product(*(params[s] for s in sorted_keys)) 79 | counter = 0 80 | for p in list(param_combos): 81 | curr_params = list(p) 82 | param_dict = dict(zip(sorted_keys, curr_params)) 83 | param_sets[counter] = param_dict 84 | 85 | counter += 1 86 | 87 | best_score = 0 88 | best_params = dict() # save best parameter set 89 | best_classifier = None # save best classifier 90 | performance = defaultdict(list) # save performance for each parameter combination 91 | 92 | params_counter = 1 93 | 94 | for p in param_sets.keys(): 95 | curr_params = param_sets[p] 96 | print('{}\t{}'.format(params_counter, curr_params)) 97 | 98 | model = None 99 | 100 | train_model_performance = ModelPerformance() 101 | val_model_performance = ModelPerformance() 102 | 103 | for train_index, test_index in ps.split(): 104 | train_ids, validation_ids = ids[train_index], ids[test_index] 105 | 106 | model, train_performance, val_performance = self._train_validate(curr_params, train_ids, validation_ids, 107 | sequences, embeddings, labels, 108 | max_length) 109 | 110 | train_loss, train_acc, train_prec, train_recall, train_f1, train_mcc = \ 111 | train_performance.get_performance_last_epoch() 112 | val_loss, val_acc, val_prec, val_recall, val_f1, val_mcc = val_performance.get_performance_last_epoch() 113 | 114 | train_model_performance.add_single_performance(train_loss, train_acc, train_prec, train_recall, 115 | train_f1, train_mcc) 116 | 117 | val_model_performance.add_single_performance(val_loss, val_acc, val_prec, val_recall, val_f1, val_mcc) 118 | 119 | # take average over all splits 120 | train_loss, train_acc, train_prec, train_recall, train_f1, train_mcc = \ 121 | train_model_performance.get_mean_performance() 122 | val_loss, val_acc, val_prec, val_recall, val_f1, val_mcc = val_model_performance.get_mean_performance() 123 | 124 | performance['train_precision'].append(train_prec) 125 | performance['train_recall'].append(train_recall) 126 | performance['train_f1'].append(train_f1) 127 | performance['train_mcc'].append(train_mcc) 128 | performance['train_acc'].append(train_acc) 129 | performance['train_loss'].append(train_loss) 130 | 131 | performance['val_precision'].append(val_prec) 132 | performance['val_recall'].append(val_recall) 133 | performance['val_f1'].append(val_f1) 134 | performance['val_mcc'].append(val_mcc) 135 | performance['val_acc'].append(val_acc) 136 | performance['val_loss'].append(val_loss) 137 | 138 | for param in curr_params.keys(): 139 | performance[param].append(curr_params[param]) 140 | 141 | if val_f1 > best_score: 142 | best_score = val_f1 143 | best_params = curr_params 144 | best_classifier = model 145 | 146 | params_counter += 1 147 | 148 | FileManager.save_cv_results(performance, result_file) 149 | 150 | print(best_score) 151 | print(best_params) 152 | 153 | return best_classifier 154 | 155 | def _train_validate(self, params, train_ids, validation_ids, sequences, embeddings, labels, max_length, 156 | verbose=True): 157 | """ 158 | Train and validate bindEmbed21DL model 159 | :param params: 160 | :param train_ids: 161 | :param validation_ids: 162 | :param sequences: 163 | :param labels: 164 | :param max_length: 165 | :param verbose: 166 | :return: 167 | """ 168 | 169 | # define data sets 170 | train_set = MyDataset(train_ids, embeddings, sequences, labels, max_length) 171 | validation_set = MyDataset(validation_ids, embeddings, sequences, labels, max_length) 172 | 173 | train_loader = torch.utils.data.DataLoader(train_set, batch_size=self.batch_size, shuffle=True, pin_memory=True, 174 | worker_init_fn=GeneralInformation.seed_worker) 175 | validation_loader = torch.utils.data.DataLoader(validation_set, batch_size=self.batch_size, shuffle=True, 176 | pin_memory=True, worker_init_fn=GeneralInformation.seed_worker) 177 | 178 | pos_weights = self.pos_weights.expand(max_length, 3) 179 | pos_weights = pos_weights.t() 180 | 181 | loss_fun = torch.nn.BCEWithLogitsLoss(reduction='none', pos_weight=pos_weights) 182 | sigm = torch.nn.Sigmoid() 183 | 184 | padding = int((params['kernel'] - 1) / 2) 185 | model = CNN2Layers(train_set.get_input_dimensions(), params['features'], params['kernel'], params['stride'], 186 | padding, params['dropout']) 187 | model.to(self.device) 188 | 189 | optim_args = {'lr': params['lr'], 'betas': params['betas'], 'eps': params['eps'], 190 | 'weight_decay': params['weight_decay']} 191 | optimizer = torch.optim.Adamax(model.parameters(), **optim_args) 192 | 193 | checkpoint_file = 'checkpoint_early_stopping.pt' 194 | early_stopping = EarlyStopping(patience=10, delta=0.01, checkpoint_file=checkpoint_file) 195 | 196 | train_performance = PerformanceEpochs() 197 | validation_performance = PerformanceEpochs() 198 | 199 | num_epochs = 0 200 | 201 | for epoch in range(params['epochs']): 202 | if verbose: 203 | print("Epoch {}".format(epoch)) 204 | 205 | train_loss = val_loss = 0 206 | train_loss_count = val_loss_count = 0 207 | train_tp = train_tn = train_fn = train_fp = 0 208 | val_tp = val_tn = val_fn = val_fp = 0 209 | 210 | train_acc = train_prec = train_rec = train_f1 = train_mcc = 0 211 | val_acc = val_prec = val_rec = val_f1 = val_mcc = 0 212 | 213 | # training 214 | model.train() 215 | for in_feature, target, loss_mask in train_loader: 216 | optimizer.zero_grad() 217 | in_feature = in_feature.to(self.device) 218 | in_feature_1024 = in_feature[:, :-1, :] 219 | 220 | target = target.to(self.device) 221 | loss_mask = loss_mask.to(self.device) 222 | pred = model.forward(in_feature_1024) 223 | 224 | # don't consider padded positions for loss calculation 225 | loss_el = loss_fun(pred, target) 226 | loss_el_masked = loss_el * loss_mask 227 | 228 | loss_norm = torch.sum(loss_el_masked) 229 | 230 | train_loss += loss_norm.item() 231 | train_loss_count += in_feature.shape[0] 232 | 233 | for idx, i in enumerate(in_feature): # remove padded positions to calculate tp, fp, tn, fn 234 | pred_i, target_i = GeneralInformation.remove_padded_positions(pred[idx], target[idx], i) 235 | 236 | pred_i = sigm(pred_i) 237 | tp, fp, tn, fn, acc, prec, rec, f1, mcc = \ 238 | train_performance.get_performance_batch(pred_i.detach().cpu(), target_i.detach().cpu()) 239 | train_tp += tp 240 | train_fp += fp 241 | train_tn += tn 242 | train_fn += fn 243 | 244 | train_acc += acc 245 | train_prec += prec 246 | train_rec += rec 247 | train_f1 += f1 248 | train_mcc += mcc 249 | 250 | loss_norm.backward() 251 | optimizer.step() 252 | 253 | # validation 254 | model.eval() 255 | with torch.no_grad(): 256 | for in_feature, target, loss_mask in validation_loader: 257 | in_feature = in_feature.to(self.device) 258 | in_feature_1024 = in_feature[:, :-1, :] 259 | 260 | target = target.to(self.device) 261 | loss_mask = loss_mask.to(self.device) 262 | 263 | pred = model.forward(in_feature_1024) 264 | 265 | # don't consider padded position for loss calculation 266 | loss_el = loss_fun(pred, target) 267 | loss_el_masked = loss_el * loss_mask 268 | val_loss += torch.sum(loss_el_masked).item() 269 | val_loss_count += in_feature.shape[0] 270 | 271 | for idx, i in enumerate(in_feature): # remove padded positions to calculate tp, fp, tn, fn 272 | pred_i, target_i = GeneralInformation.remove_padded_positions(pred[idx], target[idx], i) 273 | 274 | pred_i = sigm(pred_i) 275 | 276 | tp, fp, tn, fn, acc, prec, rec, f1, mcc = \ 277 | train_performance.get_performance_batch(pred_i.detach().cpu(), target_i.detach().cpu()) 278 | val_tp += tp 279 | val_fp += fp 280 | val_tn += tn 281 | val_fn += fn 282 | 283 | val_acc += acc 284 | val_prec += prec 285 | val_rec += rec 286 | val_f1 += f1 287 | val_mcc += mcc 288 | 289 | train_loss = train_loss / (train_loss_count * 3) 290 | val_loss = val_loss / (val_loss_count * 3) 291 | 292 | train_acc = train_acc / train_loss_count 293 | train_prec = train_prec / train_loss_count 294 | train_rec = train_rec / train_loss_count 295 | train_f1 = train_f1 / train_loss_count 296 | train_mcc = train_mcc / train_loss_count 297 | 298 | val_acc = val_acc / val_loss_count 299 | val_prec = val_prec / val_loss_count 300 | val_rec = val_rec / val_loss_count 301 | val_f1 = val_f1 / val_loss_count 302 | val_mcc = val_mcc / val_loss_count 303 | 304 | if verbose: 305 | print("Train loss: {:.3f}, Prec: {:.3f}, Recall: {:.3f}, F1: {:.3f}, MCC: {:.3f}".format(train_loss, 306 | train_prec, 307 | train_rec, 308 | train_f1, 309 | train_mcc)) 310 | print('TP: {}, FP: {}, TN: {}, FN: {}'.format(train_tp, train_fp, train_tn, train_fn)) 311 | print("Val loss: {:.3f}, Prec: {:.3f}, Recall: {:.3f}, F1: {:.3f}, MCC: {:.3f}".format(val_loss, 312 | val_prec, 313 | val_rec, val_f1, 314 | val_mcc)) 315 | print('TP: {}, FP: {}, TN: {}, FN: {}'.format(val_tp, val_fp, val_tn, val_fn)) 316 | 317 | # append average performance for this epoch 318 | train_performance.add_performance_epoch(train_loss, train_mcc, train_prec, train_rec, train_f1, train_acc) 319 | validation_performance.add_performance_epoch(val_loss, val_mcc, val_prec, val_rec, val_f1, val_acc) 320 | 321 | num_epochs += 1 322 | 323 | # stop training if F1 score doesn't improve anymore 324 | if 'early_stopping' in params.keys() and params['early_stopping']: 325 | eval_val = val_f1 * (-1) 326 | # eval_val = val_loss 327 | early_stopping(eval_val, model, verbose) 328 | if early_stopping.early_stop: 329 | break 330 | 331 | if 'early_stopping' in params.keys() and params['early_stopping']: # load best model 332 | model = torch.load(checkpoint_file) 333 | 334 | return model, train_performance, validation_performance 335 | -------------------------------------------------------------------------------- /mmseqs_wrapper.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | """ 4 | Created on Tue Jun 8 11:34:00 2021 5 | 6 | @author: mheinzinger 7 | """ 8 | 9 | import subprocess 10 | import os 11 | 12 | """ 13 | Overview over pipeline steps: 14 | 1. Create DBs for queries and lookup 15 | mmseqs createdb test_set.fasta test_set 16 | mmseqs createdb lookup_set.fasta lookup_seq 17 | 2. Use sequence-sequence search to search test against lookup 18 | mmseqs search test_set lookup_set alnResults tmp --num-iterations 3 -a 19 | 3. Compute profiles from hits 20 | mmseqs result2profile test_set lookup_set alnResults test_set_profile 21 | 4. Search test set profiles against lookup sequences to detect remote hits 22 | mmseqs search test_set_profile lookup_set alnResults tmp --min-seq-id 0.2 -s 7.5 --max-seqs 10000 23 | """ 24 | 25 | 26 | class MMSeqs2Wrapper(object): 27 | 28 | def __init__(self, mmseqs_path, work_dir, query_set): 29 | self.mmseqs = mmseqs_path 30 | self.work_dir = work_dir # Directory for storing final results 31 | self.tmp_dir = '{}/tmp'.format(self.work_dir) # Directory for storing intermediate results 32 | self.query_set = query_set 33 | 34 | def create_db(self, fasta_p, set_name): 35 | db_res = '{}/{}'.format(self.tmp_dir, set_name) 36 | if not os.path.exists(db_res): 37 | subprocess.call([str(self.mmseqs), "createdb", fasta_p, db_res]) 38 | return db_res 39 | 40 | def seq2seq_search(self, query_db, lookup_db, num_iterations=3): 41 | aln_res = '{}/seq2seq_alnRes_{}'.format(self.tmp_dir, self.query_set) 42 | 43 | subprocess.call([str(self.mmseqs), "search", query_db, lookup_db, aln_res, str(self.tmp_dir), 44 | "--num-iterations", str(num_iterations), "-a"]) 45 | return aln_res 46 | 47 | def alnres2prof(self, query_db, lookup_db, aln_res): 48 | query_profiles = '{}/test_set_profiles_{}'.format(self.tmp_dir, self.query_set) 49 | 50 | subprocess.call([str(self.mmseqs), "result2profile", query_db, lookup_db, aln_res, query_profiles]) 51 | return query_profiles 52 | 53 | def prof2seq_search(self, query_profiles, lookup_db, min_seq_id, e_val, sensitivity=7.5, max_seqs=100000): 54 | aln_res = '{}/prof2seq_alnRes_{}'.format(self.tmp_dir, self.query_set) 55 | 56 | subprocess.call([str(self.mmseqs), "search", query_profiles, lookup_db, aln_res, str(self.tmp_dir), 57 | "--min-seq-id", str(min_seq_id), "-s", str(sensitivity), "--max-seqs", str(max_seqs), 58 | "-e", e_val, "-a"]) 59 | return aln_res 60 | 61 | def convertali(self, query_db, lookup_db, profile_hits_raw, output_file): 62 | subprocess.call([str(self.mmseqs), "convertalis", query_db, lookup_db, profile_hits_raw, output_file, 63 | "--format-output", "query,target,evalue,nident,mismatch,qstart,tstart,qaln,taln"]) 64 | return output_file 65 | -------------------------------------------------------------------------------- /pytorchtools.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | 4 | 5 | class EarlyStopping: 6 | """Early stops the training if validation loss doesn't improve after a given patience.""" 7 | def __init__(self, patience=7, verbose=False, delta=0, checkpoint_file='checkpoint.pt'): 8 | """ 9 | Args: 10 | patience (int): How long to wait after last time validation loss improved. 11 | Default: 7 12 | verbose (bool): If True, prints a message for each validation loss improvement. 13 | Default: False 14 | delta (float): Minimum change in the monitored quantity to qualify as an improvement. 15 | Default: 0 16 | """ 17 | self.patience = patience 18 | self.verbose = verbose 19 | self.counter = 0 20 | self.best_score = None 21 | self.early_stop = False 22 | self.val_loss_min = np.Inf 23 | self.delta = delta 24 | 25 | self.checkpoint_file = checkpoint_file 26 | 27 | def __call__(self, val_loss, model, verbose=True): 28 | 29 | score = -val_loss 30 | 31 | if self.best_score is None: 32 | self.best_score = score 33 | self.save_checkpoint(val_loss, model) 34 | elif score < self.best_score + self.delta: 35 | self.counter += 1 36 | if verbose: 37 | print(f'EarlyStopping counter: {self.counter} out of {self.patience}') 38 | if self.counter >= self.patience: 39 | self.early_stop = True 40 | else: 41 | self.best_score = score 42 | self.save_checkpoint(val_loss, model) 43 | self.counter = 0 44 | 45 | def save_checkpoint(self, val_loss, model): 46 | """Saves model when validation loss decrease.""" 47 | 48 | if self.verbose: 49 | print(f'Validation loss decreased ({self.val_loss_min:.6f} --> {val_loss:.6f}). Saving model ...') 50 | torch.save(model, self.checkpoint_file) 51 | self.val_loss_min = val_loss 52 | -------------------------------------------------------------------------------- /run_bindEmbed21.py: -------------------------------------------------------------------------------- 1 | import numpy 2 | from pathlib import Path 3 | 4 | from config import FileManager, FileSetter 5 | from data_preparation import ProteinInformation, ProteinResults 6 | 7 | from annotation_transfer import Inference 8 | from bindEmbed21DL import BindEmbed21DL 9 | 10 | 11 | def main(): 12 | 13 | set_name = 'example' # TODO set a meaningful set name used for output files 14 | query_fasta = FileSetter.query_set() 15 | query_sequences = FileManager.read_fasta(query_fasta) 16 | query_ids = list(query_sequences.keys()) 17 | 18 | print('Run predictions using homology-based inference (bindEmbed21HBI)') 19 | result_folder = FileSetter.mmseqs_output() 20 | Path(result_folder).mkdir(parents=True, exist_ok=True) 21 | 22 | e_val = '1e-3' # E-value used for bindEmbed21HBI, can be changed 23 | proteins_inferred = dict() 24 | 25 | # inference set with annotations 26 | hbi_sequences = FileManager.read_fasta(FileSetter.lookup_fasta()) 27 | labels_hbi = ProteinInformation.get_labels(list(hbi_sequences.keys()), hbi_sequences) 28 | 29 | inference = Inference(query_ids, labels_hbi, query_sequences) 30 | # run HBI and don't allow self-hits 31 | inferred_labels, inferred_proteins = inference.infer_binding_annotations_seq(e_val, 'eval', set_name) 32 | 33 | missing_proteins = [] 34 | for p in query_ids: 35 | if p in inferred_proteins: 36 | prot = ProteinResults(p, 3) 37 | prot.set_predictions(numpy.transpose(inferred_labels[p])) 38 | proteins_inferred[p] = prot 39 | else: 40 | missing_proteins.append(p) 41 | 42 | print('Number of proteins with hit: {}'.format(len(proteins_inferred.keys()))) 43 | 44 | print('Run predictions using Machine Learning for remaining proteins (bindEmbed21DL)') 45 | model_prefix = 'trained_models/checkpoint' 46 | ri = True # Whether to write RI or Probabilities 47 | 48 | predicted_proteins = BindEmbed21DL.prediction_pipeline(model_prefix, 0.5, None, missing_proteins, query_fasta, 49 | ri) 50 | 51 | print("Write predictions") 52 | proteins = {**inferred_proteins, **predicted_proteins} 53 | prediction_folder = '{}/predictions'.format(result_folder) 54 | Path(prediction_folder).mkdir(parents=True, exist_ok=True) 55 | FileManager.write_predictions(proteins, prediction_folder, 0.5, ri) 56 | 57 | 58 | main() 59 | -------------------------------------------------------------------------------- /run_bindEmbed21DL.py: -------------------------------------------------------------------------------- 1 | from config import FileSetter, FileManager 2 | from bindEmbed21DL import BindEmbed21DL 3 | 4 | from pathlib import Path 5 | 6 | 7 | def main(): 8 | 9 | prediction_folder = FileSetter.predictions_folder() 10 | Path(prediction_folder).mkdir(parents=True, exist_ok=True) 11 | 12 | model_prefix = 'trained_models/checkpoint' 13 | 14 | query_fasta = FileSetter.query_set() 15 | query_sequences = FileManager.read_fasta(query_fasta) 16 | query_ids = list(query_sequences.keys()) 17 | 18 | ri = True # Whether to write RI or Probabilities 19 | 20 | BindEmbed21DL.prediction_pipeline(model_prefix, 0.5, prediction_folder, query_ids, query_fasta, ri) 21 | 22 | 23 | main() 24 | -------------------------------------------------------------------------------- /run_bindEmbed21HBI.py: -------------------------------------------------------------------------------- 1 | import numpy 2 | from pathlib import Path 3 | 4 | from config import FileManager, FileSetter 5 | from data_preparation import ProteinInformation, ProteinResults 6 | 7 | from annotation_transfer import Inference 8 | 9 | 10 | def main(): 11 | 12 | set_name = 'example' # TODO set a meaningful set name used for output files 13 | query_fasta = FileSetter.query_set() 14 | query_sequences = FileManager.read_fasta(query_fasta) 15 | query_ids = list(query_sequences.keys()) 16 | 17 | result_folder = FileSetter.mmseqs_output() 18 | Path(result_folder).mkdir(parents=True, exist_ok=True) 19 | 20 | e_val = '1e-3' # E-value used for bindEmbed21HBI, can be changed 21 | proteins_inferred = dict() 22 | 23 | # inference set with annotations 24 | hbi_sequences = FileManager.read_fasta(FileSetter.lookup_fasta()) 25 | labels_hbi = ProteinInformation.get_labels(list(hbi_sequences.keys()), hbi_sequences) 26 | 27 | inference = Inference(query_ids, labels_hbi, query_sequences) 28 | # run HBI and don't allow self-hits 29 | inferred_labels, inferred_proteins = inference.infer_binding_annotations_seq(e_val, 'eval', set_name) 30 | 31 | for p in inferred_proteins: 32 | prot = ProteinResults(p, 3) 33 | prot.set_predictions(numpy.transpose(inferred_labels[p])) 34 | proteins_inferred[p] = prot 35 | 36 | print('Number of proteins with hit: {}'.format(len(proteins_inferred.keys()))) 37 | 38 | # write predictions 39 | predictions_folder = '{}/hbi_predictions/'.format(FileSetter.mmseqs_output()) 40 | Path(predictions_folder).mkdir(parents=True, exist_ok=True) 41 | FileManager.write_predictions(proteins_inferred, predictions_folder, 0.5, True) 42 | 43 | 44 | main() 45 | -------------------------------------------------------------------------------- /trained_models/checkpoint1.pt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rostlab/bindPredict/aa7acfff3b99b3dcc737a22ab5bd1abff3a0efdb/trained_models/checkpoint1.pt -------------------------------------------------------------------------------- /trained_models/checkpoint2.pt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rostlab/bindPredict/aa7acfff3b99b3dcc737a22ab5bd1abff3a0efdb/trained_models/checkpoint2.pt -------------------------------------------------------------------------------- /trained_models/checkpoint3.pt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rostlab/bindPredict/aa7acfff3b99b3dcc737a22ab5bd1abff3a0efdb/trained_models/checkpoint3.pt -------------------------------------------------------------------------------- /trained_models/checkpoint4.pt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rostlab/bindPredict/aa7acfff3b99b3dcc737a22ab5bd1abff3a0efdb/trained_models/checkpoint4.pt -------------------------------------------------------------------------------- /trained_models/checkpoint5.pt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rostlab/bindPredict/aa7acfff3b99b3dcc737a22ab5bd1abff3a0efdb/trained_models/checkpoint5.pt -------------------------------------------------------------------------------- /utils/convert_embeddings.py: -------------------------------------------------------------------------------- 1 | """ 2 | Script to convert embeddings generated with the bio_embeddings pipeline in the expected format 3 | """ 4 | 5 | import h5py 6 | import numpy as np 7 | import sys 8 | 9 | 10 | def main(): 11 | embeddings_in = sys.argv[1] 12 | embeddings_out = sys.argv[2] 13 | 14 | with h5py.File(embeddings_in, 'r') as f_in: 15 | with h5py.File(embeddings_out, 'w') as f_out: 16 | for key, embedding in f_in.items(): 17 | original_id = embedding.attrs['original_id'] 18 | embedding = np.array(embedding) 19 | 20 | f_out.create_dataset(original_id, data=embedding) 21 | 22 | 23 | main() 24 | --------------------------------------------------------------------------------