├── .gitignore ├── LICENSE ├── contributors.md ├── future.md ├── imgs ├── alphafold_preds.png ├── angle_preds.png ├── elu_resnet_2d.png ├── our_preds.png └── ramachandran_plot.png ├── implementation_details.md ├── models ├── angles │ ├── predicting_angles.ipynb │ ├── resnet_1d_angles.h5 │ └── resnet_1d_angles.py ├── distance_pipeline │ ├── Tutorials │ │ ├── README.pdf │ │ ├── RR_format.py │ │ └── modify_pssm.py │ ├── distance_generator_data.py │ ├── elu_resnet_2d_distances.py │ ├── evaluation_pipeline.py │ ├── func_utils.py │ ├── images │ │ ├── FM_T0869_CASP12.jpg │ │ ├── golden_img_v91_17.png │ │ ├── golden_img_v91_20.png │ │ ├── golden_img_v91_32.png │ │ ├── golden_img_v91_45.png │ │ ├── golden_img_v91_50.png │ │ ├── golden_img_v91_54.png │ │ ├── golden_img_v91_55.png │ │ └── golden_img_v91_56.png │ ├── models │ │ └── tester_28_lxl_golden.h5 │ ├── pipeline_caller.py │ ├── pretrain_model_pssm_l_x_l.ipynb │ ├── record.txt │ └── training_pipeline.py ├── new_distances │ ├── distance_generator_changes.py │ ├── elu_resnet_2d_distances.py │ ├── pretrain_model.ipynb │ ├── resume_training.ipynb │ └── trained_models_h5 │ │ ├── 17_test.h5 │ │ └── tester_28.h5 └── old_distances │ ├── elu_resnet_2d_distances.h5 │ ├── elu_resnet_2d_distances.py │ └── predicting_distances.ipynb ├── preprocessing ├── angle_data_preparation_py.ipynb ├── get_angles_from_coords_py.ipynb ├── get_proteins_under_200aa.jl └── julia_get_proteins_under_200aa.ipynb ├── readme.md └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | # exclude everything 2 | data/* 3 | minifold_journey.md 4 | models/__pycache__/* 5 | models/.ipynb_checkpoints/* 6 | preprocessing/.ipynb_checkpoints/* 7 | # exception to the rule 8 | # !somefolder/.gitkeep -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Eric Alcaide 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /contributors.md: -------------------------------------------------------------------------------- 1 | # Contributors 2 | Here's a list of people who have made code contributions to this project: 3 | * [@EricAlcaide](https://github.com/EricAlcaide) 4 | * [@McMenemy](https://github.com/McMenemy) 5 | * [@roberCO](https://github.com/roberCO) 6 | * [@pabloAMC](https://github.com/PabloAMC) 7 | -------------------------------------------------------------------------------- /future.md: -------------------------------------------------------------------------------- 1 | # Project Future - MiniFold 2 | 3 | In a brief way, some promising ideas: 4 | 5 | * Train with crops of 64x64, not windows of 200x200 (and average at prediction time). 6 | * Use data from Multiple Sequence Alignments (MSA) such as paired changes bewteen AAs. 7 | * Use distance map as potential input for angle prediction (or vice versa?). 8 | * Use Physicochemical features of AAs as input. 9 | * Train with more data (in the cloud?) 10 | * Use predictions as constraints to the Rosetta Method for Protein Structure Prediction 11 | * Set up a prediction script/function from raw text/FASTA file 12 | * ... 13 | 14 | *"Science is a Work In Progress."* 15 | 16 | ## Contribute 17 | Hey there! New ideas are welcome: open/close issues, fork the repo and share your code with a Pull Request. 18 | Clone this project to your computer: 19 | 20 | `git clone https://github.com/EricAlcaide/MiniFold` 21 | 22 | By participating in this project, you agree to abide by the thoughtbot [code of conduct](https://thoughtbot.com/open-source-code-of-conduct) 23 | 24 | ## Meta 25 | 26 | * **Author's GitHub Profile**: [Eric Alcaide](https://github.com/EricAlcaide/) 27 | * **Twitter**: [@eric_alcaide](https://twitter.com/eric_alcaide) 28 | * **Email**: ericalcaide1@gmail.com -------------------------------------------------------------------------------- /imgs/alphafold_preds.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hypnopump/MiniFold/6eb47c5600c22c7dabbb1294adbd8c6704a185cb/imgs/alphafold_preds.png -------------------------------------------------------------------------------- /imgs/angle_preds.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hypnopump/MiniFold/6eb47c5600c22c7dabbb1294adbd8c6704a185cb/imgs/angle_preds.png -------------------------------------------------------------------------------- /imgs/elu_resnet_2d.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hypnopump/MiniFold/6eb47c5600c22c7dabbb1294adbd8c6704a185cb/imgs/elu_resnet_2d.png -------------------------------------------------------------------------------- /imgs/our_preds.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hypnopump/MiniFold/6eb47c5600c22c7dabbb1294adbd8c6704a185cb/imgs/our_preds.png -------------------------------------------------------------------------------- /imgs/ramachandran_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hypnopump/MiniFold/6eb47c5600c22c7dabbb1294adbd8c6704a185cb/imgs/ramachandran_plot.png -------------------------------------------------------------------------------- /implementation_details.md: -------------------------------------------------------------------------------- 1 | # Implementation Details 2 | 3 | ## Proposed Architecture 4 | 5 | The methods implemented are inspired by DeepMind's original post. Two different residual neural networks (ResNets) are used to predict **angles** between adjacent aminoacids (AAs) and **distance** between every pair of AAs of a protein. For distance prediction a 2D Resnet was used while for angles prediction a 1D Resnet was used. 6 | 7 |
8 | 9 |
10 | 11 | Image from DeepMind's original blogpost. 12 | 13 | ### Distance prediction 14 | 15 | The ResNet for distance prediction is built as a 2D-ResNet and takes as input tensors of shape LxLxN (a normal image would be LxLx3). The window length is set to 200 (we only train and predict proteins of less than 200 AAs) and smaller proteins are padded to match the window size. No larger proteins nor crops of larger proteins are used. 16 | 17 | The 41 channels of the input are distributed as follows: 20 for AAs in one-hot encoding (LxLx20), 1 for the Van der Waals radius of the AA encoded previously and 20 channels for the Position Specific Scoring Matrix). 18 | 19 | The network is comprised of packs of residual blocks with the architecture below illustrated with blocks cycling through 1,2,4 and 8 strides plus a first normal convolutional layer and the last convolutional layer where a Softmax activation function is applied to get an output of LxLx7 (6 classes for different distance + 1 trash class for the padding that is less penalized). 20 | 21 |
22 | 23 |
24 | 25 | Architecture of the residual block used. A mini version of the block in [this description](http://predictioncenter.org/casp13/doc/presentations/Pred_CASP13-DeepLearning-AlphaFold-Senior.pdf) 26 | 27 | The network has been trained with 134 proteins and evaluated with 16 more. Clearly unsufficient data, but memory constraints didn't allow for more. Comparably, AlphaFold was trained with 29k proteins. 28 | 29 | The output of the network is, then, a classification among 6 classes wich are ranges of distances between a pair of AAs. Here there's an example of AlphaFold predicted distances and the distances predicted by our model: 30 | 31 |
32 | 33 |
34 | Ground truth (left) and predicted distances (right) by AlphaFold. 35 | 36 |
37 | 38 |
39 | Ground truth (left) and predicted distances (right) by MiniFold. 40 | 41 | The architecture of the Residual Network for distance prediction is very simmilar, the main difference being that the model here described was trained with windows of 200x200 AAs while AlphaFold was trained with crops of 64x64 AAs. When it comes to prediction, AlphaFold used the smaller window size to average across different outputs and achieve a smoother result. Our prediction, however, is a unique window, so there's no average (noisier predictions). 42 | 43 | 44 | ### Angles prediction 45 | 46 | The ResNet for angles prediction is built as a 1D-ResNet and takes as input tensors of shape LxN. The window length is set to 34 and we only train and predict aangles of proteins with less than 200 (L) AAs. No larger proteins nor crops of larger proteins are used. 47 | 48 | The 42 (N) channels of the input are distributed as follows: 20 for AAs in one-hot encoding (Lx20), 2 for the Van der Waals radius and the surface accessibility of the AA encoded previously and 20 channels for the Position Specific Scoring Matrix). 49 | 50 | We followed the ResNet20 architecture but replaced the 2D Convolutions by 1D convolutions. The network output consists of a vector of 4 numbers that represent the `sin` and `cos` of the 2 dihedral angles between two AAs (Phi and Psi). 51 | 52 | Dihedral angles were extracted from raw coordinates of the protein backbone atoms (N-terminus, C-alpha and C-terminus of each AA). The plot of Phi and Psi recieves the name of Ramachandran plot: 53 | 54 |
55 | 56 |
57 | The cluster observed in the upper-left region corresponds to the angles comprised between AAs when they form a Beta-sheet while the cluster observed in the central-left region corresponds to the angles comprised between AAs when they form an Alpha-helix. 58 | 59 | The results of the model when making predictions can be observed below: 60 |
61 | 62 |
63 | 64 | The network has been trained with crops 38,7k crops from 600 different proteins and evaluated with some 4,3k more. 65 | 66 | The architecture of the Residual Network is different from the one implemented in AlphaFold. The model here implemented was inspired by [this paper](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005324) and [this one](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0205819). 67 | 68 | ## Results 69 | While the architectures implemented in this first preliminary version of the project are inspired by papers with great results, the results here obtained are not as good as they could be. It's probable that the lack of Multiple Alignmnent (MSA), MSA-based features, Physicochemichal properties of AAs (beyond Van der Waals radius) or the lack of both model and feature engineering have affected the models negatively, as well as the little data that they have been trained on. 70 | 71 | For that reason, we can conclude that it has been a somehow naive approach and we expect to further implement some ideas/improvements to these models. As the DeepMind team says: *"With few or no alignments accuracy is much worse"*. It would be interesting to use the predictions made by the models as constraints to a folding algorithm (ie. Rosetta) in order to visualize our results. 72 | 73 | ## References 74 | * [DeepMind original blog post](https://deepmind.com/blog/alphafold/) 75 | * [AlphaFold @ CASP13: “What just happened?”](https://moalquraishi.wordpress.com/2018/12/09/alphafold-casp13-what-just-happened/#s2.2) 76 | * [Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005324) 77 | * [AlphaFold slides](http://predictioncenter.org/casp13/doc/presentations/Pred_CASP13-DeepLearning-AlphaFold-Senior.pdf) 78 | * [De novo protein structure prediction using ultra-fast molecular dynamics simulation](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0205819) 79 | -------------------------------------------------------------------------------- /models/angles/resnet_1d_angles.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hypnopump/MiniFold/6eb47c5600c22c7dabbb1294adbd8c6704a185cb/models/angles/resnet_1d_angles.h5 -------------------------------------------------------------------------------- /models/angles/resnet_1d_angles.py: -------------------------------------------------------------------------------- 1 | # Import libraries 2 | import keras 3 | import keras.backend as K 4 | from keras.models import Model 5 | # Optimizer and regularization 6 | from keras.regularizers import l2 7 | from keras.losses import mean_squared_error, mean_absolute_error 8 | # Keras layers 9 | from keras.layers.convolutional import Conv1D 10 | from keras.layers import Dense, Dropout, Flatten, Input, BatchNormalization, Activation 11 | from keras.layers.pooling import MaxPooling1D, AveragePooling1D 12 | 13 | 14 | def custom_mse_mae(y_true, y_pred): 15 | """ Custom loss function - MSE + MAE """ 16 | return mean_squared_error(y_true, y_pred)+mean_absolute_error(y_true, y_pred) 17 | 18 | def resnet_layer(inputs, 19 | num_filters=16, 20 | kernel_size=3, 21 | strides=1, 22 | activation='relu', 23 | batch_normalization=True, 24 | conv_first=False): 25 | """2D BN-Relu-Conv (ResNet preact structure) or Conv-BN-Relu stack builder 26 | 27 | # Arguments 28 | inputs (tensor): input tensor from input image or previous layer 29 | num_filters (int): Conv2D number of filters 30 | kernel_size (int): Conv2D square kernel dimensions 31 | strides (int): Conv2D square stride dimensions 32 | activation (string): activation name 33 | batch_normalization (bool): whether to include batch normalization 34 | conv_first (bool): conv-bn-activation (True) or 35 | bn-activation-conv (False) 36 | 37 | # Returns 38 | x (tensor): tensor as input to the next layer 39 | """ 40 | conv = Conv1D(num_filters, 41 | kernel_size=kernel_size, 42 | strides=strides, 43 | padding='same', 44 | kernel_initializer='he_normal', 45 | kernel_regularizer=l2(1e-4)) 46 | 47 | x = inputs 48 | if conv_first: 49 | x = conv(x) 50 | if batch_normalization: 51 | x = BatchNormalization()(x) 52 | if activation is not None: 53 | x = Activation(activation)(x) 54 | else: 55 | if batch_normalization: 56 | x = BatchNormalization()(x) 57 | if activation is not None: 58 | x = Activation(activation)(x) 59 | x = conv(x) 60 | return x 61 | 62 | def resnet_v2(input_shape, depth, num_classes=4, conv_first=True): 63 | """ResNet Version 2 Model builder [b] 64 | 65 | Stacks of (1 x 1)-(3 x 3)-(1 x 1) BN-ReLU-Conv2D or also known as 66 | bottleneck layer 67 | First shortcut connection per layer is 1 x 1 Conv2D. 68 | Second and onwards shortcut connection is identity. 69 | At the beginning of each stage, the feature map size is halved (downsampled) 70 | by a convolutional layer with strides=2, while the number of filter maps is 71 | doubled. Within each stage, the layers have the same number filters and the 72 | same filter map sizes. 73 | Features maps sizes: 74 | conv1 : 32, 16 75 | stage 0: 32, 64 76 | stage 1: 16, 128 77 | stage 2: 8, 256 78 | 79 | # Arguments 80 | input_shape (tensor): shape of input image tensor 81 | depth (int): number of core convolutional layers 82 | num_classes (int): number of classes (CIFAR10 has 10) 83 | 84 | # Returns 85 | model (Model): Keras model instance 86 | """ 87 | if (depth - 2) % 9 != 0: 88 | raise ValueError('depth should be 9n+2 (eg 56 or 110 in [b])') 89 | # Start model definition. 90 | num_filters_in = 16 91 | num_res_blocks = int((depth - 2) / 9) 92 | 93 | inputs = Input(shape=input_shape) 94 | # v2 performs Conv1D with BN-ReLU on input before splitting into 2 paths 95 | x = resnet_layer(inputs=inputs, 96 | num_filters=num_filters_in, 97 | conv_first=True) 98 | 99 | # Instantiate the stack of residual units 100 | for stage in range(3): 101 | for res_block in range(num_res_blocks): 102 | activation = 'relu' 103 | batch_normalization = True 104 | strides = 1 105 | if stage == 0: 106 | num_filters_out = num_filters_in * 4 107 | if res_block == 0: # first layer and first stage 108 | activation = None 109 | batch_normalization = False 110 | else: 111 | num_filters_out = num_filters_in * 2 112 | if res_block == 0: # first layer but not first stage 113 | strides = 2 # downsample 114 | 115 | # bottleneck residual unit 116 | y = resnet_layer(inputs=x, 117 | num_filters=num_filters_in, 118 | kernel_size=1, 119 | strides=strides, 120 | activation=activation, 121 | batch_normalization=batch_normalization, 122 | conv_first=conv_first) 123 | y = resnet_layer(inputs=y, 124 | num_filters=num_filters_in, 125 | conv_first=conv_first) 126 | y = resnet_layer(inputs=y, 127 | num_filters=num_filters_out, 128 | kernel_size=1, 129 | conv_first=conv_first) 130 | if res_block == 0: 131 | # linear projection residual shortcut connection to match 132 | # changed dims 133 | x = resnet_layer(inputs=x, 134 | num_filters=num_filters_out, 135 | kernel_size=1, 136 | strides=strides, 137 | activation=None, 138 | batch_normalization=False) 139 | x = keras.layers.add([x, y]) 140 | 141 | num_filters_in = num_filters_out 142 | 143 | # Add classifier on top. 144 | # v2 has BN-ReLU before Pooling 145 | x = BatchNormalization()(x) 146 | x = Activation('relu')(x) 147 | x = AveragePooling1D(pool_size=3)(x) 148 | y = Flatten()(x) 149 | outputs = Dense(num_classes, 150 | activation='linear', 151 | kernel_initializer='he_normal')(y) 152 | 153 | # Instantiate model. 154 | model = Model(inputs=inputs, outputs=outputs) 155 | return model 156 | 157 | # Check it's working 158 | if __name__ == "__main__": 159 | # Using AMSGrad optimizer for speed 160 | kernel_size, filters = 3, 16 161 | adam = keras.optimizers.Adam(amsgrad=True) 162 | # Create model 163 | model = resnet_v2(input_shape=(17*2,41), depth=20, num_classes=4) 164 | model.compile(optimizer=adam, loss=custom_mse_mae, 165 | metrics=["mean_absolute_error", "mean_squared_error"]) 166 | model.summary() 167 | print("Model file works perfectly") 168 | -------------------------------------------------------------------------------- /models/distance_pipeline/Tutorials/README.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hypnopump/MiniFold/6eb47c5600c22c7dabbb1294adbd8c6704a185cb/models/distance_pipeline/Tutorials/README.pdf -------------------------------------------------------------------------------- /models/distance_pipeline/Tutorials/RR_format.py: -------------------------------------------------------------------------------- 1 | RR_FORMAT = """PFRMAT RR 2 | MODEL 1 3 | """ 4 | def save_rr(seq,sample_pred,file_name_rr): 5 | my_file=open(file_name_rr,'w') 6 | my_file.write(RR_FORMAT) 7 | my_file.write(seq + '\n') 8 | for i in range(0,len(seq)): 9 | for j in range(i+5,len(seq)): 10 | print(max(sample_pred[0][i][j])) 11 | my_file.write(str(i+1)+" "+ str(j+1)+" "+"0 8 "+ str(max(sample_pred[0][i][j]))+"\n") 12 | my_file.write('END\n') 13 | my_file.close() 14 | save_rr(seq,sample_pred,'file_name_rr') 15 | -------------------------------------------------------------------------------- /models/distance_pipeline/Tutorials/modify_pssm.py: -------------------------------------------------------------------------------- 1 | with open("file_name.pssm", "r") as f: 2 | lines = f.readlines()[2:-6] 3 | key = "ACDEFGHIKLMNPQRSTVWY" 4 | text_keys = lines[0].replace("\n", "").split(" ")[4:] 5 | 6 | # PSSM OPTION 1 7 | text_vals = np.array([" ".join(line.replace("\n", "")[72:-11].split()).split() for line in lines[1:]]).astype(float) 8 | 9 | # PSSM OPTION 2 10 | # text_vals = np.array([" ".join(line.replace("\n", "")[10:72].split()).split() for line in lines[1:]]).astype(float) 11 | 12 | # normalize to [0,1] 13 | # text_vals = text_vals / np.sum(text_vals, axis=1).reshape((len(text_vals), 1)) 14 | for i in range(len(text_vals)): 15 | text_vals[i] = (text_vals[i] - np.amin(text_vals[i])) / (np.amax(text_vals[i] - np.amin(text_vals[i]))) 16 | 17 | # create NxL PSSM 18 | pssm = np.zeros_like(text_vals) 19 | for i,aa in enumerate(text_keys): 20 | pssm[key.index(aa), :] = text_vals[i, :] 21 | 22 | inputs_pssm = wider_pssm(pssm.T, seq) 23 | inputs = np.concatenate((inputs_aa, inputs_pssm), axis=-1) 24 | -------------------------------------------------------------------------------- /models/distance_pipeline/distance_generator_data.py: -------------------------------------------------------------------------------- 1 | import keras 2 | import numpy as np 3 | # Custom functions import 4 | from func_utils import * 5 | 6 | class DataGenerator(keras.utils.Sequence): 7 | 'Generates data for Keras' 8 | def __init__(self, paths=None, max_prots=10, batch_size=8, crop_size=200, pad_size=200, 9 | n_classes=5, class_cuts=[-0.5, 500, 1000, 1700], shuffle=True): 10 | 'Initialization' 11 | # Get data 12 | self.names, self.seqs, self.dists, self.pssms = get_data(paths, max_prots) 13 | self.list_IDs = [i for i in range(len(self.seqs))] 14 | # Features 15 | self.batch_size = batch_size 16 | self.crop_size = crop_size 17 | self.pad_size = pad_size 18 | self.n_classes = n_classes 19 | self.class_cuts = class_cuts 20 | if len(self.class_cuts) != self.n_classes-1: 21 | raise ValueError('len(class_cuts) must be n_classes-1') 22 | 23 | self.shuffle = shuffle 24 | self.on_epoch_end() 25 | 26 | def __len__(self): 27 | 'Denotes the number of batches per epoch' 28 | return int(np.floor(len(self.list_IDs) / self.batch_size)) 29 | 30 | def __getitem__(self, index): 31 | 'Generate one batch of data' 32 | # Generate indexes of the batch 33 | indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size] 34 | 35 | # Find list of IDs 36 | list_IDs_temp = [self.list_IDs[k] for k in indexes] 37 | 38 | # Generate data 39 | x, y = self.__data_generation(list_IDs_temp) 40 | 41 | return x, y 42 | 43 | def get_data(self, paths, max_prots): 44 | """ Get the data from files. """ 45 | # Scan first n proteins 46 | names = [] 47 | seqs = [] 48 | dists = [] 49 | pssms = [] 50 | for path in paths: 51 | # Opn file and read text 52 | with open(path, "r") as f: 53 | lines = f.read().split('\n') 54 | 55 | # Extract numeric data from text 56 | for i,line in enumerate(lines): 57 | if len(names) == max_prots+1: 58 | break 59 | # Read each protein separately 60 | if line == "[ID]": 61 | names.append(lines[i+1]) 62 | elif line == "[PRIMARY]": 63 | seqs.append(lines[i+1]) 64 | elif line == "[EVOLUTIONARY]": 65 | pssms.append(parse_lines(lines[i+1:i+21])) 66 | elif line == "[DIST]": 67 | dists.append(parse_lines(lines[i+1:i+len(seqs[-1])+1])) 68 | # Progress control 69 | if len(names)%100 == 0: 70 | print("Currently @ ", len(names), " out of "+str(max_prots)) 71 | try: logger.info("Currently @ ", len(names), " out of "+str(max_prots)) 72 | except:pass 73 | 74 | print("Total length is "+str(len(names)-1)+" out of "+str(max_prots)+" possible.") 75 | try: logger.info("Total length is "+str(len(names)-1)+" out of "+str(max_prots)+" possible.") 76 | except:pass 77 | 78 | return names, seqs, dists, pssms 79 | 80 | 81 | def on_epoch_end(self): 82 | 'Updates indexes after each epoch' 83 | self.indexes = np.arange(len(self.list_IDs)) 84 | if self.shuffle == True: 85 | np.random.shuffle(self.indexes) 86 | 87 | def wider(self, seq, n=20): 88 | """ Converts a seq into a one-hot tensor. Not LxN but LxLxN""" 89 | key = "HRKDENQSYTCPAVLIGFWM" 90 | tensor = [] 91 | for i in range(self.pad_size): 92 | d2 = [] 93 | for j in range(self.pad_size): 94 | # Check first for lengths (dont want index_out_of_range) 95 | d1 = [1 if (j=self.class_cuts[-1]).astype(np.int)) 149 | 150 | return np.concatenate([cat.reshape(self.pad_size, self.pad_size, 1) 151 | for cat in med], axis=2) 152 | 153 | # Embed number of rows 154 | def embedding_matrix(self, matrix): 155 | # Embed with extra columns 156 | for i in range(len(matrix)): 157 | while len(matrix[i]) len(seq) 137 | d1.append(1 - abs(i-j)/200) 138 | 139 | d2.append(d1) 140 | tensor.append(d2) 141 | 142 | return np.array(tensor) 143 | 144 | 145 | def embedding_matrix(matrix, l=200): 146 | """ Embeds matrix of nxn into lxl. n=cuts[5] 167 | 168 | return np.concatenate((trash.reshape(l,l,1), 169 | first.reshape(l,l,1), 170 | sec.reshape(l,l,1), 171 | third.reshape(l,l,1), 172 | fourth.reshape(l,l,1), 173 | fifth.reshape(l,l,1), 174 | # sixth.reshape(l,l,1), 175 | seventh.reshape(l,l,1)),axis=2) 176 | 177 | def mirror_diag(image): 178 | """ Mirrors image across its diagonal. """ 179 | image = image.astype(float) 180 | # averages image across diagonal and returns 2 simetric parts 181 | for i in range(len(image)): 182 | for j in range(len(image[i])): 183 | image[i,j] = image[j,i] = np.true_divide((image[i,j]+image[j,i]), 2) 184 | 185 | return image -------------------------------------------------------------------------------- /models/distance_pipeline/images/FM_T0869_CASP12.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hypnopump/MiniFold/6eb47c5600c22c7dabbb1294adbd8c6704a185cb/models/distance_pipeline/images/FM_T0869_CASP12.jpg -------------------------------------------------------------------------------- /models/distance_pipeline/images/golden_img_v91_17.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hypnopump/MiniFold/6eb47c5600c22c7dabbb1294adbd8c6704a185cb/models/distance_pipeline/images/golden_img_v91_17.png -------------------------------------------------------------------------------- /models/distance_pipeline/images/golden_img_v91_20.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hypnopump/MiniFold/6eb47c5600c22c7dabbb1294adbd8c6704a185cb/models/distance_pipeline/images/golden_img_v91_20.png -------------------------------------------------------------------------------- /models/distance_pipeline/images/golden_img_v91_32.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hypnopump/MiniFold/6eb47c5600c22c7dabbb1294adbd8c6704a185cb/models/distance_pipeline/images/golden_img_v91_32.png -------------------------------------------------------------------------------- /models/distance_pipeline/images/golden_img_v91_45.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hypnopump/MiniFold/6eb47c5600c22c7dabbb1294adbd8c6704a185cb/models/distance_pipeline/images/golden_img_v91_45.png -------------------------------------------------------------------------------- /models/distance_pipeline/images/golden_img_v91_50.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hypnopump/MiniFold/6eb47c5600c22c7dabbb1294adbd8c6704a185cb/models/distance_pipeline/images/golden_img_v91_50.png -------------------------------------------------------------------------------- /models/distance_pipeline/images/golden_img_v91_54.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hypnopump/MiniFold/6eb47c5600c22c7dabbb1294adbd8c6704a185cb/models/distance_pipeline/images/golden_img_v91_54.png -------------------------------------------------------------------------------- /models/distance_pipeline/images/golden_img_v91_55.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hypnopump/MiniFold/6eb47c5600c22c7dabbb1294adbd8c6704a185cb/models/distance_pipeline/images/golden_img_v91_55.png -------------------------------------------------------------------------------- /models/distance_pipeline/images/golden_img_v91_56.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hypnopump/MiniFold/6eb47c5600c22c7dabbb1294adbd8c6704a185cb/models/distance_pipeline/images/golden_img_v91_56.png -------------------------------------------------------------------------------- /models/distance_pipeline/models/tester_28_lxl_golden.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hypnopump/MiniFold/6eb47c5600c22c7dabbb1294adbd8c6704a185cb/models/distance_pipeline/models/tester_28_lxl_golden.h5 -------------------------------------------------------------------------------- /models/distance_pipeline/pipeline_caller.py: -------------------------------------------------------------------------------- 1 | import os 2 | import numpy as np 3 | log_name = "genetic_log" 4 | 5 | # genetic algorithm params 6 | RECORD_PATH = "record.txt" 7 | IMPROVE = 1/30 # maximum modification to eaach class 8 | MUTATE = 0.75 # probability of a class mutation 9 | 10 | def stringify(vec): 11 | """ Helper function to save data to .txt file. """ 12 | line = "" 13 | for v in vec: 14 | line += str(v)+"," 15 | return line[:-1] 16 | 17 | for i in range(7*20): 18 | try: 19 | with open(RECORD_PATH, "r") as f: 20 | lines = f.read().split('\n') 21 | 22 | WEIGHTS = [float(w) for w in str(lines[-1]).split(" ")[-1].split(",")] 23 | # generate new_weights if 24 | if int(str(lines[-1]).split(" ")[0]) < i-1: 25 | # -0.4 since its easier to lose a 50% but hard to regain a 100% 26 | WEIGHTS = [w+2*(np.random.random()-0.4)*IMPROVE*w 27 | if np.random.random()=self.class_cuts[-1]).astype(np.int)) 109 | 110 | return np.concatenate([cat.reshape(self.pad_size, self.pad_size, 1) 111 | for cat in med], axis=2) 112 | 113 | # Embed number of rows 114 | def embedding_matrix(self, matrix): 115 | # Embed with extra columns 116 | for i in range(len(matrix)): 117 | while len(matrix[i])17*2:\n", 219 | " long += len(seqs[i])-17*2\n", 220 | " for j in range(17,len(seqs[i])-17):\n", 221 | " # Padd sequence\n", 222 | " input_aa.append(onehotter_aa(seqs[i], j))\n", 223 | " input_pssm.append(pssm_cropper(pssms[i], j))\n", 224 | " outputs.append([phis[i][j], psis[i][j]])\n", 225 | " # break\n", 226 | " # print(i, \"Added: \", len(seqs[i])-34,\"total for now: \", long)\n", 227 | "print(\"TOTAL:\", long, len(input_aa))" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": 10, 233 | "metadata": {}, 234 | "outputs": [ 235 | { 236 | "name": "stdout", 237 | "output_type": "stream", 238 | "text": [ 239 | "Outputs: 43001\n", 240 | "Inputs AAs: 43001\n", 241 | "Inputs PSSMs: 43001\n" 242 | ] 243 | } 244 | ], 245 | "source": [ 246 | "#Check everything's fine\n", 247 | "print(\"Outputs: \", len(outputs))\n", 248 | "print(\"Inputs AAs: \", len(input_aa))\n", 249 | "print(\"Inputs PSSMs: \", len(input_pssm))" 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": {}, 255 | "source": [ 256 | "#### Reshape the inputs" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": 11, 262 | "metadata": {}, 263 | "outputs": [ 264 | { 265 | "data": { 266 | "text/plain": [ 267 | "(43001, 34, 22)" 268 | ] 269 | }, 270 | "execution_count": 11, 271 | "metadata": {}, 272 | "output_type": "execute_result" 273 | } 274 | ], 275 | "source": [ 276 | "input_aa = np.array(input_aa).reshape(len(input_aa), 17*2, 22)\n", 277 | "input_aa.shape" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 12, 283 | "metadata": {}, 284 | "outputs": [ 285 | { 286 | "data": { 287 | "text/plain": [ 288 | "(43001, 34, 21)" 289 | ] 290 | }, 291 | "execution_count": 12, 292 | "metadata": {}, 293 | "output_type": "execute_result" 294 | } 295 | ], 296 | "source": [ 297 | "input_pssm = np.array(input_pssm).reshape(len(input_pssm), 17*2, 21)\n", 298 | "input_pssm.shape" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": 13, 304 | "metadata": {}, 305 | "outputs": [], 306 | "source": [ 307 | "# Helper function to save data to a .txt file\n", 308 | "def stringify(vec):\n", 309 | " return \"\".join(str(v)+\" \" for v in vec)" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": 14, 315 | "metadata": {}, 316 | "outputs": [], 317 | "source": [ 318 | "# Save outputs to txt file\n", 319 | "with open(\"../data/angles/outputs.txt\", \"a\") as f:\n", 320 | " for o in outputs:\n", 321 | " f.write(stringify(o)+\"\\n\")" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": 15, 327 | "metadata": {}, 328 | "outputs": [], 329 | "source": [ 330 | "# Save AAs & PSSMs data to different files (together makes a 3dims tensor)\n", 331 | "# Will concat later\n", 332 | "with open(\"../data/angles/input_aa.txt\", \"a\") as f:\n", 333 | " for aas in input_aa:\n", 334 | " f.write(\"\\nNEW\\n\")\n", 335 | " for j in range(len(aas)):\n", 336 | " f.write(stringify(aas[j])+\"\\n\")" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": 16, 342 | "metadata": {}, 343 | "outputs": [], 344 | "source": [ 345 | "with open(\"../data/angles/input_pssm.txt\", \"a\") as f:\n", 346 | " for k in range(len(input_pssm)):\n", 347 | " f.write(\"\\nNEW\\n\")\n", 348 | " for j in range(len(input_pssm[k])):\n", 349 | " f.write(stringify(input_pssm[k][j])+\"\\n\")" 350 | ] 351 | }, 352 | { 353 | "cell_type": "markdown", 354 | "metadata": {}, 355 | "source": [ 356 | "# Done!" 357 | ] 358 | } 359 | ], 360 | "metadata": { 361 | "kernelspec": { 362 | "display_name": "Python 3", 363 | "language": "python", 364 | "name": "python3" 365 | }, 366 | "language_info": { 367 | "codemirror_mode": { 368 | "name": "ipython", 369 | "version": 3 370 | }, 371 | "file_extension": ".py", 372 | "mimetype": "text/x-python", 373 | "name": "python", 374 | "nbconvert_exporter": "python", 375 | "pygments_lexer": "ipython3", 376 | "version": "3.6.7" 377 | } 378 | }, 379 | "nbformat": 4, 380 | "nbformat_minor": 2 381 | } 382 | -------------------------------------------------------------------------------- /preprocessing/get_angles_from_coords_py.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Calculate Dihedral Angles from Coordinates" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "# Import libraries - LOAD THE DATA\n", 17 | "import numpy as np\n", 18 | "import matplotlib.pyplot as plt" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "def parse_line(raw):\n", 28 | " return np.array([[float(x) for x in line.split(\"\\t\") if x != \"\"] for line in raw])" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 3, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "names = []\n", 38 | "seqs = []\n", 39 | "psis = []\n", 40 | "phis = []\n", 41 | "pssms = []\n", 42 | "coords = []\n", 43 | "\n", 44 | "path = \"../data/full_under_200.txt\"\n", 45 | "# Opn file and read text\n", 46 | "with open(path, \"r\") as f:\n", 47 | " lines = f.read().split('\\n')" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 4, 53 | "metadata": { 54 | "scrolled": true 55 | }, 56 | "outputs": [ 57 | { 58 | "name": "stdout", 59 | "output_type": "stream", 60 | "text": [ 61 | "Currently @ 50 out of n\n", 62 | "Currently @ 100 out of n\n", 63 | "Currently @ 150 out of n\n", 64 | "Currently @ 200 out of n\n", 65 | "Currently @ 250 out of n\n", 66 | "Currently @ 300 out of n\n", 67 | "Currently @ 350 out of n\n", 68 | "Currently @ 400 out of n\n", 69 | "Currently @ 450 out of n\n", 70 | "Currently @ 500 out of n\n", 71 | "Currently @ 550 out of n\n", 72 | "Currently @ 600 out of n\n" 73 | ] 74 | } 75 | ], 76 | "source": [ 77 | "# Extract numeric data from text\n", 78 | "for i,line in enumerate(lines):\n", 79 | " if len(names) == 601:\n", 80 | " break\n", 81 | " # Read each protein separately\n", 82 | " if line == \"[ID]\":\n", 83 | " names.append(lines[i+1])\n", 84 | " elif line == \"[PRIMARY]\":\n", 85 | " seqs.append(lines[i+1])\n", 86 | " elif line == \"[EVOLUTIONARY]\":\n", 87 | " pssms.append(parse_line(lines[i+1:i+22]))\n", 88 | " elif line == \"[TERTIARY]\":\n", 89 | " coords.append(parse_line(lines[i+1:i+3+1]))\n", 90 | " # Progress control\n", 91 | " if len(names)%50 == 0:\n", 92 | " print(\"Currently @ \", len(names), \" out of n\")" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 5, 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [ 101 | "#Get the coordinates for 1 atom type\n", 102 | "def separate_coords(full_coords, pos): # pos can be either 0(n_term), 1(calpha), 2(cterm)\n", 103 | " res = []\n", 104 | " for i in range(len(full_coords[1])):\n", 105 | " if i%3 == pos:\n", 106 | " res.append([full_coords[j][i] for j in range(3)])\n", 107 | "\n", 108 | " return np.array(res)" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 6, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "# Organize by atom type\n", 118 | "coords_nterm = [separate_coords(full_coords, 0) for full_coords in coords]\n", 119 | "coords_calpha = [separate_coords(full_coords, 1) for full_coords in coords]\n", 120 | "coords_cterm = [separate_coords(full_coords, 2) for full_coords in coords]" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 7, 126 | "metadata": {}, 127 | "outputs": [ 128 | { 129 | "name": "stdout", 130 | "output_type": "stream", 131 | "text": [ 132 | "Length coords_calpha: 600\n", 133 | "Length coords_calpha[1]: 142\n", 134 | "Length coords_calpha[1][1]: 3\n" 135 | ] 136 | } 137 | ], 138 | "source": [ 139 | "# Check everything's ok\n", 140 | "print(\"Length coords_calpha: \", len(coords_cterm))\n", 141 | "print(\"Length coords_calpha[1]: \", len(coords_cterm[1]))\n", 142 | "print(\"Length coords_calpha[1][1]: \", len(coords_cterm[1][1]))" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": 8, 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [ 151 | "# Helper functions\n", 152 | "def get_dihedral(coords1, coords2, coords3, coords4):\n", 153 | " \"\"\"Returns the dihedral angle in degrees.\"\"\"\n", 154 | "\n", 155 | " a1 = coords2 - coords1\n", 156 | " a2 = coords3 - coords2\n", 157 | " a3 = coords4 - coords3\n", 158 | "\n", 159 | " v1 = np.cross(a1, a2)\n", 160 | " v1 = v1 / (v1 * v1).sum(-1)**0.5\n", 161 | " v2 = np.cross(a2, a3)\n", 162 | " v2 = v2 / (v2 * v2).sum(-1)**0.5\n", 163 | " porm = np.sign((v1 * a3).sum(-1))\n", 164 | " rad = np.arccos((v1*v2).sum(-1) / ((v1**2).sum(-1) * (v2**2).sum(-1))**0.5)\n", 165 | " if not porm == 0:\n", 166 | " rad = rad * porm\n", 167 | "\n", 168 | " return rad" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 9, 174 | "metadata": {}, 175 | "outputs": [ 176 | { 177 | "name": "stderr", 178 | "output_type": "stream", 179 | "text": [ 180 | "c:\\users\\eric\\appdata\\local\\programs\\python\\python36\\lib\\site-packages\\ipykernel_launcher.py:10: RuntimeWarning: invalid value encountered in true_divide\n", 181 | " # Remove the CWD from sys.path while we load stuff.\n", 182 | "c:\\users\\eric\\appdata\\local\\programs\\python\\python36\\lib\\site-packages\\ipykernel_launcher.py:12: RuntimeWarning: invalid value encountered in true_divide\n", 183 | " if sys.path[0] == '':\n", 184 | "c:\\users\\eric\\appdata\\local\\programs\\python\\python36\\lib\\site-packages\\ipykernel_launcher.py:13: RuntimeWarning: invalid value encountered in sign\n", 185 | " del sys.path[0]\n" 186 | ] 187 | } 188 | ], 189 | "source": [ 190 | "# Compute angles for a protein\n", 191 | "phis, psis = [], [] # phi always starts with a 0 and psi ends with a 0\n", 192 | "ph_angle_dists, ps_angle_dists = [], []\n", 193 | "for k in range(len(coords)):\n", 194 | " phi, psi = [0.0], []\n", 195 | " # Use our own functions inspired from bioPython\n", 196 | " for i in range(len(coords_calpha[k])):\n", 197 | " # Calculate phi, psi\n", 198 | " # CALCULATE PHI - Can't calculate for first residue\n", 199 | " if i>0:\n", 200 | " phi.append(get_dihedral(coords_cterm[k][i-1], coords_nterm[k][i], coords_calpha[k][i], coords_cterm[k][i])) # my_calc\n", 201 | " \n", 202 | " # CALCULATE PSI - Can't calculate for last residue\n", 203 | " if i= 0 and test_psi[i] >= 0:\n", 270 | " quads[0] += 1\n", 271 | " elif test_phi[i] < 0 and test_psi[i] >= 0:\n", 272 | " quads[1] += 1\n", 273 | " elif test_phi[i] < 0 and test_psi[i] < 0:\n", 274 | " quads[2] += 1\n", 275 | " else:\n", 276 | " quads[3] += 1\n", 277 | " \n", 278 | "print(\"Quadrants: \", quads, \" from \", len(test_phi))" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": 12, 284 | "metadata": {}, 285 | "outputs": [ 286 | { 287 | "data": { 288 | "image/png": "\n", 289 | "text/plain": [ 290 | "
" 291 | ] 292 | }, 293 | "metadata": { 294 | "needs_background": "light" 295 | }, 296 | "output_type": "display_data" 297 | } 298 | ], 299 | "source": [ 300 | "# Visualize data. Check it matches the Ramachandran Plot distribution\n", 301 | "# (Ergo check if angles are well computed)\n", 302 | "plt.scatter(test_phi, test_psi, marker=\".\")\n", 303 | "plt.xlim(-np.pi, np.pi)\n", 304 | "plt.xlabel(\"Phi\")\n", 305 | "plt.ylabel(\"Psi\")\n", 306 | "plt.ylim(-np.pi, np.pi)\n", 307 | "plt.show()" 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": 13, 313 | "metadata": {}, 314 | "outputs": [], 315 | "source": [ 316 | "# Data is OK. Can save it to file.\n", 317 | "with open(\"../data/angles/full_angles_under_200.txt\", \"a\") as f:\n", 318 | " for k in range(len(names)-1):\n", 319 | " # ID\n", 320 | " f.write(\"\\n[ID]\\n\")\n", 321 | " f.write(names[k])\n", 322 | " # Seq\n", 323 | " f.write(\"\\n[PRIMARY]\\n\")\n", 324 | " f.write(seqs[k])\n", 325 | " # PSSMS\n", 326 | " f.write(\"\\n[EVOLUTIONARY]\\n\")\n", 327 | " for j in range(len(pssms[k])):\n", 328 | " f.write(stringify(pssms[k][j])+\"\\n\")\n", 329 | " # PHI\n", 330 | " f.write(\"\\n[PHI]\\n\")\n", 331 | " f.write(stringify(phis[k]))\n", 332 | " # PSI\n", 333 | " f.write(\"\\n[PSI]\\n\")\n", 334 | " f.write(stringify(psis[k]))\n" 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "metadata": {}, 340 | "source": [ 341 | "# Done!" 342 | ] 343 | } 344 | ], 345 | "metadata": { 346 | "kernelspec": { 347 | "display_name": "Python 3", 348 | "language": "python", 349 | "name": "python3" 350 | }, 351 | "language_info": { 352 | "codemirror_mode": { 353 | "name": "ipython", 354 | "version": 3 355 | }, 356 | "file_extension": ".py", 357 | "mimetype": "text/x-python", 358 | "name": "python", 359 | "nbconvert_exporter": "python", 360 | "pygments_lexer": "ipython3", 361 | "version": "3.6.7" 362 | } 363 | }, 364 | "nbformat": 4, 365 | "nbformat_minor": 2 366 | } 367 | -------------------------------------------------------------------------------- /preprocessing/get_proteins_under_200aa.jl: -------------------------------------------------------------------------------- 1 | """ 2 | This is a script alternative to the julia_get_proteins_under_200_aa.ipynb 3 | notebook for those who don't have IJulia or would like to run it as a script 4 | 5 | Notebook to preprocess the raw data file and 6 | handle it properly. 7 | Will prune the unnecessary data for now. 8 | Reducing data file from 600mb to 170mb. 9 | 10 | Select only proteins under L aminoacids (AAs). 11 | """ 12 | 13 | L = 200 # Set maximum AA length 14 | N = 995 # Set maximum number of proteins 15 | RAW_DATA_PATH = "../data/training_30.txt" # Path to raw data file 16 | DESTIN_PATH = "../data/full_under_200.txt" # Path to destin file 17 | 18 | # alternatively declare paths from cammand line 19 | if length(ARGS) > 1 20 | RAW_DATA_PATH = ARGS[1] # Path to raw data file 21 | DESTIN_PATH = ARGS[2] # Path to destin file 22 | end 23 | 24 | # Open the file and read content 25 | f = try open(RAW_DATA_PATH) catch 26 | println("File not found. Check it's there. Instructions in the readme.") 27 | exit(0) 28 | end 29 | lines = readlines(f) 30 | 31 | 32 | 33 | function coords_split(lister, splice) 34 | # Split all passed sequences by "splice" and return an array of them 35 | # Convert string fragments to float 36 | coords = [] 37 | for c in lister 38 | push!(coords, [parse(Float64, a) for a in split(c, splice)]) 39 | end 40 | return coords 41 | end 42 | 43 | 44 | function norm(vector) 45 | # Could use "Using LinearAlgebra + built-in norm()" but gotta learn Julia 46 | return sqrt(sum([v*v for v in vector])) 47 | end 48 | 49 | 50 | # Scan first n proteins 51 | names = [] 52 | seqs = [] 53 | coords = [] 54 | pssms = [] 55 | 56 | try 57 | # Record names, seqs and coords for each protein btwn 1-n 58 | for i in 1:length(lines) 59 | if length(coords) == N 60 | break 61 | end 62 | 63 | # Start recording 64 | if lines[i] == "[ID]" 65 | push!(names, lines[i+1]) 66 | elseif lines[i] == "[PRIMARY]" 67 | push!(seqs, lines[i+1]) 68 | elseif lines[i] == "[TERTIARY]" 69 | push!(coords, coords_split(lines[i+1:i+3], "\t")) 70 | elseif lines[i] == "[EVOLUTIONARY]" 71 | push!(pssms, coords_split(lines[i+1:i+21], "\t")) 72 | # Progress control 73 | if length(names)%50 == 0 74 | println("Currently @ ", length(names), " out of n: ", N) 75 | end 76 | end 77 | end 78 | catch 79 | println("Error while reading file. Check it's complete or download again.") 80 | exit(0) 81 | end 82 | 83 | 84 | # Check proteins w/ length under L 85 | println("\n\nTotal number of proteins: ", length(seqs)) 86 | under = [] 87 | for i in 1:length(seqs) 88 | if length(seqs[i])L 161 | println("error when checking protein in dists n: ", 162 | aux[length(aux)], " length: ", length(dists[aux[length(aux)]][1])) 163 | break 164 | else 165 | writedlm(f, dists[aux[length(aux)]]) 166 | end 167 | end 168 | end 169 | 170 | 171 | println("\n\nScript execution went fine. Data is ready at: ", DESTIN_PATH) 172 | exit(0) 173 | -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | # MiniFold 2 | 3 | [![DOI](https://zenodo.org/badge/172886347.svg)](https://zenodo.org/badge/latestdoi/172886347) 4 | 5 | ## Abstract 6 | 7 | * **Introduction**: The Protein Folding Problem (predicting a protein structure from its sequence) is an interesting one since DNA sequence data available is becoming cheaper and cheaper at an unprecedented rate, even faster than Moore's law [1](https://www.genome.gov/27541954/dna-sequencing-costs-data/). Recent research has applied Deep Learning techniques in order to accurately predict the structure of polypeptides [[2](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005324), [3](http://predictioncenter.org/casp13/doc/presentations/Pred_CASP13-DeepLearning-AlphaFold-Senior.pdf)]. 8 | * **Methods**: In this work, we present an attempt to imitate the AlphaFold system for protein prediction architecture [[3](http://predictioncenter.org/casp13/doc/presentations/Pred_CASP13-DeepLearning-AlphaFold-Senior.pdf)]. We use 1-D Residual Networks (ResNets) to predict dihedral torsion angles and 2-D ResNets to predict distance maps between the protein amino-acids[[4](https://arxiv.org/abs/1512.03385)]. We use the CASP7 ProteinNet dataset section for training and evaluation of the model [[5](https://arxiv.org/abs/1902.00249)]. An open-source implementation of the system described can be found [here](https://github.com/EricAlcaide/MiniFold). 9 | * **Results**: We are able to obtain distance maps and torsion angle predictions for a protein given it's sequence and PSSM. Our angle prediction model scores a 0.39 of MAE (Mean Absolute Error), and 0.39 and 0.43 R^2 coefficients for Phi and Psi respectively, whereas SoTA is around 0.69 (Phi) and 0.73 (Psi). Our methods do not include post-processing of Deep Learning outputs, which can be very noisy. 10 | * **Conclusion**: We have shown the potential of Deep Learning methods and its possible application to solve the Protein Folding Problem. Despite technical limitations, Neural Networks are able to capture relations between the data. Although our visually pleasant results, our system lacks components such as the protein structure prediction from both dihedral torsion angles and the distance map of a given protein and the post-processing of our predictions in order to reduce noise. 11 | 12 | #### Citation 13 | ``` 14 | @misc{ericalcaide2019 15 | title = {MiniFold: a DeepLearning-based Mini Protein Folding Engine}, 16 | publisher = {GitHub}, 17 | journal = {GitHub repository}, 18 | author = {Alcaide, Eric}, 19 | year = {2019}, 20 | howpublished = {\url{https://github.com/EricAlcaide/MiniFold/}}, 21 | doi = {10.5281/zenodo.3774491}, 22 | url = {https://doi.org/10.5281/zenodo.3774491} 23 | } 24 | ``` 25 | 26 | 27 | ## Introduction 28 | 29 | [DeepMind](https://deepmind.com), a company affiliated with Google and specialized in AI, presented a novel algorithm for Protein Structure Prediction at [CASP13](http://predictioncenter.org/casp13/index.cgi) (a competition which goal is to find the best algorithms that predict protein structures in different categories). 30 | 31 | The Protein Folding Problem is an interesting one since there's tons of DNA sequence data available and it's becoming cheaper and cheaper at an unprecedented rate (faster than [Moore's law](https://www.genome.gov/27541954/dna-sequencing-costs-data/)). The cells build the proteins they need through **transcription** (from DNA to RNA) and **translation** (from RNA to Aminocids (AAs)). However, the function of a protein does not depend solely on the sequence of AAs that form it, but also their spatial 3D folding. Thus, it's hard to predict the function of a protein from its DNA sequence. **AI** can help solve this problem by learning the relations that exist between a determined sequence and its spatial 3D folding. 32 | 33 | The DeepMind work presented @ CASP was not a technological breakthrough (they did not invent any new type of AI) but an **engineering** one: they applied well-known AI algorithms to a problem along with lots of data and computing power and found a great solution through model design, feature engineering, model ensembling and so on. DeepMind has no plan to open source the code of their model nor set up a prediction server. 34 | 35 | Based on the premise exposed before, the aim of this project is to build a model suitable for protein 3D structure prediction inspired by AlphaFold and many other AI solutions that may appear and achieve SOTA results. 36 | 37 | 38 | ## Methods 39 | ### Proposed Architecture 40 | 41 | The [methods implemented](implementation_details.md) are inspired by DeepMind's original post. Two different residual neural networks (ResNets) are used to predict **angles** between adjacent aminoacids (AAs) and **distance** between every pair of AAs of a protein. For distance prediction a 2D Resnet was used while for angles prediction a 1D Resnet was used. 42 | 43 |
44 | 45 |
46 | 47 | Image from DeepMind's original blogpost. 48 | 49 | #### Distance prediction 50 | 51 | The ResNet for distance prediction is built as a 2D-ResNet and takes as input tensors of shape LxLxN (a normal image would be LxLx3). The window length is set to 200 (we only train and predict proteins of less than 200 AAs) and smaller proteins are padded to match the window size. No larger proteins nor crops of larger proteins are used. 52 | 53 | The 41 channels of the input are distributed as follows: 20 for AAs in one-hot encoding (LxLx20), 1 for the Van der Waals radius of the AA encoded previously and 20 channels for the Position Specific Scoring Matrix). 54 | 55 | The network is comprised of packs of residual blocks with the architecture below illustrated with blocks cycling through 1,2,4 and 8 strides plus a first normal convolutional layer and the last convolutional layer where a Softmax activation function is applied to get an output of LxLx7 (6 classes for different distance + 1 trash class for the padding that is less penalized). 56 | 57 |
58 | 59 |
60 | 61 | Architecture of the residual block used. A mini version of the block in [this description](http://predictioncenter.org/casp13/doc/presentations/Pred_CASP13-DeepLearning-AlphaFold-Senior.pdf) 62 | 63 | The network has been trained with 134 proteins and evaluated with 16 more. Clearly unsufficient data, but memory constraints didn't allow for more. Comparably, AlphaFold was trained with 29k proteins. 64 | 65 | The output of the network is, then, a classification among 6 classes wich are ranges of distances between a pair of AAs. Here there's an example of AlphaFold predicted distances and the distances predicted by our model: 66 | 67 |
68 | 69 |
70 | Ground truth (left) and predicted distances (right) by AlphaFold. 71 | 72 |
73 | 74 |
75 | Ground truth (left) and predicted distances (right) by MiniFold. 76 | 77 | The architecture of the Residual Network for distance prediction is very simmilar, the main difference being that the model here described was trained with windows of 200x200 AAs while AlphaFold was trained with crops of 64x64 AAs. When it comes to prediction, AlphaFold used the smaller window size to average across different outputs and achieve a smoother result. Our prediction, however, is a unique window, so there's no average (noisier predictions). 78 | 79 | 80 | #### Angles prediction 81 | 82 | The ResNet for angles prediction is built as a 1D-ResNet and takes as input tensors of shape LxN. The window length is set to 34 and we only train and predict aangles of proteins with less than 200 (L) AAs. No larger proteins nor crops of larger proteins are used. 83 | 84 | The 42 (N) channels of the input are distributed as follows: 20 for AAs in one-hot encoding (Lx20), 2 for the Van der Waals radius and the surface accessibility of the AA encoded previously and 20 channels for the Position Specific Scoring Matrix). 85 | 86 | We followed the ResNet20 architecture but replaced the 2D Convolutions by 1D convolutions. The network output consists of a vector of 4 numbers that represent the `sin` and `cos` of the 2 dihedral angles between two AAs (Phi and Psi). 87 | 88 | Dihedral angles were extracted from raw coordinates of the protein backbone atoms (N-terminus, C-alpha and C-terminus of each AA). The plot of Phi and Psi recieves the name of Ramachandran plot: 89 | 90 |
91 | 92 |
93 | The cluster observed in the upper-left region corresponds to the angles comprised between AAs when they form a Beta-sheet while the cluster observed in the central-left region corresponds to the angles comprised between AAs when they form an Alpha-helix. 94 | 95 | The results of the model when making predictions can be observed below: 96 |
97 | 98 |
99 | 100 | The network has been trained with crops 38,7k crops from 600 different proteins and evaluated with some 4,3k more. 101 | 102 | The architecture of the Residual Network is different from the one implemented in AlphaFold. The model here implemented was inspired by [this paper](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005324) and [this one](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0205819). 103 | 104 | ## Results 105 | While the architectures implemented in this first preliminary version of the project are inspired by papers with great results, the results here obtained are not as good as they could be. It's likely that the lack of Multiple Alignmnent (MSA), MSA-based features, Physicochemichal properties of AAs (beyond Van der Waals radius) or the lack of both model and feature engineering have affected the models negatively, as well as the little data that they have been trained on. 106 | 107 | For that reason, we can conclude that it has been a somehow naive approach and we expect to further implement some ideas/improvements to these models. As the DeepMind team says: *"With few or no alignments accuracy is much worse"*. It would be interesting to use the predictions made by the models as constraints to a folding algorithm (ie. Rosetta) in order to visualize our results. 108 | 109 | 110 | ### Reproducing the results 111 | 112 | Here are the following steps in order to run the code locally or in the cloud: 113 | 1. Clone the repo: `git clone https://github.com/EricAlcaide/MiniFold` 114 | 2. Install dependencies: `pip install -r requirements.txt` 115 | 3. Get & format the data 116 | 1. Download data [here](https://github.com/aqlaboratory/proteinnet) (select CASP7 text-based format) 117 | 2. Extract/Decompress the data in any directory 118 | 3. Create the `/data` folder inside the `MiniFold` directory and copy the `training_30, training_70 and training90` files to it. Change extensions to `.txt`. 119 | 4. Execute data preprocessing notebooks (`preprocessing` folder) in the following order (we plan to release simple scripts instead of notebooks very soon): 120 | 1. `get_proteins_under_200aa.jl *source_path* *destin_path*`: - selects proteins under 200 residues from the *source_path* file (alternatively can be declared in the script itself) - (you will need the [Julia programming language](https://julialang.org/) v1.0 in order to run it) 121 | 1. **Alternatively**: `julia_get_proteins_under_200aa.ipynb` (you will need Julia as well as [iJulia](https://github.com/JuliaLang/IJulia.jl)) 122 | 3. `get_angles_from_coords_py.ipynb` - calculates dihedral angles from raw coordinates 123 | 4. `angle_data_preparation_py.ipynb` 124 | 5. Run the models! 125 | 1. For **angles prediction**: `models/predicting_angles.ipynb` 126 | 2. For **distance prediction**: 127 | 1. `models/distance_pipeline/pretrain_model_pssm_l_x_l.ipynb` 128 | 2. `models/distance_pipeline/pipeline_caller.py` 129 | 6. 3D structure modelling from predicted results 130 | 1. For **RR format conversion and 3D structure modelling** follow the steps given in `models/distance_pipeline/Tutorials/README.pdf` 131 | 132 | If you encounter any errors during installation, don't hesitate and open an [issue](https://github.com/EricAlcaide/MiniFold/issues). 133 | 134 | #### Post processing of predictions (added end 2020 - not by the original author) 135 | Presently the post processing of the predictions is done using a python script which converts the predicted results into RR format known as Residue-Residue contact prediction format. This format represents the probability of contact between pairwise residues. Data in this format are inserted between MODEL and END records of the submission file. The prediction starts with the sequence of the predicted target splitted.The sequence is followed by the list of contacts in the five-column format as represented below : 136 | ``` 137 | PFRMAT RR 138 | TARGET T0999 139 | AUTHOR 1234-5678-9000 140 | REMARK Predictor remarks 141 | METHOD Description of methods used 142 | METHOD Description of methods used 143 | MODEL 1 144 | HLEGSIGILLKKHEIVFDGC # <- entire target sequence (up to 50 145 | HDFGRTYIWQMSDASHMD # residues per line) 146 | 1 8 0 8 0.720 147 | 1 10 0 8 0.715 # <- i=1 j=10: indices of residues (integers), 148 | 31 38 0 8 0.710 149 | 10 20 0 8 0.690 # <- d1=0 d2=8: the range of Cb-Cb distance 150 | 30 37 0 8 0.678 # predicted for the residue pair (i,j) 151 | 11 29 0 8 0.673 152 | 1 9 0 8 0.63 # <- p=0.63: probability of the residues i=1 and j=9 153 | 21 37 0 8 0.502 # being in contact (in descending order) 154 | 8 15 0 8 0.401 155 | 3 14 0 8 0.400 156 | 5 15 0 8 0.307 157 | 7 14 0 8 0.30 158 | END 159 | ``` 160 | The predictions in this format can then be utilised as input to build 3D models using structure modelling softwares. 161 | 162 | ## Discussion 163 | ### Future 164 | 165 | There is plenty of ideas that could not be tried in this project due to computational and time constraints. In a brief way, some promising ideas or future directions are listed below: 166 | 167 | * Train with crops of 64x64 AAs, not windows of 200x200 AAs and average at prediction time. 168 | * Use data from Multiple Sequence Alignments (MSA) such as paired changes bewteen AAs. 169 | * Use distance map as potential input for angle prediction or vice versa. 170 | * Train with more data 171 | * Use predictions as constraints to a Protein Structure Prediction pipeline (CNS, Rosetta Solve or others). 172 | * Set up a prediction script/pipeline from raw text/FASTA file 173 | 174 | ### Limitations 175 | 176 | This project has been developed mainly during 3 weeks by 1 person and, therefore, many limitations have appeared. 177 | They will be listed below in order to give a sense about what this project is and what it's not. 178 | 179 | * **No usage of Multiple Sequence Alignments (MSA)**: The methods developed in this project don't use [MSA](https://www.ncbi.nlm.nih.gov/pubmed/27896722) nor MSA-based features as input. 180 | * **Computing power/memory**: Development of the project has taken part in a computer with the following specifications: Intel i7-6700k, 8gb RAM, NVIDIA GTX-1060Ti 6gb and 256gb of storage. The capacity for data exploration, processing, training and evaluating the models is limited. 181 | * **GPU/TPUs for training**: The models were trained and evaluated on a single GPU. No cloud servers were used. 182 | * **Time**: Three weeks of development during spare time. 183 | * **Domain expertise**: No experts in the field of genomics, proteomics or bioinformatics. The author knows the basics of Biochemistry and Deep Learning. 184 | * **Data**: The average paper about Protein Structure Prediction uses a personalized dataset acquired from the Protein Data Bank [(PDB)](https://www.ncbi.nlm.nih.gov/pubmed/28573592). No such dataset was used. Instead, we used a subset of the [ProteinNet](https://github.com/aqlaboratory/proteinnet) dataset from CASP7. Our models are trained with just 150 proteins (distance prediction) and 600 proteins (angles prediction) due to memory constraints. 185 | 186 | Due to these limitations and/or constraints, the precission/accuracy the methods here developed can achieve is limited when compared against State Of The Art algorithms. 187 | 188 | 189 | ## References 190 | * [DeepMind original blog post](https://deepmind.com/blog/alphafold/) 191 | * [AlphaFold @ CASP13: “What just happened?”](https://moalquraishi.wordpress.com/2018/12/09/alphafold-casp13-what-just-happened/#s2.2) 192 | * [Siraj Raval's YT video on AlphaFold](https://www.youtube.com/watch?v=cw6_OP5An8s) 193 | * [ProteinNet dataset](https://github.com/aqlaboratory/proteinnet) 194 | * [Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005324) 195 | * [AlphaFold slides](http://predictioncenter.org/casp13/doc/presentations/Pred_CASP13-DeepLearning-AlphaFold-Senior.pdf) 196 | * [De novo protein structure prediction using ultra-fast molecular dynamics simulation](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0205819) 197 | 198 | 199 | 200 | ## Contribute 201 | Hey there! New ideas are welcome: open/close issues, fork the repo and share your code with a Pull Request. 202 | Clone this project to your computer: 203 | 204 | `git clone https://github.com/EricAlcaide/MiniFold` 205 | 206 | By participating in this project, you agree to abide by the thoughtbot [code of conduct](https://thoughtbot.com/open-source-code-of-conduct) 207 | 208 | ## Meta 209 | 210 | * **Author's GitHub Profile**: [Eric Alcaide](https://github.com/hypnopump/) 211 | * **Twitter**: [@eric_alcaide](https://twitter.com/eric_alcaide) 212 | * **Email**: ericalcaide1@gmail.com 213 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | absl-py==0.6.1 2 | astor==0.7.1 3 | backcall==0.1.0 4 | biopython==1.73 5 | certifi==2018.11.29 6 | chardet==3.0.4 7 | colorama==0.4.1 8 | cycler==0.10.0 9 | decorator==4.3.0 10 | defusedxml==0.5.0 11 | docopt==0.6.2 12 | entrypoints==0.2.3 13 | gast==0.2.0 14 | GridDataFormats==0.4.0 15 | grpcio==1.16.1 16 | h5py==2.8.0 17 | idna==2.8 18 | ipykernel==5.1.0 19 | ipython==7.2.0 20 | ipython-genutils==0.2.0 21 | ipywidgets==7.4.2 22 | jedi==0.13.1 23 | Jinja2==2.11.3 24 | joblib==0.13.2 25 | jsonschema==2.6.0 26 | jupyter==1.0.0 27 | jupyter-client==5.2.3 28 | jupyter-console==6.0.0 29 | jupyter-core==4.4.0 30 | jupyterlab==0.35.4 31 | jupyterlab-server==0.2.0 32 | Keras==2.2.4 33 | Keras-Applications==1.0.6 34 | Keras-Preprocessing==1.0.5 35 | keras-resnet==0.1.0 36 | kiwisolver==1.0.1 37 | Markdown==3.0.1 38 | MarkupSafe==1.1.0 39 | matplotlib==3.0.2 40 | MDAnalysis==0.19.2 41 | mistune==0.8.4 42 | mmtf-python==1.1.2 43 | mock==2.0.0 44 | msgpack==0.6.1 45 | nbconvert==5.4.0 46 | nbformat==4.4.0 47 | networkx==2.2 48 | notebook==6.1.5 49 | numpy==1.15.4 50 | pandas==0.23.4 51 | pandocfilters==1.4.2 52 | parso==0.3.1 53 | pbr==5.1.2 54 | pickleshare==0.7.5 55 | Pillow==8.1.1 56 | pipreqs==0.4.9 57 | prometheus-client==0.4.2 58 | prompt-toolkit==2.0.7 59 | protobuf==3.6.1 60 | PyAutoGUI==0.9.39 61 | Pygments==2.3.0 62 | PyMsgBox==1.0.6 63 | pyparsing==2.3.0 64 | pyperclip==1.7.0 65 | PyScreeze==0.1.18 66 | python-dateutil==2.7.5 67 | PyTweening==1.0.3 68 | pytz==2018.7 69 | pywinpty==0.5.4 70 | PyYAML==5.4 71 | pyzmq==17.1.2 72 | qtconsole==4.4.3 73 | requests==2.21.0 74 | rosetta==0.3 75 | scikit-learn==0.20.2 76 | scipy==1.1.0 77 | Send2Trash==1.5.0 78 | singledispatch==3.4.0.3 79 | six==1.11.0 80 | sklearn==0.0 81 | tensorboard==1.12.0 82 | tensorflow-gpu==1.12.0 83 | termcolor==1.1.0 84 | terminado==0.8.1 85 | testpath==0.4.2 86 | torch==1.0.0 87 | tornado==5.1.1 88 | traitlets==4.3.2 89 | urllib3==1.24.2 90 | wcwidth==0.1.7 91 | webencodings==0.5.1 92 | Werkzeug==0.15.3 93 | widgetsnbextension==3.4.2 94 | yarg==0.1.9 95 | --------------------------------------------------------------------------------