├── README.md ├── final_phase └── final_extraction.py ├── license.txt ├── phase_1 ├── Extracting_morgan.py ├── Extracting_smiles.py ├── molecular_file_count_updated.py ├── sampling.py └── sanity_check.py ├── phase_2-3 ├── Extract_labels.py ├── ML │ ├── DDCallbacks.py │ ├── DDMetrics.py │ ├── DDModel.py │ ├── DDModelExceptions.py │ ├── Models.py │ ├── Parser.py │ ├── Tokenizer.py │ └── lasso_regularizer.py ├── Prediction_morgan_1024.py ├── activation_script.sh ├── hyperparameter_result_evaluation.py ├── progressive_docking.py ├── simple_job_models.py └── simple_job_predictions.py └── utilities └── Morgan_fing.py /README.md: -------------------------------------------------------------------------------- 1 | # Deep-Docking-NonAutomated 2 | 3 | Deep docking (DD) is a deep learning-based tool developed to accelerate docking-based virtual screening. Using a docking program of choice, one can screen extensive chemical libraries like ZINC15 (containing > 1.3 billion molecules) 50 times faster than typical docking. For further details into the processes behind DD, please refer to our paper (https://doi.org/10.1021/acscentsci.0c00229). This repository provides all the scripts required to run the DD process, except ligand preparation and molecular docking (which can be done with many licensed and/or free programs). 4 | 5 | This version of DD is non-automated and thus requires users to use their own docking programs and ligand preparation tools. 6 | 7 | If you use DD, please cite: 8 | 9 | Gentile F, Agrawal V, Hsing M, Ton A-T, Ban F, Norinder U, et al. *Deep Docking: A Deep Learning Platform for Augmentation of Structure Based Drug Discovery.* ACS Cent Sci 2020:acscentsci.0c00229. 10 | 11 | ## Requirements 12 | * rdkit 13 | * tensorflow >= 1.14.0 (1.15 GPU version recommended. If you are using cuda11, please use [nvidia-tensorflow](https://developer.nvidia.com/blog/accelerating-tensorflow-on-a100-gpus/)) 14 | * pandas 15 | * numpy 16 | * keras 17 | * matplotlib 18 | * scikit-learn 19 | * Ligand preparation tool 20 | * Docking program 21 | 22 | ## Help 23 | For help with the options for a specific script, type 24 | 25 | ``` 26 | python script.py -h 27 | ``` 28 | ## Before Starting 29 | You need to fill in the `activation_script.sh` file in `phase_2-3` so that it can automatically activate your conda environment when training. 30 | 31 | ## Preparing a database for Deep Docking 32 | Databases for DD should be in SMILE format. For each compound of the database, DD requires the Morgan fingerprints of radius 2 and size 1024 bits, represented as the list of the indexes of bits that are set to 1. 33 | 34 | First, it is recommended to split the SMILES into a number of evenly populated files to facilitate other steps such as random sampling and inference, and place these files into a new folder. This reorganization can be achieved for example with the `split` command in bash, and the resulting files should have `.txt` extensions. For example, consider a `smiles.smi` file with a billion compounds, to obtain 1000 evenly split .txt files of 1 million lines each we run: 35 | 36 | ```bash 37 | split -d -l 1000000 smiles.smi smile_all_ --additional-suffix=.txt 38 | ``` 39 | 40 | Ideally the number of split files should be equal to the number of CPUs used for random sampling (phase 1, see below), but always larger than the number of GPUs used for inference (phase 3, see below). 41 | 42 | Morgan fingerprints can be then generated in the correct DD format using the `Morgan_fing.py` script (located in the `utilities` folder): 43 | 44 | ```bash 45 | python Morgan_fing.py -sfp path_smile_folder -fp path_to_morgan_folder -fn name_morgan_folder -tp num_cpus 46 | ``` 47 | which will create all the fingerprints and place them in `path_to_morgan_folder/name_morgan_folder`. 48 | 49 | 50 | ## Phase 1. Random sampling of molecules 51 | In phase 1 molecules are randomly sampled from the database to build/augment a training set. During the first iteration, this phase also samples molecules for the validation and test sets. 52 | 53 | #### Runing phase 1 54 | To run phase 1, run the following sequence of scripts to randomly sample the database, and to extract Morgan fingerprints and SMILES of the sampled molecules: 55 | 56 | ```bash 57 | python molecular_file_count_updated.py -pt project_name -it current_iteration -cdd left_mol_directory -t_pos num_cpus -t_samp molecules_to_dock 58 | python sampling.py -pt project_name -fp path_to_project_without_name -it current_iteration -dd left_mol_directory -t_pos total_processors -tr_sz train_size -vl_sz val_size 59 | python sanity_check.py -pt project_name -fp path_to_project_without_name -it current_iteration 60 | python Extracting_morgan.py -pt project_name -fp path_to_project_without_name -it current_iteration -md morgan_directory -t_pos total_processors 61 | python Extracting_smiles.py -pt project_name -fp path_to_project_without_name -it current_iteration -smd smile_directory -t_pos num_cpus 62 | ``` 63 | 64 | * `molecular_file_count_updated.py` determines the number of molecules to be sampled from each file of the database, according to the desired number of molecules to sample. The sample sizes (per million) are stored in `Mol_ct_file_updated.csv` file created in the `left_mol_directory` directory. 65 | 66 | * `sampling.py` randomly samples the desired number of molecules for the training, validation and testing sets (again note that only during the first iteration do we generate the validation and testing sets). 67 | 68 | * `sanity_check.py` removes overlaps between sampled sets. 69 | 70 | * `Extracting_morgan.py` and `Extracting_smiles.py` extract morgan fingerprints and SMILES for the compounds that have been randomly sampled, and organize them in `morgan` and `smiles` folders inside the directory of the current iteration. 71 | 72 | **IMPORTANT:** For `molecular_file_count_updated.py` AND `sampling.py` the option `left_mol_directory` is the directory where molecules are sampled; for iteration 1, `left_mol_directory` is the directory storing the Morgan fingerprints of the database; BUT for subsequent iterations this must be the path to `morgan_1024_predictions` folder of the previous iteration. 73 | 74 | For example, in iteration 2: 75 | 76 | ```bash 77 | python molecular_file_count_updated.py -pt project_name -it current_iteration -cdd /path_to_project/project_name/iteration_1/morgan_1024_predictions -t_pos num_cpus -t_samp molecules_to_dock 78 | python sampling.py -pt project_name -fp path_to_project_without_name -it current_iteration -dd /path_to_project/project_name/iteration_1/morgan_1024_predictions -t_pos total_processors -tr_sz train_size -vl_sz val_size 79 | ``` 80 | This will ensure that sampling is done progressively on better scoring subsets of the database over the course of DD. 81 | 82 | ### After phase 1. Docking 83 | After phase 1 is completed, molecules grouped in the smiles folder need to be prepared and docked to the target. Use your favourite workflow for this step. It is important that docking are stored as SDF files in a *docked* folder in the current iteration directory, keeping the same name convention of the files in the *smile* folder (names can be slightly changed but the name of the set (eg validation, testing, training) should always be present in the name of the respective SDF file). 84 | 85 | 86 | ## Phase 2. Neural network training 87 | In phase 2, deep learning models are trained on the docking scores from the previous phase. 88 | 89 | #### Runing phase 2 90 | Again we just need to run the following in succession: 91 | ```bash 92 | python Extract_labels.py -if True/False -n_it current_iteration -protein project_name -file_path path_to_project_without_name -t_pos num_cpus -score score_keyword 93 | python simple_job_models.py -n_it current_iteration -mdd morgan_directory -time 00-04:00 -file_path project_path -nhp num_hyperparameters -titr total_iterations -n_mol num_molecules --percent_first_mols percent_first_molecules -ct recall_value --percent_last_mols percent_last_mols 94 | ``` 95 | * `Extract_labels.py` extracts docking scores and organizes them to be used for model training. It should generate three comma-spaced files, `training_labels.txt`, `validation_labels.txt` and `testing_labels.txt` inside the current iteration folder. 96 | 97 | * `simple_job_models.py` creates bash scripts to run model training using the `progressive_docking.py` script. These scripts are generated inside the `simple_job` folder in the current iteration. Note that if `-ct` is not specified, the recall value will be set to 0.9. 98 | 99 | The bash scripts generated by `simple_job_models.py` should be then run on GPU nodes to train DD models. The resulting models will be stored in the `all_models` folder in the current iteration. 100 | 101 | 102 | ## Phase 3. Selection of best model and prediction of the entire database 103 | In phase 3 the models from phase 2 are evalauted and the best performing one is chosen for predicting scores of all the molecules in the database. This step will create a `morgan_1024_predictions` subfolder which will contain all the molecules that are predicted as virtual hits in the current iteration. 104 | 105 | #### Run phase 3 106 | To run phase 3, 107 | 108 | ```bash 109 | python -u hyperparameter_result_evaluation.py -n_it current_iteration --data_path project_path -mdd morgan_directory -n_mol num_molecules 110 | python simple_job_predictions.py -protein project_name -file_path path_to_project_without_name -n_it current_iteration -mdd morgan_directory 111 | 112 | ``` 113 | 114 | * `hyperparameter_result_evaluation.py` evaluates the models generated in phase 2 and select the best (most precise) one. 115 | 116 | * `simple_job_predictions.py` creates bash scripts to run the predictions over the full database using the `Prediction_morgan_1024.py` script. These scripts will be stored in the `simple_job_predictions` folder of the current iteration. 117 | 118 | The generated bash scripts can be run on GPU nodes to predict virtual hits from the full database. Predicted compounds will be stored in `morgan_1024_predictions` folder of the current iteration. 119 | 120 | 121 | ## After Deep Docking. The final phase 122 | After the last iteration of DD is complete, SMILES of all or a ranked subset of the predicted virtual hits can be obtained for the final docking. Ranking is based on the probabilities of being virtual hits. Use the following script (availabe in `final_phase`). 123 | 124 | ```bash 125 | python final_extraction.py -smile_dir path_to_smile_dir -prediction_dir path_to_predictions_last_iter -processors n_cpus -mols_to_dock num_molecules_to_dock 126 | ``` 127 | 128 | Executing this script will return the list of SMILES of all the predicted virtual hits of the last iteration or the top `num_molecules_to_dock` molecules ranked by their probabilities, whichever is smaller. Probabilities will also be returned in a separated file. 129 | -------------------------------------------------------------------------------- /final_phase/final_extraction.py: -------------------------------------------------------------------------------- 1 | from multiprocessing import Pool 2 | from contextlib import closing 3 | import multiprocessing 4 | import pandas as pd 5 | import argparse 6 | import glob 7 | import os 8 | 9 | 10 | def merge_on_smiles(pred_file): 11 | print("Merging " + os.path.basename(pred_file) + "...") 12 | 13 | # Read the predictions 14 | pred = pd.read_csv(pred_file, names=["id", "score"], index_col=0) 15 | pred.drop_duplicates() 16 | 17 | # Read the smiles 18 | smile_file = os.path.join(args.smile_dir, os.path.basename(pred_file)) 19 | smi = pd.read_csv(smile_file, delimiter=" ", names=["smile", "id"], index_col=1) 20 | smi = smi.drop_duplicates() 21 | return pd.merge(pred, smi, how="inner", on=["id"]) 22 | 23 | 24 | if __name__ == '__main__': 25 | parser = argparse.ArgumentParser() 26 | parser.add_argument("-smile_dir", required=True, help='Path to SMILES directory for the database') 27 | parser.add_argument("-prediction_dir", required=True, help='Path to morgan_1024_predicitions of last iteration') 28 | parser.add_argument("-processors", required=True, help='Number of CPUs for multiprocessing') 29 | parser.add_argument("-mols_to_dock", required=False, type=int, help='Desired number of molecules to dock') 30 | 31 | args = parser.parse_args() 32 | predictions = [] 33 | 34 | # Find all smile files 35 | print("Morgan Dir: " + args.prediction_dir) 36 | print("Smile Dir: " + args.smile_dir) 37 | for file in glob.glob(args.prediction_dir + "/*"): 38 | if "smile" in os.path.basename(file): 39 | print(" - " + os.path.basename(file)) 40 | predictions.append(file) 41 | 42 | # Create a list of pandas dataframes 43 | print("Finding smiles...") 44 | print(int(args.processors), len(predictions)) 45 | print("Number of CPUs: " + str(multiprocessing.cpu_count())) 46 | num_jobs = min(len(predictions), int(args.processors)) 47 | with closing(Pool(num_jobs)) as pool: 48 | combined = pool.map(merge_on_smiles, predictions) 49 | 50 | # combine all dataframes 51 | print("Combining " + str(len(combined)) + "dataframes...") 52 | base = pd.concat(combined) 53 | combined = None 54 | 55 | print("Done combining... Sorting!") 56 | base = base.sort_values(by="score", ascending=False) 57 | 58 | print("Resetting Index...") 59 | base.reset_index(inplace=True) 60 | 61 | print("Finished Sorting... Here is the base:") 62 | print(base.head()) 63 | 64 | if args.mols_to_dock is not None: 65 | mtd = args.mols_to_dock 66 | print("Molecules to dock:", mtd) 67 | print("Total molecules:", len(base)) 68 | 69 | if len(base) <= mtd: 70 | print("Our total molecules are less or equal than the number of molecules to dock -> saving all molecules") 71 | else: 72 | print(f"Our total molecules are more than the number of molecules to dock -> saving {mtd} molecules") 73 | base = base.head(mtd) 74 | 75 | print("Saving") 76 | # Rearrange the smiles 77 | smiles = base.drop('score', 1) 78 | smiles = smiles[["smile", "id"]] 79 | print("Here is the smiles:") 80 | print(smiles.head()) 81 | smiles.to_csv("smiles.csv", sep=" ") 82 | 83 | # Rearrange for id,score 84 | base.drop("smile", 1, inplace=True) 85 | base.to_csv("id_score.csv") 86 | print("Here are the ids and scores") 87 | print(base.head()) 88 | 89 | 90 | 91 | -------------------------------------------------------------------------------- /license.txt: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 fgentile, jamesgleave, jyaacoub 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /phase_1/Extracting_morgan.py: -------------------------------------------------------------------------------- 1 | # Reads the ids found in sampling and finds the corresponding morgan fingerprint 2 | import argparse 3 | import glob 4 | 5 | parser = argparse.ArgumentParser() 6 | parser.add_argument('-pt', '--protein_name', required=True, help='Name of the DD project') 7 | parser.add_argument('-fp', '--file_path', required=True, help='Path to the project directory, excluding project directory name') 8 | parser.add_argument('-it', '--n_iteration', required=True, help='Number of current iteration') 9 | parser.add_argument('-md', '--morgan_directory', required=True, help='Path to directory containing Morgan fingerprints for the database') 10 | parser.add_argument('-t_pos', '--tot_process', required=True, help='Number of CPUs to use for multiprocessing') 11 | 12 | io_args = parser.parse_args() 13 | 14 | import os 15 | from multiprocessing import Pool 16 | import time 17 | from contextlib import closing 18 | import numpy as np 19 | 20 | protein = io_args.protein_name 21 | file_path = io_args.file_path 22 | n_it = int(io_args.n_iteration) 23 | morgan_directory = io_args.morgan_directory 24 | tot_process = int(io_args.tot_process) 25 | 26 | 27 | def extract_morgan(file_name): 28 | train = {} 29 | test = {} 30 | valid = {} 31 | with open(file_path + '/' + protein + "/iteration_" + str(n_it) + "/train_set.txt", 'r') as ref: 32 | for line in ref: 33 | train[line.rstrip()] = 0 34 | with open(file_path + '/' + protein + "/iteration_" + str(n_it) + "/valid_set.txt", 'r') as ref: 35 | for line in ref: 36 | valid[line.rstrip()] = 0 37 | with open(file_path + '/' + protein + "/iteration_" + str(n_it) + "/test_set.txt", 'r') as ref: 38 | for line in ref: 39 | test[line.rstrip()] = 0 40 | 41 | # for file_name in file_names: 42 | ref1 = open( 43 | file_path + '/' + protein + '/iteration_' + str(n_it) + '/morgan/' + 'train_' + file_name.split('/')[-1], 'w') 44 | ref2 = open( 45 | file_path + '/' + protein + '/iteration_' + str(n_it) + '/morgan/' + 'valid_' + file_name.split('/')[-1], 'w') 46 | ref3 = open(file_path + '/' + protein + '/iteration_' + str(n_it) + '/morgan/' + 'test_' + file_name.split('/')[-1], 47 | 'w') 48 | 49 | with open(file_name, 'r') as ref: 50 | flag = 0 51 | for line in ref: 52 | tmpp = line.strip().split(',')[0] 53 | if tmpp in train.keys(): 54 | train[tmpp] += 1 55 | fn = 1 56 | if train[tmpp] == 1: flag = 1 57 | elif tmpp in valid.keys(): 58 | valid[tmpp] += 1 59 | fn = 2 60 | if valid[tmpp] == 1: flag = 1 61 | elif tmpp in test.keys(): 62 | test[tmpp] += 1 63 | fn = 3 64 | if test[tmpp] == 1: flag = 1 65 | if flag == 1: 66 | if fn == 1: 67 | ref1.write(line) 68 | if fn == 2: 69 | ref2.write(line) 70 | if fn == 3: 71 | ref3.write(line) 72 | flag = 0 73 | 74 | 75 | def alternate_concat(files): 76 | to_return = [] 77 | with open(files, 'r') as ref: 78 | for line in ref: 79 | to_return.append(line) 80 | return to_return 81 | 82 | 83 | def delete_all(files): 84 | os.remove(files) 85 | 86 | 87 | def morgan_duplicacy(f_name): 88 | flag = 0 89 | mol_list = {} 90 | ref1 = open(f_name[:-4] + '_updated.csv', 'a') 91 | with open(f_name, 'r') as ref: 92 | for line in ref: 93 | tmpp = line.strip().split(',')[0] 94 | if tmpp not in mol_list: 95 | mol_list[tmpp] = 1 96 | flag = 1 97 | if flag == 1: 98 | ref1.write(line) 99 | flag = 0 100 | os.remove(f_name) 101 | 102 | 103 | if __name__ == '__main__': 104 | try: 105 | os.mkdir(file_path + '/' + protein + '/iteration_' + str(n_it) + '/morgan') 106 | except: 107 | pass 108 | 109 | files = [] 110 | for f in glob.glob(morgan_directory + "/*.txt"): 111 | files.append(f) 112 | 113 | t = time.time() 114 | with closing(Pool(np.min([tot_process, len(files)]))) as pool: 115 | pool.map(extract_morgan, files) 116 | print(time.time() - t) 117 | 118 | all_to_delete = [] 119 | for type_to in ['train', 'valid', 'test']: 120 | t = time.time() 121 | files = [] 122 | for f in glob.glob(file_path + '/' + protein + '/iteration_' + str(n_it) + '/morgan/' + type_to + '*'): 123 | files.append(f) 124 | all_to_delete.append(f) 125 | print(len(files)) 126 | if len(files) == 0: 127 | print("Error in address above") 128 | break 129 | with closing(Pool(np.min([tot_process, len(files)]))) as pool: 130 | to_print = pool.map(alternate_concat, files) 131 | with open(file_path + '/' + protein + '/iteration_' + str(n_it) + '/morgan/' + type_to + '_morgan_1024.csv', 132 | 'w') as ref: 133 | for file_data in to_print: 134 | for line in file_data: 135 | ref.write(line) 136 | to_print = [] 137 | print(type_to, time.time() - t) 138 | 139 | f_names = [] 140 | for f in glob.glob(file_path + '/' + protein + '/iteration_' + str(n_it) + '/morgan/*morgan*'): 141 | f_names.append(f) 142 | 143 | t = time.time() 144 | with closing(Pool(np.min([tot_process, len(f_names)]))) as pool: 145 | pool.map(morgan_duplicacy, f_names) 146 | print(time.time() - t) 147 | 148 | with closing(Pool(np.min([tot_process, len(all_to_delete)]))) as pool: 149 | pool.map(delete_all, all_to_delete) 150 | -------------------------------------------------------------------------------- /phase_1/Extracting_smiles.py: -------------------------------------------------------------------------------- 1 | from multiprocessing import Pool 2 | from functools import partial 3 | from contextlib import closing 4 | import argparse 5 | import numpy as np 6 | import pickle 7 | import glob 8 | import time 9 | import os 10 | 11 | parser = argparse.ArgumentParser() 12 | parser.add_argument('-pt', '--project_name', required=True, help='Name of the DD project') 13 | parser.add_argument('-fp', '--file_path', required=True, help='Path to the project directory, excluding project directory name') 14 | parser.add_argument('-it', '--n_iteration', required=True, help='Number of current iteration') 15 | parser.add_argument('-smd', '--smile_directory', required=True, help='Path to SMILES directory of the database') 16 | parser.add_argument('-t_pos', '--tot_process', required=True, help='Number of CPUs to use for multiprocessing') 17 | 18 | io_args = parser.parse_args() 19 | protein = io_args.project_name 20 | file_path = io_args.file_path 21 | n_it = int(io_args.n_iteration) 22 | smile_directory = io_args.smile_directory 23 | tot_process = int(io_args.tot_process) 24 | 25 | # This path is used often so we declare it as a constant here. 26 | ITER_PATH = file_path + '/' + protein + '/iteration_' + str(n_it) 27 | 28 | def extract_smile(file_name, train, valid, test): 29 | # This function extracts the smiles from a file to write them to train, test, and valid files for model training. 30 | ref1 = open(ITER_PATH + '/smile/' + 'train_' + file_name.split('/')[-1], 'w') 31 | ref2 = open(ITER_PATH + '/smile/' + 'valid_' + file_name.split('/')[-1], 'w') 32 | ref3 = open(ITER_PATH + '/smile/' + 'test_' + file_name.split('/')[-1], 'w') 33 | 34 | with open(file_name, 'r') as ref: 35 | ref.readline() 36 | for line in ref: 37 | tmpp = line.strip().split()[-1] 38 | if tmpp in train.keys(): 39 | train[tmpp] += 1 40 | if train[tmpp] == 1: ref1.write(line) 41 | 42 | elif tmpp in valid.keys(): 43 | valid[tmpp] += 1 44 | if valid[tmpp] == 1: ref2.write(line) 45 | 46 | elif tmpp in test.keys(): 47 | test[tmpp] += 1 48 | if test[tmpp] == 1: ref3.write(line) 49 | 50 | def alternate_concat(files): 51 | # Returns a list of the lines in a file 52 | with open(files, 'r') as ref: 53 | return ref.readlines() 54 | 55 | def smile_duplicacy(f_name): 56 | # removes duplicate molec from the file 57 | mol_list = {} # keeping track of which mol have been written 58 | ref1 = open(f_name[:-4] + '_updated.smi', 'a') 59 | with open(f_name, 'r') as ref: 60 | for line in ref: 61 | tmpp = line.strip().split()[-1] 62 | if tmpp not in mol_list: # avoiding duplicates 63 | mol_list[tmpp] = 1 64 | ref1.write(line) 65 | os.remove(f_name) 66 | 67 | def delete_all(files): 68 | os.remove(files) 69 | 70 | if __name__ == '__main__': 71 | try: 72 | os.mkdir(ITER_PATH + '/smile') 73 | except: # catching exception for when the folder already exists 74 | pass 75 | 76 | files_smiles = [] # Getting the path to every smile file from docking 77 | for f in glob.glob(smile_directory + "/*.txt"): 78 | files_smiles.append(f) 79 | 80 | # getting training, validation, and testing molec IDs from the sets created by `sampling.py` 81 | all_train = {} 82 | all_valid = {} 83 | all_test = {} 84 | with open(ITER_PATH + "/train_set.txt", 'r') as ref: # each txt file contains the IDs corresponding to a morgan fingerprint in morgan folder of that iteration 85 | for line in ref: 86 | all_train[line.rstrip()] = 0 87 | 88 | with open(ITER_PATH + "/valid_set.txt", 'r') as ref: 89 | for line in ref: 90 | all_valid[line.rstrip()] = 0 91 | 92 | with open(ITER_PATH + "/test_set.txt", 'r') as ref: 93 | for line in ref: 94 | all_test[line.rstrip()] = 0 95 | 96 | print(len(all_train), len(all_valid), len(all_test)) 97 | 98 | t = time.time() 99 | with closing(Pool(np.min([tot_process, len(files_smiles)]))) as pool: 100 | pool.map(partial(extract_smile, train=all_train, valid=all_valid, test=all_test), files_smiles) 101 | print(time.time() - t) 102 | 103 | all_to_delete = [] 104 | for type_to in ['train', 'valid', 'test']: 105 | t = time.time() 106 | files = [] 107 | for f in glob.glob(ITER_PATH + '/smile/' + type_to + '*'): 108 | files.append(f) 109 | all_to_delete.append(f) 110 | print(len(files)) 111 | if len(files) == 0: 112 | print("Error in address above") 113 | break 114 | with closing(Pool(np.min([tot_process, len(files)]))) as pool: 115 | to_print = pool.map(alternate_concat, files) 116 | with open(ITER_PATH + '/smile/' + type_to + '_smiles_final.csv', 'w') as ref: 117 | for file_data in to_print: 118 | for line in file_data: 119 | ref.write(line) 120 | to_print = [] 121 | print(type_to, time.time() - t) 122 | 123 | f_names = [] 124 | for f in glob.glob(ITER_PATH + '/smile/*final*'): 125 | f_names.append(f) 126 | 127 | t = time.time() 128 | with closing(Pool(np.min([tot_process, len(f_names)]))) as pool: 129 | pool.map(smile_duplicacy, f_names) 130 | print(time.time() - t) 131 | 132 | with closing(Pool(np.min([tot_process, len(all_to_delete)]))) as pool: 133 | pool.map(delete_all, all_to_delete) 134 | -------------------------------------------------------------------------------- /phase_1/molecular_file_count_updated.py: -------------------------------------------------------------------------------- 1 | from multiprocessing import Pool 2 | from contextlib import closing 3 | import pandas as pd 4 | import numpy as np 5 | import argparse 6 | import glob 7 | import time 8 | 9 | try: 10 | import __builtin__ 11 | except ImportError: 12 | # Python 3 13 | import builtins as __builtin__ 14 | 15 | # For debugging purposes only: 16 | def print(*args, **kwargs): 17 | __builtin__.print('\t molecular_file_count_updated: ', end="") 18 | return __builtin__.print(*args, **kwargs) 19 | 20 | parser = argparse.ArgumentParser() 21 | parser.add_argument('-pt','--project_name',required=True,help='Name of the DD project') 22 | parser.add_argument('-it','--n_iteration',required=True,help='Number of current DD iteration') 23 | parser.add_argument('-cdd','--data_directory',required=True,help='Path to directory contaning the remaining molecules of the database ') 24 | parser.add_argument('-t_pos','--tot_process',required=True,help='Number of CPUs to use for multiprocessing') 25 | parser.add_argument('-t_samp','--tot_sampling',required=True,help='Total number of molecules to sample in the current iteration; for first iteration, consider training, validation and test sets, for others only training') 26 | io_args = parser.parse_args() 27 | 28 | 29 | protein = io_args.project_name 30 | n_it = int(io_args.n_iteration) 31 | data_directory = io_args.data_directory 32 | tot_process = int(io_args.tot_process) 33 | tot_sampling = int(io_args.tot_sampling) 34 | 35 | print("Parsed Args:") 36 | print(" - Iteration:", n_it) 37 | print(" - Data Directory:", data_directory) 38 | print(" - Training Size:", tot_process) 39 | print(" - Validation Size:", tot_sampling) 40 | 41 | 42 | def write_mol_count_list(file_name,mol_count_list): 43 | with open(file_name,'w') as ref: 44 | for ct,file_name in mol_count_list: 45 | ref.write(str(ct)+","+file_name.split('/')[-1]) 46 | ref.write("\n") 47 | 48 | 49 | def molecule_count(file_name): 50 | temp = 0 51 | with open(file_name,'r') as ref: 52 | ref.readline() 53 | for line in ref: 54 | temp+=1 55 | return temp, file_name 56 | 57 | 58 | if __name__=='__main__': 59 | files = [] 60 | for f in glob.glob(data_directory+'/*.txt'): 61 | files.append(f) 62 | print("Number Of Files:", len(files)) 63 | 64 | t=time.time() 65 | print("Reading Files...") 66 | with closing(Pool(np.min([tot_process,len(files)]))) as pool: 67 | rt = pool.map(molecule_count,files) 68 | print("Done Reading Finals - Time Taken", time.time()-t) 69 | 70 | print("Saving File Count...") 71 | write_mol_count_list(data_directory+'/Mol_ct_file.csv',rt) 72 | mol_ct = pd.read_csv(data_directory+'/Mol_ct_file.csv',header=None) 73 | mol_ct.columns = ['Number_of_Molecules','file_name'] 74 | Total_sampling = tot_sampling 75 | Total_mols_available = np.sum(mol_ct.Number_of_Molecules) 76 | mol_ct['Sample_for_million'] = [int(Total_sampling/Total_mols_available*elem) for elem in mol_ct.Number_of_Molecules] 77 | mol_ct.to_csv(data_directory+'/Mol_ct_file_updated.csv',sep=',',index=False) 78 | print("Done - Time Taken", time.time()-t) 79 | 80 | -------------------------------------------------------------------------------- /phase_1/sampling.py: -------------------------------------------------------------------------------- 1 | from contextlib import closing 2 | from multiprocessing import Pool 3 | import pandas as pd 4 | import numpy as np 5 | import argparse 6 | import glob 7 | import time 8 | import os 9 | 10 | try: 11 | import __builtin__ 12 | except ImportError: 13 | # Python 3 14 | import builtins as __builtin__ 15 | 16 | # For debugging purposes only: 17 | def print(*args, **kwargs): 18 | __builtin__.print('\t sampling: ', end="") 19 | return __builtin__.print(*args, **kwargs) 20 | 21 | 22 | parser = argparse.ArgumentParser() 23 | parser.add_argument('-pt', '--project_name',required=True,help='Name of the DD project') 24 | parser.add_argument('-fp', '--file_path',required=True,help='Path to the project directory, excluding project directory name') 25 | parser.add_argument('-it', '--n_iteration',required=True,help='Number of current iteration') 26 | parser.add_argument('-dd', '--data_directory',required=True,help='Path to directory containing the remaining molecules of the database') 27 | parser.add_argument('-t_pos', '--tot_process',required=True,help='Number of CPUs to use for multiprocessing') 28 | parser.add_argument('-tr_sz', '--train_size',required=True,help='Size of training set') 29 | parser.add_argument('-vl_sz', '--val_size',required=True,help='Size of validation and test set') 30 | io_args = parser.parse_args() 31 | 32 | protein = io_args.project_name 33 | file_path = io_args.file_path 34 | n_it = int(io_args.n_iteration) 35 | data_directory = io_args.data_directory 36 | tot_process = int(io_args.tot_process) 37 | tr_sz = int(io_args.train_size) 38 | vl_sz = int(io_args.val_size) 39 | rt_sz = tr_sz/vl_sz 40 | 41 | print("Parsed Args:") 42 | print(" - Iteration:", n_it) 43 | print(" - Data Directory:", data_directory) 44 | print(" - Training Size:", tr_sz) 45 | print(" - Validation Size:", vl_sz) 46 | 47 | 48 | def train_valid_test(file_name): 49 | sampling_start_time = time.time() 50 | f_name = file_name.split('/')[-1] 51 | mol_ct = pd.read_csv(data_directory+"/Mol_ct_file_updated.csv", index_col=1) 52 | if n_it == 1: 53 | to_sample = int(mol_ct.loc[f_name].Sample_for_million/(rt_sz+2)) 54 | else: 55 | to_sample = int(mol_ct.loc[f_name].Sample_for_million/3) 56 | 57 | total_len = int(mol_ct.loc[f_name].Number_of_Molecules) 58 | shuffle_array = np.linspace(0, total_len-1, total_len) 59 | seed = np.random.randint(0, 2**32) 60 | np.random.seed(seed=seed) 61 | np.random.shuffle(shuffle_array) 62 | 63 | if n_it == 1: 64 | train_ind = shuffle_array[:int(rt_sz*to_sample)] 65 | valid_ind = shuffle_array[int(to_sample*rt_sz):int(to_sample*(rt_sz+1))] 66 | test_ind = shuffle_array[int(to_sample*(rt_sz+1)):int(to_sample*(rt_sz+2))] 67 | else: 68 | train_ind = shuffle_array[:to_sample] 69 | valid_ind = shuffle_array[to_sample:to_sample*2] 70 | test_ind = shuffle_array[to_sample*2:to_sample*3] 71 | 72 | train_ind_dict = {} 73 | valid_ind_dict = {} 74 | test_ind_dict = {} 75 | 76 | train_set = open(file_path + '/' + protein + "/iteration_" + str(n_it) + "/train_set.txt", 'a') 77 | test_set = open(file_path + '/' + protein + "/iteration_" + str(n_it) + "/test_set.txt", 'a') 78 | valid_set = open(file_path + '/' + protein + "/iteration_" + str(n_it) + "/valid_set.txt", 'a') 79 | 80 | for i in train_ind: 81 | train_ind_dict[i] = 1 82 | for j in valid_ind: 83 | valid_ind_dict[j] = 1 84 | for k in test_ind: 85 | test_ind_dict[k] = 1 86 | 87 | # Opens the file and write the test, train, and valid files 88 | with open(file_name, 'r') as ref: 89 | for ind, line in enumerate(ref): 90 | molecule_id = line.strip().split(',')[0] 91 | if ind == 1: 92 | print("molecule_id:", molecule_id) 93 | 94 | # now we write to the train, test, and validation sets 95 | if ind in train_ind_dict.keys(): 96 | train_set.write(molecule_id + '\n') 97 | elif ind in valid_ind_dict.keys(): 98 | valid_set.write(molecule_id + '\n') 99 | elif ind in test_ind_dict.keys(): 100 | test_set.write(molecule_id + '\n') 101 | 102 | train_set.close() 103 | valid_set.close() 104 | test_set.close() 105 | print("Process finished sampling in " + str(time.time()-sampling_start_time)) 106 | 107 | if __name__ == '__main__': 108 | try: 109 | os.mkdir(file_path+'/'+protein+"/iteration_"+str(n_it)) 110 | except OSError: 111 | pass 112 | 113 | f_names = [] 114 | for f in glob.glob(data_directory+'/*.txt'): 115 | f_names.append(f) 116 | 117 | t = time.time() 118 | print("Starting Processes...") 119 | with closing(Pool(np.min([tot_process, len(f_names)]))) as pool: 120 | pool.map(train_valid_test, f_names) 121 | 122 | print("Compressing smile file...") 123 | print("Sampling Complete - Total Time Taken:", time.time()-t) 124 | 125 | -------------------------------------------------------------------------------- /phase_1/sanity_check.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import glob 3 | 4 | parser = argparse.ArgumentParser() 5 | parser.add_argument('-pt','--protein_name',required=True,help='Name of project') 6 | parser.add_argument('-fp','--file_path',required=True,help='Path to project folder without name of project folder') 7 | parser.add_argument('-it','--n_iteration',required=True,help='Number of current iteration') 8 | 9 | io_args = parser.parse_args() 10 | import time 11 | 12 | protein = io_args.protein_name 13 | file_path = io_args.file_path 14 | n_it = int(io_args.n_iteration) 15 | 16 | old_dict = {} 17 | for i in range(1,n_it): 18 | with open(glob.glob(file_path+'/'+protein+'/iteration_'+str(i)+'/training_labels*')[-1]) as ref: 19 | ref.readline() 20 | for line in ref: 21 | tmpp = line.strip().split(',')[-1] 22 | old_dict[tmpp] = 1 23 | with open(glob.glob(file_path+'/'+protein+'/iteration_'+str(i)+'/validation_labels*')[-1]) as ref: 24 | ref.readline() 25 | for line in ref: 26 | tmpp = line.strip().split(',')[-1] 27 | old_dict[tmpp] = 1 28 | with open(glob.glob(file_path+'/'+protein+'/iteration_'+str(i)+'/testing_labels*')[-1]) as ref: 29 | ref.readline() 30 | for line in ref: 31 | tmpp = line.strip().split(',')[-1] 32 | old_dict[tmpp] = 1 33 | 34 | t=time.time() 35 | new_train = {} 36 | new_valid = {} 37 | new_test = {} 38 | with open(glob.glob(file_path+'/'+protein+'/iteration_'+str(n_it)+'/train_set*')[-1]) as ref: 39 | for line in ref: 40 | tmpp = line.strip().split(',')[0] 41 | new_train[tmpp] = 1 42 | with open(glob.glob(file_path+'/'+protein+'/iteration_'+str(n_it)+'/valid_set*')[-1]) as ref: 43 | for line in ref: 44 | tmpp = line.strip().split(',')[0] 45 | new_valid[tmpp] = 1 46 | with open(glob.glob(file_path+'/'+protein+'/iteration_'+str(n_it)+'/test_set*')[-1]) as ref: 47 | for line in ref: 48 | tmpp = line.strip().split(',')[0] 49 | new_test[tmpp] = 1 50 | print(time.time()-t) 51 | 52 | t=time.time() 53 | for keys in new_train.keys(): 54 | if keys in new_valid.keys(): 55 | new_valid.pop(keys) 56 | if keys in new_test.keys(): 57 | new_test.pop(keys) 58 | for keys in new_valid.keys(): 59 | if keys in new_test.keys(): 60 | new_test.pop(keys) 61 | print(time.time()-t) 62 | 63 | for keys in old_dict.keys(): 64 | if keys in new_train.keys(): 65 | new_train.pop(keys) 66 | if keys in new_valid.keys(): 67 | new_valid.pop(keys) 68 | if keys in new_test.keys(): 69 | new_test.pop(keys) 70 | 71 | with open(file_path+'/'+protein+'/iteration_'+str(n_it)+'/train_set.txt','w') as ref: 72 | for keys in new_train.keys(): 73 | ref.write(keys+'\n') 74 | with open(file_path+'/'+protein+'/iteration_'+str(n_it)+'/valid_set.txt','w') as ref: 75 | for keys in new_valid.keys(): 76 | ref.write(keys+'\n') 77 | with open(file_path+'/'+protein+'/iteration_'+str(n_it)+'/test_set.txt','w') as ref: 78 | for keys in new_test.keys(): 79 | ref.write(keys+'\n') 80 | -------------------------------------------------------------------------------- /phase_2-3/Extract_labels.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import builtins as __builtin__ 3 | import glob 4 | import gzip 5 | import os 6 | from contextlib import closing 7 | from multiprocessing import Pool 8 | 9 | 10 | # For debugging purposes only: 11 | def print(*args, **kwargs): 12 | __builtin__.print('\t extract_L: ', end="") 13 | return __builtin__.print(*args, **kwargs) 14 | 15 | 16 | parser = argparse.ArgumentParser() 17 | parser.add_argument('-if', '--is_final', required=True, help='True/False for is this the final iteration?') 18 | parser.add_argument('-n_it', '--iteration_no', required=True, help='Number of current iteration') 19 | parser.add_argument('-protein', '--protein', required=True, help='Name of the DD project') 20 | parser.add_argument('-file_path', '--file_path', required=True, help='Path to the project directory, excluding project directory name') 21 | parser.add_argument('-t_pos', '--tot_process', required=True, help='Number of CPUs to use for multiprocessing') 22 | parser.add_argument('-score', '--score_keyword', required=True, help='Score keyword. Name of the field storing the docking score in the SDF files of docking results') 23 | 24 | io_args = parser.parse_args() 25 | 26 | is_final = io_args.is_final 27 | n_it = int(io_args.iteration_no) 28 | protein = io_args.protein 29 | file_path = io_args.file_path 30 | tot_process = int(io_args.tot_process) 31 | key_word = str(io_args.score_keyword) 32 | 33 | if is_final == 'False' or is_final == 'false': 34 | is_final = False 35 | elif is_final == 'True' or is_final == 'true': 36 | is_final = True 37 | else: 38 | raise TypeError('-if parameter must be a boolean (true/false)') 39 | 40 | # mol_key = 'ZINC' 41 | print("Keyword: ", key_word) 42 | 43 | 44 | def get_scores(ref): 45 | scores = [] 46 | for line in ref: # Looping through the molecules 47 | zinc_id = line.rstrip() 48 | line = ref.readline() 49 | # '$$$' signifies end of molecule info 50 | while line != '' and line[:4] != '$$$$': # Looping through its information and saving scores 51 | 52 | tmp = line.rstrip().split('<')[-1] 53 | 54 | if key_word == tmp[:-1]: 55 | tmpp = float(ref.readline().rstrip()) 56 | if tmpp > 50 or tmpp < -50: 57 | print(zinc_id, tmpp) 58 | else: 59 | scores.append([zinc_id, tmpp]) 60 | 61 | line = ref.readline() 62 | return scores 63 | 64 | 65 | def extract_glide_score(filen): 66 | scores = [] 67 | try: 68 | # Opening the GNU compressed file 69 | with gzip.open(filen, 'rt') as ref: 70 | scores = get_scores(ref) 71 | 72 | except Exception as e: 73 | print('Exception occured in Extract_labels.py: ', e) 74 | # file is already decompressed 75 | with open(filen, 'r') as ref: 76 | scores = get_scores(ref) 77 | 78 | if 'test' in os.path.basename(filen): 79 | new_name = 'testing' 80 | elif 'valid' in os.path.basename(filen): 81 | new_name = 'validation' 82 | elif 'train' in os.path.basename(filen): 83 | new_name = 'training' 84 | else: 85 | print("Could not generate new training set") 86 | exit() 87 | 88 | with open(file_path + '/' + protein + '/iteration_' + str(n_it) + '/' + new_name + '_' + 'labels.txt', 'w') as ref: 89 | ref.write('r_i_docking_score' + ',' + 'ZINC_ID' + '\n') 90 | for z_id, gc in scores: 91 | ref.write(str(gc) + ',' + z_id + '\n') 92 | 93 | 94 | if __name__ == '__main__': 95 | files = [] 96 | iter_path = file_path + '/' + protein + '/iteration_' + str(n_it) 97 | 98 | # Checking to see if the labels have already been extracted: 99 | sets = ["training", "testing", "validation"] 100 | files_labels = glob.glob(iter_path + "/*_labels.txt") 101 | foundAll = True 102 | for s in sets: 103 | found = False 104 | print(s) 105 | for f in files_labels: 106 | set_name = f.split('/')[-1].split("_labels.txt")[0] 107 | if set_name == s: 108 | found = True 109 | print('Found') 110 | break 111 | if not found: 112 | foundAll = False 113 | print('Not Found') 114 | break 115 | if foundAll: 116 | print('Labels have already been extracted...') 117 | print('Remove "_labels.text" files in \"' + iter_path + '\" to re-extract') 118 | exit(0) 119 | 120 | # Checking to see if this is the final iteration to use the right folder 121 | if is_final: 122 | path = file_path + '/' + protein + '/after_iteration/docked/*.sdf*' 123 | else: 124 | path = iter_path + '/docked/*.sdf*' 125 | path_labels = iter_path + '/*labels*' 126 | 127 | for f in glob.glob(path): 128 | files.append(f) 129 | 130 | print("num files in", path, ":", len(files)) 131 | print("Files:", [os.path.basename(f) for f in files]) 132 | if len(files) == 0: 133 | print('NO FILES IN: ', path) 134 | print('CANCEL JOB...') 135 | exit(1) 136 | 137 | # Parallel running of the extract_glide_score() with each file path of the files array 138 | with closing(Pool(len(files))) as pool: 139 | pool.map(extract_glide_score, files) 140 | 141 | if not is_final: 142 | # renaming from f1_f2_f3 to f3_labels.txt 143 | try: 144 | for f in glob.glob(path_labels): 145 | print(f, iter_path + '/' + f.split('/')[-1].split('_')[2] + '_' + 'labels.txt') 146 | os.rename(f, iter_path + '/' + f.split('/')[-1].split('_')[2] + '_' + 'labels.txt') ## why are we renaming this? 147 | except IndexError: 148 | print("Index error on renaming", f) 149 | -------------------------------------------------------------------------------- /phase_2-3/ML/DDCallbacks.py: -------------------------------------------------------------------------------- 1 | """ 2 | James Gleave 3 | v1.1.0 4 | """ 5 | 6 | from tensorflow.keras.callbacks import Callback 7 | import pandas as pd 8 | import time 9 | import os 10 | 11 | 12 | class DDLogger(Callback): 13 | """ 14 | Logs the important data regarding model training 15 | """ 16 | 17 | def __init__(self, log_path, 18 | max_time=36000, 19 | max_epochs=500, 20 | monitoring='val_loss', ): 21 | super(Callback, self).__init__() 22 | # Params 23 | self.max_time = max_time 24 | self.max_epochs = max_epochs 25 | self.monitoring = monitoring 26 | 27 | # Stats 28 | self.epoch_start_time = 0 29 | self.current_epoch = 0 30 | 31 | # File 32 | self.log_path = log_path 33 | self.model_history = {} 34 | 35 | def on_train_begin(self, logs={}): 36 | self.epoch_start_time = time.time() 37 | 38 | def on_epoch_begin(self, epoch, logs=None): 39 | self.epoch_start_time = time.time() 40 | 41 | def on_epoch_end(self, epoch, logs={}): 42 | # Store the data 43 | current_time = time.time() 44 | epoch_duration = current_time - self.epoch_start_time 45 | logs['time_per_epoch'] = epoch_duration 46 | self.model_history["epoch_" + str(epoch + 1)] = logs 47 | 48 | # Estimate time to completion 49 | estimate, elapsed, (s, p, x) = self.estimate_training_time() 50 | logs['estimate_time'] = estimate 51 | logs['time_elapsed'] = elapsed 52 | self.model_history["epoch_" + str(epoch + 1)] = logs 53 | 54 | # Save the data to a csv 55 | df = pd.DataFrame(self.model_history) 56 | df.to_csv(self.log_path) 57 | 58 | print("Time taken calculating callbacks:", time.time()-current_time) 59 | 60 | def estimate_training_time(self): 61 | max_allotted_time = self.max_time 62 | max_allotted_epochs = self.max_epochs 63 | 64 | # Grab the info about the model 65 | model_loss = [] 66 | time_per_epoch = [] 67 | for epoch in self.model_history: 68 | model_loss.append(self.model_history[epoch]['val_loss']) 69 | time_per_epoch.append(self.model_history[epoch]['time_per_epoch']) 70 | 71 | time_elapsed = sum(time_per_epoch) 72 | average_time_per_epoch = sum(time_per_epoch) / len(time_per_epoch) 73 | current_epoch = len(time_per_epoch) 74 | 75 | # Find out if the model is approaching an early stop 76 | epochs_until_early_stop = 10 77 | stopping_vector = [] 78 | prev_loss = model_loss[0] 79 | for loss in model_loss: 80 | improved = loss < prev_loss 81 | stopping_vector.append(improved) 82 | if improved: 83 | prev_loss = loss 84 | 85 | # Check how close we are to an early stop 86 | longest_failure = 0 87 | for improved in stopping_vector: 88 | if not improved: 89 | longest_failure += 1 90 | else: 91 | longest_failure = 0 92 | 93 | max_time = max_allotted_epochs * average_time_per_epoch if max_allotted_epochs * average_time_per_epoch < max_allotted_time else max_allotted_time 94 | time_if_early_stop = (epochs_until_early_stop - longest_failure) * average_time_per_epoch 95 | 96 | # Estimate a completion time 97 | loss_drops = stopping_vector.count(True) 98 | loss_gains = len(stopping_vector) - loss_drops 99 | try: 100 | gain_drop_ratio = loss_gains / loss_drops 101 | except ZeroDivisionError: 102 | gain_drop_ratio = 0 103 | 104 | # Created a function to estimate training time 105 | power = 1 - (gain_drop_ratio ** 3 / 5) 106 | time_estimate = (max_time ** power) / (1 + longest_failure) 107 | 108 | # Smooth out the estimate 109 | if current_epoch > 1: 110 | last = self.model_history['epoch_{}'.format(current_epoch - 1)]['estimate_time'] 111 | time_estimate = (time_estimate + last) / 2 112 | 113 | # If the time estimate surpasses the max time then just show the max time 114 | time_for_remaining_epochs = (self.max_epochs - current_epoch) * average_time_per_epoch 115 | if time_for_remaining_epochs < time_estimate: 116 | time_estimate = time_for_remaining_epochs 117 | 118 | return time_estimate, time_elapsed, (longest_failure, gain_drop_ratio, max_time) 119 | 120 | 121 | 122 | -------------------------------------------------------------------------------- /phase_2-3/ML/DDMetrics.py: -------------------------------------------------------------------------------- 1 | import random 2 | import numpy as np 3 | import tensorflow as tf 4 | from tensorflow.keras import backend as K 5 | 6 | 7 | def recall(y_true, y_pred): 8 | true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) 9 | possible_positives = K.sum(K.round(K.clip(y_true, 0, 1))) 10 | recall_keras = true_positives / (possible_positives + K.epsilon()) 11 | return recall_keras 12 | 13 | 14 | def precision(y_true, y_pred): 15 | true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) 16 | predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1))) 17 | precision_keras = true_positives / (predicted_positives + K.epsilon()) 18 | return precision_keras 19 | 20 | 21 | def specificity(y_true, y_pred): 22 | tn = K.sum(K.round(K.clip((1 - y_true) * (1 - y_pred), 0, 1))) 23 | fp = K.sum(K.round(K.clip((1 - y_true) * y_pred, 0, 1))) 24 | return tn / (tn + fp + K.epsilon()) 25 | 26 | 27 | def negative_predictive_value(y_true, y_pred): 28 | tn = K.sum(K.round(K.clip((1 - y_true) * (1 - y_pred), 0, 1))) 29 | fn = K.sum(K.round(K.clip(y_true * (1 - y_pred), 0, 1))) 30 | return tn / (tn + fn + K.epsilon()) 31 | 32 | 33 | def f1(y_true, y_pred): 34 | p = precision(y_true, y_pred) 35 | r = recall(y_true, y_pred) 36 | return 2 * ((p * r) / (p + r + K.epsilon())) 37 | 38 | 39 | def fbeta(y_true, y_pred, beta=2): 40 | y_pred = K.clip(y_pred, 0, 1) 41 | 42 | tp = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)), axis=1) 43 | fp = K.sum(K.round(K.clip(y_pred - y_true, 0, 1)), axis=1) 44 | fn = K.sum(K.round(K.clip(y_true - y_pred, 0, 1)), axis=1) 45 | 46 | p = tp / (tp + fp + K.epsilon()) 47 | r = tp / (tp + fn + K.epsilon()) 48 | 49 | num = (1 + beta ** 2) * (p * r) 50 | den = (beta ** 2 * p + r + K.epsilon()) 51 | return K.mean(num / den) 52 | 53 | 54 | def matthews_correlation_coefficient(y_true, y_pred): 55 | tp = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) 56 | tn = K.sum(K.round(K.clip((1 - y_true) * (1 - y_pred), 0, 1))) 57 | fp = K.sum(K.round(K.clip((1 - y_true) * y_pred, 0, 1))) 58 | fn = K.sum(K.round(K.clip(y_true * (1 - y_pred), 0, 1))) 59 | 60 | num = tp * tn - fp * fn 61 | den = (tp + fp) * (tp + fn) * (tn + fp) * (tn + fn) 62 | return num / K.sqrt(den + K.epsilon()) 63 | 64 | 65 | def equal_error_rate(y_true, y_pred): 66 | n_imp = tf.count_nonzero(tf.equal(y_true, 0), dtype=tf.float32) + tf.constant(K.epsilon()) 67 | n_gen = tf.count_nonzero(tf.equal(y_true, 1), dtype=tf.float32) + tf.constant(K.epsilon()) 68 | 69 | scores_imp = tf.boolean_mask(y_pred, tf.equal(y_true, 0)) 70 | scores_gen = tf.boolean_mask(y_pred, tf.equal(y_true, 1)) 71 | 72 | loop_vars = (tf.constant(0.0), tf.constant(1.0), tf.constant(0.0)) 73 | cond = lambda t, fpr, fnr: tf.greater_equal(fpr, fnr) 74 | body = lambda t, fpr, fnr: ( 75 | t + 0.001, 76 | tf.divide(tf.count_nonzero(tf.greater_equal(scores_imp, t), dtype=tf.float32), n_imp), 77 | tf.divide(tf.count_nonzero(tf.less(scores_gen, t), dtype=tf.float32), n_gen) 78 | ) 79 | t, fpr, fnr = tf.while_loop(cond, body, loop_vars, back_prop=False) 80 | eer = (fpr + fnr) / 2 81 | 82 | return eer 83 | 84 | 85 | def get_metric(name): 86 | metrics = {"recall": tf.keras.metrics.Recall(), 87 | "precision": tf.keras.metrics.Precision(), 88 | "specificity": specificity, 89 | "negative_predictive_value": negative_predictive_value, 90 | "f1": f1, 91 | "fbeta": fbeta, 92 | "equal_error_rate": equal_error_rate, 93 | "matthews_correlation_coefficient": matthews_correlation_coefficient} 94 | keys = list(metrics.keys()) 95 | assert name in keys, print("Cannot find metric " + name, ". Available metrics are {}".format(keys)) 96 | return metrics[name] 97 | 98 | 99 | class DDMetrics: 100 | def __init__(self, model): 101 | self.model = model 102 | self.params = model.count_params() 103 | 104 | @staticmethod 105 | def scaled_performance(y_true, y_pred): 106 | p = precision(y_true, y_pred) 107 | f = f1(y_true, y_pred) 108 | return ((p*p) + (f*f))/2 109 | 110 | def relative_scaled_performance(self, y_true, y_pred): 111 | params = self.params / 1_000_000 112 | sp = self.scaled_performance(y_true, y_pred) 113 | return sp/(1.03 ** params) 114 | 115 | def relative_precision(self, y_true, y_pred): 116 | p = precision(y_true, y_pred) 117 | params = self.params / 1_000_000 118 | return p/params 119 | -------------------------------------------------------------------------------- /phase_2-3/ML/DDModel.py: -------------------------------------------------------------------------------- 1 | """ 2 | Version 1.1.2 3 | """ 4 | from .Tokenizer import DDTokenizer 5 | from sklearn import preprocessing 6 | from .DDModelExceptions import * 7 | from tensorflow.keras import backend 8 | from .Models import Models 9 | from .Parser import Parser 10 | import tensorflow as tf 11 | import pandas as pd 12 | import numpy as np 13 | import keras 14 | import time 15 | import os 16 | 17 | import warnings 18 | warnings.filterwarnings('ignore') 19 | 20 | 21 | class DDModel(Models): 22 | """ 23 | A class responsible for creating, storing, and working with our deep docking models 24 | """ 25 | 26 | def __init__(self, mode, input_shape, hyperparameters, metrics=None, loss='binary_crossentropy', regression=False, 27 | name="model"): 28 | 29 | """ 30 | Parameters 31 | ---------- 32 | mode : str 33 | A string indicating which model to use 34 | input_shape : tuple or list 35 | The input shape for the model 36 | hyperparameters : dict 37 | A dictionary containing the hyperparameters for the DDModel's model 38 | metrics : list 39 | The metric(s) used by keras 40 | loss : str 41 | The loss function used by keras 42 | regression : bool 43 | Set to true if the model is performing regression 44 | """ 45 | 46 | if metrics is None: 47 | self.metrics = ['accuracy'] 48 | else: 49 | self.metrics = metrics 50 | 51 | # Use regression or use binary classification 52 | output_activation = 'linear' if regression else 'sigmoid' 53 | 54 | # choose the loss function 55 | self.loss_func = loss 56 | if regression and loss == 'binary_crossentropy': 57 | self.loss_func = 'mean_squared_error' 58 | hyperparameters["loss_func"] = self.loss_func 59 | 60 | if mode == "loaded_model": 61 | super().__init__(hyperparameters={'bin_array': [], 62 | 'dropout_rate': 0, 63 | 'learning_rate': 0, 64 | 'num_units': 0, 65 | 'epsilon': 0}, 66 | output_activation=output_activation, name=name) 67 | self.mode = "" 68 | 69 | self.input_shape = () 70 | 71 | self.history = keras.callbacks.History() 72 | self.time = {"training_time": -1, "prediction_time": -1} 73 | else: 74 | # Create a model 75 | super().__init__(hyperparameters=hyperparameters, 76 | output_activation=output_activation, name=name) 77 | 78 | self.mode = mode 79 | 80 | self.input_shape = input_shape 81 | 82 | self.history = keras.callbacks.History() 83 | self.time = {'training_time': -1, "prediction_time": -1} 84 | 85 | self.model = self._create_model() 86 | self._compile() 87 | 88 | def fit(self, train_x, train_y, epochs, batch_size, shuffle, class_weight, verbose, validation_data, callbacks): 89 | 90 | """ 91 | Reshapes the input data and fits the model 92 | 93 | Parameters 94 | ---------- 95 | train_x : ndarray 96 | Training data 97 | train_y : ndarray 98 | Training labels 99 | epochs : int 100 | Number of epochs to train on 101 | batch_size : int 102 | The batch size 103 | shuffle : bool 104 | Whether to shuffle the data 105 | class_weight : dict 106 | The class weights 107 | verbose : int 108 | The verbose 109 | validation_data : list 110 | The validation data and labels 111 | callbacks : list 112 | Keras callbacks 113 | """ 114 | 115 | # First reshape the data to fit the chosen model 116 | # Here we form the shape 117 | shape_train_x = [train_x.shape[0]] 118 | shape_valid_x = [validation_data[0].shape[0]] 119 | for val in self.model.input_shape[1:]: 120 | shape_train_x.append(val) 121 | shape_valid_x.append(val) 122 | 123 | # Here we reshape the data 124 | # Format: shape = (size of data, input_shape[0], ..., input_shape[n] 125 | train_x = np.reshape(train_x, shape_train_x) 126 | validation_data_x = np.reshape(validation_data[0], shape_valid_x) 127 | validation_data_y = validation_data[1] 128 | validation_data = (validation_data_x, validation_data_y) 129 | 130 | # Track the training time 131 | training_time = time.time() 132 | 133 | # If we are in regression mode, ignore the class weights 134 | if self.output_activation == 'linear': 135 | class_weight = None 136 | 137 | # Train the model and store the history 138 | self.history = self.model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, shuffle=shuffle, 139 | class_weight=class_weight, verbose=verbose, validation_data=validation_data, 140 | callbacks=callbacks) 141 | 142 | # Store the training time 143 | training_time = time.time() - training_time 144 | self.time['training_time'] = training_time 145 | print("Training time:", training_time) 146 | 147 | def predict(self, x_test, verbose=0): 148 | 149 | """ 150 | Reshapes the input data and returns the models predictions 151 | 152 | Parameters 153 | ---------- 154 | x_test : ndarray 155 | The test data 156 | 157 | verbose : int 158 | The verbose of the model's prediction 159 | 160 | Returns 161 | ------- 162 | predictions : ndarray 163 | The model's predictions 164 | """ 165 | # We must reshape the test data to fit our model 166 | shape = [x_test.shape[0]] 167 | for val in list(self.model.input_shape)[1:]: 168 | shape.append(val) 169 | 170 | x_test = np.reshape(x_test, newshape=shape) 171 | 172 | # Predict and return the predictions 173 | prediction_time = time.time() # Keep track of how long prediction took 174 | predictions = self.model.predict(x_test, verbose=verbose) # Predict 175 | prediction_time = time.time() - prediction_time # Update prediction time 176 | self.time['prediction_time'] = prediction_time # Store the prediction time 177 | 178 | return predictions 179 | 180 | def save(self, path, json=False): 181 | self._write_stats_to_file(path) 182 | if json: 183 | json_model = self.model.to_json() 184 | with open(path + ".json", 'w') as json_file: 185 | json_file.write(json_model) 186 | else: 187 | try: 188 | self.model.save(path, save_format='h5') 189 | except: 190 | print("Could not save as h5 file. This is probably due to tensorflow version.") 191 | print("If the model is saved a directory, it will cause issues.") 192 | print("Trying to save again...") 193 | self.model.save(path) 194 | 195 | def load_stats(self, path): 196 | """ 197 | Load the stats from a .ddss file into the current DDModel 198 | 199 | Parameters 200 | ---------- 201 | path : str 202 | """ 203 | 204 | info = Parser.parse_ddss(path) 205 | for key in info.keys(): 206 | try: 207 | self.__dict__[key] = info[key] 208 | except KeyError: 209 | print(key, 'is not an attribute of this class.') 210 | self.input_shape = "Loaded Model -> Input shape will be inferred" 211 | 212 | if self.time == {}: 213 | self.time = {"training_time": 'Could Not Be Loaded', "prediction_time": 'Could Not Be Loaded'} 214 | 215 | def _write_stats_to_file(self, path="", return_string=False): 216 | info = "* {}'s Stats * \n".format(self.name) 217 | info += "- Model mode: " + self.mode + " \n" 218 | info += "\n" 219 | # Write the timings 220 | if isinstance(self.time['training_time'], str) == False and self.time['training_time'] > -1: 221 | if isinstance(self.history, dict): 222 | num_eps = self.history['total_epochs'] 223 | else: 224 | num_eps = len(self.history.history['loss']) 225 | 226 | info += "- Model Time: \n" 227 | info += " - training_time: {train_time}".format(train_time=self.time['training_time']) + " \n" 228 | info += " - time_per_epoch: {epoch_time}".format(epoch_time=(self.time['training_time'] / num_eps)) + " \n" 229 | info += " - prediction_time: {pred_time}".format(pred_time=self.time['prediction_time']) + " \n" 230 | else: 231 | info += "- Model Time: \n" 232 | info += " - Model has not been trained yet. \n" 233 | info += "\n" 234 | 235 | # Write the history 236 | try: 237 | info += "- History Stats: \n" 238 | if isinstance(self.history, dict): 239 | hist = self.history 240 | else: 241 | hist = self.history.history 242 | 243 | # Get all the history values and keys stores 244 | for key in hist: 245 | try: 246 | info += " - {key}: {val} \n".format(key=key, val=hist[key][-1]) 247 | except TypeError: 248 | info += " - {key}: {val} \n".format(key=key, val=hist[key]) 249 | 250 | try: 251 | try: 252 | info += " - total_epochs: {epochs}".format(epochs=len(hist['loss'])) 253 | except TypeError: 254 | pass 255 | 256 | info += "\n" 257 | except KeyError: 258 | info += " - Model has not been trained yet. \n" 259 | 260 | except AttributeError or KeyError: 261 | # Get all the history values and keys stores 262 | info += " - Model has not been trained yet. \n" 263 | info += "\n" 264 | 265 | # Write the hyperparameters 266 | info += "- Hyperparameter Stats: \n" 267 | for key in self.hyperparameters.keys(): 268 | if key != 'bin_array' or len(self.hyperparameters[key]) > 0: 269 | info += " - {key}: {val} \n".format(key=key, val=self.hyperparameters[key]) 270 | info += "\n" 271 | 272 | # Write stats about the model architecture 273 | info += "- Model Architecture Stats: \n" 274 | try: 275 | trainable_count = int( 276 | np.sum([backend.count_params(p) for p in set(self.model.trainable_weights)])) 277 | non_trainable_count = int( 278 | np.sum([backend.count_params(p) for p in set(self.model.non_trainable_weights)])) 279 | 280 | info += ' - total_params: {:,} \n'.format(trainable_count + non_trainable_count) 281 | info += ' - trainable_params: {:,} \n'.format(trainable_count) 282 | info += ' - non_trainable_params: {:,} \n'.format(non_trainable_count) 283 | info += "\n" 284 | except TypeError or AttributeError: 285 | info += ' - total_params: Cannot be determined \n' 286 | info += ' - trainable_params: Cannot be determined \n' 287 | info += ' - non_trainable_params: Cannot be determined \n' 288 | info += "\n" 289 | 290 | # Create a layer display 291 | display_string = "" 292 | for i, layer in enumerate(self.model.layers): 293 | if i == 0: 294 | display_string += "Input: \n" 295 | 296 | display_string += " ↓ [ {name} ] \n".format(name=layer.name) 297 | info += display_string 298 | 299 | if not return_string: 300 | with open(path + '.ddss', 'w') as stat_file: 301 | stat_file.write(info) 302 | else: 303 | return info 304 | 305 | def _create_model(self): 306 | """Creates and returns a model 307 | 308 | Raises 309 | ------ 310 | IncorrectModelModeError 311 | If a mode was passed that does not exists this error will be raised 312 | """ 313 | # Try creating the model and if failed raise exception 314 | try: 315 | model = getattr(self, self.mode, None)(self.input_shape) 316 | except TypeError: 317 | raise IncorrectModelModeError(self.mode, Models.get_available_modes()) 318 | return model 319 | 320 | def _compile(self): 321 | """Compiles the DDModel object's model""" 322 | 323 | if 'epsilon' not in self.hyperparameters.keys(): 324 | self.hyperparameters['epsilon'] = 1e-06 325 | 326 | adam_opt = tf.keras.optimizers.Adam(learning_rate=self.hyperparameters['learning_rate'], 327 | epsilon=self.hyperparameters['epsilon']) 328 | self.model.compile(optimizer=adam_opt, loss=self.loss_func, metrics=self.metrics) 329 | 330 | @staticmethod 331 | def load(model, **kwargs): 332 | pre_compiled = True 333 | # Can be a path to a model or a model instance 334 | if type(model) is str: 335 | dd_model = DDModel(mode="loaded_model", input_shape=[], hyperparameters={}) 336 | 337 | # If we would like to load from a json, we can do that as well. 338 | if '.json' in model: 339 | dd_model.model = tf.keras.models.model_from_json(open(model).read(), 340 | custom_objects=Models.get_custom_objects()) 341 | model = model.replace('.json', "") 342 | pre_compiled = False 343 | else: 344 | dd_model.model = tf.keras.models.load_model(model, custom_objects=Models.get_custom_objects()) 345 | else: 346 | dd_model = DDModel(mode="loaded_model", input_shape=[], hyperparameters={}) 347 | dd_model.model = model 348 | 349 | if 'kt_hyperparameters' in kwargs.keys(): 350 | hyp = kwargs['kt_hyperparameters'].get_config()['values'] 351 | for key in hyp.keys(): 352 | try: 353 | dd_model.__dict__['hyperparameters'][key] = hyp[key] 354 | if key == 'kernel_reg': 355 | dd_model.__dict__['hyperparameters'][key] = ['None', 'Lasso', 'l1', 'l2'][int(hyp[key])] 356 | 357 | except KeyError: 358 | print(key, 'is not an attribute of this class.') 359 | else: 360 | # Try to load a stats file 361 | try: 362 | dd_model.load_stats(model + ".ddss") 363 | except TypeError or FileNotFoundError: 364 | print("Could not find a stats file...") 365 | 366 | if 'metrics' in kwargs.keys(): 367 | dd_model.metrics = kwargs['metrics'] 368 | else: 369 | dd_model.metrics = ['accuracy'] 370 | 371 | if not pre_compiled: 372 | dd_model._compile() 373 | 374 | if 'name' in kwargs.keys(): 375 | dd_model.name = kwargs['name'] 376 | 377 | dd_model.mode = 'loaded_model' 378 | return dd_model 379 | 380 | @staticmethod 381 | def process_smiles(smiles, vocab_size=100, fit_range=1000, normalize=True, use_padding=True, padding_size=None, one_hot=False): 382 | # Create the tokenizer 383 | tokenizer = DDTokenizer(vocab_size) 384 | # Fit the tokenizer 385 | tokenizer.fit(smiles[0:fit_range]) 386 | # Encode the smiles 387 | encoded_smiles = tokenizer.encode(data=smiles, use_padding=use_padding, 388 | padding_size=padding_size, normalize=normalize) 389 | 390 | if one_hot: 391 | encoded_smiles = DDModel.one_hot_encode(encoded_smiles, len(tokenizer.word_index)) 392 | 393 | return encoded_smiles 394 | 395 | @staticmethod 396 | def one_hot_encode(encoded_smiles, unique_category_count): 397 | one_hot = keras.backend.one_hot(encoded_smiles, unique_category_count) 398 | return one_hot 399 | 400 | @staticmethod 401 | def normalize(values: pd.Series): 402 | assert type(values) is pd.Series, "Type Error -> Expected pandas.Series" 403 | # Extract the indices and name 404 | indices = values.index 405 | name = values.index.name 406 | 407 | # Normalizes values 408 | normalized_values = preprocessing.minmax_scale(values, (0, 1)) 409 | 410 | # Create a pandas series to return 411 | values = pd.Series(index=indices, data=normalized_values, name=name) 412 | return values 413 | 414 | def __repr__(self): 415 | return self._write_stats_to_file(return_string=True) 416 | -------------------------------------------------------------------------------- /phase_2-3/ML/DDModelExceptions.py: -------------------------------------------------------------------------------- 1 | class Error(Exception): 2 | """Base class for other exceptions""" 3 | pass 4 | 5 | 6 | class IncorrectModelModeError(Error): 7 | """Exception raised for errors in the model mode. 8 | 9 | Attributes: 10 | mode -- input mode which caused the error 11 | message -- explanation of the error 12 | """ 13 | def __init__(self, mode, available_modes, message="Incorrect model mode. Use one of the following modes:"): 14 | self.mode = mode 15 | self.message = message 16 | self.available_modes = available_modes 17 | 18 | def __str__(self): 19 | mode_string = "\n\n" 20 | for mode in self.available_modes: 21 | mode_string += " " + mode + "\n" 22 | 23 | return f'{self.mode} -> {self.message}' + mode_string 24 | -------------------------------------------------------------------------------- /phase_2-3/ML/Models.py: -------------------------------------------------------------------------------- 1 | """ 2 | Version 1.1.0 3 | 4 | The model to be used in deep docking 5 | James Gleave 6 | """ 7 | 8 | import keras 9 | from ML.lasso_regularizer import Lasso 10 | from ML.DDMetrics import * 11 | from tensorflow.keras.models import Model, Sequential 12 | from tensorflow.keras.layers import (Input, Dense, Activation, BatchNormalization, Dropout, LSTM, 13 | Conv2D, MaxPool2D, Flatten, Embedding, MaxPooling1D, 14 | Conv1D) 15 | from tensorflow.keras.regularizers import * 16 | 17 | import warnings 18 | warnings.filterwarnings('ignore') 19 | 20 | # TODO Give user option to make their own model 21 | 22 | 23 | class Models: 24 | def __init__(self, hyperparameters, output_activation, name="model"): 25 | """ 26 | Class to hold various NN architectures allowing for cleaner code when determining which architecture 27 | is best suited for DD. 28 | 29 | :param hyperparameters: a dictionary holding the parameters for the models: 30 | ('bin_array', 'dropout_rate', 'learning_rate', 'num_units') 31 | """ 32 | self.hyperparameters = hyperparameters 33 | self.output_activation = output_activation 34 | self.name = name 35 | 36 | def original(self, input_shape): 37 | x_input = Input(input_shape, name="original") 38 | x = x_input 39 | for j, i in enumerate(self.hyperparameters['bin_array']): 40 | if i == 0: 41 | x = Dense(self.hyperparameters['num_units'], name="Hidden_Layer_%i" % (j + 1))(x) 42 | x = BatchNormalization()(x) 43 | x = Activation('relu')(x) 44 | else: 45 | x = Dropout(self.hyperparameters['dropout_rate'])(x) 46 | x = Dense(1, activation=self.output_activation, name="Output_Layer")(x) 47 | model = Model(inputs=x_input, outputs=x, name='Progressive_Docking') 48 | return model 49 | 50 | def dense_dropout(self, input_shape): 51 | """This is the most simple neural architecture. 52 | Four dense layers, batch normalization, relu activation, and dropout. 53 | """ 54 | # The model input 55 | x_input = Input(input_shape, name='dense_dropout') 56 | x = x_input 57 | 58 | # Model happens here... 59 | x = Dense(self.hyperparameters['num_units'], name="Hidden_Layer_1")(x) 60 | x = BatchNormalization()(x) 61 | x = Activation('relu')(x) 62 | x = Dropout(self.hyperparameters['dropout_rate'])(x) 63 | 64 | x = Dense(self.hyperparameters['num_units'], name="Hidden_Layer_2")(x) 65 | x = BatchNormalization()(x) 66 | x = Activation('relu')(x) 67 | x = Dropout(self.hyperparameters['dropout_rate'])(x) 68 | 69 | x = Dense(self.hyperparameters['num_units'], name="Hidden_Layer_3")(x) 70 | x = BatchNormalization()(x) 71 | x = Activation('relu')(x) 72 | x = Dropout(self.hyperparameters['dropout_rate'])(x) 73 | 74 | x = Dense(self.hyperparameters['num_units'], name="Hidden_Layer_4")(x) 75 | x = BatchNormalization()(x) 76 | x = Activation('relu')(x) 77 | x = Dropout(self.hyperparameters['dropout_rate'])(x) 78 | 79 | # output 80 | x = Dense(1, activation=self.output_activation, name="Output_Layer")(x) 81 | model = Model(inputs=x_input, outputs=x, name='Progressive_Docking') 82 | return model 83 | 84 | def wide_net(self, input_shape): 85 | """ 86 | A simple square model 87 | """ 88 | # The model input 89 | x_input = Input(input_shape, name='wide_net') 90 | x = x_input 91 | 92 | # The width coefficient 93 | width_coef = len(self.hyperparameters['bin_array'])//2 94 | 95 | for i, layer in enumerate(self.hyperparameters['bin_array']): 96 | if layer == 0: 97 | layer_name = "Hidden_Dense_Layer_" + str(i//2) 98 | x = Dense(self.hyperparameters['num_units'] * width_coef, name=layer_name)(x) 99 | x = Activation('relu')(x) 100 | else: 101 | x = BatchNormalization()(x) 102 | x = Dropout(self.hyperparameters['dropout_rate'])(x) 103 | 104 | # output 105 | x = Dense(1, activation=self.output_activation, name="Output_Layer")(x) 106 | model = Model(inputs=x_input, outputs=x, name='Progressive_Docking') 107 | return model 108 | 109 | def shared_layer(self, input_shape): 110 | """This Model uses a shared layer""" 111 | # The model input 112 | x_input = Input(input_shape, name="shared_layer") 113 | 114 | # Here is a layer that will be shared 115 | shared_layer = Dense(input_shape[0], name="Shared_Hidden_Layer") 116 | 117 | # Apply the layer twice 118 | x = shared_layer(x_input) 119 | x = shared_layer(x) 120 | x = BatchNormalization()(x) 121 | x = Activation('relu')(x) 122 | x = Dropout(self.hyperparameters['dropout_rate'])(x) 123 | 124 | # Apply dropout and normalization 125 | x = Dense(self.hyperparameters['num_units'], name="Hidden_Layer")(x) 126 | x = BatchNormalization()(x) 127 | x = Activation('relu')(x) 128 | x = Dropout(self.hyperparameters['dropout_rate'])(x) 129 | 130 | for i, layer in enumerate(self.hyperparameters['bin_array']): 131 | if layer == 0: 132 | layer_name = "Hidden_Layer_" + str(i) 133 | x = Dense(self.hyperparameters['num_units']//i, name=layer_name)(x) 134 | x = BatchNormalization()(x) 135 | x = Activation('relu')(x) 136 | x = Dropout(self.hyperparameters['dropout_rate'])(x) 137 | 138 | # output 139 | x = Dense(1, activation=self.output_activation, name="output_layer")(x) 140 | model = Model(inputs=x_input, outputs=x, name='Progressive_Docking') 141 | return model 142 | 143 | @staticmethod 144 | def get_custom_objects(): 145 | return {"Lasso": Lasso} 146 | 147 | @staticmethod 148 | def get_available_modes(): 149 | modes = [] 150 | for attr in Models.__dict__: 151 | if attr[0] != '_' and attr != "get_custom_objects" and attr != "get_available_modes": 152 | modes.append(attr) 153 | return modes 154 | 155 | 156 | class TunerModel: 157 | def __init__(self, input_shape): 158 | self.input_shape = input_shape 159 | 160 | def build_tuner_model(self, hp): 161 | """ 162 | This method should be used with keras tuner. 163 | """ 164 | 165 | # Create the hyperparameters 166 | num_hidden_layers = hp.Int('hidden_layers', min_value=1, max_value=4, step=1) 167 | num_units = hp.Int("num_units", min_value=128, max_value=1024) 168 | dropout_rate = hp.Float("dropout_rate", min_value=0.0, max_value=0.8) 169 | learning_rate = hp.Float('learning_rate', min_value=0.00001, max_value=0.001) 170 | epsilon = hp.Float('epsilon', min_value=1e-07, max_value=1e-05) 171 | kernel_reg_func = [None, Lasso, l1, l2][hp.Choice("kernel_reg", values=[0, 1, 2, 3])] 172 | reg_amount = hp.Float("reg_amount", min_value=0.0, max_value=0.001, step=0.0001) 173 | 174 | # Determine how the layer(s) are shared 175 | share_layer = hp.Boolean("shared_layer") 176 | if share_layer: 177 | share_all = hp.Boolean("share_all") 178 | shared_layer_units = hp.Int("num_units", min_value=128, max_value=1024) 179 | shared_layer = Dense(shared_layer_units, name="shared_hidden_layer") 180 | if not share_all: 181 | where_to_share = set() 182 | layer_connections = hp.Int("num_shared_layer_connections", min_value=1, max_value=num_hidden_layers) 183 | for layer in range(layer_connections): 184 | where_to_share.add(hp.Int("where_to_share", min_value=0, max_value=num_hidden_layers, step=1)) 185 | 186 | # Build the model according to the hyperparameters 187 | inputs = Input(shape=self.input_shape, name="input") 188 | x = inputs 189 | 190 | # Determine number of hidden layers 191 | for layer_num in range(num_hidden_layers): 192 | # If we are not using a kernel regulation function or not... 193 | if kernel_reg_func is None: 194 | x = Dense(num_units, name="dense_" + str(layer_num))(x) 195 | else: 196 | x = Dense(num_units, kernel_regularizer=kernel_reg_func(reg_amount), name="dense_" + str(layer_num))(x) 197 | 198 | # If we are using a common shared layer, then connect it. 199 | if (share_layer and share_all) or (share_layer and layer_num in where_to_share): 200 | x = BatchNormalization()(x) 201 | x = Activation('relu')(x) 202 | x = Dropout(dropout_rate)(x) 203 | x = shared_layer(x) 204 | 205 | # Apply these to every layer 206 | x = BatchNormalization()(x) 207 | x = Activation('relu')(x) 208 | x = Dropout(dropout_rate)(x) 209 | 210 | outputs = Dense(1, activation='sigmoid', name="output_layer")(x) 211 | model = Model(inputs=inputs, outputs=outputs) 212 | model.compile(optimizer=keras.optimizers.Adam(learning_rate=learning_rate, epsilon=epsilon), 213 | loss=keras.losses.BinaryCrossentropy(), 214 | metrics=['accuracy', "AUC", "Precision", "Recall", DDMetrics.scaled_performance]) 215 | 216 | print(model.summary()) 217 | return model 218 | 219 | 220 | -------------------------------------------------------------------------------- /phase_2-3/ML/Parser.py: -------------------------------------------------------------------------------- 1 | """ 2 | Version 1.1.2 3 | 4 | Used to read .DDSS files which are useful for monitoring model progress 5 | """ 6 | import pandas as pd 7 | import numpy as np 8 | 9 | 10 | class Parser: 11 | 12 | @staticmethod 13 | def parse_ddss(path): 14 | 15 | architecture = {} 16 | hyperparameters = {} 17 | history = {} 18 | time = {} 19 | info = {'time': time, 'history': history, 'hyperparameters': hyperparameters, 'architecture': architecture} 20 | 21 | with open(path, 'r') as ddss_file: 22 | lines = ddss_file.readlines() 23 | lines.remove('\n') 24 | 25 | for i, line in enumerate(lines): 26 | line = line.strip('\n') 27 | 28 | # Get the model name 29 | if 'Model mode' in line: 30 | info['name'] = line.split()[-1] 31 | 32 | # Get the model timings 33 | if 'training_time' in line: 34 | split_line = line.split() 35 | time['training_time'] = float(split_line[-1]) 36 | if 'prediction_time' in line: 37 | split_line = line.split() 38 | time['prediction_time'] = float(split_line[-1]) 39 | 40 | # Get the history stats 41 | if 'History Stats' in line: 42 | # Grab everything under the history set 43 | for sub_line in lines[i + 1:]: # search the sub lines under history 44 | if '-' not in sub_line or 'Model has not been trained yet' in sub_line: 45 | break 46 | else: # Split up the lines and stores the values 47 | split_line = sub_line.split()[1:] 48 | history_key = split_line[0].replace(":", "") 49 | 50 | value = [] 51 | for v in split_line[1:]: 52 | value.append(float(v.strip(",").strip('[').strip(']'))) 53 | 54 | # If the list has one value, it should be closed to a scalar 55 | if len(value) == 1: 56 | history[history_key] = value[0] 57 | else: 58 | history[history_key] = value 59 | 60 | # Get the history stats 61 | if 'Hyperparameter Stats' in line: 62 | # search the sub lines under history 63 | for sub_line in lines[i + 1:]: 64 | if '-' not in sub_line or 'Model has not been trained yet' in sub_line: 65 | break 66 | else: 67 | sub_line = sub_line.strip(" - ").strip("\n").strip(" ").split(":") 68 | key = sub_line[0].strip(" ") 69 | value = sub_line[1].strip(" ") 70 | 71 | if '[' in value: 72 | value_list = [] 73 | for char in value: 74 | if char.isnumeric(): 75 | value_list.append(int(char)) 76 | value = value_list 77 | else: 78 | try: 79 | value = float(value) 80 | except ValueError: 81 | # If this value error occurs, it is because it has found the non-decimal 82 | # hyperparameters 83 | value = value 84 | 85 | hyperparameters[key] = value 86 | 87 | if 'total_params' in line or 'trainable_params' in line or 'total_params' in line: 88 | if "Cannot be determined" not in line: 89 | sub_line = line.strip(" - ").strip("\n").strip(" ").split(":") 90 | architecture[sub_line[0]] = int(sub_line[1].replace(",", "")) 91 | 92 | return info -------------------------------------------------------------------------------- /phase_2-3/ML/Tokenizer.py: -------------------------------------------------------------------------------- 1 | """ 2 | Used during testing to encode smiles as features 3 | This is deprecated. 4 | """ 5 | 6 | from tensorflow.keras.preprocessing.text import Tokenizer 7 | from tensorflow.keras.preprocessing.sequence import pad_sequences 8 | import numpy as np 9 | 10 | 11 | class DDTokenizer: 12 | def __init__(self, num_words, oov_token=''): 13 | self.tokenizer = Tokenizer(num_words=num_words, 14 | oov_token=oov_token, 15 | filters='!"#$%&*+,-./:;<>?\\^_`{|}~\t\n', 16 | char_level=True, 17 | lower=False) 18 | self.has_trained = False 19 | 20 | self.pad_type = 'post' 21 | self.trunc_type = 'post' 22 | 23 | # The encoded data 24 | self.word_index = {} 25 | 26 | def fit(self, train_data): 27 | # Get max training sequence length 28 | print("Training Tokenizer...") 29 | self.tokenizer.fit_on_texts(train_data) 30 | self.has_trained = True 31 | print("Done training...") 32 | 33 | # Get our training data word index 34 | self.word_index = self.tokenizer.word_index 35 | 36 | def encode(self, data, use_padding=True, padding_size=None, normalize=False): 37 | # Encode training data sentences into sequences 38 | train_sequences = self.tokenizer.texts_to_sequences(data) 39 | 40 | # Get max training sequence length if there is none passed 41 | if padding_size is None: 42 | maxlen = max([len(x) for x in train_sequences]) 43 | else: 44 | maxlen = padding_size 45 | 46 | if use_padding: 47 | train_sequences = pad_sequences(train_sequences, padding=self.pad_type, 48 | truncating=self.trunc_type, maxlen=maxlen) 49 | 50 | if normalize: 51 | train_sequences = np.multiply(1/len(self.tokenizer.word_index), train_sequences) 52 | 53 | return train_sequences 54 | 55 | def pad(self, data, padding_size=None): 56 | # Get max training sequence length if there is none passed 57 | if padding_size is None: 58 | padding_size = max([len(x) for x in data]) 59 | 60 | padded_sequence = pad_sequences(data, padding=self.pad_type, 61 | truncating=self.trunc_type, maxlen=padding_size) 62 | 63 | return padded_sequence 64 | 65 | def decode(self, array): 66 | assert self.has_trained, "Train this tokenizer before decoding a string." 67 | return self.tokenizer.sequences_to_texts(array) 68 | 69 | def test(self, string): 70 | encoded = list(self.encode(string)[0]) 71 | decoded = self.decode(self.encode(string)) 72 | 73 | print("\nEncoding:") 74 | print("{original} -> {encoded}".format(original=string[0], encoded=encoded)) 75 | print("\nDecoding:") 76 | print("{original} -> {encoded}".format(original=encoded, encoded=decoded[0].replace(" ", ""))) 77 | 78 | def get_info(self): 79 | return self.tokenizer.index_word 80 | 81 | -------------------------------------------------------------------------------- /phase_2-3/ML/lasso_regularizer.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from tensorflow.keras import backend as K 3 | from tensorflow.keras.regularizers import Regularizer 4 | 5 | 6 | class Lasso(Regularizer): 7 | """Regularizer for L21 regularization. 8 | # Arguments 9 | C: Float; L21 regularization factor. 10 | """ 11 | 12 | def __init__(self, C=0.): 13 | self.C = K.cast_to_floatx(C) 14 | 15 | def __call__(self, x): 16 | const_coeff = np.sqrt(K.int_shape(x)[1]) 17 | return self.C*const_coeff*K.sum(K.sqrt(K.sum(K.square(x), axis=1))) 18 | 19 | def get_config(self): 20 | return {'C': float(self.C)} -------------------------------------------------------------------------------- /phase_2-3/Prediction_morgan_1024.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import glob 3 | import os 4 | import time 5 | import warnings 6 | import numpy as np 7 | import pandas as pd 8 | from ML.DDModel import DDModel 9 | 10 | try: 11 | import __builtin__ 12 | except ImportError: 13 | # Python 3 14 | import builtins as __builtin__ 15 | 16 | # For debugging purposes only: 17 | def print(*args, **kwargs): 18 | __builtin__.print('\t sampling: ', end="") 19 | return __builtin__.print(*args, **kwargs) 20 | 21 | warnings.filterwarnings('ignore') 22 | 23 | parser = argparse.ArgumentParser() 24 | parser.add_argument('-fn','--fn', required=True) 25 | parser.add_argument('-protein', '--protein', required=True) 26 | parser.add_argument('-it', '--it', required=True) 27 | parser.add_argument('-file_path', '--file_path', required=True) 28 | parser.add_argument('-mdd', '--morgan_directory', required=True) 29 | 30 | io_args = parser.parse_args() 31 | fn = io_args.fn 32 | protein = str(io_args.protein) 33 | it = int(io_args.it) 34 | file_path = io_args.file_path 35 | mdd = io_args.morgan_directory 36 | 37 | # This debug feature will allow for speedy testing 38 | DEBUG=False 39 | def prediction_morgan(fname, models, thresh): # TODO: improve runtime with parallelization across multiple nodes 40 | print("Starting Predictions...") 41 | t = time.time() 42 | per_time = 1000000 43 | n_features = 1024 44 | z_id = [] 45 | X_set = np.zeros([per_time, n_features]) 46 | total_passed = 0 47 | 48 | print("We are predicting from the file", fname, "located in", mdd) 49 | with open(mdd+'/'+fname,'r') as ref: 50 | no = 0 51 | for line in ref: 52 | tmp = line.rstrip().split(',') 53 | on_bit_vector = tmp[1:] 54 | z_id.append(tmp[0]) 55 | for elem in on_bit_vector: 56 | X_set[no,int(elem)] = 1 57 | no+=1 58 | if no == per_time: 59 | X_set = X_set[:no, :] 60 | pred = [] 61 | print("We are currently running line", line) 62 | print("(1) Predicting... Time elapsed:", time.time() - t, "seconds.") 63 | for model in models: 64 | pred.append(model.predict(X_set)) 65 | 66 | with open(file_path+'/iteration_'+str(it)+'/morgan_1024_predictions/'+fname, 'a') as ref: 67 | for j in range(len(pred[0])): 68 | is_pass = 0 69 | for i,thr in enumerate(thresh): 70 | if float(pred[i][j])>thr: 71 | is_pass += 1 72 | if is_pass >= 1: 73 | total_passed += 1 74 | line = z_id[j]+','+str(float(pred[i][j]))+'\n' 75 | ref.write(line) 76 | X_set = np.zeros([per_time,n_features]) 77 | z_id = [] 78 | no = 0 79 | 80 | # With debug, we will only predict on 'per_time' molecules 81 | if DEBUG: 82 | break 83 | 84 | if no != 0: 85 | X_set = X_set[:no,:] 86 | pred = [] 87 | print("We are currently running line", line) 88 | print("(2) Predicting... Time elapsed:", time.time() - t, "seconds.") 89 | for model in models: 90 | pred.append(model.predict(X_set)) 91 | with open(file_path+'/iteration_'+str(it)+'/morgan_1024_predictions/'+fname, 'a') as ref: 92 | for j in range(len(pred[0])): 93 | is_pass = 0 94 | for i,thr in enumerate(thresh): 95 | if float(pred[i][j])>thr: 96 | is_pass+=1 97 | if is_pass>=1: 98 | total_passed+=1 99 | line = z_id[j]+','+str(float(pred[i][j]))+'\n' 100 | ref.write(line) 101 | print("Prediction time:", time.time() - t) 102 | return total_passed 103 | 104 | 105 | try: 106 | os.mkdir(file_path+'/iteration_'+str(it)+'/morgan_1024_predictions') 107 | except OSError: 108 | print(file_path+'/iteration_'+str(it)+'/morgan_1024_predictions', "already exists") 109 | 110 | thresholds = pd.read_csv(file_path+'/iteration_'+str(it)+'/best_models/thresholds.txt', header=None) 111 | thresholds.columns = ['model_no', 'thresh', 'cutoff'] 112 | 113 | tr = [] 114 | models = [] 115 | for f in glob.glob(file_path+'/iteration_'+str(it)+'/best_models/model_*'): 116 | if "." not in f: # skipping over the .ddss & .csv files 117 | mn = int(f.split('/')[-1].split('_')[1]) 118 | tr.append(thresholds[thresholds.model_no == mn].thresh.iloc[0]) 119 | models.append(DDModel.load(file_path+'/iteration_'+str(it)+'/best_models/model_'+str(mn))) 120 | 121 | print("Number of models to predict:", len(models)) 122 | t = time.time() 123 | returned = prediction_morgan(fn, models, tr) 124 | print(time.time()-t) 125 | 126 | with open(file_path+'/iteration_'+str(it)+'/morgan_1024_predictions/passed_file_ct.txt','a') as ref: 127 | ref.write(fn+','+str(returned)+'\n') 128 | -------------------------------------------------------------------------------- /phase_2-3/activation_script.sh: -------------------------------------------------------------------------------- 1 | echo Activating virtual environment 2 | source ~/.bashrc 3 | 4 | # Activate Your conda environment here: 5 | conda activate MyCondaEnvironment 6 | -------------------------------------------------------------------------------- /phase_2-3/hyperparameter_result_evaluation.py: -------------------------------------------------------------------------------- 1 | import builtins as __builtin__ 2 | import argparse 3 | 4 | import pandas as pd 5 | import numpy as np 6 | import glob 7 | import os 8 | from ML.DDModel import DDModel 9 | from sklearn.metrics import auc 10 | from sklearn.metrics import precision_recall_curve,roc_curve, precision_score, recall_score 11 | from shutil import copy2 12 | 13 | import warnings 14 | warnings.filterwarnings('ignore') 15 | 16 | 17 | # For debugging purposes only: 18 | def print(*args, **kwargs): 19 | __builtin__.print('\t eval_v2: ', end="") 20 | return __builtin__.print(*args, **kwargs) 21 | 22 | 23 | parser = argparse.ArgumentParser() 24 | parser.add_argument('-n_it','--n_iteration',required=True,help='Number of current iteration') 25 | parser.add_argument('-d_path','--data_path',required=True,help='Path to project folder, including prject folder name') 26 | parser.add_argument('-mdd','--morgan_directory',required=True,help='Path to Morgan fingerprint directory for the database') 27 | 28 | # adding parameter for where to save all the data to: 29 | parser.add_argument('-s_path', '--save_path', required=False, default=None) 30 | 31 | # allowing for variable number of molecules to train from: 32 | parser.add_argument('-n_mol', '--number_mol', required=False, default=3000000, help='Size of test/validation set to be used') 33 | parser.add_argument('-cont', '--continuous', required=False, action='store_true') # Using binary or continuous labels 34 | parser.add_argument('-smile', '--smiles', required=False, action='store_true') # Using smiles or morgan as or continuous labels 35 | 36 | io_args = parser.parse_args() 37 | n_iteration = int(io_args.n_iteration) 38 | mdd = io_args.morgan_directory 39 | num_molec = int(io_args.number_mol) 40 | 41 | DATA_PATH = io_args.data_path # Now == file_path/protein 42 | SAVE_PATH = io_args.save_path 43 | # if no save path is provided we just save it in the same location as the data 44 | if SAVE_PATH is None: SAVE_PATH = DATA_PATH 45 | 46 | 47 | print("Done importing.") 48 | 49 | # Gets the total number of molecules (SEE: simple_job_models.py, line 32) 50 | total_mols = pd.read_csv(mdd+'/Mol_ct_file.csv',header=None)[[0]].sum()[0]/1000000 51 | 52 | # reading in the file created in progressive_docking.py (line 456) 53 | hyperparameters = pd.read_csv(SAVE_PATH+'/iteration_'+str(n_iteration)+'/hyperparameter_morgan_with_freq_v3.csv',header=None) 54 | 55 | # theses are also declared in progressive_docking.py 56 | ### TODO: add these columns in progressive_docking.py as a header instead of declaring them here (Line 456) 57 | hyperparameters.columns = ['Model_no','Over_sampling','Batch_size','Learning_rate','N_layers','N_units','dropout', 58 | 'weight','cutoff','ROC_AUC','Pr_0_9','tot_left_0_9_mil','auc_te','pr_te','re_te','tot_left_0_9_mil_te','tot_positives'] 59 | 60 | # Converting total to per million 61 | hyperparameters.tot_left_0_9_mil = hyperparameters.tot_left_0_9_mil/1000000 62 | hyperparameters.tot_left_0_9_mil_te = hyperparameters.tot_left_0_9_mil_te/1000000 ## What are these? it is never used... 63 | 64 | hyperparameters['re_vl/re_pr'] = 0.9/hyperparameters.re_te # ratio of 90% recall and the recall of the test set 65 | 66 | print('hyp dataframe:', hyperparameters.head()) 67 | 68 | df_grouped_cf = hyperparameters.groupby('cutoff') # Groups them according to cutoff values for calculations 69 | 70 | cf_values = {} # Cutoff values (thresholds for validation set virtual hits) 71 | 72 | print('Got Hyperparams') 73 | # Looping through each group and printing mean and std for that particular cuttoff value 74 | for mini_df in df_grouped_cf: 75 | print(mini_df[0]) # the cutoff value for the group 76 | print(mini_df[1]['re_vl/re_pr'].mean()) 77 | print(mini_df[1]['re_vl/re_pr'].std()) 78 | cf_values[mini_df[0]] = mini_df[1]['re_vl/re_pr'].std() 79 | 80 | print('cf_values:', cf_values) 81 | 82 | model_to_use_with_cf = [] 83 | ind_pr = [] 84 | for cf in cf_values: 85 | models = hyperparameters[hyperparameters.cutoff == cf] # gets all models matching that cutoff 86 | n_models = len(models) 87 | thr = 0.9 # The recall for true positives ### TODO: make this a input for the script 88 | 89 | # Decreasing the threshold until a quarter of the models have recall values greater than it 90 | while len(models[models.re_te >= thr]) <= n_models//4: 91 | thr -= 0.01 92 | 93 | models = models[models.re_te >= thr] 94 | models = models.sort_values('pr_te', ascending=False) # Sorting in descending order 95 | 96 | if cf_values[cf] < 0.01: # Checks to see if std is less than 0.01 97 | model_to_use_with_cf.append([cf, models.Model_no[:1].values]) 98 | ind_pr.append([cf, models.pr_te[:1].values]) 99 | else: 100 | # When std is high we use 3 models to get a better idea of its perfomance as a whole 101 | model_to_use_with_cf.append([cf, models.Model_no[:3].values]) 102 | ind_pr.append([cf, models.pr_te[:3].values]) 103 | 104 | print(model_to_use_with_cf) # [ [cf_1, [model_no_1, ...]], [cf_2, [model_no_1, ...]] ... ] 105 | print(ind_pr) # printed for viewing progress? 106 | 107 | 108 | def get_all_x_data(morgan_path, ID_labels): # ID_labels is a dataframe containing the zincIDs and their corresponding labels. 109 | train_set = np.zeros([num_molec,1024], dtype=bool) # using bool to save space 110 | train_id = [] 111 | 112 | print('x data from:', morgan_path) 113 | with open(morgan_path,'r') as ref: 114 | line_no=0 115 | for line in ref: 116 | mol_info=line.rstrip().split(',') 117 | train_id.append(mol_info[0]) 118 | 119 | # "Decompressing" the information from the file about where the 1s are on the 1024 bit vector. 120 | bit_indicies = mol_info[1:] # array of indexes of the binary 1s in the 1024 bit vector representing the morgan fingerprint 121 | for elem in bit_indicies: 122 | train_set[line_no,int(elem)] = 1 123 | 124 | line_no+=1 125 | 126 | train_set = train_set[:line_no,:] 127 | 128 | print('Done...') 129 | train_pd = pd.DataFrame(data=train_set, dtype=np.uint8) 130 | train_pd['ZINC_ID'] = train_id 131 | 132 | score_col = ID_labels.columns.difference(['ZINC_ID'])[0] 133 | train_data = pd.merge(ID_labels, train_pd, how='inner',on=['ZINC_ID']) 134 | X_data = train_data[train_data.columns.difference(['ZINC_ID', score_col])].values # input 135 | y_data = train_data[[score_col]].values # labels 136 | return X_data, y_data 137 | 138 | 139 | # This function gets all the zinc ids their corresponding labels into a pd.dataframe 140 | def get_zinc_and_labels(zinc_path, labels_path): 141 | ids = [] 142 | with open(zinc_path,'r') as ref: 143 | for line in ref: 144 | ids.append(line.split(',')[0]) 145 | zincIDs = pd.DataFrame(ids, columns=['ZINC_ID']) 146 | 147 | labels_df = pd.read_csv(labels_path, header=0) 148 | combined_df = pd.merge(labels_df, zincIDs, how='inner', on=['ZINC_ID']) 149 | return combined_df.set_index('ZINC_ID') 150 | 151 | 152 | main_path = DATA_PATH+'/iteration_1' 153 | print('Geting zinc ids and labels') 154 | # Creating the test, and validation data from the first iteration: 155 | zinc_labels_valid = get_zinc_and_labels(main_path + '/morgan/valid_morgan_1024_updated.csv', main_path +'/validation_labels.txt') 156 | zinc_labels_test = get_zinc_and_labels(main_path + '/morgan/test_morgan_1024_updated.csv', main_path +'/testing_labels.txt') 157 | 158 | print('Generating test, and valid data') 159 | # Getting the x data from the zinc ids (x=input, y=labels) 160 | # decompresses the indexes to 1024 bit vector 161 | X_valid, y_valid = get_all_x_data(main_path +'/morgan/valid_morgan_1024_updated.csv', zinc_labels_valid) 162 | X_test, y_test = get_all_x_data(main_path +'/morgan/test_morgan_1024_updated.csv', zinc_labels_test) 163 | 164 | print('Reducing models,', len(model_to_use_with_cf)) 165 | for i in range(len(model_to_use_with_cf)): 166 | cf = model_to_use_with_cf[i][0] 167 | n_good_mol = len([x for x in y_valid if x < cf]) # num of molec that are under the cutoff value 168 | print('\t',cf,'cf:',len(model_to_use_with_cf[i])) 169 | 170 | # If not enough molecules exceed the cutoff then we only save one of the models and ignore the rest ## Why is this done? 171 | if n_good_mol <= 10000: 172 | model_to_use_with_cf[i][1] = [model_to_use_with_cf[i][1][0]] # Still maintaing the format: [ [cf_1, [model_no_1, ...]], [cf_2, [model_no_1, ...]] ... ] 173 | 174 | 175 | cf_with_left = {} 176 | main_thresholds = {} 177 | all_sc = {} 178 | path_to_model = SAVE_PATH+'/iteration_'+str(n_iteration)+'/all_models/' 179 | 180 | print('Model_to_use_with_cf:', model_to_use_with_cf) 181 | for i in range(len(model_to_use_with_cf)): 182 | cf = model_to_use_with_cf[i][0] 183 | 184 | # y_train 0.90)[0][-1]]) 211 | main_thresholds[cf] = tr 212 | 213 | print('tr:', tr) 214 | 215 | prediction_test = [] 216 | for model in models: 217 | model_pred = model.predict(X_test) 218 | if model.output_activation == 'linear': 219 | # Converting back to binary values to get stats 220 | model_pred = model_pred < cf 221 | prediction_test.append(model_pred) 222 | 223 | # Calculating the average prediction across the consensus of models. #TODO: change this when dealing with continuous 224 | avg_pred = np.zeros([len(prediction_test[0]),]) 225 | for i in range(len(prediction_test)): 226 | # determining if the model prediction exceeds the threshold and adding the result (1 or 0) 227 | avg_pred += (prediction_test[i] >= tr[i]).reshape(-1,) 228 | 229 | print('avg_pred:', avg_pred) 230 | avg_pred = avg_pred > (len(models)//2) # if greater than 50% of the models agree there would be a hit then that is our consensus value 231 | print('avg_pred:', avg_pred) 232 | 233 | if len(models) > 1: 234 | fpr_te_avg, tpr_te_avg, thresh_te_avg = roc_curve(y_test_cf, avg_pred) 235 | else: 236 | fpr_te_avg, tpr_te_avg, thresh_te_avg = roc_curve(y_test_cf, prediction_test[0]) 237 | 238 | pr_te_avg = precision_score(y_test_cf, avg_pred) 239 | re_te_avg = recall_score(y_test_cf, avg_pred) # TODO: make sure avg_pred is calc properly 240 | 241 | auc_te_avg = auc(fpr_te_avg,tpr_te_avg) 242 | pos_ct_orig = np.sum(y_test_cf) 243 | t_train_mol = len(y_test) 244 | 245 | print(cf,re_te_avg,pr_te_avg,auc_te_avg,pos_ct_orig/t_train_mol) 246 | 247 | total_left_te = re_te_avg*pos_ct_orig/pr_te_avg*total_mols*1000000/t_train_mol # molecules left 248 | all_sc[cf] = [cf,re_te_avg,pr_te_avg,auc_te_avg,total_left_te] 249 | cf_with_left[cf] = total_left_te 250 | 251 | print(cf_with_left) 252 | 253 | min_left_cf = total_mols*1000000 254 | cf_to_use = 0 255 | # cf_to_use is the cf for the lowest num molec 256 | for key in cf_with_left: 257 | if cf_with_left[key] <= min_left_cf: 258 | min_left_cf = cf_with_left[key] 259 | cf_to_use = key 260 | 261 | print('cf_to_use:', cf_to_use) 262 | print('min_left_cf:', min_left_cf) 263 | print('all_sc:', all_sc) 264 | 265 | with open(SAVE_PATH+'/iteration_'+str(n_iteration)+'/best_model_stats.txt', 'w') as ref: 266 | cf,re,pr,auc,tot_le = all_sc[cf_to_use] 267 | m_string = "* Best Model Stats * \n" 268 | m_string += "_" * 20 + "\n" 269 | m_string += "- Model Cutoff: " + str(cf) + "\n" 270 | m_string += "- Model Precision: " + str(pr) + "\n" 271 | m_string += "- Model Recall: " + str(re) + "\n" 272 | m_string += "- Model Auc: " + str(auc) + "\n" 273 | m_string += "- Total Left Testing: " + str(tot_le) + "\n" 274 | 275 | ref.write(m_string) 276 | 277 | try: 278 | os.mkdir(SAVE_PATH+'/iteration_'+str(n_iteration)+'/best_models') 279 | except OSError: # catching file exists error 280 | pass 281 | 282 | for models_cf in model_to_use_with_cf: # looping through the groups of models that match that cutoff 283 | if models_cf[0] == cf_to_use: 284 | count = 0 285 | # Looping through all the models in that group 286 | for mod_no in models_cf[-1]: 287 | with open(SAVE_PATH+'/iteration_'+str(n_iteration)+'/best_models/thresholds.txt', 'w') as ref: 288 | ref.write(str(mod_no)+','+str(main_thresholds[cf_to_use][count])+','+str(cf_to_use)+'\n') 289 | 290 | # Copying the specific models by their model_no to the best_models folder 291 | copy2(path_to_model+'/model_'+str(mod_no), SAVE_PATH+'/iteration_'+str(n_iteration) + '/best_models/') 292 | copy2(path_to_model+'/model_'+str(mod_no) + ".ddss", SAVE_PATH+'/iteration_'+str(n_iteration) + '/best_models/') 293 | 294 | count += 1 295 | 296 | 297 | # Deleting all other models that are not in the model_to_use array 298 | # placed at the bottom so we don't lose all our data each time we run phase 5 and it fails. 299 | for f in glob.glob(SAVE_PATH+'/iteration_'+str(n_iteration)+'/all_models/*'): 300 | try: 301 | mn = int(f.split('/')[-1].split('_')[1]) 302 | except: 303 | mn = int(f.split('/')[-1].split('_')[1].split('.')[0]) 304 | found = False 305 | for models in model_to_use_with_cf: 306 | if mn in models[-1]: 307 | found = True 308 | break 309 | if not found and "." in f: 310 | os.remove(f) 311 | -------------------------------------------------------------------------------- /phase_2-3/progressive_docking.py: -------------------------------------------------------------------------------- 1 | """ 2 | V 2.2.1 3 | """ 4 | 5 | import argparse 6 | import glob 7 | import gc 8 | import os 9 | import random 10 | import sys 11 | import time 12 | 13 | import numpy as np 14 | import pandas as pd 15 | from sklearn.metrics import auc 16 | from sklearn.metrics import precision_recall_curve, roc_curve 17 | from tensorflow.keras.callbacks import Callback 18 | from tensorflow.keras.callbacks import EarlyStopping 19 | from ML.DDModel import DDModel 20 | from ML.DDCallbacks import DDLogger 21 | 22 | START_TIME = time.time() 23 | print("Parsing args...") 24 | parser = argparse.ArgumentParser() 25 | parser.add_argument('-num_units','--nu',required=True) 26 | parser.add_argument('-dropout','--df',required=True) 27 | parser.add_argument('-learn_rate','--lr',required=True) 28 | parser.add_argument('-bin_array','--ba',required=True) 29 | parser.add_argument('-wt','--wt',required=True) 30 | parser.add_argument('-cf','--cf',required=True) 31 | parser.add_argument('-rec','--rec',required=True) 32 | parser.add_argument('-n_it','--n_it',required=True) 33 | parser.add_argument('-t_mol','--t_mol',required=True) 34 | parser.add_argument('-bs','--bs',required=True) 35 | parser.add_argument('-os','--os',required=True) 36 | parser.add_argument('-d_path','--data_path',required=True) # new! 37 | 38 | # adding parameter for where to save all the data to: 39 | parser.add_argument('-s_path', '--save_path', required=False, default=None) 40 | 41 | # allowing for variable number of molecules to validate and test from: 42 | parser.add_argument('-n_mol', '--number_mol', required=False, default=1000000) 43 | parser.add_argument('-t_n_mol', '--train_num_mol', required=False, default=-1) 44 | parser.add_argument('-cont', '--continuous', required=False, action='store_true') # Using binary or continuous labels 45 | parser.add_argument('-smile', '--smiles', required=False, action='store_true') # Using smiles or morgan as or continuous labels 46 | parser.add_argument('-norm', '--normalization', required=False, action='store_false') # if continuous labels are used -> normalize them? 47 | 48 | io_args = parser.parse_args() 49 | 50 | print(sys.argv) 51 | 52 | nu = int(io_args.nu) 53 | df = float(io_args.df) 54 | lr = float(io_args.lr) 55 | ba = int(io_args.ba) 56 | wt = float(io_args.wt) 57 | cf = float(io_args.cf) 58 | rec = float(io_args.rec) 59 | n_it = int(io_args.n_it) 60 | bs = int(io_args.bs) 61 | oss = int(io_args.os) 62 | t_mol = float(io_args.t_mol) 63 | 64 | CONTINUOUS = io_args.continuous 65 | NORMALIZE = io_args.normalization 66 | SMILES = io_args.smiles 67 | TRAINING_SIZE = int(io_args.train_num_mol) 68 | num_molec = int(io_args.number_mol) 69 | 70 | DATA_PATH = io_args.data_path # Now == file_path/protein 71 | SAVE_PATH = io_args.save_path 72 | # if no save path is provided we just save it in the same location as the data 73 | if SAVE_PATH is None: SAVE_PATH = DATA_PATH 74 | 75 | 76 | print(nu,df,lr,ba,wt,cf,bs,oss,DATA_PATH) 77 | if TRAINING_SIZE == -1: print("Training size not specified, using entire dataset...") 78 | print("Finished parsing args...") 79 | 80 | 81 | def encode_smiles(series): 82 | print("Encoding smiles") 83 | # parameter is a pd.series with ZINC_IDs as the indicies and smiles as the elements 84 | encoded_smiles = DDModel.process_smiles(series.values, 100, fit_range=100, use_padding=True, normalize=True) 85 | encoded_dict = dict(zip(series.keys(), encoded_smiles)) 86 | # returns a dict array of the smiles. 87 | return encoded_dict 88 | 89 | 90 | def get_oversampled_smiles(Oversampled_zid, smiles_series): 91 | # Must return a dictionary where the keys are the zids and the items are 92 | # numpy ndarrys with n numbers of the same encoded smile 93 | # the n comes from the number of times that particular zid was chosen at random. 94 | oversampled_smiles = {} 95 | encoded_smiles = encode_smiles(smiles_series) 96 | 97 | for key in Oversampled_zid.keys(): 98 | smile = encoded_smiles[key] 99 | oversampled_smiles[key] = np.repeat([smile], Oversampled_zid[key], axis=0) 100 | return oversampled_smiles 101 | 102 | 103 | def get_oversampled_morgan(Oversampled_zid, fname): 104 | print('x data from:', fname) 105 | # Gets only the morgan fingerprints of those randomly selected zinc ids 106 | with open(fname,'r') as ref: 107 | for line in ref: 108 | tmp=line.rstrip().split(',') 109 | 110 | # only extracting those that were randomly selected 111 | if (tmp[0] in Oversampled_zid.keys()) and (type(Oversampled_zid[tmp[0]]) != np.ndarray): 112 | train_set = np.zeros([1,1024], dtype=np.bool) 113 | on_bit_vector = tmp[1:] 114 | 115 | for elem in on_bit_vector: 116 | train_set[0,int(elem)] = 1 117 | 118 | # creates a n x 1024 numpy ndarray where n is the number of times that zinc id was randomly selected 119 | Oversampled_zid[tmp[0]] = np.repeat(train_set, Oversampled_zid[tmp[0]], axis=0) 120 | return Oversampled_zid 121 | 122 | 123 | def get_morgan_and_scores(morgan_path, ID_labels): 124 | # ID_labels is a dataframe containing the zincIDs and their corresponding scores. 125 | train_set = np.zeros([num_molec,1024], dtype=np.bool) # using bool to save space 126 | train_id = [] 127 | print('x data from:', morgan_path) 128 | with open(morgan_path,'r') as ref: 129 | line_no=0 130 | for line in ref: 131 | if line_no >= num_molec: 132 | break 133 | 134 | mol_info=line.rstrip().split(',') 135 | train_id.append(mol_info[0]) 136 | 137 | # "Decompressing" the information from the file about where the 1s are on the 1024 bit vector. 138 | bit_indicies = mol_info[1:] # array of indexes of the binary 1s in the 1024 bit vector representing the morgan fingerprint 139 | for elem in bit_indicies: 140 | train_set[line_no,int(elem)] = 1 141 | 142 | line_no+=1 143 | 144 | train_set = train_set[:line_no,:] 145 | 146 | print('Done...') 147 | train_pd = pd.DataFrame(data=train_set, dtype=np.bool) 148 | train_pd['ZINC_ID'] = train_id 149 | 150 | ID_labels = ID_labels.to_frame() 151 | print(ID_labels.columns) 152 | score_col = ID_labels.columns.difference(['ZINC_ID'])[0] 153 | print(score_col) 154 | 155 | train_data = pd.merge(ID_labels, train_pd, how='inner',on=['ZINC_ID']) 156 | X_train = train_data[train_data.columns.difference(['ZINC_ID', score_col])].values # input 157 | y_train = train_data[[score_col]].values # labels 158 | return X_train, y_train 159 | 160 | 161 | # Gets the labels data 162 | def get_data(smiles_path, morgan_path, labels_path): 163 | # Loading the docking scores (with corresponding Zinc_IDs) 164 | labels = pd.read_csv(labels_path, sep=',', header=0) 165 | 166 | # Merging and setting index to the ID if smiles flag is set 167 | if SMILES: 168 | smiles = pd.read_csv(smiles_path, sep=' ', names=['smile', 'ZINC_ID']) 169 | data = smiles.merge(labels, on='ZINC_ID') 170 | else: 171 | morgan = pd.read_csv(morgan_path, usecols=[0], header=None, names=['ZINC_ID']) # reading in only the zinc ids 172 | data = morgan.merge(labels, on='ZINC_ID') 173 | data.set_index('ZINC_ID', inplace=True) 174 | return data 175 | 176 | 177 | n_iteration = n_it 178 | total_mols = t_mol 179 | 180 | try: 181 | os.mkdir(SAVE_PATH + '/iteration_'+str(n_iteration)+'/all_models') 182 | except OSError: 183 | pass 184 | 185 | 186 | # Getting data from prev iterations and this iteration 187 | data_from_prev = pd.DataFrame() 188 | train_data = pd.DataFrame() 189 | test_data = pd.DataFrame() 190 | valid_data = pd.DataFrame() 191 | y_valid_first = pd.DataFrame() 192 | y_test_first = pd.DataFrame() 193 | for i in range(1, n_iteration+1): 194 | 195 | # getting all the data 196 | print("\nGetting data from iteration", i) 197 | smiles_path = DATA_PATH + '/iteration_'+str(i)+'/smile/{}_smiles_final_updated.smi' 198 | morgan_path = DATA_PATH + '/iteration_'+str(i)+'/morgan/{}_morgan_1024_updated.csv' 199 | labels_path = DATA_PATH + '/iteration_'+str(i)+'/{}_labels.txt' 200 | 201 | # Resulting dataframe will have cols of smiles (if selected) and docking scores with an index of Zinc IDs 202 | train_data = get_data(smiles_path.format('train'), morgan_path.format('train'), labels_path.format('training')) 203 | test_data = get_data(smiles_path.format('test'), morgan_path.format('test'), labels_path.format('testing')) 204 | valid_data = get_data(smiles_path.format('valid'), morgan_path.format('valid'), labels_path.format('validation')) 205 | 206 | print("Data acquired...") 207 | print("Train shape:", train_data.shape, "Valid shape:", valid_data.shape, "Test shape:", test_data.shape) 208 | 209 | if i == 1: # for the first iteration we only add the training data 210 | # because test and valid from this iteration is used by all subsequent iterations (constant dataset). 211 | 212 | y_test_first = test_data # test and valid should be seperate from training dataset 213 | y_valid_first = valid_data 214 | y_old = train_data 215 | elif i == n_iteration: break 216 | else: 217 | y_old = pd.concat([train_data, valid_data, test_data], axis=0) 218 | 219 | data_from_prev = pd.concat([y_old, data_from_prev], axis=0) 220 | 221 | print("Data Augmentation iteration {} data shape: {}".format(i, data_from_prev.shape)) 222 | 223 | # Always using the same valid and test dataset across all iterations: 224 | if n_iteration != 1: 225 | train_data = pd.concat([train_data, test_data, valid_data], axis=0) # These datasets are from the current iteration. 226 | train_data = pd.concat([train_data, data_from_prev]) # combining all the datasets into a single training set for iterations after the first 227 | 228 | print("Training labels shape: ", train_data.shape) 229 | 230 | # Exiting if there are not enough hits 231 | if (y_valid_first.r_i_docking_score < cf).values.sum() <= 10 or \ 232 | (y_test_first.r_i_docking_score < cf).values.sum() <= 10: 233 | print("There are not enough hits... exiting.") 234 | sys.exit() 235 | 236 | 237 | if CONTINUOUS: 238 | print('Using continuous labels...') 239 | y_valid = valid_data.r_i_docking_score 240 | y_test = test_data.r_i_docking_score 241 | y_train = train_data.r_i_docking_score 242 | 243 | if NORMALIZE: 244 | print('Adding cutoff to be normalized') 245 | cutoff_ser = pd.Series([cf], index=['cutoff']) 246 | y_train = y_train.append(cutoff_ser) 247 | 248 | print("Normalizing docking scores...") 249 | # Normalize the docking scores 250 | y_valid = DDModel.normalize(y_valid) 251 | y_test = DDModel.normalize(y_test) 252 | y_train = DDModel.normalize(y_train) 253 | 254 | print('Extracting normalized cutoff...') 255 | cf_norm = y_train['cutoff'] 256 | y_train.drop(labels=['cutoff'], inplace=True) # removing it from the dataset 257 | 258 | cf_to_use = cf_norm 259 | else: 260 | cf_to_use = cf 261 | 262 | # Getting all the ids of hits and non hits. 263 | y_pos = y_train[y_train < cf_to_use] 264 | y_neg = y_train[y_train >= cf_to_use] 265 | 266 | else: 267 | print('Using binary labels...') 268 | # valid and testing data is from the first iteration. 269 | y_valid = y_valid_first.r_i_docking_score < cf 270 | y_test = y_test_first.r_i_docking_score < cf 271 | y_train = train_data.r_i_docking_score < cf 272 | 273 | # Getting all the ids of hits and non hits. 274 | y_pos = y_train[y_train == 1] # true 275 | y_neg = y_train[y_train == 0] # false 276 | 277 | print('Converting y_pos and y_neg to dict (for faster access time)') 278 | y_pos = y_pos.to_dict() 279 | y_neg = y_neg.to_dict() 280 | 281 | num_neg = len(y_neg) 282 | num_pos = len(y_pos) 283 | 284 | sample_size = np.min([num_neg, num_pos*oss]) 285 | # //2 because we sample 1 from pos and 1 from neg: 286 | if TRAINING_SIZE != -1: sample_size = TRAINING_SIZE//2 287 | 288 | print("\nOversampling...", "size:", sample_size) 289 | print("\tNum pos: {} \n\tNum neg: {}".format(num_pos, num_neg)) 290 | Oversampled_zid = {} # Keeps track of how many times that zinc_id is randomly selected 291 | Oversampled_zid_y = {} 292 | 293 | pos_keys = list(y_pos.keys()) 294 | neg_keys = list(y_neg.keys()) 295 | 296 | for i in range(sample_size): 297 | # Randomly sampling equal number of hits and misses: 298 | idx = random.randint(0, num_pos-1) 299 | idx_neg = random.randint(0, num_neg-1) 300 | pos_zid = pos_keys[idx] 301 | neg_zid = neg_keys[idx_neg] 302 | 303 | # Adding both pos and neg to the dictionary 304 | try: 305 | Oversampled_zid[pos_zid] += 1 306 | except KeyError: 307 | Oversampled_zid[pos_zid] = 1 308 | Oversampled_zid_y[pos_zid] = y_pos[pos_zid] 309 | 310 | try: 311 | Oversampled_zid[neg_zid] += 1 312 | except KeyError: 313 | Oversampled_zid[neg_zid] = 1 314 | Oversampled_zid_y[neg_zid] = y_neg[neg_zid] 315 | 316 | # Getting the inputs 317 | if SMILES: 318 | print("Using smiles...") 319 | X_valid, y_valid = np.array(encode_smiles(valid_data.smile).values()), y_valid.to_numpy() 320 | X_test, y_test = np.array(encode_smiles(test_data).values()), y_test.to_numpy() 321 | 322 | # The training data needs to be oversampled: 323 | print("Getting oversampled smiles...") 324 | Oversampled_zid = get_oversampled_smiles(Oversampled_zid, train_data.smile) 325 | Oversampled_X_train = np.zeros([sample_size*2, len(list(Oversampled_zid.values())[0][0])], dtype=np.bool) 326 | print(len(list(Oversampled_zid.values())[0])) 327 | else: 328 | Oversampled_X_train = np.zeros([sample_size*2, 1024], dtype=np.bool) 329 | print('Using morgan fingerprints...') 330 | # this part is what gets the morgan fingerprints: 331 | print('looking through file path:', DATA_PATH + '/iteration_'+str(n_iteration)+'/morgan/*') 332 | for i in range(1, n_iteration+1): 333 | for f in glob.glob(DATA_PATH + '/iteration_'+str(i)+'/morgan/*'): 334 | set_name = f.split('/')[-1].split('_')[0] 335 | print('\t', set_name) 336 | # Valid and test datasets are always going to be from the first iteration. 337 | if i == 1: 338 | if set_name == 'valid': 339 | X_valid, y_valid = get_morgan_and_scores(f, y_valid) 340 | elif set_name == 'test': 341 | X_test, y_test = get_morgan_and_scores(f, y_test) 342 | 343 | # Fills the dictionary with the actual morgan fingerprints 344 | Oversampled_zid = get_oversampled_morgan(Oversampled_zid, f) 345 | 346 | print("y validation shape:", y_valid.shape) 347 | 348 | ct = 0 349 | Oversampled_y_train = np.zeros([sample_size*2, 1]) 350 | print("oversampled sample:", list(Oversampled_zid.items())[0]) 351 | num_morgan_missing = 0 352 | for key in Oversampled_zid.keys(): 353 | try: 354 | tt = len(Oversampled_zid[key]) 355 | except TypeError as e: 356 | # print("Missing morgan fingerprint for this ZINC ID") 357 | # print(key, Oversampled_zid[key]) 358 | num_morgan_missing += 1 359 | continue # Skipping data that has no labels for it 360 | 361 | Oversampled_X_train[ct:ct+tt] = Oversampled_zid[key] # repeating the same data for as many times as it was selected 362 | Oversampled_y_train[ct:ct+tt] = Oversampled_zid_y[key] 363 | ct += tt 364 | 365 | print("Done oversampling, number of missing morgan fingerprints:", num_morgan_missing) 366 | 367 | class TimedStopping(Callback): 368 | ''' 369 | Stop training when enough time has passed. 370 | # Arguments 371 | seconds: maximum time before stopping. 372 | verbose: verbosity mode. 373 | ''' 374 | def __init__(self, seconds=None, verbose=1): 375 | super(Callback, self).__init__() 376 | 377 | self.start_time = 0 378 | self.seconds = seconds 379 | self.verbose = verbose 380 | 381 | def on_train_begin(self, logs={}): 382 | self.start_time = time.time() 383 | 384 | def on_epoch_end(self, epoch, logs={}): 385 | print('epoch done') 386 | if time.time() - self.start_time > self.seconds: 387 | self.model.stop_training = True 388 | if self.verbose: 389 | print('Stopping after %s seconds.' % self.seconds) 390 | #FREE MEMORY 391 | 392 | del data_from_prev 393 | del train_data 394 | del test_data 395 | del valid_data 396 | del Oversampled_zid 397 | del Oversampled_zid_y 398 | del y_valid_first 399 | del y_test_first 400 | gc.collect() 401 | 402 | #END FREE MEMORY 403 | 404 | print("Data prep time:", time.time() - START_TIME) 405 | print("Configuring model...") 406 | 407 | # This is our new model 408 | hyperparameters = {"bin_array": ba*[0,1], "dropout_rate": df, "learning_rate": lr, 409 | "num_units": nu, "batch_size": bs, "class_weight": wt, "epsilon": 1e-06} 410 | print("\n"+"-"*20) 411 | print("Training data info:" + "\n") 412 | print("X Data Shape[1:]", Oversampled_X_train.shape[1:]) 413 | print("X Data Shape", Oversampled_X_train.shape) 414 | print("X Data example", Oversampled_X_train[0]) 415 | print("Hyperparameters", hyperparameters) 416 | 417 | # TODO create a flag for optimizing the models 418 | if False: 419 | # progressive_docking = optimize(technique='bayesian') 420 | pass 421 | else: 422 | from ML.DDMetrics import * 423 | metrics = ['accuracy', tf.keras.metrics.Recall(), tf.keras.metrics.Precision()] 424 | progressive_docking = DDModel(mode='original', 425 | input_shape=Oversampled_X_train.shape[1:], 426 | hyperparameters=hyperparameters, 427 | metrics=metrics) 428 | progressive_docking.model.summary() 429 | 430 | # keeping track of what model number this currently is and saving 431 | try: 432 | with open(SAVE_PATH + '/iteration_'+str(n_iteration)+'/model_no.txt', 'r') as ref: 433 | mn = int(ref.readline().rstrip())+1 434 | with open(SAVE_PATH + '/iteration_'+str(n_iteration)+'/model_no.txt', 'w') as ref: 435 | ref.write(str(mn)) 436 | 437 | except IOError: # file doesnt exist yet 438 | mn = 1 439 | with open(SAVE_PATH + '/iteration_'+str(n_iteration)+'/model_no.txt', 'w') as ref: 440 | ref.write(str(mn)) 441 | 442 | num_epochs = 500 443 | cw = {0:1, 1:wt} 444 | es = EarlyStopping(monitor='val_loss', min_delta=0, patience=10, verbose=0, mode='auto') 445 | es1 = TimedStopping(seconds=36000) # stop training after 10 hours 446 | logger = DDLogger( 447 | log_path=SAVE_PATH + "/iteration_" + str(n_iteration) + "/all_models/model_{}_train_log.csv".format(str(mn)), 448 | max_time=36000, 449 | max_epochs=num_epochs, 450 | monitoring='val_loss' 451 | ) 452 | delta_time = time.time() 453 | 454 | try: 455 | progressive_docking.save(SAVE_PATH + '/iteration_'+str(n_iteration)+'/all_models/model_'+str(mn)) 456 | except tensorflow.python.framework.errors_impl.FailedPreconditionError as e: 457 | print("Error occurred while saving:") 458 | print(" -", e) 459 | 460 | progressive_docking.fit(Oversampled_X_train, 461 | Oversampled_y_train, 462 | epochs=num_epochs, 463 | batch_size=bs, 464 | shuffle=True, 465 | class_weight=cw, 466 | verbose=1, 467 | validation_data=[X_valid, y_valid], 468 | callbacks=[es, es1, logger]) 469 | 470 | # Delta time records how long a model takes to train num_epochs amount of epochs 471 | delta_time = time.time()-delta_time 472 | print("Training Time:", delta_time) 473 | 474 | print("Saving the model...") 475 | progressive_docking.save(SAVE_PATH + '/iteration_'+str(n_iteration)+'/all_models/model_'+str(mn)) 476 | 477 | print('Predicting on validation data') 478 | prediction_valid = progressive_docking.predict(X_valid) 479 | 480 | print('Predicting on testing data') 481 | prediction_test = progressive_docking.predict(X_test) 482 | 483 | if CONTINUOUS: 484 | # Converting back to binary values to get stats 485 | y_valid = y_valid < cf_to_use 486 | prediction_valid = prediction_valid < cf_to_use 487 | y_test = y_test < cf_to_use 488 | prediction_test = prediction_test < cf_to_use 489 | 490 | print('Getting stats from predictions...') 491 | # Getting stats for validation 492 | precision_vl, recall_vl, thresholds_vl = precision_recall_curve(y_valid, prediction_valid) 493 | fpr_vl, tpr_vl, thresh_vl = roc_curve(y_valid, prediction_valid) 494 | auc_vl = auc(fpr_vl,tpr_vl) 495 | pr_vl = precision_vl[np.where(recall_vl>rec)[0][-1]] 496 | pos_ct_orig = np.sum(y_valid) 497 | Total_left = rec*pos_ct_orig/pr_vl*total_mols*1000000/len(y_valid) 498 | tr = thresholds_vl[np.where(recall_vl>rec)[0][-1]] 499 | 500 | # Getting stats for testing 501 | precision_te, recall_te, thresholds_te = precision_recall_curve(y_test,prediction_test) 502 | fpr_te, tpr_te, thresh_te = roc_curve(y_test, prediction_test) 503 | auc_te = auc(fpr_te,tpr_te) 504 | pr_te = precision_te[np.where(thresholds_te>tr)[0][0]] 505 | re_te = recall_te[np.where(thresholds_te>tr)[0][0]] 506 | pos_ct_orig = np.sum(y_test) 507 | Total_left_te = re_te*pos_ct_orig/pr_te*total_mols*1000000/len(y_test) 508 | 509 | 510 | with open(SAVE_PATH + '/iteration_'+str(n_iteration)+'/hyperparameter_morgan_with_freq_v3.csv','a') as ref: 511 | ref.write(str(mn)+','+str(oss)+','+str(bs)+','+str(lr)+','+str(ba)+','+str(nu)+','+str(df)+','+str(wt)+','+str(cf)+','+str(auc_vl)+','+str(pr_vl)+','+str(Total_left)+','+str(auc_te)+','+str(pr_te)+','+str(re_te)+','+str(Total_left_te)+','+str(pos_ct_orig)+'\n') 512 | 513 | with open(SAVE_PATH + '/iteration_'+str(n_iteration)+'/hyperparameter_morgan_with_freq_v3.txt','a') as ref: 514 | # The sting of hyperparameters that stores what will be appended to the file ref 515 | hp = "\n" + "-" * 15 + "\n" + "Hyperparameters:" + "\n" 516 | hp += "- Model Number: " + str(mn) + "\n" 517 | hp += "- Training Time: " + str(round(delta_time, 3)) + "\n" 518 | hp += " - OS: " + str(oss) + "\n" 519 | hp += " - Batch Size: " + str(bs) + "\n" 520 | hp += " - Learning Rate: " + str(lr) + "\n" 521 | hp += " - Bin Array: " + str(ba) + "\n" 522 | hp += " - Num. Units: " + str(nu) + "\n" 523 | hp += " - Dropout Freq.: " + str(df) + "\n"*2 524 | 525 | hp += " - Class Weight Parameter wt: " + str(wt) + "\n" 526 | hp += " - cf: " + str(cf) + "\n" 527 | hp += " - auc vl: " + str(auc_vl) + "\n" 528 | hp += " - auc te: " + str(auc_te) + "\n" 529 | hp += " - Precision validation: " + str(pr_vl) + "\n" 530 | hp += " - Precision testing: " + str(pr_te) + "\n" 531 | hp += " - Recall testing: " + str(re_te) + "\n" 532 | hp += " - Pos ct orig: " + str(pos_ct_orig) + "\n" 533 | hp += " - Total Left: " + str(Total_left) + "\n" 534 | hp += " - Total Left testing: " + str(Total_left_te) + "\n\n" 535 | 536 | hp += "-" * 15 537 | ref.write(hp) 538 | 539 | print("Model number", mn, "complete.") 540 | -------------------------------------------------------------------------------- /phase_2-3/simple_job_models.py: -------------------------------------------------------------------------------- 1 | import builtins as __builtin__ 2 | import pandas as pd 3 | import numpy as np 4 | import argparse 5 | import glob 6 | import time 7 | import os 8 | 9 | try: 10 | import __builtin__ 11 | except ImportError: 12 | # Python 3 13 | import builtins as __builtin__ 14 | 15 | # For debugging purposes only: 16 | def print(*args, **kwargs): 17 | __builtin__.print('\t simple_jobs: ', end="") 18 | return __builtin__.print(*args, **kwargs) 19 | 20 | 21 | START_TIME = time.time() 22 | 23 | 24 | parser = argparse.ArgumentParser() 25 | parser.add_argument('-n_it','--iteration_no',required=True,help='Number of current iteration') 26 | parser.add_argument('-mdd','--morgan_directory',required=True,help='Path to the Morgan fingerprint directory for the database') 27 | parser.add_argument('-time','--time',required=True,help='Time limit for training') 28 | parser.add_argument('-file_path','--file_path',required=True,help='Path to the project directory, including project directory name') 29 | parser.add_argument('-nhp','--number_of_hyp',required=True,help='Number of hyperparameters') 30 | parser.add_argument('-titr','--total_iterations',required=True,help='Desired total number of iterations') 31 | 32 | parser.add_argument('-isl','--is_last',required=False, action='store_true',help='True/False for is this last iteration') 33 | 34 | # adding parameter for where to save all the data to: 35 | parser.add_argument('-save', '--save_path', required=False, default=None) 36 | 37 | # allowing for variable number of molecules to test and validate from: 38 | parser.add_argument('-n_mol', '--number_mol', required=False, default=1000000, help='Size of test/validation set to be used') 39 | 40 | parser.add_argument('-pfm', '--percent_first_mols', required=False, default=-1, help='Fraction of top scoring molecules to be considered as virtual hits in the first iteration (for standard DD run on 11 iterations, we recommend 0.01)') # these two inputs must be percentages 41 | parser.add_argument('-plm', '--percent_last_mols', required=False, default=-1, help='Fraction of top scoring molecules to be considered as virtual hits in the last iteration (for standard DD run on 11 iterations, we recommend 0.0001)') 42 | 43 | 44 | # Pass the threshold 45 | parser.add_argument('-ct', required=False, default=0.9, help='Recall, [0,1] range, default value 0.9') 46 | 47 | # Flag for switching between functions that determine how many mols to be left at the end of iteration 48 | # if not provided it defaults to a linear dec 49 | funct_flags = parser.add_mutually_exclusive_group(required=False) 50 | funct_flags.add_argument('-expdec', '--exponential_dec', required=False, default=-1) # must pass in the base number 51 | funct_flags.add_argument('-polydec', '--polynomial_dec', required=False, default=-1) # You must also pass in to what power for this flag 52 | 53 | io_args, extra_args = parser.parse_known_args() 54 | n_it = int(io_args.iteration_no) 55 | mdd = io_args.morgan_directory 56 | time_model = io_args.time 57 | nhp = int(io_args.number_of_hyp) 58 | isl = io_args.is_last 59 | titr = int(io_args.total_iterations) 60 | rec = float(io_args.ct) 61 | 62 | num_molec = int(io_args.number_mol) 63 | 64 | percent_first_mols = float(io_args.percent_first_mols) 65 | percent_last_mols = float(io_args.percent_last_mols) 66 | 67 | exponential_dec = int(io_args.exponential_dec) 68 | polynomial_dec = int(io_args.polynomial_dec) 69 | 70 | DATA_PATH = io_args.file_path # Now == file_path/protein 71 | SAVE_PATH = io_args.save_path 72 | # if no save path is provided we just save it in the same location as the data 73 | if SAVE_PATH is None: SAVE_PATH = DATA_PATH 74 | 75 | # sums the first column and divides it by 1 million (this is our total database size) 76 | t_mol = pd.read_csv(mdd+'/Mol_ct_file.csv',header=None)[[0]].sum()[0]/1000000 # num of compounds in each file is mol_ct_file 77 | 78 | cummulative = 0.25*n_it 79 | num_units = [100, 1500,2000] 80 | dropout = [0.2, 0.5] 81 | learn_rate = [0.0001] 82 | bin_array = [2, 3] 83 | wt = [2, 3] 84 | if nhp < 144: 85 | bs = [256] 86 | else: 87 | bs = [128, 256] 88 | 89 | if nhp < 48: 90 | oss = [10] 91 | elif nhp < 72: 92 | oss = [5, 10] 93 | else: 94 | oss = [5, 10, 20] 95 | 96 | try: 97 | os.mkdir(SAVE_PATH+'/iteration_'+str(n_it)+'/simple_job') 98 | except OSError: # catching file exists error 99 | pass 100 | 101 | # Clearing up space from previous iteration 102 | for f in glob.glob(SAVE_PATH+'/iteration_'+str(n_it)+'/simple_job/*'): 103 | os.remove(f) 104 | 105 | scores_val = [] 106 | with open(DATA_PATH+'/iteration_'+str(1)+'/validation_labels.txt','r') as ref: 107 | ref.readline() # first line is ignored 108 | for line in ref: 109 | scores_val.append(float(line.rstrip().split(',')[0])) 110 | 111 | scores_val = np.array(scores_val) 112 | 113 | first_mols = int(100*t_mol/13) if percent_first_mols == -1.0 else int(percent_first_mols * len(scores_val)) 114 | last_mols = 100 if percent_last_mols == -1.0 else int(percent_last_mols * len(scores_val)) 115 | 116 | if n_it==1: 117 | # 'good_mol' is the number of top scoring molecules to save at the end of the iteration 118 | good_mol = first_mols 119 | else: 120 | if exponential_dec != -1: 121 | good_mol = int() #TODO: create functions for these 122 | elif polynomial_dec != -1: 123 | good_mol = int() 124 | else: 125 | good_mol = int(((last_mols-first_mols)*n_it + titr*first_mols-last_mols)/(titr-1)) # linear decrease as interations increase 126 | 127 | print(isl) 128 | # If this is the last iteration then we save only 100 molecules 129 | if isl: 130 | # 100 mols is 0.0001% of an inital of 1 million input molecules 131 | good_mol = 100 if percent_last_mols == -1.0 else int(percent_last_mols * len(scores_val)) 132 | 133 | cf_start = np.mean(scores_val) # the mean of all the docking scores (labels) of the validation set: 134 | t_good = len(scores_val) 135 | 136 | # we decrease the threshold value until we have our desired num of mol left. 137 | while t_good > good_mol: 138 | cf_start -= 0.005 139 | t_good = len(scores_val[scores_val