├── README.md
├── final_phase
    └── final_extraction.py
├── license.txt
├── phase_1
    ├── Extracting_morgan.py
    ├── Extracting_smiles.py
    ├── molecular_file_count_updated.py
    ├── sampling.py
    └── sanity_check.py
├── phase_2-3
    ├── Extract_labels.py
    ├── ML
    │   ├── DDCallbacks.py
    │   ├── DDMetrics.py
    │   ├── DDModel.py
    │   ├── DDModelExceptions.py
    │   ├── Models.py
    │   ├── Parser.py
    │   ├── Tokenizer.py
    │   └── lasso_regularizer.py
    ├── Prediction_morgan_1024.py
    ├── activation_script.sh
    ├── hyperparameter_result_evaluation.py
    ├── progressive_docking.py
    ├── simple_job_models.py
    └── simple_job_predictions.py
└── utilities
    └── Morgan_fing.py


/README.md:
--------------------------------------------------------------------------------
  1 | # Deep-Docking-NonAutomated
  2 | 
  3 | Deep docking (DD) is a deep learning-based tool developed to accelerate docking-based virtual screening. Using a docking program of choice, one can screen extensive chemical libraries like ZINC15 (containing > 1.3 billion molecules) 50 times faster than typical docking. For further details into the processes behind DD, please refer to our paper (https://doi.org/10.1021/acscentsci.0c00229). This repository provides all the scripts required to run the DD process, except ligand preparation and molecular docking (which can be done with many licensed and/or free programs).
  4 | 
  5 | This version of DD is non-automated and thus requires users to use their own docking programs and ligand preparation tools. 
  6 | 
  7 | If you use DD, please cite:
  8 | 
  9 | Gentile F, Agrawal V, Hsing M, Ton A-T, Ban F, Norinder U, et al. *Deep Docking: A Deep Learning Platform for Augmentation of Structure Based Drug Discovery.* ACS Cent Sci 2020:acscentsci.0c00229.
 10 | 
 11 | ## Requirements
 12 | * rdkit
 13 | * tensorflow >= 1.14.0 (1.15 GPU version recommended. If you are using cuda11, please use [nvidia-tensorflow](https://developer.nvidia.com/blog/accelerating-tensorflow-on-a100-gpus/))
 14 | * pandas
 15 | * numpy
 16 | * keras
 17 | * matplotlib
 18 | * scikit-learn
 19 | * Ligand preparation tool
 20 | * Docking program
 21 | 
 22 | ## Help
 23 | For help with the options for a specific script, type
 24 | 
 25 | ```
 26 | python script.py -h
 27 | ```
 28 | ## Before Starting
 29 | You need to fill in the `activation_script.sh` file in `phase_2-3` so that it can automatically activate your conda environment when training.
 30 | 
 31 | ## Preparing a database for Deep Docking
 32 | Databases for DD should be in SMILE format. For each compound of the database, DD requires the Morgan fingerprints of radius 2 and size 1024 bits, represented as the list of the indexes of bits that are set to 1. 
 33 | 
 34 | First, it is recommended to split the SMILES into a number of evenly populated files to facilitate other steps such as random sampling and inference, and place these files into a new folder. This reorganization can be achieved for example with the `split` command in bash, and the resulting files should have `.txt` extensions. For example, consider a `smiles.smi` file with a billion compounds, to obtain 1000 evenly split .txt files of 1 million lines each we run:
 35 | 
 36 | ```bash
 37 | split -d -l 1000000 smiles.smi smile_all_ --additional-suffix=.txt
 38 | ```
 39 | 
 40 | Ideally the number of split files should be equal to the number of CPUs used for random sampling (phase 1, see below), but always larger than the number of GPUs used for inference (phase 3, see below). 
 41 | 
 42 | Morgan fingerprints can be then generated in the correct DD format using the `Morgan_fing.py` script (located in the `utilities` folder):
 43 | 
 44 | ```bash
 45 | python Morgan_fing.py -sfp path_smile_folder -fp path_to_morgan_folder -fn name_morgan_folder -tp num_cpus
 46 | ```
 47 | which will create all the fingerprints and place them in `path_to_morgan_folder/name_morgan_folder`.
 48 | 
 49 | 
 50 | ## Phase 1. Random sampling of molecules
 51 | In phase 1 molecules are randomly sampled from the database to build/augment a training set. During the first iteration, this phase also samples molecules for the validation and test sets.
 52 | 
 53 | #### Runing phase 1
 54 | To run phase 1, run the following sequence of scripts to randomly sample the database, and to extract Morgan fingerprints and SMILES of the sampled molecules:
 55 | 
 56 | ```bash
 57 | python molecular_file_count_updated.py -pt project_name -it current_iteration -cdd left_mol_directory -t_pos num_cpus -t_samp molecules_to_dock
 58 | python sampling.py -pt project_name -fp path_to_project_without_name -it current_iteration -dd left_mol_directory -t_pos total_processors -tr_sz train_size -vl_sz val_size
 59 | python sanity_check.py -pt project_name -fp path_to_project_without_name -it current_iteration
 60 | python Extracting_morgan.py -pt project_name -fp path_to_project_without_name -it current_iteration -md morgan_directory -t_pos total_processors
 61 | python Extracting_smiles.py -pt project_name -fp path_to_project_without_name -it current_iteration -smd smile_directory -t_pos num_cpus
 62 | ```
 63 | 
 64 | * `molecular_file_count_updated.py` determines the number of molecules to be sampled from each file of the database, according to the desired number of molecules to sample. The sample sizes (per million) are stored in `Mol_ct_file_updated.csv` file created in the `left_mol_directory` directory.
 65 | 
 66 | * `sampling.py` randomly samples the desired number of molecules for the training, validation and testing sets (again note that only during the first iteration do we generate the validation and testing sets). 
 67 | 
 68 | * `sanity_check.py` removes overlaps between sampled sets.
 69 | 
 70 | * `Extracting_morgan.py` and `Extracting_smiles.py` extract morgan fingerprints and SMILES for the compounds that have been randomly sampled, and organize them in `morgan` and `smiles` folders inside the directory of the current iteration.
 71 | 
 72 | **IMPORTANT:** For `molecular_file_count_updated.py` AND `sampling.py` the option `left_mol_directory` is the directory where molecules are sampled; for iteration 1, `left_mol_directory` is the directory storing the Morgan fingerprints of the database; BUT for subsequent iterations this must be the path to `morgan_1024_predictions` folder of the previous iteration.
 73 | 
 74 | For example, in iteration 2:
 75 | 
 76 | ```bash
 77 | python molecular_file_count_updated.py -pt project_name -it current_iteration -cdd /path_to_project/project_name/iteration_1/morgan_1024_predictions -t_pos num_cpus -t_samp molecules_to_dock
 78 | python sampling.py -pt project_name -fp path_to_project_without_name -it current_iteration -dd /path_to_project/project_name/iteration_1/morgan_1024_predictions -t_pos total_processors -tr_sz train_size -vl_sz val_size
 79 | ```
 80 | This will ensure that sampling is done progressively on better scoring subsets of the database over the course of DD.
 81 | 
 82 | ### After phase 1. Docking
 83 | After phase 1 is completed, molecules grouped in the smiles folder need to be prepared and docked to the target. Use your favourite workflow for this step. It is important that docking are stored as SDF files in a *docked* folder in the current iteration directory, keeping the same name convention of the files in the *smile* folder (names can be slightly changed but the name of the set (eg validation, testing, training) should always be present in the name of the respective SDF file).
 84 | 
 85 | 
 86 | ## Phase 2. Neural network training
 87 | In phase 2, deep learning models are trained on the docking scores from the previous phase.
 88 | 
 89 | #### Runing phase 2
 90 | Again we just need to run the following in succession:
 91 | ```bash
 92 | python Extract_labels.py -if True/False -n_it current_iteration -protein project_name -file_path path_to_project_without_name -t_pos num_cpus -score score_keyword
 93 | python simple_job_models.py -n_it current_iteration -mdd morgan_directory -time 00-04:00 -file_path project_path -nhp num_hyperparameters -titr total_iterations -n_mol num_molecules --percent_first_mols percent_first_molecules -ct recall_value --percent_last_mols percent_last_mols
 94 | ```
 95 | * `Extract_labels.py` extracts docking scores and organizes them to be used for model training. It should generate three comma-spaced files, `training_labels.txt`, `validation_labels.txt` and `testing_labels.txt` inside the current iteration folder.
 96 | 
 97 | * `simple_job_models.py` creates bash scripts to run model training using the `progressive_docking.py` script. These scripts are generated inside the `simple_job` folder in the current iteration. Note that if `-ct` is not specified, the recall value will be set to 0.9.
 98 | 
 99 | The bash scripts generated by `simple_job_models.py` should be then run on GPU nodes to train DD models. The resulting models will be stored in the `all_models` folder in the current iteration.
100 | 
101 | 
102 | ## Phase 3. Selection of best model and prediction of the entire database
103 | In phase 3 the models from phase 2 are evalauted and the best performing one is chosen for predicting scores of all the molecules in the database. This step will create a `morgan_1024_predictions` subfolder which will contain all the molecules that are predicted as virtual hits in the current iteration.
104 | 
105 | #### Run phase 3
106 | To run phase 3, 
107 | 
108 | ```bash
109 | python -u hyperparameter_result_evaluation.py -n_it current_iteration --data_path project_path -mdd morgan_directory -n_mol num_molecules
110 | python simple_job_predictions.py -protein project_name -file_path path_to_project_without_name -n_it current_iteration -mdd morgan_directory
111 | 
112 | ```
113 | 
114 | * `hyperparameter_result_evaluation.py` evaluates the models generated in phase 2 and select the best (most precise) one.
115 | 
116 | * `simple_job_predictions.py` creates bash scripts to run the predictions over the full database using the `Prediction_morgan_1024.py` script. These scripts will be stored in the `simple_job_predictions` folder of the current iteration.
117 | 
118 | The generated bash scripts can be run on GPU nodes to predict virtual hits from the full database. Predicted compounds will be stored in `morgan_1024_predictions` folder of the current iteration.
119 | 
120 | 
121 | ## After Deep Docking. The final phase
122 | After the last iteration of DD is complete, SMILES of all or a ranked subset of the predicted virtual hits can be obtained for the final docking. Ranking is based on the probabilities of being virtual hits. Use the following script (availabe in `final_phase`).
123 | 
124 | ```bash
125 | python final_extraction.py -smile_dir path_to_smile_dir -prediction_dir path_to_predictions_last_iter -processors n_cpus -mols_to_dock num_molecules_to_dock
126 | ```
127 | 
128 | Executing this script will return the list of SMILES of all the predicted virtual hits of the last iteration or the top `num_molecules_to_dock` molecules ranked by their probabilities, whichever is smaller. Probabilities will also be returned in a separated file.
129 | 


--------------------------------------------------------------------------------
/final_phase/final_extraction.py:
--------------------------------------------------------------------------------
 1 | from multiprocessing import Pool
 2 | from contextlib import closing
 3 | import multiprocessing
 4 | import pandas as pd
 5 | import argparse
 6 | import glob
 7 | import os
 8 | 
 9 | 
10 | def merge_on_smiles(pred_file):
11 |     print("Merging " + os.path.basename(pred_file) + "...")
12 | 
13 |     # Read the predictions
14 |     pred = pd.read_csv(pred_file, names=["id", "score"], index_col=0)
15 |     pred.drop_duplicates()
16 | 
17 |     # Read the smiles
18 |     smile_file = os.path.join(args.smile_dir, os.path.basename(pred_file))
19 |     smi = pd.read_csv(smile_file, delimiter=" ", names=["smile", "id"], index_col=1)
20 |     smi = smi.drop_duplicates()
21 |     return pd.merge(pred, smi, how="inner", on=["id"])
22 | 
23 | 
24 | if __name__ == '__main__':
25 |     parser = argparse.ArgumentParser()
26 |     parser.add_argument("-smile_dir", required=True, help='Path to SMILES directory for the database')
27 |     parser.add_argument("-prediction_dir", required=True, help='Path to morgan_1024_predicitions of last iteration')
28 |     parser.add_argument("-processors", required=True, help='Number of CPUs for multiprocessing')
29 |     parser.add_argument("-mols_to_dock", required=False, type=int, help='Desired number of molecules to dock')
30 | 
31 |     args = parser.parse_args()
32 |     predictions = []
33 | 
34 |     # Find all smile files
35 |     print("Morgan Dir: " + args.prediction_dir)
36 |     print("Smile Dir: " + args.smile_dir)
37 |     for file in glob.glob(args.prediction_dir + "/*"):
38 |         if "smile" in os.path.basename(file):
39 |             print(" - " + os.path.basename(file))
40 |             predictions.append(file)
41 | 
42 |     # Create a list of pandas dataframes
43 |     print("Finding smiles...")
44 |     print(int(args.processors), len(predictions))
45 |     print("Number of CPUs: " + str(multiprocessing.cpu_count()))
46 |     num_jobs = min(len(predictions), int(args.processors))
47 |     with closing(Pool(num_jobs)) as pool:
48 |         combined = pool.map(merge_on_smiles, predictions)
49 | 
50 |     # combine all dataframes
51 |     print("Combining " + str(len(combined)) + "dataframes...")
52 |     base = pd.concat(combined)
53 |     combined = None
54 | 
55 |     print("Done combining... Sorting!")
56 |     base = base.sort_values(by="score", ascending=False)
57 | 
58 |     print("Resetting Index...")
59 |     base.reset_index(inplace=True)
60 | 
61 |     print("Finished Sorting... Here is the base:")
62 |     print(base.head())
63 | 
64 |     if args.mols_to_dock is not None:
65 |         mtd = args.mols_to_dock
66 |         print("Molecules to dock:", mtd)
67 |         print("Total molecules:", len(base))
68 | 
69 |         if len(base) <= mtd:
70 |             print("Our total molecules are less or equal than the number of molecules to dock -> saving all molecules")
71 |         else:
72 |             print(f"Our total molecules are more than the number of molecules to dock -> saving {mtd} molecules")
73 |             base = base.head(mtd)
74 | 
75 |     print("Saving")
76 |     # Rearrange the smiles
77 |     smiles = base.drop('score', 1)
78 |     smiles = smiles[["smile", "id"]]
79 |     print("Here is the smiles:")
80 |     print(smiles.head())
81 |     smiles.to_csv("smiles.csv", sep=" ")
82 | 
83 |     # Rearrange for id,score
84 |     base.drop("smile", 1, inplace=True)
85 |     base.to_csv("id_score.csv")
86 |     print("Here are the ids and scores")
87 |     print(base.head())
88 | 
89 | 
90 | 
91 | 


--------------------------------------------------------------------------------
/license.txt:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2021 fgentile, jamesgleave, jyaacoub
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/phase_1/Extracting_morgan.py:
--------------------------------------------------------------------------------
  1 | # Reads the ids found in sampling and finds the corresponding morgan fingerprint
  2 | import argparse
  3 | import glob
  4 | 
  5 | parser = argparse.ArgumentParser()
  6 | parser.add_argument('-pt', '--protein_name', required=True, help='Name of the DD project')
  7 | parser.add_argument('-fp', '--file_path', required=True, help='Path to the project directory, excluding project directory name')
  8 | parser.add_argument('-it', '--n_iteration', required=True, help='Number of current iteration')
  9 | parser.add_argument('-md', '--morgan_directory', required=True, help='Path to directory containing Morgan fingerprints for the database')
 10 | parser.add_argument('-t_pos', '--tot_process', required=True, help='Number of CPUs to use for multiprocessing')
 11 | 
 12 | io_args = parser.parse_args()
 13 | 
 14 | import os
 15 | from multiprocessing import Pool
 16 | import time
 17 | from contextlib import closing
 18 | import numpy as np
 19 | 
 20 | protein = io_args.protein_name
 21 | file_path = io_args.file_path
 22 | n_it = int(io_args.n_iteration)
 23 | morgan_directory = io_args.morgan_directory
 24 | tot_process = int(io_args.tot_process)
 25 | 
 26 | 
 27 | def extract_morgan(file_name):
 28 |     train = {}
 29 |     test = {}
 30 |     valid = {}
 31 |     with open(file_path + '/' + protein + "/iteration_" + str(n_it) + "/train_set.txt", 'r') as ref:
 32 |         for line in ref:
 33 |             train[line.rstrip()] = 0
 34 |     with open(file_path + '/' + protein + "/iteration_" + str(n_it) + "/valid_set.txt", 'r') as ref:
 35 |         for line in ref:
 36 |             valid[line.rstrip()] = 0
 37 |     with open(file_path + '/' + protein + "/iteration_" + str(n_it) + "/test_set.txt", 'r') as ref:
 38 |         for line in ref:
 39 |             test[line.rstrip()] = 0
 40 | 
 41 |     # for file_name in file_names:
 42 |     ref1 = open(
 43 |         file_path + '/' + protein + '/iteration_' + str(n_it) + '/morgan/' + 'train_' + file_name.split('/')[-1], 'w')
 44 |     ref2 = open(
 45 |         file_path + '/' + protein + '/iteration_' + str(n_it) + '/morgan/' + 'valid_' + file_name.split('/')[-1], 'w')
 46 |     ref3 = open(file_path + '/' + protein + '/iteration_' + str(n_it) + '/morgan/' + 'test_' + file_name.split('/')[-1],
 47 |                 'w')
 48 | 
 49 |     with open(file_name, 'r') as ref:
 50 |         flag = 0
 51 |         for line in ref:
 52 |             tmpp = line.strip().split(',')[0]
 53 |             if tmpp in train.keys():
 54 |                 train[tmpp] += 1
 55 |                 fn = 1
 56 |                 if train[tmpp] == 1: flag = 1
 57 |             elif tmpp in valid.keys():
 58 |                 valid[tmpp] += 1
 59 |                 fn = 2
 60 |                 if valid[tmpp] == 1: flag = 1
 61 |             elif tmpp in test.keys():
 62 |                 test[tmpp] += 1
 63 |                 fn = 3
 64 |                 if test[tmpp] == 1: flag = 1
 65 |             if flag == 1:
 66 |                 if fn == 1:
 67 |                     ref1.write(line)
 68 |                 if fn == 2:
 69 |                     ref2.write(line)
 70 |                 if fn == 3:
 71 |                     ref3.write(line)
 72 |             flag = 0
 73 | 
 74 | 
 75 | def alternate_concat(files):
 76 |     to_return = []
 77 |     with open(files, 'r') as ref:
 78 |         for line in ref:
 79 |             to_return.append(line)
 80 |     return to_return
 81 | 
 82 | 
 83 | def delete_all(files):
 84 |     os.remove(files)
 85 | 
 86 | 
 87 | def morgan_duplicacy(f_name):
 88 |     flag = 0
 89 |     mol_list = {}
 90 |     ref1 = open(f_name[:-4] + '_updated.csv', 'a')
 91 |     with open(f_name, 'r') as ref:
 92 |         for line in ref:
 93 |             tmpp = line.strip().split(',')[0]
 94 |             if tmpp not in mol_list:
 95 |                 mol_list[tmpp] = 1
 96 |                 flag = 1
 97 |             if flag == 1:
 98 |                 ref1.write(line)
 99 |                 flag = 0
100 |     os.remove(f_name)
101 | 
102 | 
103 | if __name__ == '__main__':
104 |     try:
105 |         os.mkdir(file_path + '/' + protein + '/iteration_' + str(n_it) + '/morgan')
106 |     except:
107 |         pass
108 | 
109 |     files = []
110 |     for f in glob.glob(morgan_directory + "/*.txt"):
111 |         files.append(f)
112 | 
113 |     t = time.time()
114 |     with closing(Pool(np.min([tot_process, len(files)]))) as pool:
115 |         pool.map(extract_morgan, files)
116 |     print(time.time() - t)
117 | 
118 |     all_to_delete = []
119 |     for type_to in ['train', 'valid', 'test']:
120 |         t = time.time()
121 |         files = []
122 |         for f in glob.glob(file_path + '/' + protein + '/iteration_' + str(n_it) + '/morgan/' + type_to + '*'):
123 |             files.append(f)
124 |             all_to_delete.append(f)
125 |         print(len(files))
126 |         if len(files) == 0:
127 |             print("Error in address above")
128 |             break
129 |         with closing(Pool(np.min([tot_process, len(files)]))) as pool:
130 |             to_print = pool.map(alternate_concat, files)
131 |         with open(file_path + '/' + protein + '/iteration_' + str(n_it) + '/morgan/' + type_to + '_morgan_1024.csv',
132 |                   'w') as ref:
133 |             for file_data in to_print:
134 |                 for line in file_data:
135 |                     ref.write(line)
136 |         to_print = []
137 |         print(type_to, time.time() - t)
138 | 
139 |     f_names = []
140 |     for f in glob.glob(file_path + '/' + protein + '/iteration_' + str(n_it) + '/morgan/*morgan*'):
141 |         f_names.append(f)
142 | 
143 |     t = time.time()
144 |     with closing(Pool(np.min([tot_process, len(f_names)]))) as pool:
145 |         pool.map(morgan_duplicacy, f_names)
146 |     print(time.time() - t)
147 | 
148 |     with closing(Pool(np.min([tot_process, len(all_to_delete)]))) as pool:
149 |         pool.map(delete_all, all_to_delete)
150 | 


--------------------------------------------------------------------------------
/phase_1/Extracting_smiles.py:
--------------------------------------------------------------------------------
  1 | from multiprocessing import Pool
  2 | from functools import partial
  3 | from contextlib import closing
  4 | import argparse
  5 | import numpy as np
  6 | import pickle
  7 | import glob
  8 | import time
  9 | import os
 10 | 
 11 | parser = argparse.ArgumentParser()
 12 | parser.add_argument('-pt', '--project_name', required=True, help='Name of the DD project')
 13 | parser.add_argument('-fp', '--file_path', required=True, help='Path to the project directory, excluding project directory name')
 14 | parser.add_argument('-it', '--n_iteration', required=True, help='Number of current iteration')
 15 | parser.add_argument('-smd', '--smile_directory', required=True, help='Path to SMILES directory of the database')
 16 | parser.add_argument('-t_pos', '--tot_process', required=True, help='Number of CPUs to use for multiprocessing')
 17 | 
 18 | io_args = parser.parse_args()
 19 | protein = io_args.project_name
 20 | file_path = io_args.file_path
 21 | n_it = int(io_args.n_iteration)
 22 | smile_directory = io_args.smile_directory
 23 | tot_process = int(io_args.tot_process)
 24 | 
 25 | # This path is used often so we declare it as a constant here.
 26 | ITER_PATH = file_path + '/' + protein + '/iteration_' + str(n_it)
 27 | 
 28 | def extract_smile(file_name, train, valid, test):
 29 |     # This function extracts the smiles from a file to write them to train, test, and valid files for model training.
 30 |     ref1 = open(ITER_PATH + '/smile/' + 'train_' + file_name.split('/')[-1], 'w')
 31 |     ref2 = open(ITER_PATH + '/smile/' + 'valid_' + file_name.split('/')[-1], 'w')
 32 |     ref3 = open(ITER_PATH + '/smile/' + 'test_' + file_name.split('/')[-1], 'w')
 33 | 
 34 |     with open(file_name, 'r') as ref:
 35 |         ref.readline()
 36 |         for line in ref:
 37 |             tmpp = line.strip().split()[-1]
 38 |             if tmpp in train.keys():
 39 |                 train[tmpp] += 1
 40 |                 if train[tmpp] == 1: ref1.write(line)
 41 | 
 42 |             elif tmpp in valid.keys():
 43 |                 valid[tmpp] += 1
 44 |                 if valid[tmpp] == 1: ref2.write(line)
 45 | 
 46 |             elif tmpp in test.keys():
 47 |                 test[tmpp] += 1
 48 |                 if test[tmpp] == 1: ref3.write(line)
 49 | 
 50 | def alternate_concat(files):
 51 |     # Returns a list of the lines in a file
 52 |     with open(files, 'r') as ref:
 53 |         return ref.readlines()
 54 | 
 55 | def smile_duplicacy(f_name):
 56 |     # removes duplicate molec from the file 
 57 |     mol_list = {} # keeping track of which mol have been written
 58 |     ref1 = open(f_name[:-4] + '_updated.smi', 'a')
 59 |     with open(f_name, 'r') as ref:
 60 |         for line in ref:
 61 |             tmpp = line.strip().split()[-1]
 62 |             if tmpp not in mol_list: # avoiding duplicates
 63 |                 mol_list[tmpp] = 1 
 64 |                 ref1.write(line)
 65 |     os.remove(f_name)
 66 | 
 67 | def delete_all(files):
 68 |     os.remove(files)
 69 | 
 70 | if __name__ == '__main__':
 71 |     try:
 72 |         os.mkdir(ITER_PATH + '/smile')
 73 |     except: # catching exception for when the folder already exists
 74 |         pass
 75 | 
 76 |     files_smiles = [] # Getting the path to every smile file from docking
 77 |     for f in glob.glob(smile_directory + "/*.txt"):
 78 |         files_smiles.append(f)
 79 | 
 80 |     # getting training, validation, and testing molec IDs from the sets created by `sampling.py`
 81 |     all_train = {}
 82 |     all_valid = {}
 83 |     all_test = {}
 84 |     with open(ITER_PATH + "/train_set.txt", 'r') as ref: # each txt file contains the IDs corresponding to a morgan fingerprint in morgan folder of that iteration
 85 |         for line in ref:
 86 |             all_train[line.rstrip()] = 0
 87 | 
 88 |     with open(ITER_PATH + "/valid_set.txt", 'r') as ref:
 89 |         for line in ref:
 90 |             all_valid[line.rstrip()] = 0
 91 | 
 92 |     with open(ITER_PATH + "/test_set.txt", 'r') as ref:
 93 |         for line in ref:
 94 |             all_test[line.rstrip()] = 0
 95 | 
 96 |     print(len(all_train), len(all_valid), len(all_test))
 97 | 
 98 |     t = time.time()
 99 |     with closing(Pool(np.min([tot_process, len(files_smiles)]))) as pool:
100 |         pool.map(partial(extract_smile, train=all_train, valid=all_valid, test=all_test), files_smiles)
101 |     print(time.time() - t)
102 | 
103 |     all_to_delete = []
104 |     for type_to in ['train', 'valid', 'test']:
105 |         t = time.time()
106 |         files = []
107 |         for f in glob.glob(ITER_PATH + '/smile/' + type_to + '*'):
108 |             files.append(f)
109 |             all_to_delete.append(f)
110 |         print(len(files))
111 |         if len(files) == 0:
112 |             print("Error in address above")
113 |             break
114 |         with closing(Pool(np.min([tot_process, len(files)]))) as pool:
115 |             to_print = pool.map(alternate_concat, files)
116 |         with open(ITER_PATH + '/smile/' + type_to + '_smiles_final.csv', 'w') as ref:
117 |             for file_data in to_print:
118 |                 for line in file_data:
119 |                     ref.write(line)
120 |         to_print = []
121 |         print(type_to, time.time() - t)
122 | 
123 |     f_names = []
124 |     for f in glob.glob(ITER_PATH + '/smile/*final*'):
125 |         f_names.append(f)
126 | 
127 |     t = time.time()
128 |     with closing(Pool(np.min([tot_process, len(f_names)]))) as pool:
129 |         pool.map(smile_duplicacy, f_names)
130 |     print(time.time() - t)
131 | 
132 |     with closing(Pool(np.min([tot_process, len(all_to_delete)]))) as pool:
133 |         pool.map(delete_all, all_to_delete)
134 | 


--------------------------------------------------------------------------------
/phase_1/molecular_file_count_updated.py:
--------------------------------------------------------------------------------
 1 | from multiprocessing import Pool
 2 | from contextlib import closing
 3 | import pandas as pd
 4 | import numpy as np
 5 | import argparse
 6 | import glob
 7 | import time
 8 | 
 9 | try:
10 |     import __builtin__
11 | except ImportError:
12 |     # Python 3
13 |     import builtins as __builtin__
14 | 
15 | # For debugging purposes only:
16 | def print(*args, **kwargs):
17 |     __builtin__.print('\t molecular_file_count_updated: ', end="")
18 |     return __builtin__.print(*args, **kwargs)
19 | 
20 | parser = argparse.ArgumentParser()
21 | parser.add_argument('-pt','--project_name',required=True,help='Name of the DD project')
22 | parser.add_argument('-it','--n_iteration',required=True,help='Number of current DD iteration')
23 | parser.add_argument('-cdd','--data_directory',required=True,help='Path to directory contaning the remaining molecules of the database ')
24 | parser.add_argument('-t_pos','--tot_process',required=True,help='Number of CPUs to use for multiprocessing')
25 | parser.add_argument('-t_samp','--tot_sampling',required=True,help='Total number of molecules to sample in the current iteration; for first iteration, consider training, validation and test sets, for others only training')
26 | io_args = parser.parse_args()
27 | 
28 | 
29 | protein = io_args.project_name
30 | n_it = int(io_args.n_iteration)
31 | data_directory = io_args.data_directory
32 | tot_process = int(io_args.tot_process)
33 | tot_sampling = int(io_args.tot_sampling)
34 | 
35 | print("Parsed Args:")
36 | print(" - Iteration:", n_it)
37 | print(" - Data Directory:", data_directory)
38 | print(" - Training Size:", tot_process)
39 | print(" - Validation Size:", tot_sampling)
40 | 
41 | 
42 | def write_mol_count_list(file_name,mol_count_list):
43 |     with open(file_name,'w') as ref:
44 |         for ct,file_name in mol_count_list:
45 |             ref.write(str(ct)+","+file_name.split('/')[-1])
46 |             ref.write("\n")
47 | 
48 | 
49 | def molecule_count(file_name):
50 |     temp = 0
51 |     with open(file_name,'r') as ref:
52 |         ref.readline()
53 |         for line in ref:
54 |             temp+=1
55 |     return temp, file_name
56 | 
57 | 
58 | if __name__=='__main__':
59 |     files = []
60 |     for f in glob.glob(data_directory+'/*.txt'):
61 |         files.append(f)
62 |     print("Number Of Files:", len(files))
63 | 
64 |     t=time.time()
65 |     print("Reading Files...")
66 |     with closing(Pool(np.min([tot_process,len(files)]))) as pool:
67 |         rt = pool.map(molecule_count,files)
68 |     print("Done Reading Finals - Time Taken", time.time()-t)
69 | 
70 |     print("Saving File Count...")
71 |     write_mol_count_list(data_directory+'/Mol_ct_file.csv',rt)
72 |     mol_ct = pd.read_csv(data_directory+'/Mol_ct_file.csv',header=None)
73 |     mol_ct.columns = ['Number_of_Molecules','file_name']
74 |     Total_sampling = tot_sampling
75 |     Total_mols_available = np.sum(mol_ct.Number_of_Molecules)
76 |     mol_ct['Sample_for_million'] = [int(Total_sampling/Total_mols_available*elem) for elem in mol_ct.Number_of_Molecules]
77 |     mol_ct.to_csv(data_directory+'/Mol_ct_file_updated.csv',sep=',',index=False)
78 |     print("Done - Time Taken", time.time()-t)
79 | 
80 | 


--------------------------------------------------------------------------------
/phase_1/sampling.py:
--------------------------------------------------------------------------------
  1 | from contextlib import closing
  2 | from multiprocessing import Pool
  3 | import pandas as pd
  4 | import numpy as np
  5 | import argparse
  6 | import glob
  7 | import time
  8 | import os
  9 | 
 10 | try:
 11 |     import __builtin__
 12 | except ImportError:
 13 |     # Python 3
 14 |     import builtins as __builtin__
 15 | 
 16 | # For debugging purposes only:
 17 | def print(*args, **kwargs):
 18 |     __builtin__.print('\t sampling: ', end="")
 19 |     return __builtin__.print(*args, **kwargs)
 20 | 
 21 | 
 22 | parser = argparse.ArgumentParser()
 23 | parser.add_argument('-pt', '--project_name',required=True,help='Name of the DD project')
 24 | parser.add_argument('-fp', '--file_path',required=True,help='Path to the project directory, excluding project directory name')
 25 | parser.add_argument('-it', '--n_iteration',required=True,help='Number of current iteration')
 26 | parser.add_argument('-dd', '--data_directory',required=True,help='Path to directory containing the remaining molecules of the database')
 27 | parser.add_argument('-t_pos', '--tot_process',required=True,help='Number of CPUs to use for multiprocessing')
 28 | parser.add_argument('-tr_sz', '--train_size',required=True,help='Size of training set')
 29 | parser.add_argument('-vl_sz', '--val_size',required=True,help='Size of validation and test set')
 30 | io_args = parser.parse_args()
 31 | 
 32 | protein = io_args.project_name
 33 | file_path = io_args.file_path
 34 | n_it = int(io_args.n_iteration)
 35 | data_directory = io_args.data_directory
 36 | tot_process = int(io_args.tot_process)
 37 | tr_sz = int(io_args.train_size)
 38 | vl_sz = int(io_args.val_size)
 39 | rt_sz = tr_sz/vl_sz
 40 | 
 41 | print("Parsed Args:")
 42 | print(" - Iteration:", n_it)
 43 | print(" - Data Directory:", data_directory)
 44 | print(" - Training Size:", tr_sz)
 45 | print(" - Validation Size:", vl_sz)
 46 | 
 47 | 
 48 | def train_valid_test(file_name):
 49 |     sampling_start_time = time.time()
 50 |     f_name = file_name.split('/')[-1]
 51 |     mol_ct = pd.read_csv(data_directory+"/Mol_ct_file_updated.csv", index_col=1)
 52 |     if n_it == 1:
 53 |         to_sample = int(mol_ct.loc[f_name].Sample_for_million/(rt_sz+2))
 54 |     else:
 55 |         to_sample = int(mol_ct.loc[f_name].Sample_for_million/3)
 56 | 
 57 |     total_len = int(mol_ct.loc[f_name].Number_of_Molecules)
 58 |     shuffle_array = np.linspace(0, total_len-1, total_len)
 59 |     seed = np.random.randint(0, 2**32)
 60 |     np.random.seed(seed=seed)
 61 |     np.random.shuffle(shuffle_array)
 62 | 
 63 |     if n_it == 1:
 64 |         train_ind = shuffle_array[:int(rt_sz*to_sample)]
 65 |         valid_ind = shuffle_array[int(to_sample*rt_sz):int(to_sample*(rt_sz+1))]
 66 |         test_ind = shuffle_array[int(to_sample*(rt_sz+1)):int(to_sample*(rt_sz+2))]
 67 |     else:
 68 |         train_ind = shuffle_array[:to_sample]
 69 |         valid_ind = shuffle_array[to_sample:to_sample*2]
 70 |         test_ind = shuffle_array[to_sample*2:to_sample*3]
 71 | 
 72 |     train_ind_dict = {}
 73 |     valid_ind_dict = {}
 74 |     test_ind_dict = {}
 75 | 
 76 |     train_set = open(file_path + '/' + protein + "/iteration_" + str(n_it) + "/train_set.txt", 'a')
 77 |     test_set = open(file_path + '/' + protein + "/iteration_" + str(n_it) + "/test_set.txt", 'a')
 78 |     valid_set = open(file_path + '/' + protein + "/iteration_" + str(n_it) + "/valid_set.txt", 'a')
 79 | 
 80 |     for i in train_ind:
 81 |         train_ind_dict[i] = 1
 82 |     for j in valid_ind:
 83 |         valid_ind_dict[j] = 1
 84 |     for k in test_ind:
 85 |         test_ind_dict[k] = 1
 86 | 
 87 |     # Opens the file and write the test, train, and valid files
 88 |     with open(file_name, 'r') as ref:
 89 |         for ind, line in enumerate(ref):
 90 |             molecule_id = line.strip().split(',')[0]
 91 |             if ind == 1:
 92 |                 print("molecule_id:", molecule_id)
 93 | 
 94 |             # now we write to the train, test, and validation sets
 95 |             if ind in train_ind_dict.keys():
 96 |                 train_set.write(molecule_id + '\n')
 97 |             elif ind in valid_ind_dict.keys():
 98 |                 valid_set.write(molecule_id + '\n')
 99 |             elif ind in test_ind_dict.keys():
100 |                 test_set.write(molecule_id + '\n')
101 | 
102 |     train_set.close()
103 |     valid_set.close()
104 |     test_set.close()
105 |     print("Process finished sampling in " + str(time.time()-sampling_start_time))
106 | 
107 | if __name__ == '__main__':
108 |     try:
109 |         os.mkdir(file_path+'/'+protein+"/iteration_"+str(n_it))
110 |     except OSError:
111 |         pass
112 | 
113 |     f_names = []
114 |     for f in glob.glob(data_directory+'/*.txt'):
115 |         f_names.append(f)
116 | 
117 |     t = time.time()
118 |     print("Starting Processes...")
119 |     with closing(Pool(np.min([tot_process, len(f_names)]))) as pool:
120 |         pool.map(train_valid_test, f_names)
121 | 
122 |     print("Compressing smile file...")
123 |     print("Sampling Complete - Total Time Taken:", time.time()-t)
124 | 
125 | 


--------------------------------------------------------------------------------
/phase_1/sanity_check.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import glob
 3 | 
 4 | parser = argparse.ArgumentParser()
 5 | parser.add_argument('-pt','--protein_name',required=True,help='Name of project')
 6 | parser.add_argument('-fp','--file_path',required=True,help='Path to project folder without name of project folder')
 7 | parser.add_argument('-it','--n_iteration',required=True,help='Number of current iteration')
 8 | 
 9 | io_args = parser.parse_args()
10 | import time
11 | 
12 | protein = io_args.protein_name
13 | file_path = io_args.file_path
14 | n_it = int(io_args.n_iteration)
15 | 
16 | old_dict = {}
17 | for i in range(1,n_it):
18 |     with open(glob.glob(file_path+'/'+protein+'/iteration_'+str(i)+'/training_labels*')[-1]) as ref:
19 |         ref.readline()
20 |         for line in ref:
21 |             tmpp = line.strip().split(',')[-1]
22 |             old_dict[tmpp] = 1
23 |     with open(glob.glob(file_path+'/'+protein+'/iteration_'+str(i)+'/validation_labels*')[-1]) as ref:
24 |         ref.readline()
25 |         for line in ref:
26 |             tmpp = line.strip().split(',')[-1]
27 |             old_dict[tmpp] = 1
28 |     with open(glob.glob(file_path+'/'+protein+'/iteration_'+str(i)+'/testing_labels*')[-1]) as ref:
29 |         ref.readline()
30 |         for line in ref:
31 |             tmpp = line.strip().split(',')[-1]
32 |             old_dict[tmpp] = 1
33 | 
34 | t=time.time()
35 | new_train = {}
36 | new_valid = {}
37 | new_test = {}
38 | with open(glob.glob(file_path+'/'+protein+'/iteration_'+str(n_it)+'/train_set*')[-1]) as ref:
39 |     for line in ref:
40 |         tmpp = line.strip().split(',')[0]
41 |         new_train[tmpp] = 1
42 | with open(glob.glob(file_path+'/'+protein+'/iteration_'+str(n_it)+'/valid_set*')[-1]) as ref:
43 |     for line in ref:
44 |         tmpp = line.strip().split(',')[0]
45 |         new_valid[tmpp] = 1
46 | with open(glob.glob(file_path+'/'+protein+'/iteration_'+str(n_it)+'/test_set*')[-1]) as ref:
47 |     for line in ref:
48 |         tmpp = line.strip().split(',')[0]
49 |         new_test[tmpp] = 1
50 | print(time.time()-t)
51 | 
52 | t=time.time()
53 | for keys in new_train.keys():
54 |     if keys in new_valid.keys():
55 |         new_valid.pop(keys)
56 |     if keys in new_test.keys():
57 |         new_test.pop(keys)
58 | for keys in new_valid.keys():
59 |     if keys in new_test.keys():
60 |         new_test.pop(keys)
61 | print(time.time()-t)
62 | 
63 | for keys in old_dict.keys():
64 |     if keys in new_train.keys():
65 |         new_train.pop(keys)
66 |     if keys in new_valid.keys():
67 |         new_valid.pop(keys)
68 |     if keys in new_test.keys():
69 |         new_test.pop(keys)
70 |         
71 | with open(file_path+'/'+protein+'/iteration_'+str(n_it)+'/train_set.txt','w') as ref:
72 |     for keys in new_train.keys():
73 |         ref.write(keys+'\n')
74 | with open(file_path+'/'+protein+'/iteration_'+str(n_it)+'/valid_set.txt','w') as ref:
75 |     for keys in new_valid.keys():
76 |         ref.write(keys+'\n')
77 | with open(file_path+'/'+protein+'/iteration_'+str(n_it)+'/test_set.txt','w') as ref:
78 |     for keys in new_test.keys():
79 |         ref.write(keys+'\n')
80 | 


--------------------------------------------------------------------------------
/phase_2-3/Extract_labels.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import builtins as __builtin__
  3 | import glob
  4 | import gzip
  5 | import os
  6 | from contextlib import closing
  7 | from multiprocessing import Pool
  8 | 
  9 | 
 10 | # For debugging purposes only:
 11 | def print(*args, **kwargs):
 12 |     __builtin__.print('\t extract_L: ', end="")
 13 |     return __builtin__.print(*args, **kwargs)
 14 | 
 15 | 
 16 | parser = argparse.ArgumentParser()
 17 | parser.add_argument('-if', '--is_final', required=True, help='True/False for is this the final iteration?')
 18 | parser.add_argument('-n_it', '--iteration_no', required=True, help='Number of current iteration')
 19 | parser.add_argument('-protein', '--protein', required=True, help='Name of the DD project')
 20 | parser.add_argument('-file_path', '--file_path', required=True, help='Path to the project directory, excluding project directory name')
 21 | parser.add_argument('-t_pos', '--tot_process', required=True, help='Number of CPUs to use for multiprocessing')
 22 | parser.add_argument('-score', '--score_keyword', required=True, help='Score keyword. Name of the field storing the docking score in the SDF files of docking results')
 23 | 
 24 | io_args = parser.parse_args()
 25 | 
 26 | is_final = io_args.is_final
 27 | n_it = int(io_args.iteration_no)
 28 | protein = io_args.protein
 29 | file_path = io_args.file_path
 30 | tot_process = int(io_args.tot_process)
 31 | key_word = str(io_args.score_keyword)
 32 | 
 33 | if is_final == 'False' or is_final == 'false':
 34 |     is_final = False
 35 | elif is_final == 'True' or is_final == 'true':
 36 |     is_final = True
 37 | else:
 38 |     raise TypeError('-if parameter must be a boolean (true/false)')
 39 | 
 40 | # mol_key = 'ZINC'
 41 | print("Keyword: ", key_word)
 42 | 
 43 | 
 44 | def get_scores(ref):
 45 |     scores = []
 46 |     for line in ref:  # Looping through the molecules
 47 |         zinc_id = line.rstrip()
 48 |         line = ref.readline()
 49 |         # '$$$' signifies end of molecule info
 50 |         while line != '' and line[:4] != '$$$$':  # Looping through its information and saving scores
 51 | 
 52 |             tmp = line.rstrip().split('<')[-1]
 53 | 
 54 |             if key_word == tmp[:-1]:
 55 |                 tmpp = float(ref.readline().rstrip())
 56 |                 if tmpp > 50 or tmpp < -50:
 57 |                     print(zinc_id, tmpp)
 58 |                 else:
 59 |                     scores.append([zinc_id, tmpp])
 60 | 
 61 |             line = ref.readline()
 62 |     return scores
 63 | 
 64 | 
 65 | def extract_glide_score(filen):
 66 |     scores = []
 67 |     try:
 68 |         # Opening the GNU compressed file
 69 |         with gzip.open(filen, 'rt') as ref:
 70 |             scores = get_scores(ref)
 71 | 
 72 |     except Exception as e:
 73 |         print('Exception occured in Extract_labels.py: ', e)
 74 |         # file is already decompressed
 75 |         with open(filen, 'r') as ref:
 76 |             scores = get_scores(ref)
 77 | 
 78 |     if 'test' in os.path.basename(filen):
 79 |         new_name = 'testing'
 80 |     elif 'valid' in os.path.basename(filen):
 81 |         new_name = 'validation'
 82 |     elif 'train' in os.path.basename(filen):
 83 |         new_name = 'training'
 84 |     else:
 85 |         print("Could not generate new training set")
 86 |         exit()
 87 | 
 88 |     with open(file_path + '/' + protein + '/iteration_' + str(n_it) + '/' + new_name + '_' + 'labels.txt', 'w') as ref:
 89 |         ref.write('r_i_docking_score' + ',' + 'ZINC_ID' + '\n')
 90 |         for z_id, gc in scores:
 91 |             ref.write(str(gc) + ',' + z_id + '\n')
 92 | 
 93 | 
 94 | if __name__ == '__main__':
 95 |     files = []
 96 |     iter_path = file_path + '/' + protein + '/iteration_' + str(n_it)
 97 | 
 98 |     # Checking to see if the labels have already been extracted:
 99 |     sets = ["training", "testing", "validation"]
100 |     files_labels = glob.glob(iter_path + "/*_labels.txt")
101 |     foundAll = True
102 |     for s in sets:
103 |         found = False
104 |         print(s)
105 |         for f in files_labels:
106 |             set_name = f.split('/')[-1].split("_labels.txt")[0]
107 |             if set_name == s:
108 |                 found = True
109 |                 print('Found')
110 |                 break
111 |         if not found:
112 |             foundAll = False
113 |             print('Not Found')
114 |             break
115 |     if foundAll:
116 |         print('Labels have already been extracted...')
117 |         print('Remove "_labels.text" files in \"' + iter_path + '\" to re-extract')
118 |         exit(0)
119 | 
120 |     # Checking to see if this is the final iteration to use the right folder
121 |     if is_final:
122 |         path = file_path + '/' + protein + '/after_iteration/docked/*.sdf*'
123 |     else:
124 |         path = iter_path + '/docked/*.sdf*'
125 |         path_labels = iter_path + '/*labels*'
126 | 
127 |     for f in glob.glob(path):
128 |         files.append(f)
129 | 
130 |     print("num files in", path, ":", len(files))
131 |     print("Files:", [os.path.basename(f) for f in files])
132 |     if len(files) == 0:
133 |         print('NO FILES IN: ', path)
134 |         print('CANCEL JOB...')
135 |         exit(1)
136 | 
137 |     # Parallel running of the extract_glide_score() with each file path of the files array
138 |     with closing(Pool(len(files))) as pool:
139 |         pool.map(extract_glide_score, files)
140 | 
141 |     if not is_final:
142 |         # renaming from f1_f2_f3 to f3_labels.txt
143 |         try:
144 |             for f in glob.glob(path_labels):
145 |                 print(f, iter_path + '/' + f.split('/')[-1].split('_')[2] + '_' + 'labels.txt')
146 |                 os.rename(f, iter_path + '/' + f.split('/')[-1].split('_')[2] + '_' + 'labels.txt')  ## why are we renaming this?
147 |         except IndexError:
148 |             print("Index error on renaming", f)
149 | 


--------------------------------------------------------------------------------
/phase_2-3/ML/DDCallbacks.py:
--------------------------------------------------------------------------------
  1 | """
  2 | James Gleave
  3 | v1.1.0
  4 | """
  5 | 
  6 | from tensorflow.keras.callbacks import Callback
  7 | import pandas as pd
  8 | import time
  9 | import os
 10 | 
 11 | 
 12 | class DDLogger(Callback):
 13 |     """
 14 |     Logs the important data regarding model training
 15 |     """
 16 | 
 17 |     def __init__(self, log_path,
 18 |                  max_time=36000,
 19 |                  max_epochs=500,
 20 |                  monitoring='val_loss', ):
 21 |         super(Callback, self).__init__()
 22 |         # Params
 23 |         self.max_time = max_time
 24 |         self.max_epochs = max_epochs
 25 |         self.monitoring = monitoring
 26 | 
 27 |         # Stats
 28 |         self.epoch_start_time = 0
 29 |         self.current_epoch = 0
 30 | 
 31 |         # File
 32 |         self.log_path = log_path
 33 |         self.model_history = {}
 34 | 
 35 |     def on_train_begin(self, logs={}):
 36 |         self.epoch_start_time = time.time()
 37 | 
 38 |     def on_epoch_begin(self, epoch, logs=None):
 39 |         self.epoch_start_time = time.time()
 40 | 
 41 |     def on_epoch_end(self, epoch, logs={}):
 42 |         # Store the data
 43 |         current_time = time.time()
 44 |         epoch_duration = current_time - self.epoch_start_time
 45 |         logs['time_per_epoch'] = epoch_duration
 46 |         self.model_history["epoch_" + str(epoch + 1)] = logs
 47 | 
 48 |         # Estimate time to completion
 49 |         estimate, elapsed, (s, p, x) = self.estimate_training_time()
 50 |         logs['estimate_time'] = estimate
 51 |         logs['time_elapsed'] = elapsed
 52 |         self.model_history["epoch_" + str(epoch + 1)] = logs
 53 | 
 54 |         # Save the data to a csv
 55 |         df = pd.DataFrame(self.model_history)
 56 |         df.to_csv(self.log_path)
 57 | 
 58 |         print("Time taken calculating callbacks:", time.time()-current_time)
 59 | 
 60 |     def estimate_training_time(self):
 61 |         max_allotted_time = self.max_time
 62 |         max_allotted_epochs = self.max_epochs
 63 | 
 64 |         # Grab the info about the model
 65 |         model_loss = []
 66 |         time_per_epoch = []
 67 |         for epoch in self.model_history:
 68 |             model_loss.append(self.model_history[epoch]['val_loss'])
 69 |             time_per_epoch.append(self.model_history[epoch]['time_per_epoch'])
 70 | 
 71 |         time_elapsed = sum(time_per_epoch)
 72 |         average_time_per_epoch = sum(time_per_epoch) / len(time_per_epoch)
 73 |         current_epoch = len(time_per_epoch)
 74 | 
 75 |         # Find out if the model is approaching an early stop
 76 |         epochs_until_early_stop = 10
 77 |         stopping_vector = []
 78 |         prev_loss = model_loss[0]
 79 |         for loss in model_loss:
 80 |             improved = loss < prev_loss
 81 |             stopping_vector.append(improved)
 82 |             if improved:
 83 |                 prev_loss = loss
 84 | 
 85 |         # Check how close we are to an early stop
 86 |         longest_failure = 0
 87 |         for improved in stopping_vector:
 88 |             if not improved:
 89 |                 longest_failure += 1
 90 |             else:
 91 |                 longest_failure = 0
 92 | 
 93 |         max_time = max_allotted_epochs * average_time_per_epoch if max_allotted_epochs * average_time_per_epoch < max_allotted_time else max_allotted_time
 94 |         time_if_early_stop = (epochs_until_early_stop - longest_failure) * average_time_per_epoch
 95 | 
 96 |         # Estimate a completion time
 97 |         loss_drops = stopping_vector.count(True)
 98 |         loss_gains = len(stopping_vector) - loss_drops
 99 |         try:
100 |             gain_drop_ratio = loss_gains / loss_drops
101 |         except ZeroDivisionError:
102 |             gain_drop_ratio = 0
103 | 
104 |         # Created a function to estimate training time
105 |         power = 1 - (gain_drop_ratio ** 3 / 5)
106 |         time_estimate = (max_time ** power) / (1 + longest_failure)
107 | 
108 |         # Smooth out the estimate
109 |         if current_epoch > 1:
110 |             last = self.model_history['epoch_{}'.format(current_epoch - 1)]['estimate_time']
111 |             time_estimate = (time_estimate + last) / 2
112 | 
113 |         # If the time estimate surpasses the max time then just show the max time
114 |         time_for_remaining_epochs = (self.max_epochs - current_epoch) * average_time_per_epoch
115 |         if time_for_remaining_epochs < time_estimate:
116 |             time_estimate = time_for_remaining_epochs
117 | 
118 |         return time_estimate, time_elapsed, (longest_failure, gain_drop_ratio, max_time)
119 | 
120 | 
121 | 
122 | 


--------------------------------------------------------------------------------
/phase_2-3/ML/DDMetrics.py:
--------------------------------------------------------------------------------
  1 | import random
  2 | import numpy as np
  3 | import tensorflow as tf
  4 | from tensorflow.keras import backend as K
  5 | 
  6 | 
  7 | def recall(y_true, y_pred):
  8 |     true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
  9 |     possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
 10 |     recall_keras = true_positives / (possible_positives + K.epsilon())
 11 |     return recall_keras
 12 | 
 13 | 
 14 | def precision(y_true, y_pred):
 15 |     true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
 16 |     predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
 17 |     precision_keras = true_positives / (predicted_positives + K.epsilon())
 18 |     return precision_keras
 19 | 
 20 | 
 21 | def specificity(y_true, y_pred):
 22 |     tn = K.sum(K.round(K.clip((1 - y_true) * (1 - y_pred), 0, 1)))
 23 |     fp = K.sum(K.round(K.clip((1 - y_true) * y_pred, 0, 1)))
 24 |     return tn / (tn + fp + K.epsilon())
 25 | 
 26 | 
 27 | def negative_predictive_value(y_true, y_pred):
 28 |     tn = K.sum(K.round(K.clip((1 - y_true) * (1 - y_pred), 0, 1)))
 29 |     fn = K.sum(K.round(K.clip(y_true * (1 - y_pred), 0, 1)))
 30 |     return tn / (tn + fn + K.epsilon())
 31 | 
 32 | 
 33 | def f1(y_true, y_pred):
 34 |     p = precision(y_true, y_pred)
 35 |     r = recall(y_true, y_pred)
 36 |     return 2 * ((p * r) / (p + r + K.epsilon()))
 37 | 
 38 | 
 39 | def fbeta(y_true, y_pred, beta=2):
 40 |     y_pred = K.clip(y_pred, 0, 1)
 41 | 
 42 |     tp = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)), axis=1)
 43 |     fp = K.sum(K.round(K.clip(y_pred - y_true, 0, 1)), axis=1)
 44 |     fn = K.sum(K.round(K.clip(y_true - y_pred, 0, 1)), axis=1)
 45 | 
 46 |     p = tp / (tp + fp + K.epsilon())
 47 |     r = tp / (tp + fn + K.epsilon())
 48 | 
 49 |     num = (1 + beta ** 2) * (p * r)
 50 |     den = (beta ** 2 * p + r + K.epsilon())
 51 |     return K.mean(num / den)
 52 | 
 53 | 
 54 | def matthews_correlation_coefficient(y_true, y_pred):
 55 |     tp = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
 56 |     tn = K.sum(K.round(K.clip((1 - y_true) * (1 - y_pred), 0, 1)))
 57 |     fp = K.sum(K.round(K.clip((1 - y_true) * y_pred, 0, 1)))
 58 |     fn = K.sum(K.round(K.clip(y_true * (1 - y_pred), 0, 1)))
 59 | 
 60 |     num = tp * tn - fp * fn
 61 |     den = (tp + fp) * (tp + fn) * (tn + fp) * (tn + fn)
 62 |     return num / K.sqrt(den + K.epsilon())
 63 | 
 64 | 
 65 | def equal_error_rate(y_true, y_pred):
 66 |     n_imp = tf.count_nonzero(tf.equal(y_true, 0), dtype=tf.float32) + tf.constant(K.epsilon())
 67 |     n_gen = tf.count_nonzero(tf.equal(y_true, 1), dtype=tf.float32) + tf.constant(K.epsilon())
 68 | 
 69 |     scores_imp = tf.boolean_mask(y_pred, tf.equal(y_true, 0))
 70 |     scores_gen = tf.boolean_mask(y_pred, tf.equal(y_true, 1))
 71 | 
 72 |     loop_vars = (tf.constant(0.0), tf.constant(1.0), tf.constant(0.0))
 73 |     cond = lambda t, fpr, fnr: tf.greater_equal(fpr, fnr)
 74 |     body = lambda t, fpr, fnr: (
 75 |         t + 0.001,
 76 |         tf.divide(tf.count_nonzero(tf.greater_equal(scores_imp, t), dtype=tf.float32), n_imp),
 77 |         tf.divide(tf.count_nonzero(tf.less(scores_gen, t), dtype=tf.float32), n_gen)
 78 |     )
 79 |     t, fpr, fnr = tf.while_loop(cond, body, loop_vars, back_prop=False)
 80 |     eer = (fpr + fnr) / 2
 81 | 
 82 |     return eer
 83 | 
 84 | 
 85 | def get_metric(name):
 86 |     metrics = {"recall": tf.keras.metrics.Recall(),
 87 |                "precision": tf.keras.metrics.Precision(),
 88 |                "specificity": specificity,
 89 |                "negative_predictive_value": negative_predictive_value,
 90 |                "f1": f1,
 91 |                "fbeta": fbeta,
 92 |                "equal_error_rate": equal_error_rate,
 93 |                "matthews_correlation_coefficient": matthews_correlation_coefficient}
 94 |     keys = list(metrics.keys())
 95 |     assert name in keys, print("Cannot find metric " + name, ". Available metrics are {}".format(keys))
 96 |     return metrics[name]
 97 | 
 98 | 
 99 | class DDMetrics:
100 |     def __init__(self, model):
101 |         self.model = model
102 |         self.params = model.count_params()
103 | 
104 |     @staticmethod
105 |     def scaled_performance(y_true, y_pred):
106 |         p = precision(y_true, y_pred)
107 |         f = f1(y_true, y_pred)
108 |         return ((p*p) + (f*f))/2
109 | 
110 |     def relative_scaled_performance(self, y_true, y_pred):
111 |         params = self.params / 1_000_000
112 |         sp = self.scaled_performance(y_true, y_pred)
113 |         return sp/(1.03 ** params)
114 | 
115 |     def relative_precision(self, y_true, y_pred):
116 |         p = precision(y_true, y_pred)
117 |         params = self.params / 1_000_000
118 |         return p/params
119 | 


--------------------------------------------------------------------------------
/phase_2-3/ML/DDModel.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Version 1.1.2
  3 | """
  4 | from .Tokenizer import DDTokenizer
  5 | from sklearn import preprocessing
  6 | from .DDModelExceptions import *
  7 | from tensorflow.keras import backend
  8 | from .Models import Models
  9 | from .Parser import Parser
 10 | import tensorflow as tf
 11 | import pandas as pd
 12 | import numpy as np
 13 | import keras
 14 | import time
 15 | import os
 16 | 
 17 | import warnings
 18 | warnings.filterwarnings('ignore')
 19 | 
 20 | 
 21 | class DDModel(Models):
 22 |     """
 23 |     A class responsible for creating, storing, and working with our deep docking models
 24 |     """
 25 | 
 26 |     def __init__(self, mode, input_shape, hyperparameters, metrics=None, loss='binary_crossentropy', regression=False,
 27 |                  name="model"):
 28 | 
 29 |         """
 30 |         Parameters
 31 |         ----------
 32 |         mode : str
 33 |             A string indicating which model to use
 34 |         input_shape : tuple or list
 35 |             The input shape for the model
 36 |         hyperparameters : dict
 37 |             A dictionary containing the hyperparameters for the DDModel's model
 38 |         metrics : list
 39 |             The metric(s) used by keras
 40 |         loss : str
 41 |             The loss function used by keras
 42 |         regression : bool
 43 |             Set to true if the model is performing regression
 44 |         """
 45 | 
 46 |         if metrics is None:
 47 |             self.metrics = ['accuracy']
 48 |         else:
 49 |             self.metrics = metrics
 50 | 
 51 |         # Use regression or use binary classification
 52 |         output_activation = 'linear' if regression else 'sigmoid'
 53 | 
 54 |         # choose the loss function
 55 |         self.loss_func = loss
 56 |         if regression and loss == 'binary_crossentropy':
 57 |             self.loss_func = 'mean_squared_error'
 58 |         hyperparameters["loss_func"] = self.loss_func
 59 | 
 60 |         if mode == "loaded_model":
 61 |             super().__init__(hyperparameters={'bin_array': [],
 62 |                                               'dropout_rate': 0,
 63 |                                               'learning_rate': 0,
 64 |                                               'num_units': 0,
 65 |                                               'epsilon': 0},
 66 |                              output_activation=output_activation, name=name)
 67 |             self.mode = ""
 68 | 
 69 |             self.input_shape = ()
 70 | 
 71 |             self.history = keras.callbacks.History()
 72 |             self.time = {"training_time": -1, "prediction_time": -1}
 73 |         else:
 74 |             # Create a model
 75 |             super().__init__(hyperparameters=hyperparameters,
 76 |                              output_activation=output_activation, name=name)
 77 | 
 78 |             self.mode = mode
 79 | 
 80 |             self.input_shape = input_shape
 81 | 
 82 |             self.history = keras.callbacks.History()
 83 |             self.time = {'training_time': -1, "prediction_time": -1}
 84 | 
 85 |             self.model = self._create_model()
 86 |             self._compile()
 87 | 
 88 |     def fit(self, train_x, train_y, epochs, batch_size, shuffle, class_weight, verbose, validation_data, callbacks):
 89 | 
 90 |         """
 91 |         Reshapes the input data and fits the model
 92 | 
 93 |         Parameters
 94 |         ----------
 95 |         train_x : ndarray
 96 |             Training data
 97 |         train_y : ndarray
 98 |             Training labels
 99 |         epochs : int
100 |             Number of epochs to train on
101 |         batch_size : int
102 |             The batch size
103 |         shuffle : bool
104 |             Whether to shuffle the data
105 |         class_weight : dict
106 |             The class weights
107 |         verbose : int
108 |             The verbose
109 |         validation_data : list
110 |             The validation data and labels
111 |         callbacks : list
112 |             Keras callbacks
113 |         """
114 | 
115 |         # First reshape the data to fit the chosen model
116 |         # Here we form the shape
117 |         shape_train_x = [train_x.shape[0]]
118 |         shape_valid_x = [validation_data[0].shape[0]]
119 |         for val in self.model.input_shape[1:]:
120 |             shape_train_x.append(val)
121 |             shape_valid_x.append(val)
122 | 
123 |         # Here we reshape the data
124 |         # Format: shape = (size of data, input_shape[0], ..., input_shape[n]
125 |         train_x = np.reshape(train_x, shape_train_x)
126 |         validation_data_x = np.reshape(validation_data[0], shape_valid_x)
127 |         validation_data_y = validation_data[1]
128 |         validation_data = (validation_data_x, validation_data_y)
129 | 
130 |         # Track the training time
131 |         training_time = time.time()
132 | 
133 |         # If we are in regression mode, ignore the class weights
134 |         if self.output_activation == 'linear':
135 |             class_weight = None
136 | 
137 |         # Train the model and store the history
138 |         self.history = self.model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, shuffle=shuffle,
139 |                                       class_weight=class_weight, verbose=verbose, validation_data=validation_data,
140 |                                       callbacks=callbacks)
141 | 
142 |         # Store the training time
143 |         training_time = time.time() - training_time
144 |         self.time['training_time'] = training_time
145 |         print("Training time:", training_time)
146 | 
147 |     def predict(self, x_test, verbose=0):
148 | 
149 |         """
150 |         Reshapes the input data and returns the models predictions
151 | 
152 |         Parameters
153 |         ----------
154 |         x_test : ndarray
155 |             The test data
156 | 
157 |         verbose : int
158 |             The verbose of the model's prediction
159 | 
160 |         Returns
161 |         -------
162 |         predictions : ndarray
163 |             The model's predictions
164 |         """
165 |         # We must reshape the test data to fit our model
166 |         shape = [x_test.shape[0]]
167 |         for val in list(self.model.input_shape)[1:]:
168 |             shape.append(val)
169 | 
170 |         x_test = np.reshape(x_test, newshape=shape)
171 | 
172 |         # Predict and return the predictions
173 |         prediction_time = time.time()  # Keep track of how long prediction took
174 |         predictions = self.model.predict(x_test, verbose=verbose)  # Predict
175 |         prediction_time = time.time() - prediction_time  # Update prediction time
176 |         self.time['prediction_time'] = prediction_time  # Store the prediction time
177 | 
178 |         return predictions
179 | 
180 |     def save(self, path, json=False):
181 |         self._write_stats_to_file(path)
182 |         if json:
183 |             json_model = self.model.to_json()
184 |             with open(path + ".json", 'w') as json_file:
185 |                 json_file.write(json_model)
186 |         else:
187 |             try:
188 |                 self.model.save(path, save_format='h5')
189 |             except:
190 |                 print("Could not save as h5 file. This is probably due to tensorflow version.")
191 |                 print("If the model is saved a directory, it will cause issues.")
192 |                 print("Trying to save again...")
193 |                 self.model.save(path)
194 | 
195 |     def load_stats(self, path):
196 |         """
197 |         Load the stats from a .ddss file into the current DDModel
198 | 
199 |         Parameters
200 |         ----------
201 |         path : str
202 |         """
203 | 
204 |         info = Parser.parse_ddss(path)
205 |         for key in info.keys():
206 |             try:
207 |                 self.__dict__[key] = info[key]
208 |             except KeyError:
209 |                 print(key, 'is not an attribute of this class.')
210 |         self.input_shape = "Loaded Model -> Input shape will be inferred"
211 | 
212 |         if self.time == {}:
213 |             self.time = {"training_time": 'Could Not Be Loaded', "prediction_time": 'Could Not Be Loaded'}
214 | 
215 |     def _write_stats_to_file(self, path="", return_string=False):
216 |         info = "* {}'s Stats * \n".format(self.name)
217 |         info += "- Model mode: " + self.mode + " \n"
218 |         info += "\n"
219 |         # Write the timings
220 |         if isinstance(self.time['training_time'], str) == False and self.time['training_time'] > -1:
221 |             if isinstance(self.history, dict):
222 |                 num_eps = self.history['total_epochs']
223 |             else:
224 |                 num_eps = len(self.history.history['loss'])
225 | 
226 |             info += "- Model Time: \n"
227 |             info += "   - training_time: {train_time}".format(train_time=self.time['training_time']) + "  \n"
228 |             info += "   - time_per_epoch: {epoch_time}".format(epoch_time=(self.time['training_time'] / num_eps)) + " \n"
229 |             info += "   - prediction_time: {pred_time}".format(pred_time=self.time['prediction_time']) + " \n"
230 |         else:
231 |             info += "- Model Time: \n"
232 |             info += "   - Model has not been trained yet. \n"
233 |         info += "\n"
234 | 
235 |         # Write the history
236 |         try:
237 |             info += "- History Stats: \n"
238 |             if isinstance(self.history, dict):
239 |                 hist = self.history
240 |             else:
241 |                 hist = self.history.history
242 | 
243 |             # Get all the history values and keys stores
244 |             for key in hist:
245 |                 try:
246 |                     info += "   - {key}: {val} \n".format(key=key, val=hist[key][-1])
247 |                 except TypeError:
248 |                     info += "   - {key}: {val} \n".format(key=key, val=hist[key])
249 | 
250 |             try:
251 |                 try:
252 |                     info += "   - total_epochs: {epochs}".format(epochs=len(hist['loss']))
253 |                 except TypeError:
254 |                     pass
255 | 
256 |                 info += "\n"
257 |             except KeyError:
258 |                 info += "   - Model has not been trained yet. \n"
259 | 
260 |         except AttributeError or KeyError:
261 |             # Get all the history values and keys stores
262 |             info += "   - Model has not been trained yet. \n"
263 |         info += "\n"
264 | 
265 |         # Write the hyperparameters
266 |         info += "- Hyperparameter Stats: \n"
267 |         for key in self.hyperparameters.keys():
268 |             if key != 'bin_array' or len(self.hyperparameters[key]) > 0:
269 |                 info += "   - {key}: {val} \n".format(key=key, val=self.hyperparameters[key])
270 |         info += "\n"
271 | 
272 |         # Write stats about the model architecture
273 |         info += "- Model Architecture Stats: \n"
274 |         try:
275 |             trainable_count = int(
276 |                 np.sum([backend.count_params(p) for p in set(self.model.trainable_weights)]))
277 |             non_trainable_count = int(
278 |                 np.sum([backend.count_params(p) for p in set(self.model.non_trainable_weights)]))
279 | 
280 |             info += '   - total_params: {:,} \n'.format(trainable_count + non_trainable_count)
281 |             info += '   - trainable_params: {:,}  \n'.format(trainable_count)
282 |             info += '   - non_trainable_params: {:,}  \n'.format(non_trainable_count)
283 |             info += "\n"
284 |         except TypeError or AttributeError:
285 |             info += '   - total_params: Cannot be determined \n'
286 |             info += '   - trainable_params: Cannot be determined \n'
287 |             info += '   - non_trainable_params: Cannot be determined \n'
288 |             info += "\n"
289 | 
290 |         # Create a layer display
291 |         display_string = ""
292 |         for i, layer in enumerate(self.model.layers):
293 |             if i == 0:
294 |                 display_string += "Input: \n"
295 | 
296 |             display_string += "    ↓ [ {name} ] \n".format(name=layer.name)
297 |         info += display_string
298 | 
299 |         if not return_string:
300 |             with open(path + '.ddss', 'w') as stat_file:
301 |                 stat_file.write(info)
302 |         else:
303 |             return info
304 | 
305 |     def _create_model(self):
306 |         """Creates and returns a model
307 | 
308 |         Raises
309 |         ------
310 |         IncorrectModelModeError
311 |             If a mode was passed that does not exists this error will be raised
312 |         """
313 |         # Try creating the model and if failed raise exception
314 |         try:
315 |             model = getattr(self, self.mode, None)(self.input_shape)
316 |         except TypeError:
317 |             raise IncorrectModelModeError(self.mode, Models.get_available_modes())
318 |         return model
319 | 
320 |     def _compile(self):
321 |         """Compiles the DDModel object's model"""
322 | 
323 |         if 'epsilon' not in self.hyperparameters.keys():
324 |             self.hyperparameters['epsilon'] = 1e-06
325 | 
326 |         adam_opt = tf.keras.optimizers.Adam(learning_rate=self.hyperparameters['learning_rate'],
327 |                                             epsilon=self.hyperparameters['epsilon'])
328 |         self.model.compile(optimizer=adam_opt, loss=self.loss_func, metrics=self.metrics)
329 | 
330 |     @staticmethod
331 |     def load(model, **kwargs):
332 |         pre_compiled = True
333 |         # Can be a path to a model or a model instance
334 |         if type(model) is str:
335 |             dd_model = DDModel(mode="loaded_model", input_shape=[], hyperparameters={})
336 | 
337 |             # If we would like to load from a json, we can do that as well.
338 |             if '.json' in model:
339 |                 dd_model.model = tf.keras.models.model_from_json(open(model).read(),
340 |                                                                  custom_objects=Models.get_custom_objects())
341 |                 model = model.replace('.json', "")
342 |                 pre_compiled = False
343 |             else:
344 |                 dd_model.model = tf.keras.models.load_model(model, custom_objects=Models.get_custom_objects())
345 |         else:
346 |             dd_model = DDModel(mode="loaded_model", input_shape=[], hyperparameters={})
347 |             dd_model.model = model
348 | 
349 |         if 'kt_hyperparameters' in kwargs.keys():
350 |             hyp = kwargs['kt_hyperparameters'].get_config()['values']
351 |             for key in hyp.keys():
352 |                 try:
353 |                     dd_model.__dict__['hyperparameters'][key] = hyp[key]
354 |                     if key == 'kernel_reg':
355 |                         dd_model.__dict__['hyperparameters'][key] = ['None', 'Lasso', 'l1', 'l2'][int(hyp[key])]
356 | 
357 |                 except KeyError:
358 |                     print(key, 'is not an attribute of this class.')
359 |         else:
360 |             # Try to load a stats file
361 |             try:
362 |                 dd_model.load_stats(model + ".ddss")
363 |             except TypeError or FileNotFoundError:
364 |                 print("Could not find a stats file...")
365 | 
366 |         if 'metrics' in kwargs.keys():
367 |             dd_model.metrics = kwargs['metrics']
368 |         else:
369 |             dd_model.metrics = ['accuracy']
370 | 
371 |         if not pre_compiled:
372 |             dd_model._compile()
373 | 
374 |         if 'name' in kwargs.keys():
375 |             dd_model.name = kwargs['name']
376 | 
377 |         dd_model.mode = 'loaded_model'
378 |         return dd_model
379 | 
380 |     @staticmethod
381 |     def process_smiles(smiles, vocab_size=100, fit_range=1000, normalize=True, use_padding=True, padding_size=None, one_hot=False):
382 |         # Create the tokenizer
383 |         tokenizer = DDTokenizer(vocab_size)
384 |         # Fit the tokenizer
385 |         tokenizer.fit(smiles[0:fit_range])
386 |         # Encode the smiles
387 |         encoded_smiles = tokenizer.encode(data=smiles, use_padding=use_padding,
388 |                                           padding_size=padding_size, normalize=normalize)
389 | 
390 |         if one_hot:
391 |             encoded_smiles = DDModel.one_hot_encode(encoded_smiles, len(tokenizer.word_index))
392 | 
393 |         return encoded_smiles
394 | 
395 |     @staticmethod
396 |     def one_hot_encode(encoded_smiles, unique_category_count):
397 |         one_hot = keras.backend.one_hot(encoded_smiles, unique_category_count)
398 |         return one_hot
399 | 
400 |     @staticmethod
401 |     def normalize(values: pd.Series):
402 |         assert type(values) is pd.Series, "Type Error -> Expected pandas.Series"
403 |         # Extract the indices and name
404 |         indices = values.index
405 |         name = values.index.name
406 | 
407 |         # Normalizes values
408 |         normalized_values = preprocessing.minmax_scale(values, (0, 1))
409 | 
410 |         # Create a pandas series to return
411 |         values = pd.Series(index=indices, data=normalized_values, name=name)
412 |         return values
413 | 
414 |     def __repr__(self):
415 |         return self._write_stats_to_file(return_string=True)
416 | 


--------------------------------------------------------------------------------
/phase_2-3/ML/DDModelExceptions.py:
--------------------------------------------------------------------------------
 1 | class Error(Exception):
 2 |     """Base class for other exceptions"""
 3 |     pass
 4 | 
 5 | 
 6 | class IncorrectModelModeError(Error):
 7 |     """Exception raised for errors in the model mode.
 8 | 
 9 |     Attributes:
10 |         mode -- input mode which caused the error
11 |         message -- explanation of the error
12 |     """
13 |     def __init__(self, mode, available_modes, message="Incorrect model mode. Use one of the following modes:"):
14 |         self.mode = mode
15 |         self.message = message
16 |         self.available_modes = available_modes
17 | 
18 |     def __str__(self):
19 |         mode_string = "\n\n"
20 |         for mode in self.available_modes:
21 |             mode_string += " " + mode + "\n"
22 | 
23 |         return f'{self.mode} -> {self.message}' + mode_string
24 | 


--------------------------------------------------------------------------------
/phase_2-3/ML/Models.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Version 1.1.0
  3 | 
  4 | The model to be used in deep docking
  5 | James Gleave
  6 | """
  7 | 
  8 | import keras
  9 | from ML.lasso_regularizer import Lasso
 10 | from ML.DDMetrics import *
 11 | from tensorflow.keras.models import Model, Sequential
 12 | from tensorflow.keras.layers import (Input, Dense, Activation, BatchNormalization, Dropout, LSTM,
 13 |                           Conv2D, MaxPool2D, Flatten, Embedding, MaxPooling1D,
 14 |                           Conv1D)
 15 | from tensorflow.keras.regularizers import *
 16 | 
 17 | import warnings
 18 | warnings.filterwarnings('ignore')
 19 | 
 20 | # TODO Give user option to make their own model
 21 | 
 22 | 
 23 | class Models:
 24 |     def __init__(self, hyperparameters, output_activation, name="model"):
 25 |         """
 26 |         Class to hold various NN architectures allowing for cleaner code when determining which architecture
 27 |         is best suited for DD.
 28 | 
 29 |         :param hyperparameters: a dictionary holding the parameters for the models:
 30 |         ('bin_array', 'dropout_rate', 'learning_rate', 'num_units')
 31 |         """
 32 |         self.hyperparameters = hyperparameters
 33 |         self.output_activation = output_activation
 34 |         self.name = name
 35 | 
 36 |     def original(self, input_shape):
 37 |         x_input = Input(input_shape, name="original")
 38 |         x = x_input
 39 |         for j, i in enumerate(self.hyperparameters['bin_array']):
 40 |             if i == 0:
 41 |                 x = Dense(self.hyperparameters['num_units'], name="Hidden_Layer_%i" % (j + 1))(x)
 42 |                 x = BatchNormalization()(x)
 43 |                 x = Activation('relu')(x)
 44 |             else:
 45 |                 x = Dropout(self.hyperparameters['dropout_rate'])(x)
 46 |         x = Dense(1, activation=self.output_activation, name="Output_Layer")(x)
 47 |         model = Model(inputs=x_input, outputs=x, name='Progressive_Docking')
 48 |         return model
 49 | 
 50 |     def dense_dropout(self, input_shape):
 51 |         """This is the most simple neural architecture.
 52 |         Four dense layers, batch normalization, relu activation, and dropout.
 53 |         """
 54 |         # The model input
 55 |         x_input = Input(input_shape, name='dense_dropout')
 56 |         x = x_input
 57 | 
 58 |         # Model happens here...
 59 |         x = Dense(self.hyperparameters['num_units'], name="Hidden_Layer_1")(x)
 60 |         x = BatchNormalization()(x)
 61 |         x = Activation('relu')(x)
 62 |         x = Dropout(self.hyperparameters['dropout_rate'])(x)
 63 | 
 64 |         x = Dense(self.hyperparameters['num_units'], name="Hidden_Layer_2")(x)
 65 |         x = BatchNormalization()(x)
 66 |         x = Activation('relu')(x)
 67 |         x = Dropout(self.hyperparameters['dropout_rate'])(x)
 68 | 
 69 |         x = Dense(self.hyperparameters['num_units'], name="Hidden_Layer_3")(x)
 70 |         x = BatchNormalization()(x)
 71 |         x = Activation('relu')(x)
 72 |         x = Dropout(self.hyperparameters['dropout_rate'])(x)
 73 | 
 74 |         x = Dense(self.hyperparameters['num_units'], name="Hidden_Layer_4")(x)
 75 |         x = BatchNormalization()(x)
 76 |         x = Activation('relu')(x)
 77 |         x = Dropout(self.hyperparameters['dropout_rate'])(x)
 78 | 
 79 |         # output
 80 |         x = Dense(1, activation=self.output_activation, name="Output_Layer")(x)
 81 |         model = Model(inputs=x_input, outputs=x, name='Progressive_Docking')
 82 |         return model
 83 | 
 84 |     def wide_net(self, input_shape):
 85 |         """
 86 |         A simple square model
 87 |         """
 88 |         # The model input
 89 |         x_input = Input(input_shape, name='wide_net')
 90 |         x = x_input
 91 | 
 92 |         # The width coefficient
 93 |         width_coef = len(self.hyperparameters['bin_array'])//2
 94 | 
 95 |         for i, layer in enumerate(self.hyperparameters['bin_array']):
 96 |             if layer == 0:
 97 |                 layer_name = "Hidden_Dense_Layer_" + str(i//2)
 98 |                 x = Dense(self.hyperparameters['num_units'] * width_coef, name=layer_name)(x)
 99 |                 x = Activation('relu')(x)
100 |             else:
101 |                 x = BatchNormalization()(x)
102 |                 x = Dropout(self.hyperparameters['dropout_rate'])(x)
103 | 
104 |         # output
105 |         x = Dense(1, activation=self.output_activation, name="Output_Layer")(x)
106 |         model = Model(inputs=x_input, outputs=x, name='Progressive_Docking')
107 |         return model
108 | 
109 |     def shared_layer(self, input_shape):
110 |         """This Model uses a shared layer"""
111 |         # The model input
112 |         x_input = Input(input_shape, name="shared_layer")
113 | 
114 |         # Here is a layer that will be shared
115 |         shared_layer = Dense(input_shape[0], name="Shared_Hidden_Layer")
116 | 
117 |         # Apply the layer twice
118 |         x = shared_layer(x_input)
119 |         x = shared_layer(x)
120 |         x = BatchNormalization()(x)
121 |         x = Activation('relu')(x)
122 |         x = Dropout(self.hyperparameters['dropout_rate'])(x)
123 | 
124 |         # Apply dropout and normalization
125 |         x = Dense(self.hyperparameters['num_units'], name="Hidden_Layer")(x)
126 |         x = BatchNormalization()(x)
127 |         x = Activation('relu')(x)
128 |         x = Dropout(self.hyperparameters['dropout_rate'])(x)
129 | 
130 |         for i, layer in enumerate(self.hyperparameters['bin_array']):
131 |             if layer == 0:
132 |                 layer_name = "Hidden_Layer_" + str(i)
133 |                 x = Dense(self.hyperparameters['num_units']//i, name=layer_name)(x)
134 |                 x = BatchNormalization()(x)
135 |                 x = Activation('relu')(x)
136 |                 x = Dropout(self.hyperparameters['dropout_rate'])(x)
137 | 
138 |         # output
139 |         x = Dense(1, activation=self.output_activation, name="output_layer")(x)
140 |         model = Model(inputs=x_input, outputs=x, name='Progressive_Docking')
141 |         return model
142 | 
143 |     @staticmethod
144 |     def get_custom_objects():
145 |         return {"Lasso": Lasso}
146 | 
147 |     @staticmethod
148 |     def get_available_modes():
149 |         modes = []
150 |         for attr in Models.__dict__:
151 |             if attr[0] != '_' and attr != "get_custom_objects" and attr != "get_available_modes":
152 |                 modes.append(attr)
153 |         return modes
154 | 
155 | 
156 | class TunerModel:
157 |     def __init__(self, input_shape):
158 |         self.input_shape = input_shape
159 | 
160 |     def build_tuner_model(self, hp):
161 |         """
162 |         This method should be used with keras tuner.
163 |         """
164 | 
165 |         # Create the hyperparameters
166 |         num_hidden_layers = hp.Int('hidden_layers', min_value=1, max_value=4, step=1)
167 |         num_units = hp.Int("num_units", min_value=128, max_value=1024)
168 |         dropout_rate = hp.Float("dropout_rate", min_value=0.0, max_value=0.8)
169 |         learning_rate = hp.Float('learning_rate', min_value=0.00001, max_value=0.001)
170 |         epsilon = hp.Float('epsilon', min_value=1e-07, max_value=1e-05)
171 |         kernel_reg_func = [None, Lasso, l1, l2][hp.Choice("kernel_reg", values=[0, 1, 2, 3])]
172 |         reg_amount = hp.Float("reg_amount", min_value=0.0, max_value=0.001, step=0.0001)
173 | 
174 |         # Determine how the layer(s) are shared
175 |         share_layer = hp.Boolean("shared_layer")
176 |         if share_layer:
177 |             share_all = hp.Boolean("share_all")
178 |             shared_layer_units = hp.Int("num_units", min_value=128, max_value=1024)
179 |             shared_layer = Dense(shared_layer_units, name="shared_hidden_layer")
180 |             if not share_all:
181 |                 where_to_share = set()
182 |                 layer_connections = hp.Int("num_shared_layer_connections", min_value=1, max_value=num_hidden_layers)
183 |                 for layer in range(layer_connections):
184 |                     where_to_share.add(hp.Int("where_to_share", min_value=0, max_value=num_hidden_layers, step=1))
185 | 
186 |         # Build the model according to the hyperparameters
187 |         inputs = Input(shape=self.input_shape, name="input")
188 |         x = inputs
189 | 
190 |         # Determine number of hidden layers
191 |         for layer_num in range(num_hidden_layers):
192 |             # If we are not using a kernel regulation function or not...
193 |             if kernel_reg_func is None:
194 |                 x = Dense(num_units, name="dense_" + str(layer_num))(x)
195 |             else:
196 |                 x = Dense(num_units, kernel_regularizer=kernel_reg_func(reg_amount), name="dense_" + str(layer_num))(x)
197 | 
198 |             # If we are using a common shared layer, then connect it.
199 |             if (share_layer and share_all) or (share_layer and layer_num in where_to_share):
200 |                 x = BatchNormalization()(x)
201 |                 x = Activation('relu')(x)
202 |                 x = Dropout(dropout_rate)(x)
203 |                 x = shared_layer(x)
204 | 
205 |             # Apply these to every layer
206 |             x = BatchNormalization()(x)
207 |             x = Activation('relu')(x)
208 |             x = Dropout(dropout_rate)(x)
209 | 
210 |         outputs = Dense(1, activation='sigmoid', name="output_layer")(x)
211 |         model = Model(inputs=inputs, outputs=outputs)
212 |         model.compile(optimizer=keras.optimizers.Adam(learning_rate=learning_rate, epsilon=epsilon),
213 |                       loss=keras.losses.BinaryCrossentropy(),
214 |                       metrics=['accuracy', "AUC", "Precision", "Recall", DDMetrics.scaled_performance])
215 | 
216 |         print(model.summary())
217 |         return model
218 | 
219 | 
220 | 


--------------------------------------------------------------------------------
/phase_2-3/ML/Parser.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Version 1.1.2
 3 | 
 4 | Used to read .DDSS files which are useful for monitoring model progress
 5 | """
 6 | import pandas as pd
 7 | import numpy as np
 8 | 
 9 | 
10 | class Parser:
11 | 
12 |     @staticmethod
13 |     def parse_ddss(path):
14 | 
15 |         architecture = {}
16 |         hyperparameters = {}
17 |         history = {}
18 |         time = {}
19 |         info = {'time': time, 'history': history, 'hyperparameters': hyperparameters, 'architecture': architecture}
20 | 
21 |         with open(path, 'r') as ddss_file:
22 |             lines = ddss_file.readlines()
23 |             lines.remove('\n')
24 | 
25 |             for i, line in enumerate(lines):
26 |                 line = line.strip('\n')
27 | 
28 |                 # Get the model name
29 |                 if 'Model mode' in line:
30 |                     info['name'] = line.split()[-1]
31 | 
32 |                 # Get the model timings
33 |                 if 'training_time' in line:
34 |                     split_line = line.split()
35 |                     time['training_time'] = float(split_line[-1])
36 |                 if 'prediction_time' in line:
37 |                     split_line = line.split()
38 |                     time['prediction_time'] = float(split_line[-1])
39 | 
40 |                 # Get the history stats
41 |                 if 'History Stats' in line:
42 |                     #  Grab everything under the history set
43 |                     for sub_line in lines[i + 1:]:  # search the sub lines under history
44 |                         if '-' not in sub_line or 'Model has not been trained yet' in sub_line:
45 |                             break
46 |                         else:  # Split up the lines and stores the values
47 |                             split_line = sub_line.split()[1:]
48 |                             history_key = split_line[0].replace(":", "")
49 | 
50 |                             value = []
51 |                             for v in split_line[1:]:
52 |                                 value.append(float(v.strip(",").strip('[').strip(']')))
53 | 
54 |                             # If the list has one value, it should be closed to a scalar
55 |                             if len(value) == 1:
56 |                                 history[history_key] = value[0]
57 |                             else:
58 |                                 history[history_key] = value
59 | 
60 |                 # Get the history stats
61 |                 if 'Hyperparameter Stats' in line:
62 |                     # search the sub lines under history
63 |                     for sub_line in lines[i + 1:]:
64 |                         if '-' not in sub_line or 'Model has not been trained yet' in sub_line:
65 |                             break
66 |                         else:
67 |                             sub_line = sub_line.strip(" - ").strip("\n").strip(" ").split(":")
68 |                             key = sub_line[0].strip(" ")
69 |                             value = sub_line[1].strip(" ")
70 | 
71 |                             if '[' in value:
72 |                                 value_list = []
73 |                                 for char in value:
74 |                                     if char.isnumeric():
75 |                                         value_list.append(int(char))
76 |                                 value = value_list
77 |                             else:
78 |                                 try:
79 |                                     value = float(value)
80 |                                 except ValueError:
81 |                                     # If this value error occurs, it is because it has found the non-decimal
82 |                                     # hyperparameters
83 |                                     value = value
84 | 
85 |                             hyperparameters[key] = value
86 | 
87 |                 if 'total_params' in line or 'trainable_params' in line or 'total_params' in line:
88 |                     if "Cannot be determined" not in line:
89 |                         sub_line = line.strip(" - ").strip("\n").strip(" ").split(":")
90 |                         architecture[sub_line[0]] = int(sub_line[1].replace(",", ""))
91 | 
92 |         return info


--------------------------------------------------------------------------------
/phase_2-3/ML/Tokenizer.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Used during testing to encode smiles as features
 3 | This is deprecated.
 4 | """
 5 | 
 6 | from tensorflow.keras.preprocessing.text import Tokenizer
 7 | from tensorflow.keras.preprocessing.sequence import pad_sequences
 8 | import numpy as np
 9 | 
10 | 
11 | class DDTokenizer:
12 |     def __init__(self, num_words, oov_token='<UNK>'):
13 |         self.tokenizer = Tokenizer(num_words=num_words,
14 |                                    oov_token=oov_token,
15 |                                    filters='!"#$%&*+,-./:;<>?\\^_`{|}~\t\n',
16 |                                    char_level=True,
17 |                                    lower=False)
18 |         self.has_trained = False
19 | 
20 |         self.pad_type = 'post'
21 |         self.trunc_type = 'post'
22 | 
23 |         # The encoded data
24 |         self.word_index = {}
25 | 
26 |     def fit(self, train_data):
27 |         # Get max training sequence length
28 |         print("Training Tokenizer...")
29 |         self.tokenizer.fit_on_texts(train_data)
30 |         self.has_trained = True
31 |         print("Done training...")
32 | 
33 |         # Get our training data word index
34 |         self.word_index = self.tokenizer.word_index
35 | 
36 |     def encode(self, data, use_padding=True, padding_size=None, normalize=False):
37 |         # Encode training data sentences into sequences
38 |         train_sequences = self.tokenizer.texts_to_sequences(data)
39 | 
40 |         # Get max training sequence length if there is none passed
41 |         if padding_size is None:
42 |             maxlen = max([len(x) for x in train_sequences])
43 |         else:
44 |             maxlen = padding_size
45 | 
46 |         if use_padding:
47 |             train_sequences = pad_sequences(train_sequences, padding=self.pad_type,
48 |                                             truncating=self.trunc_type, maxlen=maxlen)
49 | 
50 |         if normalize:
51 |             train_sequences = np.multiply(1/len(self.tokenizer.word_index), train_sequences)
52 | 
53 |         return train_sequences
54 | 
55 |     def pad(self, data, padding_size=None):
56 |         # Get max training sequence length if there is none passed
57 |         if padding_size is None:
58 |             padding_size = max([len(x) for x in data])
59 | 
60 |         padded_sequence = pad_sequences(data, padding=self.pad_type,
61 |                                         truncating=self.trunc_type, maxlen=padding_size)
62 | 
63 |         return padded_sequence
64 | 
65 |     def decode(self, array):
66 |         assert self.has_trained, "Train this tokenizer before decoding a string."
67 |         return self.tokenizer.sequences_to_texts(array)
68 | 
69 |     def test(self, string):
70 |         encoded = list(self.encode(string)[0])
71 |         decoded = self.decode(self.encode(string))
72 | 
73 |         print("\nEncoding:")
74 |         print("{original} -> {encoded}".format(original=string[0], encoded=encoded))
75 |         print("\nDecoding:")
76 |         print("{original} -> {encoded}".format(original=encoded, encoded=decoded[0].replace(" ", "")))
77 | 
78 |     def get_info(self):
79 |         return self.tokenizer.index_word
80 | 
81 | 


--------------------------------------------------------------------------------
/phase_2-3/ML/lasso_regularizer.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from tensorflow.keras import backend as K
 3 | from tensorflow.keras.regularizers import Regularizer
 4 | 
 5 | 
 6 | class Lasso(Regularizer):
 7 |     """Regularizer for L21 regularization.
 8 |     # Arguments
 9 |         C: Float; L21 regularization factor.
10 |     """
11 | 
12 |     def __init__(self, C=0.):
13 |         self.C = K.cast_to_floatx(C)
14 | 
15 |     def __call__(self, x):
16 |         const_coeff = np.sqrt(K.int_shape(x)[1])
17 |         return self.C*const_coeff*K.sum(K.sqrt(K.sum(K.square(x), axis=1)))
18 | 
19 |     def get_config(self):
20 |         return {'C': float(self.C)}


--------------------------------------------------------------------------------
/phase_2-3/Prediction_morgan_1024.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import glob
  3 | import os
  4 | import time
  5 | import warnings
  6 | import numpy as np
  7 | import pandas as pd
  8 | from ML.DDModel import DDModel
  9 | 
 10 | try:
 11 |     import __builtin__
 12 | except ImportError:
 13 |     # Python 3
 14 |     import builtins as __builtin__
 15 | 
 16 | # For debugging purposes only:
 17 | def print(*args, **kwargs):
 18 |     __builtin__.print('\t sampling: ', end="")
 19 |     return __builtin__.print(*args, **kwargs)
 20 | 
 21 | warnings.filterwarnings('ignore')
 22 | 
 23 | parser = argparse.ArgumentParser()
 24 | parser.add_argument('-fn','--fn', required=True)
 25 | parser.add_argument('-protein', '--protein', required=True)
 26 | parser.add_argument('-it', '--it', required=True)
 27 | parser.add_argument('-file_path', '--file_path', required=True)
 28 | parser.add_argument('-mdd', '--morgan_directory', required=True)
 29 | 
 30 | io_args = parser.parse_args()
 31 | fn = io_args.fn
 32 | protein = str(io_args.protein)
 33 | it = int(io_args.it)
 34 | file_path = io_args.file_path
 35 | mdd = io_args.morgan_directory
 36 | 
 37 | # This debug feature will allow for speedy testing
 38 | DEBUG=False
 39 | def prediction_morgan(fname, models, thresh):   # TODO: improve runtime with parallelization across multiple nodes
 40 |     print("Starting Predictions...")
 41 |     t = time.time()
 42 |     per_time = 1000000
 43 |     n_features = 1024
 44 |     z_id = []
 45 |     X_set = np.zeros([per_time, n_features])
 46 |     total_passed = 0
 47 | 
 48 |     print("We are predicting from the file", fname, "located in", mdd)
 49 |     with open(mdd+'/'+fname,'r') as ref:
 50 |         no = 0
 51 |         for line in ref:
 52 |             tmp = line.rstrip().split(',')
 53 |             on_bit_vector = tmp[1:]
 54 |             z_id.append(tmp[0])
 55 |             for elem in on_bit_vector:
 56 |                 X_set[no,int(elem)] = 1
 57 |             no+=1
 58 |             if no == per_time:
 59 |                 X_set = X_set[:no, :]
 60 |                 pred = []
 61 |                 print("We are currently running line", line)
 62 |                 print("(1) Predicting... Time elapsed:", time.time() - t, "seconds.")
 63 |                 for model in models:
 64 |                     pred.append(model.predict(X_set))
 65 | 
 66 |                 with open(file_path+'/iteration_'+str(it)+'/morgan_1024_predictions/'+fname, 'a') as ref:
 67 |                     for j in range(len(pred[0])):
 68 |                         is_pass = 0
 69 |                         for i,thr in enumerate(thresh):
 70 |                             if float(pred[i][j])>thr:
 71 |                                 is_pass += 1
 72 |                         if is_pass >= 1:
 73 |                             total_passed += 1
 74 |                             line = z_id[j]+','+str(float(pred[i][j]))+'\n'
 75 |                             ref.write(line)
 76 |                 X_set = np.zeros([per_time,n_features])
 77 |                 z_id = []
 78 |                 no = 0
 79 | 
 80 |                 # With debug, we will only predict on 'per_time' molecules
 81 |                 if DEBUG:
 82 |                     break
 83 | 
 84 |         if no != 0:
 85 |             X_set = X_set[:no,:]
 86 |             pred = []
 87 |             print("We are currently running line", line)
 88 |             print("(2) Predicting... Time elapsed:", time.time() - t, "seconds.")
 89 |             for model in models:
 90 |                 pred.append(model.predict(X_set))
 91 |             with open(file_path+'/iteration_'+str(it)+'/morgan_1024_predictions/'+fname, 'a') as ref:
 92 |                 for j in range(len(pred[0])):
 93 |                     is_pass = 0
 94 |                     for i,thr in enumerate(thresh):
 95 |                         if float(pred[i][j])>thr:
 96 |                             is_pass+=1
 97 |                     if is_pass>=1:
 98 |                         total_passed+=1
 99 |                         line = z_id[j]+','+str(float(pred[i][j]))+'\n'
100 |                         ref.write(line)
101 |     print("Prediction time:", time.time() - t)
102 |     return total_passed
103 | 
104 | 
105 | try:
106 |     os.mkdir(file_path+'/iteration_'+str(it)+'/morgan_1024_predictions')
107 | except OSError:
108 |     print(file_path+'/iteration_'+str(it)+'/morgan_1024_predictions', "already exists")
109 | 
110 | thresholds = pd.read_csv(file_path+'/iteration_'+str(it)+'/best_models/thresholds.txt', header=None)
111 | thresholds.columns = ['model_no', 'thresh', 'cutoff']
112 | 
113 | tr = []
114 | models = []
115 | for f in glob.glob(file_path+'/iteration_'+str(it)+'/best_models/model_*'):
116 |     if "." not in f:    # skipping over the .ddss & .csv files
117 |         mn = int(f.split('/')[-1].split('_')[1])
118 |         tr.append(thresholds[thresholds.model_no == mn].thresh.iloc[0])
119 |         models.append(DDModel.load(file_path+'/iteration_'+str(it)+'/best_models/model_'+str(mn)))
120 | 
121 | print("Number of models to predict:", len(models))
122 | t = time.time()
123 | returned = prediction_morgan(fn, models, tr)
124 | print(time.time()-t)
125 | 
126 | with open(file_path+'/iteration_'+str(it)+'/morgan_1024_predictions/passed_file_ct.txt','a') as ref:
127 |         ref.write(fn+','+str(returned)+'\n')
128 | 


--------------------------------------------------------------------------------
/phase_2-3/activation_script.sh:
--------------------------------------------------------------------------------
1 | echo Activating virtual environment
2 | source ~/.bashrc
3 | 
4 | # Activate Your conda environment here:
5 | conda activate MyCondaEnvironment
6 | 


--------------------------------------------------------------------------------
/phase_2-3/hyperparameter_result_evaluation.py:
--------------------------------------------------------------------------------
  1 | import builtins as __builtin__
  2 | import argparse
  3 | 
  4 | import pandas as pd
  5 | import numpy as np
  6 | import glob
  7 | import os
  8 | from ML.DDModel import DDModel
  9 | from sklearn.metrics import auc
 10 | from sklearn.metrics import precision_recall_curve,roc_curve, precision_score, recall_score
 11 | from shutil import copy2
 12 | 
 13 | import warnings
 14 | warnings.filterwarnings('ignore')
 15 | 
 16 | 
 17 | # For debugging purposes only:
 18 | def print(*args, **kwargs):
 19 |     __builtin__.print('\t eval_v2: ', end="")
 20 |     return __builtin__.print(*args, **kwargs)
 21 | 
 22 | 
 23 | parser = argparse.ArgumentParser()
 24 | parser.add_argument('-n_it','--n_iteration',required=True,help='Number of current iteration')
 25 | parser.add_argument('-d_path','--data_path',required=True,help='Path to project folder, including prject folder name')
 26 | parser.add_argument('-mdd','--morgan_directory',required=True,help='Path to Morgan fingerprint directory for the database')
 27 | 
 28 | # adding parameter for where to save all the data to:
 29 | parser.add_argument('-s_path', '--save_path', required=False, default=None)
 30 | 
 31 | # allowing for variable number of molecules to train from:
 32 | parser.add_argument('-n_mol', '--number_mol', required=False, default=3000000, help='Size of test/validation set to be used')
 33 | parser.add_argument('-cont', '--continuous', required=False, action='store_true')   # Using binary or continuous labels
 34 | parser.add_argument('-smile', '--smiles', required=False, action='store_true') # Using smiles or morgan as or continuous labels
 35 | 
 36 | io_args = parser.parse_args()
 37 | n_iteration = int(io_args.n_iteration)
 38 | mdd = io_args.morgan_directory
 39 | num_molec = int(io_args.number_mol)
 40 | 
 41 | DATA_PATH = io_args.data_path   # Now == file_path/protein
 42 | SAVE_PATH = io_args.save_path
 43 | # if no save path is provided we just save it in the same location as the data
 44 | if SAVE_PATH is None: SAVE_PATH = DATA_PATH
 45 | 
 46 | 
 47 | print("Done importing.")
 48 | 
 49 | # Gets the total number of molecules (SEE: simple_job_models.py, line 32)
 50 | total_mols = pd.read_csv(mdd+'/Mol_ct_file.csv',header=None)[[0]].sum()[0]/1000000
 51 | 
 52 | # reading in the file created in progressive_docking.py (line 456)
 53 | hyperparameters = pd.read_csv(SAVE_PATH+'/iteration_'+str(n_iteration)+'/hyperparameter_morgan_with_freq_v3.csv',header=None)
 54 |  
 55 | # theses are also declared in progressive_docking.py
 56 | ### TODO: add these columns in progressive_docking.py as a header instead of declaring them here (Line 456)
 57 | hyperparameters.columns = ['Model_no','Over_sampling','Batch_size','Learning_rate','N_layers','N_units','dropout',
 58 |                           'weight','cutoff','ROC_AUC','Pr_0_9','tot_left_0_9_mil','auc_te','pr_te','re_te','tot_left_0_9_mil_te','tot_positives']
 59 | 
 60 | # Converting total to per million
 61 | hyperparameters.tot_left_0_9_mil = hyperparameters.tot_left_0_9_mil/1000000
 62 | hyperparameters.tot_left_0_9_mil_te = hyperparameters.tot_left_0_9_mil_te/1000000   ## What are these? it is never used...
 63 | 
 64 | hyperparameters['re_vl/re_pr'] = 0.9/hyperparameters.re_te  # ratio of 90% recall and the recall of the test set
 65 | 
 66 | print('hyp dataframe:', hyperparameters.head())
 67 | 
 68 | df_grouped_cf = hyperparameters.groupby('cutoff')  # Groups them according to cutoff values for calculations
 69 | 
 70 | cf_values = {}  # Cutoff values (thresholds for validation set virtual hits)
 71 | 
 72 | print('Got Hyperparams')
 73 | # Looping through each group and printing mean and std for that particular cuttoff value
 74 | for mini_df in df_grouped_cf:
 75 |     print(mini_df[0])   # the cutoff value for the group
 76 |     print(mini_df[1]['re_vl/re_pr'].mean())
 77 |     print(mini_df[1]['re_vl/re_pr'].std())
 78 |     cf_values[mini_df[0]] = mini_df[1]['re_vl/re_pr'].std()
 79 |     
 80 | print('cf_values:', cf_values)
 81 | 
 82 | model_to_use_with_cf = []   
 83 | ind_pr = []
 84 | for cf in cf_values:
 85 |     models = hyperparameters[hyperparameters.cutoff == cf]  # gets all models matching that cutoff
 86 |     n_models = len(models)
 87 |     thr = 0.9   # The recall for true positives ### TODO: make this a input for the script
 88 | 
 89 |     # Decreasing the threshold until a quarter of the models have recall values greater than it
 90 |     while len(models[models.re_te >= thr]) <= n_models//4:
 91 |         thr -= 0.01
 92 |     
 93 |     models = models[models.re_te >= thr]
 94 |     models = models.sort_values('pr_te', ascending=False)   # Sorting in descending order
 95 | 
 96 |     if cf_values[cf] < 0.01:    # Checks to see if std is less than 0.01
 97 |         model_to_use_with_cf.append([cf, models.Model_no[:1].values])
 98 |         ind_pr.append([cf, models.pr_te[:1].values])
 99 |     else:
100 |         # When std is high we use 3 models to get a better idea of its perfomance as a whole
101 |         model_to_use_with_cf.append([cf, models.Model_no[:3].values])
102 |         ind_pr.append([cf, models.pr_te[:3].values])
103 |         
104 | print(model_to_use_with_cf)  # [ [cf_1, [model_no_1, ...]], [cf_2, [model_no_1, ...]] ... ]
105 | print(ind_pr)  # printed for viewing progress?
106 | 
107 | 
108 | def get_all_x_data(morgan_path, ID_labels): # ID_labels is a dataframe containing the zincIDs and their corresponding labels.
109 |     train_set = np.zeros([num_molec,1024], dtype=bool) # using bool to save space
110 |     train_id = []
111 | 
112 |     print('x data from:', morgan_path)
113 |     with open(morgan_path,'r') as ref:
114 |         line_no=0
115 |         for line in ref:            
116 |             mol_info=line.rstrip().split(',')
117 |             train_id.append(mol_info[0])
118 |             
119 |             # "Decompressing" the information from the file about where the 1s are on the 1024 bit vector.
120 |             bit_indicies = mol_info[1:] # array of indexes of the binary 1s in the 1024 bit vector representing the morgan fingerprint
121 |             for elem in bit_indicies:
122 |                 train_set[line_no,int(elem)] = 1
123 | 
124 |             line_no+=1
125 |     
126 |     train_set = train_set[:line_no,:]
127 | 
128 |     print('Done...')
129 |     train_pd = pd.DataFrame(data=train_set, dtype=np.uint8)
130 |     train_pd['ZINC_ID'] = train_id
131 | 
132 |     score_col = ID_labels.columns.difference(['ZINC_ID'])[0]
133 |     train_data = pd.merge(ID_labels, train_pd, how='inner',on=['ZINC_ID'])
134 |     X_data = train_data[train_data.columns.difference(['ZINC_ID', score_col])].values   # input
135 |     y_data = train_data[[score_col]].values    # labels
136 |     return X_data, y_data
137 | 
138 | 
139 | # This function gets all the zinc ids their corresponding labels into a pd.dataframe
140 | def get_zinc_and_labels(zinc_path, labels_path):
141 |     ids = []
142 |     with open(zinc_path,'r') as ref:
143 |         for line in ref:
144 |             ids.append(line.split(',')[0])
145 |     zincIDs = pd.DataFrame(ids, columns=['ZINC_ID'])
146 | 
147 |     labels_df = pd.read_csv(labels_path, header=0)
148 |     combined_df = pd.merge(labels_df, zincIDs, how='inner', on=['ZINC_ID'])
149 |     return combined_df.set_index('ZINC_ID')
150 | 
151 | 
152 | main_path = DATA_PATH+'/iteration_1'
153 | print('Geting zinc ids and labels')
154 | # Creating the test, and validation data from the first iteration:
155 | zinc_labels_valid = get_zinc_and_labels(main_path + '/morgan/valid_morgan_1024_updated.csv', main_path +'/validation_labels.txt')
156 | zinc_labels_test = get_zinc_and_labels(main_path + '/morgan/test_morgan_1024_updated.csv', main_path +'/testing_labels.txt')
157 | 
158 | print('Generating test, and valid data')
159 | # Getting the x data from the zinc ids (x=input, y=labels)
160 | # decompresses the indexes to 1024 bit vector
161 | X_valid, y_valid = get_all_x_data(main_path +'/morgan/valid_morgan_1024_updated.csv', zinc_labels_valid)
162 | X_test, y_test = get_all_x_data(main_path +'/morgan/test_morgan_1024_updated.csv', zinc_labels_test)
163 | 
164 | print('Reducing models,', len(model_to_use_with_cf))
165 | for i in range(len(model_to_use_with_cf)):
166 |     cf = model_to_use_with_cf[i][0]
167 |     n_good_mol = len([x for x in y_valid if x < cf]) # num of molec that are under the cutoff value
168 |     print('\t',cf,'cf:',len(model_to_use_with_cf[i]))
169 | 
170 |     # If not enough molecules exceed the cutoff then we only save one of the models and ignore the rest ## Why is this done?
171 |     if n_good_mol <= 10000: 
172 |         model_to_use_with_cf[i][1] = [model_to_use_with_cf[i][1][0]]    # Still maintaing the format: [ [cf_1, [model_no_1, ...]], [cf_2, [model_no_1, ...]] ... ]
173 | 
174 | 
175 | cf_with_left = {}
176 | main_thresholds = {}
177 | all_sc = {}
178 | path_to_model = SAVE_PATH+'/iteration_'+str(n_iteration)+'/all_models/'
179 | 
180 | print('Model_to_use_with_cf:', model_to_use_with_cf)
181 | for i in range(len(model_to_use_with_cf)):
182 |     cf = model_to_use_with_cf[i][0]
183 | 
184 |     # y_train<cf returns a bool array for this condition on each element
185 |     y_test_cf = y_test<cf
186 |     y_valid_cf = y_valid<cf
187 | 
188 |     models = []
189 |     # loading the models matching the cutoff and appending them to the models list
190 |     for mn in model_to_use_with_cf[i][-1]:
191 |         print('\tLoading model:', path_to_model + '/model_'+str(mn))
192 |         models.append(DDModel.load(path_to_model+'/model_'+str(mn)))
193 |     print('num models:', len(models))
194 |     
195 |     prediction_valid = []
196 |     scc = []
197 |     for model in models:
198 |         print('using valid as validation')
199 |         model_pred = model.predict(X_valid)
200 |         if model.output_activation == 'linear':
201 |             # Converting back to binary values to get stats
202 |             model_pred = model_pred < cf
203 |         prediction_valid.append(model_pred)
204 |         precision, recall, thresholds = precision_recall_curve(y_valid_cf, model_pred)
205 |         scc.append([precision, recall, thresholds])
206 | 
207 |     tr = []
208 |     for _, recall, thresholds in scc:
209 |         # Getting the index positions for where the recall is greater than 0.90 to append the threshold value that got it
210 |         tr.append(thresholds[np.where(recall > 0.90)[0][-1]])
211 |     main_thresholds[cf] = tr
212 | 
213 |     print('tr:', tr)
214 | 
215 |     prediction_test = []
216 |     for model in models:
217 |         model_pred = model.predict(X_test)
218 |         if model.output_activation == 'linear':
219 |             # Converting back to binary values to get stats
220 |             model_pred = model_pred < cf
221 |         prediction_test.append(model_pred)
222 | 
223 |     # Calculating the average prediction across the consensus of models. #TODO: change this when dealing with continuous
224 |     avg_pred = np.zeros([len(prediction_test[0]),])
225 |     for i in range(len(prediction_test)):
226 |         # determining if the model prediction exceeds the threshold and adding the result (1 or 0)
227 |         avg_pred += (prediction_test[i] >= tr[i]).reshape(-1,)
228 | 
229 |     print('avg_pred:', avg_pred)
230 |     avg_pred = avg_pred > (len(models)//2)  # if greater than 50% of the models agree there would be a hit then that is our consensus value
231 |     print('avg_pred:', avg_pred)
232 | 
233 |     if len(models) > 1:
234 |         fpr_te_avg, tpr_te_avg, thresh_te_avg = roc_curve(y_test_cf, avg_pred)
235 |     else:
236 |         fpr_te_avg, tpr_te_avg, thresh_te_avg = roc_curve(y_test_cf, prediction_test[0])
237 |         
238 |     pr_te_avg = precision_score(y_test_cf, avg_pred)
239 |     re_te_avg = recall_score(y_test_cf, avg_pred)   # TODO: make sure avg_pred is calc properly
240 | 
241 |     auc_te_avg = auc(fpr_te_avg,tpr_te_avg)
242 |     pos_ct_orig = np.sum(y_test_cf)
243 |     t_train_mol = len(y_test)   
244 | 
245 |     print(cf,re_te_avg,pr_te_avg,auc_te_avg,pos_ct_orig/t_train_mol)
246 | 
247 |     total_left_te = re_te_avg*pos_ct_orig/pr_te_avg*total_mols*1000000/t_train_mol  # molecules left
248 |     all_sc[cf] = [cf,re_te_avg,pr_te_avg,auc_te_avg,total_left_te]
249 |     cf_with_left[cf] = total_left_te
250 | 
251 | print(cf_with_left)
252 | 
253 | min_left_cf = total_mols*1000000
254 | cf_to_use = 0
255 | # cf_to_use is the cf for the lowest num molec
256 | for key in cf_with_left:
257 |     if cf_with_left[key] <= min_left_cf:
258 |         min_left_cf = cf_with_left[key]
259 |         cf_to_use = key
260 | 
261 | print('cf_to_use:', cf_to_use)
262 | print('min_left_cf:', min_left_cf)
263 | print('all_sc:', all_sc)
264 | 
265 | with open(SAVE_PATH+'/iteration_'+str(n_iteration)+'/best_model_stats.txt', 'w') as ref:
266 |     cf,re,pr,auc,tot_le = all_sc[cf_to_use]
267 |     m_string = "* Best Model Stats * \n"
268 |     m_string += "_" * 20 + "\n"
269 |     m_string += "- Model Cutoff: " + str(cf) + "\n"
270 |     m_string += "- Model Precision: " + str(pr) + "\n"
271 |     m_string += "- Model Recall: " + str(re) + "\n"
272 |     m_string += "- Model Auc: " + str(auc) + "\n"
273 |     m_string += "- Total Left Testing: " + str(tot_le) + "\n"
274 | 
275 |     ref.write(m_string)
276 | 
277 | try:
278 |     os.mkdir(SAVE_PATH+'/iteration_'+str(n_iteration)+'/best_models')
279 | except OSError:  # catching file exists error
280 |     pass
281 | 
282 | for models_cf in model_to_use_with_cf:  # looping through the groups of models that match that cutoff
283 |     if models_cf[0] == cf_to_use:
284 |         count = 0
285 |         # Looping through all the models in that group
286 |         for mod_no in models_cf[-1]:
287 |             with open(SAVE_PATH+'/iteration_'+str(n_iteration)+'/best_models/thresholds.txt', 'w') as ref:
288 |                 ref.write(str(mod_no)+','+str(main_thresholds[cf_to_use][count])+','+str(cf_to_use)+'\n')
289 |             
290 |             # Copying the specific models by their model_no to the best_models folder
291 |             copy2(path_to_model+'/model_'+str(mod_no), SAVE_PATH+'/iteration_'+str(n_iteration) + '/best_models/')
292 |             copy2(path_to_model+'/model_'+str(mod_no) + ".ddss", SAVE_PATH+'/iteration_'+str(n_iteration) + '/best_models/')
293 | 
294 |             count += 1
295 | 
296 | 
297 | # Deleting all other models that are not in the model_to_use array
298 | # placed at the bottom so we don't lose all our data each time we run phase 5 and it fails.
299 | for f in glob.glob(SAVE_PATH+'/iteration_'+str(n_iteration)+'/all_models/*'):
300 |     try:
301 |         mn = int(f.split('/')[-1].split('_')[1])
302 |     except:
303 |         mn = int(f.split('/')[-1].split('_')[1].split('.')[0])
304 |     found = False
305 |     for models in model_to_use_with_cf:
306 |         if mn in models[-1]:
307 |             found = True
308 |             break
309 |     if not found and "." in f:
310 |         os.remove(f)
311 | 


--------------------------------------------------------------------------------
/phase_2-3/progressive_docking.py:
--------------------------------------------------------------------------------
  1 | """
  2 | V 2.2.1
  3 | """
  4 | 
  5 | import argparse
  6 | import glob
  7 | import gc
  8 | import os
  9 | import random
 10 | import sys
 11 | import time
 12 | 
 13 | import numpy as np
 14 | import pandas as pd
 15 | from sklearn.metrics import auc
 16 | from sklearn.metrics import precision_recall_curve, roc_curve
 17 | from tensorflow.keras.callbacks import Callback
 18 | from tensorflow.keras.callbacks import EarlyStopping
 19 | from ML.DDModel import DDModel
 20 | from ML.DDCallbacks import DDLogger
 21 | 
 22 | START_TIME = time.time()
 23 | print("Parsing args...")
 24 | parser = argparse.ArgumentParser()
 25 | parser.add_argument('-num_units','--nu',required=True)
 26 | parser.add_argument('-dropout','--df',required=True)
 27 | parser.add_argument('-learn_rate','--lr',required=True)
 28 | parser.add_argument('-bin_array','--ba',required=True)
 29 | parser.add_argument('-wt','--wt',required=True)
 30 | parser.add_argument('-cf','--cf',required=True)
 31 | parser.add_argument('-rec','--rec',required=True)
 32 | parser.add_argument('-n_it','--n_it',required=True)
 33 | parser.add_argument('-t_mol','--t_mol',required=True)
 34 | parser.add_argument('-bs','--bs',required=True)
 35 | parser.add_argument('-os','--os',required=True)
 36 | parser.add_argument('-d_path','--data_path',required=True)  # new!
 37 | 
 38 | # adding parameter for where to save all the data to:
 39 | parser.add_argument('-s_path', '--save_path', required=False, default=None)
 40 | 
 41 | # allowing for variable number of molecules to validate and test from:
 42 | parser.add_argument('-n_mol', '--number_mol', required=False, default=1000000)
 43 | parser.add_argument('-t_n_mol', '--train_num_mol', required=False, default=-1)
 44 | parser.add_argument('-cont', '--continuous', required=False, action='store_true')   # Using binary or continuous labels
 45 | parser.add_argument('-smile', '--smiles', required=False, action='store_true')      # Using smiles or morgan as or continuous labels
 46 | parser.add_argument('-norm', '--normalization', required=False, action='store_false')   # if continuous labels are used -> normalize them?
 47 | 
 48 | io_args = parser.parse_args()
 49 | 
 50 | print(sys.argv)
 51 | 
 52 | nu = int(io_args.nu)
 53 | df = float(io_args.df)
 54 | lr = float(io_args.lr)
 55 | ba = int(io_args.ba)
 56 | wt = float(io_args.wt)
 57 | cf = float(io_args.cf)
 58 | rec = float(io_args.rec)
 59 | n_it = int(io_args.n_it)
 60 | bs = int(io_args.bs)
 61 | oss = int(io_args.os)
 62 | t_mol = float(io_args.t_mol)
 63 | 
 64 | CONTINUOUS = io_args.continuous
 65 | NORMALIZE = io_args.normalization
 66 | SMILES = io_args.smiles
 67 | TRAINING_SIZE = int(io_args.train_num_mol)
 68 | num_molec = int(io_args.number_mol)
 69 | 
 70 | DATA_PATH = io_args.data_path   # Now == file_path/protein
 71 | SAVE_PATH = io_args.save_path
 72 | # if no save path is provided we just save it in the same location as the data
 73 | if SAVE_PATH is None: SAVE_PATH = DATA_PATH
 74 | 
 75 | 
 76 | print(nu,df,lr,ba,wt,cf,bs,oss,DATA_PATH)
 77 | if TRAINING_SIZE == -1: print("Training size not specified, using entire dataset...")
 78 | print("Finished parsing args...")
 79 | 
 80 | 
 81 | def encode_smiles(series):
 82 |     print("Encoding smiles")
 83 |     # parameter is a pd.series with ZINC_IDs as the indicies and smiles as the elements
 84 |     encoded_smiles = DDModel.process_smiles(series.values, 100, fit_range=100, use_padding=True, normalize=True)
 85 |     encoded_dict = dict(zip(series.keys(), encoded_smiles))
 86 |     # returns a dict array of the smiles.
 87 |     return encoded_dict
 88 | 
 89 | 
 90 | def get_oversampled_smiles(Oversampled_zid, smiles_series):
 91 |     # Must return a dictionary where the keys are the zids and the items are
 92 |     # numpy ndarrys with n numbers of the same encoded smile
 93 |     # the n comes from the number of times that particular zid was chosen at random. 
 94 |     oversampled_smiles = {}
 95 |     encoded_smiles = encode_smiles(smiles_series)
 96 |     
 97 |     for key in Oversampled_zid.keys():
 98 |         smile = encoded_smiles[key]
 99 |         oversampled_smiles[key] = np.repeat([smile], Oversampled_zid[key], axis=0)
100 |     return oversampled_smiles
101 | 
102 | 
103 | def get_oversampled_morgan(Oversampled_zid, fname):
104 |     print('x data from:', fname)
105 |     # Gets only the morgan fingerprints of those randomly selected zinc ids
106 |     with open(fname,'r') as ref:
107 |         for line in ref:
108 |             tmp=line.rstrip().split(',')
109 | 
110 |             # only extracting those that were randomly selected
111 |             if (tmp[0] in Oversampled_zid.keys()) and (type(Oversampled_zid[tmp[0]]) != np.ndarray):
112 |                 train_set = np.zeros([1,1024], dtype=np.bool)
113 |                 on_bit_vector = tmp[1:]
114 | 
115 |                 for elem in on_bit_vector:
116 |                     train_set[0,int(elem)] = 1
117 | 
118 |                 # creates a n x 1024 numpy ndarray where n is the number of times that zinc id was randomly selected
119 |                 Oversampled_zid[tmp[0]] = np.repeat(train_set, Oversampled_zid[tmp[0]], axis=0)
120 |     return Oversampled_zid
121 | 
122 | 
123 | def get_morgan_and_scores(morgan_path, ID_labels):
124 |     # ID_labels is a dataframe containing the zincIDs and their corresponding scores.
125 |     train_set = np.zeros([num_molec,1024], dtype=np.bool)  # using bool to save space
126 |     train_id = []
127 |     print('x data from:', morgan_path)
128 |     with open(morgan_path,'r') as ref:
129 |         line_no=0
130 |         for line in ref:
131 |             if line_no >= num_molec:
132 |                 break
133 | 
134 |             mol_info=line.rstrip().split(',')
135 |             train_id.append(mol_info[0])
136 |             
137 |             # "Decompressing" the information from the file about where the 1s are on the 1024 bit vector.
138 |             bit_indicies = mol_info[1:]  # array of indexes of the binary 1s in the 1024 bit vector representing the morgan fingerprint
139 |             for elem in bit_indicies:
140 |                 train_set[line_no,int(elem)] = 1
141 | 
142 |             line_no+=1
143 |     
144 |     train_set = train_set[:line_no,:]
145 | 
146 |     print('Done...')
147 |     train_pd = pd.DataFrame(data=train_set, dtype=np.bool)
148 |     train_pd['ZINC_ID'] = train_id
149 | 
150 |     ID_labels = ID_labels.to_frame()
151 |     print(ID_labels.columns)
152 |     score_col = ID_labels.columns.difference(['ZINC_ID'])[0]
153 |     print(score_col)
154 | 
155 |     train_data = pd.merge(ID_labels, train_pd, how='inner',on=['ZINC_ID'])
156 |     X_train = train_data[train_data.columns.difference(['ZINC_ID', score_col])].values   # input
157 |     y_train = train_data[[score_col]].values    # labels
158 |     return X_train, y_train
159 | 
160 | 
161 | # Gets the labels data
162 | def get_data(smiles_path, morgan_path, labels_path):
163 |     # Loading the docking scores (with corresponding Zinc_IDs)
164 |     labels = pd.read_csv(labels_path, sep=',', header=0)
165 | 
166 |     # Merging and setting index to the ID if smiles flag is set
167 |     if SMILES:
168 |         smiles = pd.read_csv(smiles_path, sep=' ', names=['smile', 'ZINC_ID'])
169 |         data = smiles.merge(labels, on='ZINC_ID')
170 |     else:
171 |         morgan = pd.read_csv(morgan_path, usecols=[0], header=None, names=['ZINC_ID']) # reading in only the zinc ids
172 |         data = morgan.merge(labels, on='ZINC_ID')
173 |     data.set_index('ZINC_ID', inplace=True)
174 |     return data
175 | 
176 | 
177 | n_iteration = n_it
178 | total_mols = t_mol
179 | 
180 | try:
181 |     os.mkdir(SAVE_PATH + '/iteration_'+str(n_iteration)+'/all_models')
182 | except OSError:
183 |     pass
184 | 
185 | 
186 | # Getting data from prev iterations and this iteration
187 | data_from_prev = pd.DataFrame()
188 | train_data = pd.DataFrame()
189 | test_data = pd.DataFrame()
190 | valid_data = pd.DataFrame()
191 | y_valid_first = pd.DataFrame()
192 | y_test_first = pd.DataFrame()
193 | for i in range(1, n_iteration+1):
194 | 
195 |     # getting all the data
196 |     print("\nGetting data from iteration", i)
197 |     smiles_path = DATA_PATH + '/iteration_'+str(i)+'/smile/{}_smiles_final_updated.smi'
198 |     morgan_path = DATA_PATH + '/iteration_'+str(i)+'/morgan/{}_morgan_1024_updated.csv'
199 |     labels_path = DATA_PATH + '/iteration_'+str(i)+'/{}_labels.txt'
200 | 
201 |     # Resulting dataframe will have cols of smiles (if selected) and docking scores with an index of Zinc IDs
202 |     train_data = get_data(smiles_path.format('train'), morgan_path.format('train'), labels_path.format('training'))
203 |     test_data = get_data(smiles_path.format('test'), morgan_path.format('test'), labels_path.format('testing'))
204 |     valid_data = get_data(smiles_path.format('valid'), morgan_path.format('valid'), labels_path.format('validation'))
205 | 
206 |     print("Data acquired...")
207 |     print("Train shape:", train_data.shape, "Valid shape:", valid_data.shape, "Test shape:", test_data.shape)
208 | 
209 |     if i == 1:  # for the first iteration we only add the training data
210 |         # because test and valid from this iteration is used by all subsequent iterations (constant dataset).
211 | 
212 |         y_test_first = test_data   # test and valid should be seperate from training dataset
213 |         y_valid_first = valid_data
214 |         y_old = train_data
215 |     elif i == n_iteration: break
216 |     else:
217 |         y_old = pd.concat([train_data, valid_data, test_data], axis=0)
218 | 
219 |     data_from_prev = pd.concat([y_old, data_from_prev], axis=0)
220 | 
221 |     print("Data Augmentation iteration {} data shape: {}".format(i, data_from_prev.shape))
222 | 
223 | # Always using the same valid and test dataset across all iterations:
224 | if n_iteration != 1:
225 |     train_data = pd.concat([train_data, test_data, valid_data], axis=0) # These datasets are from the current iteration.
226 |     train_data = pd.concat([train_data, data_from_prev])   # combining all the datasets into a single training set for iterations after the first
227 | 
228 | print("Training labels shape: ", train_data.shape)
229 | 
230 | # Exiting if there are not enough hits
231 | if (y_valid_first.r_i_docking_score < cf).values.sum() <= 10 or \
232 |         (y_test_first.r_i_docking_score < cf).values.sum() <= 10:
233 |     print("There are not enough hits... exiting.")
234 |     sys.exit()
235 | 
236 | 
237 | if CONTINUOUS:
238 |     print('Using continuous labels...')
239 |     y_valid = valid_data.r_i_docking_score
240 |     y_test = test_data.r_i_docking_score
241 |     y_train = train_data.r_i_docking_score
242 | 
243 |     if NORMALIZE:
244 |         print('Adding cutoff to be normalized')
245 |         cutoff_ser = pd.Series([cf], index=['cutoff'])
246 |         y_train = y_train.append(cutoff_ser)
247 | 
248 |         print("Normalizing docking scores...")
249 |         # Normalize the docking scores
250 |         y_valid = DDModel.normalize(y_valid)
251 |         y_test = DDModel.normalize(y_test)
252 |         y_train = DDModel.normalize(y_train)
253 | 
254 |         print('Extracting normalized cutoff...')
255 |         cf_norm = y_train['cutoff']
256 |         y_train.drop(labels=['cutoff'], inplace=True)   # removing it from the dataset
257 | 
258 |         cf_to_use = cf_norm
259 |     else:
260 |         cf_to_use = cf
261 | 
262 |     # Getting all the ids of hits and non hits.
263 |     y_pos = y_train[y_train < cf_to_use]
264 |     y_neg = y_train[y_train >= cf_to_use]
265 | 
266 | else:
267 |     print('Using binary labels...')
268 |     # valid and testing data is from the first iteration.
269 |     y_valid = y_valid_first.r_i_docking_score < cf
270 |     y_test = y_test_first.r_i_docking_score < cf
271 |     y_train = train_data.r_i_docking_score < cf
272 | 
273 |     # Getting all the ids of hits and non hits.
274 |     y_pos = y_train[y_train == 1]   # true
275 |     y_neg = y_train[y_train == 0]   # false
276 | 
277 | print('Converting y_pos and y_neg to dict (for faster access time)')
278 | y_pos = y_pos.to_dict()
279 | y_neg = y_neg.to_dict()
280 | 
281 | num_neg = len(y_neg)
282 | num_pos = len(y_pos)
283 | 
284 | sample_size = np.min([num_neg, num_pos*oss])
285 | # //2 because we sample 1 from pos and 1 from neg:
286 | if TRAINING_SIZE != -1: sample_size = TRAINING_SIZE//2
287 | 
288 | print("\nOversampling...", "size:", sample_size)
289 | print("\tNum pos: {} \n\tNum neg: {}".format(num_pos, num_neg))
290 | Oversampled_zid = {}    # Keeps track of how many times that zinc_id is randomly selected
291 | Oversampled_zid_y = {}
292 | 
293 | pos_keys = list(y_pos.keys())
294 | neg_keys = list(y_neg.keys())
295 | 
296 | for i in range(sample_size):
297 |     # Randomly sampling equal number of hits and misses:
298 |     idx = random.randint(0, num_pos-1)
299 |     idx_neg = random.randint(0, num_neg-1)
300 |     pos_zid = pos_keys[idx]
301 |     neg_zid = neg_keys[idx_neg]
302 | 
303 |     # Adding both pos and neg to the dictionary
304 |     try:
305 |         Oversampled_zid[pos_zid] += 1
306 |     except KeyError:
307 |         Oversampled_zid[pos_zid] = 1
308 |         Oversampled_zid_y[pos_zid] = y_pos[pos_zid]
309 | 
310 |     try:
311 |         Oversampled_zid[neg_zid] += 1
312 |     except KeyError:
313 |         Oversampled_zid[neg_zid] = 1
314 |         Oversampled_zid_y[neg_zid] = y_neg[neg_zid]
315 | 
316 | # Getting the inputs
317 | if SMILES:
318 |     print("Using smiles...")
319 |     X_valid, y_valid = np.array(encode_smiles(valid_data.smile).values()), y_valid.to_numpy()
320 |     X_test, y_test = np.array(encode_smiles(test_data).values()), y_test.to_numpy()
321 | 
322 |     # The training data needs to be oversampled:
323 |     print("Getting oversampled smiles...")
324 |     Oversampled_zid = get_oversampled_smiles(Oversampled_zid, train_data.smile)
325 |     Oversampled_X_train = np.zeros([sample_size*2, len(list(Oversampled_zid.values())[0][0])], dtype=np.bool)
326 |     print(len(list(Oversampled_zid.values())[0]))
327 | else:
328 |     Oversampled_X_train = np.zeros([sample_size*2, 1024], dtype=np.bool)
329 |     print('Using morgan fingerprints...')
330 |     # this part is what gets the morgan fingerprints:
331 |     print('looking through file path:', DATA_PATH + '/iteration_'+str(n_iteration)+'/morgan/*')
332 |     for i in range(1, n_iteration+1):
333 |         for f in glob.glob(DATA_PATH + '/iteration_'+str(i)+'/morgan/*'):
334 |             set_name = f.split('/')[-1].split('_')[0]
335 |             print('\t', set_name)
336 |             # Valid and test datasets are always going to be from the first iteration.
337 |             if i == 1:
338 |                 if set_name == 'valid':
339 |                     X_valid, y_valid = get_morgan_and_scores(f, y_valid)
340 |                 elif set_name == 'test':
341 |                     X_test, y_test = get_morgan_and_scores(f, y_test)
342 | 
343 |             # Fills the dictionary with the actual morgan fingerprints
344 |             Oversampled_zid = get_oversampled_morgan(Oversampled_zid, f)
345 | 
346 | print("y validation shape:", y_valid.shape)
347 | 
348 | ct = 0
349 | Oversampled_y_train = np.zeros([sample_size*2, 1])
350 | print("oversampled sample:", list(Oversampled_zid.items())[0])
351 | num_morgan_missing = 0
352 | for key in Oversampled_zid.keys():
353 |     try:
354 |         tt = len(Oversampled_zid[key])
355 |     except TypeError as e:
356 |         # print("Missing morgan fingerprint for this ZINC ID")
357 |         # print(key, Oversampled_zid[key])
358 |         num_morgan_missing += 1
359 |         continue    # Skipping data that has no labels for it
360 | 
361 |     Oversampled_X_train[ct:ct+tt] = Oversampled_zid[key]  # repeating the same data for as many times as it was selected
362 |     Oversampled_y_train[ct:ct+tt] = Oversampled_zid_y[key]
363 |     ct += tt
364 | 
365 | print("Done oversampling, number of missing morgan fingerprints:", num_morgan_missing)
366 | 
367 | class TimedStopping(Callback):
368 |     '''
369 |     Stop training when enough time has passed.
370 |     # Arguments
371 |         seconds: maximum time before stopping.
372 |         verbose: verbosity mode.
373 |     '''
374 |     def __init__(self, seconds=None, verbose=1):
375 |         super(Callback, self).__init__()
376 | 
377 |         self.start_time = 0
378 |         self.seconds = seconds
379 |         self.verbose = verbose
380 | 
381 |     def on_train_begin(self, logs={}):
382 |         self.start_time = time.time()
383 | 
384 |     def on_epoch_end(self, epoch, logs={}):
385 |         print('epoch done')
386 |         if time.time() - self.start_time > self.seconds:
387 |             self.model.stop_training = True
388 |             if self.verbose:
389 |                 print('Stopping after %s seconds.' % self.seconds)
390 | #FREE MEMORY
391 | 
392 | del data_from_prev
393 | del train_data
394 | del test_data
395 | del valid_data
396 | del Oversampled_zid
397 | del Oversampled_zid_y
398 | del y_valid_first
399 | del y_test_first
400 | gc.collect()
401 | 
402 | #END FREE MEMORY
403 | 
404 | print("Data prep time:", time.time() - START_TIME)
405 | print("Configuring model...")
406 | 
407 | # This is our new model 
408 | hyperparameters = {"bin_array": ba*[0,1], "dropout_rate": df, "learning_rate": lr,
409 |                    "num_units": nu, "batch_size": bs, "class_weight": wt, "epsilon": 1e-06}
410 | print("\n"+"-"*20)
411 | print("Training data info:" + "\n")
412 | print("X Data Shape[1:]", Oversampled_X_train.shape[1:])
413 | print("X Data Shape", Oversampled_X_train.shape)
414 | print("X Data example", Oversampled_X_train[0])
415 | print("Hyperparameters", hyperparameters)
416 | 
417 | # TODO create a flag for optimizing the models
418 | if False:
419 |     # progressive_docking = optimize(technique='bayesian')
420 |     pass
421 | else:
422 |     from ML.DDMetrics import *
423 |     metrics = ['accuracy', tf.keras.metrics.Recall(), tf.keras.metrics.Precision()]
424 |     progressive_docking = DDModel(mode='original',
425 |                                   input_shape=Oversampled_X_train.shape[1:],
426 |                                   hyperparameters=hyperparameters,
427 |                                   metrics=metrics)
428 | progressive_docking.model.summary()
429 | 
430 | # keeping track of what model number this currently is and saving
431 | try:
432 |     with open(SAVE_PATH + '/iteration_'+str(n_iteration)+'/model_no.txt', 'r') as ref:
433 |         mn = int(ref.readline().rstrip())+1
434 |     with open(SAVE_PATH + '/iteration_'+str(n_iteration)+'/model_no.txt', 'w') as ref:
435 |         ref.write(str(mn))
436 | 
437 | except IOError:  # file doesnt exist yet
438 |     mn = 1
439 |     with open(SAVE_PATH + '/iteration_'+str(n_iteration)+'/model_no.txt', 'w') as ref:
440 |         ref.write(str(mn))
441 | 
442 | num_epochs = 500
443 | cw = {0:1, 1:wt}
444 | es = EarlyStopping(monitor='val_loss', min_delta=0, patience=10, verbose=0, mode='auto')
445 | es1 = TimedStopping(seconds=36000)   # stop training after 10 hours
446 | logger = DDLogger(
447 |     log_path=SAVE_PATH + "/iteration_" + str(n_iteration) + "/all_models/model_{}_train_log.csv".format(str(mn)),
448 |     max_time=36000,
449 |     max_epochs=num_epochs,
450 |     monitoring='val_loss'
451 | )
452 | delta_time = time.time()
453 | 
454 | try:
455 |     progressive_docking.save(SAVE_PATH + '/iteration_'+str(n_iteration)+'/all_models/model_'+str(mn))
456 | except tensorflow.python.framework.errors_impl.FailedPreconditionError as e:
457 |     print("Error occurred while saving:")
458 |     print(" -", e)
459 | 
460 | progressive_docking.fit(Oversampled_X_train,
461 |                         Oversampled_y_train,
462 |                         epochs=num_epochs,
463 |                         batch_size=bs,
464 |                         shuffle=True,
465 |                         class_weight=cw,
466 |                         verbose=1,
467 |                         validation_data=[X_valid, y_valid],
468 |                         callbacks=[es, es1, logger])
469 | 
470 | # Delta time records how long a model takes to train num_epochs amount of epochs
471 | delta_time = time.time()-delta_time
472 | print("Training Time:", delta_time)
473 | 
474 | print("Saving the model...")
475 | progressive_docking.save(SAVE_PATH + '/iteration_'+str(n_iteration)+'/all_models/model_'+str(mn))
476 | 
477 | print('Predicting on validation data')
478 | prediction_valid = progressive_docking.predict(X_valid)
479 | 
480 | print('Predicting on testing data')
481 | prediction_test = progressive_docking.predict(X_test)
482 | 
483 | if CONTINUOUS:
484 |     # Converting back to binary values to get stats
485 |     y_valid = y_valid < cf_to_use
486 |     prediction_valid = prediction_valid < cf_to_use
487 |     y_test = y_test < cf_to_use
488 |     prediction_test = prediction_test < cf_to_use
489 | 
490 | print('Getting stats from predictions...')
491 | # Getting stats for validation
492 | precision_vl, recall_vl, thresholds_vl = precision_recall_curve(y_valid, prediction_valid)
493 | fpr_vl, tpr_vl, thresh_vl = roc_curve(y_valid, prediction_valid)
494 | auc_vl = auc(fpr_vl,tpr_vl)
495 | pr_vl = precision_vl[np.where(recall_vl>rec)[0][-1]]
496 | pos_ct_orig = np.sum(y_valid)
497 | Total_left = rec*pos_ct_orig/pr_vl*total_mols*1000000/len(y_valid)
498 | tr = thresholds_vl[np.where(recall_vl>rec)[0][-1]]
499 | 
500 | # Getting stats for testing
501 | precision_te, recall_te, thresholds_te = precision_recall_curve(y_test,prediction_test)
502 | fpr_te, tpr_te, thresh_te = roc_curve(y_test, prediction_test)
503 | auc_te = auc(fpr_te,tpr_te)
504 | pr_te = precision_te[np.where(thresholds_te>tr)[0][0]]
505 | re_te = recall_te[np.where(thresholds_te>tr)[0][0]]
506 | pos_ct_orig = np.sum(y_test)
507 | Total_left_te = re_te*pos_ct_orig/pr_te*total_mols*1000000/len(y_test)
508 | 
509 | 
510 | with open(SAVE_PATH + '/iteration_'+str(n_iteration)+'/hyperparameter_morgan_with_freq_v3.csv','a') as ref:
511 |     ref.write(str(mn)+','+str(oss)+','+str(bs)+','+str(lr)+','+str(ba)+','+str(nu)+','+str(df)+','+str(wt)+','+str(cf)+','+str(auc_vl)+','+str(pr_vl)+','+str(Total_left)+','+str(auc_te)+','+str(pr_te)+','+str(re_te)+','+str(Total_left_te)+','+str(pos_ct_orig)+'\n')
512 | 
513 | with open(SAVE_PATH + '/iteration_'+str(n_iteration)+'/hyperparameter_morgan_with_freq_v3.txt','a') as ref:
514 |     # The sting of hyperparameters that stores what will be appended to the file ref
515 |     hp = "\n" + "-" * 15 + "\n" + "Hyperparameters:" + "\n"
516 |     hp += "- Model Number: " + str(mn) + "\n"
517 |     hp += "- Training Time: " + str(round(delta_time, 3)) + "\n"
518 |     hp += "  - OS: " + str(oss) + "\n"
519 |     hp += "  - Batch Size: " + str(bs) + "\n"
520 |     hp += "  - Learning Rate: " + str(lr) + "\n"
521 |     hp += "  - Bin Array: " + str(ba) + "\n"
522 |     hp += "  - Num. Units: " + str(nu) + "\n"
523 |     hp += "  - Dropout Freq.: " + str(df) + "\n"*2
524 | 
525 |     hp += "  - Class Weight Parameter wt: " + str(wt) + "\n"
526 |     hp += "  - cf: " + str(cf) + "\n"
527 |     hp += "  - auc vl: " + str(auc_vl) + "\n"
528 |     hp += "  - auc te: " + str(auc_te) + "\n"
529 |     hp += "  - Precision validation: " + str(pr_vl) + "\n"
530 |     hp += "  - Precision testing: " + str(pr_te) + "\n"
531 |     hp += "  - Recall testing: " + str(re_te) + "\n"
532 |     hp += "  - Pos ct orig: " + str(pos_ct_orig) + "\n"
533 |     hp += "  - Total Left: " + str(Total_left) + "\n"
534 |     hp += "  - Total Left testing: " + str(Total_left_te) + "\n\n"
535 | 
536 |     hp += "-" * 15
537 |     ref.write(hp)
538 | 
539 | print("Model number", mn, "complete.")
540 | 


--------------------------------------------------------------------------------
/phase_2-3/simple_job_models.py:
--------------------------------------------------------------------------------
  1 | import builtins as __builtin__
  2 | import pandas as pd
  3 | import numpy as np
  4 | import argparse
  5 | import glob
  6 | import time
  7 | import os
  8 | 
  9 | try:
 10 |     import __builtin__
 11 | except ImportError:
 12 |     # Python 3
 13 |     import builtins as __builtin__
 14 |     
 15 | # For debugging purposes only:
 16 | def print(*args, **kwargs):
 17 |     __builtin__.print('\t simple_jobs: ', end="")
 18 |     return __builtin__.print(*args, **kwargs)
 19 | 
 20 | 
 21 | START_TIME = time.time()
 22 | 
 23 | 
 24 | parser = argparse.ArgumentParser()
 25 | parser.add_argument('-n_it','--iteration_no',required=True,help='Number of current iteration')
 26 | parser.add_argument('-mdd','--morgan_directory',required=True,help='Path to the Morgan fingerprint directory for the database')
 27 | parser.add_argument('-time','--time',required=True,help='Time limit for training')
 28 | parser.add_argument('-file_path','--file_path',required=True,help='Path to the project directory, including project directory name')
 29 | parser.add_argument('-nhp','--number_of_hyp',required=True,help='Number of hyperparameters')
 30 | parser.add_argument('-titr','--total_iterations',required=True,help='Desired total number of iterations')
 31 | 
 32 | parser.add_argument('-isl','--is_last',required=False, action='store_true',help='True/False for is this last iteration')
 33 | 
 34 | # adding parameter for where to save all the data to:
 35 | parser.add_argument('-save', '--save_path', required=False, default=None)
 36 | 
 37 | # allowing for variable number of molecules to test and validate from:
 38 | parser.add_argument('-n_mol', '--number_mol', required=False, default=1000000, help='Size of test/validation set to be used')
 39 | 
 40 | parser.add_argument('-pfm', '--percent_first_mols', required=False, default=-1, help='Fraction of top scoring molecules to be considered as virtual hits in the first iteration (for standard DD run on 11 iterations, we recommend 0.01)')  # these two inputs must be percentages
 41 | parser.add_argument('-plm', '--percent_last_mols', required=False, default=-1, help='Fraction of top scoring molecules to be considered as virtual hits in the last iteration (for standard DD run on 11 iterations, we recommend 0.0001)')
 42 | 
 43 | 
 44 | # Pass the threshold
 45 | parser.add_argument('-ct', required=False, default=0.9, help='Recall, [0,1] range, default value 0.9')
 46 | 
 47 | # Flag for switching between functions that determine how many mols to be left at the end of iteration 
 48 | #   if not provided it defaults to a linear dec
 49 | funct_flags = parser.add_mutually_exclusive_group(required=False)
 50 | funct_flags.add_argument('-expdec', '--exponential_dec', required=False, default=-1) # must pass in the base number
 51 | funct_flags.add_argument('-polydec', '--polynomial_dec', required=False, default=-1) # You must also pass in to what power for this flag
 52 | 
 53 | io_args, extra_args = parser.parse_known_args()
 54 | n_it = int(io_args.iteration_no)
 55 | mdd = io_args.morgan_directory
 56 | time_model = io_args.time
 57 | nhp = int(io_args.number_of_hyp)
 58 | isl = io_args.is_last
 59 | titr = int(io_args.total_iterations)
 60 | rec = float(io_args.ct)
 61 | 
 62 | num_molec = int(io_args.number_mol)
 63 | 
 64 | percent_first_mols = float(io_args.percent_first_mols)
 65 | percent_last_mols = float(io_args.percent_last_mols)
 66 | 
 67 | exponential_dec = int(io_args.exponential_dec)
 68 | polynomial_dec = int(io_args.polynomial_dec)
 69 | 
 70 | DATA_PATH = io_args.file_path   # Now == file_path/protein
 71 | SAVE_PATH = io_args.save_path
 72 | # if no save path is provided we just save it in the same location as the data
 73 | if SAVE_PATH is None: SAVE_PATH = DATA_PATH
 74 | 
 75 | # sums the first column and divides it by 1 million (this is our total database size)
 76 | t_mol = pd.read_csv(mdd+'/Mol_ct_file.csv',header=None)[[0]].sum()[0]/1000000 # num of compounds in each file is mol_ct_file
 77 | 
 78 | cummulative = 0.25*n_it
 79 | num_units = [100, 1500,2000]
 80 | dropout = [0.2, 0.5]
 81 | learn_rate = [0.0001]
 82 | bin_array = [2, 3]
 83 | wt = [2, 3]
 84 | if nhp < 144:
 85 |    bs = [256]
 86 | else:
 87 |     bs = [128, 256]
 88 |     
 89 | if nhp < 48:
 90 |     oss = [10]
 91 | elif nhp < 72:
 92 |     oss = [5, 10]
 93 | else:
 94 |     oss = [5, 10, 20]
 95 | 
 96 | try:
 97 |     os.mkdir(SAVE_PATH+'/iteration_'+str(n_it)+'/simple_job')
 98 | except OSError: # catching file exists error
 99 |     pass
100 | 
101 | # Clearing up space from previous iteration
102 | for f in glob.glob(SAVE_PATH+'/iteration_'+str(n_it)+'/simple_job/*'):
103 |     os.remove(f)
104 | 
105 | scores_val = []
106 | with open(DATA_PATH+'/iteration_'+str(1)+'/validation_labels.txt','r') as ref:
107 |     ref.readline()  # first line is ignored
108 |     for line in ref:
109 |         scores_val.append(float(line.rstrip().split(',')[0]))
110 | 
111 | scores_val = np.array(scores_val)
112 | 
113 | first_mols = int(100*t_mol/13) if percent_first_mols == -1.0 else int(percent_first_mols * len(scores_val))
114 | last_mols = 100 if percent_last_mols == -1.0 else int(percent_last_mols * len(scores_val))
115 | 
116 | if n_it==1:
117 |     # 'good_mol' is the number of top scoring molecules to save at the end of the iteration
118 |     good_mol = first_mols
119 | else:
120 |     if exponential_dec != -1:
121 |         good_mol = int() #TODO: create functions for these
122 |     elif polynomial_dec != -1:
123 |         good_mol = int()
124 |     else:
125 |         good_mol = int(((last_mols-first_mols)*n_it + titr*first_mols-last_mols)/(titr-1))     # linear decrease as interations increase
126 | 
127 | print(isl)
128 | # If this is the last iteration then we save only 100 molecules
129 | if isl:
130 |     # 100 mols is 0.0001% of an inital of 1 million input molecules
131 |     good_mol = 100 if percent_last_mols == -1.0 else int(percent_last_mols * len(scores_val))
132 | 
133 | cf_start = np.mean(scores_val)  # the mean of all the docking scores (labels) of the validation set:
134 | t_good = len(scores_val)
135 | 
136 | # we decrease the threshold value until we have our desired num of mol left.
137 | while t_good > good_mol: 
138 |     cf_start -= 0.005
139 |     t_good = len(scores_val[scores_val<cf_start])
140 | 
141 | print('Threshold (cutoff):',cf_start)
142 | print('Molec under threshold:', t_good)
143 | print('Goal molec:', good_mol)
144 | print('Total molec:', len(scores_val))
145 | 
146 | all_hyperparas = []
147 | 
148 | for o in oss:   # Over Sample Size
149 |     for batch in bs:
150 |         for nu in num_units:
151 |             for do in dropout:
152 |                 for lr in learn_rate:
153 |                     for ba in bin_array:
154 |                         for w in wt:    # Weight
155 |                             all_hyperparas.append([o,batch,nu,do,lr,ba,w,cf_start])
156 | 
157 | print('Total hyp:', len(all_hyperparas))
158 | 
159 | # Creating all the jobs for each hyperparameter combination:
160 | 
161 | other_args = ' '.join(extra_args) + '-rec {} -n_it {} -t_mol {} --data_path {} --save_path {} -n_mol {}'.format(rec, n_it, t_mol, DATA_PATH, SAVE_PATH, num_molec)
162 | print(other_args)
163 | count = 1
164 | for i in range(len(all_hyperparas)):
165 |     with open(SAVE_PATH+'/iteration_'+str(n_it)+'/simple_job/simple_job_'+str(count)+'.sh', 'w') as ref:
166 |         ref.write('#!/bin/bash\n')
167 |         cwd = os.getcwd()
168 |         ref.write('cd {}\n'.format(cwd))
169 |         hyp_args = '-os {} -bs {} -num_units {} -dropout {} -learn_rate {} -bin_array {} -wt {} -cf {}'.format(*all_hyperparas[i])
170 |         ref.write('python -u progressive_docking.py ' + hyp_args + ' ' + other_args)
171 |         ref.write("\n echo complete")
172 |     count += 1
173 |     
174 | print('Runtime:', time.time() - START_TIME)
175 | 


--------------------------------------------------------------------------------
/phase_2-3/simple_job_predictions.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import glob
 3 | import os
 4 | 
 5 | parser = argparse.ArgumentParser()
 6 | parser.add_argument('-protein', '--protein', required=True, help='Name of DD project')
 7 | parser.add_argument('-file_path', '--file_path', required=True, help='PAth to project folder without project folder name')
 8 | parser.add_argument('-n_it', '--n_it', required=True, help='Number of current iteration')
 9 | parser.add_argument('-mdd', '--morgan_directory', required=True, help='PAth to Morgan fingerprint directory')
10 | 
11 | # adding parameter for where to save all the data to:
12 | parser.add_argument('-save', '--save_path', required=False, default=None)
13 | 
14 | io_args = parser.parse_args()
15 | 
16 | protein = io_args.protein
17 | n_it = int(io_args.n_it)
18 | mdd = io_args.morgan_directory
19 | 
20 | DATA_PATH = io_args.file_path  # Now == file_path/protein
21 | SAVE_PATH = io_args.save_path
22 | # if no save path is provided we just save it in the same location as the data
23 | if SAVE_PATH is None: SAVE_PATH = DATA_PATH
24 | 
25 | add = mdd
26 | 
27 | try:
28 |     os.mkdir(SAVE_PATH + '/iteration_' + str(n_it) + '/simple_job_predictions')
29 | except OSError:
30 |     pass
31 | 
32 | for f in glob.glob(SAVE_PATH + '/iteration_' + str(n_it) + '/simple_job_predictions/*'):
33 |     os.remove(f)
34 | 
35 | # TODO Add to GUI create a project
36 | ct = 0.9
37 | time = '0-10:30'
38 | 
39 | # temp = []
40 | part_files = []
41 | 
42 | for i, f in enumerate(glob.glob(add + '/*.txt')):
43 |     part_files.append(f)
44 | 
45 | ct = 1
46 | for f in part_files:
47 |     with open(SAVE_PATH + '/iteration_' + str(n_it) + '/simple_job_predictions/simple_job_' + str(ct) + '.sh',
48 |               'w') as ref:
49 |         ref.write('#!/bin/bash\n')
50 |         cwd = os.getcwd()
51 |         ref.write('cd {}\n'.format(cwd))
52 |         ref.write('source ~/.bashrc\n')
53 |         ref.write('source activation_script.sh\n')
54 |         ref.write('python -u ' + 'Prediction_morgan_1024.py' + ' ' + '-fn' + ' ' + f.split('/')[
55 |             -1] + ' ' + '-protein' + ' ' + protein + ' ' + '-it' + ' ' + str(n_it) + ' ' + '-mdd' + ' ' + str(
56 |             mdd) + ' ' + '-file_path' + ' ' + SAVE_PATH + '\n')
57 |         ref.write("\n echo complete")
58 | 
59 |     ct += 1
60 | 


--------------------------------------------------------------------------------
/utilities/Morgan_fing.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | 
 3 | parser = argparse.ArgumentParser()
 4 | parser.add_argument('-sfp','--smile_folder_path',required=True,help='Path to SMILES directory')
 5 | parser.add_argument('-fp','--folder_path',required=True,help='Desired path to Morgan directory, without Morgan directory name')
 6 | parser.add_argument('-fn','--folder_name',required=True,help='Name of Morgan directory that will be created')
 7 | parser.add_argument('-tp','--t_pos',required=True,help='Number of processors for multiprocessing')
 8 | 
 9 | io_args = parser.parse_args()
10 | sfp = io_args.smile_folder_path
11 | fp = io_args.folder_path
12 | fn = io_args.folder_name
13 | t_pos = int(io_args.t_pos)
14 | 
15 | import glob
16 | import time
17 | import numpy as np
18 | import pandas as pd
19 | import pickle
20 | from contextlib import closing
21 | from multiprocessing import Pool
22 | import multiprocessing
23 | from rdkit.Chem import AllChem
24 | from rdkit import DataStructs
25 | from rdkit import Chem
26 | from functools import partial
27 | import os
28 | 
29 | def get_n_lines_2(file):
30 |     ct=0
31 |     with open(file,'r') as ref:
32 |         ref.readline()
33 |         for line in ref:
34 |             ct+=1
35 |     return file,ct
36 | 
37 | def morgan_fingp(fname):
38 |     nbits=1024
39 |     radius=2
40 |     #fp = []
41 |     fsplit = fname.split('/')[-1]
42 |     #to_skip = done_dict[fsplit]
43 |     ref2  = open(fp+'/'+fn+'/'+fsplit,'a')
44 |     #print(fname,to_skip)
45 |     with open(fname,'r') as ref:
46 |         ref.readline()
47 |         #for count in range(to_skip):
48 |         #    ref.readline()
49 |         for line in ref:
50 |             smile,zin_id = line.rstrip().split()
51 |             arg = np.zeros((1,))
52 |             try:
53 |                 DataStructs.ConvertToNumpyArray(AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(smile),radius,nBits=nbits,useChirality=True),arg)
54 | 
55 |                 ref2.write((',').join([zin_id]+[str(elem) for elem in np.where(arg==1)[0]]))
56 |                 ref2.write('\n')
57 |             except:
58 |                 print(line)
59 |                 pass
60 | 
61 | if __name__=='__main__':
62 | 	files = []
63 | 	for f in glob.glob(sfp+'/*.txt'):
64 | 	    files.append(f)
65 | 
66 | 	try:
67 | 	    os.mkdir(fp+'/'+fn)
68 | 	except:
69 | 	    pass
70 | 
71 | 	#all_morgan = []
72 | 	#for f in glob.glob(fp+'/'+fn+'/*.txt'):all_morgan.append(f)
73 | 
74 | 	#t=time.time()
75 | 	#with closing(Pool(multiprocessing.cpu_count())) as pool:
76 | 	#    ct_per_file = pool.map(get_n_lines_2,all_morgan)
77 | 	#print(time.time()-t)
78 | 
79 | 	#ct_per_file_dict = {}
80 | 	#for fn,ct in ct_per_file:
81 | 	#    ct_per_file_dict[fn.split('/')[-1]] = ct
82 | 
83 | 	t_f = len(files)
84 | 	t=time.time()
85 | 	with closing(Pool(np.min([multiprocessing.cpu_count(),t_pos]))) as pool:
86 | 	    pool.map(morgan_fingp,files)
87 | 	print(time.time()-t)
88 | 


--------------------------------------------------------------------------------