├── .gitignore ├── LICENSE ├── README.md ├── cover-image.png ├── data ├── fine-tuning │ ├── QSAR_model_example.pickle │ └── gdb13_1K-debug │ │ ├── preprocessing_params.csv │ │ ├── pretrained_model.pth │ │ └── train.csv └── pre-training │ ├── DRD2_actives │ ├── test.smi │ ├── train.smi │ └── valid.smi │ ├── gdb13_1K-debug │ ├── preprocessing_params.csv │ ├── test.h5 │ ├── test.smi │ ├── train.csv │ ├── train.h5 │ ├── train.smi │ ├── valid.h5 │ └── valid.smi │ └── gdb13_1K │ ├── preprocessing_params.csv │ ├── test.h5 │ ├── test.smi │ ├── train.csv │ ├── train.h5 │ ├── train.smi │ ├── valid.h5 │ └── valid.smi ├── environments └── graphinvent.yml ├── graphinvent ├── Analyzer.py ├── BlockDatasetLoader.py ├── DataProcesser.py ├── GraphGenerator.py ├── GraphGeneratorRL.py ├── MolecularGraph.py ├── ScoringFunction.py ├── Workflow.py ├── __init__.py ├── gnn │ ├── __init__.py │ ├── aggregation_mpnn.py │ ├── edge_mpnn.py │ ├── modules.py │ ├── mpnn.py │ └── summation_mpnn.py ├── main.py ├── parameters │ ├── __init__.py │ ├── args.py │ ├── constants.py │ ├── defaults.py │ └── load.py └── util.py ├── output └── input.csv ├── submit-fine-tuning.py ├── submit-pre-training-supercloud.py ├── submit-pre-training.py ├── tools ├── README.md ├── atom_types.py ├── combine_HDFs.py ├── formal_charges.py ├── max_n_nodes.py ├── submit-split-preprocessing-supercloud.py ├── tdc-create-dataset.py └── utils.py └── tutorials ├── 0_setting_up_environment.md ├── 1_introduction.md ├── 2_using_a_new_dataset.md ├── 3_visualizing_molecules.md ├── 4_transfer_learning.md ├── 5_benchmarking_with_moses.md ├── 6_preprocessing_large_datasets.md ├── 7_reinforcement_learning.md └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | data/fine-tuning/qsar_model.pickle 2 | .vscode/* 3 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright 2020 Rocío Mercado. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 4 | 5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 6 | 7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | **Please note: this repository is no longer being maintained.** 2 | 3 | # GraphINVENT 4 | 5 | ![cover image](./cover-image.png) 6 | 7 | ## Description 8 | GraphINVENT is a platform for graph-based molecular generation using graph neural networks. GraphINVENT uses a tiered deep neural network architecture to probabilistically generate new molecules a single bond at a time. All models implemented in GraphINVENT can quickly learn to build molecules resembling training set molecules without any explicit programming of chemical rules. The models have been benchmarked using the MOSES distribution-based metrics, showing how the best GraphINVENT model compares well with state-of-the-art generative models. 9 | 10 | ## Updates 11 | The following versions of GraphINVENT exist in this repository: 12 | * v1.0 (and all commits up to here) is the label corresponding to the "original" version, and corresponds with the publications below. 13 | * v2.0 is an outdated version, created March 10, 2021. 14 | * v3.0 is the latest version, created August 20, 2021. 15 | 16 | *20-08-2021*: 17 | 18 | Large update: 19 | * Added a reinforcement learning framework to allow for fine-tuning models. Fine-tuning jobs can now be run using the --job-type "fine-tune" flag. 20 | * An example submission script for fine-tuning jobs was added (`submit-fine-tuning.py`), and the old example submission script was renamed (`submit.py` --> `submit-pre-training.py`). 21 | * Note: the tutorials have not yet been updated to reflect the changes, this will be done soon but for now be aware that there may be small discrepancies between what is written in the tutorial and the actual instructions. I will delete this bulletpoint when I have updated the tutorials. 22 | 23 | *26-03-2021*: 24 | 25 | Small update: 26 | * Pre-trained models created with GraphINVENT v1.0 can now be used with GraphINVENT v2.0. 27 | 28 | *10-03-2021*: 29 | 30 | The biggest changes in v2.0 from v1.0 are summarized below: 31 | * Data preprocessing was updated for readibility (now done in `DataProcesser.py`). 32 | * Graph generation was updated for readibility (now done in `Generator.py`), as well as some bugs related to how implicit Hs and chirality were handled on the GPU (not used before, despite being available for preprocessing/training). 33 | * Data analysis code was updated for readibility (now done in `Analyzer.py`). 34 | * The learning rate decay scheme was changed from a custom learning rate scheduler to the OneCycle scheduler (so far, it appears to be working well enough, and with a reduced set of parameters). 35 | * The code now runs using the latest version of PyTorch (1.8.0); the previous version was running using PyTorch 1.3. The environment has correspondingly been updated (and renamed "GraphINVENT-env" -> "graphinvent"). 36 | * Redundant hyperparameters were removed; additionally, hyperparameters seen not to improve things were removed from `defaults.py`, such as the optimizer weight decay (now just 0.0) and weights initialization (fixed to Xavier uniform now). 37 | * Some old functions, such as `models.py` and `loss.py` were consolidated into `Workflow.py`. 38 | * A validation loss calculation was added to keep track of model training. 39 | 40 | Additionally, minor typos and bugs were corrected, and the docstrings and error messages updated. Examples of minor bugs/changes: 41 | * Bug in how fraction properly terminated graphs (and fraction valid of properly terminated) was calculated (wrong function for data type, which led to errors in rare instances). 42 | * Errors in how analysis histograms were written to tensorboard; these were also of questionable utility so are now simply removed. 43 | * Some values (like the "NLL diff") were removed, as they were also not found to be useful. 44 | 45 | If you spot any issues (big or small) since the update, please create an issue or a pull request (if you are able to fix it), and we will be happy to make changes. 46 | 47 | ## Prerequisites 48 | * Anaconda or Miniconda with Python 3.6 or 3.8. 49 | * (for GPU-training only) CUDA-enabled GPU. 50 | 51 | ## Instructions and tutorials 52 | For detailed guides on how to use GraphINVENT, see the [tutorials](./tutorials/). 53 | 54 | ## Examples 55 | An example training set is available in [./data/gdb13_1K/](./data/gdb13_1K/). It is a small (1K) subset of GDB-13 and is already preprocessed. 56 | 57 | ## Contributors 58 | [@rociomer](https://www.github.com/rociomer) 59 | 60 | [@rastemo](https://www.github.com/rastemo) 61 | 62 | [@edvardlindelof](https://www.github.com/edvardlindelof) 63 | 64 | [@sararromeo](https://www.github.com/sararromeo) 65 | 66 | [@JuanViguera](https://www.github.com/JuanViguera) 67 | 68 | [@psolsson](https://www.github.com/psolsson) 69 | 70 | ## Contributions 71 | 72 | Contributions are welcome in the form of issues or pull requests. To report a bug, please submit an issue. Thank you to everyone who has used the code and provided feedback thus far. 73 | 74 | 75 | ## References 76 | ### Relevant publications 77 | If you use GraphINVENT in your research, please reference our [publication](https://doi.org/10.1088/2632-2153/abcf91). 78 | 79 | Additional details related to the development of GraphINVENT are available in our [technical note](https://doi.org/10.1002/ail2.18). You might find this note useful if you're interested in either exploring different hyperparameters or developing your own generative models. 80 | 81 | The references in BibTex format are available below: 82 | 83 | ``` 84 | @article{mercado2020graph, 85 | author = "Rocío Mercado and Tobias Rastemo and Edvard Lindelöf and Günter Klambauer and Ola Engkvist and Hongming Chen and Esben Jannik Bjerrum", 86 | title = "{Graph Networks for Molecular Design}", 87 | journal = {Machine Learning: Science and Technology}, 88 | year = {2020}, 89 | publisher = {IOP Publishing}, 90 | doi = "10.1088/2632-2153/abcf91" 91 | } 92 | 93 | @article{mercado2020practical, 94 | author = "Rocío Mercado and Tobias Rastemo and Edvard Lindelöf and Günter Klambauer and Ola Engkvist and Hongming Chen and Esben Jannik Bjerrum", 95 | title = "{Practical Notes on Building Molecular Graph Generative Models}", 96 | journal = {Applied AI Letters}, 97 | year = {2020}, 98 | publisher = {Wiley Online Library}, 99 | doi = "10.1002/ail2.18" 100 | } 101 | ``` 102 | 103 | ### Related work 104 | #### MPNNs 105 | The MPNN implementations used in this work were pulled from Edvard Lindelöf's repo in October 2018, while he was a masters student in the MAI group. This work is available at 106 | 107 | https://github.com/edvardlindelof/graph-neural-networks-for-drug-discovery. 108 | 109 | His master's thesis, describing the EMN implementation, can be found at 110 | 111 | https://odr.chalmers.se/handle/20.500.12380/256629. 112 | 113 | #### MOSES 114 | The MOSES repo is available at https://github.com/molecularsets/moses. 115 | 116 | #### GDB-13 117 | The example dataset provided is a subset of GDB-13. This was obtained by randomly sampling 1000 structures from the entire GDB-13 dataset. The full dataset is available for download at http://gdb.unibe.ch/downloads/. 118 | 119 | 120 | #### RL-GraphINVENT 121 | Version 3.0 incorporates Sara's work into the latest GraphINVENT framework: [repo](https://github.com/olsson-group/RL-GraphINVENT) and [paper](https://doi.org/10.33774/chemrxiv-2021-9w3tc). Her work was presented at the [RL4RealLife](https://sites.google.com/view/RL4RealLife) workshop at ICML 2021. 122 | 123 | #### Exploring graph traversal algorithms in GraphINVENT 124 | In [this](https://doi.org/10.33774/chemrxiv-2021-5c5l1) pre-print, we look into the effect of different graph traversal algorithms on the types of structures that are generated by GraphINVENT. We find that a BFS generally leads to better molecules than a DFS, unless the model is overtrained, at which point both graph traversal algorithms lead to indistinguishible sets of structures. 125 | 126 | ## License 127 | 128 | GraphINVENT is licensed under the MIT license and is free and provided as-is. 129 | 130 | ## Link 131 | https://github.com/MolecularAI/GraphINVENT/ 132 | -------------------------------------------------------------------------------- /cover-image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/cover-image.png -------------------------------------------------------------------------------- /data/fine-tuning/QSAR_model_example.pickle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/data/fine-tuning/QSAR_model_example.pickle -------------------------------------------------------------------------------- /data/fine-tuning/gdb13_1K-debug/preprocessing_params.csv: -------------------------------------------------------------------------------- 1 | atom_types;['C', 'N', 'O', 'S', 'Cl'] 2 | chirality;['None', 'R', 'S'] 3 | formal_charge;[-1, 0, 1] 4 | ignore_H;True 5 | imp_H;[0, 1, 2, 3] 6 | max_n_nodes;13 7 | use_aromatic_bonds;False 8 | use_chirality;False 9 | use_explicit_H;False 10 | -------------------------------------------------------------------------------- /data/fine-tuning/gdb13_1K-debug/pretrained_model.pth: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/data/fine-tuning/gdb13_1K-debug/pretrained_model.pth -------------------------------------------------------------------------------- /data/fine-tuning/gdb13_1K-debug/train.csv: -------------------------------------------------------------------------------- 1 | ('Training set', 'n_nodes_hist');[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.25, 0.25, 0.25, 2.25] 2 | ('Training set', 'avg_n_nodes');12.625 3 | ('Training set', 'atom_type_hist');[25.5, 6.75, 5.25, 0.0, 0.0] 4 | ('Training set', 'formal_charge_hist');[0.0, 37.5, 0.0] 5 | ('Training set', 'n_edges_hist');[10.0, 14.75, 12.0, 0.75, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 6 | ('Training set', 'avg_n_edges');2.072 7 | ('Training set', 'edge_feature_hist');[33.25, 5.5, 0.5] 8 | ('Training set', 'fraction_unique');0.0 9 | ('Training set', 'fraction_valid');1.0 10 | ('Training set', 'numh_hist');[] 11 | ('Training set', 'chirality_hist');[] 12 | -------------------------------------------------------------------------------- /data/pre-training/gdb13_1K-debug/preprocessing_params.csv: -------------------------------------------------------------------------------- 1 | atom_types;['C', 'N', 'O', 'S', 'Cl'] 2 | chirality;['None', 'R', 'S'] 3 | formal_charge;[-1, 0, 1] 4 | ignore_H;True 5 | imp_H;[0, 1, 2, 3] 6 | max_n_nodes;13 7 | use_aromatic_bonds;False 8 | use_chirality;False 9 | use_explicit_H;False 10 | -------------------------------------------------------------------------------- /data/pre-training/gdb13_1K-debug/test.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/data/pre-training/gdb13_1K-debug/test.h5 -------------------------------------------------------------------------------- /data/pre-training/gdb13_1K-debug/test.smi: -------------------------------------------------------------------------------- 1 | SMILES Name 2 | OC1C(O)C2C1NCC2O 1665110 3 | CC1CC2C(C1)C21CN=CN1 2415507 4 | CC=CC1(N)C=CC2C(N)C12 3385993 5 | CC#CC12COC(=O)CC1O2 5741941 6 | COC(C)C#CC(O)C(C)=O 6426824 7 | OCC=C1COC2CCCC12 6724947 8 | C=C1C2CC2CC1(C=O)C#N 7240989 9 | CC1CN(CC(CO)C=C)C1 8022539 10 | CC1N(C=N)N=C(C)OC1=O 8541293 11 | CCC1(C)C2(N)COCC12N 9017466 12 | -------------------------------------------------------------------------------- /data/pre-training/gdb13_1K-debug/train.csv: -------------------------------------------------------------------------------- 1 | ('Training set', 'n_nodes_hist');[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.25, 0.25, 0.25, 2.25] 2 | ('Training set', 'avg_n_nodes');12.625 3 | ('Training set', 'atom_type_hist');[25.5, 6.75, 5.25, 0.0, 0.0] 4 | ('Training set', 'formal_charge_hist');[0.0, 37.5, 0.0] 5 | ('Training set', 'n_edges_hist');[10.0, 14.75, 12.0, 0.75, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 6 | ('Training set', 'avg_n_edges');2.072 7 | ('Training set', 'edge_feature_hist');[33.25, 5.5, 0.5] 8 | ('Training set', 'fraction_unique');0.0 9 | ('Training set', 'fraction_valid');1.0 10 | ('Training set', 'numh_hist');[] 11 | ('Training set', 'chirality_hist');[] 12 | -------------------------------------------------------------------------------- /data/pre-training/gdb13_1K-debug/train.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/data/pre-training/gdb13_1K-debug/train.h5 -------------------------------------------------------------------------------- /data/pre-training/gdb13_1K-debug/train.smi: -------------------------------------------------------------------------------- 1 | SMILES Name 2 | CC1C2N1CC1=C2CC=C1 69719 3 | CC(C)C1=CCC2C3C=COC123 41613507 4 | CN1CC(N)C2CC2(CO)ON=C1 676397262 5 | CC1NC1C1ON=CC(C)C1CN 481064820 6 | CN=C1CN(C)CC(CCO)CN1 695143543 7 | CC1C2CC1(O)C(C=O)=C2C 4967159 8 | COCC=CC(C)N(C=N)N(C)C 758801042 9 | OC(C=C)C1=CC(=O)C2CC2OC1 356175703 10 | CNC(C#C)C1=NN=C(CO)N1O 708514546 11 | COC(=O)NC(C)C1CNNC1=O 766737120 12 | -------------------------------------------------------------------------------- /data/pre-training/gdb13_1K-debug/valid.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/data/pre-training/gdb13_1K-debug/valid.h5 -------------------------------------------------------------------------------- /data/pre-training/gdb13_1K-debug/valid.smi: -------------------------------------------------------------------------------- 1 | SMILES Name 2 | CCC(C=CC)=CC(C)C 28951 3 | N#CC1CCC2NC2CN1 375643 4 | OCC1(CC1)C(=O)OC=C 692764 5 | OCCNCCN(O)C=N 1739460 6 | CC12C3C4C(C4C1=C)C2C3N 2179443 7 | NC1CC(N)C(C#C)C1C#C 4528171 8 | CC1CC2(CC=C)CC2C1=O 5070952 9 | CCC(CO)C1=CC=CC1=O 6015379 10 | CC12CC=NN=CC1NC2=O 7076956 11 | CC1C2CC1(C)NC2C=NO 7236816 12 | -------------------------------------------------------------------------------- /data/pre-training/gdb13_1K/preprocessing_params.csv: -------------------------------------------------------------------------------- 1 | atom_types;['C', 'N', 'O', 'S', 'Cl'] 2 | chirality;['None', 'R', 'S'] 3 | formal_charge;[-1, 0, 1] 4 | ignore_H;True 5 | imp_H;[0, 1, 2, 3] 6 | max_n_nodes;13 7 | use_aromatic_bonds;False 8 | use_chirality;False 9 | use_explicit_H;False 10 | -------------------------------------------------------------------------------- /data/pre-training/gdb13_1K/test.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/data/pre-training/gdb13_1K/test.h5 -------------------------------------------------------------------------------- /data/pre-training/gdb13_1K/train.csv: -------------------------------------------------------------------------------- 1 | ('Training set', 'n_nodes_hist');[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.064, 1.414, 9.493, 53.61] 2 | ('Training set', 'avg_n_nodes');12.787 3 | ('Training set', 'atom_type_hist');[589.076, 125.834, 104.132, 7.85, 0.137] 4 | ('Training set', 'formal_charge_hist');[0.25, 826.53, 0.25] 5 | ('Training set', 'n_edges_hist');[187.398, 361.517, 236.11, 42.004, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 6 | ('Training set', 'avg_n_edges');2.164 7 | ('Training set', 'edge_feature_hist');[752.936, 127.018, 13.436] 8 | ('Training set', 'fraction_unique');0.0 9 | ('Training set', 'fraction_valid');1.0 10 | ('Training set', 'numh_hist');[] 11 | ('Training set', 'chirality_hist');[] 12 | -------------------------------------------------------------------------------- /data/pre-training/gdb13_1K/train.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/data/pre-training/gdb13_1K/train.h5 -------------------------------------------------------------------------------- /data/pre-training/gdb13_1K/valid.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/data/pre-training/gdb13_1K/valid.h5 -------------------------------------------------------------------------------- /environments/graphinvent.yml: -------------------------------------------------------------------------------- 1 | name: graphinvent 2 | channels: 3 | - pytorch 4 | - anaconda 5 | - conda-forge 6 | - defaults 7 | dependencies: 8 | - _libgcc_mutex=0.1=conda_forge 9 | - _openmp_mutex=4.5=1_gnu 10 | - absl-py=0.11.0=py38h578d9bd_0 11 | - astroid=2.4.2=py38_0 12 | - blas=1.0=mkl 13 | - boost=1.74.0=py38hc10631b_3 14 | - boost-cpp=1.74.0=h9359b55_0 15 | - bzip2=1.0.8=h7b6447c_0 16 | - c-ares=1.17.1=h7f98852_1 17 | - ca-certificates=2020.10.14=0 18 | - cached-property=1.5.1=py_0 19 | - cairo=1.16.0=h3fc0475_1005 20 | - certifi=2020.6.20=py38_0 21 | - cudatoolkit=10.1.243=h6bb024c_0 22 | - cycler=0.10.0=py_2 23 | - ffmpeg=4.3=hf484d3e_0 24 | - fontconfig=2.13.1=hba837de_1004 25 | - freetype=2.10.4=h5ab3b9f_0 26 | - glib=2.67.4=h36276a3_1 27 | - gmp=6.2.1=h2531618_2 28 | - gnutls=3.6.5=h71b1129_1002 29 | - grpcio=1.36.1=py38hdd6454d_0 30 | - h5py=3.1.0=nompi_py38hafa665b_100 31 | - hdf5=1.10.6=nompi_h6a2412b_1114 32 | - icu=67.1=he1b5a44_0 33 | - importlib-metadata=3.7.2=py38h578d9bd_0 34 | - intel-openmp=2020.2=254 35 | - isort=5.6.4=py_0 36 | - joblib=1.0.1=pyhd8ed1ab_0 37 | - jpeg=9b=h024ee3a_2 38 | - kiwisolver=1.3.1=py38h1fd1430_1 39 | - krb5=1.17.2=h926e7f8_0 40 | - lame=3.100=h7b6447c_0 41 | - lazy-object-proxy=1.4.3=py38h7b6447c_0 42 | - lcms2=2.11=h396b838_0 43 | - ld_impl_linux-64=2.33.1=h53a641e_7 44 | - libblas=3.9.0=1_h86c2bf4_netlib 45 | - libcblas=3.9.0=5_h92ddd45_netlib 46 | - libcurl=7.75.0=hc4aaa36_0 47 | - libedit=3.1.20191231=h14c3975_1 48 | - libev=4.33=h516909a_1 49 | - libffi=3.3=he6710b0_2 50 | - libgcc-ng=9.3.0=h2828fa1_18 51 | - libgfortran-ng=9.3.0=hff62375_18 52 | - libgfortran5=9.3.0=hff62375_18 53 | - libgomp=9.3.0=h2828fa1_18 54 | - libiconv=1.15=h63c8f33_5 55 | - liblapack=3.9.0=5_h92ddd45_netlib 56 | - libnghttp2=1.43.0=h812cca2_0 57 | - libpng=1.6.37=hbc83047_0 58 | - libprotobuf=3.15.5=h780b84a_0 59 | - libssh2=1.9.0=hab1572f_5 60 | - libstdcxx-ng=9.3.0=h6de172a_18 61 | - libtiff=4.1.0=h2733197_1 62 | - libuuid=2.32.1=h7f98852_1000 63 | - libuv=1.40.0=h7b6447c_0 64 | - libxcb=1.13=h7f98852_1003 65 | - libxml2=2.9.10=h72b56ed_2 66 | - lz4-c=1.9.3=h2531618_0 67 | - markdown=3.3.4=pyhd8ed1ab_0 68 | - matplotlib-base=3.3.4=py38h0efea84_0 69 | - mccabe=0.6.1=py38_1 70 | - mkl=2020.2=256 71 | - mkl-service=2.3.0=py38he904b0f_0 72 | - mkl_fft=1.3.0=py38h54f3939_0 73 | - mkl_random=1.1.1=py38h0573a6f_0 74 | - ncurses=6.2=he6710b0_1 75 | - nettle=3.4.1=hbb512f6_0 76 | - ninja=1.10.2=py38hff7bd54_0 77 | - numpy=1.19.2=py38h54aff64_0 78 | - numpy-base=1.19.2=py38hfa32c7d_0 79 | - olefile=0.46=py_0 80 | - openh264=2.1.0=hd408876_0 81 | - openssl=1.1.1k=h7f98852_0 82 | - pandas=1.2.3=py38h51da96c_0 83 | - pcre=8.44=he1b5a44_0 84 | - pillow=8.1.1=py38he98fc37_0 85 | - pip=21.0.1=py38h06a4308_0 86 | - pixman=0.38.0=h516909a_1003 87 | - protobuf=3.15.5=py38h709712a_0 88 | - pthread-stubs=0.4=h36c2ea0_1001 89 | - pycairo=1.20.0=py38h323dad1_1 90 | - pylint=2.6.0=py38_0 91 | - pyparsing=2.4.7=pyh9f0ad1d_0 92 | - python=3.8.8=hdb3f193_4 93 | - python-dateutil=2.8.1=py_0 94 | - python_abi=3.8=1_cp38 95 | - pytorch=1.8.0=py3.8_cuda10.1_cudnn7.6.3_0 96 | - pytz=2021.1=pyhd8ed1ab_0 97 | - rdkit=2020.09.5=py38h2bca085_0 98 | - readline=8.1=h27cfd23_0 99 | - reportlab=3.5.63=py38hadf75a6_0 100 | - scikit-learn=0.21.1=py38hd81dba3_0 101 | - scipy=1.7.0=py38h7b17777_1 102 | - setuptools=52.0.0=py38h06a4308_0 103 | - six=1.15.0=py38h06a4308_0 104 | - sqlalchemy=1.3.23=py38h497a2fe_0 105 | - sqlite=3.33.0=h62c20be_0 106 | - tensorboard=1.15.0=py38_0 107 | - threadpoolctl=2.2.0=pyh8a188c0_0 108 | - tk=8.6.10=hbc83047_0 109 | - toml=0.10.1=py_0 110 | - torchaudio=0.8.0=py38 111 | - torchvision=0.9.0=py38_cu101 112 | - tornado=6.1=py38h497a2fe_1 113 | - tqdm=4.59.0=pyhd8ed1ab_0 114 | - typing_extensions=3.7.4.3=pyha847dfd_0 115 | - werkzeug=1.0.1=pyh9f0ad1d_0 116 | - wheel=0.36.2=pyhd3eb1b0_0 117 | - wrapt=1.11.2=py38h7b6447c_0 118 | - xorg-kbproto=1.0.7=h7f98852_1002 119 | - xorg-libice=1.0.10=h7f98852_0 120 | - xorg-libsm=1.2.3=hd9c2040_1000 121 | - xorg-libx11=1.7.0=h7f98852_0 122 | - xorg-libxau=1.0.9=h7f98852_0 123 | - xorg-libxdmcp=1.1.3=h7f98852_0 124 | - xorg-libxext=1.3.4=h7f98852_1 125 | - xorg-libxrender=0.9.10=h7f98852_1003 126 | - xorg-renderproto=0.11.1=h7f98852_1002 127 | - xorg-xextproto=7.3.0=h7f98852_1002 128 | - xorg-xproto=7.0.31=h7f98852_1007 129 | - xz=5.2.5=h7b6447c_0 130 | - zipp=3.4.1=pyhd8ed1ab_0 131 | - zlib=1.2.11=h7b6447c_3 132 | - zstd=1.4.5=h9ceee32_0 133 | -------------------------------------------------------------------------------- /graphinvent/BlockDatasetLoader.py: -------------------------------------------------------------------------------- 1 | """ 2 | The `BlockDatasetLoader` defines custom `DataLoader`s and `Dataset`s used to 3 | efficiently load data from HDF files in this work 4 | """ 5 | # load general packages and functions 6 | from typing import Tuple 7 | import torch 8 | import h5py 9 | 10 | 11 | class BlockDataLoader(torch.utils.data.DataLoader): 12 | """ 13 | Main `DataLoader` class which has been modified so as to read training data 14 | from disk in blocks, as opposed to a single line at a time (as is done in 15 | the original `DataLoader` class). 16 | """ 17 | def __init__(self, dataset : torch.utils.data.Dataset, batch_size : int=100, 18 | block_size : int=10000, shuffle : bool=True, n_workers : int=0, 19 | pin_memory : bool=True) -> None: 20 | 21 | # define variables to be used throughout dataloading 22 | self.dataset = dataset # `HDFDataset` object 23 | self.batch_size = batch_size # `int` 24 | self.block_size = block_size # `int` 25 | self.shuffle = shuffle # `bool` 26 | self.n_workers = n_workers # `int` 27 | self.pin_memory = pin_memory # `bool` 28 | self.block_dataset = BlockDataset(self.dataset, 29 | batch_size=self.batch_size, 30 | block_size=self.block_size) 31 | 32 | def __iter__(self) -> torch.Tensor: 33 | 34 | # define a regular `DataLoader` using the `BlockDataset` 35 | block_loader = torch.utils.data.DataLoader(self.block_dataset, 36 | shuffle=self.shuffle, 37 | num_workers=self.n_workers) 38 | 39 | # define a condition for determining whether to drop the last block this 40 | # is done if the remainder block is very small (less than a tenth the 41 | # size of a normal block) 42 | condition = bool( 43 | int(self.block_dataset.__len__()/self.block_size) > 1 & 44 | self.block_dataset.__len__()%self.block_size < self.block_size/10 45 | ) 46 | 47 | # loop through and load BLOCKS of data every iteration 48 | for block in block_loader: 49 | block = [torch.squeeze(b) for b in block] 50 | 51 | # wrap each block in a `ShuffleBlock` so that data can be shuffled 52 | # within blocks 53 | batch_loader = torch.utils.data.DataLoader( 54 | dataset=ShuffleBlockWrapper(block), 55 | shuffle=self.shuffle, 56 | batch_size=self.batch_size, 57 | num_workers=self.n_workers, 58 | pin_memory=self.pin_memory, 59 | drop_last=condition 60 | ) 61 | 62 | for batch in batch_loader: 63 | yield batch 64 | 65 | def __len__(self) -> int: 66 | # returns the number of graphs in the DataLoader 67 | n_blocks = len(self.dataset) // self.block_size 68 | n_rem = len(self.dataset) % self.block_size 69 | n_batch_per_block = self.__ceil__(self.block_size, self.batch_size) 70 | n_last = self.__ceil__(n_rem, self.batch_size) 71 | return n_batch_per_block * n_blocks + n_last 72 | 73 | def __ceil__(self, i : int, j : int) -> int: 74 | return (i + j - 1) // j 75 | 76 | 77 | class BlockDataset(torch.utils.data.Dataset): 78 | """ 79 | Modified `Dataset` class which returns BLOCKS of data when `__getitem__()` 80 | is called. 81 | """ 82 | def __init__(self, dataset : torch.utils.data.Dataset, batch_size : int=100, 83 | block_size : int=10000) -> None: 84 | 85 | assert block_size >= batch_size, "Block size should be > batch size." 86 | 87 | self.block_size = block_size # `int` 88 | self.batch_size = batch_size # `int` 89 | self.dataset = dataset # `HDFDataset` 90 | 91 | def __getitem__(self, idx : int) -> torch.Tensor: 92 | # returns a block of data from the dataset 93 | start = idx * self.block_size 94 | end = min((idx + 1) * self.block_size, len(self.dataset)) 95 | return self.dataset[start:end] 96 | 97 | def __len__(self) -> int: 98 | # returns the number of blocks in the dataset 99 | return (len(self.dataset) + self.block_size - 1) // self.block_size 100 | 101 | 102 | class ShuffleBlockWrapper: 103 | """ 104 | Extra class used to wrap a block of data, enabling data to get shuffled 105 | *within* a block. 106 | """ 107 | def __init__(self, data : torch.Tensor) -> None: 108 | self.data = data 109 | 110 | def __getitem__(self, idx : int) -> torch.Tensor: 111 | return [d[idx] for d in self.data] 112 | 113 | def __len__(self) -> int: 114 | return len(self.data[0]) 115 | 116 | 117 | class HDFDataset(torch.utils.data.Dataset): 118 | """ 119 | Reads and collects data from an HDF file with three datasets: "nodes", 120 | "edges", and "APDs". 121 | """ 122 | def __init__(self, path : str) -> None: 123 | 124 | self.path = path 125 | hdf_file = h5py.File(self.path, "r+", swmr=True) 126 | 127 | # load each HDF dataset 128 | self.nodes = hdf_file.get("nodes") 129 | self.edges = hdf_file.get("edges") 130 | self.apds = hdf_file.get("APDs") 131 | 132 | # get the number of elements in the dataset 133 | self.n_subgraphs = self.nodes.shape[0] 134 | 135 | def __getitem__(self, idx : int) -> \ 136 | Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: 137 | 138 | # returns specific graph elements 139 | nodes_i = torch.from_numpy(self.nodes[idx]).type(torch.float32) 140 | edges_i = torch.from_numpy(self.edges[idx]).type(torch.float32) 141 | apd_i = torch.from_numpy(self.apds[idx]).type(torch.float32) 142 | 143 | return (nodes_i, edges_i, apd_i) 144 | 145 | def __len__(self) -> int: 146 | # returns the number of graphs in the dataset 147 | return self.n_subgraphs 148 | -------------------------------------------------------------------------------- /graphinvent/DataProcesser.py: -------------------------------------------------------------------------------- 1 | """ 2 | The `DataProcesser` class contains functions for pre-processing training data. 3 | """ 4 | # load general packages and functions 5 | import os 6 | import numpy as np 7 | import rdkit 8 | import h5py 9 | from tqdm import tqdm 10 | 11 | # load GraphINVENT-specific functions 12 | from Analyzer import Analyzer 13 | from parameters.constants import constants 14 | import parameters.load as load 15 | from MolecularGraph import PreprocessingGraph 16 | import util 17 | 18 | 19 | class DataProcesser: 20 | """ 21 | A class for preprocessing molecular sets and writing them to HDF files. 22 | """ 23 | def __init__(self, path : str, is_training_set : bool=False) -> None: 24 | """ 25 | Args: 26 | ---- 27 | path (string) : Full path/filename to SMILES file containing 28 | molecules. 29 | is_training_set (bool) : Indicates if this is the training set, as we 30 | calculate a few additional things for the training 31 | set. 32 | """ 33 | # define some variables for later use 34 | self.path = path 35 | self.is_training_set = is_training_set 36 | self.dataset_names = ["nodes", "edges", "APDs"] 37 | self.get_dataset_dims() # creates `self.dims` 38 | 39 | # load the molecules 40 | self.molecule_set = load.molecules(self.path) 41 | 42 | # placeholders 43 | self.molecule_subset = None 44 | self.dataset = None 45 | self.skip_collection = None 46 | self.resume_idx = None 47 | self.ts_properties = None 48 | self.restart_index_file = None 49 | self.hdf_file = None 50 | self.dataset_size = None 51 | 52 | # get total number of molecules, and total number of subgraphs in their 53 | # decoding routes 54 | self.n_molecules = len(self.molecule_set) 55 | self.total_n_subgraphs = self.get_n_subgraphs() 56 | print(f"-- {self.n_molecules} molecules in set.", flush=True) 57 | print(f"-- {self.total_n_subgraphs} total subgraphs in set.", 58 | flush=True) 59 | 60 | def preprocess(self) -> None: 61 | """ 62 | Prepares an HDF file to save three different datasets to it (`nodes`, 63 | `edges`, `APDs`), and slowly fills it in by looping over all the 64 | molecules in the data in groups (or "mini-batches"). 65 | """ 66 | with h5py.File(f"{self.path[:-3]}h5.chunked", "a") as self.hdf_file: 67 | 68 | self.restart_index_file = constants.dataset_dir + "index.restart" 69 | 70 | if constants.restart and os.path.exists(self.restart_index_file): 71 | self.restart_preprocessing_job() 72 | else: 73 | self.start_new_preprocessing_job() 74 | 75 | # keep track of the dataset size (to resize later) 76 | self.dataset_size = 0 77 | 78 | self.ts_properties = None 79 | 80 | # this is where we fill the datasets with actual data by looping 81 | # over subgraphs in blocks of size `constants.batch_size` 82 | for idx in range(0, self.total_n_subgraphs, constants.batch_size): 83 | 84 | if not self.skip_collection: 85 | 86 | self.get_molecule_subset() 87 | 88 | # add `constants.batch_size` subgraphs from 89 | # `self.molecule_subset` to the dataset (and if training 90 | # set, calculate their properties and add these to 91 | # `self.ts_properties`) 92 | self.get_subgraphs(init_idx=idx) 93 | 94 | util.write_last_molecule_idx( 95 | last_molecule_idx=self.resume_idx, 96 | dataset_size=self.dataset_size, 97 | restart_file_path=constants.dataset_dir 98 | ) 99 | 100 | 101 | if self.resume_idx == self.n_molecules: 102 | # all molecules have been processed 103 | 104 | self.resize_datasets() # remove padding from initialization 105 | print("Datasets resized.", flush=True) 106 | 107 | if self.is_training_set and not constants.restart: 108 | 109 | print("Writing training set properties.", flush=True) 110 | util.write_ts_properties( 111 | training_set_properties=self.ts_properties 112 | ) 113 | 114 | break 115 | 116 | print("* Resaving datasets in unchunked format.") 117 | self.resave_datasets_unchunked() 118 | 119 | def restart_preprocessing_job(self) -> None: 120 | """ 121 | Restarts a preprocessing job. Uses an index specified in the dataset 122 | directory to know where to resume preprocessing. 123 | """ 124 | try: 125 | self.resume_idx, self.dataset_size = util.read_last_molecule_idx( 126 | restart_file_path=constants.dataset_dir 127 | ) 128 | except: 129 | self.resume_idx, self.dataset_size = 0, 0 130 | self.skip_collection = bool( 131 | self.resume_idx == self.n_molecules and self.is_training_set 132 | ) 133 | 134 | # load dictionary of previously created datasets (`self.dataset`) 135 | self.load_datasets(hdf_file=self.hdf_file) 136 | 137 | def start_new_preprocessing_job(self) -> None: 138 | """ 139 | Starts a fresh preprocessing job. 140 | """ 141 | self.resume_idx = 0 142 | self.skip_collection = False 143 | 144 | # create a dictionary of empty HDF datasets (`self.dataset`) 145 | self.create_datasets(hdf_file=self.hdf_file) 146 | 147 | def resave_datasets_unchunked(self) -> None: 148 | """ 149 | Resaves the HDF datasets in an unchunked format to remove initial 150 | padding. 151 | """ 152 | with h5py.File(f"{self.path[:-3]}h5.chunked", "r", swmr=True) as chunked_file: 153 | keys = list(chunked_file.keys()) 154 | data = [chunked_file.get(key)[:] for key in keys] 155 | data_zipped = tuple(zip(data, keys)) 156 | 157 | with h5py.File(f"{self.path[:-3]}h5", "w") as unchunked_file: 158 | for d, k in tqdm(data_zipped): 159 | unchunked_file.create_dataset( 160 | k, chunks=None, data=d, dtype=np.dtype("int8") 161 | ) 162 | 163 | # remove the restart file and chunked file (don't need them anymore) 164 | os.remove(self.restart_index_file) 165 | os.remove(f"{self.path[:-3]}h5.chunked") 166 | 167 | def get_subgraphs(self, init_idx : int) -> None: 168 | """ 169 | Adds `constants.batch_size` subgraphs from `self.molecule_subset` to the 170 | HDF dataset (and if currently processing the training set, also 171 | calculates the full graphs' properties and adds these to 172 | `self.ts_properties`). 173 | 174 | Args: 175 | ---- 176 | init_idx (int) : As analysis is done in blocks/slices, `init_idx` is 177 | the start index for the next block/slice to be taken 178 | from `self.molecule_subset`. 179 | """ 180 | data_subgraphs, data_apds, molecular_graph_list = [], [], [] # initialize 181 | 182 | # convert all molecules in `self.molecules_subset` to `PreprocessingGraphs` 183 | molecular_graph_generator = map(self.get_graph, self.molecule_subset) 184 | 185 | molecules_processed = 0 # keep track of the number of molecules processed 186 | 187 | # loop over all the `PreprocessingGraph`s 188 | for graph in molecular_graph_generator: 189 | molecules_processed += 1 190 | 191 | # store `PreprocessingGraph` object 192 | molecular_graph_list.append(graph) 193 | 194 | # get the number of decoding graphs 195 | n_subgraphs = graph.get_decoding_route_length() 196 | 197 | for new_subgraph_idx in range(n_subgraphs): 198 | 199 | # `get_decoding_route_state() returns a list of [`subgraph`, `apd`], 200 | subgraph, apd = graph.get_decoding_route_state( 201 | subgraph_idx=new_subgraph_idx 202 | ) 203 | 204 | # "collect" all APDs corresponding to pre-existing subgraphs, 205 | # otherwise append both new subgraph and new APD 206 | count = 0 207 | for idx, existing_subgraph in enumerate(data_subgraphs): 208 | 209 | count += 1 210 | # check if subgraph `subgraph` is "already" in 211 | # `data_subgraphs` as `existing_subgraph`, and if so, add 212 | # the "new" APD to the "old" 213 | try: # first compare the node feature matrices 214 | nodes_equal = (subgraph[0] == existing_subgraph[0]).all() 215 | except AttributeError: 216 | nodes_equal = False 217 | try: # then compare the edge feature tensors 218 | edges_equal = (subgraph[1] == existing_subgraph[1]).all() 219 | except AttributeError: 220 | edges_equal = False 221 | 222 | # if both matrices have a match, then subgraphs are the same 223 | if nodes_equal and edges_equal: 224 | existing_apd = data_apds[idx] 225 | existing_apd += apd 226 | break 227 | 228 | # if subgraph is not already in `data_subgraphs`, append it 229 | if count == len(data_subgraphs) or count == 0: 230 | data_subgraphs.append(subgraph) 231 | data_apds.append(apd) 232 | 233 | # if `constants.batch_size` unique subgraphs have been 234 | # processed, save group to the HDF dataset 235 | len_data_subgraphs = len(data_subgraphs) 236 | if len_data_subgraphs == constants.batch_size: 237 | self.save_group(data_subgraphs=data_subgraphs, 238 | data_apds=data_apds, 239 | group_size=len_data_subgraphs, 240 | init_idx=init_idx) 241 | 242 | # get molecular properties for group iff it's the training set 243 | self.get_ts_properties(molecular_graphs=molecular_graph_list, 244 | group_size=constants.batch_size) 245 | 246 | # keep track of the last molecule to be processed in 247 | # `self.resume_idx` 248 | # number of molecules processed: 249 | self.resume_idx += molecules_processed 250 | # subgraphs processed: 251 | self.dataset_size += constants.batch_size 252 | 253 | return None 254 | 255 | n_processed_subgraphs = len(data_subgraphs) 256 | 257 | # save group with < `constants.batch_size` subgraphs (e.g. last block) 258 | self.save_group(data_subgraphs=data_subgraphs, 259 | data_apds=data_apds, 260 | group_size=n_processed_subgraphs, 261 | init_idx=init_idx) 262 | 263 | # get molecular properties for this group iff it's the training set 264 | self.get_ts_properties(molecular_graphs=molecular_graph_list, 265 | group_size=constants.batch_size) 266 | 267 | # keep track of the last molecule to be processed in `self.resume_idx` 268 | self.resume_idx += molecules_processed # number of molecules processed 269 | self.dataset_size += molecules_processed # subgraphs processed 270 | 271 | return None 272 | 273 | def create_datasets(self, hdf_file : h5py._hl.files.File) -> None: 274 | """ 275 | Creates a dictionary of HDF5 datasets (`self.dataset`). 276 | 277 | Args: 278 | ---- 279 | hdf_file (h5py._hl.files.File) : HDF5 file which will contain the datasets. 280 | """ 281 | self.dataset = {} # initialize 282 | 283 | for ds_name in self.dataset_names: 284 | self.dataset[ds_name] = hdf_file.create_dataset( 285 | ds_name, 286 | (self.total_n_subgraphs, *self.dims[ds_name]), 287 | chunks=True, # must be True for resizing later 288 | dtype=np.dtype("int8") 289 | ) 290 | 291 | def resize_datasets(self) -> None: 292 | """ 293 | Resizes the HDF datasets, since much longer datasets are initialized 294 | when first creating the HDF datasets (it it is impossible to predict 295 | how many graphs will be equivalent beforehand). 296 | """ 297 | for dataset_name in self.dataset_names: 298 | try: 299 | self.dataset[dataset_name].resize( 300 | (self.dataset_size, *self.dims[dataset_name])) 301 | except KeyError: # `f_term` has no extra dims 302 | self.dataset[dataset_name].resize((self.dataset_size,)) 303 | 304 | def get_dataset_dims(self) -> None: 305 | """ 306 | Calculates the dimensions of the node features, edge features, and APDs, 307 | and stores them as lists in a dict (`self.dims`), where keys are the 308 | dataset name. 309 | 310 | Shapes: 311 | ------ 312 | dims["nodes"] : [max N nodes, N atom types + N formal charges] 313 | dims["edges"] : [max N nodes, max N nodes, N bond types] 314 | dims["APDs"] : [APD length = f_add length + f_conn length + f_term length] 315 | """ 316 | self.dims = {} 317 | self.dims["nodes"] = constants.dim_nodes 318 | self.dims["edges"] = constants.dim_edges 319 | self.dims["APDs"] = constants.dim_apd 320 | 321 | def get_graph(self, mol : rdkit.Chem.Mol) -> PreprocessingGraph: 322 | """ 323 | Converts an `rdkit.Chem.Mol` object to `PreprocessingGraph`. 324 | 325 | Args: 326 | ---- 327 | mol (rdkit.Chem.Mol) : Molecule to convert. 328 | 329 | Returns: 330 | ------- 331 | molecular_graph (PreprocessingGraph) : Molecule, now as a graph. 332 | """ 333 | if mol is not None: 334 | if not constants.use_aromatic_bonds: 335 | rdkit.Chem.Kekulize(mol, clearAromaticFlags=True) 336 | molecular_graph = PreprocessingGraph(molecule=mol, 337 | constants=constants) 338 | return molecular_graph 339 | 340 | def get_molecule_subset(self) -> None: 341 | """ 342 | Slices `self.molecule_set` into a subset of molecules of size 343 | `constants.batch_size`, starting from `self.resume_idx`. 344 | `self.n_molecules` is the number of molecules in the full 345 | `self.molecule_set`. 346 | """ 347 | init_idx = self.resume_idx 348 | subset_size = constants.batch_size 349 | self.molecule_subset = [] 350 | max_idx = min(init_idx + subset_size, self.n_molecules) 351 | 352 | count = -1 353 | for mol in self.molecule_set: 354 | if mol is not None: 355 | count += 1 356 | if count < init_idx: 357 | continue 358 | elif count >= max_idx: 359 | return self.molecule_subset 360 | else: 361 | self.molecule_subset.append(mol) 362 | 363 | def get_n_subgraphs(self) -> int: 364 | """ 365 | Calculates the total number of subgraphs in the decoding route of all 366 | molecules in `self.molecule_set`. Loads training, testing, or validation 367 | set. First, the `PreprocessingGraph` for each molecule is obtained, and 368 | then the length of the decoding route is trivially calculated for each. 369 | 370 | Returns: 371 | ------- 372 | n_subgraphs (int) : Sum of number of subgraphs in decoding routes for 373 | all molecules in `self.molecule_set`. 374 | """ 375 | n_subgraphs = 0 # start the count 376 | 377 | # convert molecules in `self.molecule_set` to `PreprocessingGraph`s 378 | molecular_graph_generator = map(self.get_graph, self.molecule_set) 379 | 380 | # loop over all the `PreprocessingGraph`s 381 | for molecular_graph in molecular_graph_generator: 382 | 383 | # get the number of decoding graphs (i.e. the decoding route length) 384 | # and add them to the running count 385 | n_subgraphs += molecular_graph.get_decoding_route_length() 386 | 387 | return int(n_subgraphs) 388 | 389 | def get_ts_properties(self, molecular_graphs : list, group_size : int) -> \ 390 | None: 391 | """ 392 | Gets molecular properties for group of molecular graphs, only for the 393 | training set. 394 | 395 | Args: 396 | ---- 397 | molecular_graphs (list) : Contains `PreprocessingGraph`s. 398 | group_size (int) : Size of "group" (i.e. slice of graphs). 399 | """ 400 | if self.is_training_set: 401 | 402 | analyzer = Analyzer() 403 | ts_properties = analyzer.evaluate_training_set( 404 | preprocessing_graphs=molecular_graphs 405 | ) 406 | 407 | # merge properties of current group with the previous group analyzed 408 | if self.ts_properties: # `self.ts_properties` is a dictionary 409 | self.ts_properties = analyzer.combine_ts_properties( 410 | prev_properties=self.ts_properties, 411 | next_properties=ts_properties, 412 | weight_next=group_size 413 | ) 414 | else: # `self.ts_properties` is None (has not been calculated yet) 415 | self.ts_properties = ts_properties 416 | else: 417 | self.ts_properties = None 418 | 419 | def load_datasets(self, hdf_file : h5py._hl.files.File) -> None: 420 | """ 421 | Creates a dictionary of HDF datasets (`self.dataset`) which have been 422 | previously created (for restart jobs only). 423 | 424 | Args: 425 | ---- 426 | hdf_file (h5py._hl.files.File) : HDF file containing all the datasets. 427 | """ 428 | self.dataset = {} # initialize dictionary of datasets 429 | 430 | # use the names of the datasets as the keys in `self.dataset` 431 | for ds_name in self.dataset_names: 432 | self.dataset[ds_name] = hdf_file.get(ds_name) 433 | 434 | def save_group(self, data_subgraphs : list, data_apds : list, 435 | group_size : int, init_idx : int) -> None: 436 | """ 437 | Saves a group of padded subgraphs and their corresponding APDs to the HDF 438 | datasets as `numpy.ndarray`s. 439 | 440 | Args: 441 | ---- 442 | data_subgraphs (list) : Contains molecular subgraphs. 443 | data_apds (list) : Contains APDs. 444 | group_size (int) : Size of HDF "slice". 445 | init_idx (int) : Index to begin slicing. 446 | """ 447 | # convert to `np.ndarray`s 448 | nodes = np.array([graph_tuple[0] for graph_tuple in data_subgraphs]) 449 | edges = np.array([graph_tuple[1] for graph_tuple in data_subgraphs]) 450 | apds = np.array(data_apds) 451 | 452 | end_idx = init_idx + group_size # idx to end slicing 453 | 454 | # once data is padded, save it to dataset slice 455 | self.dataset["nodes"][init_idx:end_idx] = nodes 456 | self.dataset["edges"][init_idx:end_idx] = edges 457 | self.dataset["APDs"][init_idx:end_idx] = apds 458 | -------------------------------------------------------------------------------- /graphinvent/ScoringFunction.py: -------------------------------------------------------------------------------- 1 | """ 2 | This class is used for defining the scoring function(s) which can be used during 3 | fine-tuning. 4 | """ 5 | # load general packages and functions 6 | from collections import namedtuple 7 | import torch 8 | from rdkit import DataStructs 9 | from rdkit.Chem import QED, AllChem 10 | import numpy as np 11 | import sklearn 12 | from sklearn import svm 13 | 14 | class ScoringFunction: 15 | """ 16 | A class for defining the scoring function components. 17 | """ 18 | def __init__(self, constants : namedtuple) -> None: 19 | """ 20 | Args: 21 | ---- 22 | constants (namedtuple) : Contains job parameters as well as global 23 | constants. 24 | """ 25 | self.score_components = constants.score_components # list 26 | self.score_type = constants.score_type # list 27 | self.qsar_models = constants.qsar_models # dict 28 | self.device = constants.device 29 | self.max_n_nodes = constants.max_n_nodes 30 | self.score_thresholds = constants.score_thresholds 31 | 32 | self.n_graphs = None # placeholder 33 | 34 | assert len(self.score_components) == len(self.score_thresholds), \ 35 | "`score_components` and `score_thresholds` do not match." 36 | 37 | def compute_score(self, graphs : list, termination : torch.Tensor, 38 | validity : torch.Tensor, uniqueness : torch.Tensor) -> \ 39 | torch.Tensor: 40 | """ 41 | Computes the overall score for the input molecular graphs. 42 | 43 | Args: 44 | ---- 45 | graphs (list) : Contains molecular graphs to evaluate. 46 | termination (torch.Tensor) : Termination status of input molecular 47 | graphs. 48 | validity (torch.Tensor) : Validity of input molecular graphs. 49 | uniqueness (torch.Tensor) : Uniqueness of input molecular graphs. 50 | 51 | Returns: 52 | ------- 53 | final_score (torch.Tensor) : The final scores for each input graph. 54 | """ 55 | self.n_graphs = len(graphs) 56 | contributions_to_score = self.get_contributions_to_score(graphs=graphs) 57 | 58 | if len(self.score_components) == 1: 59 | final_score = contributions_to_score[0] 60 | 61 | elif self.score_type == "continuous": 62 | final_score = contributions_to_score[0] 63 | for component in contributions_to_score[1:]: 64 | final_score *= component 65 | 66 | elif self.score_type == "binary": 67 | component_masks = [] 68 | for idx, score_component in enumerate(contributions_to_score): 69 | component_mask = torch.where( 70 | score_component > self.score_thresholds[idx], 71 | torch.ones(self.n_graphs, device=self.device, dtype=torch.uint8), 72 | torch.zeros(self.n_graphs, device=self.device, dtype=torch.uint8) 73 | ) 74 | component_masks.append(component_mask) 75 | 76 | final_score = component_masks[0] 77 | for mask in component_masks[1:]: 78 | final_score *= mask 79 | final_score = final_score.float() 80 | 81 | else: 82 | raise NotImplementedError 83 | 84 | # remove contribution of duplicate molecules to the score 85 | final_score *= uniqueness 86 | 87 | # remove contribution of invalid molecules to the score 88 | final_score *= validity 89 | 90 | # remove contribution of improperly-terminated molecules to the score 91 | final_score *= termination 92 | 93 | return final_score 94 | 95 | def get_contributions_to_score(self, graphs : list) -> list: 96 | """ 97 | Returns the different elements of the score. 98 | 99 | Args: 100 | ---- 101 | graphs (list) : Contains molecular graphs to evaluate. 102 | 103 | Returns: 104 | ------- 105 | contributions_to_score (list) : Contains elements of the score due to 106 | each scoring function component. 107 | """ 108 | contributions_to_score = [] 109 | 110 | for score_component in self.score_components: 111 | if "target_size" in score_component: 112 | 113 | target_size = int(score_component[12:]) 114 | 115 | assert target_size <= self.max_n_nodes, \ 116 | "Target size > largest possible size (`max_n_nodes`)." 117 | assert 0 < target_size, "Target size must be greater than 0." 118 | 119 | target_size *= torch.ones(self.n_graphs, device=self.device) 120 | n_nodes = torch.tensor([graph.n_nodes for graph in graphs], 121 | device=self.device) 122 | max_nodes = self.max_n_nodes 123 | score = ( 124 | torch.ones(self.n_graphs, device=self.device) 125 | - torch.abs(n_nodes - target_size) 126 | / (max_nodes - target_size) 127 | ) 128 | 129 | contributions_to_score.append(score) 130 | 131 | elif score_component == "QED": 132 | mols = [graph.molecule for graph in graphs] 133 | 134 | # compute the QED score for each molecule (if possible) 135 | qed = [] 136 | for mol in mols: 137 | try: 138 | qed.append(QED.qed(mol)) 139 | except: 140 | qed.append(0.0) 141 | score = torch.tensor(qed, device=self.device) 142 | 143 | contributions_to_score.append(score) 144 | 145 | elif "activity" in score_component: 146 | mols = [graph.molecule for graph in graphs] 147 | 148 | # `score_component` has to be the key to the QSAR model in the 149 | # `self.qsar_models` dict 150 | qsar_model = self.qsar_models[score_component] 151 | score = self.compute_activity(mols, qsar_model) 152 | 153 | contributions_to_score.append(score) 154 | 155 | else: 156 | raise NotImplementedError("The score component is not defined. " 157 | "You can define it in " 158 | "`ScoringFunction.py`.") 159 | 160 | return contributions_to_score 161 | 162 | def compute_activity(self, mols : list, 163 | activity_model : sklearn.svm.classes.SVC) -> list: 164 | """ 165 | Note: this function may have to be tuned/replicated depending on how 166 | the activity model is saved. 167 | 168 | Args: 169 | ---- 170 | mols (list) : Contains `rdkit.Mol` objects corresponding to molecular 171 | graphs sampled. 172 | activity_model (sklearn.svm.classes.SVC) : Pre-trained QSAR model. 173 | 174 | Returns: 175 | ------- 176 | activity (list) : Contains predicted activities for input molecules. 177 | """ 178 | n_mols = len(mols) 179 | activity = torch.zeros(n_mols, device=self.device) 180 | 181 | for idx, mol in enumerate(mols): 182 | try: 183 | fingerprint = AllChem.GetMorganFingerprintAsBitVect(mol, 184 | 2, 185 | nBits=2048) 186 | ecfp4 = np.zeros((2048,)) 187 | DataStructs.ConvertToNumpyArray(fingerprint, ecfp4) 188 | activity[idx] = activity_model.predict_proba([ecfp4])[0][1] 189 | except: 190 | pass # activity[idx] will remain 0.0 191 | 192 | return activity 193 | -------------------------------------------------------------------------------- /graphinvent/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/graphinvent/__init__.py -------------------------------------------------------------------------------- /graphinvent/gnn/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/graphinvent/gnn/__init__.py -------------------------------------------------------------------------------- /graphinvent/gnn/aggregation_mpnn.py: -------------------------------------------------------------------------------- 1 | """ 2 | Defines the `AggregationMPNN` class. 3 | """ 4 | # load general packages and functions 5 | from collections import namedtuple 6 | import torch 7 | 8 | 9 | class AggregationMPNN(torch.nn.Module): 10 | """ 11 | Abstract `AggregationMPNN` class. Specific models using this class are 12 | defined in `mpnn.py`; these are the attention networks AttS2V and AttGGNN. 13 | """ 14 | def __init__(self, constants : namedtuple) -> None: 15 | super().__init__() 16 | 17 | self.hidden_node_features = constants.hidden_node_features 18 | self.edge_features = constants.n_edge_features 19 | self.message_size = constants.message_size 20 | self.message_passes = constants.message_passes 21 | self.constants = constants 22 | 23 | def aggregate_message(self, nodes : torch.Tensor, node_neighbours : torch.Tensor, 24 | edges : torch.Tensor, mask : torch.Tensor) -> None: 25 | """ 26 | Message aggregation function, to be implemented in all `AggregationMPNN` subclasses. 27 | 28 | Args: 29 | ---- 30 | nodes (torch.Tensor) : Batch of node feature vectors. 31 | node_neighbours (torch.Tensor) : Batch of node feature vectors for neighbors. 32 | edges (torch.Tensor) : Batch of edge feature vectors. 33 | mask (torch.Tensor) : Mask for non-existing neighbors, where 34 | elements are 1 if corresponding element 35 | exists and 0 otherwise. 36 | 37 | Shapes: 38 | ------ 39 | nodes : (total N nodes in batch, N node features) 40 | node_neighbours : (total N nodes in batch, max node degree, N node features) 41 | edges : (total N nodes in batch, max node degree, N edge features) 42 | mask : (total N nodes in batch, max node degree) 43 | """ 44 | raise NotImplementedError 45 | 46 | def update(self, nodes : torch.Tensor, messages : torch.Tensor) -> None: 47 | """ 48 | Message update function, to be implemented in all `AggregationMPNN` subclasses. 49 | 50 | Args: 51 | ---- 52 | nodes (torch.Tensor) : Batch of node feature vectors. 53 | messages (torch.Tensor) : Batch of incoming messages. 54 | 55 | Shapes: 56 | ------ 57 | nodes : (total N nodes in batch, N node features) 58 | messages : (total N nodes in batch, N node features) 59 | """ 60 | raise NotImplementedError 61 | 62 | def readout(self, hidden_nodes : torch.Tensor, input_nodes : torch.Tensor, 63 | node_mask : torch.Tensor) -> None: 64 | """ 65 | Local readout function, to be implemented in all `AggregationMPNN` subclasses. 66 | 67 | Args: 68 | ---- 69 | hidden_nodes (torch.Tensor) : Batch of node feature vectors. 70 | input_nodes (torch.Tensor) : Batch of node feature vectors. 71 | node_mask (torch.Tensor) : Mask for non-existing neighbors, where 72 | elements are 1 if corresponding element 73 | exists and 0 otherwise. 74 | 75 | Shapes: 76 | ------ 77 | hidden_nodes : (total N nodes in batch, N node features) 78 | input_nodes : (total N nodes in batch, N node features) 79 | node_mask : (total N nodes in batch, N features) 80 | """ 81 | raise NotImplementedError 82 | 83 | def forward(self, nodes : torch.Tensor, edges : torch.Tensor) -> torch.Tensor: 84 | """ 85 | Defines forward pass. 86 | 87 | Args: 88 | ---- 89 | nodes (torch.Tensor) : Batch of node feature matrices. 90 | edges (torch.Tensor) : Batch of edge feature tensors. 91 | 92 | Shapes: 93 | ------ 94 | nodes : (batch size, N nodes, N node features) 95 | edges : (batch size, N nodes, N nodes, N edge features) 96 | 97 | Returns: 98 | ------- 99 | output (torch.Tensor) : This would normally be the learned graph 100 | representation, but in all MPNN readout functions 101 | in this work, the last layer is used to predict 102 | the action probability distribution for a batch 103 | of graphs from the learned graph representation. 104 | """ 105 | adjacency = torch.sum(edges, dim=3) 106 | 107 | # **note: "idc" == "indices", "nghb{s}" == "neighbour(s)" 108 | edge_batch_batch_idc, edge_batch_node_idc, edge_batch_nghb_idc = \ 109 | adjacency.nonzero(as_tuple=True) 110 | 111 | node_batch_batch_idc, node_batch_node_idc = adjacency.sum(-1).nonzero(as_tuple=True) 112 | node_batch_adj = adjacency[node_batch_batch_idc, node_batch_node_idc, :] 113 | node_batch_size = node_batch_batch_idc.shape[0] 114 | node_degrees = node_batch_adj.sum(-1).long() 115 | max_node_degree = node_degrees.max() 116 | 117 | node_batch_node_nghbs = torch.zeros(node_batch_size, 118 | max_node_degree, 119 | self.hidden_node_features, 120 | device=self.constants.device) 121 | node_batch_edges = torch.zeros(node_batch_size, 122 | max_node_degree, 123 | self.edge_features, 124 | device=self.constants.device) 125 | 126 | node_batch_nghb_nghb_idc = torch.cat( 127 | [torch.arange(i) for i in node_degrees] 128 | ).long() 129 | 130 | edge_batch_node_batch_idc = torch.cat( 131 | [i * torch.ones(degree) for i, degree in enumerate(node_degrees)] 132 | ).long() 133 | 134 | node_batch_node_nghb_mask = torch.zeros(node_batch_size, 135 | max_node_degree, 136 | device=self.constants.device) 137 | 138 | node_batch_node_nghb_mask[edge_batch_node_batch_idc, node_batch_nghb_nghb_idc] = 1 139 | 140 | node_batch_edges[edge_batch_node_batch_idc, node_batch_nghb_nghb_idc, :] = \ 141 | edges[edge_batch_batch_idc, edge_batch_node_idc, edge_batch_nghb_idc, :] 142 | 143 | # pad up the hidden nodes 144 | hidden_nodes = torch.zeros(nodes.shape[0], 145 | nodes.shape[1], 146 | self.hidden_node_features, 147 | device=self.constants.device) 148 | hidden_nodes[:nodes.shape[0], :nodes.shape[1], :nodes.shape[2]] = nodes.clone() 149 | 150 | for _ in range(self.message_passes): 151 | 152 | node_batch_nodes = hidden_nodes[node_batch_batch_idc, node_batch_node_idc, :] 153 | node_batch_node_nghbs[edge_batch_node_batch_idc, node_batch_nghb_nghb_idc, :] = \ 154 | hidden_nodes[edge_batch_batch_idc, edge_batch_nghb_idc, :] 155 | 156 | messages = self.aggregate_message(nodes=node_batch_nodes, 157 | node_neighbours=node_batch_node_nghbs.clone(), 158 | edges=node_batch_edges, 159 | mask=node_batch_node_nghb_mask) 160 | 161 | hidden_nodes[node_batch_batch_idc, node_batch_node_idc, :] = \ 162 | self.update(node_batch_nodes.clone(), messages) 163 | 164 | node_mask = (adjacency.sum(-1) != 0) 165 | 166 | output = self.readout(hidden_nodes, nodes, node_mask) 167 | 168 | return output 169 | -------------------------------------------------------------------------------- /graphinvent/gnn/edge_mpnn.py: -------------------------------------------------------------------------------- 1 | """ 2 | Defines the `EdgeMPNN` class. 3 | """# load general packages and functions 4 | from collections import namedtuple 5 | import torch 6 | 7 | 8 | class EdgeMPNN(torch.nn.Module): 9 | """ 10 | Abstract `EdgeMPNN` class. A specific model using this class is defined 11 | in `mpnn.py`; this is the EMN. 12 | """ 13 | def __init__(self, constants : namedtuple) -> None: 14 | super().__init__() 15 | 16 | self.edge_features = constants.edge_features 17 | self.edge_embedding_size = constants.edge_embedding_size 18 | self.message_passes = constants.message_passes 19 | self.n_nodes_largest_graph = constants.max_n_nodes 20 | self.constants = constants 21 | 22 | def preprocess_edges(self, nodes : torch.Tensor, node_neighbours : torch.Tensor, 23 | edges : torch.Tensor) -> None: 24 | """ 25 | Edge preprocessing step, to be implemented in all `EdgeMPNN` subclasses. 26 | 27 | Args: 28 | ---- 29 | nodes (torch.Tensor) : Batch of node feature vectors. 30 | node_neighbours (torch.Tensor) : Batch of node feature vectors for neighbors. 31 | edges (torch.Tensor) : Batch of edge feature vectors. 32 | 33 | Shapes: 34 | ------ 35 | nodes : (total N nodes in batch, N node features) 36 | node_neighbours : (total N nodes in batch, max node degree, N node features) 37 | edges : (total N nodes in batch, max node degree, N edge features) 38 | """ 39 | raise NotImplementedError 40 | 41 | def propagate_edges(self, edges : torch.Tensor, ingoing_edge_memories : torch.Tensor, 42 | ingoing_edges_mask : torch.Tensor) -> None: 43 | """ 44 | Edge propagation rule, to be implemented in all `EdgeMPNN` subclasses. 45 | 46 | Args: 47 | ---- 48 | edges (torch.Tensor) : Batch of edge feature tensors. 49 | ingoing_edge_memories (torch.Tensor) : Batch of memories for all 50 | ingoing edges. 51 | ingoing_edges_mask (torch.Tensor) : Mask for ingoing edges. 52 | 53 | Shapes: 54 | ------ 55 | edges : (batch size, N nodes, N nodes, total N edge features) 56 | ingoing_edge_memories : (total N edges in batch, total N edge features) 57 | ingoing_edges_mask : (total N edges in batch, max node degree, total N edge features) 58 | """ 59 | raise NotImplementedError 60 | 61 | def readout(self, hidden_nodes : torch.Tensor, input_nodes : torch.Tensor, 62 | node_mask : torch.Tensor) -> None: 63 | """ 64 | Local readout function, to be implemented in all `EdgeMPNN` subclasses. 65 | 66 | Args: 67 | ---- 68 | hidden_nodes (torch.Tensor) : Batch of node feature vectors. 69 | input_nodes (torch.Tensor) : Batch of node feature vectors. 70 | node_mask (torch.Tensor) : Mask for non-existing neighbors, where 71 | elements are 1 if corresponding element 72 | exists and 0 otherwise. 73 | 74 | Shapes: 75 | ------ 76 | hidden_nodes : (total N nodes in batch, N node features) 77 | input_nodes : (total N nodes in batch, N node features) 78 | node_mask : (total N nodes in batch, N features) 79 | """ 80 | raise NotImplementedError 81 | 82 | def forward(self, nodes : torch.Tensor, edges : torch.Tensor) -> torch.Tensor: 83 | """ 84 | Defines forward pass. 85 | 86 | Args: 87 | ---- 88 | nodes (torch.Tensor) : Batch of node feature matrices. 89 | edges (torch.Tensor) : Batch of edge feature tensors. 90 | 91 | Shapes: 92 | ------ 93 | nodes : (batch size, N nodes, N node features) 94 | edges : (batch size, N nodes, N nodes, N edge features) 95 | 96 | Returns: 97 | ------- 98 | output (torch.Tensor) : This would normally be the learned graph representation, 99 | but in all MPNN readout functions in this work, 100 | the last layer is used to predict the action 101 | probability distribution for a batch of graphs from 102 | the learned graph representation. 103 | """ 104 | adjacency = torch.sum(edges, dim=3) 105 | 106 | # indices for finding edges in batch; `edges_b_idx` is batch index, 107 | # `edges_n_idx` is the node index, and `edges_nghb_idx` is the index 108 | # that each node in `edges_n_idx` is bound to 109 | edges_b_idx, edges_n_idx, edges_nghb_idx = adjacency.nonzero(as_tuple=True) 110 | 111 | n_edges = edges_n_idx.shape[0] 112 | adj_of_edge_batch_idc = adjacency.clone().long() 113 | 114 | # +1 to distinguish idx 0 from empty elements, subtracted few lines down 115 | r = torch.arange(1, n_edges + 1, device=self.constants.device) 116 | 117 | adj_of_edge_batch_idc[edges_b_idx, edges_n_idx, edges_nghb_idx] = r 118 | 119 | ingoing_edges_eb_idx = ( 120 | torch.cat([row[row.nonzero()] for row in 121 | adj_of_edge_batch_idc[edges_b_idx, edges_nghb_idx, :]]) - 1 122 | ).squeeze() 123 | 124 | edge_degrees = adjacency[edges_b_idx, edges_nghb_idx, :].sum(-1).long() 125 | ingoing_edges_igeb_idx = torch.cat( 126 | [i * torch.ones(d) for i, d in enumerate(edge_degrees)] 127 | ).long() 128 | ingoing_edges_ige_idx = torch.cat([torch.arange(i) for i in edge_degrees]).long() 129 | 130 | 131 | batch_size = adjacency.shape[0] 132 | n_nodes = adjacency.shape[1] 133 | max_node_degree = adjacency.sum(-1).max().int() 134 | edge_memories = torch.zeros(n_edges, 135 | self.edge_embedding_size, 136 | device=self.constants.device) 137 | 138 | ingoing_edge_memories = torch.zeros(n_edges, max_node_degree, 139 | self.edge_embedding_size, 140 | device=self.constants.device) 141 | ingoing_edges_mask = torch.zeros(n_edges, 142 | max_node_degree, 143 | device=self.constants.device) 144 | 145 | edge_batch_nodes = nodes[edges_b_idx, edges_n_idx, :] 146 | # **note: "nghb{s}" == "neighbour(s)" 147 | edge_batch_nghbs = nodes[edges_b_idx, edges_nghb_idx, :] 148 | edge_batch_edges = edges[edges_b_idx, edges_n_idx, edges_nghb_idx, :] 149 | edge_batch_edges = self.preprocess_edges(nodes=edge_batch_nodes, 150 | node_neighbours=edge_batch_nghbs, 151 | edges=edge_batch_edges) 152 | 153 | # remove h_ji:s influence on h_ij 154 | ingoing_edges_nghb_idx = edges_nghb_idx[ingoing_edges_eb_idx] 155 | ingoing_edges_receiving_edge_n_idx = edges_n_idx[ingoing_edges_igeb_idx] 156 | diff_idx = (ingoing_edges_receiving_edge_n_idx != ingoing_edges_nghb_idx).nonzero() 157 | 158 | try: 159 | ingoing_edges_eb_idx = ingoing_edges_eb_idx[diff_idx].squeeze() 160 | ingoing_edges_ige_idx = ingoing_edges_ige_idx[diff_idx].squeeze() 161 | ingoing_edges_igeb_idx = ingoing_edges_igeb_idx[diff_idx].squeeze() 162 | except: 163 | pass 164 | 165 | ingoing_edges_mask[ingoing_edges_igeb_idx, ingoing_edges_ige_idx] = 1 166 | 167 | for _ in range(self.message_passes): 168 | ingoing_edge_memories[ingoing_edges_igeb_idx, ingoing_edges_ige_idx, :] = \ 169 | edge_memories[ingoing_edges_eb_idx, :] 170 | edge_memories = self.propagate_edges( 171 | edges=edge_batch_edges, 172 | ingoing_edge_memories=ingoing_edge_memories.clone(), 173 | ingoing_edges_mask=ingoing_edges_mask 174 | ) 175 | 176 | node_mask = (adjacency.sum(-1) != 0) 177 | 178 | node_sets = torch.zeros(batch_size, 179 | n_nodes, 180 | max_node_degree, 181 | self.edge_embedding_size, 182 | device=self.constants.device) 183 | 184 | edge_batch_edge_memory_idc = torch.cat( 185 | [torch.arange(row.sum()) for row in adjacency.view(-1, n_nodes)] 186 | ).long() 187 | 188 | node_sets[edges_b_idx, edges_n_idx, edge_batch_edge_memory_idc, :] = edge_memories 189 | graph_sets = node_sets.sum(2) 190 | 191 | output = self.readout(graph_sets, graph_sets, node_mask) 192 | return output 193 | -------------------------------------------------------------------------------- /graphinvent/gnn/modules.py: -------------------------------------------------------------------------------- 1 | """ 2 | Defines MPNN modules and readout functions, and APD readout functions. 3 | """ 4 | # load general packages and functions 5 | from collections import namedtuple 6 | import torch 7 | 8 | # load GraphINVENT-specific functions 9 | # (none) 10 | 11 | 12 | class GraphGather(torch.nn.Module): 13 | """ 14 | GGNN readout function. 15 | """ 16 | def __init__(self, node_features : int, hidden_node_features : int, 17 | out_features : int, att_depth : int, att_hidden_dim : int, 18 | att_dropout_p : float, emb_depth : int, emb_hidden_dim : int, 19 | emb_dropout_p : float, big_positive : float) -> None: 20 | 21 | super().__init__() 22 | 23 | self.big_positive = big_positive 24 | 25 | self.att_nn = MLP( 26 | in_features=node_features + hidden_node_features, 27 | hidden_layer_sizes=[att_hidden_dim] * att_depth, 28 | out_features=out_features, 29 | dropout_p=att_dropout_p 30 | ) 31 | 32 | self.emb_nn = MLP( 33 | in_features=hidden_node_features, 34 | hidden_layer_sizes=[emb_hidden_dim] * emb_depth, 35 | out_features=out_features, 36 | dropout_p=emb_dropout_p 37 | ) 38 | 39 | def forward(self, hidden_nodes : torch.Tensor, input_nodes : torch.Tensor, 40 | node_mask : torch.Tensor) -> torch.Tensor: 41 | """ 42 | Defines forward pass. 43 | """ 44 | Softmax = torch.nn.Softmax(dim=1) 45 | 46 | cat = torch.cat((hidden_nodes, input_nodes), dim=2) 47 | energy_mask = (node_mask == 0).float() * self.big_positive 48 | energies = self.att_nn(cat) - energy_mask.unsqueeze(-1) 49 | attention = Softmax(energies) 50 | embedding = self.emb_nn(hidden_nodes) 51 | 52 | return torch.sum(attention * embedding, dim=1) 53 | 54 | 55 | class Set2Vec(torch.nn.Module): 56 | """ 57 | S2V readout function. 58 | """ 59 | def __init__(self, node_features : int, hidden_node_features : int, 60 | lstm_computations : int, memory_size : int, 61 | constants : namedtuple) -> None: 62 | 63 | super().__init__() 64 | 65 | self.constants = constants 66 | self.lstm_computations = lstm_computations 67 | self.memory_size = memory_size 68 | 69 | self.embedding_matrix = torch.nn.Linear( 70 | in_features=node_features + hidden_node_features, 71 | out_features=self.memory_size, 72 | bias=True 73 | ) 74 | 75 | self.lstm = torch.nn.LSTMCell( 76 | input_size=self.memory_size, 77 | hidden_size=self.memory_size, 78 | bias=True 79 | ) 80 | 81 | def forward(self, hidden_output_nodes : torch.Tensor, input_nodes : torch.Tensor, 82 | node_mask : torch.Tensor) -> torch.Tensor: 83 | """ 84 | Defines forward pass. 85 | """ 86 | Softmax = torch.nn.Softmax(dim=1) 87 | 88 | batch_size = input_nodes.shape[0] 89 | energy_mask = torch.bitwise_not(node_mask).float() * self.C.big_negative 90 | lstm_input = torch.zeros(batch_size, self.memory_size, device=self.constants.device) 91 | cat = torch.cat((hidden_output_nodes, input_nodes), dim=2) 92 | memory = self.embedding_matrix(cat) 93 | hidden_state = torch.zeros(batch_size, self.memory_size, device=self.constants.device) 94 | cell_state = torch.zeros(batch_size, self.memory_size, device=self.constants.device) 95 | 96 | for _ in range(self.lstm_computations): 97 | query, cell_state = self.lstm(lstm_input, (hidden_state, cell_state)) 98 | 99 | # dot product query x memory 100 | energies = (query.view(batch_size, 1, self.memory_size) * memory).sum(dim=-1) 101 | attention = Softmax(energies + energy_mask) 102 | read = (attention.unsqueeze(-1) * memory).sum(dim=1) 103 | 104 | hidden_state = query 105 | lstm_input = read 106 | 107 | cat = torch.cat((query, read), dim=1) 108 | return cat 109 | 110 | 111 | class MLP(torch.nn.Module): 112 | """ 113 | Multi-layer perceptron. Applies SELU after every linear layer. 114 | 115 | Args: 116 | ---- 117 | in_features (int) : Size of each input sample. 118 | hidden_layer_sizes (list) : Hidden layer sizes. 119 | out_features (int) : Size of each output sample. 120 | dropout_p (float) : Probability of dropping a weight. 121 | """ 122 | 123 | def __init__(self, in_features : int, hidden_layer_sizes : list, out_features : int, 124 | dropout_p : float) -> None: 125 | super().__init__() 126 | 127 | activation_function = torch.nn.SELU 128 | 129 | # create list of all layer feature sizes 130 | fs = [in_features, *hidden_layer_sizes, out_features] 131 | 132 | # create list of linear_blocks 133 | layers = [self._linear_block(in_f, out_f, 134 | activation_function, 135 | dropout_p) 136 | for in_f, out_f in zip(fs, fs[1:])] 137 | 138 | # concatenate modules in all sequentials in layers list 139 | layers = [module for sq in layers for module in sq.children()] 140 | 141 | # add modules to sequential container 142 | self.seq = torch.nn.Sequential(*layers) 143 | 144 | def _linear_block(self, in_f : int, out_f : int, activation : torch.nn.Module, 145 | dropout_p : float) -> torch.nn.Sequential: 146 | """ 147 | Returns a linear block consisting of a linear layer, an activation function 148 | (SELU), and dropout (optional) stack. 149 | 150 | Args: 151 | ---- 152 | in_f (int) : Size of each input sample. 153 | out_f (int) : Size of each output sample. 154 | activation (torch.nn.Module) : Activation function. 155 | dropout_p (float) : Probability of dropping a weight. 156 | 157 | Returns: 158 | ------- 159 | torch.nn.Sequential : The linear block. 160 | """ 161 | # bias must be used in most MLPs in our models to learn from empty graphs 162 | linear = torch.nn.Linear(in_f, out_f, bias=True) 163 | torch.nn.init.xavier_uniform_(linear.weight) 164 | return torch.nn.Sequential(linear, activation(), torch.nn.AlphaDropout(dropout_p)) 165 | 166 | def forward(self, layers_input : torch.nn.Sequential) -> torch.nn.Sequential: 167 | """ 168 | Defines forward pass. 169 | """ 170 | return self.seq(layers_input) 171 | 172 | 173 | class GlobalReadout(torch.nn.Module): 174 | """ 175 | Global readout function class. Used to predict the action probability distributions 176 | (APDs) for molecular graphs. 177 | 178 | The first tier of two `MLP`s take as input, for each graph in the batch, the 179 | final transformed node feature vectors. These feed-forward networks correspond 180 | to the preliminary "f_add" and "f_conn" distributions. 181 | 182 | The second tier of three `MLP`s takes as input the output of the first tier 183 | of `MLP`s (the "preliminary" APDs) as well as the graph embeddings for all 184 | graphs in the batch. Output are the final APD components, which are then flattened 185 | and concatenated. No activation function is applied after the final layer, so 186 | that this can be done outside (e.g. in the loss function, and before sampling). 187 | """ 188 | def __init__(self, f_add_elems : int, f_conn_elems : int, f_term_elems : int, 189 | mlp1_depth : int, mlp1_dropout_p : float, mlp1_hidden_dim : int, 190 | mlp2_depth : int, mlp2_dropout_p : float, mlp2_hidden_dim : int, 191 | graph_emb_size : int, max_n_nodes : int, node_emb_size : int, 192 | device : str) -> None: 193 | super().__init__() 194 | 195 | self.device = device 196 | 197 | # preliminary f_add 198 | self.fAddNet1 = MLP( 199 | in_features=node_emb_size, 200 | hidden_layer_sizes=[mlp1_hidden_dim] * mlp1_depth, 201 | out_features=f_add_elems, 202 | dropout_p=mlp1_dropout_p 203 | ) 204 | 205 | # preliminary f_conn 206 | self.fConnNet1 = MLP( 207 | in_features=node_emb_size, 208 | hidden_layer_sizes=[mlp1_hidden_dim] * mlp1_depth, 209 | out_features=f_conn_elems, 210 | dropout_p=mlp1_dropout_p 211 | ) 212 | 213 | # final f_add 214 | self.fAddNet2 = MLP( 215 | in_features=(max_n_nodes * f_add_elems + graph_emb_size), 216 | hidden_layer_sizes=[mlp2_hidden_dim] * mlp2_depth, 217 | out_features=f_add_elems * max_n_nodes, 218 | dropout_p=mlp2_dropout_p 219 | ) 220 | 221 | # final f_conn 222 | self.fConnNet2 = MLP( 223 | in_features=(max_n_nodes * f_conn_elems + graph_emb_size), 224 | hidden_layer_sizes=[mlp2_hidden_dim] * mlp2_depth, 225 | out_features=f_conn_elems * max_n_nodes, 226 | dropout_p=mlp2_dropout_p 227 | ) 228 | 229 | # final f_term (only takes as input graph embeddings) 230 | self.fTermNet2 = MLP( 231 | in_features=graph_emb_size, 232 | hidden_layer_sizes=[mlp2_hidden_dim] * mlp2_depth, 233 | out_features=f_term_elems, 234 | dropout_p=mlp2_dropout_p 235 | ) 236 | 237 | def forward(self, node_level_output : torch.Tensor, 238 | graph_embedding_batch : torch.Tensor) -> torch.Tensor: 239 | """ 240 | Defines forward pass. 241 | """ 242 | if self.device == "cuda": 243 | self.fAddNet1 = self.fAddNet1.to("cuda", non_blocking=True) 244 | self.fConnNet1 = self.fConnNet1.to("cuda", non_blocking=True) 245 | self.fAddNet2 = self.fAddNet2.to("cuda", non_blocking=True) 246 | self.fConnNet2 = self.fConnNet2.to("cuda", non_blocking=True) 247 | self.fTermNet2 = self.fTermNet2.to("cuda", non_blocking=True) 248 | 249 | # get preliminary f_add and f_conn 250 | f_add_1 = self.fAddNet1(node_level_output) 251 | f_conn_1 = self.fConnNet1(node_level_output) 252 | 253 | if self.device == "cuda": 254 | f_add_1 = f_add_1.to("cuda", non_blocking=True) 255 | f_conn_1 = f_conn_1.to("cuda", non_blocking=True) 256 | 257 | # reshape preliminary APDs into flattenened vectors (e.g. one vector per 258 | # graph in batch) 259 | f_add_1_size = f_add_1.size() 260 | f_conn_1_size = f_conn_1.size() 261 | f_add_1 = f_add_1.view((f_add_1_size[0], f_add_1_size[1] * f_add_1_size[2])) 262 | f_conn_1 = f_conn_1.view((f_conn_1_size[0], f_conn_1_size[1] * f_conn_1_size[2])) 263 | 264 | # get final f_add, f_conn, and f_term 265 | f_add_2 = self.fAddNet2( 266 | torch.cat((f_add_1, graph_embedding_batch), dim=1).unsqueeze(dim=1) 267 | ) 268 | f_conn_2 = self.fConnNet2( 269 | torch.cat((f_conn_1, graph_embedding_batch), dim=1).unsqueeze(dim=1) 270 | ) 271 | f_term_2 = self.fTermNet2(graph_embedding_batch) 272 | 273 | if self.device == "cuda": 274 | f_add_2 = f_add_2.to("cuda", non_blocking=True) 275 | f_conn_2 = f_conn_2.to("cuda", non_blocking=True) 276 | f_term_2 = f_term_2.to("cuda", non_blocking=True) 277 | 278 | # flatten and concatenate 279 | cat = torch.cat((f_add_2.squeeze(dim=1), f_conn_2.squeeze(dim=1), f_term_2), dim=1) 280 | 281 | return cat # note: no activation function before returning 282 | -------------------------------------------------------------------------------- /graphinvent/gnn/summation_mpnn.py: -------------------------------------------------------------------------------- 1 | """ 2 | Defines the `SummationMPNN` class. 3 | """ 4 | # load general packages and functions 5 | from collections import namedtuple 6 | import torch 7 | 8 | 9 | class SummationMPNN(torch.nn.Module): 10 | """ 11 | Abstract `SummationMPNN` class. Specific models using this class are 12 | defined in `mpnn.py`; these are MNN, S2V, and GGNN. 13 | """ 14 | def __init__(self, constants : namedtuple): 15 | 16 | super().__init__() 17 | 18 | self.hidden_node_features = constants.hidden_node_features 19 | self.edge_features = constants.n_edge_features 20 | self.message_size = constants.message_size 21 | self.message_passes = constants.message_passes 22 | self.constants = constants 23 | 24 | def message_terms(self, nodes : torch.Tensor, node_neighbours : torch.Tensor, 25 | edges : torch.Tensor) -> None: 26 | """ 27 | Message passing function, to be implemented in all `SummationMPNN` subclasses. 28 | 29 | Args: 30 | ---- 31 | nodes (torch.Tensor) : Batch of node feature vectors. 32 | node_neighbours (torch.Tensor) : Batch of node feature vectors for neighbors. 33 | edges (torch.Tensor) : Batch of edge feature vectors. 34 | 35 | Shapes: 36 | ------ 37 | nodes : (total N nodes in batch, N node features) 38 | node_neighbours : (total N nodes in batch, max node degree, N node features) 39 | edges : (total N nodes in batch, max node degree, N edge features) 40 | """ 41 | raise NotImplementedError 42 | 43 | def update(self, nodes : torch.Tensor, messages : torch.Tensor) -> None: 44 | """ 45 | Message update function, to be implemented in all `SummationMPNN` subclasses. 46 | 47 | Args: 48 | ---- 49 | nodes (torch.Tensor) : Batch of node feature vectors. 50 | messages (torch.Tensor) : Batch of incoming messages. 51 | 52 | Shapes: 53 | ------ 54 | nodes : (total N nodes in batch, N node features) 55 | messages : (total N nodes in batch, N node features) 56 | """ 57 | raise NotImplementedError 58 | 59 | def readout(self, hidden_nodes : torch.Tensor, input_nodes : torch.Tensor, 60 | node_mask : torch.Tensor) -> None: 61 | """ 62 | Local readout function, to be implemented in all `SummationMPNN` subclasses. 63 | 64 | Args: 65 | ---- 66 | hidden_nodes (torch.Tensor) : Batch of node feature vectors. 67 | input_nodes (torch.Tensor) : Batch of node feature vectors. 68 | node_mask (torch.Tensor) : Mask for non-existing neighbors, where elements 69 | are 1 if corresponding element exists and 0 70 | otherwise. 71 | 72 | Shapes: 73 | ------ 74 | hidden_nodes : (total N nodes in batch, N node features) 75 | input_nodes : (total N nodes in batch, N node features) 76 | node_mask : (total N nodes in batch, N features) 77 | """ 78 | raise NotImplementedError 79 | 80 | def forward(self, nodes : torch.Tensor, edges : torch.Tensor) -> None: 81 | """ 82 | Defines forward pass. 83 | 84 | Args: 85 | ---- 86 | nodes (torch.Tensor) : Batch of node feature matrices. 87 | edges (torch.Tensor) : Batch of edge feature tensors. 88 | 89 | Shapes: 90 | ------ 91 | nodes : (batch size, N nodes, N node features) 92 | edges : (batch size, N nodes, N nodes, N edge features) 93 | 94 | Returns: 95 | ------- 96 | output (torch.Tensor) : This would normally be the learned graph representation, 97 | but in all MPNN readout functions in this work, 98 | the last layer is used to predict the action 99 | probability distribution for a batch of graphs 100 | from the learned graph representation. 101 | """ 102 | adjacency = torch.sum(edges, dim=3) 103 | 104 | # **note: "idc" == "indices", "nghb{s}" == "neighbour(s)" 105 | (edge_batch_batch_idc, 106 | edge_batch_node_idc, 107 | edge_batch_nghb_idc) = adjacency.nonzero(as_tuple=True) 108 | 109 | (node_batch_batch_idc, node_batch_node_idc) = adjacency.sum(-1).nonzero(as_tuple=True) 110 | 111 | same_batch = node_batch_batch_idc.view(-1, 1) == edge_batch_batch_idc 112 | same_node = node_batch_node_idc.view(-1, 1) == edge_batch_node_idc 113 | 114 | # element ij of `message_summation_matrix` is 1 if `edge_batch_edges[j]` 115 | # is connected with `node_batch_nodes[i]`, else 0 116 | message_summation_matrix = (same_batch * same_node).float() 117 | 118 | edge_batch_edges = edges[edge_batch_batch_idc, edge_batch_node_idc, edge_batch_nghb_idc, :] 119 | 120 | # pad up the hidden nodes 121 | hidden_nodes = torch.zeros(nodes.shape[0], 122 | nodes.shape[1], 123 | self.hidden_node_features, 124 | device=self.constants.device) 125 | hidden_nodes[:nodes.shape[0], :nodes.shape[1], :nodes.shape[2]] = nodes.clone() 126 | node_batch_nodes = hidden_nodes[node_batch_batch_idc, node_batch_node_idc, :] 127 | 128 | for _ in range(self.message_passes): 129 | edge_batch_nodes = hidden_nodes[edge_batch_batch_idc, edge_batch_node_idc, :] 130 | 131 | edge_batch_nghbs = hidden_nodes[edge_batch_batch_idc, edge_batch_nghb_idc, :] 132 | 133 | message_terms = self.message_terms(edge_batch_nodes, 134 | edge_batch_nghbs, 135 | edge_batch_edges) 136 | 137 | if len(message_terms.size()) == 1: # if a single graph in batch 138 | message_terms = message_terms.unsqueeze(0) 139 | 140 | # the summation in eq. 1 of the NMPQC paper happens here 141 | messages = torch.matmul(message_summation_matrix, message_terms) 142 | 143 | node_batch_nodes = self.update(node_batch_nodes, messages) 144 | hidden_nodes[node_batch_batch_idc, node_batch_node_idc, :] = node_batch_nodes.clone() 145 | 146 | node_mask = adjacency.sum(-1) != 0 147 | output = self.readout(hidden_nodes, nodes, node_mask) 148 | 149 | return output 150 | -------------------------------------------------------------------------------- /graphinvent/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Main function for running GraphINVENT jobs. 3 | 4 | Examples: 5 | -------- 6 | * If you define an "input.csv" with desired job parameters in job_dir/: 7 | (graphinvent) ~/GraphINVENT$ python main.py --job_dir path/to/job_dir/ 8 | * If you instead want to run your job using the submission scripts: 9 | (graphinvent) ~/GraphINVENT$ python submit-fine-tuning.py 10 | """ 11 | # load general packages and functions 12 | import datetime 13 | 14 | # load GraphINVENT-specific functions 15 | import util 16 | from parameters.constants import constants 17 | from Workflow import Workflow 18 | 19 | # suppress minor warnings 20 | util.suppress_warnings() 21 | 22 | 23 | def main(): 24 | """ 25 | Defines the type of job (preprocessing, training, generation, testing, or 26 | fine-tuning), writes the job parameters (for future reference), and runs 27 | the job. 28 | """ 29 | _ = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") # fix date/time 30 | 31 | workflow = Workflow(constants=constants) 32 | 33 | job_type = constants.job_type 34 | print(f"* Run mode: '{job_type}'", flush=True) 35 | 36 | if job_type == "preprocess": 37 | # write preprocessing parameters 38 | util.write_preprocessing_parameters(params=constants) 39 | 40 | # preprocess all datasets 41 | workflow.preprocess_phase() 42 | 43 | elif job_type == "train": 44 | # write training parameters 45 | util.write_job_parameters(params=constants) 46 | 47 | # train model and generate graphs 48 | workflow.training_phase() 49 | 50 | elif job_type == "generate": 51 | # write generation parameters 52 | util.write_job_parameters(params=constants) 53 | 54 | # generate molecules only 55 | workflow.generation_phase() 56 | 57 | elif job_type == "test": 58 | # write testing parameters 59 | util.write_job_parameters(params=constants) 60 | 61 | # evaluate best model using the test set data 62 | workflow.testing_phase() 63 | 64 | elif job_type == "fine-tune": 65 | # write training parameters 66 | util.write_job_parameters(params=constants) 67 | 68 | # fine-tune the model and generate graphs 69 | workflow.learning_phase() 70 | 71 | else: 72 | raise NotImplementedError("Not a valid `job_type`.") 73 | 74 | 75 | if __name__ == "__main__": 76 | main() 77 | -------------------------------------------------------------------------------- /graphinvent/parameters/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/graphinvent/parameters/__init__.py -------------------------------------------------------------------------------- /graphinvent/parameters/args.py: -------------------------------------------------------------------------------- 1 | """ 2 | Defines `ArgumentParser` for specifying job directory using command-line. 3 | """# load general packages and functions 4 | import argparse 5 | 6 | 7 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter, 8 | add_help=False) 9 | parser.add_argument("--job-dir", 10 | type=str, 11 | default="../output/", 12 | help="Directory in which to write all output.") 13 | 14 | 15 | args = parser.parse_args() 16 | 17 | args_dict = vars(args) 18 | job_dir = args_dict["job_dir"] 19 | -------------------------------------------------------------------------------- /graphinvent/parameters/constants.py: -------------------------------------------------------------------------------- 1 | """ 2 | Loads input parameters from `defaults.py`, and defines other global constants 3 | that depend on the input features, creating a `namedtuple` from them; 4 | additionally, if there exists an `input.csv` in the job directory, loads those 5 | arguments and overrides default values in `defaults.py`. 6 | """ 7 | # load general packages and functions 8 | from collections import namedtuple 9 | import pickle 10 | import csv 11 | import os 12 | import sys 13 | from typing import Tuple 14 | import numpy as np 15 | from rdkit.Chem.rdchem import BondType 16 | 17 | # load GraphINVENT-specific functions 18 | sys.path.insert(1, "./parameters/") # search "parameters/" directory 19 | import parameters.args as args 20 | import parameters.defaults as defaults 21 | 22 | 23 | def get_feature_dimensions(parameters : dict) -> Tuple[int, int, int, int]: 24 | """ 25 | Returns dimensions of all node features. 26 | """ 27 | n_atom_types = len(parameters["atom_types"]) 28 | n_formal_charge = len(parameters["formal_charge"]) 29 | n_numh = ( 30 | int(not parameters["use_explicit_H"] and not parameters["ignore_H"]) 31 | * len(parameters["imp_H"]) 32 | ) 33 | n_chirality = int(parameters["use_chirality"]) * len(parameters["chirality"]) 34 | 35 | return n_atom_types, n_formal_charge, n_numh, n_chirality 36 | 37 | 38 | def get_tensor_dimensions(n_atom_types : int, n_formal_charge : int, n_num_h : int, 39 | n_chirality : int, n_node_features : int, n_edge_features : int, 40 | parameters : dict) -> Tuple[list, list, list, list, int]: 41 | """ 42 | Returns dimensions for all tensors that describe molecular graphs. Tensor dimensions 43 | are `list`s, except for `dim_f_term` which is simply an `int`. Each element 44 | of the lists indicate the corresponding dimension of a particular subgraph matrix 45 | (i.e. `nodes`, `f_add`, etc). 46 | """ 47 | max_nodes = parameters["max_n_nodes"] 48 | 49 | # define the matrix dimensions as `list`s 50 | # first for the graph reps... 51 | dim_nodes = [max_nodes, n_node_features] 52 | 53 | dim_edges = [max_nodes, max_nodes, n_edge_features] 54 | 55 | # ... then for the APDs 56 | if parameters["use_chirality"]: 57 | if parameters["use_explicit_H"] or parameters["ignore_H"]: 58 | dim_f_add = [ 59 | parameters["max_n_nodes"], 60 | n_atom_types, 61 | n_formal_charge, 62 | n_chirality, 63 | n_edge_features, 64 | ] 65 | else: 66 | dim_f_add = [ 67 | parameters["max_n_nodes"], 68 | n_atom_types, 69 | n_formal_charge, 70 | n_num_h, 71 | n_chirality, 72 | n_edge_features, 73 | ] 74 | else: 75 | if parameters["use_explicit_H"] or parameters["ignore_H"]: 76 | dim_f_add = [ 77 | parameters["max_n_nodes"], 78 | n_atom_types, 79 | n_formal_charge, 80 | n_edge_features, 81 | ] 82 | else: 83 | dim_f_add = [ 84 | parameters["max_n_nodes"], 85 | n_atom_types, 86 | n_formal_charge, 87 | n_num_h, 88 | n_edge_features, 89 | ] 90 | 91 | dim_f_conn = [parameters["max_n_nodes"], n_edge_features] 92 | 93 | dim_f_term = 1 94 | 95 | return dim_nodes, dim_edges, dim_f_add, dim_f_conn, dim_f_term 96 | 97 | 98 | def load_params(input_csv_path : str) -> dict: 99 | """ 100 | Loads job parameters/hyperparameters from CSV (in `input_csv_path`). 101 | """ 102 | params_to_override = {} 103 | with open(input_csv_path, "r") as csv_file: 104 | 105 | params_reader = csv.reader(csv_file, delimiter=";") 106 | 107 | for key, value in params_reader: 108 | try: 109 | params_to_override[key] = eval(value) 110 | except NameError: # `value` is a `str` 111 | params_to_override[key] = value 112 | except SyntaxError: # to avoid "unexpected `EOF`" 113 | params_to_override[key] = value 114 | 115 | return params_to_override 116 | 117 | 118 | def override_params(all_params : dict) -> dict: 119 | """ 120 | If there exists an `input.csv` in the job directory, loads those arguments 121 | and overrides their default values from `features.py`. 122 | """ 123 | input_csv_path = all_params["job_dir"] + "input.csv" 124 | 125 | # check if there exists and `input.csv` in working directory 126 | if os.path.exists(input_csv_path): 127 | # override default values for parameters in `input.csv` 128 | params_to_override_dict = load_params(input_csv_path) 129 | for key, value in params_to_override_dict.items(): 130 | all_params[key] = value 131 | 132 | return all_params 133 | 134 | 135 | def collect_global_constants(parameters : dict, job_dir : str) -> namedtuple: 136 | """ 137 | Collects constants defined in `features.py` with those defined by the 138 | ArgParser (`args.py`), and returns the bundle as a `namedtuple`. 139 | 140 | Args: 141 | ---- 142 | parameters (dict) : Dictionary of parameters defined in `features.py`. 143 | job_dir (str) : Current job directory, defined on the command line. 144 | 145 | Returns: 146 | ------- 147 | constants (namedtuple) : Collected constants. 148 | """ 149 | # first override any arguments from `input.csv`: 150 | parameters["job_dir"] = job_dir 151 | parameters = override_params(all_params=parameters) 152 | 153 | # then calculate any global constants below: 154 | if parameters["use_explicit_H"] and parameters["ignore_H"]: 155 | raise ValueError("Cannot use explicit Hs and ignore Hs at " 156 | "the same time. Please fix flags.") 157 | 158 | # define edge feature (rdkit `GetBondType()` result -> `int`) constants 159 | bondtype_to_int = {BondType.SINGLE: 0, BondType.DOUBLE: 1, BondType.TRIPLE: 2} 160 | 161 | if parameters["use_aromatic_bonds"]: 162 | bondtype_to_int[BondType.AROMATIC] = 3 163 | 164 | int_to_bondtype = dict(map(reversed, bondtype_to_int.items())) 165 | 166 | n_edge_features = len(bondtype_to_int) 167 | 168 | # define node feature constants 169 | n_atom_types, n_formal_charge, n_imp_H, n_chirality = get_feature_dimensions(parameters) 170 | 171 | n_node_features = n_atom_types + n_formal_charge + n_imp_H + n_chirality 172 | 173 | # define matrix dimensions 174 | (dim_nodes, dim_edges, dim_f_add, 175 | dim_f_conn, dim_f_term) = get_tensor_dimensions(n_atom_types, 176 | n_formal_charge, 177 | n_imp_H, 178 | n_chirality, 179 | n_node_features, 180 | n_edge_features, 181 | parameters) 182 | 183 | len_f_add = np.prod(dim_f_add[:]) 184 | len_f_add_per_node = np.prod(dim_f_add[1:]) 185 | len_f_conn = np.prod(dim_f_conn[:]) 186 | len_f_conn_per_node = np.prod(dim_f_conn[1:]) 187 | 188 | # create a dictionary of global constants, and add `job_dir` to it; this 189 | # will ultimately be converted to a `namedtuple` 190 | constants_dict = { 191 | "big_negative" : -1e6, 192 | "big_positive" : 1e6, 193 | "bondtype_to_int" : bondtype_to_int, 194 | "int_to_bondtype" : int_to_bondtype, 195 | "n_edge_features" : n_edge_features, 196 | "n_atom_types" : n_atom_types, 197 | "n_formal_charge" : n_formal_charge, 198 | "n_imp_H" : n_imp_H, 199 | "n_chirality" : n_chirality, 200 | "n_node_features" : n_node_features, 201 | "dim_nodes" : dim_nodes, 202 | "dim_edges" : dim_edges, 203 | "dim_f_add" : dim_f_add, 204 | "dim_f_conn" : dim_f_conn, 205 | "dim_f_term" : dim_f_term, 206 | "dim_apd" : [np.prod(dim_f_add) + np.prod(dim_f_conn) + 1], 207 | "len_f_add" : len_f_add, 208 | "len_f_add_per_node" : len_f_add_per_node, 209 | "len_f_conn" : len_f_conn, 210 | "len_f_conn_per_node": len_f_conn_per_node, 211 | } 212 | 213 | # join with `features.args_dict` 214 | constants_dict.update(parameters) 215 | 216 | # define path to dataset splits 217 | constants_dict["test_set"] = parameters["dataset_dir"] + "test.smi" 218 | constants_dict["training_set"] = parameters["dataset_dir"] + "train.smi" 219 | constants_dict["validation_set"] = parameters["dataset_dir"] + "valid.smi" 220 | 221 | # check (if a job is not a preprocessing job) that parameters match those for 222 | # the original preprocessing job 223 | if constants_dict["job_type"] != "preprocess": 224 | print( 225 | "* Running job using HDF datasets located at " 226 | + parameters["dataset_dir"], 227 | flush=True, 228 | ) 229 | print( 230 | "* Checking that the relevant parameters match " 231 | "those used in preprocessing the dataset.", 232 | flush=True, 233 | ) 234 | 235 | # load preprocessing parameters for comparison (if they exist already) 236 | csv_file = parameters["dataset_dir"] + "preprocessing_params.csv" 237 | params_to_check = load_params(input_csv_path=csv_file) 238 | 239 | for key, value in params_to_check.items(): 240 | if key in constants_dict.keys() and value != constants_dict[key]: 241 | raise ValueError( 242 | f"Check that training job parameters match those used in " 243 | f"preprocessing. {key} does not match." 244 | ) 245 | 246 | # if above error never raised, then all relevant parameters match! :) 247 | print("-- Job parameters match preprocessing parameters.", flush=True) 248 | 249 | # load QSAR models (sklearn activity model) 250 | if constants_dict["job_type"] == "fine-tune": 251 | print("-- Loading pre-trained scikit-learn activity model.", flush=True) 252 | for qsar_model_name, qsar_model_path in constants_dict["qsar_models"].items(): 253 | with open(qsar_model_path, 'rb') as file: 254 | model_dict = pickle.load(file) 255 | activity_model = model_dict["classifier_sv"] 256 | constants_dict["qsar_models"][qsar_model_name] = activity_model 257 | 258 | # convert `CONSTANTS` dictionary into a namedtuple (immutable + cleaner) 259 | Constants = namedtuple("CONSTANTS", sorted(constants_dict)) 260 | constants = Constants(**constants_dict) 261 | 262 | return constants 263 | 264 | # collect the constants using the functions defined above 265 | constants = collect_global_constants(parameters=defaults.parameters, 266 | job_dir=args.job_dir) 267 | -------------------------------------------------------------------------------- /graphinvent/parameters/load.py: -------------------------------------------------------------------------------- 1 | """ 2 | Functions for loading molecules from SMILES, as well as loading the model type. 3 | """ 4 | # load general packages and functions 5 | import csv 6 | import rdkit 7 | from rdkit.Chem.rdmolfiles import SmilesMolSupplier 8 | 9 | 10 | def molecules(path : str) -> rdkit.Chem.rdmolfiles.SmilesMolSupplier: 11 | """ 12 | Reads a SMILES file (full path/filename specified by `path`) and returns 13 | `rdkit.Mol` objects. 14 | """ 15 | # check first line of SMILES file to see if contains header 16 | with open(path) as smi_file: 17 | first_line = smi_file.readline() 18 | has_header = bool("SMILES" in first_line) 19 | smi_file.close() 20 | 21 | # read file 22 | molecule_set = SmilesMolSupplier(path, 23 | sanitize=True, 24 | nameColumn=-1, 25 | titleLine=has_header) 26 | return molecule_set 27 | 28 | def which_model(input_csv_path : str) -> str: 29 | """ 30 | Gets the type of model to use by reading it from CSV (in "input.csv"). 31 | 32 | Args: 33 | ---- 34 | input_csv_path (str) : The full path/filename to "input.csv" file 35 | containing parameters to overwrite from defaults. 36 | 37 | Returns: 38 | ------- 39 | value (str) : Name of model to use. 40 | """ 41 | with open(input_csv_path, "r") as csv_file: 42 | 43 | params_reader = csv.reader(csv_file, delimiter=";") 44 | 45 | for key, value in params_reader: 46 | if key == "model": 47 | return value # string describing model e.g. "GGNN" 48 | 49 | raise ValueError("Model type not specified.") 50 | -------------------------------------------------------------------------------- /output/input.csv: -------------------------------------------------------------------------------- 1 | atom_types;['C', 'N', 'O', 'S', 'Cl'] 2 | formal_charge;[-1, 0, 1] 3 | chirality;['None', 'R', 'S'] 4 | max_n_nodes;13 5 | job_type;train 6 | dataset_dir;/path/to/GraphINVENT/data/gdb13_1K/ 7 | model;GGNN 8 | -------------------------------------------------------------------------------- /submit-fine-tuning.py: -------------------------------------------------------------------------------- 1 | """ 2 | Example submission script for a GraphINVENT fine-tuning job. This can be used to 3 | fine-tune a pre-trained model via reinforcement learning. 4 | 5 | To run, type: 6 | (graphinvent) ~/GraphINVENT$ python submit-fine-tuning.py 7 | """ 8 | # load general packages and functions 9 | import csv 10 | import sys 11 | import os 12 | from pathlib import Path 13 | import subprocess 14 | import time 15 | import torch 16 | 17 | 18 | # define what you want to do for the specified job(s) 19 | DATASET = "gdb13_1K-debug" 20 | JOB_TYPE = "fine-tune" # "fine-tune", or "generate" 21 | JOBDIR_START_IDX = 0 # where to start indexing job dirs 22 | N_JOBS = 1 # number of jobs to run per model 23 | RESTART = False 24 | FORCE_OVERWRITE = True # overwrite job directories which already exist 25 | JOBNAME = "example_job_name" # used to create a sub directory 26 | 27 | # if running using SLURM sbatch, specify params below 28 | USE_SLURM = False # use SLURM or not 29 | RUN_TIME = "1-00:00:00" # hh:mm:ss 30 | MEM_GB = 20 # required RAM in GB 31 | 32 | # for SLURM jobs, set partition to run job on (preprocessing jobs run entirely on 33 | # CPU, so no need to request GPU partition; all other job types benefit from running 34 | # on a GPU) 35 | if JOB_TYPE == "preprocess": 36 | PARTITION = "core" 37 | CPUS_PER_TASK = 1 38 | else: 39 | PARTITION = "gpu" 40 | CPUS_PER_TASK = 4 41 | 42 | # set paths here 43 | HOME = str(Path.home()) 44 | PYTHON_PATH = f"{HOME}/miniconda3/envs/graphinvent/bin/python" 45 | GRAPHINVENT_PATH = "./graphinvent/" 46 | DATA_PATH = "./data/fine-tuning/" 47 | 48 | if torch.cuda.is_available(): 49 | DEVICE = "cuda" 50 | else: 51 | DEVICE = "cpu" 52 | 53 | # define dataset-specific parameters 54 | params = { 55 | "atom_types" : ["C", "N", "O", "S", "Cl"], # <-- should match pre-trained model param 56 | "formal_charge" : [-1, 0, +1], # <-- should match pre-trained model param 57 | "max_n_nodes" : 13, # <-- should match pre-trained model param 58 | "job_type" : JOB_TYPE, 59 | "dataset_dir" : f"{DATA_PATH}{DATASET}/", 60 | "restart" : RESTART, 61 | "device" : DEVICE, 62 | "model" : "GGNN", # <-- should match pre-trained model param 63 | "sample_every" : 2, 64 | "init_lr" : 1e-4, 65 | "epochs" : 100, # <-- number of fine-tuning steps 66 | "batch_size" : 64, 67 | "block_size" : 1000, 68 | "n_workers" : 0, 69 | "sigma" : 20, # <-- see loss function 70 | "alpha" : 0.5, # <-- see loss function 71 | "pretrained_model_dir": f"output_{DATASET}/example/job_0/", 72 | "generation_epoch" : 80, # <-- which pre-trained model epoch to use 73 | "n_samples" : 100, # <-- how many graphs to sample every step 74 | # additional paramaters can be defined here, if different from the "defaults" 75 | } 76 | 77 | 78 | def submit() -> None: 79 | """ 80 | Creates and submits submission script. Uses global variables defined at top 81 | of this file. 82 | """ 83 | check_paths() 84 | 85 | # create an output directory 86 | dataset_output_path = f"{HOME}/GraphINVENT/output_{DATASET}" 87 | tensorboard_path = os.path.join(dataset_output_path, "tensorboard") 88 | if JOBNAME != "": 89 | dataset_output_path = os.path.join(dataset_output_path, JOBNAME) 90 | tensorboard_path = os.path.join(tensorboard_path, JOBNAME) 91 | 92 | os.makedirs(dataset_output_path, exist_ok=True) 93 | os.makedirs(tensorboard_path, exist_ok=True) 94 | print(f"* Creating dataset directory {dataset_output_path}/", flush=True) 95 | 96 | # submit `N_JOBS` separate jobs 97 | jobdir_end_idx = JOBDIR_START_IDX + N_JOBS 98 | for job_idx in range(JOBDIR_START_IDX, jobdir_end_idx): 99 | 100 | # specify and create the job subdirectory if it does not exist 101 | params["job_dir"] = f"{dataset_output_path}/job_{job_idx}/" 102 | params["tensorboard_dir"] = f"{tensorboard_path}/job_{job_idx}/" 103 | 104 | # create the directory if it does not exist already, otherwise raises an 105 | # error, which is good because *might* not want to override data our 106 | # existing directories! 107 | os.makedirs(params["tensorboard_dir"], exist_ok=True) 108 | try: 109 | job_dir_exists_already = bool( 110 | JOB_TYPE in ["generate", "test"] or FORCE_OVERWRITE 111 | ) 112 | os.makedirs(params["job_dir"], exist_ok=job_dir_exists_already) 113 | print( 114 | f"* Creating model subdirectory {dataset_output_path}/job_{job_idx}/", 115 | flush=True, 116 | ) 117 | except FileExistsError: 118 | print( 119 | f"-- Model subdirectory {dataset_output_path}/job_{job_idx}/ already exists.", 120 | flush=True, 121 | ) 122 | if not RESTART: 123 | continue 124 | 125 | # write the `input.csv` file 126 | write_input_csv(params_dict=params, filename="input.csv") 127 | 128 | # write `submit.sh` and submit 129 | if USE_SLURM: 130 | print("* Writing submission script.", flush=True) 131 | write_submission_script(job_dir=params["job_dir"], 132 | job_idx=job_idx, 133 | job_type=params["job_type"], 134 | max_n_nodes=params["max_n_nodes"], 135 | runtime=RUN_TIME, 136 | mem=MEM_GB, 137 | ptn=PARTITION, 138 | cpu_per_task=CPUS_PER_TASK, 139 | python_bin_path=PYTHON_PATH) 140 | 141 | print("* Submitting job to SLURM.", flush=True) 142 | subprocess.run(["sbatch", params["job_dir"] + "submit.sh"], 143 | check=True) 144 | else: 145 | print("* Running job as a normal process.", flush=True) 146 | subprocess.run(["ls", f"{PYTHON_PATH}"], check=True) 147 | subprocess.run([f"{PYTHON_PATH}", 148 | f"{GRAPHINVENT_PATH}main.py", 149 | "--job-dir", 150 | params["job_dir"]], 151 | check=True) 152 | 153 | # sleep a few secs before submitting next job 154 | print("-- Sleeping 2 seconds.") 155 | time.sleep(2) 156 | 157 | 158 | def write_input_csv(params_dict : dict, filename : str="params.csv") -> None: 159 | """ 160 | Writes job parameters/hyperparameters in `params_dict` to CSV using the specified 161 | `filename`. 162 | """ 163 | dict_path = params_dict["job_dir"] + filename 164 | 165 | with open(dict_path, "w") as csv_file: 166 | 167 | writer = csv.writer(csv_file, delimiter=";") 168 | for key, value in params_dict.items(): 169 | writer.writerow([key, value]) 170 | 171 | 172 | def write_submission_script(job_dir : str, job_idx : int, job_type : str, max_n_nodes : int, 173 | runtime : str, mem : int, ptn : str, cpu_per_task : int, 174 | python_bin_path : str) -> None: 175 | """ 176 | Writes a submission script (`submit.sh`). 177 | 178 | Args: 179 | ---- 180 | job_dir (str) : Job running directory. 181 | job_idx (int) : Job idx. 182 | job_type (str) : Type of job to run. 183 | max_n_nodes (int) : Maximum number of nodes in dataset. 184 | runtime (str) : Job run-time limit in hh:mm:ss format. 185 | mem (int) : Gigabytes to reserve. 186 | ptn (str) : Partition to use, either "core" (CPU) or "gpu" (GPU). 187 | cpu_per_task (int) : How many CPUs to use per task. 188 | python_bin_path (str) : Path to Python binary to use. 189 | """ 190 | submit_filename = job_dir + "submit.sh" 191 | with open(submit_filename, "w") as submit_file: 192 | submit_file.write("#!/bin/bash\n") 193 | submit_file.write(f"#SBATCH --job-name={job_type}{max_n_nodes}_{job_idx}\n") 194 | submit_file.write(f"#SBATCH --output={job_type}{max_n_nodes}_{job_idx}o\n") 195 | submit_file.write(f"#SBATCH --time={runtime}\n") 196 | submit_file.write(f"#SBATCH --mem={mem}g\n") 197 | submit_file.write(f"#SBATCH --partition={ptn}\n") 198 | submit_file.write("#SBATCH --nodes=1\n") 199 | submit_file.write(f"#SBATCH --cpus-per-task={cpu_per_task}\n") 200 | if ptn == "gpu": 201 | submit_file.write("#SBATCH --gres=gpu:1\n") 202 | submit_file.write("hostname\n") 203 | submit_file.write("export QT_QPA_PLATFORM='offscreen'\n") 204 | submit_file.write(f"{python_bin_path} {GRAPHINVENT_PATH}main.py --job-dir {job_dir}") 205 | submit_file.write(f" > {job_dir}output.o${{SLURM_JOB_ID}}\n") 206 | 207 | 208 | def check_paths() -> None: 209 | """ 210 | Checks that paths to Python binary, data, and GraphINVENT are properly 211 | defined before running a job, and tells the user to define them if not. 212 | """ 213 | for path in [PYTHON_PATH, GRAPHINVENT_PATH, DATA_PATH]: 214 | if "path/to/" in path: 215 | print("!!!") 216 | print("* Update the following paths in `submit.py` before running:") 217 | print("-- `PYTHON_PATH`\n-- `GRAPHINVENT_PATH`\n-- `DATA_PATH`") 218 | sys.exit(0) 219 | 220 | if __name__ == "__main__": 221 | submit() 222 | -------------------------------------------------------------------------------- /submit-pre-training-supercloud.py: -------------------------------------------------------------------------------- 1 | """ 2 | Example submission script for a GraphINVENT training job (distribution- 3 | based training, not fine-tuning/optimization job). This can be used to 4 | pre-train a model before a fine-tuning (via reinforcement learning) job. 5 | 6 | To run, type: 7 | (graphinvent) ~/GraphINVENT$ python submit-pre-training.py 8 | 9 | This script was modified to run on the MIT Supercloud. 10 | """ 11 | # load general packages and functions 12 | import csv 13 | import sys 14 | import os 15 | from pathlib import Path 16 | import subprocess 17 | import time 18 | import torch 19 | 20 | 21 | # define what you want to do for the specified job(s) 22 | DATASET = "gdb13_1K-debug" # dataset name in "./data/pre-training/" 23 | JOB_TYPE = "train" # "preprocess", "train", "generate", or "test" 24 | JOBDIR_START_IDX = 0 # where to start indexing job dirs 25 | N_JOBS = 1 # number of jobs to run per model 26 | RESTART = False # whether or not this is a restart job 27 | FORCE_OVERWRITE = True # overwrite job directories which already exist 28 | JOBNAME = "example-job-name" # used to create a sub directory 29 | 30 | # if running using LLsub, specify params below 31 | USE_LLSUB = True # use LLsub or not 32 | MEM_GB = 20 # required RAM in GB 33 | 34 | # for LLsub jobs, set number of CPUs per task 35 | if JOB_TYPE == "preprocess": 36 | CPUS_PER_TASK = 1 37 | DEVICE = "cpu" 38 | else: 39 | CPUS_PER_TASK = 10 40 | DEVICE = "cuda" 41 | 42 | # set paths here 43 | HOME = str(Path.home()) 44 | PYTHON_PATH = f"{HOME}/path/to/graphinvent/bin/python" 45 | GRAPHINVENT_PATH = "./graphinvent/" 46 | DATA_PATH = "./data/pre-training/" 47 | 48 | # define dataset-specific parameters 49 | params = { 50 | "atom_types" : ["C", "N", "O", "S", "Cl"], 51 | "formal_charge": [-1, 0, +1], 52 | "max_n_nodes" : 13, 53 | "job_type" : JOB_TYPE, 54 | "dataset_dir" : f"{DATA_PATH}{DATASET}/", 55 | "restart" : RESTART, 56 | "model" : "GGNN", 57 | "sample_every" : 2, 58 | "init_lr" : 1e-4, 59 | "epochs" : 100, 60 | "batch_size" : 50, 61 | "block_size" : 1000, 62 | "device" : DEVICE, 63 | "n_samples" : 100, 64 | # additional paramaters can be defined here, if different from the "defaults" 65 | # for instance, for "generate" jobs, don't forget to specify "generation_epoch" 66 | # and "n_samples" 67 | } 68 | 69 | 70 | def submit() -> None: 71 | """ 72 | Creates and submits submission script. Uses global variables defined at top 73 | of this file. 74 | """ 75 | check_paths() 76 | 77 | # create an output directory 78 | dataset_output_path = f"{HOME}/GraphINVENT/output_{DATASET}" 79 | tensorboard_path = os.path.join(dataset_output_path, "tensorboard") 80 | if JOBNAME != "": 81 | dataset_output_path = os.path.join(dataset_output_path, JOBNAME) 82 | tensorboard_path = os.path.join(tensorboard_path, JOBNAME) 83 | 84 | os.makedirs(dataset_output_path, exist_ok=True) 85 | os.makedirs(tensorboard_path, exist_ok=True) 86 | print(f"* Creating dataset directory {dataset_output_path}/", flush=True) 87 | 88 | # submit `N_JOBS` separate jobs 89 | jobdir_end_idx = JOBDIR_START_IDX + N_JOBS 90 | for job_idx in range(JOBDIR_START_IDX, jobdir_end_idx): 91 | 92 | # specify and create the job subdirectory if it does not exist 93 | params["job_dir"] = f"{dataset_output_path}/job_{job_idx}/" 94 | params["tensorboard_dir"] = f"{tensorboard_path}/job_{job_idx}/" 95 | 96 | # create the directory if it does not exist already, otherwise raises an 97 | # error, which is good because *might* not want to override data our 98 | # existing directories! 99 | os.makedirs(params["tensorboard_dir"], exist_ok=True) 100 | try: 101 | job_dir_exists_already = bool( 102 | JOB_TYPE in ["generate", "test"] or FORCE_OVERWRITE 103 | ) 104 | os.makedirs(params["job_dir"], exist_ok=job_dir_exists_already) 105 | print( 106 | f"* Creating model subdirectory {dataset_output_path}/job_{job_idx}/", 107 | flush=True, 108 | ) 109 | except FileExistsError: 110 | print( 111 | f"-- Model subdirectory {dataset_output_path}/job_{job_idx}/ already exists.", 112 | flush=True, 113 | ) 114 | if not RESTART: 115 | continue 116 | 117 | # write the `input.csv` file 118 | write_input_csv(params_dict=params, filename="input.csv") 119 | 120 | # write `submit.sh` and submit 121 | if USE_LLSUB: 122 | print("* Writing submission script.", flush=True) 123 | write_submission_script(job_dir=params["job_dir"], 124 | job_idx=job_idx, 125 | job_type=params["job_type"], 126 | max_n_nodes=params["max_n_nodes"], 127 | cpu_per_task=CPUS_PER_TASK, 128 | python_bin_path=PYTHON_PATH) 129 | 130 | print("* Submitting batch job using LLsub.", flush=True) 131 | subprocess.run(["LLsub", params["job_dir"] + "submit.sh"], 132 | check=True) 133 | else: 134 | print("* Running job as a normal process.", flush=True) 135 | subprocess.run(["ls", f"{PYTHON_PATH}"], check=True) 136 | subprocess.run([f"{PYTHON_PATH}", 137 | f"{GRAPHINVENT_PATH}main.py", 138 | "--job-dir", 139 | params["job_dir"]], 140 | check=True) 141 | 142 | # sleep a few secs before submitting next job 143 | print("-- Sleeping 2 seconds.") 144 | time.sleep(2) 145 | 146 | 147 | def write_input_csv(params_dict : dict, filename : str="params.csv") -> None: 148 | """ 149 | Writes job parameters/hyperparameters in `params_dict` to CSV using the specified 150 | `filename`. 151 | """ 152 | dict_path = params_dict["job_dir"] + filename 153 | 154 | with open(dict_path, "w") as csv_file: 155 | 156 | writer = csv.writer(csv_file, delimiter=";") 157 | for key, value in params_dict.items(): 158 | writer.writerow([key, value]) 159 | 160 | 161 | def write_submission_script(job_dir : str, job_idx : int, job_type : str, max_n_nodes : int, 162 | cpu_per_task : int, python_bin_path : str) -> None: 163 | """ 164 | Writes a submission script (`submit.sh`). 165 | 166 | Args: 167 | ---- 168 | job_dir (str) : Job running directory. 169 | job_idx (int) : Job idx. 170 | job_type (str) : Type of job to run. 171 | max_n_nodes (int) : Maximum number of nodes in dataset. 172 | cpu_per_task (int) : How many CPUs to use per task. 173 | python_bin_path (str) : Path to Python binary to use. 174 | """ 175 | submit_filename = job_dir + "submit.sh" 176 | with open(submit_filename, "w") as submit_file: 177 | submit_file.write("#!/bin/bash\n") 178 | submit_file.write(f"#SBATCH --job-name={job_type}{max_n_nodes}_{job_idx}\n") 179 | submit_file.write(f"#SBATCH --output={job_type}{max_n_nodes}_{job_idx}o\n") 180 | submit_file.write(f"#SBATCH --cpus-per-task={cpu_per_task}\n") 181 | if DEVICE == "cuda": 182 | submit_file.write("#SBATCH --gres=gpu:volta:1\n") 183 | submit_file.write("hostname\n") 184 | submit_file.write("export QT_QPA_PLATFORM='offscreen'\n") 185 | submit_file.write(f"{python_bin_path} {GRAPHINVENT_PATH}main.py --job-dir {job_dir}") 186 | submit_file.write(f" > {job_dir}output.o${{LLSUB_RANK}}\n") 187 | 188 | 189 | def check_paths() -> None: 190 | """ 191 | Checks that paths to Python binary, data, and GraphINVENT are properly 192 | defined before running a job, and tells the user to define them if not. 193 | """ 194 | for path in [PYTHON_PATH, GRAPHINVENT_PATH, DATA_PATH]: 195 | if "path/to/" in path: 196 | print("!!!") 197 | print("* Update the following paths in `submit.py` before running:") 198 | print("-- `PYTHON_PATH`\n-- `GRAPHINVENT_PATH`\n-- `DATA_PATH`") 199 | sys.exit(0) 200 | 201 | if __name__ == "__main__": 202 | submit() 203 | -------------------------------------------------------------------------------- /submit-pre-training.py: -------------------------------------------------------------------------------- 1 | """ 2 | Example submission script for a GraphINVENT training job (distribution- 3 | based training, not fine-tuning/optimization job). This can be used to 4 | pre-train a model before a fine-tuning (via reinforcement learning) job. 5 | 6 | To run, type: 7 | (graphinvent) ~/GraphINVENT$ python submit-pre-training.py 8 | """ 9 | # load general packages and functions 10 | import csv 11 | import sys 12 | import os 13 | from pathlib import Path 14 | import subprocess 15 | import time 16 | import torch 17 | 18 | 19 | # define what you want to do for the specified job(s) 20 | DATASET = "gdb13_1K-debug" # dataset name in "./data/pre-training/" 21 | JOB_TYPE = "train" # "preprocess", "train", "generate", or "test" 22 | JOBDIR_START_IDX = 0 # where to start indexing job dirs 23 | N_JOBS = 1 # number of jobs to run per model 24 | RESTART = False # whether or not this is a restart job 25 | FORCE_OVERWRITE = True # overwrite job directories which already exist 26 | JOBNAME = "example-job-name" # used to create a sub directory 27 | 28 | # if running using SLURM sbatch, specify params below 29 | USE_SLURM = False # use SLURM or not 30 | RUN_TIME = "1-00:00:00" # hh:mm:ss 31 | MEM_GB = 20 # required RAM in GB 32 | 33 | # for SLURM jobs, set partition to run job on (preprocessing jobs run entirely on 34 | # CPU, so no need to request GPU partition; all other job types benefit from running 35 | # on a GPU) 36 | if JOB_TYPE == "preprocess": 37 | PARTITION = "core" 38 | CPUS_PER_TASK = 1 39 | else: 40 | PARTITION = "gpu" 41 | CPUS_PER_TASK = 4 42 | 43 | # set paths here 44 | HOME = str(Path.home()) 45 | PYTHON_PATH = f"{HOME}/miniconda3/envs/graphinvent/bin/python" 46 | GRAPHINVENT_PATH = "./graphinvent/" 47 | DATA_PATH = "./data/pre-training/" 48 | 49 | if torch.cuda.is_available(): 50 | DEVICE = "cuda" 51 | else: 52 | DEVICE = "cpu" 53 | 54 | # define dataset-specific parameters 55 | params = { 56 | "atom_types" : ["C", "N", "O", "S", "Cl"], 57 | "formal_charge": [-1, 0, +1], 58 | "max_n_nodes" : 13, 59 | "job_type" : JOB_TYPE, 60 | "dataset_dir" : f"{DATA_PATH}{DATASET}/", 61 | "restart" : RESTART, 62 | "model" : "GGNN", 63 | "sample_every" : 2, 64 | "init_lr" : 1e-4, 65 | "epochs" : 100, 66 | "batch_size" : 50, 67 | "block_size" : 1000, 68 | "device" : DEVICE, 69 | "n_samples" : 100, 70 | # additional paramaters can be defined here, if different from the "defaults" 71 | # for instance, for "generate" jobs, don't forget to specify "generation_epoch" 72 | # and "n_samples" 73 | } 74 | 75 | 76 | def submit() -> None: 77 | """ 78 | Creates and submits submission script. Uses global variables defined at top 79 | of this file. 80 | """ 81 | check_paths() 82 | 83 | # create an output directory 84 | dataset_output_path = f"{HOME}/GraphINVENT/output_{DATASET}" 85 | tensorboard_path = os.path.join(dataset_output_path, "tensorboard") 86 | if JOBNAME != "": 87 | dataset_output_path = os.path.join(dataset_output_path, JOBNAME) 88 | tensorboard_path = os.path.join(tensorboard_path, JOBNAME) 89 | 90 | os.makedirs(dataset_output_path, exist_ok=True) 91 | os.makedirs(tensorboard_path, exist_ok=True) 92 | print(f"* Creating dataset directory {dataset_output_path}/", flush=True) 93 | 94 | # submit `N_JOBS` separate jobs 95 | jobdir_end_idx = JOBDIR_START_IDX + N_JOBS 96 | for job_idx in range(JOBDIR_START_IDX, jobdir_end_idx): 97 | 98 | # specify and create the job subdirectory if it does not exist 99 | params["job_dir"] = f"{dataset_output_path}/job_{job_idx}/" 100 | params["tensorboard_dir"] = f"{tensorboard_path}/job_{job_idx}/" 101 | 102 | # create the directory if it does not exist already, otherwise raises an 103 | # error, which is good because *might* not want to override data our 104 | # existing directories! 105 | os.makedirs(params["tensorboard_dir"], exist_ok=True) 106 | try: 107 | job_dir_exists_already = bool( 108 | JOB_TYPE in ["generate", "test"] or FORCE_OVERWRITE 109 | ) 110 | os.makedirs(params["job_dir"], exist_ok=job_dir_exists_already) 111 | print( 112 | f"* Creating model subdirectory {dataset_output_path}/job_{job_idx}/", 113 | flush=True, 114 | ) 115 | except FileExistsError: 116 | print( 117 | f"-- Model subdirectory {dataset_output_path}/job_{job_idx}/ already exists.", 118 | flush=True, 119 | ) 120 | if not RESTART: 121 | continue 122 | 123 | # write the `input.csv` file 124 | write_input_csv(params_dict=params, filename="input.csv") 125 | 126 | # write `submit.sh` and submit 127 | if USE_SLURM: 128 | print("* Writing submission script.", flush=True) 129 | write_submission_script(job_dir=params["job_dir"], 130 | job_idx=job_idx, 131 | job_type=params["job_type"], 132 | max_n_nodes=params["max_n_nodes"], 133 | runtime=RUN_TIME, 134 | mem=MEM_GB, 135 | ptn=PARTITION, 136 | cpu_per_task=CPUS_PER_TASK, 137 | python_bin_path=PYTHON_PATH) 138 | 139 | print("* Submitting job to SLURM.", flush=True) 140 | subprocess.run(["sbatch", params["job_dir"] + "submit.sh"], 141 | check=True) 142 | else: 143 | print("* Running job as a normal process.", flush=True) 144 | subprocess.run(["ls", f"{PYTHON_PATH}"], check=True) 145 | subprocess.run([f"{PYTHON_PATH}", 146 | f"{GRAPHINVENT_PATH}main.py", 147 | "--job-dir", 148 | params["job_dir"]], 149 | check=True) 150 | 151 | # sleep a few secs before submitting next job 152 | print("-- Sleeping 2 seconds.") 153 | time.sleep(2) 154 | 155 | 156 | def write_input_csv(params_dict : dict, filename : str="params.csv") -> None: 157 | """ 158 | Writes job parameters/hyperparameters in `params_dict` to CSV using the specified 159 | `filename`. 160 | """ 161 | dict_path = params_dict["job_dir"] + filename 162 | 163 | with open(dict_path, "w") as csv_file: 164 | 165 | writer = csv.writer(csv_file, delimiter=";") 166 | for key, value in params_dict.items(): 167 | writer.writerow([key, value]) 168 | 169 | 170 | def write_submission_script(job_dir : str, job_idx : int, job_type : str, max_n_nodes : int, 171 | runtime : str, mem : int, ptn : str, cpu_per_task : int, 172 | python_bin_path : str) -> None: 173 | """ 174 | Writes a submission script (`submit.sh`). 175 | 176 | Args: 177 | ---- 178 | job_dir (str) : Job running directory. 179 | job_idx (int) : Job idx. 180 | job_type (str) : Type of job to run. 181 | max_n_nodes (int) : Maximum number of nodes in dataset. 182 | runtime (str) : Job run-time limit in hh:mm:ss format. 183 | mem (int) : Gigabytes to reserve. 184 | ptn (str) : Partition to use, either "core" (CPU) or "gpu" (GPU). 185 | cpu_per_task (int) : How many CPUs to use per task. 186 | python_bin_path (str) : Path to Python binary to use. 187 | """ 188 | submit_filename = job_dir + "submit.sh" 189 | with open(submit_filename, "w") as submit_file: 190 | submit_file.write("#!/bin/bash\n") 191 | submit_file.write(f"#SBATCH --job-name={job_type}{max_n_nodes}_{job_idx}\n") 192 | submit_file.write(f"#SBATCH --output={job_type}{max_n_nodes}_{job_idx}o\n") 193 | submit_file.write(f"#SBATCH --time={runtime}\n") 194 | submit_file.write(f"#SBATCH --mem={mem}g\n") 195 | submit_file.write(f"#SBATCH --partition={ptn}\n") 196 | submit_file.write("#SBATCH --nodes=1\n") 197 | submit_file.write(f"#SBATCH --cpus-per-task={cpu_per_task}\n") 198 | if ptn == "gpu": 199 | submit_file.write("#SBATCH --gres=gpu:1\n") 200 | submit_file.write("hostname\n") 201 | submit_file.write("export QT_QPA_PLATFORM='offscreen'\n") 202 | submit_file.write(f"{python_bin_path} {GRAPHINVENT_PATH}main.py --job-dir {job_dir}") 203 | submit_file.write(f" > {job_dir}output.o${{SLURM_JOB_ID}}\n") 204 | 205 | 206 | def check_paths() -> None: 207 | """ 208 | Checks that paths to Python binary, data, and GraphINVENT are properly 209 | defined before running a job, and tells the user to define them if not. 210 | """ 211 | for path in [PYTHON_PATH, GRAPHINVENT_PATH, DATA_PATH]: 212 | if "path/to/" in path: 213 | print("!!!") 214 | print("* Update the following paths in `submit.py` before running:") 215 | print("-- `PYTHON_PATH`\n-- `GRAPHINVENT_PATH`\n-- `DATA_PATH`") 216 | sys.exit(0) 217 | 218 | if __name__ == "__main__": 219 | submit() 220 | -------------------------------------------------------------------------------- /tools/README.md: -------------------------------------------------------------------------------- 1 | # Tools 2 | This directory contains various tools for analyzing datasets: 3 | 4 | * [max_n_nodes.py](./max_n_nodes.py): Gets the maximum number of nodes per molecule in a set of molecules. 5 | * [atom_types.py](./atom_types.py) : Gets the atom types present in a set of molecules. 6 | * [formal_charges.py](./formal_charges.py) : Gets the formal charges present in a set of molecules. 7 | * [tdc-create-dataset.py](./tdc-create-dataset.py) : Downloads a dataset, such as ChEMBL or MOSES, from the Therapeutics Data Commons (TDC). 8 | * [submit-split-preprocessing-supercloud.py](./submit-split-preprocessing-supercloud.py) : Example submission script for preprocessing a very large dataset in parallel. 9 | 10 | --- 11 | 12 | To use the first 3 tools in this directory ([max_n_nodes.py](./max_n_nodes.py), [atom_types.py](./atom_types.py), or [formal_charges.py](./formal_charges.py)), first activate the GraphINVENT virtual environment, then run: 13 | 14 | ``` 15 | (graphinvent)$ python {script} --smi path/to/file.smi 16 | ``` 17 | 18 | Simply replace *{script}* by the name of the script e.g. *max_n_nodes.py*, and *path/to/file* with the name of the SMILES file to analyze. 19 | 20 | --- 21 | If you would like to download a dataset such as ChEMBL or MOSES from the TDC and preprocess it slightly (e.g. remove molecular with high formal charges, filter to molecules with <= 80 heavy atoms, etc), then you can use the [tdc-create-dataset.py](./tdc-create-dataset.py) script. 22 | 23 | To use script to download, for example, the MOSES dataset, run (from within the GraphINVENT environment): 24 | ``` 25 | (graphinvent)$ python tdc-create-dataset.py --dataset MOSES 26 | ``` 27 | 28 | You can change the flag to speficy other datasets available via the TDC. 29 | 30 | Furthermore, you can manually edit the script to do other things you would like (for instance, set the number of heavy atoms and formal charges to filter). 31 | 32 | --- 33 | 34 | In some cases, if you have a really large dataset, it might be easier to preprocess it in pieces (i.e. in parallel on different nodes) rather than all in serial. To do this, you can use the [submit-split-preprocessing-supercloud.py](./submit-split-preprocessing-supercloud.py) script. 35 | 36 | To use it, you will first need to split your dataset by running, **from within an interactive session**, the following command: 37 | ``` 38 | (graphinvent)$ python submit-split-preprocessing-supercloud.py --type split 39 | ``` 40 | 41 | Then, once the dataset has been split, you can submit the separate splits as individual preprocessing jobs as follows: 42 | ``` 43 | (graphinvent)$ python submit-split-preprocessing-supercloud.py --type submit 44 | ``` 45 | 46 | When the above jobs have completed, you can aggregate the generated HDFs for each dataset split into a single HDF in the main dataset dir: 47 | ``` 48 | (graphinvent)$ python submit-split-preprocessing-supercloud.py --type aggregate 49 | ``` -------------------------------------------------------------------------------- /tools/atom_types.py: -------------------------------------------------------------------------------- 1 | """ 2 | Gets the atom types present in a set of molecules. 3 | 4 | To use script, run: 5 | python atom_types.py --smi path/to/file.smi 6 | """ 7 | import argparse 8 | import rdkit 9 | from utils import load_molecules 10 | 11 | 12 | # define the argument parser 13 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter, 14 | add_help=False) 15 | 16 | # define two potential arguments to use when drawing SMILES from a file 17 | parser.add_argument("--smi", 18 | type=str, 19 | default="data/gdb13_1K/train.smi", 20 | help="SMILES file containing molecules to analyse.") 21 | args = parser.parse_args() 22 | 23 | 24 | def get_atom_types(smi_file : str) -> list: 25 | """ 26 | Determines the atom types present in an input SMILES file. 27 | 28 | Args: 29 | ---- 30 | smi_file (str) : Full path/filename to SMILES file. 31 | """ 32 | molecules = load_molecules(path=smi_file) 33 | 34 | # create a list of all the atom types 35 | atom_types = list() 36 | for mol in molecules: 37 | for atom in mol.GetAtoms(): 38 | atom_types.append(atom.GetAtomicNum()) 39 | 40 | # remove duplicate atom types then sort by atomic number 41 | set_of_atom_types = set(atom_types) 42 | atom_types_sorted = list(set_of_atom_types) 43 | atom_types_sorted.sort() 44 | 45 | # return the symbols, for convenience 46 | return [rdkit.Chem.Atom(atom).GetSymbol() for atom in atom_types_sorted] 47 | 48 | 49 | if __name__ == "__main__": 50 | atom_types = get_atom_types(smi_file=args.smi) 51 | print("* Atom types present in input file:", atom_types, flush=True) 52 | print("Done.", flush=True) 53 | -------------------------------------------------------------------------------- /tools/combine_HDFs.py: -------------------------------------------------------------------------------- 1 | """ 2 | Combines preprocessed HDF files. Useful when preprocessing large datasets, as 3 | one can split the `{split}.smi` into multiple files (and directories), preprocess 4 | them separately, and then combine using this script. 5 | 6 | To use script, modify the variables below to automatically create a list of 7 | paths **assuming** HDFs were created with the following directory structure: 8 | data/ 9 | |-- {dataset}_1/ 10 | |-- {dataset}_2/ 11 | |-- {dataset}_3/ 12 | |... 13 | |-- {dataset}_{n_dirs}/ 14 | 15 | The variables are also used in setting the dimensions of the HDF datasets later on. 16 | 17 | If directories were not named as above, then simply replace `path_list` below 18 | with a list of the paths to all the HDFs to combine. 19 | 20 | Then, run: 21 | python combine_HDFs.py 22 | """ 23 | import csv 24 | import numpy as np 25 | import h5py 26 | import torch 27 | from typing import Union 28 | 29 | 30 | def load_ts_properties_from_csv(csv_path : str) -> Union[dict, None]: 31 | """ 32 | Loads CSV file containing training set properties and returns contents as a dictionary. 33 | """ 34 | print("* Loading training set properties.", flush=True) 35 | 36 | # read dictionaries from csv 37 | try: 38 | with open(csv_path, "r") as csv_file: 39 | reader = csv.reader(csv_file, delimiter=";") 40 | csv_dict = dict(reader) 41 | except: 42 | return None 43 | 44 | # fix file types within dict in going from `csv_dict` --> `properties_dict` 45 | properties_dict = {} 46 | for key, value in csv_dict.items(): 47 | 48 | # first determine if key is a tuple 49 | key = eval(key) 50 | if len(key) > 1: 51 | tuple_key = (str(key[0]), str(key[1])) 52 | else: 53 | tuple_key = key 54 | 55 | # then convert the values to the correct data type 56 | try: 57 | properties_dict[tuple_key] = eval(value) 58 | except (SyntaxError, NameError): 59 | properties_dict[tuple_key] = value 60 | 61 | # convert any `list`s to `torch.Tensor`s (for consistency) 62 | if type(properties_dict[tuple_key]) == list: 63 | properties_dict[tuple_key] = torch.Tensor(properties_dict[tuple_key]) 64 | 65 | return properties_dict 66 | 67 | def write_ts_properties_to_csv(ts_properties_dict : dict) -> None: 68 | """ 69 | Writes the training set properties in `ts_properties_dict` to a CSV file. 70 | """ 71 | dict_path = f"data/{dataset}/{split}.csv" 72 | 73 | with open(dict_path, "w") as csv_file: 74 | 75 | csv_writer = csv.writer(csv_file, delimiter=";") 76 | for key, value in ts_properties_dict.items(): 77 | if "validity_tensor" in key: 78 | continue # skip writing the validity tensor because it is really long 79 | elif type(value) == np.ndarray: 80 | csv_writer.writerow([key, list(value)]) 81 | elif type(value) == torch.Tensor: 82 | try: 83 | csv_writer.writerow([key, float(value)]) 84 | except ValueError: 85 | csv_writer.writerow([key, [float(i) for i in value]]) 86 | else: 87 | csv_writer.writerow([key, value]) 88 | 89 | def get_dims() -> dict: 90 | """ 91 | Gets the dims corresponding to the three datasets in each preprocessed HDF 92 | file: "nodes", "edges", and "APDs". 93 | """ 94 | dims = {} 95 | dims["nodes"] = [max_n_nodes, n_atom_types + n_formal_charges] 96 | dims["edges"] = [max_n_nodes, max_n_nodes, n_bond_types] 97 | dim_f_add = [max_n_nodes, n_atom_types, n_formal_charges, n_bond_types] 98 | dim_f_conn = [max_n_nodes, n_bond_types] 99 | dims["APDs"] = [np.prod(dim_f_add) + np.prod(dim_f_conn) + 1] 100 | 101 | return dims 102 | 103 | def get_total_n_subgraphs(paths : list) -> int: 104 | """ 105 | Gets the total number of subgraphs saved in all the HDF files in the `paths`, 106 | where `paths` is a list of strings containing the path to each HDF file we want 107 | to combine. 108 | """ 109 | total_n_subgraphs = 0 110 | for path in paths: 111 | print("path:", path) 112 | hdf_file = h5py.File(path, "r") 113 | nodes = hdf_file.get("nodes") 114 | n_subgraphs = nodes.shape[0] 115 | total_n_subgraphs += n_subgraphs 116 | hdf_file.close() 117 | 118 | return total_n_subgraphs 119 | 120 | def main(paths : list, training_set : bool) -> None: 121 | """ 122 | Combine many small HDF files (their paths defined in `paths`) into one large HDF file. 123 | """ 124 | total_n_subgraphs = get_total_n_subgraphs(paths) 125 | dims = get_dims() 126 | 127 | print(f"* Creating HDF file to contain {total_n_subgraphs} subgraphs") 128 | new_hdf_file = h5py.File(f"data/{dataset}/{split}.h5", "a") 129 | new_dataset_nodes = new_hdf_file.create_dataset("nodes", 130 | (total_n_subgraphs, *dims["nodes"]), 131 | dtype=np.dtype("int8")) 132 | new_dataset_edges = new_hdf_file.create_dataset("edges", 133 | (total_n_subgraphs, *dims["edges"]), 134 | dtype=np.dtype("int8")) 135 | new_dataset_APDs = new_hdf_file.create_dataset("APDs", 136 | (total_n_subgraphs, *dims["APDs"]), 137 | dtype=np.dtype("int8")) 138 | 139 | print("* Combining data from smaller HDFs into a new larger HDF.") 140 | init_index = 0 141 | for path in paths: 142 | print("path:", path) 143 | hdf_file = h5py.File(path, "r") 144 | 145 | nodes = hdf_file.get("nodes") 146 | edges = hdf_file.get("edges") 147 | APDs = hdf_file.get("APDs") 148 | 149 | n_subgraphs = nodes.shape[0] 150 | 151 | new_dataset_nodes[init_index:(init_index + n_subgraphs)] = nodes 152 | new_dataset_edges[init_index:(init_index + n_subgraphs)] = edges 153 | new_dataset_APDs[init_index:(init_index + n_subgraphs)] = APDs 154 | 155 | init_index += n_subgraphs 156 | hdf_file.close() 157 | 158 | new_hdf_file.close() 159 | 160 | if training_set: 161 | print(f"* Combining data from respective `{split}.csv` files into one.") 162 | csv_list = [f"{path[:-2]}csv" for path in paths] 163 | 164 | ts_properties_old = None 165 | csv_files_processed = 0 166 | for path in csv_list: 167 | ts_properties = load_ts_properties_from_csv(csv_path=path) 168 | ts_properties_new = {} 169 | if ts_properties_old and ts_properties: 170 | for key, value in ts_properties_old.items(): 171 | if type(value) == float: 172 | ts_properties_new[key] = ( 173 | value * csv_files_processed + ts_properties[key] 174 | )/(csv_files_processed + 1) 175 | else: 176 | new_list = [] 177 | for i, value_i in enumerate(value): 178 | new_list.append( 179 | float( 180 | value_i * csv_files_processed + ts_properties[key][i] 181 | )/(csv_files_processed + 1) 182 | ) 183 | ts_properties_new[key] = new_list 184 | else: 185 | ts_properties_new = ts_properties 186 | ts_properties_old = ts_properties_new 187 | csv_files_processed += 1 188 | 189 | write_ts_properties_to_csv(ts_properties_dict=ts_properties_new) 190 | 191 | 192 | if __name__ == "__main__": 193 | # combine the HDFs defined in `path_list` 194 | 195 | # set variables 196 | dataset = "ChEMBL" 197 | n_atom_types = 15 # number of atom types used in preprocessing the data 198 | n_formal_charges = 3 # number of formal charges used in preprocessing the data 199 | n_bond_types = 3 # number of bond types used in preprocessing the data 200 | max_n_nodes = 40 # maximum number of nodes in the data 201 | 202 | # combine the training files 203 | n_dirs = 12 # how many times was `{split}.smi` split? 204 | split = "train" # train, test, or valid 205 | path_list = [f"data/{dataset}_{i}/{split}.h5" for i in range(0, n_dirs)] 206 | main(path_list, training_set=True) 207 | 208 | # combine the test files 209 | n_dirs = 4 # how many times was `{split}.smi` split? 210 | split = "test" # train, test, or valid 211 | path_list = [f"data/{dataset}_{i}/{split}.h5" for i in range(0, n_dirs)] 212 | main(path_list, training_set=False) 213 | 214 | # combine the validation files 215 | n_dirs = 2 # how many times was `{split}.smi` split? 216 | split = "valid" # train, test, or valid 217 | path_list = [f"data/{dataset}_{i}/{split}.h5" for i in range(0, n_dirs)] 218 | main(path_list, training_set=False) 219 | 220 | print("Done.", flush=True) 221 | -------------------------------------------------------------------------------- /tools/formal_charges.py: -------------------------------------------------------------------------------- 1 | """ 2 | Gets the formal charges present in a set of molecules. 3 | 4 | To use script, run: 5 | python formal_charges.py --smi path/to/file.smi 6 | """ 7 | import argparse 8 | from utils import load_molecules 9 | 10 | 11 | # define the argument parser 12 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter, 13 | add_help=False) 14 | 15 | # define two potential arguments to use when drawing SMILES from a file 16 | parser.add_argument("--smi", 17 | type=str, 18 | default="data/gdb13_1K/train.smi", 19 | help="SMILES file containing molecules to analyse.") 20 | args = parser.parse_args() 21 | 22 | 23 | def get_formal_charges(smi_file : str) -> list: 24 | """ 25 | Determines the formal charges present in an input SMILES file. 26 | 27 | Args: 28 | ---- 29 | smi_file (str) : Full path/filename to SMILES file. 30 | """ 31 | molecules = load_molecules(path=smi_file) 32 | 33 | # create a list of all the formal charges 34 | formal_charges = list() 35 | for mol in molecules: 36 | for atom in mol.GetAtoms(): 37 | formal_charges.append(atom.GetFormalCharge()) 38 | 39 | # remove duplicate formal charges then sort 40 | set_of_formal_charges = set(formal_charges) 41 | formal_charges_sorted = list(set_of_formal_charges) 42 | formal_charges_sorted.sort() 43 | 44 | return formal_charges_sorted 45 | 46 | 47 | if __name__ == "__main__": 48 | formal_charges = get_formal_charges(smi_file=args.smi) 49 | print("* Formal charges present in input file:", formal_charges, flush=True) 50 | print("Done.", flush=True) 51 | -------------------------------------------------------------------------------- /tools/max_n_nodes.py: -------------------------------------------------------------------------------- 1 | """ 2 | Gets the maximum number of nodes per molecule present in a set of molecules. 3 | 4 | To use script, run: 5 | python max_n_nodes.py --smi path/to/file.smi 6 | """ 7 | import argparse 8 | from utils import load_molecules 9 | 10 | 11 | # define the argument parser 12 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter, 13 | add_help=False) 14 | 15 | # define two potential arguments to use when drawing SMILES from a file 16 | parser.add_argument("--smi", 17 | type=str, 18 | default="data/gdb13_1K/train.smi", 19 | help="SMILES file containing molecules to analyse.") 20 | args = parser.parse_args() 21 | 22 | 23 | def get_max_n_atoms(smi_file : str) -> int: 24 | """ 25 | Determines the maximum number of atoms per molecule in an input SMILES file. 26 | 27 | Args: 28 | ---- 29 | smi_file (str) : Full path/filename to SMILES file. 30 | """ 31 | molecules = load_molecules(path=smi_file) 32 | 33 | max_n_atoms = 0 34 | for mol in molecules: 35 | n_atoms = mol.GetNumAtoms() 36 | 37 | if n_atoms > max_n_atoms: 38 | max_n_atoms = n_atoms 39 | 40 | return max_n_atoms 41 | 42 | 43 | if __name__ == "__main__": 44 | max_n_atoms = get_max_n_atoms(smi_file=args.smi) 45 | print("* Max number of atoms in input file:", max_n_atoms, flush=True) 46 | print("Done.", flush=True) 47 | -------------------------------------------------------------------------------- /tools/tdc-create-dataset.py: -------------------------------------------------------------------------------- 1 | """ 2 | Uses the Therapeutics Data Commons (TDC) to get datasets for goal-directed 3 | molecular optimization tasks and then filters the molecules based on number 4 | of heavy atoms and formal charge. 5 | 6 | See: 7 | * https://tdcommons.ai/ 8 | * https://github.com/mims-harvard/TDC 9 | 10 | To use script, run: 11 | (graphinvent)$ python tdc-create-dataset.py --dataset MOSES 12 | """ 13 | import os 14 | import argparse 15 | from pathlib import Path 16 | import shutil 17 | from tdc.generation import MolGen 18 | import rdkit 19 | from rdkit import Chem 20 | 21 | # define the argument parser 22 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter, 23 | add_help=False) 24 | 25 | # define two potential arguments to use when drawing SMILES from a file 26 | parser.add_argument("--dataset", 27 | type=str, 28 | default="ChEMBL", 29 | help="Specifies the dataset to use for creating the data. Options " 30 | "are: 'ChEMBL', 'MOSES', or 'ZINC'.") 31 | args = parser.parse_args() 32 | 33 | 34 | def save_smiles(smi_file : str, smi_list : list) -> None: 35 | """Saves input list of SMILES to the specified file path.""" 36 | smi_writer = rdkit.Chem.rdmolfiles.SmilesWriter(smi_file) 37 | for smi in smi_list: 38 | try: 39 | mol = rdkit.Chem.MolFromSmiles(smi[0]) 40 | if mol.GetNumAtoms() < 81: # filter out molecules with >= 81 atoms 41 | save = True 42 | for atom in mol.GetAtoms(): 43 | if atom.GetFormalCharge() not in [-1, 0, +1]: # filter out molecules with large formal charge 44 | save = False 45 | break 46 | if save: 47 | smi_writer.write(mol) 48 | except: # likely TypeError or AttributeError e.g. "smi[0]" is "nan" 49 | continue 50 | smi_writer.close() 51 | 52 | 53 | if __name__ == "__main__": 54 | print(f"* Loading {args.dataset} dataset using the TDC.") 55 | data = MolGen(name=args.dataset) 56 | split = data.get_split() 57 | HOME = str(Path.home()) 58 | DATA_PATH = f"./data/{args.dataset}/" 59 | try: 60 | os.mkdir(DATA_PATH) 61 | print(f"-- Creating dataset at {DATA_PATH}") 62 | except FileExistsError: 63 | shutil.rmtree(DATA_PATH) 64 | os.mkdir(DATA_PATH) 65 | print(f"-- Removed old directory at {DATA_PATH}") 66 | print(f"-- Creating new dataset at {DATA_PATH}") 67 | 68 | print(f"* Re-saving {args.dataset} dataset in a format GraphINVENT can parse.") 69 | print("-- Saving training data...") 70 | save_smiles(smi_file=f"{DATA_PATH}train.smi", smi_list=split["train"].values) 71 | print("-- Saving testing data...") 72 | save_smiles(smi_file=f"{DATA_PATH}test.smi", smi_list=split["test"].values) 73 | print("-- Saving validation data...") 74 | save_smiles(smi_file=f"{DATA_PATH}valid.smi", smi_list=split["valid"].values) 75 | 76 | # # delete the raw downloaded files 77 | # dir_path = "./data/" 78 | # shutil.rmtree(dir_path) 79 | print("Done.", flush=True) 80 | -------------------------------------------------------------------------------- /tools/utils.py: -------------------------------------------------------------------------------- 1 | """ 2 | Miscellaneous functions. 3 | """ 4 | import rdkit 5 | from rdkit.Chem.rdmolfiles import SmilesMolSupplier 6 | 7 | 8 | def load_molecules(path : str) -> rdkit.Chem.rdmolfiles.SmilesMolSupplier: 9 | """ 10 | Reads a SMILES file (full path/filename specified by `path`) and returns the 11 | `rdkit.Mol` object "supplier". 12 | """ 13 | # check first line of SMILES file to see if contains header 14 | with open(path) as smi_file: 15 | first_line = smi_file.readline() 16 | has_header = bool("SMILES" in first_line) 17 | smi_file.close() 18 | 19 | # read file 20 | molecule_set = SmilesMolSupplier(path, sanitize=True, nameColumn=-1, titleLine=has_header) 21 | 22 | return molecule_set 23 | -------------------------------------------------------------------------------- /tutorials/0_setting_up_environment.md: -------------------------------------------------------------------------------- 1 | ## Setting up the environment 2 | Before doing anything with GraphINVENT, you will need to configure the GraphINVENT virtual environment, as the code is dependent on very specific versions of packages. You can use [conda](https://docs.conda.io/en/latest/) for this. 3 | 4 | The [../environments/graphinvent.yml](../environments/graphinvent.yml) file lists all the packages required for GraphINVENT to run. From within the [GraphINVENT/](../) directory, a virtual environment can be easily created using the YAML file and conda by typing into the terminal: 5 | 6 | ``` 7 | conda env create -f environments/graphinvent.yml 8 | ``` 9 | 10 | Then, to activate the environment: 11 | 12 | ``` 13 | conda activate graphinvent 14 | ``` 15 | 16 | To install additional packages to the virtual environment, should the need arise, use: 17 | 18 | ``` 19 | conda install -n graphinvent {package_name} 20 | ``` 21 | 22 | To save an updated environment as a YAML file using conda, use: 23 | 24 | ``` 25 | conda env export > path/to/environment.yml 26 | ``` 27 | 28 | And that's it! To learn how to start training models, go to [1_introduction](1_introduction.md). 29 | -------------------------------------------------------------------------------- /tutorials/1_introduction.md: -------------------------------------------------------------------------------- 1 | ## Introduction to GraphINVENT 2 | As shown in our recent [publication]((https://chemrxiv.org/articles/preprint/Graph_Networks_for_Molecular_Design/12843137/1)), GraphINVENT can be used to learn the structure and connectivity of sets of molecular graphs, thus making it a promising tool for the generation of molecules resembling an input dataset. As models in GraphINVENT are probabilistic, they can be used to discover new molecules that are not present in the training set. 3 | 4 | There are six GNN-based models implemented in GraphINVENT: the MNN, GGNN, AttGGNN, S2V, AttS2V, and EMN models. The GGNN has shown the best performance when weighed against the computational time required for training, and is as such used as the default model. 5 | 6 | To begin using GraphINVENT, we have prepared the following tutorial to guide a new user through the molecular generation workflow using a small example dataset. The example dataset is a 1K random subset of GBD-13. It has already been preprocessed, so you can use it directly for Training and Generation, as we will show in this tutorial. If this is too simple and you would like to learn how to train GraphINVENT models using a new molecular dataset, see [2_using_a_new_dataset](./2_using_a_new_dataset.md). 7 | 8 | ### Training using the example dataset 9 | #### Preparing a training job 10 | The example dataset is located in [../data/pre-training/gdb13_1K/](../data/pre-training/gdb13_1K/) and contains the following: 11 | * 1K molecules in each the training, validation, and test set 12 | * atom types : {C, N, O, S, Cl} 13 | * formal charges : {-1, 0, +1} 14 | * max num nodes : 13 (it is a subset of GDB-13). 15 | 16 | The last three points of information must be included in the submission script, as well as any additional parameters and hyperparameters to use for the training job. 17 | 18 | A sample submission script [submit-pre-training.py](../submit-pre-training.py) has been provided. Begin by modifying the submission script to specify where the dataset can be found and what type of job you want to run. For training on the example set, the settings below are recommended: 19 | 20 | ``` 21 | submit.py > 22 | # define what you want to do for the specified job(s) 23 | dataset = "gdb13_1K" # this is the dataset name, which corresponds to the directory containing the data, located in GraphINVENT/data/ 24 | job_type = "train" # this tells the code that this is a training job 25 | jobdir_start_idx = 0 # this is an index used for labeling the first job directory where output will be written 26 | n_jobs = 1 # if you want to run multiple jobs (e.g. for collecting statistics), set this to >1 27 | restart = False # this tells the code that this is not a restart job 28 | force_overwrite = False # if `True`, this will overwrite job directories which already exist with this name (recommend `True` only when debugging) 29 | jobname = "example" # this is the name of the job, to be used in labeling directories where output will be written 30 | ``` 31 | 32 | Then, specify whether you want the job to run using [SLURM](https://slurm.schedmd.com/overview.html). In the example below, we specify that we want the job to run as a regular process (i.e. no SLURM). In such cases, any specified run time and memory requirements will be ignored by the script. Note: if you want to use a different scheduler, this can be easily changed in the submission script (search for "sbatch" and change it to your scheduler's submission command). 33 | 34 | ``` 35 | submit.py > 36 | # if running using SLURM, specify the parameters below 37 | use_slurm = False # this tells the code to NOT use SLURM 38 | run_time = "1-00:00:00" # d-hh:mm:ss (will be ignored here) 39 | mem_GB = 20 # memory in GB (will be ignored here) 40 | ``` 41 | 42 | Then, specify the path to the Python binary in the GraphINVENT virtual environment. You probably won't need to change *graphinvent_path* or *data_path*, unless you want to run the code from a different directory. 43 | 44 | ``` 45 | submit.py > 46 | # set paths here 47 | python_path = f"../miniconda3/envs/graphinvent/bin/python" # this is the path to the Python binary to use (change to your own) 48 | graphinvent_path = f"./graphinvent/" # this is the directory containing the source code 49 | data_path = f"./data/" # this is the directory where all datasets are found 50 | ``` 51 | 52 | Finally, details regarding the specific dataset and parameters you want to use need to be entered. If they are not specified in *submit.py* before running, the model will use the default values in [./graphinvent/parameters/defaults.py](./graphinvent/parameters/defaults.py), but it is not always the case that the "default" values will work well for your dataset. The models are sensitive to the hyperparameters used for each dataset, especially the learning rate and learning rate decay. For the example dataset, the following parameters are recommended: 53 | 54 | ``` 55 | submit.py > 56 | # define dataset-specific parameters 57 | params = { 58 | "atom_types": ["C", "N", "O", "S", "Cl"], 59 | "formal_charge": [-1, 0, +1], 60 | "max_n_nodes": 13, 61 | "job_type": job_type, 62 | "dataset_dir": f"{data_path}{dataset}/", 63 | "restart": restart, 64 | "model": "GGNN", 65 | "sample_every": 10, 66 | "init_lr": 1e-4, # (!) 67 | "epochs": 400, 68 | "batch_size": 1000, 69 | "block_size": 100000, 70 | } 71 | ``` 72 | 73 | Above, (!) indicates that a parameter is strongly dependent on the dataset used. Note that, depending on your system, you might need to tune the mini-batch and/or block size so as to reduce/increase the memory requirement for training jobs. There is an inverse relationship between the batch size and the time required to train a model. As such, only reduce the batch size if necessary, as decreasing the batch size will lead to noticeably slower training. 74 | 75 | At this point, you are done editing the *submit.py* file and are ready to submit a training job. 76 | 77 | #### Running a training job 78 | Using the prepared *submit.py*, you can run a GraphINVENT training job from the terminal using the following command: 79 | 80 | ``` 81 | (graphinvent)$ python submit.py 82 | ``` 83 | 84 | Note that for the code to run, you need to have configured and activated the GraphINVENT environment (see [0_setting_up_environment](0_setting_up_environment.md) for help with this). 85 | 86 | As the models are training, you should see the progress bar updating on the terminal every epoch. The training status will be saved every epoch to the job directory, *output_{dataset}/{jobname}/job_{jobdir_start_idx}/*, which should be *output_gdb13_1K/example/job_0/* if you followed the settings above. Additionally, the evaluation scores will be saved every evaluation epoch to the job directory. Among the files written to this directory will be: 87 | 88 | * *generation.log*, containing various evaluation metrics for the generated set, calculated during evaluation epochs 89 | * *convergence.log*, containing the loss and learning rate for every epoch 90 | * *validation.log*, containing model scores (e.g. NLLs, UC-JSD), calculated during evaluation epochs 91 | * *model_restart_{epoch}.pth*, which are the model states for use in restarting jobs, or running generation/validation jobs with a trained model 92 | * *generation/*, a directory containing structures generated during evaluation epochs (\*.smi), as well as information on each structure's NLL (\*.nll) and validity (\*.valid) 93 | 94 | It is good to check the *generation.log* to verify that the generated set features indeed converge to those of the training set (first entry). If they do not then something is wrong (most likely bad hyperparameters). Furthermore, it is good to check the *convergence.log* to make sure the loss is smoothly decreasing during training. 95 | 96 | #### Restarting a training job 97 | If for any reason you want to restart a training job from a previous epoch (e.g. you cancelled a training job before it reached convergence), then you can do this by setting *restart = True* in *submit.py* and rerunning. While it is possible to change certain parameters in *submit.py* before rerunning (e.g. *init_lr* or *epochs*), parameters related to the model should not be changed, as the program will load an existing model from the last saved *model_restart_{epoch}.pth* file (hence there will be a mismatch between the previous parameters and those you changed). Similarly, any settings related to the file location or job name should not be changed, as the program uses those settings to search in the right directory for the previously saved model. Finally, parameters related to the dataset (e.g. *atom_types*) should not be changed, not only for a restart job but throughout the entire workflow of a dataset. If you want to use different features in the node and edge feature representations, you will have to create a copy of the dataset in [../data/](../data/), give it a unique name, and preprocess it using the desired settings. 98 | 99 | ### Generation using a trained model 100 | #### Running a generation job 101 | Once you have trained a model, you can use a saved state (e.g. *model_restart_400.pth*) to generate molecules. To do this, *submit.py* needs to be updated to specify a generation job. The first setting that needs to be changed is the *job_type*; all other settings here should be kept fixed so that the program can find the correct job directory: 102 | 103 | ``` 104 | submit.py > 105 | # define what you want to do for the specified job(s) 106 | dataset = "gdb13_1K" 107 | job_type = "generate" # this tells the code that this is a generation job 108 | jobdir_start_idx = 0 109 | n_jobs = 1 110 | restart = False 111 | force_overwrite = False 112 | jobname = "example" 113 | ``` 114 | 115 | You will then need to update the *generation_epoch* and *n_samples* parameters in *submit.py*: 116 | 117 | ``` 118 | submit.py > 119 | # define dataset-specific parameters 120 | params = { 121 | "atom_types": ["C", "N", "O", "S", "Cl"], 122 | "formal_charge": [-1, 0, +1], 123 | "max_n_nodes": 13, 124 | "job_type": job_type, 125 | "dataset_dir": f"{data_path}{dataset}/", 126 | "restart": restart, 127 | "model": "GGNN", 128 | "sample_every": 10, 129 | "init_lr": 1e-4, # (!) 130 | "epochs": 400, 131 | "batch_size": 1000, 132 | "block_size": 100000, 133 | "generation_epoch": 400, # <-- which model to use (i.e. which epoch) 134 | "n_samples": 30000, # <-- how many structures to generate 135 | } 136 | ``` 137 | 138 | The *generation_epoch* should correspond to the saved model state that you want to use for generation. In the example above, the parameters specify that the model saved at Epoch 400 should be used to generate 30,000 molecules. All other parameters should be kept the same (if they are related to training, such as *epochs* or *init_lr*, they will be ignored during generation jobs). 139 | 140 | Structures will be generated in batches of size *batch_size*. If you encounter memory problems during generation jobs, reducing the batch size should once again solve them. Generated structures, along with their corresponding metadata, will be written to the *generation/* directory within the existing job directory. These files are: 141 | 142 | * *epochGEN{generation_epoch}_{batch}.smi*, containing molecules generated at the epoch specified 143 | * *epochGEN{generation_epoch}_{batch}.nll*, containing their respective NLLs 144 | * *epochGEN{generation_epoch}_{batch}.valid*, containing their respective validity (0: invalid, 1: valid) 145 | 146 | Additionally, the *generation.log* file will be updated with the various evaluation metrics for the generated structures. 147 | 148 | If you've followed the tutorial up to here, it means you can successfully create new molecules using a trained GNN-based model. 149 | 150 | #### (Optional) Postprocessing 151 | 152 | To make things more convenient for any subsequent analyses, you can concatenate all structures generated in different batches into one file using: 153 | 154 | ``` 155 | for i in epochGEN{generation_epoch}_*.smi; do cat $i >> epochGEN{generation_epoch}.smi; done 156 | ``` 157 | 158 | Above, *{generation_epoch}* should be replaced with a number corresponding to a valid epoch. You can do similar things for the NLL and validity files, as the rows in those files correspond to the rows in the SMILES files. 159 | 160 | Note that "Xe" and empty graphs may appear in the generated structures, even if the models are well-trained, as there is always a small probability of sampling invalid actions. If you do not want to include invalid entries in your analysis, these can be filtered out by typing: 161 | 162 | ``` 163 | sed -i "/Xe/d" path/to/file.smi # remove "Xe" placeholders from file 164 | sed -i "/^ [0-9]\+$/d" path/to/file.smi # remove empty graphs from file 165 | ``` 166 | 167 | See [3_visualizing_molecules](./3_visualizing_molecules.md) for examples on how to draw grids of molecules. 168 | 169 | ### Summary 170 | Now you know how to train models and generate structures using the example dataset. However, the example dataset structures are not drug-like, and are therefore not the most interesting to study for drug discovery applications. To learn how to train GraphINVENT models on custom datasets, see [2_using_a_new_dataset](./2_using_a_new_dataset.md). 171 | -------------------------------------------------------------------------------- /tutorials/2_using_a_new_dataset.md: -------------------------------------------------------------------------------- 1 | ## Using a new dataset in GraphINVENT 2 | In this tutorial, you will be guided through the steps of using a new dataset in GraphINVENT. 3 | 4 | ### Selecting a new dataset 5 | Before getting carried away with the possibilities of molecular graph generative models, it should be clear that the GraphINVENT models are computationally demanding, especially compared to string-based models. As such, you should keep in mind the capabilities of your system when selecting a new dataset to study, such as how much disk space you have available, how much RAM, and how fast is your GPU. 6 | 7 | In our recent [publication](https://chemrxiv.org/articles/preprint/Graph_Networks_for_Molecular_Design/12843137/1), we report the computational requirements for Preprocessing, Training, Generation, and Benchmarking jobs using the various GraphINVENT models. We summarize some of the results here for the largest dataset we trained on: 8 | 9 | | Dataset | Train | Test | Valid | Largest Molecule | Atom Types | Formal Charges | 10 | |---|---|---|---|---|---|---| 11 | | [MOSES](https://github.com/molecularsets/moses/tree/master/data) | 1.5M | 176K | 10K | 27 atoms | {C, N, O, F, S, Cl, Br} | {0} | 12 | 13 | The disk space used by the different splits, before and after preprocessing (using the best parameters from the paper), are as follows: 14 | 15 | | | Train | Test | Valid | 16 | |---|---|---|---| 17 | | Before | 65M | 7.1M | 403K | 18 | | After | 75G | 9.5G | 559M | 19 | 20 | We point this out to emphasize that if you intend to use a large dataset (such as the MOSES dataset), you need to have considerable disk space available. The sizes of these files can be reduced by specifying a larger *group_size* (default: 1000), but increasing the group size will also increase the time required for preprocessing while having a small effect on decreasing the training time. 21 | 22 | Training and Generation jobs using the above dataset generally require <10 GB GPU memory. A model can be fully trained on MOSES after around 5 days of training on a single GPU (using a batch size of 1000). 23 | 24 | When selecting a dataset to study, thus keep in mind that more structures in your dataset means 1) more disk space will be required to save processed dataset splits and 2) more computational time will be required for training. The number of structures should not have a significant effect on the RAM requirements of a job, as this can be controlled by the batch and block sizes used. However, the number of atom types present in the dataset will have an effect on the memory and disk space requirements of a job, as this is directly correlated to the sizes of the node and edge features tensors, as well as the sizes of the APDs. As such, you might not want to use the entire periodic table in your generative models. 25 | 26 | Finally, as all molecules are padded up to the size of the largest graph in the dataset during Preprocessing jobs, if you have a dataset where most molecules have fewer nodes than *N*, and you have only a few structures where the number of nodes is >>*N*, a good strategy to reduce the computational requirements for this dataset would be to simply remove all molecules with >*N* nodes. The same thing could be said for the atom types and formal charges. We recommend to only keep any "outliers" in a dataset if they are deemed essential. 27 | 28 | To summarize, 29 | 30 | Increases disk space requirement: 31 | * more molecules in dataset 32 | * more atom types present in dataset 33 | * more formal charges present 34 | * larger molecules in dataset (really, larger *max_n_nodes*) 35 | * smaller group size 36 | 37 | Increases RAM: 38 | * using a larger batch size 39 | * using a larger block size 40 | 41 | Increases run time: 42 | * more molecules in dataset 43 | * using a smaller batch size 44 | * larger group size (Preprocessing jobs only) 45 | 46 | Hopefully these guidelines help you in selecting an appropriate dataset to study using GraphINVENT. 47 | 48 | ### Preparing a new dataset 49 | Once you have selected a dataset to study, you must prepare it so that it agrees with the format expected by the program. GraphINVENT expects, for each dataset, three splits in SMILES format. Each split should be named as follows: 50 | 51 | * *train.smi* 52 | * *test.smi* 53 | * *valid.smi* 54 | 55 | These should contain the training set, test set, and validation set, respectively. It is not important for the SMILES to be canonical, and it also does not matter if the file has a header or not. How many structures you put in each split is also up to you (generally the training set is larger than the testing and validation set). 56 | 57 | You should then create a new directory in [../data/](../data/) where the name of this directory corresponds to a unique name for your dataset: 58 | 59 | ``` 60 | mkdir path/to/GraphINVENT/data/your_dataset_name/ 61 | mv train.smi valid.smi test.smi path/to/GraphINVENT/data/your_dataset_name/. 62 | ``` 63 | 64 | You will want to replace *your_dataset_name* above with the actual name for your dataset (e.g. *ChEMBL_subset*, *DRD2_actives*, etc). 65 | 66 | 67 | ### Preprocessing the new dataset 68 | Once you have prepared your dataset in the aforementioned format, you can move on to preprocessing it using GraphINVENT. To preprocess it, you will need to know the following information: 69 | 70 | * *max_n_nodes* 71 | * *atom_types* 72 | * *formal_charge* 73 | 74 | We have provided a few scripts to help you calculate these properties in [../tools/](../tools/). 75 | 76 | Once you know these values, you can move on to preparing a submission script. A sample submission script [../submit.py](../submit.py) has been provided. Begin by modifying the submission script to specify where the dataset can be found and what type of job you want to run. For preprocessing a new dataset, you can use the settings below, substituting in your own values where necessary: 77 | 78 | ``` 79 | submit.py > 80 | # define what you want to do for the specified job(s) 81 | dataset = "your_dataset_name" # this is the dataset name, which corresponds to the directory containing the data, located in GraphINVENT/data/ 82 | job_type = "preprocess" # this tells the code that this is a preprocessing job 83 | jobdir_start_idx = 0 # this is an index used for labeling the first job directory where output will be written 84 | n_jobs = 1 # if you want to run multiple jobs (not recommended for preprocessing), set this to >1 85 | restart = False # this tells the code that this is not a restart job 86 | force_overwrite = False # if `True`, this will overwrite job directories which already exist with this name (recommend `True` only when debugging) 87 | jobname = "preprocess" # this is the name of the job, to be used in labeling directories where output will be written 88 | ``` 89 | 90 | Then, specify whether you want the job to run using [SLURM](https://slurm.schedmd.com/overview.html). In the example below, we specify that we want the job to run as a regular process (i.e. no SLURM). In such cases, any specified run time and memory requirements will be ignored by the script. Note: if you want to use a different scheduler, this can be easily changed in the submission script (search for "sbatch" and change it to your scheduler's submission command). 91 | 92 | ``` 93 | submit.py > 94 | # if running using SLURM, specify the parameters below 95 | use_slurm = False # this tells the code to NOT use SLURM 96 | run_time = "1-00:00:00" # d-hh:mm:ss (will be ignored here) 97 | mem_GB = 20 # memory in GB (will be ignored here) 98 | ``` 99 | 100 | Then, specify the path to the Python binary in the GraphINVENT virtual environment. You probably won't need to change *graphinvent_path* or *data_path*, unless you want to run the code from a different directory. 101 | 102 | ``` 103 | submit.py > 104 | # set paths here 105 | python_path = f"{home}/miniconda3/envs/graphinvent/bin/python" # this is the path to the Python binary to use (change to your own) 106 | graphinvent_path = f"./graphinvent/" # this is the directory containing the source code 107 | data_path = f"./data/" # this is the directory where all datasets are found 108 | ``` 109 | 110 | Finally, details regarding the specific dataset you want to use need to be entered: 111 | 112 | ``` 113 | submit.py > 114 | # define dataset-specific parameters 115 | params = { 116 | "atom_types": ["C", "N", "O", "S", "Cl"], # <-- change to your dataset's atom types 117 | "formal_charge": [-1, 0, +1], # <-- change to your dataset's formal charges 118 | "chirality": ["None", "R", "S"], # <-- ignored, unless you also specify `use_chirality`=True 119 | "max_n_nodes": 13, # <-- change to your dataset's value 120 | "job_type": job_type, 121 | "dataset_dir": f"{data_path}{dataset}/", 122 | "restart": restart, 123 | } 124 | ``` 125 | 126 | At this point, you are done editing the *submit.py* file and are ready to submit a preprocesing job. You can submit the job from the terminal using the following command: 127 | 128 | ``` 129 | (graphinvent)$ python submit.py 130 | ``` 131 | 132 | During preprocessing jobs, the following will be written to the specified *dataset_dir*: 133 | * 3 HDF files (*train.h5*, *valid.h5*, and *test.h5*) 134 | * *preprocessing_params.csv*, containing parameters used in preprocessing the dataset (for later reference) 135 | * *train.csv*, containing training set properties (e.g. histograms of number of nodes per molecule, number of edges per node, etc) 136 | 137 | A preprocessing job can take a few seconds to a few hours to finish, depending on the size of your dataset. Once the preprocessing job is done and you have the above files, you are ready to run a training job using your processed dataset. 138 | 139 | ### Training models using the new dataset 140 | You can modify the same *submit.py* script to instead run a training job using your dataset. Begin by changing the *job_type* and *jobname*; all other settings can be kept the same: 141 | 142 | ``` 143 | submit.py > 144 | # define what you want to do for the specified job(s) 145 | dataset = "your_dataset_name" 146 | job_type = "train" # this tells the code that this is a training job 147 | jobdir_start_idx = 0 148 | n_jobs = 1 149 | restart = False 150 | force_overwrite = False 151 | jobname = "train" # this is the name of the job, to be used in labeling directories where output will be written 152 | ``` 153 | 154 | If you would like to change the SLURM settings, you should do that next, but for this example we will keep them the same. You will then need to specify all parameters that you want to use for training: 155 | 156 | 157 | ``` 158 | submit.py > 159 | # define dataset-specific parameters 160 | params = { 161 | "atom_types": ["C", "N", "O", "S", "Cl"], # change to your dataset's atom types 162 | "formal_charge": [-1, 0, +1], # change to your dataset's formal charges 163 | "chirality": ["None", "R", "S"], # ignored, unless you also specify `use_chirality`=True 164 | "max_n_nodes": 13, # change to your dataset's value 165 | "job_type": job_type, 166 | "dataset_dir": f"{data_path}{dataset}/", 167 | "restart": restart, 168 | "model": "GGNN", # <-- which model to use (GGNN is the default, but showing it here to be explicit) 169 | "sample_every": 2, # <-- how often you want to sample/evaluate your model during training (for larger datasets, we recommend sampling more often) 170 | "init_lr": 1e-4, # <-- tune the initial learning rate if needed 171 | "epochs": 100, # <-- how many epochs you want to train for (you can experiment with this) 172 | "batch_size": 1000, # <-- tune the batch size if needed 173 | "block_size": 100000, # <-- tune the block size if needed 174 | } 175 | ``` 176 | 177 | If any parameters are not specified in *submit.py* before running, the model will use the default values in [../graphinvent/parameters/defaults.py](../graphinvent/parameters/defaults.py), but it is not always the case that the "default" values will work well for your dataset. For instance, the parameters related to the learning rate decay are strongly dependent on the dataset used, and you might have to tune them to get optimal performance using your dataset. Depending on your system, you might also need to tune the mini-batch and/or block size so as to reduce/increase the memory requirement for training jobs. 178 | 179 | You can then run a GraphINVENT training job from the terminal using the following command: 180 | 181 | ``` 182 | (graphinvent)$ python submit.py 183 | ``` 184 | 185 | As the models are training, you should see the progress bar updating on the terminal every epoch. The training status will be saved every epoch to the job directory, *output_{your_dataset_name}/{jobname}/job_{jobdir_start_idx}/*, which should be *output_{your_dataset_name}/train/job_0/* if you followed the settings above. Additionally, the evaluation scores will be saved every evaluation epoch to the job directory. Among the files written to this directory will be: 186 | 187 | * *generation.log*, containing various evaluation metrics for the generated set, calculated during evaluation epochs 188 | * *convergence.log*, containing the loss and learning rate for every epoch 189 | * *validation.log*, containing model scores (e.g. NLLs, UC-JSD), calculated during evaluation epochs 190 | * *model_restart_{epoch}.pth*, which are the model states for use in restarting jobs, or running generation/validation jobs with a trained model 191 | * *generation/*, a directory containing structures generated during evaluation epochs (\*.smi), as well as information on each structure's NLL (\*.nll) and validity (\*.valid) 192 | 193 | It is good to check the *generation.log* to verify that the generated set features indeed converge to those of the training set (first entry). If they do not, then you will have to tune the hyperparameters to get better performance. Furthermore, it is good to check the *convergence.log* to make sure the loss is smoothly decreasing during training. 194 | 195 | #### Restarting a training job 196 | If for any reason you want to restart a training job from a previous epoch (e.g. you cancelled a training job before it reached convergence), then you can do this by setting *restart = True* in *submit.py* and rerunning. While it is possible to change certain parameters in *submit.py* before rerunning (e.g. *init_lr* or *epochs*), parameters related to the model should not be changed, as the program will load an existing model from the last saved *model_restart_{epoch}.pth* file (hence there will be a mismatch between the previous parameters and those you changed). Similarly, any settings related to the file location or job name should not be changed, as the program uses those settings to search in the right directory for the previously saved model. Finally, parameters related to the dataset (e.g. *atom_types*) should not be changed, not only for a restart job but throughout the entire workflow of a dataset. If you want to use different features in the node and edge feature representations, you will have to create a copy of the dataset in [../data/](../data/), give it a unique name, and preprocess it using the desired settings. 197 | 198 | ### Generating structures using the newly trained models 199 | Once you have trained a model, you can use a saved state (e.g. *model_restart_100.pth*) to generate molecules. To do this, *submit.py* needs to be updated to specify a generation job. The first setting that needs to be changed is the *job_type*; all other settings here should be kept fixed so that the program can find the correct job directory: 200 | 201 | ``` 202 | submit.py > 203 | # define what you want to do for the specified job(s) 204 | dataset = "your_dataset_name" 205 | job_type = "generate" # this tells the code that this is a generation job 206 | jobdir_start_idx = 0 207 | n_jobs = 1 208 | restart = False 209 | force_overwrite = False 210 | jobname = "train" # don't change the jobname, or the program won't find the saved model 211 | ``` 212 | 213 | You will then need to update the *generation_epoch* and *n_samples* parameter in *submit.py*: 214 | 215 | ``` 216 | submit.py > 217 | # define dataset-specific parameters 218 | params = { 219 | "atom_types": ["C", "N", "O", "S", "Cl"], # change to your dataset's atom types 220 | "formal_charge": [-1, 0, +1], # change to your dataset's formal charges 221 | "chirality": ["None", "R", "S"], # ignored, unless you also specify `use_chirality`=True 222 | "max_n_nodes": 13, # change to your dataset's value 223 | "job_type": job_type, 224 | "dataset_dir": f"{data_path}{dataset}/", 225 | "restart": restart, 226 | "model": "GGNN", 227 | "sample_every": 2, # how often you want to sample/evaluate your model during training (for larger datasets, we recommend sampling more often) 228 | "init_lr": 1e-4, # tune the initial learning rate if needed 229 | "epochs": 100, # how many epochs you want to train for (you can experiment with this) 230 | "batch_size": 1000, # <-- tune the batch size if needed 231 | "block_size": 100000, # tune the block size if needed 232 | "generation_epoch": 100, # <-- specify which saved model (i.e. at which epoch) to use for training) 233 | "n_samples": 30000, # <-- specify how many structures you want to generate 234 | } 235 | ``` 236 | 237 | The *generation_epoch* should correspond to the saved model state that you want to use for generation, and *n_samples* tells the program how many structures you want to generate. In the example above, the parameters specify that the model saved at Epoch 100 should be used to generate 30,000 structures. All other parameters should be kept the same (if they are related to training, such as *epochs* or *init_lr*, they will be ignored during generation jobs). 238 | 239 | Structures will be generated in batches of size *batch_size*. If you encounter memory problems during generation jobs, reducing the batch size should once again solve them. Generated structures, along with their corresponding metadata, will be written to the *generation/* directory within the existing job directory. These files are: 240 | 241 | * *epochGEN{generation_epoch}_{batch}.smi*, containing molecules generated at the epoch specified 242 | * *epochGEN{generation_epoch}_{batch}.nll*, containing their respective NLLs 243 | * *epochGEN{generation_epoch}_{batch}.valid*, containing their respective validity (0: invalid, 1: valid) 244 | 245 | Additionally, the *generation.log* file will be updated with the various evaluation metrics for the generated structures. 246 | 247 | If you've followed the tutorial up to here, it means you can successfully create new molecules using a GNN-based model trained on a custom dataset. 248 | 249 | #### (Optional) Postprocessing 250 | 251 | To make things more convenient for any subsequent analyses, you can concatenate all structures generated in different batches into one file using: 252 | 253 | ``` 254 | for i in epochGEN{generation_epoch}_*.smi; do cat $i >> epochGEN{generation_epoch}.smi; done 255 | ``` 256 | 257 | Above, *{generation_epoch}* should be replaced with a number corresponding to a valid epoch. You can do similar things for the NLL and validity files, as the rows in those files correspond to the rows in the SMILES files. 258 | 259 | Note that "Xe" and empty graphs may appear in the generated structures, even if the models are well-trained, as there is always a small probability of sampling invalid actions. If you do not want to include invalid entries in your analysis, these can be filtered out by typing: 260 | 261 | ``` 262 | sed -i "/Xe/d" path/to/file.smi # remove "Xe" placeholders from file 263 | sed -i "/^ [0-9]\+$/d" path/to/file.smi # remove empty graphs from file 264 | ``` 265 | 266 | See [3_visualizing_molecules](./3_visualizing_molecules.md) for examples on how to draw grids of molecules. 267 | 268 | ### A note about hyperparameters 269 | If you've reached this part of the tutorial, you now have a good idea of how to train GraphINVENT models on custom datasets. Nonetheless, as hinted above, some hyperparameters are highly dependent on the dataset used, and you may have to do some hyperparameter tuning to obtain the best performance using your specific dataset. In particular, parameters related to the learning rate decay are sensitive to the dataset, so a bit of experimentation here is recommended when using a new dataset as these parameters can make a difference between an "okay" model and a well-trained model. These parameters are: 270 | 271 | * *init_lr* 272 | * *min_rel_lr* 273 | * *lrdf* 274 | * *lrdi* 275 | 276 | If any parameters are not specified in the submission script, the program will use the default values from [../graphinvent/parameters/defaults.py](../graphinvent/parameters/defaults.py). Have a look there if you want to learn more about any additional hyperparameters that may not have been discussed in this tutorial. Note that not all parameters defined in *../graphinvent/parameters/defaults.py* are model-related hyperparameters; many are simply practical parameters and settings, such as the path to the datasets being studied. 277 | 278 | ### Summary 279 | Hopefully you are now able to train models on custom datasets using GraphINVENT. If anything is unclear in this tutorial, or if you have any questions that have not been addressed by this guide, feel free to contact the authors for assistance. Note that a lot of useful information centered about hyperparameter tuning is available in our [technical note](https://chemrxiv.org/articles/preprint/Practical_Notes_on_Building_Molecular_Graph_Generative_Models/12888383/1). 280 | 281 | We look forward to seeing the molecules you've generated using GraphINVENT. 282 | -------------------------------------------------------------------------------- /tutorials/3_visualizing_molecules.md: -------------------------------------------------------------------------------- 1 | ## Visualizing molecules 2 | After generating structures using GraphINVENT, you will almost certainly want to visualize them. Below we provide some examples using RDKit for visualizing the molecules in simple but elegant grids. 3 | 4 | ### Drawing a grid of molecules 5 | Assuming you use the trained models to generate thousands (if not more) molecules, you *probably* don't want to visualize all of them in one massive grid. A more reasonable thing to do is to randomly sample a small subset for visualization. 6 | 7 | An example script for drawing 100 randomly selected molecules is shown below: 8 | 9 | ``` 10 | example_visualization_script.py > 11 | import math 12 | import random 13 | import rdkit 14 | from rdkit.Chem.Draw import MolsToGridImage 15 | from rdkit.Chem.rdmolfiles import SmilesMolSupplier 16 | 17 | smi_file = "path/to/file.smi" 18 | 19 | # load molecules from file 20 | mols = SmilesMolSupplier(smi_file, sanitize=True, nameColumn=-1) 21 | 22 | n_samples = 100 23 | mols_list = [mol for mol in mols] 24 | mols_sampled = random.sample(mols_list, n_samples) # sample 100 random molecules to visualize 25 | 26 | mols_per_row = int(math.sqrt(n_samples)) # make a square grid 27 | 28 | png_filename=smi_file[:-3] + "png" # name of PNG file to create 29 | labels=list(range(n_samples)) # label structures with a number 30 | 31 | # draw the molecules (creates a PIL image) 32 | img = MolsToGridImage(mols=mols_sampled, 33 | molsPerRow=mols_per_row, 34 | legends=[str(i) for i in labels]) 35 | 36 | img.save(png_filename) 37 | ``` 38 | 39 | Alternatively, you could first randomly sample 100 molecules from your source file, save them in a new file, and draw everything in the new file: 40 | 41 | ``` 42 | shuffle -n 100 path/to/file.smi > path/to/file_100_shuffled.smi 43 | ``` 44 | 45 | ``` 46 | example_visualization_script_2.py > 47 | import rdkit 48 | from rdkit.Chem.Draw import MolsToGridImage 49 | from rdkit.Chem.rdmolfiles import SmilesMolSupplier 50 | 51 | smi_file = "path/to/file_100_shuffled.smi" 52 | 53 | # load molecules from file 54 | mols = SmilesMolSupplier(smi_file, sanitize=True, nameColumn=-1) 55 | 56 | png_filename=smi_file[:-3] + "png" # name of PNG file to create 57 | labels=list(range(n_samples)) # label structures with a number 58 | 59 | # draw the molecules (creates a PIL image) 60 | img = MolsToGridImage(mols=mols, 61 | molsPerRow=10, 62 | legends=[str(i) for i in labels]) 63 | 64 | img.save(png_filename) 65 | ``` 66 | 67 | ### Filtering out invalid entries 68 | By default, GraphINVENT writes a "Xe" placeholder when an invalid molecular graph is generated, as an invalid molecular graph cannot be converted to a SMILES string for saving. The placeholder is used because the NLL is written for all generated graphs in a separate file, where the same line number in the \*.nll file corresponds to the same line number in the \*.smi file. Similarly, if an empty graph samples an invalid action as the first action, then no SMILES can be generated for an empty graph, so the corresponding line for an empty graph in a SMILES file contains only the "ID" of the molecule. 69 | 70 | For visualization, you might be interested in viewing only the valid molecular graphs. The SMILES for the generated molecules can thus be post-processed as follows to remove empty and invalid entries from a file before visualization: 71 | 72 | ``` 73 | sed -i "/Xe/d" path/to/file.smi # remove "Xe" placeholders from file 74 | sed -i "/^ [0-9]\+$/d" path/to/file.smi # remove empty graphs from file 75 | ``` 76 | -------------------------------------------------------------------------------- /tutorials/4_transfer_learning.md: -------------------------------------------------------------------------------- 1 | ## Transfer learning using GraphINVENT 2 | In this tutorial, you will be guided through the process of generating molecules with targeted properties using transfer learning (TL). 3 | 4 | This tutorial assumes that you have looked through tutorials [1_introduction](./1_introduction.md) and [2_using_a_new_dataset](./2_using_a_new_dataset.md). 5 | 6 | ### Selecting two (or more) datasets 7 | In order to do transfer learning, you must first select two datasets which you would like to work with. The first and (probably) larger dataset should be one that you can use to train your model generally, whereas the second should be one containing (a few) examples of molecules exhibiting the properties you desire in your generated molecules (e.g. known actives). 8 | 9 | When choosing your datasets, first, remember that GraphINVENT models are computationally demanding; I recommend you go back and review the *Selecting a new dataset* guidelines provided in [2_using_a_new_dataset](./2_using_a_new_dataset.md). 10 | 11 | Second, ideally there is some amount of overlap between the structures in your general training set (set 1) and your targeted training set (set 2). If the two sets are totally different, it will be difficult for your model to learn how to apply what it learns from set 1 to set 2. However, they also should not come from the exact same distributions (otherwise, what's the point of doing TL...). 12 | 13 | 14 | ### Preparing a new dataset 15 | Once you have chosen your two datasets, you must prepare them so that they agree with the format expected by the program. GraphINVENT expects, for each dataset, three splits in SMILES format. Each split should be named as follows: 16 | 17 | * *train.smi* 18 | * *test.smi* 19 | * *valid.smi* 20 | 21 | These should contain the training set, test set, and validation set, respectively. It is not important for the SMILES to be canonical, and it also does not matter if the file has a header or not. How many structures you put in each split is also up to you. 22 | 23 | You should then create two new directories in [../data/](../data/), one for each dataset, where the name of each directory corresponds to a unique name for the dataset it contains: 24 | 25 | ``` 26 | mkdir path/to/GraphINVENT/data/set_1/ 27 | ./split_dataset set_1.smi # example script that writes a train.smi, valid.smi, and test.smi from set_1.smi 28 | mv train.smi valid.smi test.smi path/to/GraphINVENT/data/set_1/. 29 | 30 | mkdir path/to/GraphINVENT/data/set_2/ 31 | ./split_dataset set_2.smi # example script that writes a train.smi, valid.smi, and test.smi from set_2.smi 32 | mv train.smi valid.smi test.smi path/to/GraphINVENT/data/set_2/. 33 | 34 | ``` 35 | 36 | You will want to replace *set_1* and *set_2* above with the actual names for your datasets (e.g. *ChEMBL_subset*, *DRD2_actives*, etc). 37 | 38 | 39 | ### Preprocessing the new dataset 40 | Once you have prepared your datasets in the aforementioned format, you can move on to preprocessing them using GraphINVENT. To preprocess them, you will need to know the following information: 41 | 42 | * *max_n_nodes* 43 | * *atom_types* 44 | * *formal_charge* 45 | 46 | Be careful to calculate this for BOTH sets, and not just one e.g. if the *max_n_nodes* in set 1 is 38, and the *max_n_nodes* in set 2 is 15, then the *max_n_nodes* for BOTH sets will be 38. Similarly, if the *atom_types* in set 1 are ["C", "N", "O"] and the *atom_types* in set 2 are ["C", "O", "S"], then the *atom_types* for BOTH sets will be ["C", "N", "O", "S"]. Here, the specific order of elements in *atom_types* does not matter, so long as the order is the same for BOTH sets. 47 | 48 | We have provided a few scripts to help you calculate these properties in [../tools/](../tools/). 49 | 50 | Once you know these values, you can move on to preparing a submission script for preprocessing the first dataset. A sample submission script [../submit.py](../submit.py) has been provided. Begin by modifying the submission script to specify where the dataset can be found and what type of job you want to run. For preprocessing a new dataset, you can use the settings below, substituting in your own values where necessary: 51 | 52 | ``` 53 | submit.py > 54 | # define what you want to do for the specified job(s) 55 | dataset = "set 1" # this is the dataset name, which corresponds to the directory containing the data, located in GraphINVENT/data/ 56 | job_type = "preprocess" # this tells the code that this is a preprocessing job 57 | jobdir_start_idx = 0 # this is an index used for labeling the first job directory where output will be written 58 | n_jobs = 1 # if you want to run multiple jobs (not recommended for preprocessing), set this to >1 59 | restart = False # this tells the code that this is not a restart job 60 | force_overwrite = False # if `True`, this will overwrite job directories which already exist with this name (recommend `True` only when debugging) 61 | jobname = "preprocess" # this is the name of the job, to be used in labeling directories where output will be written 62 | ``` 63 | 64 | Then, specify whether you want the job to run using [SLURM](https://slurm.schedmd.com/overview.html). In the example below, we specify that we want the job to run as a regular process (i.e. no SLURM). In such cases, any specified run time and memory requirements will be ignored by the script. Note: if you want to use a different scheduler, this can be easily changed in the submission script (search for "sbatch" and change it to your scheduler's submission command). 65 | 66 | ``` 67 | submit.py > 68 | # if running using SLURM, specify the parameters below 69 | use_slurm = False # this tells the code to NOT use SLURM 70 | run_time = "1-00:00:00" # d-hh:mm:ss (will be ignored here) 71 | mem_GB = 20 # memory in GB (will be ignored here) 72 | ``` 73 | 74 | Then, specify the path to the Python binary in the GraphINVENT virtual environment. You probably won't need to change *graphinvent_path* or *data_path*, unless you want to run the code from a different directory. 75 | 76 | ``` 77 | submit.py > 78 | # set paths here 79 | python_path = f"../miniconda3/envs/graphinvent/bin/python" # this is the path to the Python binary to use (change to your own) 80 | graphinvent_path = f"./graphinvent/" # this is the directory containing the source code 81 | data_path = f"./data/" # this is the directory where all datasets are found 82 | ``` 83 | 84 | Finally, details regarding the specific dataset you want to use need to be entered. Here, you must remember to use the properties for BOTH datasets: 85 | 86 | ``` 87 | submit.py > 88 | # define dataset-specific parameters 89 | params = { 90 | "atom_types": ["C", "N", "O", "S"], # <-- change to your datasets' atom types 91 | "formal_charge": [-1, 0, +1], # <-- change to your datasets' formal charges 92 | "chirality": ["None", "R", "S"], # <-- ignored, unless you also specify `use_chirality`=True 93 | "max_n_nodes": 38, # <-- change to your datasets' value 94 | "job_type": job_type, 95 | "dataset_dir": f"{data_path}{dataset}/", 96 | "restart": restart, 97 | } 98 | ``` 99 | 100 | At this point, you are done editing the *submit.py* file and are ready to submit a preprocesing job. You can submit the job from the terminal using the following command: 101 | 102 | ``` 103 | (graphinvent)$ python submit.py 104 | ``` 105 | 106 | During preprocessing jobs, the following will be written to the specified *dataset_dir*: 107 | * 3 HDF files (*train.h5*, *valid.h5*, and *test.h5*) 108 | * *preprocessing_params.csv*, containing parameters used in preprocessing the dataset (for later reference) 109 | * *train.csv*, containing training set properties (e.g. histograms of number of nodes per molecule, number of edges per node, etc) 110 | 111 | A preprocessing job can take a few seconds to a few hours to finish, depending on the size of your dataset. 112 | 113 | Once you have preprocessed the first dataset, you must go back and preprocess the second dataset. To do this, you can use the same *submit.py* file; simply go back and change the dataset name: 114 | 115 | ``` 116 | submit.py > 117 | # define what you want to do for the specified job(s) 118 | dataset = "set 2" # this is the dataset name, which corresponds to the directory containing the data, located in GraphINVENT/data/ <-- this line changed 119 | job_type = "preprocess" # this tells the code that this is a preprocessing job 120 | jobdir_start_idx = 0 # this is an index used for labeling the first job directory where output will be written 121 | n_jobs = 1 # if you want to run multiple jobs (not recommended for preprocessing), set this to >1 122 | restart = False # this tells the code that this is not a restart job 123 | force_overwrite = False # if `True`, this will overwrite job directories which already exist with this name (recommend `True` only when debugging) 124 | jobname = "preprocess" # this is the name of the job, to be used in labeling directories where output will be written 125 | ``` 126 | 127 | ...and re-run: 128 | 129 | ``` 130 | (graphinvent)$ python submit.py 131 | ``` 132 | 133 | Once you have preprocessed both datasets, you are ready to run a general training job using the first dataset. 134 | 135 | ### Training models generally 136 | You can modify the same *submit.py* script to instead run a training job using the general dataset (set 1). Begin by changing the *job_type* and *jobname*; all other settings can be kept the same: 137 | 138 | ``` 139 | submit.py > 140 | # define what you want to do for the specified job(s) 141 | dataset = "set_1" 142 | job_type = "train" # this tells the code that this is a training job 143 | jobdir_start_idx = 0 144 | n_jobs = 1 145 | restart = False 146 | force_overwrite = False 147 | jobname = "train" # this is the name of the job, to be used in labeling directories where output will be written 148 | ``` 149 | 150 | If you would like to change the SLURM settings, you should do that next, but for this example we will keep them the same. You will then need to specify all parameters that you want to use for training: 151 | 152 | 153 | ``` 154 | submit.py > 155 | # define dataset-specific parameters 156 | params = { 157 | "atom_types": ["C", "N", "O", "S"], # change to your datasets' atom types 158 | "formal_charge": [-1, 0, +1], # change to your datasets' formal charges 159 | "chirality": ["None", "R", "S"], # ignored, unless you also specify `use_chirality`=True 160 | "max_n_nodes": 38, # change to your datasets' value 161 | "job_type": job_type, 162 | "dataset_dir": f"{data_path}{dataset}/", 163 | "restart": restart, 164 | "model": "GGNN", # <-- which model to use (GGNN is the default, but showing it here to be explicit) 165 | "sample_every": 2, # <-- how often you want to sample/evaluate your model during training (for larger datasets, we recommend sampling more often) 166 | "init_lr": 1e-4, # <-- tune the initial learning rate if needed 167 | "epochs": 100, # <-- how many epochs you want to train for (you can experiment with this) 168 | "batch_size": 1000, # <-- tune the batch size if needed 169 | "block_size": 100000, # <-- tune the block size if needed 170 | } 171 | ``` 172 | 173 | If any parameters are not specified in *submit.py* before running, the model will use the default values in [../graphinvent/parameters/defaults.py](../graphinvent/parameters/defaults.py), but it is not always the case that the "default" values will work well for your datasets. For instance, the parameters related to the learning rate decay are strongly dependent on the dataset used, and you might have to tune them to get optimal performance using your datasets. Depending on your system, you might also need to tune the mini-batch and/or block size so as to reduce/increase the memory requirement for training jobs. 174 | 175 | You can then run a GraphINVENT training job from the terminal using the following command: 176 | 177 | ``` 178 | (graphinvent)$ python submit.py 179 | ``` 180 | 181 | As the models are training, you should see the progress bar updating on the terminal every epoch. The training status will be saved every epoch to the job directory, *output_{your_dataset_name}/{jobname}/job_{jobdir_start_idx}/*, which should be *output_{your_dataset_name}/train/job_0/* if you followed the settings above. Additionally, the evaluation scores will be saved every evaluation epoch to the job directory. Among the files written to this directory will be: 182 | 183 | * *generation.log*, containing various evaluation metrics for the generated set, calculated during evaluation epochs 184 | * *convergence.log*, containing the loss and learning rate for every epoch 185 | * *validation.log*, containing model scores (e.g. NLLs, UC-JSD), calculated during evaluation epochs 186 | * *model_restart_{epoch}.pth*, which are the model states for use in restarting jobs, or running generation/validation jobs with a trained model 187 | * *generation/*, a directory containing structures generated during evaluation epochs (\*.smi), as well as information on each structure's NLL (\*.nll) and validity (\*.valid) 188 | 189 | It is good to check the *generation.log* to verify that the generated set features indeed converge to those of the training set (first entry). If they do not, then you will have to tune the hyperparameters to get better performance. Furthermore, it is good to check the *convergence.log* to make sure the loss is smoothly decreasing during training. 190 | 191 | #### Restarting a training job 192 | If for any reason you want to restart a training job from a previous epoch (e.g. you cancelled a training job before it reached convergence), then you can do this by setting *restart = True* in *submit.py* and rerunning. While it is possible to change certain parameters in *submit.py* before rerunning (e.g. *init_lr* or *epochs*), parameters related to the model should not be changed, as the program will load an existing model from the last saved *model_restart_{epoch}.pth* file (hence there will be a mismatch between the previous parameters and those you changed). Similarly, any settings related to the file location or job name should not be changed, as the program uses those settings to search in the right directory for the previously saved model. Finally, parameters related to the dataset (e.g. *atom_types*) should not be changed, not only for a restart job but throughout the entire workflow of a dataset. If you want to use different features in the node and edge feature representations, you will have to create a copy of the dataset in [../data/](../data/), give it a unique name, and preprocess it using the desired settings. 193 | 194 | ### Fine-tuning your predictions 195 | By this point, you have trained a model on a general dataset, so it has ideally learned how to form chemically valid compounds. However, the next thing we would like to do is fine-tune the models on a smaller set of molecules possessing the molecular properties that we would like our generated molecules to have. To do this, we can resume training from a generally trained model. 196 | 197 | To do this, we can once again modify *submit.py* to specify a restart job on the second dataset: 198 | 199 | ``` 200 | submit.py > 201 | # define what you want to do for the specified job(s) 202 | dataset = "set_1" # <-- change this from set 1 to set 2 203 | job_type = "train" # this tells the code that this is a training job 204 | jobdir_start_idx = 0 205 | n_jobs = 1 206 | restart = True # <-- specify a restart job 207 | force_overwrite = False 208 | jobname = "train" # this is the name of the job, to be used in labeling directories where output will be written (don't change this! otherwise GraphINVENT won't find the saved model states) 209 | ``` 210 | 211 | At this point, you can also fine-tune the training parameters, but below we have chosen to keep them all the same (you will have to see what works and what doesn't work for your dataset): 212 | 213 | ``` 214 | submit.py > 215 | # define dataset-specific parameters 216 | params = { 217 | "atom_types": ["C", "N", "O", "S"], # change to your datasets' atom types 218 | "formal_charge": [-1, 0, +1], # change to your datasets' formal charges 219 | "chirality": ["None", "R", "S"], # ignored, unless you also specify `use_chirality`=True 220 | "max_n_nodes": 38, # change to your datasets' value 221 | "job_type": job_type, 222 | "dataset_dir": f"{data_path}{dataset}/", 223 | "restart": restart, 224 | "model": "GGNN", # <-- which model to use (GGNN is the default, but showing it here to be explicit) 225 | "sample_every": 2, # <-- how often you want to sample/evaluate your model during training (for larger datasets, we recommend sampling more often) 226 | "init_lr": 1e-4, # <-- tune the initial learning rate if needed 227 | "epochs": 100, # <-- how many epochs you want to train for (you can experiment with this) 228 | "batch_size": 1000, # <-- tune the batch size if needed 229 | "block_size": 100000, # <-- tune the block size if needed 230 | } 231 | ``` 232 | 233 | Before submitting, you must also create a new output directory (manually) for set 2 containing the saved model state and the *convergence.log* file, following the same directory structure as the output directory for set 1: 234 | 235 | ``` 236 | mkdir output_set_2/train/job_0/ 237 | cp output_set_2/train/job_0/model_restart_100.pth output_set_2/train/job_0/. 238 | cp output_set_2/train/job_0/convergence.log output_set_2/train/job_0/. 239 | ``` 240 | 241 | This is necessary in order for GraphINVENT to successfully find the previous saved model state, containing the "generally" trained model. 242 | 243 | Once you have done this, you can run the new training job from the terminal using the following command: 244 | 245 | ``` 246 | (graphinvent)$ python submit.py 247 | ``` 248 | 249 | The job will restart from the last saved state, so, for example, if your first training job on set 1 reached Epoch 100, then training on set 2 will resume at the model state saved then. 250 | 251 | ### Generating structures using the fine-tuned model 252 | Once you have fine-tuned your model, you can use a saved state (e.g. *model_restart_200.pth*) to generate targeted molecules. To do this, *submit.py* needs to be updated to specify a generation job. The first setting that needs to be changed is the *job_type*; all other settings here should be kept fixed so that the program can find the correct job directory: 253 | 254 | ``` 255 | submit.py > 256 | # define what you want to do for the specified job(s) 257 | dataset = "set_2" 258 | job_type = "generate" # this tells the code that this is a generation job 259 | jobdir_start_idx = 0 260 | n_jobs = 1 261 | restart = False 262 | force_overwrite = False 263 | jobname = "train" # don't change the jobname, or the program won't find the saved model 264 | ``` 265 | 266 | You will then need to update the *generation_epoch* and *n_samples* parameter in *submit.py*: 267 | 268 | ``` 269 | submit.py > 270 | # define dataset-specific parameters 271 | params = { 272 | "atom_types": ["C", "N", "O", "S"], # change to your dataset's atom types 273 | "formal_charge": [-1, 0, +1], # change to your dataset's formal charges 274 | "chirality": ["None", "R", "S"], # ignored, unless you also specify `use_chirality`=True 275 | "max_n_nodes": 38, # change to your dataset's value 276 | "job_type": job_type, 277 | "dataset_dir": f"{data_path}{dataset}/", 278 | "restart": restart, 279 | "model": "GGNN", 280 | "sample_every": 2, # how often you want to sample/evaluate your model during training (for larger datasets, we recommend sampling more often) 281 | "init_lr": 1e-4, # tune the initial learning rate if needed 282 | "epochs": 200, # how many epochs you want to train for (you can experiment with this) 283 | "batch_size": 1000, # <-- tune the batch size if needed 284 | "block_size": 100000, # tune the block size if needed 285 | "generation_epoch": 100, # <-- specify which saved model (i.e. at which epoch) to use for training) 286 | "n_samples": 30000, # <-- specify how many structures you want to generate 287 | } 288 | ``` 289 | 290 | The *generation_epoch* should correspond to the saved model state that you want to use for generation, and *n_samples* tells the program how many structures you want to generate. In the example above, the parameters specify that the model saved at Epoch 200 should be used to generate 30,000 structures. All other parameters should be kept the same (if they are related to training, such as *epochs* or *init_lr*, they will be ignored during generation jobs). 291 | 292 | Structures will be generated in batches of size *batch_size*. If you encounter memory problems during generation jobs, reducing the batch size should once again solve them. Generated structures, along with their corresponding metadata, will be written to the *generation/* directory within the existing job directory. These files are: 293 | 294 | * *epochGEN{generation_epoch}_{batch}.smi*, containing molecules generated at the epoch specified 295 | * *epochGEN{generation_epoch}_{batch}.nll*, containing their respective NLLs 296 | * *epochGEN{generation_epoch}_{batch}.valid*, containing their respective validity (0: invalid, 1: valid) 297 | 298 | Additionally, the *generation.log* file will be updated with the various evaluation metrics for the generated structures. 299 | 300 | If you've followed the tutorial up to here, it means you can successfully create new, targeted molecules using transfer learning. 301 | 302 | Please see the other tutorials (e.g. [1_introduction](./1_introduction.md) and [2_using_a_new_dataset](./2_using_a_new_dataset.md)) for details on how one can post-process the structures for easy visualization, as well as how one can tune the hyperparameters to improve model performance using the different datasets. 303 | 304 | ### Summary 305 | Hopefully you are now able to train models to generate targeted molecules using transfer learning in GraphINVENT. If anything is unclear in this tutorial, or if you have any questions that have not been addressed by this guide, feel free to contact [me](https://github.com/rociomer). 306 | -------------------------------------------------------------------------------- /tutorials/5_benchmarking_with_moses.md: -------------------------------------------------------------------------------- 1 | ## Benchmarking models with MOSES 2 | Models can be easily benchmarked using MOSES. To do this, we recommend reading the MOSES documentation, available at https://github.com/molecularsets/moses. If you want to compare to previously benchmarked models, you will need to train models using the MOSES datasets, available [here](https://github.com/molecularsets/moses/tree/master/data). 3 | 4 | Once you have a satisfactorily trained model, you can run a Generation job to create 30,000 new structures (see [2_using_a_new_dataset](./2_using_a_new_dataset.md) and follow the instructions using the MOSES dataset). The generated structures can then be used as the \ in MOSES evaluation jobs. 5 | 6 | From our experience, MOSES benchmarking jobs require c.a. 30 GB RAM and are done in about an hour. 7 | -------------------------------------------------------------------------------- /tutorials/6_preprocessing_large_datasets.md: -------------------------------------------------------------------------------- 1 | ## Preprocessing large datasets 2 | 3 | For preprocessing very large datasets (e.g. MOSES, with over 1M structures in the training set), it is recommended to split up the data and preprocess them on separate CPUs. 4 | 5 | Until I get around to fixing a way to do this in the code, one can do it the hacky way. In the hacky way, we simply split up the large dataset into many smaller datasets, preprocess them as separate CPU jobs, and then combine them with a hacky script at the end. 6 | 7 | So first, split up the desired SMILES file by running 8 | 9 | ``` 10 | split -l 100000 train.smi 11 | ``` 12 | 13 | The above line is of course assuming that you want to split the training data. 14 | 15 | Then, place each of the splits in a separate directory in [../data/](../data/), such as *my_dataset_1/train.smi*, and make sure to rename them into "train.smi" from whatever the split output is (e.g. "xaa", "xab", etc). 16 | 17 | Then, comment out the lines for preprocessing the validation and test sets in [../graphinvent/Workflow.py](../graphinvent/Workflow.py): 18 | 19 | ``` 20 | # self.preprocess_valid_data() 21 | # self.preprocess_test_data() 22 | ``` 23 | 24 | Finally, set your desired parameters in *submit.py* and run a preprocessing job for each split (within the GraphINVENT conda environment): 25 | 26 | ``` 27 | (graphinvent)$ python submit.py 28 | ``` 29 | 30 | Once all the HDF files are preprocessed, these can be combined using [../tools/combine_HDFs.py](../tools/combine_HDFs.py). 31 | 32 | Don't forget to uncomment out the above lines in the future. 33 | 34 | -------------------------------------------------------------------------------- /tutorials/7_reinforcement_learning.md: -------------------------------------------------------------------------------- 1 | ## Reinforcement learning 2 | TODO 3 | -------------------------------------------------------------------------------- /tutorials/README.md: -------------------------------------------------------------------------------- 1 | # GraphINVENT tutorials 2 | 3 | ![School vector created by pch.vector - www.freepik.com](https://image.freepik.com/free-vector/online-tutorials-concept_52683-37480.jpg) 4 | 5 | ## Description 6 | This directory contains guides on how to use GraphINVENT. If viewing in a browser (recommended), simply click the link to the desired tutorial to view it. 7 | 8 | ## Tutorials 9 | * [0_setting_up_environment](./0_setting_up_environment.md) : Instructions on how to set up the GraphINVENT virtual environment. 10 | * [1_introduction](./1_introduction.md) : A quick introduction to GraphINVENT. Uses the example dataset *gdb13_1K* to guide new users through Training and Generation jobs in the code. 11 | * [2_using_a_new_dataset](./2_using_a_new_dataset.md) : A tutorial on how to use new datasets to train models in GraphINVENT. 12 | * [3_visualizing_molecules](./3_visualizing_molecules.md) : A quick guide on how to visualize grids of molecules using RDKit. 13 | * [4_transfer_learning](./4_transfer_learning.md) : A guide on how to use GraphINVENT for transfer learning tasks. 14 | * [5_benchmarking_with_moses](./5_benchmarking_with_moses.md) : A guide on how to benchmark GraphINVENT models using the MOSES distribution-based benchmarks. 15 | * [6_preprocessing_large_datasets](./6_preprocessing_large_datasets.md) : A guide on how to preprocess large datasets in GraphINVENT. 16 | * [7_reinforcement_learning](./7_reinforcement_learning.md) : A guide on how to fine-tune GraphINVENT models for molecular optimization and *de novo* design tasks. [TODO] 17 | 18 | ## Comments 19 | If a tutorial doesn't exist for something you'd like to do, contact [me](https://github.com/rociomer) and I'll be happy to create one (if I think others would benefit from it and I have time). Similarly, if you find and error in a tutorial, please let me know so that I can correct it. 20 | 21 | ## Author 22 | Rocío Mercado 23 | --------------------------------------------------------------------------------