├── .gitignore
├── LICENSE
├── README.md
├── cover-image.png
├── data
    ├── fine-tuning
    │   ├── QSAR_model_example.pickle
    │   └── gdb13_1K-debug
    │   │   ├── preprocessing_params.csv
    │   │   ├── pretrained_model.pth
    │   │   └── train.csv
    └── pre-training
    │   ├── DRD2_actives
    │       ├── test.smi
    │       ├── train.smi
    │       └── valid.smi
    │   ├── gdb13_1K-debug
    │       ├── preprocessing_params.csv
    │       ├── test.h5
    │       ├── test.smi
    │       ├── train.csv
    │       ├── train.h5
    │       ├── train.smi
    │       ├── valid.h5
    │       └── valid.smi
    │   └── gdb13_1K
    │       ├── preprocessing_params.csv
    │       ├── test.h5
    │       ├── test.smi
    │       ├── train.csv
    │       ├── train.h5
    │       ├── train.smi
    │       ├── valid.h5
    │       └── valid.smi
├── environments
    └── graphinvent.yml
├── graphinvent
    ├── Analyzer.py
    ├── BlockDatasetLoader.py
    ├── DataProcesser.py
    ├── GraphGenerator.py
    ├── GraphGeneratorRL.py
    ├── MolecularGraph.py
    ├── ScoringFunction.py
    ├── Workflow.py
    ├── __init__.py
    ├── gnn
    │   ├── __init__.py
    │   ├── aggregation_mpnn.py
    │   ├── edge_mpnn.py
    │   ├── modules.py
    │   ├── mpnn.py
    │   └── summation_mpnn.py
    ├── main.py
    ├── parameters
    │   ├── __init__.py
    │   ├── args.py
    │   ├── constants.py
    │   ├── defaults.py
    │   └── load.py
    └── util.py
├── output
    └── input.csv
├── submit-fine-tuning.py
├── submit-pre-training-supercloud.py
├── submit-pre-training.py
├── tools
    ├── README.md
    ├── atom_types.py
    ├── combine_HDFs.py
    ├── formal_charges.py
    ├── max_n_nodes.py
    ├── submit-split-preprocessing-supercloud.py
    ├── tdc-create-dataset.py
    └── utils.py
└── tutorials
    ├── 0_setting_up_environment.md
    ├── 1_introduction.md
    ├── 2_using_a_new_dataset.md
    ├── 3_visualizing_molecules.md
    ├── 4_transfer_learning.md
    ├── 5_benchmarking_with_moses.md
    ├── 6_preprocessing_large_datasets.md
    ├── 7_reinforcement_learning.md
    └── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | data/fine-tuning/qsar_model.pickle
2 | .vscode/*
3 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright 2020 Rocío Mercado.
2 | 
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4 | 
5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6 | 
7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | **Please note: this repository is no longer being maintained.**
  2 | 
  3 | # GraphINVENT
  4 | 
  5 | ![cover image](./cover-image.png)
  6 | 
  7 | ## Description
  8 | GraphINVENT is a platform for graph-based molecular generation using graph neural networks. GraphINVENT uses a tiered deep neural network architecture to probabilistically generate new molecules a single bond at a time. All models implemented in GraphINVENT can quickly learn to build molecules resembling training set molecules without any explicit programming of chemical rules. The models have been benchmarked using the MOSES distribution-based metrics, showing how the best GraphINVENT model compares well with state-of-the-art generative models.
  9 | 
 10 | ## Updates
 11 | The following versions of GraphINVENT exist in this repository:
 12 | * v1.0 (and all commits up to here) is the label corresponding to the "original" version, and corresponds with the publications below.
 13 | * v2.0 is an outdated version, created March 10, 2021.
 14 | * v3.0 is the latest version, created August 20, 2021.
 15 | 
 16 | *20-08-2021*:
 17 | 
 18 | Large update: 
 19 | * Added a reinforcement learning framework to allow for fine-tuning models. Fine-tuning jobs can now be run using the --job-type "fine-tune" flag. 
 20 | * An example submission script for fine-tuning jobs was added (`submit-fine-tuning.py`), and the old example submission script was renamed (`submit.py` --> `submit-pre-training.py`).
 21 | * Note: the tutorials have not yet been updated to reflect the changes, this will be done soon but for now be aware that there may be small discrepancies between what is written in the tutorial and the actual instructions. I will delete this bulletpoint when I have updated the tutorials.
 22 | 
 23 | *26-03-2021*:
 24 | 
 25 | Small update: 
 26 | * Pre-trained models created with GraphINVENT v1.0 can now be used with GraphINVENT v2.0.
 27 | 
 28 | *10-03-2021*:
 29 | 
 30 | The biggest changes in v2.0 from v1.0 are summarized below:
 31 | * Data preprocessing was updated for readibility (now done in `DataProcesser.py`).
 32 | * Graph generation was updated for readibility (now done in `Generator.py`), as well as some bugs related to how implicit Hs and chirality were handled on the GPU (not used before, despite being available for preprocessing/training).
 33 | * Data analysis code was updated for readibility (now done in `Analyzer.py`).
 34 | * The learning rate decay scheme was changed from a custom learning rate scheduler to the OneCycle scheduler (so far, it appears to be working well enough, and with a reduced set of parameters).
 35 | * The code now runs using the latest version of PyTorch (1.8.0); the previous version was running using PyTorch 1.3. The environment has correspondingly been updated (and renamed "GraphINVENT-env" -> "graphinvent").
 36 | * Redundant hyperparameters were removed; additionally, hyperparameters seen not to improve things were removed from `defaults.py`, such as the optimizer weight decay (now just 0.0) and weights initialization (fixed to Xavier uniform now).
 37 | * Some old functions, such as `models.py` and `loss.py` were consolidated into `Workflow.py`.
 38 | * A validation loss calculation was added to keep track of model training.
 39 | 
 40 | Additionally, minor typos and bugs were corrected, and the docstrings and error messages updated. Examples of minor bugs/changes:
 41 | * Bug in how fraction properly terminated graphs (and fraction valid of properly terminated) was calculated (wrong function for data type, which led to errors in rare instances).
 42 | * Errors in how analysis histograms were written to tensorboard; these were also of questionable utility so are now simply removed.
 43 | * Some values (like the "NLL diff") were removed, as they were also not found to be useful.
 44 | 
 45 | If you spot any issues (big or small) since the update, please create an issue or a pull request (if you are able to fix it), and we will be happy to make changes.
 46 | 
 47 | ## Prerequisites
 48 | * Anaconda or Miniconda with Python 3.6 or 3.8.
 49 | * (for GPU-training only) CUDA-enabled GPU.
 50 | 
 51 | ## Instructions and tutorials
 52 | For detailed guides on how to use GraphINVENT, see the [tutorials](./tutorials/).
 53 | 
 54 | ## Examples
 55 | An example training set is available in [./data/gdb13_1K/](./data/gdb13_1K/). It is a small (1K) subset of GDB-13 and is already preprocessed.
 56 | 
 57 | ## Contributors
 58 | [@rociomer](https://www.github.com/rociomer)
 59 | 
 60 | [@rastemo](https://www.github.com/rastemo)
 61 | 
 62 | [@edvardlindelof](https://www.github.com/edvardlindelof)
 63 | 
 64 | [@sararromeo](https://www.github.com/sararromeo)
 65 | 
 66 | [@JuanViguera](https://www.github.com/JuanViguera)
 67 | 
 68 | [@psolsson](https://www.github.com/psolsson)
 69 | 
 70 | ## Contributions
 71 | 
 72 | Contributions are welcome in the form of issues or pull requests. To report a bug, please submit an issue. Thank you to everyone who has used the code and provided feedback thus far.
 73 | 
 74 | 
 75 | ## References
 76 | ### Relevant publications
 77 | If you use GraphINVENT in your research, please reference our [publication](https://doi.org/10.1088/2632-2153/abcf91).
 78 | 
 79 | Additional details related to the development of GraphINVENT are available in our [technical note](https://doi.org/10.1002/ail2.18). You might find this note useful if you're interested in either exploring different hyperparameters or developing your own generative models.
 80 | 
 81 | The references in BibTex format are available below:
 82 | 
 83 | ```
 84 | @article{mercado2020graph,
 85 |   author = "Rocío Mercado and Tobias Rastemo and Edvard Lindelöf and Günter Klambauer and Ola Engkvist and Hongming Chen and Esben Jannik Bjerrum",
 86 |   title = "{Graph Networks for Molecular Design}",
 87 |   journal = {Machine Learning: Science and Technology},
 88 |   year = {2020},
 89 |   publisher = {IOP Publishing},
 90 |   doi = "10.1088/2632-2153/abcf91"
 91 | }
 92 | 
 93 | @article{mercado2020practical,
 94 |   author = "Rocío Mercado and Tobias Rastemo and Edvard Lindelöf and Günter Klambauer and Ola Engkvist and Hongming Chen and Esben Jannik Bjerrum",
 95 |   title = "{Practical Notes on Building Molecular Graph Generative Models}",
 96 |   journal = {Applied AI Letters},
 97 |   year = {2020},
 98 |   publisher = {Wiley Online Library},
 99 |   doi = "10.1002/ail2.18"
100 | }
101 | ```
102 | 
103 | ### Related work
104 | #### MPNNs
105 | The MPNN implementations used in this work were pulled from Edvard Lindelöf's repo in October 2018, while he was a masters student in the MAI group. This work is available at
106 | 
107 | https://github.com/edvardlindelof/graph-neural-networks-for-drug-discovery.
108 | 
109 | His master's thesis, describing the EMN implementation, can be found at
110 | 
111 | https://odr.chalmers.se/handle/20.500.12380/256629.
112 | 
113 | #### MOSES
114 | The MOSES repo is available at https://github.com/molecularsets/moses.
115 | 
116 | #### GDB-13
117 | The example dataset provided is a subset of GDB-13. This was obtained by randomly sampling 1000 structures from the entire GDB-13 dataset. The full dataset is available for download at http://gdb.unibe.ch/downloads/.
118 | 
119 | 
120 | #### RL-GraphINVENT
121 | Version 3.0 incorporates Sara's work into the latest GraphINVENT framework: [repo](https://github.com/olsson-group/RL-GraphINVENT) and [paper](https://doi.org/10.33774/chemrxiv-2021-9w3tc). Her work was presented at the [RL4RealLife](https://sites.google.com/view/RL4RealLife) workshop at ICML 2021.
122 | 
123 | #### Exploring graph traversal algorithms in GraphINVENT
124 | In [this](https://doi.org/10.33774/chemrxiv-2021-5c5l1) pre-print, we look into the effect of different graph traversal algorithms on the types of structures that are generated by GraphINVENT. We find that a BFS generally leads to better molecules than a DFS, unless the model is overtrained, at which point both graph traversal algorithms lead to indistinguishible sets of structures.
125 | 
126 | ## License
127 | 
128 | GraphINVENT is licensed under the MIT license and is free and provided as-is.
129 | 
130 | ## Link
131 | https://github.com/MolecularAI/GraphINVENT/
132 | 


--------------------------------------------------------------------------------
/cover-image.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/cover-image.png


--------------------------------------------------------------------------------
/data/fine-tuning/QSAR_model_example.pickle:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/data/fine-tuning/QSAR_model_example.pickle


--------------------------------------------------------------------------------
/data/fine-tuning/gdb13_1K-debug/preprocessing_params.csv:
--------------------------------------------------------------------------------
 1 | atom_types;['C', 'N', 'O', 'S', 'Cl']
 2 | chirality;['None', 'R', 'S']
 3 | formal_charge;[-1, 0, 1]
 4 | ignore_H;True
 5 | imp_H;[0, 1, 2, 3]
 6 | max_n_nodes;13
 7 | use_aromatic_bonds;False
 8 | use_chirality;False
 9 | use_explicit_H;False
10 | 


--------------------------------------------------------------------------------
/data/fine-tuning/gdb13_1K-debug/pretrained_model.pth:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/data/fine-tuning/gdb13_1K-debug/pretrained_model.pth


--------------------------------------------------------------------------------
/data/fine-tuning/gdb13_1K-debug/train.csv:
--------------------------------------------------------------------------------
 1 | ('Training set', 'n_nodes_hist');[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.25, 0.25, 0.25, 2.25]
 2 | ('Training set', 'avg_n_nodes');12.625
 3 | ('Training set', 'atom_type_hist');[25.5, 6.75, 5.25, 0.0, 0.0]
 4 | ('Training set', 'formal_charge_hist');[0.0, 37.5, 0.0]
 5 | ('Training set', 'n_edges_hist');[10.0, 14.75, 12.0, 0.75, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 6 | ('Training set', 'avg_n_edges');2.072
 7 | ('Training set', 'edge_feature_hist');[33.25, 5.5, 0.5]
 8 | ('Training set', 'fraction_unique');0.0
 9 | ('Training set', 'fraction_valid');1.0
10 | ('Training set', 'numh_hist');[]
11 | ('Training set', 'chirality_hist');[]
12 | 


--------------------------------------------------------------------------------
/data/pre-training/gdb13_1K-debug/preprocessing_params.csv:
--------------------------------------------------------------------------------
 1 | atom_types;['C', 'N', 'O', 'S', 'Cl']
 2 | chirality;['None', 'R', 'S']
 3 | formal_charge;[-1, 0, 1]
 4 | ignore_H;True
 5 | imp_H;[0, 1, 2, 3]
 6 | max_n_nodes;13
 7 | use_aromatic_bonds;False
 8 | use_chirality;False
 9 | use_explicit_H;False
10 | 


--------------------------------------------------------------------------------
/data/pre-training/gdb13_1K-debug/test.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/data/pre-training/gdb13_1K-debug/test.h5


--------------------------------------------------------------------------------
/data/pre-training/gdb13_1K-debug/test.smi:
--------------------------------------------------------------------------------
 1 | SMILES Name
 2 | OC1C(O)C2C1NCC2O 1665110
 3 | CC1CC2C(C1)C21CN=CN1 2415507
 4 | CC=CC1(N)C=CC2C(N)C12 3385993
 5 | CC#CC12COC(=O)CC1O2 5741941
 6 | COC(C)C#CC(O)C(C)=O 6426824
 7 | OCC=C1COC2CCCC12 6724947
 8 | C=C1C2CC2CC1(C=O)C#N 7240989
 9 | CC1CN(CC(CO)C=C)C1 8022539
10 | CC1N(C=N)N=C(C)OC1=O 8541293
11 | CCC1(C)C2(N)COCC12N 9017466
12 | 


--------------------------------------------------------------------------------
/data/pre-training/gdb13_1K-debug/train.csv:
--------------------------------------------------------------------------------
 1 | ('Training set', 'n_nodes_hist');[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.25, 0.25, 0.25, 2.25]
 2 | ('Training set', 'avg_n_nodes');12.625
 3 | ('Training set', 'atom_type_hist');[25.5, 6.75, 5.25, 0.0, 0.0]
 4 | ('Training set', 'formal_charge_hist');[0.0, 37.5, 0.0]
 5 | ('Training set', 'n_edges_hist');[10.0, 14.75, 12.0, 0.75, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 6 | ('Training set', 'avg_n_edges');2.072
 7 | ('Training set', 'edge_feature_hist');[33.25, 5.5, 0.5]
 8 | ('Training set', 'fraction_unique');0.0
 9 | ('Training set', 'fraction_valid');1.0
10 | ('Training set', 'numh_hist');[]
11 | ('Training set', 'chirality_hist');[]
12 | 


--------------------------------------------------------------------------------
/data/pre-training/gdb13_1K-debug/train.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/data/pre-training/gdb13_1K-debug/train.h5


--------------------------------------------------------------------------------
/data/pre-training/gdb13_1K-debug/train.smi:
--------------------------------------------------------------------------------
 1 | SMILES Name
 2 | CC1C2N1CC1=C2CC=C1 69719
 3 | CC(C)C1=CCC2C3C=COC123 41613507
 4 | CN1CC(N)C2CC2(CO)ON=C1 676397262
 5 | CC1NC1C1ON=CC(C)C1CN 481064820
 6 | CN=C1CN(C)CC(CCO)CN1 695143543
 7 | CC1C2CC1(O)C(C=O)=C2C 4967159
 8 | COCC=CC(C)N(C=N)N(C)C 758801042
 9 | OC(C=C)C1=CC(=O)C2CC2OC1 356175703
10 | CNC(C#C)C1=NN=C(CO)N1O 708514546
11 | COC(=O)NC(C)C1CNNC1=O 766737120
12 | 


--------------------------------------------------------------------------------
/data/pre-training/gdb13_1K-debug/valid.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/data/pre-training/gdb13_1K-debug/valid.h5


--------------------------------------------------------------------------------
/data/pre-training/gdb13_1K-debug/valid.smi:
--------------------------------------------------------------------------------
 1 | SMILES Name
 2 | CCC(C=CC)=CC(C)C 28951
 3 | N#CC1CCC2NC2CN1 375643
 4 | OCC1(CC1)C(=O)OC=C 692764
 5 | OCCNCCN(O)C=N 1739460
 6 | CC12C3C4C(C4C1=C)C2C3N 2179443
 7 | NC1CC(N)C(C#C)C1C#C 4528171
 8 | CC1CC2(CC=C)CC2C1=O 5070952
 9 | CCC(CO)C1=CC=CC1=O 6015379
10 | CC12CC=NN=CC1NC2=O 7076956
11 | CC1C2CC1(C)NC2C=NO 7236816
12 | 


--------------------------------------------------------------------------------
/data/pre-training/gdb13_1K/preprocessing_params.csv:
--------------------------------------------------------------------------------
 1 | atom_types;['C', 'N', 'O', 'S', 'Cl']
 2 | chirality;['None', 'R', 'S']
 3 | formal_charge;[-1, 0, 1]
 4 | ignore_H;True
 5 | imp_H;[0, 1, 2, 3]
 6 | max_n_nodes;13
 7 | use_aromatic_bonds;False
 8 | use_chirality;False
 9 | use_explicit_H;False
10 | 


--------------------------------------------------------------------------------
/data/pre-training/gdb13_1K/test.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/data/pre-training/gdb13_1K/test.h5


--------------------------------------------------------------------------------
/data/pre-training/gdb13_1K/train.csv:
--------------------------------------------------------------------------------
 1 | ('Training set', 'n_nodes_hist');[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.064, 1.414, 9.493, 53.61]
 2 | ('Training set', 'avg_n_nodes');12.787
 3 | ('Training set', 'atom_type_hist');[589.076, 125.834, 104.132, 7.85, 0.137]
 4 | ('Training set', 'formal_charge_hist');[0.25, 826.53, 0.25]
 5 | ('Training set', 'n_edges_hist');[187.398, 361.517, 236.11, 42.004, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 6 | ('Training set', 'avg_n_edges');2.164
 7 | ('Training set', 'edge_feature_hist');[752.936, 127.018, 13.436]
 8 | ('Training set', 'fraction_unique');0.0
 9 | ('Training set', 'fraction_valid');1.0
10 | ('Training set', 'numh_hist');[]
11 | ('Training set', 'chirality_hist');[]
12 | 


--------------------------------------------------------------------------------
/data/pre-training/gdb13_1K/train.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/data/pre-training/gdb13_1K/train.h5


--------------------------------------------------------------------------------
/data/pre-training/gdb13_1K/valid.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/data/pre-training/gdb13_1K/valid.h5


--------------------------------------------------------------------------------
/environments/graphinvent.yml:
--------------------------------------------------------------------------------
  1 | name: graphinvent
  2 | channels:
  3 |   - pytorch
  4 |   - anaconda
  5 |   - conda-forge
  6 |   - defaults
  7 | dependencies:
  8 |   - _libgcc_mutex=0.1=conda_forge
  9 |   - _openmp_mutex=4.5=1_gnu
 10 |   - absl-py=0.11.0=py38h578d9bd_0
 11 |   - astroid=2.4.2=py38_0
 12 |   - blas=1.0=mkl
 13 |   - boost=1.74.0=py38hc10631b_3
 14 |   - boost-cpp=1.74.0=h9359b55_0
 15 |   - bzip2=1.0.8=h7b6447c_0
 16 |   - c-ares=1.17.1=h7f98852_1
 17 |   - ca-certificates=2020.10.14=0
 18 |   - cached-property=1.5.1=py_0
 19 |   - cairo=1.16.0=h3fc0475_1005
 20 |   - certifi=2020.6.20=py38_0
 21 |   - cudatoolkit=10.1.243=h6bb024c_0
 22 |   - cycler=0.10.0=py_2
 23 |   - ffmpeg=4.3=hf484d3e_0
 24 |   - fontconfig=2.13.1=hba837de_1004
 25 |   - freetype=2.10.4=h5ab3b9f_0
 26 |   - glib=2.67.4=h36276a3_1
 27 |   - gmp=6.2.1=h2531618_2
 28 |   - gnutls=3.6.5=h71b1129_1002
 29 |   - grpcio=1.36.1=py38hdd6454d_0
 30 |   - h5py=3.1.0=nompi_py38hafa665b_100
 31 |   - hdf5=1.10.6=nompi_h6a2412b_1114
 32 |   - icu=67.1=he1b5a44_0
 33 |   - importlib-metadata=3.7.2=py38h578d9bd_0
 34 |   - intel-openmp=2020.2=254
 35 |   - isort=5.6.4=py_0
 36 |   - joblib=1.0.1=pyhd8ed1ab_0
 37 |   - jpeg=9b=h024ee3a_2
 38 |   - kiwisolver=1.3.1=py38h1fd1430_1
 39 |   - krb5=1.17.2=h926e7f8_0
 40 |   - lame=3.100=h7b6447c_0
 41 |   - lazy-object-proxy=1.4.3=py38h7b6447c_0
 42 |   - lcms2=2.11=h396b838_0
 43 |   - ld_impl_linux-64=2.33.1=h53a641e_7
 44 |   - libblas=3.9.0=1_h86c2bf4_netlib
 45 |   - libcblas=3.9.0=5_h92ddd45_netlib
 46 |   - libcurl=7.75.0=hc4aaa36_0
 47 |   - libedit=3.1.20191231=h14c3975_1
 48 |   - libev=4.33=h516909a_1
 49 |   - libffi=3.3=he6710b0_2
 50 |   - libgcc-ng=9.3.0=h2828fa1_18
 51 |   - libgfortran-ng=9.3.0=hff62375_18
 52 |   - libgfortran5=9.3.0=hff62375_18
 53 |   - libgomp=9.3.0=h2828fa1_18
 54 |   - libiconv=1.15=h63c8f33_5
 55 |   - liblapack=3.9.0=5_h92ddd45_netlib
 56 |   - libnghttp2=1.43.0=h812cca2_0
 57 |   - libpng=1.6.37=hbc83047_0
 58 |   - libprotobuf=3.15.5=h780b84a_0
 59 |   - libssh2=1.9.0=hab1572f_5
 60 |   - libstdcxx-ng=9.3.0=h6de172a_18
 61 |   - libtiff=4.1.0=h2733197_1
 62 |   - libuuid=2.32.1=h7f98852_1000
 63 |   - libuv=1.40.0=h7b6447c_0
 64 |   - libxcb=1.13=h7f98852_1003
 65 |   - libxml2=2.9.10=h72b56ed_2
 66 |   - lz4-c=1.9.3=h2531618_0
 67 |   - markdown=3.3.4=pyhd8ed1ab_0
 68 |   - matplotlib-base=3.3.4=py38h0efea84_0
 69 |   - mccabe=0.6.1=py38_1
 70 |   - mkl=2020.2=256
 71 |   - mkl-service=2.3.0=py38he904b0f_0
 72 |   - mkl_fft=1.3.0=py38h54f3939_0
 73 |   - mkl_random=1.1.1=py38h0573a6f_0
 74 |   - ncurses=6.2=he6710b0_1
 75 |   - nettle=3.4.1=hbb512f6_0
 76 |   - ninja=1.10.2=py38hff7bd54_0
 77 |   - numpy=1.19.2=py38h54aff64_0
 78 |   - numpy-base=1.19.2=py38hfa32c7d_0
 79 |   - olefile=0.46=py_0
 80 |   - openh264=2.1.0=hd408876_0
 81 |   - openssl=1.1.1k=h7f98852_0
 82 |   - pandas=1.2.3=py38h51da96c_0
 83 |   - pcre=8.44=he1b5a44_0
 84 |   - pillow=8.1.1=py38he98fc37_0
 85 |   - pip=21.0.1=py38h06a4308_0
 86 |   - pixman=0.38.0=h516909a_1003
 87 |   - protobuf=3.15.5=py38h709712a_0
 88 |   - pthread-stubs=0.4=h36c2ea0_1001
 89 |   - pycairo=1.20.0=py38h323dad1_1
 90 |   - pylint=2.6.0=py38_0
 91 |   - pyparsing=2.4.7=pyh9f0ad1d_0
 92 |   - python=3.8.8=hdb3f193_4
 93 |   - python-dateutil=2.8.1=py_0
 94 |   - python_abi=3.8=1_cp38
 95 |   - pytorch=1.8.0=py3.8_cuda10.1_cudnn7.6.3_0
 96 |   - pytz=2021.1=pyhd8ed1ab_0
 97 |   - rdkit=2020.09.5=py38h2bca085_0
 98 |   - readline=8.1=h27cfd23_0
 99 |   - reportlab=3.5.63=py38hadf75a6_0
100 |   - scikit-learn=0.21.1=py38hd81dba3_0
101 |   - scipy=1.7.0=py38h7b17777_1
102 |   - setuptools=52.0.0=py38h06a4308_0
103 |   - six=1.15.0=py38h06a4308_0
104 |   - sqlalchemy=1.3.23=py38h497a2fe_0
105 |   - sqlite=3.33.0=h62c20be_0
106 |   - tensorboard=1.15.0=py38_0
107 |   - threadpoolctl=2.2.0=pyh8a188c0_0
108 |   - tk=8.6.10=hbc83047_0
109 |   - toml=0.10.1=py_0
110 |   - torchaudio=0.8.0=py38
111 |   - torchvision=0.9.0=py38_cu101
112 |   - tornado=6.1=py38h497a2fe_1
113 |   - tqdm=4.59.0=pyhd8ed1ab_0
114 |   - typing_extensions=3.7.4.3=pyha847dfd_0
115 |   - werkzeug=1.0.1=pyh9f0ad1d_0
116 |   - wheel=0.36.2=pyhd3eb1b0_0
117 |   - wrapt=1.11.2=py38h7b6447c_0
118 |   - xorg-kbproto=1.0.7=h7f98852_1002
119 |   - xorg-libice=1.0.10=h7f98852_0
120 |   - xorg-libsm=1.2.3=hd9c2040_1000
121 |   - xorg-libx11=1.7.0=h7f98852_0
122 |   - xorg-libxau=1.0.9=h7f98852_0
123 |   - xorg-libxdmcp=1.1.3=h7f98852_0
124 |   - xorg-libxext=1.3.4=h7f98852_1
125 |   - xorg-libxrender=0.9.10=h7f98852_1003
126 |   - xorg-renderproto=0.11.1=h7f98852_1002
127 |   - xorg-xextproto=7.3.0=h7f98852_1002
128 |   - xorg-xproto=7.0.31=h7f98852_1007
129 |   - xz=5.2.5=h7b6447c_0
130 |   - zipp=3.4.1=pyhd8ed1ab_0
131 |   - zlib=1.2.11=h7b6447c_3
132 |   - zstd=1.4.5=h9ceee32_0
133 | 


--------------------------------------------------------------------------------
/graphinvent/BlockDatasetLoader.py:
--------------------------------------------------------------------------------
  1 | """
  2 | The `BlockDatasetLoader` defines custom `DataLoader`s and `Dataset`s used to
  3 | efficiently load data from HDF files in this work
  4 | """
  5 | # load general packages and functions
  6 | from typing import Tuple
  7 | import torch
  8 | import h5py
  9 | 
 10 | 
 11 | class BlockDataLoader(torch.utils.data.DataLoader):
 12 |     """
 13 |     Main `DataLoader` class which has been modified so as to read training data
 14 |     from disk in blocks, as opposed to a single line at a time (as is done in
 15 |     the original `DataLoader` class).
 16 |     """
 17 |     def __init__(self, dataset : torch.utils.data.Dataset, batch_size : int=100,
 18 |                 block_size : int=10000, shuffle : bool=True, n_workers : int=0,
 19 |                 pin_memory : bool=True) -> None:
 20 | 
 21 |         # define variables to be used throughout dataloading
 22 |         self.dataset       = dataset     # `HDFDataset` object
 23 |         self.batch_size    = batch_size  # `int`
 24 |         self.block_size    = block_size  # `int`
 25 |         self.shuffle       = shuffle     # `bool`
 26 |         self.n_workers     = n_workers   # `int`
 27 |         self.pin_memory    = pin_memory  # `bool`
 28 |         self.block_dataset = BlockDataset(self.dataset,
 29 |                                           batch_size=self.batch_size,
 30 |                                           block_size=self.block_size)
 31 | 
 32 |     def __iter__(self) -> torch.Tensor:
 33 | 
 34 |         # define a regular `DataLoader` using the `BlockDataset`
 35 |         block_loader = torch.utils.data.DataLoader(self.block_dataset,
 36 |                                                    shuffle=self.shuffle,
 37 |                                                    num_workers=self.n_workers)
 38 | 
 39 |         # define a condition for determining whether to drop the last block this
 40 |         # is done if the remainder block is very small (less than a tenth the
 41 |         # size of a normal block)
 42 |         condition = bool(
 43 |             int(self.block_dataset.__len__()/self.block_size) > 1 &
 44 |             self.block_dataset.__len__()%self.block_size < self.block_size/10
 45 |         )
 46 | 
 47 |         # loop through and load BLOCKS of data every iteration
 48 |         for block in block_loader:
 49 |             block = [torch.squeeze(b) for b in block]
 50 | 
 51 |             # wrap each block in a `ShuffleBlock` so that data can be shuffled
 52 |             # within blocks
 53 |             batch_loader = torch.utils.data.DataLoader(
 54 |                 dataset=ShuffleBlockWrapper(block),
 55 |                 shuffle=self.shuffle,
 56 |                 batch_size=self.batch_size,
 57 |                 num_workers=self.n_workers,
 58 |                 pin_memory=self.pin_memory,
 59 |                 drop_last=condition
 60 |             )
 61 | 
 62 |             for batch in batch_loader:
 63 |                 yield batch
 64 | 
 65 |     def __len__(self) -> int:
 66 |         # returns the number of graphs in the DataLoader
 67 |         n_blocks          = len(self.dataset) // self.block_size
 68 |         n_rem             = len(self.dataset) % self.block_size
 69 |         n_batch_per_block = self.__ceil__(self.block_size, self.batch_size)
 70 |         n_last            = self.__ceil__(n_rem, self.batch_size)
 71 |         return n_batch_per_block * n_blocks + n_last
 72 | 
 73 |     def __ceil__(self, i : int, j : int) -> int:
 74 |         return (i + j - 1) // j
 75 | 
 76 | 
 77 | class BlockDataset(torch.utils.data.Dataset):
 78 |     """
 79 |     Modified `Dataset` class which returns BLOCKS of data when `__getitem__()`
 80 |     is called.
 81 |     """
 82 |     def __init__(self, dataset : torch.utils.data.Dataset, batch_size : int=100,
 83 |                  block_size : int=10000) -> None:
 84 | 
 85 |         assert block_size >= batch_size, "Block size should be > batch size."
 86 | 
 87 |         self.block_size = block_size  # `int`
 88 |         self.batch_size = batch_size  # `int`
 89 |         self.dataset    = dataset     # `HDFDataset`
 90 | 
 91 |     def __getitem__(self, idx : int) -> torch.Tensor:
 92 |         # returns a block of data from the dataset
 93 |         start = idx * self.block_size
 94 |         end   = min((idx + 1) * self.block_size, len(self.dataset))
 95 |         return self.dataset[start:end]
 96 | 
 97 |     def __len__(self) -> int:
 98 |         # returns the number of blocks in the dataset
 99 |         return (len(self.dataset) + self.block_size - 1) // self.block_size
100 | 
101 | 
102 | class ShuffleBlockWrapper:
103 |     """
104 |     Extra class used to wrap a block of data, enabling data to get shuffled
105 |     *within* a block.
106 |     """
107 |     def __init__(self, data : torch.Tensor) -> None:
108 |         self.data = data
109 | 
110 |     def __getitem__(self, idx : int) -> torch.Tensor:
111 |         return [d[idx] for d in self.data]
112 | 
113 |     def __len__(self) -> int:
114 |         return len(self.data[0])
115 | 
116 | 
117 | class HDFDataset(torch.utils.data.Dataset):
118 |     """
119 |     Reads and collects data from an HDF file with three datasets: "nodes",
120 |     "edges", and "APDs".
121 |     """
122 |     def __init__(self, path : str) -> None:
123 | 
124 |         self.path = path
125 |         hdf_file  = h5py.File(self.path, "r+", swmr=True)
126 | 
127 |         # load each HDF dataset
128 |         self.nodes = hdf_file.get("nodes")
129 |         self.edges = hdf_file.get("edges")
130 |         self.apds  = hdf_file.get("APDs")
131 | 
132 |         # get the number of elements in the dataset
133 |         self.n_subgraphs = self.nodes.shape[0]
134 | 
135 |     def __getitem__(self, idx : int) -> \
136 |         Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
137 | 
138 |         # returns specific graph elements
139 |         nodes_i = torch.from_numpy(self.nodes[idx]).type(torch.float32)
140 |         edges_i = torch.from_numpy(self.edges[idx]).type(torch.float32)
141 |         apd_i   = torch.from_numpy(self.apds[idx]).type(torch.float32)
142 | 
143 |         return (nodes_i, edges_i, apd_i)
144 | 
145 |     def __len__(self) -> int:
146 |         # returns the number of graphs in the dataset
147 |         return self.n_subgraphs
148 | 


--------------------------------------------------------------------------------
/graphinvent/DataProcesser.py:
--------------------------------------------------------------------------------
  1 | """
  2 | The `DataProcesser` class contains functions for pre-processing training data.
  3 | """
  4 | # load general packages and functions
  5 | import os
  6 | import numpy as np
  7 | import rdkit
  8 | import h5py
  9 | from tqdm import tqdm
 10 | 
 11 | # load GraphINVENT-specific functions
 12 | from Analyzer import Analyzer
 13 | from parameters.constants import constants
 14 | import parameters.load as load
 15 | from MolecularGraph import PreprocessingGraph
 16 | import util
 17 | 
 18 | 
 19 | class DataProcesser:
 20 |     """
 21 |     A class for preprocessing molecular sets and writing them to HDF files.
 22 |     """
 23 |     def __init__(self, path : str, is_training_set : bool=False) -> None:
 24 |         """
 25 |         Args:
 26 |         ----
 27 |             path (string)          : Full path/filename to SMILES file containing
 28 |                                      molecules.
 29 |             is_training_set (bool) : Indicates if this is the training set, as we
 30 |                                      calculate a few additional things for the training
 31 |                                      set.
 32 |         """
 33 |         # define some variables for later use
 34 |         self.path            = path
 35 |         self.is_training_set = is_training_set
 36 |         self.dataset_names   = ["nodes", "edges", "APDs"]
 37 |         self.get_dataset_dims()  # creates `self.dims`
 38 | 
 39 |         # load the molecules
 40 |         self.molecule_set = load.molecules(self.path)
 41 | 
 42 |         # placeholders
 43 |         self.molecule_subset    = None
 44 |         self.dataset            = None
 45 |         self.skip_collection    = None
 46 |         self.resume_idx         = None
 47 |         self.ts_properties      = None
 48 |         self.restart_index_file = None
 49 |         self.hdf_file           = None
 50 |         self.dataset_size       = None
 51 | 
 52 |         # get total number of molecules, and total number of subgraphs in their
 53 |         # decoding routes
 54 |         self.n_molecules       = len(self.molecule_set)
 55 |         self.total_n_subgraphs = self.get_n_subgraphs()
 56 |         print(f"-- {self.n_molecules} molecules in set.", flush=True)
 57 |         print(f"-- {self.total_n_subgraphs} total subgraphs in set.",
 58 |               flush=True)
 59 | 
 60 |     def preprocess(self) -> None:
 61 |         """
 62 |         Prepares an HDF file to save three different datasets to it (`nodes`,
 63 |         `edges`, `APDs`), and slowly fills it in by looping over all the
 64 |         molecules in the data in groups (or "mini-batches").
 65 |         """
 66 |         with h5py.File(f"{self.path[:-3]}h5.chunked", "a") as self.hdf_file:
 67 | 
 68 |             self.restart_index_file = constants.dataset_dir + "index.restart"
 69 | 
 70 |             if constants.restart and os.path.exists(self.restart_index_file):
 71 |                 self.restart_preprocessing_job()
 72 |             else:
 73 |                 self.start_new_preprocessing_job()
 74 | 
 75 |                 # keep track of the dataset size (to resize later)
 76 |                 self.dataset_size = 0
 77 | 
 78 |             self.ts_properties = None
 79 | 
 80 |             # this is where we fill the datasets with actual data by looping
 81 |             # over subgraphs in blocks of size `constants.batch_size`
 82 |             for idx in range(0, self.total_n_subgraphs, constants.batch_size):
 83 | 
 84 |                 if not self.skip_collection:
 85 | 
 86 |                     self.get_molecule_subset()
 87 | 
 88 |                     # add `constants.batch_size` subgraphs from
 89 |                     # `self.molecule_subset` to the dataset (and if training
 90 |                     # set, calculate their properties and add these to
 91 |                     # `self.ts_properties`)
 92 |                     self.get_subgraphs(init_idx=idx)
 93 | 
 94 |                     util.write_last_molecule_idx(
 95 |                         last_molecule_idx=self.resume_idx,
 96 |                         dataset_size=self.dataset_size,
 97 |                         restart_file_path=constants.dataset_dir
 98 |                     )
 99 | 
100 | 
101 |                 if self.resume_idx == self.n_molecules:
102 |                     # all molecules have been processed
103 | 
104 |                     self.resize_datasets()  # remove padding from initialization
105 |                     print("Datasets resized.", flush=True)
106 | 
107 |                     if self.is_training_set and not constants.restart:
108 | 
109 |                         print("Writing training set properties.", flush=True)
110 |                         util.write_ts_properties(
111 |                             training_set_properties=self.ts_properties
112 |                         )
113 | 
114 |                     break
115 | 
116 |         print("* Resaving datasets in unchunked format.")
117 |         self.resave_datasets_unchunked()
118 | 
119 |     def restart_preprocessing_job(self) -> None:
120 |         """
121 |         Restarts a preprocessing job. Uses an index specified in the dataset
122 |         directory to know where to resume preprocessing.
123 |         """
124 |         try:
125 |             self.resume_idx, self.dataset_size = util.read_last_molecule_idx(
126 |                 restart_file_path=constants.dataset_dir
127 |             )
128 |         except:
129 |             self.resume_idx, self.dataset_size = 0, 0
130 |         self.skip_collection = bool(
131 |             self.resume_idx == self.n_molecules and self.is_training_set
132 |         )
133 | 
134 |         # load dictionary of previously created datasets (`self.dataset`)
135 |         self.load_datasets(hdf_file=self.hdf_file)
136 | 
137 |     def start_new_preprocessing_job(self) -> None:
138 |         """
139 |         Starts a fresh preprocessing job.
140 |         """
141 |         self.resume_idx      = 0
142 |         self.skip_collection = False
143 | 
144 |         # create a dictionary of empty HDF datasets (`self.dataset`)
145 |         self.create_datasets(hdf_file=self.hdf_file)
146 | 
147 |     def resave_datasets_unchunked(self) -> None:
148 |         """
149 |         Resaves the HDF datasets in an unchunked format to remove initial
150 |         padding.
151 |         """
152 |         with h5py.File(f"{self.path[:-3]}h5.chunked", "r", swmr=True) as chunked_file:
153 |             keys        = list(chunked_file.keys())
154 |             data        = [chunked_file.get(key)[:] for key in keys]
155 |             data_zipped = tuple(zip(data, keys))
156 | 
157 |             with h5py.File(f"{self.path[:-3]}h5", "w") as unchunked_file:
158 |                 for d, k in tqdm(data_zipped):
159 |                     unchunked_file.create_dataset(
160 |                         k, chunks=None, data=d, dtype=np.dtype("int8")
161 |                     )
162 | 
163 |         # remove the restart file and chunked file (don't need them anymore)
164 |         os.remove(self.restart_index_file)
165 |         os.remove(f"{self.path[:-3]}h5.chunked")
166 | 
167 |     def get_subgraphs(self, init_idx : int) -> None:
168 |         """
169 |         Adds `constants.batch_size` subgraphs from `self.molecule_subset` to the
170 |         HDF dataset (and if currently processing the training set, also
171 |         calculates the full graphs' properties and adds these to
172 |         `self.ts_properties`).
173 | 
174 |         Args:
175 |         ----
176 |             init_idx (int) : As analysis is done in blocks/slices, `init_idx` is
177 |                              the start index for the next block/slice to be taken
178 |                              from `self.molecule_subset`.
179 |         """
180 |         data_subgraphs, data_apds, molecular_graph_list = [], [], []  # initialize
181 | 
182 |         # convert all molecules in `self.molecules_subset` to `PreprocessingGraphs`
183 |         molecular_graph_generator = map(self.get_graph, self.molecule_subset)
184 | 
185 |         molecules_processed       = 0  # keep track of the number of molecules processed
186 | 
187 |         # loop over all the `PreprocessingGraph`s
188 |         for graph in molecular_graph_generator:
189 |             molecules_processed += 1
190 | 
191 |             # store `PreprocessingGraph` object
192 |             molecular_graph_list.append(graph)
193 | 
194 |             # get the number of decoding graphs
195 |             n_subgraphs = graph.get_decoding_route_length()
196 | 
197 |             for new_subgraph_idx in range(n_subgraphs):
198 | 
199 |                 # `get_decoding_route_state() returns a list of [`subgraph`, `apd`],
200 |                 subgraph, apd = graph.get_decoding_route_state(
201 |                     subgraph_idx=new_subgraph_idx
202 |                 )
203 | 
204 |                 # "collect" all APDs corresponding to pre-existing subgraphs,
205 |                 # otherwise append both new subgraph and new APD
206 |                 count = 0
207 |                 for idx, existing_subgraph in enumerate(data_subgraphs):
208 | 
209 |                     count += 1
210 |                     # check if subgraph `subgraph` is "already" in
211 |                     # `data_subgraphs` as `existing_subgraph`, and if so, add
212 |                     # the "new" APD to the "old"
213 |                     try:  # first compare the node feature matrices
214 |                         nodes_equal = (subgraph[0] == existing_subgraph[0]).all()
215 |                     except AttributeError:
216 |                         nodes_equal = False
217 |                     try:  # then compare the edge feature tensors
218 |                         edges_equal = (subgraph[1] == existing_subgraph[1]).all()
219 |                     except AttributeError:
220 |                         edges_equal = False
221 | 
222 |                     # if both matrices have a match, then subgraphs are the same
223 |                     if nodes_equal and edges_equal:
224 |                         existing_apd = data_apds[idx]
225 |                         existing_apd += apd
226 |                         break
227 | 
228 |                 # if subgraph is not already in `data_subgraphs`, append it
229 |                 if count == len(data_subgraphs) or count == 0:
230 |                     data_subgraphs.append(subgraph)
231 |                     data_apds.append(apd)
232 | 
233 |                 # if `constants.batch_size` unique subgraphs have been
234 |                 # processed, save group to the HDF dataset
235 |                 len_data_subgraphs = len(data_subgraphs)
236 |                 if len_data_subgraphs == constants.batch_size:
237 |                     self.save_group(data_subgraphs=data_subgraphs,
238 |                                     data_apds=data_apds,
239 |                                     group_size=len_data_subgraphs,
240 |                                     init_idx=init_idx)
241 | 
242 |                     # get molecular properties for group iff it's the training set
243 |                     self.get_ts_properties(molecular_graphs=molecular_graph_list,
244 |                                            group_size=constants.batch_size)
245 | 
246 |                     # keep track of the last molecule to be processed in
247 |                     # `self.resume_idx`
248 |                     # number of molecules processed:
249 |                     self.resume_idx   += molecules_processed
250 |                     # subgraphs processed:
251 |                     self.dataset_size += constants.batch_size
252 | 
253 |                     return None
254 | 
255 |         n_processed_subgraphs = len(data_subgraphs)
256 | 
257 |         # save group with < `constants.batch_size` subgraphs (e.g. last block)
258 |         self.save_group(data_subgraphs=data_subgraphs,
259 |                         data_apds=data_apds,
260 |                         group_size=n_processed_subgraphs,
261 |                         init_idx=init_idx)
262 | 
263 |         # get molecular properties for this group iff it's the training set
264 |         self.get_ts_properties(molecular_graphs=molecular_graph_list,
265 |                                group_size=constants.batch_size)
266 | 
267 |         # keep track of the last molecule to be processed in `self.resume_idx`
268 |         self.resume_idx   += molecules_processed  # number of molecules processed
269 |         self.dataset_size += molecules_processed  # subgraphs processed
270 | 
271 |         return None
272 | 
273 |     def create_datasets(self, hdf_file : h5py._hl.files.File) -> None:
274 |         """
275 |         Creates a dictionary of HDF5 datasets (`self.dataset`).
276 | 
277 |         Args:
278 |         ----
279 |             hdf_file (h5py._hl.files.File) : HDF5 file which will contain the datasets.
280 |         """
281 |         self.dataset = {}  # initialize
282 | 
283 |         for ds_name in self.dataset_names:
284 |             self.dataset[ds_name] = hdf_file.create_dataset(
285 |                 ds_name,
286 |                 (self.total_n_subgraphs, *self.dims[ds_name]),
287 |                 chunks=True,  # must be True for resizing later
288 |                 dtype=np.dtype("int8")
289 |             )
290 | 
291 |     def resize_datasets(self) -> None:
292 |         """
293 |         Resizes the HDF datasets, since much longer datasets are initialized
294 |         when first creating the HDF datasets (it it is impossible to predict
295 |         how many graphs will be equivalent beforehand).
296 |         """
297 |         for dataset_name in self.dataset_names:
298 |             try:
299 |                 self.dataset[dataset_name].resize(
300 |                     (self.dataset_size, *self.dims[dataset_name]))
301 |             except KeyError:  # `f_term` has no extra dims
302 |                 self.dataset[dataset_name].resize((self.dataset_size,))
303 | 
304 |     def get_dataset_dims(self) -> None:
305 |         """
306 |         Calculates the dimensions of the node features, edge features, and APDs,
307 |         and stores them as lists in a dict (`self.dims`), where keys are the
308 |         dataset name.
309 | 
310 |         Shapes:
311 |         ------
312 |             dims["nodes"] : [max N nodes, N atom types + N formal charges]
313 |             dims["edges"] : [max N nodes, max N nodes, N bond types]
314 |             dims["APDs"]  : [APD length = f_add length + f_conn length + f_term length]
315 |         """
316 |         self.dims = {}
317 |         self.dims["nodes"] = constants.dim_nodes
318 |         self.dims["edges"] = constants.dim_edges
319 |         self.dims["APDs"]  = constants.dim_apd
320 | 
321 |     def get_graph(self, mol : rdkit.Chem.Mol) -> PreprocessingGraph:
322 |         """
323 |         Converts an `rdkit.Chem.Mol` object to `PreprocessingGraph`.
324 | 
325 |         Args:
326 |         ----
327 |             mol (rdkit.Chem.Mol) : Molecule to convert.
328 | 
329 |         Returns:
330 |         -------
331 |             molecular_graph (PreprocessingGraph) : Molecule, now as a graph.
332 |         """
333 |         if mol is not None:
334 |             if not constants.use_aromatic_bonds:
335 |                 rdkit.Chem.Kekulize(mol, clearAromaticFlags=True)
336 |             molecular_graph = PreprocessingGraph(molecule=mol,
337 |                                                  constants=constants)
338 |         return molecular_graph
339 | 
340 |     def get_molecule_subset(self) -> None:
341 |         """
342 |         Slices `self.molecule_set` into a subset of molecules of size
343 |         `constants.batch_size`, starting from `self.resume_idx`.
344 |         `self.n_molecules` is the number of molecules in the full
345 |         `self.molecule_set`.
346 |         """
347 |         init_idx             = self.resume_idx
348 |         subset_size          = constants.batch_size
349 |         self.molecule_subset = []
350 |         max_idx              = min(init_idx + subset_size, self.n_molecules)
351 | 
352 |         count = -1
353 |         for mol in self.molecule_set:
354 |             if mol is not None:
355 |                 count += 1
356 |                 if count < init_idx:
357 |                     continue
358 |                 elif count >= max_idx:
359 |                     return self.molecule_subset
360 |                 else:
361 |                     self.molecule_subset.append(mol)
362 | 
363 |     def get_n_subgraphs(self) -> int:
364 |         """
365 |         Calculates the total number of subgraphs in the decoding route of all
366 |         molecules in `self.molecule_set`. Loads training, testing, or validation
367 |         set. First, the `PreprocessingGraph` for each molecule is obtained, and
368 |         then the length of the decoding route is trivially calculated for each.
369 | 
370 |         Returns:
371 |         -------
372 |             n_subgraphs (int) : Sum of number of subgraphs in decoding routes for
373 |                                 all molecules in `self.molecule_set`.
374 |         """
375 |         n_subgraphs = 0  # start the count
376 | 
377 |         # convert molecules in `self.molecule_set` to `PreprocessingGraph`s
378 |         molecular_graph_generator = map(self.get_graph, self.molecule_set)
379 | 
380 |         # loop over all the `PreprocessingGraph`s
381 |         for molecular_graph in molecular_graph_generator:
382 | 
383 |             # get the number of decoding graphs (i.e. the decoding route length)
384 |             # and add them to the running count
385 |             n_subgraphs += molecular_graph.get_decoding_route_length()
386 | 
387 |         return int(n_subgraphs)
388 | 
389 |     def get_ts_properties(self, molecular_graphs : list, group_size : int) -> \
390 |         None:
391 |         """
392 |         Gets molecular properties for group of molecular graphs, only for the
393 |         training set.
394 | 
395 |         Args:
396 |         ----
397 |             molecular_graphs (list) : Contains `PreprocessingGraph`s.
398 |             group_size (int)        : Size of "group" (i.e. slice of graphs).
399 |         """
400 |         if self.is_training_set:
401 | 
402 |             analyzer      = Analyzer()
403 |             ts_properties = analyzer.evaluate_training_set(
404 |                 preprocessing_graphs=molecular_graphs
405 |             )
406 | 
407 |             # merge properties of current group with the previous group analyzed
408 |             if self.ts_properties:  # `self.ts_properties` is a dictionary
409 |                 self.ts_properties = analyzer.combine_ts_properties(
410 |                     prev_properties=self.ts_properties,
411 |                     next_properties=ts_properties,
412 |                     weight_next=group_size
413 |                 )
414 |             else:  # `self.ts_properties` is None (has not been calculated yet)
415 |                 self.ts_properties = ts_properties
416 |         else:
417 |             self.ts_properties = None
418 | 
419 |     def load_datasets(self, hdf_file : h5py._hl.files.File) -> None:
420 |         """
421 |         Creates a dictionary of HDF datasets (`self.dataset`) which have been
422 |         previously created (for restart jobs only).
423 | 
424 |         Args:
425 |         ----
426 |             hdf_file (h5py._hl.files.File) : HDF file containing all the datasets.
427 |         """
428 |         self.dataset = {}  # initialize dictionary of datasets
429 | 
430 |         # use the names of the datasets as the keys in `self.dataset`
431 |         for ds_name in self.dataset_names:
432 |             self.dataset[ds_name] = hdf_file.get(ds_name)
433 | 
434 |     def save_group(self, data_subgraphs : list, data_apds : list,
435 |                    group_size : int, init_idx : int) -> None:
436 |         """
437 |         Saves a group of padded subgraphs and their corresponding APDs to the HDF
438 |         datasets as `numpy.ndarray`s.
439 | 
440 |         Args:
441 |         ----
442 |             data_subgraphs (list) : Contains molecular subgraphs.
443 |             data_apds (list)      : Contains APDs.
444 |             group_size (int)      : Size of HDF "slice".
445 |             init_idx (int)        : Index to begin slicing.
446 |         """
447 |         # convert to `np.ndarray`s
448 |         nodes = np.array([graph_tuple[0] for graph_tuple in data_subgraphs])
449 |         edges = np.array([graph_tuple[1] for graph_tuple in data_subgraphs])
450 |         apds  = np.array(data_apds)
451 | 
452 |         end_idx = init_idx + group_size  # idx to end slicing
453 | 
454 |         # once data is padded, save it to dataset slice
455 |         self.dataset["nodes"][init_idx:end_idx] = nodes
456 |         self.dataset["edges"][init_idx:end_idx] = edges
457 |         self.dataset["APDs"][init_idx:end_idx]  = apds
458 | 


--------------------------------------------------------------------------------
/graphinvent/ScoringFunction.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This class is used for defining the scoring function(s) which can be used during
  3 | fine-tuning.
  4 | """
  5 | # load general packages and functions
  6 | from collections import namedtuple
  7 | import torch
  8 | from rdkit import DataStructs
  9 | from rdkit.Chem import QED, AllChem
 10 | import numpy as np
 11 | import sklearn
 12 | from sklearn import svm
 13 | 
 14 | class ScoringFunction:
 15 |     """
 16 |     A class for defining the scoring function components.
 17 |     """
 18 |     def __init__(self, constants : namedtuple) -> None:
 19 |         """
 20 |         Args:
 21 |         ----
 22 |             constants (namedtuple) : Contains job parameters as well as global
 23 |                                      constants.
 24 |         """
 25 |         self.score_components = constants.score_components  # list
 26 |         self.score_type       = constants.score_type        # list
 27 |         self.qsar_models      = constants.qsar_models       # dict
 28 |         self.device           = constants.device
 29 |         self.max_n_nodes      = constants.max_n_nodes
 30 |         self.score_thresholds = constants.score_thresholds
 31 | 
 32 |         self.n_graphs         = None  # placeholder
 33 | 
 34 |         assert len(self.score_components) == len(self.score_thresholds), \
 35 |                "`score_components` and `score_thresholds` do not match."
 36 | 
 37 |     def compute_score(self, graphs : list, termination : torch.Tensor,
 38 |                       validity : torch.Tensor, uniqueness : torch.Tensor) -> \
 39 |                       torch.Tensor:
 40 |         """
 41 |         Computes the overall score for the input molecular graphs.
 42 | 
 43 |         Args:
 44 |         ----
 45 |             graphs (list)              : Contains molecular graphs to evaluate.
 46 |             termination (torch.Tensor) : Termination status of input molecular
 47 |                                          graphs.
 48 |             validity (torch.Tensor)    : Validity of input molecular graphs.
 49 |             uniqueness (torch.Tensor)  : Uniqueness of input molecular graphs.
 50 | 
 51 |         Returns:
 52 |         -------
 53 |             final_score (torch.Tensor) : The final scores for each input graph.
 54 |         """
 55 |         self.n_graphs          = len(graphs)
 56 |         contributions_to_score = self.get_contributions_to_score(graphs=graphs)
 57 | 
 58 |         if len(self.score_components) == 1:
 59 |             final_score = contributions_to_score[0]
 60 | 
 61 |         elif self.score_type == "continuous":
 62 |             final_score = contributions_to_score[0]
 63 |             for component in contributions_to_score[1:]:
 64 |                 final_score *= component
 65 | 
 66 |         elif self.score_type == "binary":
 67 |             component_masks = []
 68 |             for idx, score_component in enumerate(contributions_to_score):
 69 |                 component_mask = torch.where(
 70 |                     score_component > self.score_thresholds[idx],
 71 |                     torch.ones(self.n_graphs, device=self.device, dtype=torch.uint8),
 72 |                     torch.zeros(self.n_graphs, device=self.device, dtype=torch.uint8)
 73 |                 )
 74 |                 component_masks.append(component_mask)
 75 | 
 76 |             final_score = component_masks[0]
 77 |             for mask in component_masks[1:]:
 78 |                 final_score *= mask
 79 |                 final_score  = final_score.float()
 80 | 
 81 |         else:
 82 |             raise NotImplementedError
 83 | 
 84 |         # remove contribution of duplicate molecules to the score
 85 |         final_score *= uniqueness
 86 | 
 87 |         # remove contribution of invalid molecules to the score
 88 |         final_score *= validity
 89 | 
 90 |         # remove contribution of improperly-terminated molecules to the score
 91 |         final_score *= termination
 92 | 
 93 |         return final_score
 94 | 
 95 |     def get_contributions_to_score(self, graphs : list) -> list:
 96 |         """
 97 |         Returns the different elements of the score.
 98 | 
 99 |         Args:
100 |         ----
101 |             graphs (list) : Contains molecular graphs to evaluate.
102 | 
103 |         Returns:
104 |         -------
105 |             contributions_to_score (list) : Contains elements of the score due to
106 |                                             each scoring function component.
107 |         """
108 |         contributions_to_score = []
109 | 
110 |         for score_component in self.score_components:
111 |             if "target_size" in score_component:
112 | 
113 |                 target_size  = int(score_component[12:])
114 | 
115 |                 assert target_size <= self.max_n_nodes, \
116 |                        "Target size > largest possible size (`max_n_nodes`)."
117 |                 assert 0 < target_size, "Target size must be greater than 0."
118 | 
119 |                 target_size *= torch.ones(self.n_graphs, device=self.device)
120 |                 n_nodes      = torch.tensor([graph.n_nodes for graph in graphs],
121 |                                             device=self.device)
122 |                 max_nodes    = self.max_n_nodes
123 |                 score        = (
124 |                     torch.ones(self.n_graphs, device=self.device)
125 |                     - torch.abs(n_nodes - target_size)
126 |                     / (max_nodes - target_size)
127 |                 )
128 | 
129 |                 contributions_to_score.append(score)
130 | 
131 |             elif score_component == "QED":
132 |                 mols = [graph.molecule for graph in graphs]
133 | 
134 |                 # compute the QED score for each molecule (if possible)
135 |                 qed = []
136 |                 for mol in mols:
137 |                     try:
138 |                         qed.append(QED.qed(mol))
139 |                     except:
140 |                         qed.append(0.0)
141 |                 score = torch.tensor(qed, device=self.device)
142 | 
143 |                 contributions_to_score.append(score)
144 | 
145 |             elif "activity" in score_component:
146 |                 mols = [graph.molecule for graph in graphs]
147 | 
148 |                 # `score_component` has to be the key to the QSAR model in the
149 |                 # `self.qsar_models` dict
150 |                 qsar_model = self.qsar_models[score_component]
151 |                 score      = self.compute_activity(mols, qsar_model)
152 | 
153 |                 contributions_to_score.append(score)
154 | 
155 |             else:
156 |                 raise NotImplementedError("The score component is not defined. "
157 |                                           "You can define it in "
158 |                                           "`ScoringFunction.py`.")
159 | 
160 |         return contributions_to_score
161 | 
162 |     def compute_activity(self, mols : list,
163 |                          activity_model : sklearn.svm.classes.SVC) -> list:
164 |         """
165 |         Note: this function may have to be tuned/replicated depending on how
166 |         the activity model is saved.
167 | 
168 |         Args:
169 |         ----
170 |             mols (list) : Contains `rdkit.Mol` objects corresponding to molecular
171 |                           graphs sampled.
172 |             activity_model (sklearn.svm.classes.SVC) : Pre-trained QSAR model.
173 | 
174 |         Returns:
175 |         -------
176 |             activity (list) : Contains predicted activities for input molecules.
177 |         """
178 |         n_mols   = len(mols)
179 |         activity = torch.zeros(n_mols, device=self.device)
180 | 
181 |         for idx, mol in enumerate(mols):
182 |             try:
183 |                 fingerprint   = AllChem.GetMorganFingerprintAsBitVect(mol,
184 |                                                                       2,
185 |                                                                       nBits=2048)
186 |                 ecfp4         = np.zeros((2048,))
187 |                 DataStructs.ConvertToNumpyArray(fingerprint, ecfp4)
188 |                 activity[idx] = activity_model.predict_proba([ecfp4])[0][1]
189 |             except:
190 |                 pass  # activity[idx] will remain 0.0
191 | 
192 |         return activity
193 | 


--------------------------------------------------------------------------------
/graphinvent/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/graphinvent/__init__.py


--------------------------------------------------------------------------------
/graphinvent/gnn/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/graphinvent/gnn/__init__.py


--------------------------------------------------------------------------------
/graphinvent/gnn/aggregation_mpnn.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Defines the `AggregationMPNN` class.
  3 | """
  4 | # load general packages and functions
  5 | from collections import  namedtuple
  6 | import torch
  7 | 
  8 | 
  9 | class AggregationMPNN(torch.nn.Module):
 10 |     """
 11 |     Abstract `AggregationMPNN` class. Specific models using this class are
 12 |     defined in `mpnn.py`; these are the attention networks AttS2V and AttGGNN.
 13 |     """
 14 |     def __init__(self, constants : namedtuple) -> None:
 15 |         super().__init__()
 16 | 
 17 |         self.hidden_node_features = constants.hidden_node_features
 18 |         self.edge_features        = constants.n_edge_features
 19 |         self.message_size         = constants.message_size
 20 |         self.message_passes       = constants.message_passes
 21 |         self.constants            = constants
 22 | 
 23 |     def aggregate_message(self, nodes : torch.Tensor, node_neighbours : torch.Tensor,
 24 |                           edges : torch.Tensor, mask : torch.Tensor) -> None:
 25 |         """
 26 |         Message aggregation function, to be implemented in all `AggregationMPNN` subclasses.
 27 | 
 28 |         Args:
 29 |         ----
 30 |             nodes (torch.Tensor)           : Batch of node feature vectors.
 31 |             node_neighbours (torch.Tensor) : Batch of node feature vectors for neighbors.
 32 |             edges (torch.Tensor)           : Batch of edge feature vectors.
 33 |             mask (torch.Tensor)            : Mask for non-existing neighbors, where
 34 |                                              elements are 1 if corresponding element
 35 |                                              exists and 0 otherwise.
 36 | 
 37 |         Shapes:
 38 |         ------
 39 |             nodes           : (total N nodes in batch, N node features)
 40 |             node_neighbours : (total N nodes in batch, max node degree, N node features)
 41 |             edges           : (total N nodes in batch, max node degree, N edge features)
 42 |             mask            : (total N nodes in batch, max node degree)
 43 |         """
 44 |         raise NotImplementedError
 45 | 
 46 |     def update(self, nodes : torch.Tensor, messages : torch.Tensor) -> None:
 47 |         """
 48 |         Message update function, to be implemented in all `AggregationMPNN` subclasses.
 49 | 
 50 |         Args:
 51 |         ----
 52 |             nodes (torch.Tensor)    : Batch of node feature vectors.
 53 |             messages (torch.Tensor) : Batch of incoming messages.
 54 | 
 55 |         Shapes:
 56 |         ------
 57 |             nodes    : (total N nodes in batch, N node features)
 58 |             messages : (total N nodes in batch, N node features)
 59 |         """
 60 |         raise NotImplementedError
 61 | 
 62 |     def readout(self, hidden_nodes : torch.Tensor, input_nodes : torch.Tensor,
 63 |                 node_mask : torch.Tensor) -> None:
 64 |         """
 65 |         Local readout function, to be implemented in all `AggregationMPNN` subclasses.
 66 | 
 67 |         Args:
 68 |         ----
 69 |             hidden_nodes (torch.Tensor) : Batch of node feature vectors.
 70 |             input_nodes (torch.Tensor)  : Batch of node feature vectors.
 71 |             node_mask (torch.Tensor)    : Mask for non-existing neighbors, where
 72 |                                           elements are 1 if corresponding element
 73 |                                           exists and 0 otherwise.
 74 | 
 75 |         Shapes:
 76 |         ------
 77 |             hidden_nodes : (total N nodes in batch, N node features)
 78 |             input_nodes  : (total N nodes in batch, N node features)
 79 |             node_mask    : (total N nodes in batch, N features)
 80 |         """
 81 |         raise NotImplementedError
 82 | 
 83 |     def forward(self, nodes : torch.Tensor, edges : torch.Tensor) -> torch.Tensor:
 84 |         """
 85 |         Defines forward pass.
 86 | 
 87 |         Args:
 88 |         ----
 89 |             nodes (torch.Tensor) : Batch of node feature matrices.
 90 |             edges (torch.Tensor) : Batch of edge feature tensors.
 91 | 
 92 |         Shapes:
 93 |         ------
 94 |             nodes : (batch size, N nodes, N node features)
 95 |             edges : (batch size, N nodes, N nodes, N edge features)
 96 | 
 97 |         Returns:
 98 |         -------
 99 |             output (torch.Tensor) : This would normally be the learned graph
100 |                                     representation, but in all MPNN readout functions
101 |                                     in this work, the last layer is used to predict
102 |                                     the action probability distribution for a batch
103 |                                     of graphs from the learned graph representation.
104 |         """
105 |         adjacency = torch.sum(edges, dim=3)
106 | 
107 |         # **note: "idc" == "indices", "nghb{s}" == "neighbour(s)"
108 |         edge_batch_batch_idc, edge_batch_node_idc, edge_batch_nghb_idc = \
109 |             adjacency.nonzero(as_tuple=True)
110 | 
111 |         node_batch_batch_idc, node_batch_node_idc = adjacency.sum(-1).nonzero(as_tuple=True)
112 |         node_batch_adj  = adjacency[node_batch_batch_idc, node_batch_node_idc, :]
113 |         node_batch_size = node_batch_batch_idc.shape[0]
114 |         node_degrees    = node_batch_adj.sum(-1).long()
115 |         max_node_degree = node_degrees.max()
116 | 
117 |         node_batch_node_nghbs = torch.zeros(node_batch_size,
118 |                                             max_node_degree,
119 |                                             self.hidden_node_features,
120 |                                             device=self.constants.device)
121 |         node_batch_edges      = torch.zeros(node_batch_size,
122 |                                             max_node_degree,
123 |                                             self.edge_features,
124 |                                             device=self.constants.device)
125 | 
126 |         node_batch_nghb_nghb_idc = torch.cat(
127 |             [torch.arange(i) for i in node_degrees]
128 |         ).long()
129 | 
130 |         edge_batch_node_batch_idc = torch.cat(
131 |             [i * torch.ones(degree) for i, degree in enumerate(node_degrees)]
132 |         ).long()
133 | 
134 |         node_batch_node_nghb_mask = torch.zeros(node_batch_size,
135 |                                                 max_node_degree,
136 |                                                 device=self.constants.device)
137 | 
138 |         node_batch_node_nghb_mask[edge_batch_node_batch_idc, node_batch_nghb_nghb_idc] = 1
139 | 
140 |         node_batch_edges[edge_batch_node_batch_idc, node_batch_nghb_nghb_idc, :] = \
141 |             edges[edge_batch_batch_idc, edge_batch_node_idc, edge_batch_nghb_idc, :]
142 | 
143 |         # pad up the hidden nodes
144 |         hidden_nodes = torch.zeros(nodes.shape[0],
145 |                                    nodes.shape[1],
146 |                                    self.hidden_node_features,
147 |                                    device=self.constants.device)
148 |         hidden_nodes[:nodes.shape[0], :nodes.shape[1], :nodes.shape[2]] = nodes.clone()
149 | 
150 |         for _ in range(self.message_passes):
151 | 
152 |             node_batch_nodes = hidden_nodes[node_batch_batch_idc, node_batch_node_idc, :]
153 |             node_batch_node_nghbs[edge_batch_node_batch_idc, node_batch_nghb_nghb_idc, :] = \
154 |                 hidden_nodes[edge_batch_batch_idc, edge_batch_nghb_idc, :]
155 | 
156 |             messages = self.aggregate_message(nodes=node_batch_nodes,
157 |                                               node_neighbours=node_batch_node_nghbs.clone(),
158 |                                               edges=node_batch_edges,
159 |                                               mask=node_batch_node_nghb_mask)
160 | 
161 |             hidden_nodes[node_batch_batch_idc, node_batch_node_idc, :] = \
162 |                 self.update(node_batch_nodes.clone(), messages)
163 | 
164 |         node_mask = (adjacency.sum(-1) != 0)
165 | 
166 |         output = self.readout(hidden_nodes, nodes, node_mask)
167 | 
168 |         return output
169 | 


--------------------------------------------------------------------------------
/graphinvent/gnn/edge_mpnn.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Defines the `EdgeMPNN` class.
  3 | """# load general packages and functions
  4 | from collections import namedtuple
  5 | import torch
  6 | 
  7 | 
  8 | class EdgeMPNN(torch.nn.Module):
  9 |     """
 10 |     Abstract `EdgeMPNN` class. A specific model using this class is defined
 11 |     in `mpnn.py`; this is the EMN.
 12 |     """
 13 |     def __init__(self, constants : namedtuple) -> None:
 14 |         super().__init__()
 15 | 
 16 |         self.edge_features         = constants.edge_features
 17 |         self.edge_embedding_size   = constants.edge_embedding_size
 18 |         self.message_passes        = constants.message_passes
 19 |         self.n_nodes_largest_graph = constants.max_n_nodes
 20 |         self.constants             = constants
 21 | 
 22 |     def preprocess_edges(self, nodes : torch.Tensor, node_neighbours : torch.Tensor,
 23 |                          edges : torch.Tensor) -> None:
 24 |         """
 25 |         Edge preprocessing step, to be implemented in all `EdgeMPNN` subclasses.
 26 | 
 27 |         Args:
 28 |         ----
 29 |             nodes (torch.Tensor)           : Batch of node feature vectors.
 30 |             node_neighbours (torch.Tensor) : Batch of node feature vectors for neighbors.
 31 |             edges (torch.Tensor)           : Batch of edge feature vectors.
 32 | 
 33 |         Shapes:
 34 |         ------
 35 |             nodes           : (total N nodes in batch, N node features)
 36 |             node_neighbours : (total N nodes in batch, max node degree, N node features)
 37 |             edges           : (total N nodes in batch, max node degree, N edge features)
 38 |         """
 39 |         raise NotImplementedError
 40 | 
 41 |     def propagate_edges(self, edges : torch.Tensor, ingoing_edge_memories : torch.Tensor,
 42 |                         ingoing_edges_mask : torch.Tensor) -> None:
 43 |         """
 44 |         Edge propagation rule, to be implemented in all `EdgeMPNN` subclasses.
 45 | 
 46 |         Args:
 47 |         ----
 48 |             edges (torch.Tensor)                 : Batch of edge feature tensors.
 49 |             ingoing_edge_memories (torch.Tensor) : Batch of memories for all
 50 |                                                    ingoing edges.
 51 |             ingoing_edges_mask (torch.Tensor)    : Mask for ingoing edges.
 52 | 
 53 |         Shapes:
 54 |         ------
 55 |             edges                 : (batch size, N nodes, N nodes, total N edge features)
 56 |             ingoing_edge_memories : (total N edges in batch, total N edge features)
 57 |             ingoing_edges_mask    : (total N edges in batch, max node degree, total N edge features)
 58 |         """
 59 |         raise NotImplementedError
 60 | 
 61 |     def readout(self, hidden_nodes : torch.Tensor, input_nodes : torch.Tensor,
 62 |                 node_mask : torch.Tensor) -> None:
 63 |         """
 64 |         Local readout function, to be implemented in all `EdgeMPNN` subclasses.
 65 | 
 66 |         Args:
 67 |         ----
 68 |             hidden_nodes (torch.Tensor) : Batch of node feature vectors.
 69 |             input_nodes (torch.Tensor)  : Batch of node feature vectors.
 70 |             node_mask (torch.Tensor)    : Mask for non-existing neighbors, where
 71 |                                           elements are 1 if corresponding element
 72 |                                           exists and 0 otherwise.
 73 | 
 74 |         Shapes:
 75 |         ------
 76 |             hidden_nodes : (total N nodes in batch, N node features)
 77 |             input_nodes  : (total N nodes in batch, N node features)
 78 |             node_mask    : (total N nodes in batch, N features)
 79 |         """
 80 |         raise NotImplementedError
 81 | 
 82 |     def forward(self, nodes : torch.Tensor, edges : torch.Tensor) -> torch.Tensor:
 83 |         """
 84 |         Defines forward pass.
 85 | 
 86 |         Args:
 87 |         ----
 88 |             nodes (torch.Tensor) : Batch of node feature matrices.
 89 |             edges (torch.Tensor) : Batch of edge feature tensors.
 90 | 
 91 |         Shapes:
 92 |         ------
 93 |             nodes : (batch size, N nodes, N node features)
 94 |             edges : (batch size, N nodes, N nodes, N edge features)
 95 | 
 96 |         Returns:
 97 |         -------
 98 |             output (torch.Tensor) : This would normally be the learned graph representation,
 99 |                                     but in all MPNN readout functions in this work,
100 |                                     the last layer is used to predict the action
101 |                                     probability distribution for a batch of graphs from
102 |                                     the learned graph representation.
103 |         """
104 |         adjacency = torch.sum(edges, dim=3)
105 | 
106 |         # indices for finding edges in batch; `edges_b_idx` is batch index,
107 |         # `edges_n_idx` is the node index, and `edges_nghb_idx` is the index
108 |         # that each node in `edges_n_idx` is bound to
109 |         edges_b_idx, edges_n_idx, edges_nghb_idx = adjacency.nonzero(as_tuple=True)
110 | 
111 |         n_edges               = edges_n_idx.shape[0]
112 |         adj_of_edge_batch_idc = adjacency.clone().long()
113 | 
114 |         # +1 to distinguish idx 0 from empty elements, subtracted few lines down
115 |         r = torch.arange(1, n_edges + 1, device=self.constants.device)
116 | 
117 |         adj_of_edge_batch_idc[edges_b_idx, edges_n_idx, edges_nghb_idx] = r
118 | 
119 |         ingoing_edges_eb_idx = (
120 |             torch.cat([row[row.nonzero()] for row in
121 |             adj_of_edge_batch_idc[edges_b_idx, edges_nghb_idx, :]]) - 1
122 |         ).squeeze()
123 | 
124 |         edge_degrees = adjacency[edges_b_idx, edges_nghb_idx, :].sum(-1).long()
125 |         ingoing_edges_igeb_idx = torch.cat(
126 |             [i * torch.ones(d) for i, d in enumerate(edge_degrees)]
127 |         ).long()
128 |         ingoing_edges_ige_idx = torch.cat([torch.arange(i) for i in edge_degrees]).long()
129 | 
130 | 
131 |         batch_size      = adjacency.shape[0]
132 |         n_nodes         = adjacency.shape[1]
133 |         max_node_degree = adjacency.sum(-1).max().int()
134 |         edge_memories   = torch.zeros(n_edges,
135 |                                       self.edge_embedding_size,
136 |                                       device=self.constants.device)
137 | 
138 |         ingoing_edge_memories = torch.zeros(n_edges, max_node_degree,
139 |                                             self.edge_embedding_size,
140 |                                             device=self.constants.device)
141 |         ingoing_edges_mask    = torch.zeros(n_edges,
142 |                                             max_node_degree,
143 |                                             device=self.constants.device)
144 | 
145 |         edge_batch_nodes = nodes[edges_b_idx, edges_n_idx, :]
146 |         # **note: "nghb{s}" == "neighbour(s)"
147 |         edge_batch_nghbs = nodes[edges_b_idx, edges_nghb_idx, :]
148 |         edge_batch_edges = edges[edges_b_idx, edges_n_idx, edges_nghb_idx, :]
149 |         edge_batch_edges = self.preprocess_edges(nodes=edge_batch_nodes,
150 |                                                  node_neighbours=edge_batch_nghbs,
151 |                                                  edges=edge_batch_edges)
152 | 
153 |         # remove h_ji:s influence on h_ij
154 |         ingoing_edges_nghb_idx = edges_nghb_idx[ingoing_edges_eb_idx]
155 |         ingoing_edges_receiving_edge_n_idx = edges_n_idx[ingoing_edges_igeb_idx]
156 |         diff_idx = (ingoing_edges_receiving_edge_n_idx != ingoing_edges_nghb_idx).nonzero()
157 | 
158 |         try:
159 |             ingoing_edges_eb_idx   = ingoing_edges_eb_idx[diff_idx].squeeze()
160 |             ingoing_edges_ige_idx  = ingoing_edges_ige_idx[diff_idx].squeeze()
161 |             ingoing_edges_igeb_idx = ingoing_edges_igeb_idx[diff_idx].squeeze()
162 |         except:
163 |             pass
164 | 
165 |         ingoing_edges_mask[ingoing_edges_igeb_idx, ingoing_edges_ige_idx] = 1
166 | 
167 |         for _ in range(self.message_passes):
168 |             ingoing_edge_memories[ingoing_edges_igeb_idx, ingoing_edges_ige_idx, :] = \
169 |                 edge_memories[ingoing_edges_eb_idx, :]
170 |             edge_memories = self.propagate_edges(
171 |                 edges=edge_batch_edges,
172 |                 ingoing_edge_memories=ingoing_edge_memories.clone(),
173 |                 ingoing_edges_mask=ingoing_edges_mask
174 |             )
175 | 
176 |         node_mask = (adjacency.sum(-1) != 0)
177 | 
178 |         node_sets = torch.zeros(batch_size,
179 |                                 n_nodes,
180 |                                 max_node_degree,
181 |                                 self.edge_embedding_size,
182 |                                 device=self.constants.device)
183 | 
184 |         edge_batch_edge_memory_idc = torch.cat(
185 |             [torch.arange(row.sum()) for row in adjacency.view(-1, n_nodes)]
186 |         ).long()
187 | 
188 |         node_sets[edges_b_idx, edges_n_idx, edge_batch_edge_memory_idc, :] = edge_memories
189 |         graph_sets = node_sets.sum(2)
190 | 
191 |         output = self.readout(graph_sets, graph_sets, node_mask)
192 |         return output
193 | 


--------------------------------------------------------------------------------
/graphinvent/gnn/modules.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Defines MPNN modules and readout functions, and APD readout functions.
  3 | """
  4 | # load general packages and functions
  5 | from collections import namedtuple
  6 | import torch
  7 | 
  8 | # load GraphINVENT-specific functions
  9 | # (none)
 10 | 
 11 | 
 12 | class GraphGather(torch.nn.Module):
 13 |     """
 14 |     GGNN readout function.
 15 |     """
 16 |     def __init__(self, node_features : int, hidden_node_features : int,
 17 |                  out_features : int, att_depth : int, att_hidden_dim : int,
 18 |                  att_dropout_p : float, emb_depth : int, emb_hidden_dim : int,
 19 |                  emb_dropout_p : float, big_positive : float) -> None:
 20 | 
 21 |         super().__init__()
 22 | 
 23 |         self.big_positive = big_positive
 24 | 
 25 |         self.att_nn = MLP(
 26 |             in_features=node_features + hidden_node_features,
 27 |             hidden_layer_sizes=[att_hidden_dim] * att_depth,
 28 |             out_features=out_features,
 29 |             dropout_p=att_dropout_p
 30 |         )
 31 | 
 32 |         self.emb_nn = MLP(
 33 |             in_features=hidden_node_features,
 34 |             hidden_layer_sizes=[emb_hidden_dim] * emb_depth,
 35 |             out_features=out_features,
 36 |             dropout_p=emb_dropout_p
 37 |         )
 38 | 
 39 |     def forward(self, hidden_nodes : torch.Tensor, input_nodes : torch.Tensor,
 40 |                 node_mask : torch.Tensor) -> torch.Tensor:
 41 |         """
 42 |         Defines forward pass.
 43 |         """
 44 |         Softmax     = torch.nn.Softmax(dim=1)
 45 | 
 46 |         cat         = torch.cat((hidden_nodes, input_nodes), dim=2)
 47 |         energy_mask = (node_mask == 0).float() * self.big_positive
 48 |         energies    = self.att_nn(cat) - energy_mask.unsqueeze(-1)
 49 |         attention   = Softmax(energies)
 50 |         embedding   = self.emb_nn(hidden_nodes)
 51 | 
 52 |         return torch.sum(attention * embedding, dim=1)
 53 | 
 54 | 
 55 | class Set2Vec(torch.nn.Module):
 56 |     """
 57 |     S2V readout function.
 58 |     """
 59 |     def __init__(self, node_features : int, hidden_node_features : int,
 60 |                  lstm_computations : int, memory_size : int,
 61 |                  constants : namedtuple) -> None:
 62 | 
 63 |         super().__init__()
 64 | 
 65 |         self.constants         = constants
 66 |         self.lstm_computations = lstm_computations
 67 |         self.memory_size       = memory_size
 68 | 
 69 |         self.embedding_matrix = torch.nn.Linear(
 70 |             in_features=node_features + hidden_node_features,
 71 |             out_features=self.memory_size,
 72 |             bias=True
 73 |         )
 74 | 
 75 |         self.lstm = torch.nn.LSTMCell(
 76 |             input_size=self.memory_size,
 77 |             hidden_size=self.memory_size,
 78 |             bias=True
 79 |         )
 80 | 
 81 |     def forward(self, hidden_output_nodes : torch.Tensor, input_nodes : torch.Tensor,
 82 |                 node_mask : torch.Tensor) -> torch.Tensor:
 83 |         """
 84 |         Defines forward pass.
 85 |         """
 86 |         Softmax      = torch.nn.Softmax(dim=1)
 87 | 
 88 |         batch_size   = input_nodes.shape[0]
 89 |         energy_mask  = torch.bitwise_not(node_mask).float() * self.C.big_negative
 90 |         lstm_input   = torch.zeros(batch_size, self.memory_size, device=self.constants.device)
 91 |         cat          = torch.cat((hidden_output_nodes, input_nodes), dim=2)
 92 |         memory       = self.embedding_matrix(cat)
 93 |         hidden_state = torch.zeros(batch_size, self.memory_size, device=self.constants.device)
 94 |         cell_state   = torch.zeros(batch_size, self.memory_size, device=self.constants.device)
 95 | 
 96 |         for _ in range(self.lstm_computations):
 97 |             query, cell_state = self.lstm(lstm_input, (hidden_state, cell_state))
 98 | 
 99 |             # dot product query x memory
100 |             energies  = (query.view(batch_size, 1, self.memory_size) * memory).sum(dim=-1)
101 |             attention = Softmax(energies + energy_mask)
102 |             read      = (attention.unsqueeze(-1) * memory).sum(dim=1)
103 | 
104 |             hidden_state = query
105 |             lstm_input   = read
106 | 
107 |         cat = torch.cat((query, read), dim=1)
108 |         return cat
109 | 
110 | 
111 | class MLP(torch.nn.Module):
112 |     """
113 |     Multi-layer perceptron. Applies SELU after every linear layer.
114 | 
115 |     Args:
116 |     ----
117 |         in_features (int)         : Size of each input sample.
118 |         hidden_layer_sizes (list) : Hidden layer sizes.
119 |         out_features (int)        : Size of each output sample.
120 |         dropout_p (float)         : Probability of dropping a weight.
121 |     """
122 | 
123 |     def __init__(self, in_features : int, hidden_layer_sizes : list, out_features : int,
124 |                  dropout_p : float) -> None:
125 |         super().__init__()
126 | 
127 |         activation_function = torch.nn.SELU
128 | 
129 |         # create list of all layer feature sizes
130 |         fs = [in_features, *hidden_layer_sizes, out_features]
131 | 
132 |         # create list of linear_blocks
133 |         layers = [self._linear_block(in_f, out_f,
134 |                                      activation_function,
135 |                                      dropout_p)
136 |                   for in_f, out_f in zip(fs, fs[1:])]
137 | 
138 |         # concatenate modules in all sequentials in layers list
139 |         layers = [module for sq in layers for module in sq.children()]
140 | 
141 |         # add modules to sequential container
142 |         self.seq = torch.nn.Sequential(*layers)
143 | 
144 |     def _linear_block(self, in_f : int, out_f : int, activation : torch.nn.Module,
145 |                       dropout_p : float) -> torch.nn.Sequential:
146 |         """
147 |         Returns a linear block consisting of a linear layer, an activation function
148 |         (SELU), and dropout (optional) stack.
149 | 
150 |         Args:
151 |         ----
152 |             in_f (int)                   : Size of each input sample.
153 |             out_f (int)                  : Size of each output sample.
154 |             activation (torch.nn.Module) : Activation function.
155 |             dropout_p (float)            : Probability of dropping a weight.
156 | 
157 |         Returns:
158 |         -------
159 |             torch.nn.Sequential : The linear block.
160 |         """
161 |         # bias must be used in most MLPs in our models to learn from empty graphs
162 |         linear = torch.nn.Linear(in_f, out_f, bias=True)
163 |         torch.nn.init.xavier_uniform_(linear.weight)
164 |         return torch.nn.Sequential(linear, activation(), torch.nn.AlphaDropout(dropout_p))
165 | 
166 |     def forward(self, layers_input : torch.nn.Sequential) -> torch.nn.Sequential:
167 |         """
168 |         Defines forward pass.
169 |         """
170 |         return self.seq(layers_input)
171 | 
172 | 
173 | class GlobalReadout(torch.nn.Module):
174 |     """
175 |     Global readout function class. Used to predict the action probability distributions
176 |     (APDs) for molecular graphs.
177 | 
178 |     The first tier of two `MLP`s take as input, for each graph in the batch, the
179 |     final transformed node feature vectors. These feed-forward networks correspond
180 |     to the preliminary "f_add" and "f_conn" distributions.
181 | 
182 |     The second tier of three `MLP`s takes as input the output of the first tier
183 |     of `MLP`s (the "preliminary" APDs) as well as the graph embeddings for all
184 |     graphs in the batch. Output are the final APD components, which are then flattened
185 |     and concatenated. No activation function is applied after the final layer, so
186 |     that this can be done outside (e.g. in the loss function, and before sampling).
187 |     """
188 |     def __init__(self, f_add_elems : int, f_conn_elems : int, f_term_elems : int,
189 |                  mlp1_depth : int, mlp1_dropout_p : float, mlp1_hidden_dim : int,
190 |                  mlp2_depth : int, mlp2_dropout_p : float, mlp2_hidden_dim : int,
191 |                  graph_emb_size : int, max_n_nodes : int, node_emb_size : int,
192 |                  device : str) -> None:
193 |         super().__init__()
194 | 
195 |         self.device = device
196 | 
197 |         # preliminary f_add
198 |         self.fAddNet1 = MLP(
199 |             in_features=node_emb_size,
200 |             hidden_layer_sizes=[mlp1_hidden_dim] * mlp1_depth,
201 |             out_features=f_add_elems,
202 |             dropout_p=mlp1_dropout_p
203 |         )
204 | 
205 |         # preliminary f_conn
206 |         self.fConnNet1 = MLP(
207 |             in_features=node_emb_size,
208 |             hidden_layer_sizes=[mlp1_hidden_dim] * mlp1_depth,
209 |             out_features=f_conn_elems,
210 |             dropout_p=mlp1_dropout_p
211 |         )
212 | 
213 |         # final f_add
214 |         self.fAddNet2 = MLP(
215 |             in_features=(max_n_nodes * f_add_elems + graph_emb_size),
216 |             hidden_layer_sizes=[mlp2_hidden_dim] * mlp2_depth,
217 |             out_features=f_add_elems * max_n_nodes,
218 |             dropout_p=mlp2_dropout_p
219 |         )
220 | 
221 |         # final f_conn
222 |         self.fConnNet2 = MLP(
223 |             in_features=(max_n_nodes * f_conn_elems + graph_emb_size),
224 |             hidden_layer_sizes=[mlp2_hidden_dim] * mlp2_depth,
225 |             out_features=f_conn_elems * max_n_nodes,
226 |             dropout_p=mlp2_dropout_p
227 |         )
228 | 
229 |         # final f_term (only takes as input graph embeddings)
230 |         self.fTermNet2 = MLP(
231 |             in_features=graph_emb_size,
232 |             hidden_layer_sizes=[mlp2_hidden_dim] * mlp2_depth,
233 |             out_features=f_term_elems,
234 |             dropout_p=mlp2_dropout_p
235 |         )
236 | 
237 |     def forward(self, node_level_output : torch.Tensor,
238 |                 graph_embedding_batch : torch.Tensor) -> torch.Tensor:
239 |         """
240 |         Defines forward pass.
241 |         """
242 |         if self.device == "cuda":
243 |             self.fAddNet1  = self.fAddNet1.to("cuda", non_blocking=True)
244 |             self.fConnNet1 = self.fConnNet1.to("cuda", non_blocking=True)
245 |             self.fAddNet2  = self.fAddNet2.to("cuda", non_blocking=True)
246 |             self.fConnNet2 = self.fConnNet2.to("cuda", non_blocking=True)
247 |             self.fTermNet2 = self.fTermNet2.to("cuda", non_blocking=True)
248 | 
249 |         # get preliminary f_add and f_conn
250 |         f_add_1  = self.fAddNet1(node_level_output)
251 |         f_conn_1 = self.fConnNet1(node_level_output)
252 | 
253 |         if self.device == "cuda":
254 |             f_add_1  = f_add_1.to("cuda", non_blocking=True)
255 |             f_conn_1 = f_conn_1.to("cuda", non_blocking=True)
256 | 
257 |         # reshape preliminary APDs into flattenened vectors (e.g. one vector per
258 |         # graph in batch)
259 |         f_add_1_size  = f_add_1.size()
260 |         f_conn_1_size = f_conn_1.size()
261 |         f_add_1  = f_add_1.view((f_add_1_size[0], f_add_1_size[1] * f_add_1_size[2]))
262 |         f_conn_1 = f_conn_1.view((f_conn_1_size[0], f_conn_1_size[1] * f_conn_1_size[2]))
263 | 
264 |         # get final f_add, f_conn, and f_term
265 |         f_add_2 = self.fAddNet2(
266 |             torch.cat((f_add_1, graph_embedding_batch), dim=1).unsqueeze(dim=1)
267 |         )
268 |         f_conn_2 = self.fConnNet2(
269 |             torch.cat((f_conn_1, graph_embedding_batch), dim=1).unsqueeze(dim=1)
270 |         )
271 |         f_term_2 = self.fTermNet2(graph_embedding_batch)
272 | 
273 |         if self.device == "cuda":
274 |             f_add_2  = f_add_2.to("cuda", non_blocking=True)
275 |             f_conn_2 = f_conn_2.to("cuda", non_blocking=True)
276 |             f_term_2 = f_term_2.to("cuda", non_blocking=True)
277 | 
278 |         # flatten and concatenate
279 |         cat = torch.cat((f_add_2.squeeze(dim=1), f_conn_2.squeeze(dim=1), f_term_2), dim=1)
280 | 
281 |         return cat  # note: no activation function before returning
282 | 


--------------------------------------------------------------------------------
/graphinvent/gnn/summation_mpnn.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Defines the `SummationMPNN` class.
  3 | """
  4 | # load general packages and functions
  5 | from collections import namedtuple
  6 | import torch
  7 | 
  8 | 
  9 | class SummationMPNN(torch.nn.Module):
 10 |     """
 11 |     Abstract `SummationMPNN` class. Specific models using this class are
 12 |     defined in `mpnn.py`; these are MNN, S2V, and GGNN.
 13 |     """
 14 |     def __init__(self, constants : namedtuple):
 15 | 
 16 |         super().__init__()
 17 | 
 18 |         self.hidden_node_features = constants.hidden_node_features
 19 |         self.edge_features        = constants.n_edge_features
 20 |         self.message_size         = constants.message_size
 21 |         self.message_passes       = constants.message_passes
 22 |         self.constants            = constants
 23 | 
 24 |     def message_terms(self, nodes : torch.Tensor, node_neighbours : torch.Tensor,
 25 |                       edges : torch.Tensor) -> None:
 26 |         """
 27 |         Message passing function, to be implemented in all `SummationMPNN` subclasses.
 28 | 
 29 |         Args:
 30 |         ----
 31 |             nodes (torch.Tensor)           : Batch of node feature vectors.
 32 |             node_neighbours (torch.Tensor) : Batch of node feature vectors for neighbors.
 33 |             edges (torch.Tensor)           : Batch of edge feature vectors.
 34 | 
 35 |         Shapes:
 36 |         ------
 37 |             nodes           : (total N nodes in batch, N node features)
 38 |             node_neighbours : (total N nodes in batch, max node degree, N node features)
 39 |             edges           : (total N nodes in batch, max node degree, N edge features)
 40 |         """
 41 |         raise NotImplementedError
 42 | 
 43 |     def update(self, nodes : torch.Tensor, messages : torch.Tensor) -> None:
 44 |         """
 45 |         Message update function, to be implemented in all `SummationMPNN` subclasses.
 46 | 
 47 |         Args:
 48 |         ----
 49 |             nodes (torch.Tensor)    : Batch of node feature vectors.
 50 |             messages (torch.Tensor) : Batch of incoming messages.
 51 | 
 52 |         Shapes:
 53 |         ------
 54 |             nodes    : (total N nodes in batch, N node features)
 55 |             messages : (total N nodes in batch, N node features)
 56 |         """
 57 |         raise NotImplementedError
 58 | 
 59 |     def readout(self, hidden_nodes : torch.Tensor, input_nodes : torch.Tensor,
 60 |                 node_mask : torch.Tensor) -> None:
 61 |         """
 62 |         Local readout function, to be implemented in all `SummationMPNN` subclasses.
 63 | 
 64 |         Args:
 65 |         ----
 66 |             hidden_nodes (torch.Tensor) : Batch of node feature vectors.
 67 |             input_nodes (torch.Tensor) : Batch of node feature vectors.
 68 |             node_mask (torch.Tensor) : Mask for non-existing neighbors, where elements
 69 |                                        are 1 if corresponding element exists and 0
 70 |                                        otherwise.
 71 | 
 72 |         Shapes:
 73 |         ------
 74 |             hidden_nodes : (total N nodes in batch, N node features)
 75 |             input_nodes : (total N nodes in batch, N node features)
 76 |             node_mask : (total N nodes in batch, N features)
 77 |         """
 78 |         raise NotImplementedError
 79 | 
 80 |     def forward(self, nodes : torch.Tensor, edges : torch.Tensor) -> None:
 81 |         """
 82 |         Defines forward pass.
 83 | 
 84 |         Args:
 85 |         ----
 86 |             nodes (torch.Tensor) : Batch of node feature matrices.
 87 |             edges (torch.Tensor) : Batch of edge feature tensors.
 88 | 
 89 |         Shapes:
 90 |         ------
 91 |             nodes : (batch size, N nodes, N node features)
 92 |             edges : (batch size, N nodes, N nodes, N edge features)
 93 | 
 94 |         Returns:
 95 |         -------
 96 |             output (torch.Tensor) : This would normally be the learned graph representation,
 97 |                                     but in all MPNN readout functions in this work,
 98 |                                     the last layer is used to predict the action
 99 |                                     probability distribution for a batch of graphs
100 |                                     from the learned graph representation.
101 |         """
102 |         adjacency = torch.sum(edges, dim=3)
103 | 
104 |         # **note: "idc" == "indices", "nghb{s}" == "neighbour(s)"
105 |         (edge_batch_batch_idc,
106 |          edge_batch_node_idc,
107 |          edge_batch_nghb_idc) = adjacency.nonzero(as_tuple=True)
108 | 
109 |         (node_batch_batch_idc, node_batch_node_idc) = adjacency.sum(-1).nonzero(as_tuple=True)
110 | 
111 |         same_batch = node_batch_batch_idc.view(-1, 1) == edge_batch_batch_idc
112 |         same_node  = node_batch_node_idc.view(-1, 1) == edge_batch_node_idc
113 | 
114 |         # element ij of `message_summation_matrix` is 1 if `edge_batch_edges[j]`
115 |         # is connected with `node_batch_nodes[i]`, else 0
116 |         message_summation_matrix = (same_batch * same_node).float()
117 | 
118 |         edge_batch_edges = edges[edge_batch_batch_idc, edge_batch_node_idc, edge_batch_nghb_idc, :]
119 | 
120 |         # pad up the hidden nodes
121 |         hidden_nodes = torch.zeros(nodes.shape[0],
122 |                                    nodes.shape[1],
123 |                                    self.hidden_node_features,
124 |                                    device=self.constants.device)
125 |         hidden_nodes[:nodes.shape[0], :nodes.shape[1], :nodes.shape[2]] = nodes.clone()
126 |         node_batch_nodes = hidden_nodes[node_batch_batch_idc, node_batch_node_idc, :]
127 | 
128 |         for _ in range(self.message_passes):
129 |             edge_batch_nodes = hidden_nodes[edge_batch_batch_idc, edge_batch_node_idc, :]
130 | 
131 |             edge_batch_nghbs = hidden_nodes[edge_batch_batch_idc, edge_batch_nghb_idc, :]
132 | 
133 |             message_terms    = self.message_terms(edge_batch_nodes,
134 |                                                   edge_batch_nghbs,
135 |                                                   edge_batch_edges)
136 | 
137 |             if len(message_terms.size()) == 1:  # if a single graph in batch
138 |                 message_terms = message_terms.unsqueeze(0)
139 | 
140 |             # the summation in eq. 1 of the NMPQC paper happens here
141 |             messages = torch.matmul(message_summation_matrix, message_terms)
142 | 
143 |             node_batch_nodes = self.update(node_batch_nodes, messages)
144 |             hidden_nodes[node_batch_batch_idc, node_batch_node_idc, :] = node_batch_nodes.clone()
145 | 
146 |         node_mask = adjacency.sum(-1) != 0
147 |         output    = self.readout(hidden_nodes, nodes, node_mask)
148 | 
149 |         return output
150 | 


--------------------------------------------------------------------------------
/graphinvent/main.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Main function for running GraphINVENT jobs.
 3 | 
 4 | Examples:
 5 | --------
 6 |  * If you define an "input.csv" with desired job parameters in job_dir/:
 7 |    (graphinvent) ~/GraphINVENT$ python main.py --job_dir path/to/job_dir/
 8 |  * If you instead want to run your job using the submission scripts:
 9 |    (graphinvent) ~/GraphINVENT$ python submit-fine-tuning.py
10 | """
11 | # load general packages and functions
12 | import datetime
13 | 
14 | # load GraphINVENT-specific functions
15 | import util
16 | from parameters.constants import constants
17 | from Workflow import Workflow
18 | 
19 | # suppress minor warnings
20 | util.suppress_warnings()
21 | 
22 | 
23 | def main():
24 |     """
25 |     Defines the type of job (preprocessing, training, generation, testing, or
26 |     fine-tuning), writes the job parameters (for future reference), and runs
27 |     the job.
28 |     """
29 |     _ = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")  # fix date/time
30 | 
31 |     workflow = Workflow(constants=constants)
32 | 
33 |     job_type = constants.job_type
34 |     print(f"* Run mode: '{job_type}'", flush=True)
35 | 
36 |     if job_type == "preprocess":
37 |         # write preprocessing parameters
38 |         util.write_preprocessing_parameters(params=constants)
39 | 
40 |         # preprocess all datasets
41 |         workflow.preprocess_phase()
42 | 
43 |     elif job_type == "train":
44 |         # write training parameters
45 |         util.write_job_parameters(params=constants)
46 | 
47 |         # train model and generate graphs
48 |         workflow.training_phase()
49 | 
50 |     elif job_type == "generate":
51 |         # write generation parameters
52 |         util.write_job_parameters(params=constants)
53 | 
54 |         # generate molecules only
55 |         workflow.generation_phase()
56 | 
57 |     elif job_type == "test":
58 |         # write testing parameters
59 |         util.write_job_parameters(params=constants)
60 | 
61 |         # evaluate best model using the test set data
62 |         workflow.testing_phase()
63 | 
64 |     elif job_type == "fine-tune":
65 |         # write training parameters
66 |         util.write_job_parameters(params=constants)
67 | 
68 |         # fine-tune the model and generate graphs
69 |         workflow.learning_phase()
70 | 
71 |     else:
72 |         raise NotImplementedError("Not a valid `job_type`.")
73 | 
74 | 
75 | if __name__ == "__main__":
76 |     main()
77 | 


--------------------------------------------------------------------------------
/graphinvent/parameters/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MolecularAI/GraphINVENT/6ef587ddb983f0c853dc8bc7b418f43cb69420c9/graphinvent/parameters/__init__.py


--------------------------------------------------------------------------------
/graphinvent/parameters/args.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Defines `ArgumentParser` for specifying job directory using command-line.
 3 | """# load general packages and functions
 4 | import argparse
 5 | 
 6 | 
 7 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter,
 8 |                                  add_help=False)
 9 | parser.add_argument("--job-dir",
10 |                     type=str,
11 |                     default="../output/",
12 |                     help="Directory in which to write all output.")
13 | 
14 | 
15 | args = parser.parse_args()
16 | 
17 | args_dict = vars(args)
18 | job_dir = args_dict["job_dir"]
19 | 


--------------------------------------------------------------------------------
/graphinvent/parameters/constants.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Loads input parameters from `defaults.py`, and defines other global constants
  3 | that depend on the input features, creating a `namedtuple` from them;
  4 | additionally, if there exists an `input.csv` in the job directory, loads those
  5 | arguments and overrides default values in `defaults.py`.
  6 | """
  7 | # load general packages and functions
  8 | from collections import namedtuple
  9 | import pickle
 10 | import csv
 11 | import os
 12 | import sys
 13 | from typing import Tuple
 14 | import numpy as np
 15 | from rdkit.Chem.rdchem import BondType
 16 | 
 17 | # load GraphINVENT-specific functions
 18 | sys.path.insert(1, "./parameters/")  # search "parameters/" directory
 19 | import parameters.args as args
 20 | import parameters.defaults as defaults
 21 | 
 22 | 
 23 | def get_feature_dimensions(parameters : dict) -> Tuple[int, int, int, int]:
 24 |     """
 25 |     Returns dimensions of all node features.
 26 |     """
 27 |     n_atom_types    = len(parameters["atom_types"])
 28 |     n_formal_charge = len(parameters["formal_charge"])
 29 |     n_numh          = (
 30 |         int(not parameters["use_explicit_H"] and not parameters["ignore_H"])
 31 |         * len(parameters["imp_H"])
 32 |     )
 33 |     n_chirality     = int(parameters["use_chirality"]) * len(parameters["chirality"])
 34 | 
 35 |     return n_atom_types, n_formal_charge, n_numh, n_chirality
 36 | 
 37 | 
 38 | def get_tensor_dimensions(n_atom_types : int, n_formal_charge : int, n_num_h : int,
 39 |                           n_chirality : int, n_node_features : int, n_edge_features : int,
 40 |                           parameters : dict) -> Tuple[list, list, list, list, int]:
 41 |     """
 42 |     Returns dimensions for all tensors that describe molecular graphs. Tensor dimensions
 43 |     are `list`s, except for `dim_f_term` which is  simply an `int`. Each element
 44 |     of the lists indicate the corresponding dimension of a particular subgraph matrix
 45 |     (i.e. `nodes`, `f_add`, etc).
 46 |     """
 47 |     max_nodes = parameters["max_n_nodes"]
 48 | 
 49 |     # define the matrix dimensions as `list`s
 50 |     # first for the graph reps...
 51 |     dim_nodes = [max_nodes, n_node_features]
 52 | 
 53 |     dim_edges = [max_nodes, max_nodes, n_edge_features]
 54 | 
 55 |     # ... then for the APDs
 56 |     if parameters["use_chirality"]:
 57 |         if parameters["use_explicit_H"] or parameters["ignore_H"]:
 58 |             dim_f_add = [
 59 |                 parameters["max_n_nodes"],
 60 |                 n_atom_types,
 61 |                 n_formal_charge,
 62 |                 n_chirality,
 63 |                 n_edge_features,
 64 |             ]
 65 |         else:
 66 |             dim_f_add = [
 67 |                 parameters["max_n_nodes"],
 68 |                 n_atom_types,
 69 |                 n_formal_charge,
 70 |                 n_num_h,
 71 |                 n_chirality,
 72 |                 n_edge_features,
 73 |             ]
 74 |     else:
 75 |         if parameters["use_explicit_H"] or parameters["ignore_H"]:
 76 |             dim_f_add = [
 77 |                 parameters["max_n_nodes"],
 78 |                 n_atom_types,
 79 |                 n_formal_charge,
 80 |                 n_edge_features,
 81 |             ]
 82 |         else:
 83 |             dim_f_add = [
 84 |                 parameters["max_n_nodes"],
 85 |                 n_atom_types,
 86 |                 n_formal_charge,
 87 |                 n_num_h,
 88 |                 n_edge_features,
 89 |             ]
 90 | 
 91 |     dim_f_conn = [parameters["max_n_nodes"], n_edge_features]
 92 | 
 93 |     dim_f_term = 1
 94 | 
 95 |     return dim_nodes, dim_edges, dim_f_add, dim_f_conn, dim_f_term
 96 | 
 97 | 
 98 | def load_params(input_csv_path : str) -> dict:
 99 |     """
100 |     Loads job parameters/hyperparameters from CSV (in `input_csv_path`).
101 |     """
102 |     params_to_override = {}
103 |     with open(input_csv_path, "r") as csv_file:
104 | 
105 |         params_reader = csv.reader(csv_file, delimiter=";")
106 | 
107 |         for key, value in params_reader:
108 |             try:
109 |                 params_to_override[key] = eval(value)
110 |             except NameError:  # `value` is a `str`
111 |                 params_to_override[key] = value
112 |             except SyntaxError:  # to avoid "unexpected `EOF`"
113 |                 params_to_override[key] = value
114 | 
115 |     return params_to_override
116 | 
117 | 
118 | def override_params(all_params : dict) -> dict:
119 |     """
120 |     If there exists an `input.csv` in the job directory, loads those arguments
121 |     and overrides their default values from `features.py`.
122 |     """
123 |     input_csv_path = all_params["job_dir"] + "input.csv"
124 | 
125 |     # check if there exists and `input.csv` in working directory
126 |     if os.path.exists(input_csv_path):
127 |         # override default values for parameters in `input.csv`
128 |         params_to_override_dict = load_params(input_csv_path)
129 |         for key, value in params_to_override_dict.items():
130 |             all_params[key] = value
131 | 
132 |     return all_params
133 | 
134 | 
135 | def collect_global_constants(parameters : dict, job_dir : str) -> namedtuple:
136 |     """
137 |     Collects constants defined in `features.py` with those defined by the
138 |     ArgParser (`args.py`), and returns the bundle as a `namedtuple`.
139 | 
140 |     Args:
141 |     ----
142 |         parameters (dict) : Dictionary of parameters defined in `features.py`.
143 |         job_dir (str)     : Current job directory, defined on the command line.
144 | 
145 |     Returns:
146 |     -------
147 |         constants (namedtuple) : Collected constants.
148 |     """
149 |     # first override any arguments from `input.csv`:
150 |     parameters["job_dir"] = job_dir
151 |     parameters            = override_params(all_params=parameters)
152 | 
153 |     # then calculate any global constants below:
154 |     if parameters["use_explicit_H"] and parameters["ignore_H"]:
155 |         raise ValueError("Cannot use explicit Hs and ignore Hs at "
156 |                          "the same time. Please fix flags.")
157 | 
158 |     # define edge feature (rdkit `GetBondType()` result -> `int`) constants
159 |     bondtype_to_int = {BondType.SINGLE: 0, BondType.DOUBLE: 1, BondType.TRIPLE: 2}
160 | 
161 |     if parameters["use_aromatic_bonds"]:
162 |         bondtype_to_int[BondType.AROMATIC] = 3
163 | 
164 |     int_to_bondtype = dict(map(reversed, bondtype_to_int.items()))
165 | 
166 |     n_edge_features = len(bondtype_to_int)
167 | 
168 |     # define node feature constants
169 |     n_atom_types, n_formal_charge, n_imp_H, n_chirality = get_feature_dimensions(parameters)
170 | 
171 |     n_node_features = n_atom_types + n_formal_charge + n_imp_H + n_chirality
172 | 
173 |     # define matrix dimensions
174 |     (dim_nodes, dim_edges, dim_f_add,
175 |      dim_f_conn, dim_f_term) = get_tensor_dimensions(n_atom_types,
176 |                                                      n_formal_charge,
177 |                                                      n_imp_H,
178 |                                                      n_chirality,
179 |                                                      n_node_features,
180 |                                                      n_edge_features,
181 |                                                      parameters)
182 | 
183 |     len_f_add           = np.prod(dim_f_add[:])
184 |     len_f_add_per_node  = np.prod(dim_f_add[1:])
185 |     len_f_conn          = np.prod(dim_f_conn[:])
186 |     len_f_conn_per_node = np.prod(dim_f_conn[1:])
187 | 
188 |     # create a dictionary of global constants, and add `job_dir` to it; this
189 |     # will ultimately be converted to a `namedtuple`
190 |     constants_dict = {
191 |         "big_negative"       : -1e6,
192 |         "big_positive"       : 1e6,
193 |         "bondtype_to_int"    : bondtype_to_int,
194 |         "int_to_bondtype"    : int_to_bondtype,
195 |         "n_edge_features"    : n_edge_features,
196 |         "n_atom_types"       : n_atom_types,
197 |         "n_formal_charge"    : n_formal_charge,
198 |         "n_imp_H"            : n_imp_H,
199 |         "n_chirality"        : n_chirality,
200 |         "n_node_features"    : n_node_features,
201 |         "dim_nodes"          : dim_nodes,
202 |         "dim_edges"          : dim_edges,
203 |         "dim_f_add"          : dim_f_add,
204 |         "dim_f_conn"         : dim_f_conn,
205 |         "dim_f_term"         : dim_f_term,
206 |         "dim_apd"            : [np.prod(dim_f_add) + np.prod(dim_f_conn) + 1],
207 |         "len_f_add"          : len_f_add,
208 |         "len_f_add_per_node" : len_f_add_per_node,
209 |         "len_f_conn"         : len_f_conn,
210 |         "len_f_conn_per_node": len_f_conn_per_node,
211 |     }
212 | 
213 |     # join with `features.args_dict`
214 |     constants_dict.update(parameters)
215 | 
216 |     # define path to dataset splits
217 |     constants_dict["test_set"]       = parameters["dataset_dir"] + "test.smi"
218 |     constants_dict["training_set"]   = parameters["dataset_dir"] + "train.smi"
219 |     constants_dict["validation_set"] = parameters["dataset_dir"] + "valid.smi"
220 | 
221 |     # check (if a job is not a preprocessing job) that parameters  match those for
222 |     # the original preprocessing job
223 |     if constants_dict["job_type"] != "preprocess":
224 |         print(
225 |             "* Running job using HDF datasets located at "
226 |             + parameters["dataset_dir"],
227 |             flush=True,
228 |         )
229 |         print(
230 |             "* Checking that the relevant parameters match "
231 |             "those used in preprocessing the dataset.",
232 |             flush=True,
233 |         )
234 | 
235 |         # load preprocessing parameters for comparison (if they exist already)
236 |         csv_file        = parameters["dataset_dir"] + "preprocessing_params.csv"
237 |         params_to_check = load_params(input_csv_path=csv_file)
238 | 
239 |         for key, value in params_to_check.items():
240 |             if key in constants_dict.keys() and value != constants_dict[key]:
241 |                 raise ValueError(
242 |                     f"Check that training job parameters match those used in "
243 |                     f"preprocessing. {key} does not match."
244 |                 )
245 | 
246 |         # if above error never raised, then all relevant parameters match! :)
247 |         print("-- Job parameters match preprocessing parameters.", flush=True)
248 | 
249 |     # load QSAR models (sklearn activity model)
250 |     if constants_dict["job_type"] == "fine-tune":
251 |         print("-- Loading pre-trained scikit-learn activity model.", flush=True)
252 |         for qsar_model_name, qsar_model_path in constants_dict["qsar_models"].items():
253 |             with open(qsar_model_path, 'rb') as file:
254 |                 model_dict                                     = pickle.load(file)
255 |                 activity_model                                 = model_dict["classifier_sv"]
256 |                 constants_dict["qsar_models"][qsar_model_name] = activity_model
257 | 
258 |     # convert `CONSTANTS` dictionary into a namedtuple (immutable + cleaner)
259 |     Constants = namedtuple("CONSTANTS", sorted(constants_dict))
260 |     constants = Constants(**constants_dict)
261 | 
262 |     return constants
263 | 
264 | # collect the constants using the functions defined above
265 | constants = collect_global_constants(parameters=defaults.parameters,
266 |                                      job_dir=args.job_dir)
267 | 


--------------------------------------------------------------------------------
/graphinvent/parameters/load.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Functions for loading molecules from SMILES, as well as loading the model type.
 3 | """
 4 | # load general packages and functions
 5 | import csv
 6 | import rdkit
 7 | from rdkit.Chem.rdmolfiles import SmilesMolSupplier
 8 | 
 9 | 
10 | def molecules(path : str) -> rdkit.Chem.rdmolfiles.SmilesMolSupplier:
11 |     """
12 |     Reads a SMILES file (full path/filename specified by `path`) and returns
13 |     `rdkit.Mol` objects.
14 |     """
15 |     # check first line of SMILES file to see if contains header
16 |     with open(path) as smi_file:
17 |         first_line = smi_file.readline()
18 |         has_header = bool("SMILES" in first_line)
19 |     smi_file.close()
20 | 
21 |     # read file
22 |     molecule_set = SmilesMolSupplier(path,
23 |                                      sanitize=True,
24 |                                      nameColumn=-1,
25 |                                      titleLine=has_header)
26 |     return molecule_set
27 | 
28 | def which_model(input_csv_path : str) -> str:
29 |     """
30 |     Gets the type of model to use by reading it from CSV (in "input.csv").
31 | 
32 |     Args:
33 |     ----
34 |         input_csv_path (str) : The full path/filename to "input.csv" file
35 |           containing parameters to overwrite from defaults.
36 | 
37 |     Returns:
38 |     -------
39 |         value (str) : Name of model to use.
40 |     """
41 |     with open(input_csv_path, "r") as csv_file:
42 | 
43 |         params_reader = csv.reader(csv_file, delimiter=";")
44 | 
45 |         for key, value in params_reader:
46 |             if key == "model":
47 |                 return value  # string describing model e.g. "GGNN"
48 | 
49 |     raise ValueError("Model type not specified.")
50 | 


--------------------------------------------------------------------------------
/output/input.csv:
--------------------------------------------------------------------------------
1 | atom_types;['C', 'N', 'O', 'S', 'Cl']
2 | formal_charge;[-1, 0, 1]
3 | chirality;['None', 'R', 'S']
4 | max_n_nodes;13
5 | job_type;train
6 | dataset_dir;/path/to/GraphINVENT/data/gdb13_1K/
7 | model;GGNN
8 | 


--------------------------------------------------------------------------------
/submit-fine-tuning.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Example submission script for a GraphINVENT fine-tuning job. This can be used to
  3 | fine-tune a pre-trained model via reinforcement learning.
  4 | 
  5 | To run, type:
  6 | (graphinvent) ~/GraphINVENT$ python submit-fine-tuning.py
  7 | """
  8 | # load general packages and functions
  9 | import csv
 10 | import sys
 11 | import os
 12 | from pathlib import Path
 13 | import subprocess
 14 | import time
 15 | import torch
 16 | 
 17 | 
 18 | # define what you want to do for the specified job(s)
 19 | DATASET          = "gdb13_1K-debug"
 20 | JOB_TYPE         = "fine-tune"         # "fine-tune", or "generate"
 21 | JOBDIR_START_IDX = 0                   # where to start indexing job dirs
 22 | N_JOBS           = 1                   # number of jobs to run per model
 23 | RESTART          = False
 24 | FORCE_OVERWRITE  = True                # overwrite job directories which already exist
 25 | JOBNAME          = "example_job_name"  # used to create a sub directory
 26 | 
 27 | # if running using SLURM sbatch, specify params below
 28 | USE_SLURM = False                      # use SLURM or not
 29 | RUN_TIME  = "1-00:00:00"               # hh:mm:ss
 30 | MEM_GB    = 20                         # required RAM in GB
 31 | 
 32 | # for SLURM jobs, set partition to run job on (preprocessing jobs run entirely on
 33 | # CPU, so no need to request GPU partition; all other job types benefit from running
 34 | # on a GPU)
 35 | if JOB_TYPE == "preprocess":
 36 |     PARTITION     = "core"
 37 |     CPUS_PER_TASK = 1
 38 | else:
 39 |     PARTITION     = "gpu"
 40 |     CPUS_PER_TASK = 4
 41 | 
 42 | # set paths here
 43 | HOME             = str(Path.home())
 44 | PYTHON_PATH      = f"{HOME}/miniconda3/envs/graphinvent/bin/python"
 45 | GRAPHINVENT_PATH = "./graphinvent/"
 46 | DATA_PATH        = "./data/fine-tuning/"
 47 | 
 48 | if torch.cuda.is_available():
 49 |     DEVICE = "cuda"
 50 | else:
 51 |     DEVICE = "cpu"
 52 | 
 53 | # define dataset-specific parameters
 54 | params = {
 55 |     "atom_types"          : ["C", "N", "O", "S", "Cl"],  # <-- should match pre-trained model param
 56 |     "formal_charge"       : [-1, 0, +1],                 # <-- should match pre-trained model param
 57 |     "max_n_nodes"         : 13,                          # <-- should match pre-trained model param
 58 |     "job_type"            : JOB_TYPE,
 59 |     "dataset_dir"         : f"{DATA_PATH}{DATASET}/",
 60 |     "restart"             : RESTART,
 61 |     "device"              : DEVICE,
 62 |     "model"               : "GGNN",                      # <-- should match pre-trained model param
 63 |     "sample_every"        : 2,
 64 |     "init_lr"             : 1e-4,
 65 |     "epochs"              : 100,                         # <-- number of fine-tuning steps
 66 |     "batch_size"          : 64,
 67 |     "block_size"          : 1000,
 68 |     "n_workers"           : 0,
 69 |     "sigma"               : 20,                          # <-- see loss function
 70 |     "alpha"               : 0.5,                         # <-- see loss function
 71 |     "pretrained_model_dir": f"output_{DATASET}/example/job_0/",
 72 |     "generation_epoch"    : 80,                          # <-- which pre-trained model epoch to use
 73 |     "n_samples"           : 100,                         # <-- how many graphs to sample every step
 74 |     # additional paramaters can be defined here, if different from the "defaults"
 75 | }
 76 | 
 77 | 
 78 | def submit() -> None:
 79 |     """
 80 |     Creates and submits submission script. Uses global variables defined at top
 81 |     of this file.
 82 |     """
 83 |     check_paths()
 84 | 
 85 |     # create an output directory
 86 |     dataset_output_path = f"{HOME}/GraphINVENT/output_{DATASET}"
 87 |     tensorboard_path    = os.path.join(dataset_output_path, "tensorboard")
 88 |     if JOBNAME != "":
 89 |         dataset_output_path = os.path.join(dataset_output_path, JOBNAME)
 90 |         tensorboard_path    = os.path.join(tensorboard_path, JOBNAME)
 91 | 
 92 |     os.makedirs(dataset_output_path, exist_ok=True)
 93 |     os.makedirs(tensorboard_path, exist_ok=True)
 94 |     print(f"* Creating dataset directory {dataset_output_path}/", flush=True)
 95 | 
 96 |     # submit `N_JOBS` separate jobs
 97 |     jobdir_end_idx = JOBDIR_START_IDX + N_JOBS
 98 |     for job_idx in range(JOBDIR_START_IDX, jobdir_end_idx):
 99 | 
100 |         # specify and create the job subdirectory if it does not exist
101 |         params["job_dir"]         = f"{dataset_output_path}/job_{job_idx}/"
102 |         params["tensorboard_dir"] = f"{tensorboard_path}/job_{job_idx}/"
103 | 
104 |         # create the directory if it does not exist already, otherwise raises an
105 |         # error, which is good because *might* not want to override data our
106 |         # existing directories!
107 |         os.makedirs(params["tensorboard_dir"], exist_ok=True)
108 |         try:
109 |             job_dir_exists_already = bool(
110 |                 JOB_TYPE in ["generate", "test"] or FORCE_OVERWRITE
111 |             )
112 |             os.makedirs(params["job_dir"], exist_ok=job_dir_exists_already)
113 |             print(
114 |                 f"* Creating model subdirectory {dataset_output_path}/job_{job_idx}/",
115 |                 flush=True,
116 |             )
117 |         except FileExistsError:
118 |             print(
119 |                 f"-- Model subdirectory {dataset_output_path}/job_{job_idx}/ already exists.",
120 |                 flush=True,
121 |             )
122 |             if not RESTART:
123 |                 continue
124 | 
125 |         # write the `input.csv` file
126 |         write_input_csv(params_dict=params, filename="input.csv")
127 | 
128 |         # write `submit.sh` and submit
129 |         if USE_SLURM:
130 |             print("* Writing submission script.", flush=True)
131 |             write_submission_script(job_dir=params["job_dir"],
132 |                                     job_idx=job_idx,
133 |                                     job_type=params["job_type"],
134 |                                     max_n_nodes=params["max_n_nodes"],
135 |                                     runtime=RUN_TIME,
136 |                                     mem=MEM_GB,
137 |                                     ptn=PARTITION,
138 |                                     cpu_per_task=CPUS_PER_TASK,
139 |                                     python_bin_path=PYTHON_PATH)
140 | 
141 |             print("* Submitting job to SLURM.", flush=True)
142 |             subprocess.run(["sbatch", params["job_dir"] + "submit.sh"],
143 |                            check=True)
144 |         else:
145 |             print("* Running job as a normal process.", flush=True)
146 |             subprocess.run(["ls", f"{PYTHON_PATH}"], check=True)
147 |             subprocess.run([f"{PYTHON_PATH}",
148 |                             f"{GRAPHINVENT_PATH}main.py",
149 |                             "--job-dir",
150 |                             params["job_dir"]],
151 |                            check=True)
152 | 
153 |         # sleep a few secs before submitting next job
154 |         print("-- Sleeping 2 seconds.")
155 |         time.sleep(2)
156 | 
157 | 
158 | def write_input_csv(params_dict : dict, filename : str="params.csv") -> None:
159 |     """
160 |     Writes job parameters/hyperparameters in `params_dict` to CSV using the specified
161 |     `filename`.
162 |     """
163 |     dict_path = params_dict["job_dir"] + filename
164 | 
165 |     with open(dict_path, "w") as csv_file:
166 | 
167 |         writer = csv.writer(csv_file, delimiter=";")
168 |         for key, value in params_dict.items():
169 |             writer.writerow([key, value])
170 | 
171 | 
172 | def write_submission_script(job_dir : str, job_idx : int, job_type : str, max_n_nodes : int,
173 |                             runtime : str, mem : int, ptn : str, cpu_per_task : int,
174 |                             python_bin_path : str) -> None:
175 |     """
176 |     Writes a submission script (`submit.sh`).
177 | 
178 |     Args:
179 |     ----
180 |         job_dir (str)         : Job running directory.
181 |         job_idx (int)         : Job idx.
182 |         job_type (str)        : Type of job to run.
183 |         max_n_nodes (int)     : Maximum number of nodes in dataset.
184 |         runtime (str)         : Job run-time limit in hh:mm:ss format.
185 |         mem (int)             : Gigabytes to reserve.
186 |         ptn (str)             : Partition to use, either "core" (CPU) or "gpu" (GPU).
187 |         cpu_per_task (int)    : How many CPUs to use per task.
188 |         python_bin_path (str) : Path to Python binary to use.
189 |     """
190 |     submit_filename = job_dir + "submit.sh"
191 |     with open(submit_filename, "w") as submit_file:
192 |         submit_file.write("#!/bin/bash\n")
193 |         submit_file.write(f"#SBATCH --job-name={job_type}{max_n_nodes}_{job_idx}\n")
194 |         submit_file.write(f"#SBATCH --output={job_type}{max_n_nodes}_{job_idx}o\n")
195 |         submit_file.write(f"#SBATCH --time={runtime}\n")
196 |         submit_file.write(f"#SBATCH --mem={mem}g\n")
197 |         submit_file.write(f"#SBATCH --partition={ptn}\n")
198 |         submit_file.write("#SBATCH --nodes=1\n")
199 |         submit_file.write(f"#SBATCH --cpus-per-task={cpu_per_task}\n")
200 |         if ptn == "gpu":
201 |             submit_file.write("#SBATCH --gres=gpu:1\n")
202 |         submit_file.write("hostname\n")
203 |         submit_file.write("export QT_QPA_PLATFORM='offscreen'\n")
204 |         submit_file.write(f"{python_bin_path} {GRAPHINVENT_PATH}main.py --job-dir {job_dir}")
205 |         submit_file.write(f" > {job_dir}output.o${{SLURM_JOB_ID}}\n")
206 | 
207 | 
208 | def check_paths() -> None:
209 |     """
210 |     Checks that paths to Python binary, data, and GraphINVENT are properly
211 |     defined before running a job, and tells the user to define them if not.
212 |     """
213 |     for path in [PYTHON_PATH, GRAPHINVENT_PATH, DATA_PATH]:
214 |         if "path/to/" in path:
215 |             print("!!!")
216 |             print("* Update the following paths in `submit.py` before running:")
217 |             print("-- `PYTHON_PATH`\n-- `GRAPHINVENT_PATH`\n-- `DATA_PATH`")
218 |             sys.exit(0)
219 | 
220 | if __name__ == "__main__":
221 |     submit()
222 | 


--------------------------------------------------------------------------------
/submit-pre-training-supercloud.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Example submission script for a GraphINVENT training job (distribution-
  3 | based training, not fine-tuning/optimization job). This can be used to
  4 | pre-train a model before a fine-tuning (via reinforcement learning) job.
  5 | 
  6 | To run, type:
  7 | (graphinvent) ~/GraphINVENT$ python submit-pre-training.py
  8 | 
  9 | This script was modified to run on the MIT Supercloud.
 10 | """
 11 | # load general packages and functions
 12 | import csv
 13 | import sys
 14 | import os
 15 | from pathlib import Path
 16 | import subprocess
 17 | import time
 18 | import torch
 19 | 
 20 | 
 21 | # define what you want to do for the specified job(s)
 22 | DATASET          = "gdb13_1K-debug"    # dataset name in "./data/pre-training/"
 23 | JOB_TYPE         = "train"             # "preprocess", "train", "generate", or "test"
 24 | JOBDIR_START_IDX = 0                   # where to start indexing job dirs
 25 | N_JOBS           = 1                   # number of jobs to run per model
 26 | RESTART          = False               # whether or not this is a restart job
 27 | FORCE_OVERWRITE  = True                # overwrite job directories which already exist
 28 | JOBNAME          = "example-job-name"  # used to create a sub directory
 29 | 
 30 | # if running using LLsub, specify params below
 31 | USE_LLSUB = True                       # use LLsub or not
 32 | MEM_GB    = 20                         # required RAM in GB
 33 | 
 34 | # for LLsub jobs, set number of CPUs per task
 35 | if JOB_TYPE == "preprocess":
 36 |     CPUS_PER_TASK = 1
 37 |     DEVICE = "cpu"
 38 | else:
 39 |     CPUS_PER_TASK = 10
 40 |     DEVICE = "cuda"
 41 | 
 42 | # set paths here
 43 | HOME             = str(Path.home())
 44 | PYTHON_PATH      = f"{HOME}/path/to/graphinvent/bin/python"
 45 | GRAPHINVENT_PATH = "./graphinvent/"
 46 | DATA_PATH        = "./data/pre-training/"
 47 | 
 48 | # define dataset-specific parameters
 49 | params = {
 50 |     "atom_types"   : ["C", "N", "O", "S", "Cl"],
 51 |     "formal_charge": [-1, 0, +1],
 52 |     "max_n_nodes"  : 13,
 53 |     "job_type"     : JOB_TYPE,
 54 |     "dataset_dir"  : f"{DATA_PATH}{DATASET}/",
 55 |     "restart"      : RESTART,
 56 |     "model"        : "GGNN",
 57 |     "sample_every" : 2,
 58 |     "init_lr"      : 1e-4,
 59 |     "epochs"       : 100,
 60 |     "batch_size"   : 50,
 61 |     "block_size"   : 1000,
 62 |     "device"       : DEVICE,
 63 |     "n_samples"    : 100,
 64 |     # additional paramaters can be defined here, if different from the "defaults"
 65 |     # for instance, for "generate" jobs, don't forget to specify "generation_epoch"
 66 |     # and "n_samples"
 67 | }
 68 | 
 69 | 
 70 | def submit() -> None:
 71 |     """
 72 |     Creates and submits submission script. Uses global variables defined at top
 73 |     of this file.
 74 |     """
 75 |     check_paths()
 76 | 
 77 |     # create an output directory
 78 |     dataset_output_path = f"{HOME}/GraphINVENT/output_{DATASET}"
 79 |     tensorboard_path    = os.path.join(dataset_output_path, "tensorboard")
 80 |     if JOBNAME != "":
 81 |         dataset_output_path = os.path.join(dataset_output_path, JOBNAME)
 82 |         tensorboard_path    = os.path.join(tensorboard_path, JOBNAME)
 83 | 
 84 |     os.makedirs(dataset_output_path, exist_ok=True)
 85 |     os.makedirs(tensorboard_path, exist_ok=True)
 86 |     print(f"* Creating dataset directory {dataset_output_path}/", flush=True)
 87 | 
 88 |     # submit `N_JOBS` separate jobs
 89 |     jobdir_end_idx = JOBDIR_START_IDX + N_JOBS
 90 |     for job_idx in range(JOBDIR_START_IDX, jobdir_end_idx):
 91 | 
 92 |         # specify and create the job subdirectory if it does not exist
 93 |         params["job_dir"]         = f"{dataset_output_path}/job_{job_idx}/"
 94 |         params["tensorboard_dir"] = f"{tensorboard_path}/job_{job_idx}/"
 95 | 
 96 |         # create the directory if it does not exist already, otherwise raises an
 97 |         # error, which is good because *might* not want to override data our
 98 |         # existing directories!
 99 |         os.makedirs(params["tensorboard_dir"], exist_ok=True)
100 |         try:
101 |             job_dir_exists_already = bool(
102 |                 JOB_TYPE in ["generate", "test"] or FORCE_OVERWRITE
103 |             )
104 |             os.makedirs(params["job_dir"], exist_ok=job_dir_exists_already)
105 |             print(
106 |                 f"* Creating model subdirectory {dataset_output_path}/job_{job_idx}/",
107 |                 flush=True,
108 |             )
109 |         except FileExistsError:
110 |             print(
111 |                 f"-- Model subdirectory {dataset_output_path}/job_{job_idx}/ already exists.",
112 |                 flush=True,
113 |             )
114 |             if not RESTART:
115 |                 continue
116 | 
117 |         # write the `input.csv` file
118 |         write_input_csv(params_dict=params, filename="input.csv")
119 | 
120 |         # write `submit.sh` and submit
121 |         if USE_LLSUB:
122 |             print("* Writing submission script.", flush=True)
123 |             write_submission_script(job_dir=params["job_dir"],
124 |                                     job_idx=job_idx,
125 |                                     job_type=params["job_type"],
126 |                                     max_n_nodes=params["max_n_nodes"],
127 |                                     cpu_per_task=CPUS_PER_TASK,
128 |                                     python_bin_path=PYTHON_PATH)
129 | 
130 |             print("* Submitting batch job using LLsub.", flush=True)
131 |             subprocess.run(["LLsub", params["job_dir"] + "submit.sh"],
132 |                            check=True)
133 |         else:
134 |             print("* Running job as a normal process.", flush=True)
135 |             subprocess.run(["ls", f"{PYTHON_PATH}"], check=True)
136 |             subprocess.run([f"{PYTHON_PATH}",
137 |                             f"{GRAPHINVENT_PATH}main.py",
138 |                             "--job-dir",
139 |                             params["job_dir"]],
140 |                            check=True)
141 | 
142 |         # sleep a few secs before submitting next job
143 |         print("-- Sleeping 2 seconds.")
144 |         time.sleep(2)
145 | 
146 | 
147 | def write_input_csv(params_dict : dict, filename : str="params.csv") -> None:
148 |     """
149 |     Writes job parameters/hyperparameters in `params_dict` to CSV using the specified
150 |     `filename`.
151 |     """
152 |     dict_path = params_dict["job_dir"] + filename
153 | 
154 |     with open(dict_path, "w") as csv_file:
155 | 
156 |         writer = csv.writer(csv_file, delimiter=";")
157 |         for key, value in params_dict.items():
158 |             writer.writerow([key, value])
159 | 
160 | 
161 | def write_submission_script(job_dir : str, job_idx : int, job_type : str, max_n_nodes : int,
162 |                             cpu_per_task : int, python_bin_path : str) -> None:
163 |     """
164 |     Writes a submission script (`submit.sh`).
165 | 
166 |     Args:
167 |     ----
168 |         job_dir (str)         : Job running directory.
169 |         job_idx (int)         : Job idx.
170 |         job_type (str)        : Type of job to run.
171 |         max_n_nodes (int)     : Maximum number of nodes in dataset.
172 |         cpu_per_task (int)    : How many CPUs to use per task.
173 |         python_bin_path (str) : Path to Python binary to use.
174 |     """
175 |     submit_filename = job_dir + "submit.sh"
176 |     with open(submit_filename, "w") as submit_file:
177 |         submit_file.write("#!/bin/bash\n")
178 |         submit_file.write(f"#SBATCH --job-name={job_type}{max_n_nodes}_{job_idx}\n")
179 |         submit_file.write(f"#SBATCH --output={job_type}{max_n_nodes}_{job_idx}o\n")
180 |         submit_file.write(f"#SBATCH --cpus-per-task={cpu_per_task}\n")
181 |         if DEVICE == "cuda":
182 |             submit_file.write("#SBATCH --gres=gpu:volta:1\n")
183 |         submit_file.write("hostname\n")
184 |         submit_file.write("export QT_QPA_PLATFORM='offscreen'\n")
185 |         submit_file.write(f"{python_bin_path} {GRAPHINVENT_PATH}main.py --job-dir {job_dir}")
186 |         submit_file.write(f" > {job_dir}output.o${{LLSUB_RANK}}\n")
187 | 
188 | 
189 | def check_paths() -> None:
190 |     """
191 |     Checks that paths to Python binary, data, and GraphINVENT are properly
192 |     defined before running a job, and tells the user to define them if not.
193 |     """
194 |     for path in [PYTHON_PATH, GRAPHINVENT_PATH, DATA_PATH]:
195 |         if "path/to/" in path:
196 |             print("!!!")
197 |             print("* Update the following paths in `submit.py` before running:")
198 |             print("-- `PYTHON_PATH`\n-- `GRAPHINVENT_PATH`\n-- `DATA_PATH`")
199 |             sys.exit(0)
200 | 
201 | if __name__ == "__main__":
202 |     submit()
203 | 


--------------------------------------------------------------------------------
/submit-pre-training.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Example submission script for a GraphINVENT training job (distribution-
  3 | based training, not fine-tuning/optimization job). This can be used to
  4 | pre-train a model before a fine-tuning (via reinforcement learning) job.
  5 | 
  6 | To run, type:
  7 | (graphinvent) ~/GraphINVENT$ python submit-pre-training.py
  8 | """
  9 | # load general packages and functions
 10 | import csv
 11 | import sys
 12 | import os
 13 | from pathlib import Path
 14 | import subprocess
 15 | import time
 16 | import torch
 17 | 
 18 | 
 19 | # define what you want to do for the specified job(s)
 20 | DATASET          = "gdb13_1K-debug"    # dataset name in "./data/pre-training/"
 21 | JOB_TYPE         = "train"             # "preprocess", "train", "generate", or "test"
 22 | JOBDIR_START_IDX = 0                   # where to start indexing job dirs
 23 | N_JOBS           = 1                   # number of jobs to run per model
 24 | RESTART          = False               # whether or not this is a restart job
 25 | FORCE_OVERWRITE  = True                # overwrite job directories which already exist
 26 | JOBNAME          = "example-job-name"  # used to create a sub directory
 27 | 
 28 | # if running using SLURM sbatch, specify params below
 29 | USE_SLURM = False                        # use SLURM or not
 30 | RUN_TIME  = "1-00:00:00"                 # hh:mm:ss
 31 | MEM_GB    = 20                           # required RAM in GB
 32 | 
 33 | # for SLURM jobs, set partition to run job on (preprocessing jobs run entirely on
 34 | # CPU, so no need to request GPU partition; all other job types benefit from running
 35 | # on a GPU)
 36 | if JOB_TYPE == "preprocess":
 37 |     PARTITION     = "core"
 38 |     CPUS_PER_TASK = 1
 39 | else:
 40 |     PARTITION     = "gpu"
 41 |     CPUS_PER_TASK = 4
 42 | 
 43 | # set paths here
 44 | HOME             = str(Path.home())
 45 | PYTHON_PATH      = f"{HOME}/miniconda3/envs/graphinvent/bin/python"
 46 | GRAPHINVENT_PATH = "./graphinvent/"
 47 | DATA_PATH        = "./data/pre-training/"
 48 | 
 49 | if torch.cuda.is_available():
 50 |     DEVICE = "cuda"
 51 | else:
 52 |     DEVICE = "cpu"
 53 | 
 54 | # define dataset-specific parameters
 55 | params = {
 56 |     "atom_types"   : ["C", "N", "O", "S", "Cl"],
 57 |     "formal_charge": [-1, 0, +1],
 58 |     "max_n_nodes"  : 13,
 59 |     "job_type"     : JOB_TYPE,
 60 |     "dataset_dir"  : f"{DATA_PATH}{DATASET}/",
 61 |     "restart"      : RESTART,
 62 |     "model"        : "GGNN",
 63 |     "sample_every" : 2,
 64 |     "init_lr"      : 1e-4,
 65 |     "epochs"       : 100,
 66 |     "batch_size"   : 50,
 67 |     "block_size"   : 1000,
 68 |     "device"       : DEVICE,
 69 |     "n_samples"    : 100,
 70 |     # additional paramaters can be defined here, if different from the "defaults"
 71 |     # for instance, for "generate" jobs, don't forget to specify "generation_epoch"
 72 |     # and "n_samples"
 73 | }
 74 | 
 75 | 
 76 | def submit() -> None:
 77 |     """
 78 |     Creates and submits submission script. Uses global variables defined at top
 79 |     of this file.
 80 |     """
 81 |     check_paths()
 82 | 
 83 |     # create an output directory
 84 |     dataset_output_path = f"{HOME}/GraphINVENT/output_{DATASET}"
 85 |     tensorboard_path    = os.path.join(dataset_output_path, "tensorboard")
 86 |     if JOBNAME != "":
 87 |         dataset_output_path = os.path.join(dataset_output_path, JOBNAME)
 88 |         tensorboard_path    = os.path.join(tensorboard_path, JOBNAME)
 89 | 
 90 |     os.makedirs(dataset_output_path, exist_ok=True)
 91 |     os.makedirs(tensorboard_path, exist_ok=True)
 92 |     print(f"* Creating dataset directory {dataset_output_path}/", flush=True)
 93 | 
 94 |     # submit `N_JOBS` separate jobs
 95 |     jobdir_end_idx = JOBDIR_START_IDX + N_JOBS
 96 |     for job_idx in range(JOBDIR_START_IDX, jobdir_end_idx):
 97 | 
 98 |         # specify and create the job subdirectory if it does not exist
 99 |         params["job_dir"]         = f"{dataset_output_path}/job_{job_idx}/"
100 |         params["tensorboard_dir"] = f"{tensorboard_path}/job_{job_idx}/"
101 | 
102 |         # create the directory if it does not exist already, otherwise raises an
103 |         # error, which is good because *might* not want to override data our
104 |         # existing directories!
105 |         os.makedirs(params["tensorboard_dir"], exist_ok=True)
106 |         try:
107 |             job_dir_exists_already = bool(
108 |                 JOB_TYPE in ["generate", "test"] or FORCE_OVERWRITE
109 |             )
110 |             os.makedirs(params["job_dir"], exist_ok=job_dir_exists_already)
111 |             print(
112 |                 f"* Creating model subdirectory {dataset_output_path}/job_{job_idx}/",
113 |                 flush=True,
114 |             )
115 |         except FileExistsError:
116 |             print(
117 |                 f"-- Model subdirectory {dataset_output_path}/job_{job_idx}/ already exists.",
118 |                 flush=True,
119 |             )
120 |             if not RESTART:
121 |                 continue
122 | 
123 |         # write the `input.csv` file
124 |         write_input_csv(params_dict=params, filename="input.csv")
125 | 
126 |         # write `submit.sh` and submit
127 |         if USE_SLURM:
128 |             print("* Writing submission script.", flush=True)
129 |             write_submission_script(job_dir=params["job_dir"],
130 |                                     job_idx=job_idx,
131 |                                     job_type=params["job_type"],
132 |                                     max_n_nodes=params["max_n_nodes"],
133 |                                     runtime=RUN_TIME,
134 |                                     mem=MEM_GB,
135 |                                     ptn=PARTITION,
136 |                                     cpu_per_task=CPUS_PER_TASK,
137 |                                     python_bin_path=PYTHON_PATH)
138 | 
139 |             print("* Submitting job to SLURM.", flush=True)
140 |             subprocess.run(["sbatch", params["job_dir"] + "submit.sh"],
141 |                            check=True)
142 |         else:
143 |             print("* Running job as a normal process.", flush=True)
144 |             subprocess.run(["ls", f"{PYTHON_PATH}"], check=True)
145 |             subprocess.run([f"{PYTHON_PATH}",
146 |                             f"{GRAPHINVENT_PATH}main.py",
147 |                             "--job-dir",
148 |                             params["job_dir"]],
149 |                            check=True)
150 | 
151 |         # sleep a few secs before submitting next job
152 |         print("-- Sleeping 2 seconds.")
153 |         time.sleep(2)
154 | 
155 | 
156 | def write_input_csv(params_dict : dict, filename : str="params.csv") -> None:
157 |     """
158 |     Writes job parameters/hyperparameters in `params_dict` to CSV using the specified 
159 |     `filename`.
160 |     """
161 |     dict_path = params_dict["job_dir"] + filename
162 | 
163 |     with open(dict_path, "w") as csv_file:
164 | 
165 |         writer = csv.writer(csv_file, delimiter=";")
166 |         for key, value in params_dict.items():
167 |             writer.writerow([key, value])
168 | 
169 | 
170 | def write_submission_script(job_dir : str, job_idx : int, job_type : str, max_n_nodes : int,
171 |                             runtime : str, mem : int, ptn : str, cpu_per_task : int,
172 |                             python_bin_path : str) -> None:
173 |     """
174 |     Writes a submission script (`submit.sh`).
175 | 
176 |     Args:
177 |     ----
178 |         job_dir (str)         : Job running directory.
179 |         job_idx (int)         : Job idx.
180 |         job_type (str)        : Type of job to run.
181 |         max_n_nodes (int)     : Maximum number of nodes in dataset.
182 |         runtime (str)         : Job run-time limit in hh:mm:ss format.
183 |         mem (int)             : Gigabytes to reserve.
184 |         ptn (str)             : Partition to use, either "core" (CPU) or "gpu" (GPU).
185 |         cpu_per_task (int)    : How many CPUs to use per task.
186 |         python_bin_path (str) : Path to Python binary to use.
187 |     """
188 |     submit_filename = job_dir + "submit.sh"
189 |     with open(submit_filename, "w") as submit_file:
190 |         submit_file.write("#!/bin/bash\n")
191 |         submit_file.write(f"#SBATCH --job-name={job_type}{max_n_nodes}_{job_idx}\n")
192 |         submit_file.write(f"#SBATCH --output={job_type}{max_n_nodes}_{job_idx}o\n")
193 |         submit_file.write(f"#SBATCH --time={runtime}\n")
194 |         submit_file.write(f"#SBATCH --mem={mem}g\n")
195 |         submit_file.write(f"#SBATCH --partition={ptn}\n")
196 |         submit_file.write("#SBATCH --nodes=1\n")
197 |         submit_file.write(f"#SBATCH --cpus-per-task={cpu_per_task}\n")
198 |         if ptn == "gpu":
199 |             submit_file.write("#SBATCH --gres=gpu:1\n")
200 |         submit_file.write("hostname\n")
201 |         submit_file.write("export QT_QPA_PLATFORM='offscreen'\n")
202 |         submit_file.write(f"{python_bin_path} {GRAPHINVENT_PATH}main.py --job-dir {job_dir}")
203 |         submit_file.write(f" > {job_dir}output.o${{SLURM_JOB_ID}}\n")
204 | 
205 | 
206 | def check_paths() -> None:
207 |     """
208 |     Checks that paths to Python binary, data, and GraphINVENT are properly
209 |     defined before running a job, and tells the user to define them if not.
210 |     """
211 |     for path in [PYTHON_PATH, GRAPHINVENT_PATH, DATA_PATH]:
212 |         if "path/to/" in path:
213 |             print("!!!")
214 |             print("* Update the following paths in `submit.py` before running:")
215 |             print("-- `PYTHON_PATH`\n-- `GRAPHINVENT_PATH`\n-- `DATA_PATH`")
216 |             sys.exit(0)
217 | 
218 | if __name__ == "__main__":
219 |     submit()
220 | 


--------------------------------------------------------------------------------
/tools/README.md:
--------------------------------------------------------------------------------
 1 | # Tools
 2 | This directory contains various tools for analyzing datasets:
 3 | 
 4 | * [max_n_nodes.py](./max_n_nodes.py): Gets the maximum number of nodes per molecule in a set of molecules.
 5 | * [atom_types.py](./atom_types.py) : Gets the atom types present in a set of molecules.
 6 | * [formal_charges.py](./formal_charges.py) : Gets the formal charges present in a set of molecules.
 7 | * [tdc-create-dataset.py](./tdc-create-dataset.py) : Downloads a dataset, such as ChEMBL or MOSES, from the Therapeutics Data Commons (TDC).
 8 | * [submit-split-preprocessing-supercloud.py](./submit-split-preprocessing-supercloud.py) : Example submission script for preprocessing a very large dataset in parallel.
 9 | 
10 | ---
11 | 
12 | To use the first 3 tools in this directory ([max_n_nodes.py](./max_n_nodes.py), [atom_types.py](./atom_types.py), or [formal_charges.py](./formal_charges.py)), first activate the GraphINVENT virtual environment, then run:
13 | 
14 | ```
15 | (graphinvent)$ python {script} --smi path/to/file.smi
16 | ```
17 | 
18 | Simply replace *{script}* by the name of the script e.g. *max_n_nodes.py*, and *path/to/file* with the name of the SMILES file to analyze.
19 | 
20 | ---
21 | If you would like to download a dataset such as ChEMBL or MOSES from the TDC and preprocess it slightly (e.g. remove molecular with high formal charges, filter to molecules with <= 80 heavy atoms, etc), then you can use the [tdc-create-dataset.py](./tdc-create-dataset.py) script.
22 | 
23 | To use script to download, for example, the MOSES dataset, run (from within the GraphINVENT environment):
24 | ```
25 | (graphinvent)$ python tdc-create-dataset.py --dataset MOSES
26 | ```
27 | 
28 | You can change the flag to speficy other datasets available via the TDC.
29 | 
30 | Furthermore, you can manually edit the script to do other things you would like (for instance, set the number of heavy atoms and formal charges to filter).
31 | 
32 | ---
33 | 
34 | In some cases, if you have a really large dataset, it might be easier to preprocess it in pieces (i.e. in parallel on different nodes) rather than all in serial. To do this, you can use the [submit-split-preprocessing-supercloud.py](./submit-split-preprocessing-supercloud.py) script. 
35 | 
36 | To use it, you will first need to split your dataset by running, **from within an interactive session**, the following command:
37 | ```
38 | (graphinvent)$ python submit-split-preprocessing-supercloud.py --type split
39 | ```
40 | 
41 | Then, once the dataset has been split, you can submit the separate splits as individual preprocessing jobs as follows:
42 | ```
43 | (graphinvent)$ python submit-split-preprocessing-supercloud.py --type submit
44 | ```
45 | 
46 | When the above jobs have completed, you can aggregate the generated HDFs for each dataset split into a single HDF in the main dataset dir:
47 | ```
48 | (graphinvent)$ python submit-split-preprocessing-supercloud.py --type aggregate
49 | ```


--------------------------------------------------------------------------------
/tools/atom_types.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Gets the atom types present in a set of molecules.
 3 | 
 4 | To use script, run:
 5 | python atom_types.py --smi path/to/file.smi
 6 | """
 7 | import argparse
 8 | import rdkit
 9 | from utils import load_molecules
10 | 
11 | 
12 | # define the argument parser
13 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter,
14 |                                  add_help=False)
15 | 
16 | # define two potential arguments to use when drawing SMILES from a file
17 | parser.add_argument("--smi",
18 |                     type=str,
19 |                     default="data/gdb13_1K/train.smi",
20 |                     help="SMILES file containing molecules to analyse.")
21 | args = parser.parse_args()
22 | 
23 | 
24 | def get_atom_types(smi_file : str) -> list:
25 |     """
26 |     Determines the atom types present in an input SMILES file.
27 | 
28 |     Args:
29 |     ----
30 |         smi_file (str) : Full path/filename to SMILES file.
31 |     """
32 |     molecules = load_molecules(path=smi_file)
33 | 
34 |     # create a list of all the atom types
35 |     atom_types = list()
36 |     for mol in molecules:
37 |         for atom in mol.GetAtoms():
38 |             atom_types.append(atom.GetAtomicNum())
39 | 
40 |     # remove duplicate atom types then sort by atomic number
41 |     set_of_atom_types = set(atom_types)
42 |     atom_types_sorted = list(set_of_atom_types)
43 |     atom_types_sorted.sort()
44 | 
45 |     # return the symbols, for convenience
46 |     return [rdkit.Chem.Atom(atom).GetSymbol() for atom in atom_types_sorted]
47 | 
48 | 
49 | if __name__ == "__main__":
50 |     atom_types = get_atom_types(smi_file=args.smi)
51 |     print("* Atom types present in input file:", atom_types, flush=True)
52 |     print("Done.", flush=True)
53 | 


--------------------------------------------------------------------------------
/tools/combine_HDFs.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Combines preprocessed HDF files. Useful when preprocessing large datasets, as
  3 | one can split the `{split}.smi` into multiple files (and directories), preprocess
  4 | them separately, and then combine using this script.
  5 | 
  6 | To use script, modify the variables below to automatically create a list of
  7 | paths **assuming** HDFs were created with the following directory structure:
  8 |  data/
  9 |   |-- {dataset}_1/
 10 |   |-- {dataset}_2/
 11 |   |-- {dataset}_3/
 12 |   |...
 13 |   |-- {dataset}_{n_dirs}/
 14 | 
 15 | The variables are also used in setting the dimensions of the HDF datasets later on.
 16 | 
 17 | If directories were not named as above, then simply replace `path_list` below
 18 | with a list of the paths to all the HDFs to combine.
 19 | 
 20 | Then, run:
 21 | python combine_HDFs.py
 22 | """
 23 | import csv
 24 | import numpy as np
 25 | import h5py
 26 | import torch
 27 | from typing import Union
 28 | 
 29 | 
 30 | def load_ts_properties_from_csv(csv_path : str) -> Union[dict, None]:
 31 |     """
 32 |     Loads CSV file containing training set properties and returns contents as a dictionary.
 33 |     """
 34 |     print("* Loading training set properties.", flush=True)
 35 | 
 36 |     # read dictionaries from csv
 37 |     try:
 38 |         with open(csv_path, "r") as csv_file:
 39 |             reader   = csv.reader(csv_file, delimiter=";")
 40 |             csv_dict = dict(reader)
 41 |     except:
 42 |         return None
 43 | 
 44 |     # fix file types within dict in going from `csv_dict` --> `properties_dict`
 45 |     properties_dict = {}
 46 |     for key, value in csv_dict.items():
 47 | 
 48 |         # first determine if key is a tuple
 49 |         key = eval(key)
 50 |         if len(key) > 1:
 51 |             tuple_key = (str(key[0]), str(key[1]))
 52 |         else:
 53 |             tuple_key = key
 54 | 
 55 |         # then convert the values to the correct data type
 56 |         try:
 57 |             properties_dict[tuple_key] = eval(value)
 58 |         except (SyntaxError, NameError):
 59 |             properties_dict[tuple_key] = value
 60 | 
 61 |         # convert any `list`s to `torch.Tensor`s (for consistency)
 62 |         if type(properties_dict[tuple_key]) == list:
 63 |             properties_dict[tuple_key] = torch.Tensor(properties_dict[tuple_key])
 64 | 
 65 |     return properties_dict
 66 | 
 67 | def write_ts_properties_to_csv(ts_properties_dict : dict) -> None:
 68 |     """
 69 |     Writes the training set properties in `ts_properties_dict` to a CSV file.
 70 |     """
 71 |     dict_path = f"data/{dataset}/{split}.csv"
 72 | 
 73 |     with open(dict_path, "w") as csv_file:
 74 | 
 75 |         csv_writer = csv.writer(csv_file, delimiter=";")
 76 |         for key, value in ts_properties_dict.items():
 77 |             if "validity_tensor" in key:
 78 |                 continue  # skip writing the validity tensor because it is really long
 79 |             elif type(value) == np.ndarray:
 80 |                 csv_writer.writerow([key, list(value)])
 81 |             elif type(value) == torch.Tensor:
 82 |                 try:
 83 |                     csv_writer.writerow([key, float(value)])
 84 |                 except ValueError:
 85 |                     csv_writer.writerow([key, [float(i) for i in value]])
 86 |             else:
 87 |                 csv_writer.writerow([key, value])
 88 | 
 89 | def get_dims() -> dict:
 90 |     """
 91 |     Gets the dims corresponding to the three datasets in each preprocessed HDF
 92 |     file: "nodes", "edges", and "APDs".
 93 |     """
 94 |     dims = {}
 95 |     dims["nodes"] = [max_n_nodes, n_atom_types + n_formal_charges]
 96 |     dims["edges"] = [max_n_nodes, max_n_nodes, n_bond_types]
 97 |     dim_f_add     = [max_n_nodes, n_atom_types, n_formal_charges, n_bond_types]
 98 |     dim_f_conn    = [max_n_nodes, n_bond_types]
 99 |     dims["APDs"]  = [np.prod(dim_f_add) + np.prod(dim_f_conn) + 1]
100 | 
101 |     return dims
102 | 
103 | def get_total_n_subgraphs(paths : list) -> int:
104 |     """
105 |     Gets the total number of subgraphs saved in all the HDF files in the `paths`,
106 |     where `paths` is a list of strings containing the path to each HDF file we want
107 |     to combine.
108 |     """
109 |     total_n_subgraphs = 0
110 |     for path in paths:
111 |         print("path:", path)
112 |         hdf_file           = h5py.File(path, "r")
113 |         nodes              = hdf_file.get("nodes")
114 |         n_subgraphs        = nodes.shape[0]
115 |         total_n_subgraphs += n_subgraphs
116 |         hdf_file.close()
117 | 
118 |     return total_n_subgraphs
119 | 
120 | def main(paths : list, training_set : bool) -> None:
121 |     """
122 |     Combine many small HDF files (their paths defined in `paths`) into one large HDF file.
123 |     """
124 |     total_n_subgraphs = get_total_n_subgraphs(paths)
125 |     dims              = get_dims()
126 | 
127 |     print(f"* Creating HDF file to contain {total_n_subgraphs} subgraphs")
128 |     new_hdf_file = h5py.File(f"data/{dataset}/{split}.h5", "a")
129 |     new_dataset_nodes = new_hdf_file.create_dataset("nodes",
130 |                                                     (total_n_subgraphs, *dims["nodes"]),
131 |                                                     dtype=np.dtype("int8"))
132 |     new_dataset_edges = new_hdf_file.create_dataset("edges",
133 |                                                     (total_n_subgraphs, *dims["edges"]),
134 |                                                     dtype=np.dtype("int8"))
135 |     new_dataset_APDs  = new_hdf_file.create_dataset("APDs",
136 |                                                     (total_n_subgraphs, *dims["APDs"]),
137 |                                                     dtype=np.dtype("int8"))
138 | 
139 |     print("* Combining data from smaller HDFs into a new larger HDF.")
140 |     init_index = 0
141 |     for path in paths:
142 |         print("path:", path)
143 |         hdf_file = h5py.File(path, "r")
144 | 
145 |         nodes = hdf_file.get("nodes")
146 |         edges = hdf_file.get("edges")
147 |         APDs  = hdf_file.get("APDs")
148 | 
149 |         n_subgraphs = nodes.shape[0]
150 | 
151 |         new_dataset_nodes[init_index:(init_index + n_subgraphs)] = nodes
152 |         new_dataset_edges[init_index:(init_index + n_subgraphs)] = edges
153 |         new_dataset_APDs[init_index:(init_index + n_subgraphs)]  = APDs
154 | 
155 |         init_index += n_subgraphs
156 |         hdf_file.close()
157 | 
158 |     new_hdf_file.close()
159 | 
160 |     if training_set:
161 |         print(f"* Combining data from respective `{split}.csv` files into one.")
162 |         csv_list = [f"{path[:-2]}csv" for path in paths]
163 | 
164 |         ts_properties_old = None
165 |         csv_files_processed = 0
166 |         for path in csv_list:
167 |             ts_properties     = load_ts_properties_from_csv(csv_path=path)
168 |             ts_properties_new = {}
169 |             if ts_properties_old and ts_properties:
170 |                 for key, value in ts_properties_old.items():
171 |                     if type(value) == float:
172 |                         ts_properties_new[key] = (
173 |                             value * csv_files_processed + ts_properties[key]
174 |                         )/(csv_files_processed + 1)
175 |                     else:
176 |                         new_list = []
177 |                         for i, value_i in enumerate(value):
178 |                             new_list.append(
179 |                                 float(
180 |                                     value_i * csv_files_processed + ts_properties[key][i]
181 |                                 )/(csv_files_processed + 1)
182 |                             )
183 |                         ts_properties_new[key] = new_list
184 |             else:
185 |                 ts_properties_new = ts_properties
186 |             ts_properties_old = ts_properties_new
187 |             csv_files_processed += 1
188 | 
189 |         write_ts_properties_to_csv(ts_properties_dict=ts_properties_new)
190 | 
191 | 
192 | if __name__ == "__main__":
193 |     # combine the HDFs defined in `path_list`
194 | 
195 |     # set variables
196 |     dataset          = "ChEMBL"
197 |     n_atom_types     = 15       # number of atom types used in preprocessing the data
198 |     n_formal_charges = 3        # number of formal charges used in preprocessing the data
199 |     n_bond_types     = 3        # number of bond types used in preprocessing the data
200 |     max_n_nodes      = 40       # maximum number of nodes in the data
201 | 
202 |     # combine the training files
203 |     n_dirs           = 12       # how many times was `{split}.smi` split?
204 |     split            = "train"  # train, test, or valid
205 |     path_list        = [f"data/{dataset}_{i}/{split}.h5" for i in range(0, n_dirs)]
206 |     main(path_list, training_set=True)
207 | 
208 |     # combine the test files
209 |     n_dirs           = 4       # how many times was `{split}.smi` split?
210 |     split            = "test"  # train, test, or valid
211 |     path_list        = [f"data/{dataset}_{i}/{split}.h5" for i in range(0, n_dirs)]
212 |     main(path_list, training_set=False)
213 | 
214 |     # combine the validation files
215 |     n_dirs           = 2        # how many times was `{split}.smi` split?
216 |     split            = "valid"  # train, test, or valid
217 |     path_list        = [f"data/{dataset}_{i}/{split}.h5" for i in range(0, n_dirs)]
218 |     main(path_list, training_set=False)
219 | 
220 |     print("Done.", flush=True)
221 | 


--------------------------------------------------------------------------------
/tools/formal_charges.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Gets the formal charges present in a set of molecules.
 3 | 
 4 | To use script, run:
 5 | python formal_charges.py --smi path/to/file.smi
 6 | """
 7 | import argparse
 8 | from utils import load_molecules
 9 | 
10 | 
11 | # define the argument parser
12 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter,
13 |                                  add_help=False)
14 | 
15 | # define two potential arguments to use when drawing SMILES from a file
16 | parser.add_argument("--smi",
17 |                     type=str,
18 |                     default="data/gdb13_1K/train.smi",
19 |                     help="SMILES file containing molecules to analyse.")
20 | args = parser.parse_args()
21 | 
22 | 
23 | def get_formal_charges(smi_file : str) -> list:
24 |     """
25 |     Determines the formal charges present in an input SMILES file.
26 | 
27 |     Args:
28 |     ----
29 |         smi_file (str) : Full path/filename to SMILES file.
30 |     """
31 |     molecules = load_molecules(path=smi_file)
32 | 
33 |     # create a list of all the formal charges
34 |     formal_charges = list()
35 |     for mol in molecules:
36 |         for atom in mol.GetAtoms():
37 |             formal_charges.append(atom.GetFormalCharge())
38 | 
39 |     # remove duplicate formal charges then sort
40 |     set_of_formal_charges = set(formal_charges)
41 |     formal_charges_sorted = list(set_of_formal_charges)
42 |     formal_charges_sorted.sort()
43 | 
44 |     return formal_charges_sorted
45 | 
46 | 
47 | if __name__ == "__main__":
48 |     formal_charges = get_formal_charges(smi_file=args.smi)
49 |     print("* Formal charges present in input file:", formal_charges, flush=True)
50 |     print("Done.", flush=True)
51 | 


--------------------------------------------------------------------------------
/tools/max_n_nodes.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Gets the maximum number of nodes per molecule present in a set of molecules.
 3 | 
 4 | To use script, run:
 5 | python max_n_nodes.py --smi path/to/file.smi
 6 | """
 7 | import argparse
 8 | from utils import load_molecules
 9 | 
10 | 
11 | # define the argument parser
12 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter,
13 |                                  add_help=False)
14 | 
15 | # define two potential arguments to use when drawing SMILES from a file
16 | parser.add_argument("--smi",
17 |                     type=str,
18 |                     default="data/gdb13_1K/train.smi",
19 |                     help="SMILES file containing molecules to analyse.")
20 | args = parser.parse_args()
21 | 
22 | 
23 | def get_max_n_atoms(smi_file : str) -> int:
24 |     """
25 |     Determines the maximum number of atoms per molecule in an input SMILES file.
26 | 
27 |     Args:
28 |     ----
29 |         smi_file (str) : Full path/filename to SMILES file.
30 |     """
31 |     molecules = load_molecules(path=smi_file)
32 | 
33 |     max_n_atoms = 0
34 |     for mol in molecules:
35 |         n_atoms = mol.GetNumAtoms()
36 | 
37 |         if n_atoms > max_n_atoms:
38 |             max_n_atoms = n_atoms
39 | 
40 |     return max_n_atoms
41 | 
42 | 
43 | if __name__ == "__main__":
44 |     max_n_atoms = get_max_n_atoms(smi_file=args.smi)
45 |     print("* Max number of atoms in input file:", max_n_atoms, flush=True)
46 |     print("Done.", flush=True)
47 | 


--------------------------------------------------------------------------------
/tools/tdc-create-dataset.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Uses the Therapeutics Data Commons (TDC) to get datasets for goal-directed
 3 | molecular optimization tasks and then filters the molecules based on number
 4 | of heavy atoms and formal charge.
 5 | 
 6 | See:
 7 | * https://tdcommons.ai/
 8 | * https://github.com/mims-harvard/TDC
 9 | 
10 | To use script, run:
11 | (graphinvent)$ python tdc-create-dataset.py --dataset MOSES
12 | """
13 | import os
14 | import argparse
15 | from pathlib import Path
16 | import shutil
17 | from tdc.generation import MolGen
18 | import rdkit
19 | from rdkit import Chem
20 | 
21 | # define the argument parser
22 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter,
23 |                                  add_help=False)
24 | 
25 | # define two potential arguments to use when drawing SMILES from a file
26 | parser.add_argument("--dataset",
27 |                     type=str,
28 |                     default="ChEMBL",
29 |                     help="Specifies the dataset to use for creating the data. Options "
30 |                          "are: 'ChEMBL', 'MOSES', or 'ZINC'.")
31 | args = parser.parse_args()
32 | 
33 | 
34 | def save_smiles(smi_file : str, smi_list : list) -> None:
35 |     """Saves input list of SMILES to the specified file path."""
36 |     smi_writer = rdkit.Chem.rdmolfiles.SmilesWriter(smi_file)
37 |     for smi in smi_list:
38 |         try:
39 |             mol = rdkit.Chem.MolFromSmiles(smi[0])
40 |             if mol.GetNumAtoms() < 81:  # filter out molecules with >= 81 atoms
41 |                 save = True
42 |                 for atom in mol.GetAtoms():
43 |                     if atom.GetFormalCharge() not in [-1, 0, +1]:  # filter out molecules with large formal charge
44 |                         save = False
45 |                         break
46 |                 if save:
47 |                     smi_writer.write(mol)
48 |         except:  # likely TypeError or AttributeError e.g. "smi[0]" is "nan"
49 |             continue
50 |     smi_writer.close()
51 | 
52 | 
53 | if __name__ == "__main__":
54 |     print(f"* Loading {args.dataset} dataset using the TDC.")
55 |     data      = MolGen(name=args.dataset)
56 |     split     = data.get_split()
57 |     HOME      = str(Path.home())
58 |     DATA_PATH = f"./data/{args.dataset}/"
59 |     try:
60 |         os.mkdir(DATA_PATH)
61 |         print(f"-- Creating dataset at {DATA_PATH}")
62 |     except FileExistsError:
63 |         shutil.rmtree(DATA_PATH)
64 |         os.mkdir(DATA_PATH)
65 |         print(f"-- Removed old directory at {DATA_PATH}")
66 |         print(f"-- Creating new dataset at {DATA_PATH}")
67 | 
68 |     print(f"* Re-saving {args.dataset} dataset in a format GraphINVENT can parse.")
69 |     print("-- Saving training data...")
70 |     save_smiles(smi_file=f"{DATA_PATH}train.smi", smi_list=split["train"].values)
71 |     print("-- Saving testing data...")
72 |     save_smiles(smi_file=f"{DATA_PATH}test.smi", smi_list=split["test"].values)
73 |     print("-- Saving validation data...")
74 |     save_smiles(smi_file=f"{DATA_PATH}valid.smi", smi_list=split["valid"].values)
75 | 
76 |     # # delete the raw downloaded files
77 |     # dir_path = "./data/"
78 |     # shutil.rmtree(dir_path)
79 |     print("Done.", flush=True)
80 | 


--------------------------------------------------------------------------------
/tools/utils.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Miscellaneous functions.
 3 | """
 4 | import rdkit
 5 | from rdkit.Chem.rdmolfiles import SmilesMolSupplier
 6 | 
 7 | 
 8 | def load_molecules(path : str) -> rdkit.Chem.rdmolfiles.SmilesMolSupplier:
 9 |     """
10 |     Reads a SMILES file (full path/filename specified by `path`) and returns the
11 |     `rdkit.Mol` object "supplier".
12 |     """
13 |     # check first line of SMILES file to see if contains header
14 |     with open(path) as smi_file:
15 |         first_line = smi_file.readline()
16 |         has_header = bool("SMILES" in first_line)
17 |     smi_file.close()
18 | 
19 |     # read file
20 |     molecule_set = SmilesMolSupplier(path, sanitize=True, nameColumn=-1, titleLine=has_header)
21 | 
22 |     return molecule_set
23 | 


--------------------------------------------------------------------------------
/tutorials/0_setting_up_environment.md:
--------------------------------------------------------------------------------
 1 | ## Setting up the environment
 2 | Before doing anything with GraphINVENT, you will need to configure the GraphINVENT virtual environment, as the code is dependent on very specific versions of packages. You can use [conda](https://docs.conda.io/en/latest/) for this.
 3 | 
 4 | The [../environments/graphinvent.yml](../environments/graphinvent.yml) file lists all the packages required for GraphINVENT to run. From within the [GraphINVENT/](../) directory, a virtual environment can be easily created using the YAML file and conda by typing into the terminal:
 5 | 
 6 | ```
 7 | conda env create -f environments/graphinvent.yml
 8 | ```
 9 | 
10 | Then, to activate the environment:
11 | 
12 | ```
13 | conda activate graphinvent
14 | ```
15 | 
16 | To install additional packages to the virtual environment, should the need arise, use:
17 | 
18 | ```
19 | conda install -n graphinvent {package_name}
20 | ```
21 | 
22 | To save an updated environment as a YAML file using conda, use:
23 | 
24 | ```
25 | conda env export > path/to/environment.yml
26 | ```
27 | 
28 | And that's it! To learn how to start training models, go to [1_introduction](1_introduction.md).
29 | 


--------------------------------------------------------------------------------
/tutorials/1_introduction.md:
--------------------------------------------------------------------------------
  1 | ## Introduction to GraphINVENT
  2 | As shown in our recent [publication]((https://chemrxiv.org/articles/preprint/Graph_Networks_for_Molecular_Design/12843137/1)), GraphINVENT can be used to learn the structure and connectivity of sets of molecular graphs, thus making it a promising tool for the generation of molecules resembling an input dataset. As models in GraphINVENT are probabilistic, they can be used to discover new molecules that are not present in the training set.
  3 | 
  4 | There are six GNN-based models implemented in GraphINVENT: the MNN, GGNN, AttGGNN, S2V, AttS2V, and EMN models. The GGNN has shown the best performance when weighed against the computational time required for training, and is as such used as the default model.
  5 | 
  6 | To begin using GraphINVENT, we have prepared the following tutorial to guide a new user through the molecular generation workflow using a small example dataset. The example dataset is a 1K random subset of GBD-13. It has already been preprocessed, so you can use it directly for Training and Generation, as we will show in this tutorial. If this is too simple and you would like to learn how to train GraphINVENT models using a new molecular dataset, see [2_using_a_new_dataset](./2_using_a_new_dataset.md).
  7 | 
  8 | ### Training using the example dataset
  9 | #### Preparing a training job
 10 | The example dataset is located in [../data/pre-training/gdb13_1K/](../data/pre-training/gdb13_1K/) and contains the following:
 11 | * 1K molecules in each the training, validation, and test set
 12 | * atom types : {C, N, O, S, Cl}
 13 | * formal charges : {-1, 0, +1}
 14 | * max num nodes : 13 (it is a subset of GDB-13).
 15 | 
 16 | The last three points of information must be included in the submission script, as well as any additional parameters and hyperparameters to use for the training job.
 17 | 
 18 | A sample submission script [submit-pre-training.py](../submit-pre-training.py) has been provided. Begin by modifying the submission script to specify where the dataset can be found and what type of job you want to run. For training on the example set, the settings below are recommended:
 19 | 
 20 | ```
 21 | submit.py >
 22 | # define what you want to do for the specified job(s)
 23 | dataset = "gdb13_1K"     # this is the dataset name, which corresponds to the directory containing the data, located in GraphINVENT/data/
 24 | job_type = "train"       # this tells the code that this is a training job
 25 | jobdir_start_idx = 0     # this is an index used for labeling the first job directory where output will be written
 26 | n_jobs = 1               # if you want to run multiple jobs (e.g. for collecting statistics), set this to >1
 27 | restart = False          # this tells the code that this is not a restart job
 28 | force_overwrite = False  # if `True`, this will overwrite job directories which already exist with this name (recommend `True` only when debugging)
 29 | jobname = "example"      # this is the name of the job, to be used in labeling directories where output will be written
 30 | ```
 31 | 
 32 | Then, specify whether you want the job to run using [SLURM](https://slurm.schedmd.com/overview.html). In the example below, we specify that we want the job to run as a regular process (i.e. no SLURM). In such cases, any specified run time and memory requirements will be ignored by the script. Note: if you want to use a different scheduler, this can be easily changed in the submission script (search for "sbatch" and change it to your scheduler's submission command).
 33 | 
 34 | ```
 35 | submit.py >
 36 | # if running using SLURM, specify the parameters below
 37 | use_slurm = False        # this tells the code to NOT use SLURM
 38 | run_time = "1-00:00:00"  # d-hh:mm:ss (will be ignored here)
 39 | mem_GB = 20              # memory in GB (will be ignored here)
 40 | ```
 41 | 
 42 | Then, specify the path to the Python binary in the GraphINVENT virtual environment. You probably won't need to change *graphinvent_path* or *data_path*, unless you want to run the code from a different directory.
 43 | 
 44 | ```
 45 | submit.py >
 46 | # set paths here
 47 | python_path = f"../miniconda3/envs/graphinvent/bin/python"  # this is the path to the Python binary to use (change to your own)
 48 | graphinvent_path = f"./graphinvent/"                            # this is the directory containing the source code
 49 | data_path = f"./data/"                                          # this is the directory where all datasets are found
 50 | ```
 51 | 
 52 | Finally, details regarding the specific dataset and parameters you want to use need to be entered. If they are not specified in *submit.py* before running, the model will use the default values in [./graphinvent/parameters/defaults.py](./graphinvent/parameters/defaults.py), but it is not always the case that the "default" values will work well for your dataset. The models are sensitive to the hyperparameters used for each dataset, especially the learning rate and learning rate decay. For the example dataset, the following parameters are recommended:
 53 | 
 54 | ```
 55 | submit.py >
 56 | # define dataset-specific parameters
 57 | params = {
 58 |     "atom_types": ["C", "N", "O", "S", "Cl"],
 59 |     "formal_charge": [-1, 0, +1],
 60 |     "max_n_nodes": 13,
 61 |     "job_type": job_type,
 62 |     "dataset_dir": f"{data_path}{dataset}/",
 63 |     "restart": restart,
 64 |     "model": "GGNN",
 65 |     "sample_every": 10,
 66 |     "init_lr": 1e-4,     # (!)
 67 |     "epochs": 400,
 68 |     "batch_size": 1000,
 69 |     "block_size": 100000,
 70 | }
 71 | ```
 72 | 
 73 | Above, (!) indicates that a parameter is strongly dependent on the dataset used. Note that, depending on your system, you might need to tune the mini-batch and/or block size so as to reduce/increase the memory requirement for training jobs. There is an inverse relationship between the batch size and the time required to train a model. As such, only reduce the batch size if necessary, as decreasing the batch size will lead to noticeably slower training.
 74 | 
 75 | At this point, you are done editing the *submit.py* file and are ready to submit a training job.
 76 | 
 77 | #### Running a training job
 78 | Using the prepared *submit.py*, you can run a GraphINVENT training job from the terminal using the following command:
 79 | 
 80 | ```
 81 | (graphinvent)$ python submit.py
 82 | ```
 83 | 
 84 | Note that for the code to run, you need to have configured and activated the GraphINVENT environment (see [0_setting_up_environment](0_setting_up_environment.md) for help with this).
 85 | 
 86 | As the models are training, you should see the progress bar updating on the terminal every epoch. The training status will be saved every epoch to the job directory, *output_{dataset}/{jobname}/job_{jobdir_start_idx}/*, which should be *output_gdb13_1K/example/job_0/* if you followed the settings above. Additionally, the evaluation scores will be saved every evaluation epoch to the job directory. Among the files written to this directory will be:
 87 | 
 88 | * *generation.log*, containing various evaluation metrics for the generated set, calculated during evaluation epochs
 89 | * *convergence.log*, containing the loss and learning rate for every epoch
 90 | * *validation.log*, containing model scores (e.g. NLLs, UC-JSD), calculated during evaluation epochs
 91 | * *model_restart_{epoch}.pth*, which are the model states for use in restarting jobs, or running generation/validation jobs with a trained model
 92 | * *generation/*, a directory containing structures generated during evaluation epochs (\*.smi), as well as information on each structure's NLL (\*.nll) and validity (\*.valid)
 93 | 
 94 | It is good to check the *generation.log* to verify that the generated set features indeed converge to those of the training set (first entry). If they do not then something is wrong (most likely bad hyperparameters). Furthermore, it is good to check the *convergence.log* to make sure the loss is smoothly decreasing during training.
 95 | 
 96 | #### Restarting a training job
 97 | If for any reason you want to restart a training job from a previous epoch (e.g. you cancelled a training job before it reached convergence), then you can do this by setting *restart = True* in *submit.py* and rerunning. While it is possible to change certain parameters in *submit.py* before rerunning (e.g. *init_lr* or *epochs*), parameters related to the model should not be changed, as the program will load an existing model from the last saved *model_restart_{epoch}.pth* file (hence there will be a mismatch between the previous parameters and those you changed). Similarly, any settings related to the file location or job name should not be changed, as the program uses those settings to search in the right directory for the previously saved model. Finally, parameters related to the dataset (e.g. *atom_types*) should not be changed, not only for a restart job but throughout the entire workflow of a dataset. If you want to use different features in the node and edge feature representations, you will have to create a copy of the dataset in [../data/](../data/), give it a unique name, and preprocess it using the desired settings.
 98 | 
 99 | ### Generation using a trained model
100 | #### Running a generation job
101 | Once you have trained a model, you can use a saved state (e.g. *model_restart_400.pth*) to generate molecules. To do this, *submit.py* needs to be updated to specify a generation job. The first setting that needs to be changed is the *job_type*; all other settings here should be kept fixed so that the program can find the correct job directory:
102 | 
103 | ```
104 | submit.py >
105 | # define what you want to do for the specified job(s)
106 | dataset = "gdb13_1K"
107 | job_type = "generate"    # this tells the code that this is a generation job
108 | jobdir_start_idx = 0
109 | n_jobs = 1
110 | restart = False
111 | force_overwrite = False
112 | jobname = "example"
113 | ```
114 | 
115 | You will then need to update the *generation_epoch* and *n_samples* parameters in *submit.py*:
116 | 
117 | ```
118 | submit.py >
119 | # define dataset-specific parameters
120 | params = {
121 |     "atom_types": ["C", "N", "O", "S", "Cl"],
122 |     "formal_charge": [-1, 0, +1],
123 |     "max_n_nodes": 13,
124 |     "job_type": job_type,
125 |     "dataset_dir": f"{data_path}{dataset}/",
126 |     "restart": restart,
127 |     "model": "GGNN",
128 |     "sample_every": 10,
129 |     "init_lr": 1e-4,          # (!)
130 |     "epochs": 400,
131 |     "batch_size": 1000,
132 |     "block_size": 100000,
133 |     "generation_epoch": 400,  # <-- which model to use (i.e. which epoch)
134 |     "n_samples": 30000,       # <-- how many structures to generate
135 | }
136 | ```
137 | 
138 | The *generation_epoch* should correspond to the saved model state that you want to use for generation. In the example above, the parameters specify that the model saved at Epoch 400 should be used to generate 30,000 molecules. All other parameters should be kept the same (if they are related to training, such as *epochs* or *init_lr*, they will be ignored during generation jobs).
139 | 
140 | Structures will be generated in batches of size *batch_size*. If you encounter memory problems during generation jobs, reducing the batch size should once again solve them. Generated structures, along with their corresponding metadata, will be written to the *generation/* directory within the existing job directory. These files are:
141 | 
142 | * *epochGEN{generation_epoch}_{batch}.smi*, containing molecules generated at the epoch specified
143 | * *epochGEN{generation_epoch}_{batch}.nll*, containing their respective NLLs
144 | * *epochGEN{generation_epoch}_{batch}.valid*, containing their respective validity (0: invalid, 1: valid)
145 | 
146 | Additionally, the *generation.log* file will be updated with the various evaluation metrics for the generated structures.
147 | 
148 | If you've followed the tutorial up to here, it means you can successfully create new molecules using a trained GNN-based model.
149 | 
150 | #### (Optional) Postprocessing
151 | 
152 | To make things more convenient for any subsequent analyses, you can concatenate all structures generated in different batches into one file using:
153 | 
154 | ```
155 | for i in epochGEN{generation_epoch}_*.smi; do cat $i >> epochGEN{generation_epoch}.smi; done
156 | ```
157 | 
158 | Above, *{generation_epoch}* should be replaced with a number corresponding to a valid epoch. You can do similar things for the NLL and validity files, as the rows in those files correspond to the rows in the SMILES files.
159 | 
160 | Note that "Xe" and empty graphs may appear in the generated structures, even if the models are well-trained, as there is always a small probability of sampling invalid actions. If you do not want to include invalid entries in your analysis, these can be filtered out by typing:
161 | 
162 | ```
163 | sed -i "/Xe/d" path/to/file.smi          # remove "Xe" placeholders from file
164 | sed -i "/^ [0-9]\+$/d" path/to/file.smi  # remove empty graphs from file
165 | ```
166 | 
167 | See [3_visualizing_molecules](./3_visualizing_molecules.md) for examples on how to draw grids of molecules.
168 | 
169 | ### Summary
170 | Now you know how to train models and generate structures using the example dataset. However, the example dataset structures are not drug-like, and are therefore not the most interesting to study for drug discovery applications. To learn how to train GraphINVENT models on custom datasets, see [2_using_a_new_dataset](./2_using_a_new_dataset.md).
171 | 


--------------------------------------------------------------------------------
/tutorials/2_using_a_new_dataset.md:
--------------------------------------------------------------------------------
  1 | ## Using a new dataset in GraphINVENT
  2 | In this tutorial, you will be guided through the steps of using a new dataset in GraphINVENT.
  3 | 
  4 | ### Selecting a new dataset
  5 | Before getting carried away with the possibilities of molecular graph generative models, it should be clear that the GraphINVENT models are computationally demanding, especially compared to string-based models. As such, you should keep in mind the capabilities of your system when selecting a new dataset to study, such as how much disk space you have available, how much RAM, and how fast is your GPU.
  6 | 
  7 | In our recent [publication](https://chemrxiv.org/articles/preprint/Graph_Networks_for_Molecular_Design/12843137/1), we report the computational requirements for Preprocessing, Training, Generation, and Benchmarking jobs using the various GraphINVENT models. We summarize some of the results here for the largest dataset we trained on:
  8 | 
  9 | | Dataset                                                          | Train | Test | Valid | Largest Molecule | Atom Types              | Formal Charges |
 10 | |---|---|---|---|---|---|---|
 11 | | [MOSES](https://github.com/molecularsets/moses/tree/master/data) | 1.5M  | 176K | 10K   | 27 atoms         | {C, N, O, F, S, Cl, Br} | {0}            |
 12 | 
 13 | The disk space used by the different splits, before and after preprocessing (using the best parameters from the paper), are as follows:
 14 | 
 15 | |        | Train | Test | Valid |
 16 | |---|---|---|---|
 17 | | Before | 65M   | 7.1M | 403K  |
 18 | | After  | 75G   | 9.5G | 559M  |
 19 | 
 20 | We point this out to emphasize that if you intend to use a large dataset (such as the MOSES dataset), you need to have considerable disk space available. The sizes of these files can be reduced by specifying a larger *group_size* (default: 1000), but increasing the group size will also increase the time required for preprocessing while having a small effect on decreasing the training time.
 21 | 
 22 | Training and Generation jobs using the above dataset generally require <10 GB GPU memory. A model can be fully trained on MOSES after around 5 days of training on a single GPU (using a batch size of 1000).
 23 | 
 24 | When selecting a dataset to study, thus keep in mind that more structures in your dataset means 1) more disk space will be required to save processed dataset splits and 2) more computational time will be required for training. The number of structures should not have a significant effect on the RAM requirements of a job, as this can be controlled by the batch and block sizes used. However, the number of atom types present in the dataset will have an effect on the memory and disk space requirements of a job, as this is directly correlated to the sizes of the node and edge features tensors, as well as the sizes of the APDs. As such, you might not want to use the entire periodic table in your generative models.
 25 | 
 26 | Finally, as all molecules are padded up to the size of the largest graph in the dataset during Preprocessing jobs, if you have a dataset where most molecules have fewer nodes than *N*, and you have only a few structures where the number of nodes is >>*N*, a good strategy to reduce the computational requirements for this dataset would be to simply remove all molecules with >*N* nodes. The same thing could be said for the atom types and formal charges. We recommend to only keep any "outliers" in a dataset if they are deemed essential.
 27 | 
 28 | To summarize,
 29 | 
 30 | Increases disk space requirement:
 31 | * more molecules in dataset
 32 | * more atom types present in dataset
 33 | * more formal charges present
 34 | * larger molecules in dataset (really, larger *max_n_nodes*)
 35 | * smaller group size
 36 | 
 37 | Increases RAM:
 38 | * using a larger batch size
 39 | * using a larger block size
 40 | 
 41 | Increases run time:
 42 | * more molecules in dataset
 43 | * using a smaller batch size
 44 | * larger group size (Preprocessing jobs only)
 45 | 
 46 | Hopefully these guidelines help you in selecting an appropriate dataset to study using GraphINVENT.
 47 | 
 48 | ### Preparing a new dataset
 49 | Once you have selected a dataset to study, you must prepare it so that it agrees with the format expected by the program. GraphINVENT expects, for each dataset, three splits in SMILES format. Each split should be named as follows:
 50 | 
 51 | * *train.smi*
 52 | * *test.smi*
 53 | * *valid.smi*
 54 | 
 55 | These should contain the training set, test set, and validation set, respectively. It is not important for the SMILES to be canonical, and it also does not matter if the file has a header or not. How many structures you put in each split is also up to you (generally the training set is larger than the testing and validation set).
 56 | 
 57 | You should then create a new directory in [../data/](../data/) where the name of this directory corresponds to a unique name for your dataset:
 58 | 
 59 | ```
 60 | mkdir path/to/GraphINVENT/data/your_dataset_name/
 61 | mv train.smi valid.smi test.smi path/to/GraphINVENT/data/your_dataset_name/.
 62 | ```
 63 | 
 64 | You will want to replace *your_dataset_name* above with the actual name for your dataset (e.g. *ChEMBL_subset*, *DRD2_actives*, etc).
 65 | 
 66 | 
 67 | ### Preprocessing the new dataset
 68 | Once you have prepared your dataset in the aforementioned format, you can move on to preprocessing it using GraphINVENT. To preprocess it, you will need to know the following information:
 69 | 
 70 | * *max_n_nodes*
 71 | * *atom_types*
 72 | * *formal_charge*
 73 | 
 74 | We have provided a few scripts to help you calculate these properties in [../tools/](../tools/).
 75 | 
 76 | Once you know these values, you can move on to preparing a submission script. A sample submission script [../submit.py](../submit.py) has been provided. Begin by modifying the submission script to specify where the dataset can be found and what type of job you want to run. For preprocessing a new dataset, you can use the settings below, substituting in your own values where necessary:
 77 | 
 78 | ```
 79 | submit.py >
 80 | # define what you want to do for the specified job(s)
 81 | dataset = "your_dataset_name"  # this is the dataset name, which corresponds to the directory containing the data, located in GraphINVENT/data/
 82 | job_type = "preprocess"        # this tells the code that this is a preprocessing job
 83 | jobdir_start_idx = 0           # this is an index used for labeling the first job directory where output will be written
 84 | n_jobs = 1                     # if you want to run multiple jobs (not recommended for preprocessing), set this to >1
 85 | restart = False                # this tells the code that this is not a restart job
 86 | force_overwrite = False        # if `True`, this will overwrite job directories which already exist with this name (recommend `True` only when debugging)
 87 | jobname = "preprocess"         # this is the name of the job, to be used in labeling directories where output will be written
 88 | ```
 89 | 
 90 | Then, specify whether you want the job to run using [SLURM](https://slurm.schedmd.com/overview.html). In the example below, we specify that we want the job to run as a regular process (i.e. no SLURM). In such cases, any specified run time and memory requirements will be ignored by the script. Note: if you want to use a different scheduler, this can be easily changed in the submission script (search for "sbatch" and change it to your scheduler's submission command).
 91 | 
 92 | ```
 93 | submit.py >
 94 | # if running using SLURM, specify the parameters below
 95 | use_slurm = False        # this tells the code to NOT use SLURM
 96 | run_time = "1-00:00:00"  # d-hh:mm:ss (will be ignored here)
 97 | mem_GB = 20              # memory in GB (will be ignored here)
 98 | ```
 99 | 
100 | Then, specify the path to the Python binary in the GraphINVENT virtual environment. You probably won't need to change *graphinvent_path* or *data_path*, unless you want to run the code from a different directory.
101 | 
102 | ```
103 | submit.py >
104 | # set paths here
105 | python_path = f"{home}/miniconda3/envs/graphinvent/bin/python"  # this is the path to the Python binary to use (change to your own)
106 | graphinvent_path = f"./graphinvent/"                            # this is the directory containing the source code
107 | data_path = f"./data/"                                          # this is the directory where all datasets are found
108 | ```
109 | 
110 | Finally, details regarding the specific dataset you want to use need to be entered:
111 | 
112 | ```
113 | submit.py >
114 | # define dataset-specific parameters
115 | params = {
116 |     "atom_types": ["C", "N", "O", "S", "Cl"],  # <-- change to your dataset's atom types
117 |     "formal_charge": [-1, 0, +1],              # <-- change to your dataset's formal charges
118 |     "chirality": ["None", "R", "S"],           # <-- ignored, unless you also specify `use_chirality`=True
119 |     "max_n_nodes": 13,                         # <-- change to your dataset's value
120 |     "job_type": job_type,
121 |     "dataset_dir": f"{data_path}{dataset}/",
122 |     "restart": restart,
123 | }
124 | ```
125 | 
126 | At this point, you are done editing the *submit.py* file and are ready to submit a preprocesing job. You can submit the job from the terminal using the following command:
127 | 
128 | ```
129 | (graphinvent)$ python submit.py
130 | ```
131 | 
132 | During preprocessing jobs, the following will be written to the specified *dataset_dir*:
133 | * 3 HDF files (*train.h5*, *valid.h5*, and *test.h5*)
134 | * *preprocessing_params.csv*, containing parameters used in preprocessing the dataset (for later reference)
135 | * *train.csv*, containing training set properties (e.g. histograms of number of nodes per molecule, number of edges per node, etc)
136 | 
137 | A preprocessing job can take a few seconds to a few hours to finish, depending on the size of your dataset. Once the preprocessing job is done and you have the above files, you are ready to run a training job using your processed dataset.
138 | 
139 | ### Training models using the new dataset
140 | You can modify the same *submit.py* script to instead run a training job using your dataset. Begin by changing the *job_type* and *jobname*; all other settings can be kept the same:
141 | 
142 | ```
143 | submit.py >
144 | # define what you want to do for the specified job(s)
145 | dataset = "your_dataset_name"
146 | job_type = "train"             # this tells the code that this is a training job
147 | jobdir_start_idx = 0
148 | n_jobs = 1
149 | restart = False
150 | force_overwrite = False
151 | jobname = "train"              # this is the name of the job, to be used in labeling directories where output will be written
152 | ```
153 | 
154 | If you would like to change the SLURM settings, you should do that next, but for this example we will keep them the same. You will then need to specify all parameters that you want to use for training:
155 | 
156 | 
157 | ```
158 | submit.py >
159 | # define dataset-specific parameters
160 | params = {
161 |     "atom_types": ["C", "N", "O", "S", "Cl"],  # change to your dataset's atom types
162 |     "formal_charge": [-1, 0, +1],              # change to your dataset's formal charges
163 |     "chirality": ["None", "R", "S"],           # ignored, unless you also specify `use_chirality`=True
164 |     "max_n_nodes": 13,                         # change to your dataset's value
165 |     "job_type": job_type,
166 |     "dataset_dir": f"{data_path}{dataset}/",
167 |     "restart": restart,
168 |     "model": "GGNN",                           # <-- which model to use (GGNN is the default, but showing it here to be explicit)
169 |     "sample_every": 2,                         # <-- how often you want to sample/evaluate your model during training (for larger datasets, we recommend sampling more often)
170 |     "init_lr": 1e-4,                           # <-- tune the initial learning rate if needed
171 |     "epochs": 100,                             # <-- how many epochs you want to train for (you can experiment with this)
172 |     "batch_size": 1000,                        # <-- tune the batch size if needed
173 |     "block_size": 100000,                      # <-- tune the block size if needed
174 | }
175 | ```
176 | 
177 | If any parameters are not specified in *submit.py* before running, the model will use the default values in [../graphinvent/parameters/defaults.py](../graphinvent/parameters/defaults.py), but it is not always the case that the "default" values will work well for your dataset. For instance, the parameters related to the learning rate decay are strongly dependent on the dataset used, and you might have to tune them to get optimal performance using your dataset. Depending on your system, you might also need to tune the mini-batch and/or block size so as to reduce/increase the memory requirement for training jobs.
178 | 
179 | You can then run a GraphINVENT training job from the terminal using the following command:
180 | 
181 | ```
182 | (graphinvent)$ python submit.py
183 | ```
184 | 
185 | As the models are training, you should see the progress bar updating on the terminal every epoch. The training status will be saved every epoch to the job directory, *output_{your_dataset_name}/{jobname}/job_{jobdir_start_idx}/*, which should be *output_{your_dataset_name}/train/job_0/* if you followed the settings above. Additionally, the evaluation scores will be saved every evaluation epoch to the job directory. Among the files written to this directory will be:
186 | 
187 | * *generation.log*, containing various evaluation metrics for the generated set, calculated during evaluation epochs
188 | * *convergence.log*, containing the loss and learning rate for every epoch
189 | * *validation.log*, containing model scores (e.g. NLLs, UC-JSD), calculated during evaluation epochs
190 | * *model_restart_{epoch}.pth*, which are the model states for use in restarting jobs, or running generation/validation jobs with a trained model
191 | * *generation/*, a directory containing structures generated during evaluation epochs (\*.smi), as well as information on each structure's NLL (\*.nll) and validity (\*.valid)
192 | 
193 | It is good to check the *generation.log* to verify that the generated set features indeed converge to those of the training set (first entry). If they do not, then you will have to tune the hyperparameters to get better performance. Furthermore, it is good to check the *convergence.log* to make sure the loss is smoothly decreasing during training.
194 | 
195 | #### Restarting a training job
196 | If for any reason you want to restart a training job from a previous epoch (e.g. you cancelled a training job before it reached convergence), then you can do this by setting *restart = True* in *submit.py* and rerunning. While it is possible to change certain parameters in *submit.py* before rerunning (e.g. *init_lr* or *epochs*), parameters related to the model should not be changed, as the program will load an existing model from the last saved *model_restart_{epoch}.pth* file (hence there will be a mismatch between the previous parameters and those you changed). Similarly, any settings related to the file location or job name should not be changed, as the program uses those settings to search in the right directory for the previously saved model. Finally, parameters related to the dataset (e.g. *atom_types*) should not be changed, not only for a restart job but throughout the entire workflow of a dataset. If you want to use different features in the node and edge feature representations, you will have to create a copy of the dataset in [../data/](../data/), give it a unique name, and preprocess it using the desired settings.
197 | 
198 | ### Generating structures using the newly trained models
199 | Once you have trained a model, you can use a saved state (e.g. *model_restart_100.pth*) to generate molecules. To do this, *submit.py* needs to be updated to specify a generation job. The first setting that needs to be changed is the *job_type*; all other settings here should be kept fixed so that the program can find the correct job directory:
200 | 
201 | ```
202 | submit.py >
203 | # define what you want to do for the specified job(s)
204 | dataset = "your_dataset_name"
205 | job_type = "generate"          # this tells the code that this is a generation job
206 | jobdir_start_idx = 0
207 | n_jobs = 1
208 | restart = False
209 | force_overwrite = False
210 | jobname = "train"              # don't change the jobname, or the program won't find the saved model
211 | ```
212 | 
213 | You will then need to update the *generation_epoch* and *n_samples* parameter in *submit.py*:
214 | 
215 | ```
216 | submit.py >
217 | # define dataset-specific parameters
218 | params = {
219 |     "atom_types": ["C", "N", "O", "S", "Cl"],  # change to your dataset's atom types
220 |     "formal_charge": [-1, 0, +1],              # change to your dataset's formal charges
221 |     "chirality": ["None", "R", "S"],           # ignored, unless you also specify `use_chirality`=True
222 |     "max_n_nodes": 13,                         # change to your dataset's value
223 |     "job_type": job_type,
224 |     "dataset_dir": f"{data_path}{dataset}/",
225 |     "restart": restart,
226 |     "model": "GGNN",
227 |     "sample_every": 2,                         # how often you want to sample/evaluate your model during training (for larger datasets, we recommend sampling more often)
228 |     "init_lr": 1e-4,                           # tune the initial learning rate if needed
229 |     "epochs": 100,                             # how many epochs you want to train for (you can experiment with this)
230 |     "batch_size": 1000,                        # <-- tune the batch size if needed
231 |     "block_size": 100000,                      # tune the block size if needed
232 |     "generation_epoch": 100,                   # <-- specify which saved model (i.e. at which epoch) to use for training)
233 |     "n_samples": 30000,                        # <-- specify how many structures you want to generate
234 | }
235 | ```
236 | 
237 | The *generation_epoch* should correspond to the saved model state that you want to use for generation, and *n_samples* tells the program how many structures you want to generate. In the example above, the parameters specify that the model saved at Epoch 100 should be used to generate 30,000 structures. All other parameters should be kept the same (if they are related to training, such as *epochs* or *init_lr*, they will be ignored during generation jobs).
238 | 
239 | Structures will be generated in batches of size *batch_size*. If you encounter memory problems during generation jobs, reducing the batch size should once again solve them. Generated structures, along with their corresponding metadata, will be written to the *generation/* directory within the existing job directory. These files are:
240 | 
241 | * *epochGEN{generation_epoch}_{batch}.smi*, containing molecules generated at the epoch specified
242 | * *epochGEN{generation_epoch}_{batch}.nll*, containing their respective NLLs
243 | * *epochGEN{generation_epoch}_{batch}.valid*, containing their respective validity (0: invalid, 1: valid)
244 | 
245 | Additionally, the *generation.log* file will be updated with the various evaluation metrics for the generated structures.
246 | 
247 | If you've followed the tutorial up to here, it means you can successfully create new molecules using a GNN-based model trained on a custom dataset.
248 | 
249 | #### (Optional) Postprocessing
250 | 
251 | To make things more convenient for any subsequent analyses, you can concatenate all structures generated in different batches into one file using:
252 | 
253 | ```
254 | for i in epochGEN{generation_epoch}_*.smi; do cat $i >> epochGEN{generation_epoch}.smi; done
255 | ```
256 | 
257 | Above, *{generation_epoch}* should be replaced with a number corresponding to a valid epoch. You can do similar things for the NLL and validity files, as the rows in those files correspond to the rows in the SMILES files.
258 | 
259 | Note that "Xe" and empty graphs may appear in the generated structures, even if the models are well-trained, as there is always a small probability of sampling invalid actions. If you do not want to include invalid entries in your analysis, these can be filtered out by typing:
260 | 
261 | ```
262 | sed -i "/Xe/d" path/to/file.smi          # remove "Xe" placeholders from file
263 | sed -i "/^ [0-9]\+$/d" path/to/file.smi  # remove empty graphs from file
264 | ```
265 | 
266 | See [3_visualizing_molecules](./3_visualizing_molecules.md) for examples on how to draw grids of molecules.
267 | 
268 | ### A note about hyperparameters
269 | If you've reached this part of the tutorial, you now have a good idea of how to train GraphINVENT models on custom datasets. Nonetheless, as hinted above, some hyperparameters are highly dependent on the dataset used, and you may have to do some hyperparameter tuning to obtain the best performance using your specific dataset. In particular, parameters related to the learning rate decay are sensitive to the dataset, so a bit of experimentation here is recommended when using a new dataset as these parameters can make a difference between an "okay" model and a well-trained model. These parameters are:
270 | 
271 | * *init_lr*
272 | * *min_rel_lr*
273 | * *lrdf*
274 | * *lrdi*
275 | 
276 | If any parameters are not specified in the submission script, the program will use the default values from [../graphinvent/parameters/defaults.py](../graphinvent/parameters/defaults.py). Have a look there if you want to learn more about any additional hyperparameters that may not have been discussed in this tutorial. Note that not all parameters defined in *../graphinvent/parameters/defaults.py* are model-related hyperparameters; many are simply practical parameters and settings, such as the path to the datasets being studied.
277 | 
278 | ### Summary
279 | Hopefully you are now able to train models on custom datasets using GraphINVENT. If anything is unclear in this tutorial, or if you have any questions that have not been addressed by this guide, feel free to contact the authors for assistance. Note that a lot of useful information centered about hyperparameter tuning is available in our [technical note](https://chemrxiv.org/articles/preprint/Practical_Notes_on_Building_Molecular_Graph_Generative_Models/12888383/1).
280 | 
281 | We look forward to seeing the molecules you've generated using GraphINVENT.
282 | 


--------------------------------------------------------------------------------
/tutorials/3_visualizing_molecules.md:
--------------------------------------------------------------------------------
 1 | ## Visualizing molecules
 2 | After generating structures using GraphINVENT, you will almost certainly want to visualize them. Below we provide some examples using RDKit for visualizing the molecules in simple but elegant grids.
 3 | 
 4 | ### Drawing a grid of molecules
 5 | Assuming you use the trained models to generate thousands (if not more) molecules, you *probably* don't want to visualize all of them in one massive grid. A more reasonable thing to do is to randomly sample a small subset for visualization.
 6 | 
 7 | An example script for drawing 100 randomly selected molecules is shown below:
 8 | 
 9 | ```
10 | example_visualization_script.py >
11 | import math
12 | import random
13 | import rdkit
14 | from rdkit.Chem.Draw import MolsToGridImage
15 | from rdkit.Chem.rdmolfiles import SmilesMolSupplier
16 | 
17 | smi_file = "path/to/file.smi"
18 | 
19 | # load molecules from file
20 | mols = SmilesMolSupplier(smi_file, sanitize=True, nameColumn=-1)
21 | 
22 | n_samples = 100
23 | mols_list = [mol for mol in mols]
24 | mols_sampled = random.sample(mols_list, n_samples)  # sample 100 random molecules to visualize
25 | 
26 | mols_per_row = int(math.sqrt(n_samples))            # make a square grid
27 | 
28 | png_filename=smi_file[:-3] + "png"  # name of PNG file to create
29 | labels=list(range(n_samples))       # label structures with a number
30 | 
31 | # draw the molecules (creates a PIL image)
32 | img = MolsToGridImage(mols=mols_sampled,
33 |                       molsPerRow=mols_per_row,
34 |                       legends=[str(i) for i in labels])
35 | 
36 | img.save(png_filename)
37 | ```
38 | 
39 | Alternatively, you could first randomly sample 100 molecules from your source file, save them in a new file, and draw everything in the new file:
40 | 
41 | ```
42 | shuffle -n 100 path/to/file.smi > path/to/file_100_shuffled.smi
43 | ```
44 | 
45 | ```
46 | example_visualization_script_2.py >
47 | import rdkit
48 | from rdkit.Chem.Draw import MolsToGridImage
49 | from rdkit.Chem.rdmolfiles import SmilesMolSupplier
50 | 
51 | smi_file = "path/to/file_100_shuffled.smi"
52 | 
53 | # load molecules from file
54 | mols = SmilesMolSupplier(smi_file, sanitize=True, nameColumn=-1)
55 | 
56 | png_filename=smi_file[:-3] + "png"  # name of PNG file to create
57 | labels=list(range(n_samples))       # label structures with a number
58 | 
59 | # draw the molecules (creates a PIL image)
60 | img = MolsToGridImage(mols=mols,
61 |                       molsPerRow=10,
62 |                       legends=[str(i) for i in labels])
63 | 
64 | img.save(png_filename)
65 | ```
66 | 
67 | ### Filtering out invalid entries
68 | By default, GraphINVENT writes a "Xe" placeholder when an invalid molecular graph is generated, as an invalid molecular graph cannot be converted to a SMILES string for saving. The placeholder is used because the NLL is written for all generated graphs in a separate file, where the same line number in the \*.nll file corresponds to the same line number in the \*.smi file. Similarly, if an empty graph samples an invalid action as the first action, then no SMILES can be generated for an empty graph, so the corresponding line for an empty graph in a SMILES file contains only the "ID" of the molecule.
69 | 
70 | For visualization, you might be interested in viewing only the valid molecular graphs. The SMILES for the generated molecules can thus be post-processed as follows to remove empty and invalid entries from a file before visualization:
71 | 
72 | ```
73 | sed -i "/Xe/d" path/to/file.smi          # remove "Xe" placeholders from file
74 | sed -i "/^ [0-9]\+$/d" path/to/file.smi  # remove empty graphs from file
75 | ```
76 | 


--------------------------------------------------------------------------------
/tutorials/4_transfer_learning.md:
--------------------------------------------------------------------------------
  1 | ## Transfer learning using GraphINVENT
  2 | In this tutorial, you will be guided through the process of generating molecules with targeted properties using transfer learning (TL).
  3 | 
  4 | This tutorial assumes that you have looked through tutorials [1_introduction](./1_introduction.md) and [2_using_a_new_dataset](./2_using_a_new_dataset.md).
  5 | 
  6 | ### Selecting two (or more) datasets
  7 | In order to do transfer learning, you must first select two datasets which you would like to work with. The first and (probably) larger dataset should be one that you can use to train your model generally, whereas the second should be one containing (a few) examples of molecules exhibiting the properties you desire in your generated molecules (e.g. known actives).
  8 | 
  9 | When choosing your datasets, first, remember that GraphINVENT models are computationally demanding; I recommend you go back and review the *Selecting a new dataset* guidelines provided in [2_using_a_new_dataset](./2_using_a_new_dataset.md).
 10 | 
 11 | Second, ideally there is some amount of overlap between the structures in your general training set (set 1) and your targeted training set (set 2). If the two sets are totally different, it will be difficult for your model to learn how to apply what it learns from set 1 to set 2. However, they also should not come from the exact same distributions (otherwise, what's the point of doing TL...). 
 12 | 
 13 | 
 14 | ### Preparing a new dataset
 15 | Once you have chosen your two datasets, you must prepare them so that they agree with the format expected by the program. GraphINVENT expects, for each dataset, three splits in SMILES format. Each split should be named as follows:
 16 | 
 17 | * *train.smi*
 18 | * *test.smi*
 19 | * *valid.smi*
 20 | 
 21 | These should contain the training set, test set, and validation set, respectively. It is not important for the SMILES to be canonical, and it also does not matter if the file has a header or not. How many structures you put in each split is also up to you.
 22 | 
 23 | You should then create two new directories in [../data/](../data/), one for each dataset, where the name of each directory corresponds to a unique name for the dataset it contains:
 24 | 
 25 | ```
 26 | mkdir path/to/GraphINVENT/data/set_1/
 27 | ./split_dataset set_1.smi  # example script that writes a train.smi, valid.smi, and test.smi from set_1.smi
 28 | mv train.smi valid.smi test.smi path/to/GraphINVENT/data/set_1/.
 29 | 
 30 | mkdir path/to/GraphINVENT/data/set_2/
 31 | ./split_dataset set_2.smi  # example script that writes a train.smi, valid.smi, and test.smi from set_2.smi
 32 | mv train.smi valid.smi test.smi path/to/GraphINVENT/data/set_2/.
 33 | 
 34 | ```
 35 | 
 36 | You will want to replace *set_1* and *set_2* above with the actual names for your datasets (e.g. *ChEMBL_subset*, *DRD2_actives*, etc).
 37 | 
 38 | 
 39 | ### Preprocessing the new dataset
 40 | Once you have prepared your datasets in the aforementioned format, you can move on to preprocessing them using GraphINVENT. To preprocess them, you will need to know the following information:
 41 | 
 42 | * *max_n_nodes*
 43 | * *atom_types*
 44 | * *formal_charge*
 45 | 
 46 | Be careful to calculate this for BOTH sets, and not just one e.g. if the *max_n_nodes* in set 1 is 38, and the *max_n_nodes* in set 2 is 15, then the *max_n_nodes* for BOTH sets will be 38. Similarly, if the *atom_types* in set 1 are ["C", "N", "O"] and the *atom_types* in set 2 are ["C", "O", "S"], then the *atom_types* for BOTH sets will be ["C", "N", "O", "S"]. Here, the specific order of elements in *atom_types* does not matter, so long as the order is the same for BOTH sets.
 47 | 
 48 | We have provided a few scripts to help you calculate these properties in [../tools/](../tools/).
 49 | 
 50 | Once you know these values, you can move on to preparing a submission script for preprocessing the first dataset. A sample submission script [../submit.py](../submit.py) has been provided. Begin by modifying the submission script to specify where the dataset can be found and what type of job you want to run. For preprocessing a new dataset, you can use the settings below, substituting in your own values where necessary:
 51 | 
 52 | ```
 53 | submit.py >
 54 | # define what you want to do for the specified job(s)
 55 | dataset = "set 1"              # this is the dataset name, which corresponds to the directory containing the data, located in GraphINVENT/data/
 56 | job_type = "preprocess"        # this tells the code that this is a preprocessing job
 57 | jobdir_start_idx = 0           # this is an index used for labeling the first job directory where output will be written
 58 | n_jobs = 1                     # if you want to run multiple jobs (not recommended for preprocessing), set this to >1
 59 | restart = False                # this tells the code that this is not a restart job
 60 | force_overwrite = False        # if `True`, this will overwrite job directories which already exist with this name (recommend `True` only when debugging)
 61 | jobname = "preprocess"         # this is the name of the job, to be used in labeling directories where output will be written
 62 | ```
 63 | 
 64 | Then, specify whether you want the job to run using [SLURM](https://slurm.schedmd.com/overview.html). In the example below, we specify that we want the job to run as a regular process (i.e. no SLURM). In such cases, any specified run time and memory requirements will be ignored by the script. Note: if you want to use a different scheduler, this can be easily changed in the submission script (search for "sbatch" and change it to your scheduler's submission command).
 65 | 
 66 | ```
 67 | submit.py >
 68 | # if running using SLURM, specify the parameters below
 69 | use_slurm = False        # this tells the code to NOT use SLURM
 70 | run_time = "1-00:00:00"  # d-hh:mm:ss (will be ignored here)
 71 | mem_GB = 20              # memory in GB (will be ignored here)
 72 | ```
 73 | 
 74 | Then, specify the path to the Python binary in the GraphINVENT virtual environment. You probably won't need to change *graphinvent_path* or *data_path*, unless you want to run the code from a different directory.
 75 | 
 76 | ```
 77 | submit.py >
 78 | # set paths here
 79 | python_path = f"../miniconda3/envs/graphinvent/bin/python"  # this is the path to the Python binary to use (change to your own)
 80 | graphinvent_path = f"./graphinvent/"                            # this is the directory containing the source code
 81 | data_path = f"./data/"                                          # this is the directory where all datasets are found
 82 | ```
 83 | 
 84 | Finally, details regarding the specific dataset you want to use need to be entered. Here, you must remember to use the properties for BOTH datasets:
 85 | 
 86 | ```
 87 | submit.py >
 88 | # define dataset-specific parameters
 89 | params = {
 90 |     "atom_types": ["C", "N", "O", "S"],        # <-- change to your datasets' atom types
 91 |     "formal_charge": [-1, 0, +1],              # <-- change to your datasets' formal charges
 92 |     "chirality": ["None", "R", "S"],           # <-- ignored, unless you also specify `use_chirality`=True
 93 |     "max_n_nodes": 38,                         # <-- change to your datasets' value
 94 |     "job_type": job_type,
 95 |     "dataset_dir": f"{data_path}{dataset}/",
 96 |     "restart": restart,
 97 | }
 98 | ```
 99 | 
100 | At this point, you are done editing the *submit.py* file and are ready to submit a preprocesing job. You can submit the job from the terminal using the following command:
101 | 
102 | ```
103 | (graphinvent)$ python submit.py
104 | ```
105 | 
106 | During preprocessing jobs, the following will be written to the specified *dataset_dir*:
107 | * 3 HDF files (*train.h5*, *valid.h5*, and *test.h5*)
108 | * *preprocessing_params.csv*, containing parameters used in preprocessing the dataset (for later reference)
109 | * *train.csv*, containing training set properties (e.g. histograms of number of nodes per molecule, number of edges per node, etc)
110 | 
111 | A preprocessing job can take a few seconds to a few hours to finish, depending on the size of your dataset.
112 | 
113 | Once you have preprocessed the first dataset, you must go back and preprocess the second dataset. To do this, you can use the same *submit.py* file; simply go back and change the dataset name:
114 | 
115 | ```
116 | submit.py >
117 | # define what you want to do for the specified job(s)
118 | dataset = "set 2"              # this is the dataset name, which corresponds to the directory containing the data, located in GraphINVENT/data/ <-- this line changed
119 | job_type = "preprocess"        # this tells the code that this is a preprocessing job
120 | jobdir_start_idx = 0           # this is an index used for labeling the first job directory where output will be written
121 | n_jobs = 1                     # if you want to run multiple jobs (not recommended for preprocessing), set this to >1
122 | restart = False                # this tells the code that this is not a restart job
123 | force_overwrite = False        # if `True`, this will overwrite job directories which already exist with this name (recommend `True` only when debugging)
124 | jobname = "preprocess"         # this is the name of the job, to be used in labeling directories where output will be written
125 | ```
126 | 
127 | ...and re-run:
128 | 
129 | ```
130 | (graphinvent)$ python submit.py
131 | ```
132 | 
133 | Once you have preprocessed both datasets, you are ready to run a general training job using the first dataset.
134 | 
135 | ### Training models generally
136 | You can modify the same *submit.py* script to instead run a training job using the general dataset (set 1). Begin by changing the *job_type* and *jobname*; all other settings can be kept the same:
137 | 
138 | ```
139 | submit.py >
140 | # define what you want to do for the specified job(s)
141 | dataset = "set_1"
142 | job_type = "train"             # this tells the code that this is a training job
143 | jobdir_start_idx = 0
144 | n_jobs = 1
145 | restart = False
146 | force_overwrite = False
147 | jobname = "train"              # this is the name of the job, to be used in labeling directories where output will be written
148 | ```
149 | 
150 | If you would like to change the SLURM settings, you should do that next, but for this example we will keep them the same. You will then need to specify all parameters that you want to use for training:
151 | 
152 | 
153 | ```
154 | submit.py >
155 | # define dataset-specific parameters
156 | params = {
157 |     "atom_types": ["C", "N", "O", "S"],        # change to your datasets' atom types
158 |     "formal_charge": [-1, 0, +1],              # change to your datasets' formal charges
159 |     "chirality": ["None", "R", "S"],           # ignored, unless you also specify `use_chirality`=True
160 |     "max_n_nodes": 38,                         # change to your datasets' value
161 |     "job_type": job_type,
162 |     "dataset_dir": f"{data_path}{dataset}/",
163 |     "restart": restart,
164 |     "model": "GGNN",                           # <-- which model to use (GGNN is the default, but showing it here to be explicit)
165 |     "sample_every": 2,                         # <-- how often you want to sample/evaluate your model during training (for larger datasets, we recommend sampling more often)
166 |     "init_lr": 1e-4,                           # <-- tune the initial learning rate if needed
167 |     "epochs": 100,                             # <-- how many epochs you want to train for (you can experiment with this)
168 |     "batch_size": 1000,                        # <-- tune the batch size if needed
169 |     "block_size": 100000,                      # <-- tune the block size if needed
170 | }
171 | ```
172 | 
173 | If any parameters are not specified in *submit.py* before running, the model will use the default values in [../graphinvent/parameters/defaults.py](../graphinvent/parameters/defaults.py), but it is not always the case that the "default" values will work well for your datasets. For instance, the parameters related to the learning rate decay are strongly dependent on the dataset used, and you might have to tune them to get optimal performance using your datasets. Depending on your system, you might also need to tune the mini-batch and/or block size so as to reduce/increase the memory requirement for training jobs.
174 | 
175 | You can then run a GraphINVENT training job from the terminal using the following command:
176 | 
177 | ```
178 | (graphinvent)$ python submit.py
179 | ```
180 | 
181 | As the models are training, you should see the progress bar updating on the terminal every epoch. The training status will be saved every epoch to the job directory, *output_{your_dataset_name}/{jobname}/job_{jobdir_start_idx}/*, which should be *output_{your_dataset_name}/train/job_0/* if you followed the settings above. Additionally, the evaluation scores will be saved every evaluation epoch to the job directory. Among the files written to this directory will be:
182 | 
183 | * *generation.log*, containing various evaluation metrics for the generated set, calculated during evaluation epochs
184 | * *convergence.log*, containing the loss and learning rate for every epoch
185 | * *validation.log*, containing model scores (e.g. NLLs, UC-JSD), calculated during evaluation epochs
186 | * *model_restart_{epoch}.pth*, which are the model states for use in restarting jobs, or running generation/validation jobs with a trained model
187 | * *generation/*, a directory containing structures generated during evaluation epochs (\*.smi), as well as information on each structure's NLL (\*.nll) and validity (\*.valid)
188 | 
189 | It is good to check the *generation.log* to verify that the generated set features indeed converge to those of the training set (first entry). If they do not, then you will have to tune the hyperparameters to get better performance. Furthermore, it is good to check the *convergence.log* to make sure the loss is smoothly decreasing during training.
190 | 
191 | #### Restarting a training job
192 | If for any reason you want to restart a training job from a previous epoch (e.g. you cancelled a training job before it reached convergence), then you can do this by setting *restart = True* in *submit.py* and rerunning. While it is possible to change certain parameters in *submit.py* before rerunning (e.g. *init_lr* or *epochs*), parameters related to the model should not be changed, as the program will load an existing model from the last saved *model_restart_{epoch}.pth* file (hence there will be a mismatch between the previous parameters and those you changed). Similarly, any settings related to the file location or job name should not be changed, as the program uses those settings to search in the right directory for the previously saved model. Finally, parameters related to the dataset (e.g. *atom_types*) should not be changed, not only for a restart job but throughout the entire workflow of a dataset. If you want to use different features in the node and edge feature representations, you will have to create a copy of the dataset in [../data/](../data/), give it a unique name, and preprocess it using the desired settings.
193 | 
194 | ### Fine-tuning your predictions
195 | By this point, you have trained a model on a general dataset, so it has ideally learned how to form chemically valid compounds. However, the next thing we would like to do is fine-tune the models on a smaller set of molecules possessing the molecular properties that we would like our generated molecules to have. To do this, we can resume training from a generally trained model.
196 | 
197 | To do this, we can once again modify *submit.py* to specify a restart job on the second dataset:
198 | 
199 | ```
200 | submit.py >
201 | # define what you want to do for the specified job(s)
202 | dataset = "set_1"              # <-- change this from set 1 to set 2
203 | job_type = "train"             # this tells the code that this is a training job
204 | jobdir_start_idx = 0
205 | n_jobs = 1
206 | restart = True                 # <-- specify a restart job
207 | force_overwrite = False
208 | jobname = "train"              # this is the name of the job, to be used in labeling directories where output will be written (don't change this! otherwise GraphINVENT won't find the saved model states)
209 | ```
210 | 
211 | At this point, you can also fine-tune the training parameters, but below we have chosen to keep them all the same (you will have to see what works and what doesn't work for your dataset):
212 | 
213 | ```
214 | submit.py >
215 | # define dataset-specific parameters
216 | params = {
217 |     "atom_types": ["C", "N", "O", "S"],        # change to your datasets' atom types
218 |     "formal_charge": [-1, 0, +1],              # change to your datasets' formal charges
219 |     "chirality": ["None", "R", "S"],           # ignored, unless you also specify `use_chirality`=True
220 |     "max_n_nodes": 38,                         # change to your datasets' value
221 |     "job_type": job_type,
222 |     "dataset_dir": f"{data_path}{dataset}/",
223 |     "restart": restart,
224 |     "model": "GGNN",                           # <-- which model to use (GGNN is the default, but showing it here to be explicit)
225 |     "sample_every": 2,                         # <-- how often you want to sample/evaluate your model during training (for larger datasets, we recommend sampling more often)
226 |     "init_lr": 1e-4,                           # <-- tune the initial learning rate if needed
227 |     "epochs": 100,                             # <-- how many epochs you want to train for (you can experiment with this)
228 |     "batch_size": 1000,                        # <-- tune the batch size if needed
229 |     "block_size": 100000,                      # <-- tune the block size if needed
230 | }
231 | ```
232 | 
233 | Before submitting, you must also create a new output directory (manually) for set 2 containing the saved model state and the *convergence.log* file, following the same directory structure as the output directory for set 1:
234 | 
235 | ```
236 | mkdir output_set_2/train/job_0/
237 | cp output_set_2/train/job_0/model_restart_100.pth output_set_2/train/job_0/.
238 | cp output_set_2/train/job_0/convergence.log output_set_2/train/job_0/.
239 | ```
240 | 
241 | This is necessary in order for GraphINVENT to successfully find the previous saved model state, containing the "generally" trained model.
242 | 
243 | Once you have done this, you can run the new training job from the terminal using the following command:
244 | 
245 | ```
246 | (graphinvent)$ python submit.py
247 | ```
248 | 
249 | The job will restart from the last saved state, so, for example, if your first training job on set 1 reached Epoch 100, then training on set 2 will resume at the model state saved then.
250 | 
251 | ### Generating structures using the fine-tuned model
252 | Once you have fine-tuned your model, you can use a saved state (e.g. *model_restart_200.pth*) to generate targeted molecules. To do this, *submit.py* needs to be updated to specify a generation job. The first setting that needs to be changed is the *job_type*; all other settings here should be kept fixed so that the program can find the correct job directory:
253 | 
254 | ```
255 | submit.py >
256 | # define what you want to do for the specified job(s)
257 | dataset = "set_2"
258 | job_type = "generate"          # this tells the code that this is a generation job
259 | jobdir_start_idx = 0
260 | n_jobs = 1
261 | restart = False
262 | force_overwrite = False
263 | jobname = "train"              # don't change the jobname, or the program won't find the saved model
264 | ```
265 | 
266 | You will then need to update the *generation_epoch* and *n_samples* parameter in *submit.py*:
267 | 
268 | ```
269 | submit.py >
270 | # define dataset-specific parameters
271 | params = {
272 |     "atom_types": ["C", "N", "O", "S"],        # change to your dataset's atom types
273 |     "formal_charge": [-1, 0, +1],              # change to your dataset's formal charges
274 |     "chirality": ["None", "R", "S"],           # ignored, unless you also specify `use_chirality`=True
275 |     "max_n_nodes": 38,                         # change to your dataset's value
276 |     "job_type": job_type,
277 |     "dataset_dir": f"{data_path}{dataset}/",
278 |     "restart": restart,
279 |     "model": "GGNN",
280 |     "sample_every": 2,                         # how often you want to sample/evaluate your model during training (for larger datasets, we recommend sampling more often)
281 |     "init_lr": 1e-4,                           # tune the initial learning rate if needed
282 |     "epochs": 200,                             # how many epochs you want to train for (you can experiment with this)
283 |     "batch_size": 1000,                        # <-- tune the batch size if needed
284 |     "block_size": 100000,                      # tune the block size if needed
285 |     "generation_epoch": 100,                   # <-- specify which saved model (i.e. at which epoch) to use for training)
286 |     "n_samples": 30000,                        # <-- specify how many structures you want to generate
287 | }
288 | ```
289 | 
290 | The *generation_epoch* should correspond to the saved model state that you want to use for generation, and *n_samples* tells the program how many structures you want to generate. In the example above, the parameters specify that the model saved at Epoch 200 should be used to generate 30,000 structures. All other parameters should be kept the same (if they are related to training, such as *epochs* or *init_lr*, they will be ignored during generation jobs).
291 | 
292 | Structures will be generated in batches of size *batch_size*. If you encounter memory problems during generation jobs, reducing the batch size should once again solve them. Generated structures, along with their corresponding metadata, will be written to the *generation/* directory within the existing job directory. These files are:
293 | 
294 | * *epochGEN{generation_epoch}_{batch}.smi*, containing molecules generated at the epoch specified
295 | * *epochGEN{generation_epoch}_{batch}.nll*, containing their respective NLLs
296 | * *epochGEN{generation_epoch}_{batch}.valid*, containing their respective validity (0: invalid, 1: valid)
297 | 
298 | Additionally, the *generation.log* file will be updated with the various evaluation metrics for the generated structures.
299 | 
300 | If you've followed the tutorial up to here, it means you can successfully create new, targeted molecules using transfer learning.
301 | 
302 | Please see the other tutorials (e.g. [1_introduction](./1_introduction.md) and [2_using_a_new_dataset](./2_using_a_new_dataset.md)) for details on how one can post-process the structures for easy visualization, as well as how one can tune the hyperparameters to improve model performance using the different datasets.
303 | 
304 | ### Summary
305 | Hopefully you are now able to train models to generate targeted molecules using transfer learning in GraphINVENT. If anything is unclear in this tutorial, or if you have any questions that have not been addressed by this guide, feel free to contact [me](https://github.com/rociomer).
306 | 


--------------------------------------------------------------------------------
/tutorials/5_benchmarking_with_moses.md:
--------------------------------------------------------------------------------
1 | ## Benchmarking models with MOSES
2 | Models can be easily benchmarked using MOSES. To do this, we recommend reading the MOSES documentation, available at https://github.com/molecularsets/moses. If you want to compare to previously benchmarked models, you will need to train models using the MOSES datasets, available [here](https://github.com/molecularsets/moses/tree/master/data).
3 | 
4 | Once you have a satisfactorily trained model, you can run a Generation job to create 30,000 new structures (see [2_using_a_new_dataset](./2_using_a_new_dataset.md) and follow the instructions using the MOSES dataset). The generated structures can then be used as the \<generated dataset\> in MOSES evaluation jobs.
5 | 
6 | From our experience, MOSES benchmarking jobs require c.a. 30 GB RAM and are done in about an hour.
7 | 


--------------------------------------------------------------------------------
/tutorials/6_preprocessing_large_datasets.md:
--------------------------------------------------------------------------------
 1 | ## Preprocessing large datasets
 2 | 
 3 | For preprocessing very large datasets (e.g. MOSES, with over 1M structures in the training set), it is recommended to split up the data and preprocess them on separate CPUs.
 4 | 
 5 | Until I get around to fixing a way to do this in the code, one can do it the hacky way. In the hacky way, we simply split up the large dataset into many smaller datasets, preprocess them as separate CPU jobs, and then combine them with a hacky script at the end.
 6 | 
 7 | So first, split up the desired SMILES file by running
 8 | 
 9 | ```
10 | split -l 100000 train.smi
11 | ```
12 | 
13 | The above line is of course assuming that you want to split the training data.
14 | 
15 | Then, place each of the splits in a separate directory in [../data/](../data/), such as *my_dataset_1/train.smi*, and make sure to rename them into "train.smi" from whatever the split output is (e.g. "xaa", "xab", etc).
16 | 
17 | Then, comment out the lines for preprocessing the validation and test sets in [../graphinvent/Workflow.py](../graphinvent/Workflow.py):
18 | 
19 | ```
20 | # self.preprocess_valid_data()                                        
21 | # self.preprocess_test_data()
22 | ```
23 | 
24 | Finally, set your desired parameters in *submit.py* and run a preprocessing job for each split (within the GraphINVENT conda environment):
25 | 
26 | ```
27 | (graphinvent)$ python submit.py
28 | ```
29 | 
30 | Once all the HDF files are preprocessed, these can be combined using [../tools/combine_HDFs.py](../tools/combine_HDFs.py).
31 | 
32 | Don't forget to uncomment out the above lines in the future.
33 | 
34 | 


--------------------------------------------------------------------------------
/tutorials/7_reinforcement_learning.md:
--------------------------------------------------------------------------------
1 | ## Reinforcement learning
2 | TODO
3 | 


--------------------------------------------------------------------------------
/tutorials/README.md:
--------------------------------------------------------------------------------
 1 | # GraphINVENT tutorials
 2 | 
 3 | ![School vector created by pch.vector - www.freepik.com](https://image.freepik.com/free-vector/online-tutorials-concept_52683-37480.jpg)
 4 | 
 5 | ## Description
 6 | This directory contains guides on how to use GraphINVENT. If viewing in a browser (recommended), simply click the link to the desired tutorial to view it.
 7 | 
 8 | ## Tutorials
 9 | * [0_setting_up_environment](./0_setting_up_environment.md) : Instructions on how to set up the GraphINVENT virtual environment.
10 | * [1_introduction](./1_introduction.md) : A quick introduction to GraphINVENT. Uses the example dataset *gdb13_1K* to guide new users through Training and Generation jobs in the code.
11 | * [2_using_a_new_dataset](./2_using_a_new_dataset.md) : A tutorial on how to use new datasets to train models in GraphINVENT.
12 | * [3_visualizing_molecules](./3_visualizing_molecules.md) : A quick guide on how to visualize grids of molecules using RDKit.
13 | * [4_transfer_learning](./4_transfer_learning.md) : A guide on how to use GraphINVENT for transfer learning tasks.
14 | * [5_benchmarking_with_moses](./5_benchmarking_with_moses.md) : A guide on how to benchmark GraphINVENT models using the MOSES distribution-based benchmarks.
15 | * [6_preprocessing_large_datasets](./6_preprocessing_large_datasets.md) : A guide on how to preprocess large datasets in GraphINVENT.
16 | * [7_reinforcement_learning](./7_reinforcement_learning.md) : A guide on how to fine-tune GraphINVENT models for molecular optimization and *de novo* design tasks. [TODO]
17 | 
18 | ## Comments
19 | If a tutorial doesn't exist for something you'd like to do, contact [me](https://github.com/rociomer) and I'll be happy to create one (if I think others would benefit from it and I have time). Similarly, if you find and error in a tutorial, please let me know so that I can correct it.
20 | 
21 | ## Author
22 | Rocío Mercado
23 | 


--------------------------------------------------------------------------------