├── .gitignore
├── LICENSE
├── README.md
├── environment.yaml
├── extra_requirements.txt
├── notebooks
    ├── qm9.ipynb
    └── visualization_demo.ipynb
├── semlaflow
    ├── __init__.py
    ├── data
    │   ├── __init__.py
    │   ├── datamodules.py
    │   ├── datasets.py
    │   ├── interpolate.py
    │   └── util.py
    ├── evaluate.py
    ├── models
    │   ├── __init__.py
    │   ├── egnn.py
    │   ├── eqgat.py
    │   ├── fm.py
    │   └── semla.py
    ├── predict.py
    ├── preprocess.py
    ├── scriptutil.py
    ├── train.py
    └── util
    │   ├── __init__.py
    │   ├── functional.py
    │   ├── metrics.py
    │   ├── molrepr.py
    │   ├── rdkit.py
    │   └── tokeniser.py
└── tests
    ├── __init__.py
    └── functional.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Basic Gitignore file
 2 | 
 3 | # User specific paths
 4 | datasets/
 5 | wandb/
 6 | output/
 7 | 
 8 | # Editors
 9 | .vscode/
10 | 
11 | # Jupyter notebook checkpoints
12 | notebooks/.ipynb_checkpoints/
13 | 
14 | # Python cache files
15 | __pycache__/
16 | */__pycache__/
17 | **/__pycache__/
18 | *.pyc
19 | 
20 | # Log files
21 | molproc_logs/
22 | lightning_logs/
23 | notebooks/lightning_logs/
24 | wandb/
25 | nohup.out
26 | 
27 | # Slurm submission scripts and logs
28 | subslurm/
29 | slurm-*
30 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 Ross Irwin
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # SemlaFlow - Efficient Molecular Generation with Flow Matching and Semla
 2 | 
 3 | This project creates a novel equivariant attention-based message passing architecture, Semla, for molecular design and dynamics tasks. We train a molecular generation model, SemlaFlow, using flow matching with optimal transport to generate realistic 3D molecular structures.
 4 | 
 5 | 
 6 | ## Installation
 7 | 
 8 | All of the code was run using a mamba/conda environment. You can of course use a different environment manager; all core requirements are contained in the `environment.yaml` file. Using mamba/conda you can recreate the environment as follows:
 9 | 1. `mamba env create --file environment.yaml`
10 | 2. `mamba activate semlaflow`
11 | 
12 | For developing (and to run the notebooks) you will also need to install the extra requirements:
13 | 3. `pip install -r extra_requirements.txt`
14 | 
15 | 
16 | ## Datasets
17 | 
18 | For ease-of-use we have provided the processed data files in a Google drive [here](https://drive.google.com/drive/folders/1rHi5JzN05bsGRGQUcWRmDu-Ilfoa9EAT?usp=sharing). Copy the folder called `smol` from the QM9 or GEOM drugs folders and point to the `smol` folder when running the scripts. For example, pass `--data_path path/to/data/qm9/smol` to the script you wish to run.
19 | 
20 | 
21 | ### Data Prep
22 | 
23 | We copied the code from MiDi (https://github.com/cvignac/MiDi) to download the QM9 dataset and create the data splits. We provide the code to do this, as well as create the _Smol_ internal dataset representation used for training in the `notebooks/qm9.ipynb` notebook.
24 | 
25 | For GEOM Drugs we also follow the URLs provided in the MiDi repo. GEOM Drugs is preprocessed using the `preprocess.py` script. GEOM Drugs URLs from MiDi are as follows:
26 | * train: https://drive.switch.ch/index.php/s/UauSNgSMUPQdZ9v
27 | * validation: https://drive.switch.ch/index.php/s/YNW5UriYEeVCDnL
28 | * test: https://drive.switch.ch/index.php/s/GQW9ok7mPInPcIo
29 | 
30 | 
31 | ## Running
32 | 
33 | Once you have created and activated the environment successfully, you can run the code.
34 | 
35 | ### Scripts
36 | 
37 | We provide 4 scripts in the repository:
38 | * `preprocess` - Used for preprocessing larger datasets into the internal representation used by the model for training
39 | * `train` - Trains a MolFlow model on preprocessed data
40 | * `evaluate` - Evaluates a trained model and prints the results
41 | * `predict` - Runs the sampling for a trained model and saves the generated molecules
42 | 
43 | Each script can be run as follows (where `<script>` is replaced by the script name above without `.py`): `python -m semlaflow.<script> --data_path <path/to/data> <other_args>`
44 | 
45 | See the bottom of each script for a full list of the arguments available. Default paramaters for those arguments are also given as global declarations at the top of each file. The default arguments in the training script are for GEOM Drugs. To train on QM9 we use a `bond_loss_weight` of 0.5, 2000 `warm_up_steps` and usually 300 `epochs`.
46 | 
47 | ### Models
48 | 
49 | We also provide pretrained model checkpoints for our headline QM9 and GEOM drugs models [here](https://drive.google.com/drive/folders/1rHi5JzN05bsGRGQUcWRmDu-Ilfoa9EAT?usp=sharing). If you wish to evaluate one of these models pass the checkpoint to the evaluate script. For example `--ckpt_path path/to/models/qm9.ckpt`
50 | 
51 | ### Tests
52 | 
53 | The tests are quite sparse and only test the core functionality of the util functions used throughout the model and the molecular representations. 
54 | 
55 | To run all tests `python -m unittest -v tests/*.py`
56 | 
57 | Specific test modules can also be run individually. Eg. `python -m unittest -v tests.functional`
58 | 
59 | 
60 | ## Contact
61 | 
62 | If you find a problem with the code feel free to make a PR. If you have questions or other issues with the code you can email me directly -> rossir [at] chalmers [dot] se
63 | 
64 | 
65 | ## Citation
66 | 
67 | ```
68 | @article{irwin2024efficient,
69 |   title={Efficient 3D Molecular Generation with Flow Matching and Scale Optimal Transport},
70 |   author={Irwin, Ross and Tibo, Alessandro and Janet, Jon-Paul and Olsson, Simon},
71 |   journal={arXiv preprint arXiv:2406.07266},
72 |   year={2024}
73 | }
74 | ```
75 | 


--------------------------------------------------------------------------------
/environment.yaml:
--------------------------------------------------------------------------------
 1 | name: equinv
 2 | channels:
 3 |   - conda-forge
 4 |   - pytorch
 5 |   - nvidia
 6 | dependencies:
 7 |   - python=3.11
 8 |   - pytorch
 9 |   - pytorch-cuda=12.1
10 |   - ca-certificates
11 |   - cxx-compiler
12 |   - openssl
13 |   - tqdm
14 |   - wandb
15 |   - ipython
16 |   - certifi
17 |   - pip:
18 |     - numpy==1.26.2
19 |     - pandas==2.2.2
20 |     - scipy==1.11.4
21 |     - rdkit
22 |     - lightning
23 |     - torchmetrics
24 |     - openbabel-wheel
25 |     - typing_extensions
26 | 


--------------------------------------------------------------------------------
/extra_requirements.txt:
--------------------------------------------------------------------------------
1 | matplotlib
2 | jupyter
3 | ipykernel
4 | py3Dmol


--------------------------------------------------------------------------------
/notebooks/qm9.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "attachments": {},
  5 |    "cell_type": "markdown",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Inspect Datasets and Save as Smol Objects"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": null,
 14 |    "metadata": {},
 15 |    "outputs": [],
 16 |    "source": [
 17 |     "import sys\n",
 18 |     "sys.path.append(\"..\")"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": null,
 24 |    "metadata": {},
 25 |    "outputs": [],
 26 |    "source": [
 27 |     "import torch\n",
 28 |     "import numpy as np\n",
 29 |     "import pandas as pd\n",
 30 |     "import matplotlib.pyplot as plt\n",
 31 |     "from tqdm import tqdm\n",
 32 |     "from pathlib import Path\n",
 33 |     "from rdkit import Chem, RDLogger\n",
 34 |     "from torchmetrics import MetricCollection\n",
 35 |     "from rdkit.Chem.Draw import IPythonConsole\n",
 36 |     "IPythonConsole.ipython_3d = True"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": null,
 42 |    "metadata": {},
 43 |    "outputs": [],
 44 |    "source": [
 45 |     "import semlaflow.util.rdkit as smolRD\n",
 46 |     "import semlaflow.util.functional as smolF\n",
 47 |     "import semlaflow.util.metrics as Metrics\n",
 48 |     "from semlaflow.util.tokeniser import Vocabulary\n",
 49 |     "from semlaflow.util.molrepr import GeometricMol, GeometricMolBatch"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "code",
 54 |    "execution_count": null,
 55 |    "metadata": {},
 56 |    "outputs": [],
 57 |    "source": [
 58 |     "QM9_PATH = \"../../../data/qm9\"\n",
 59 |     "RAW_DIR =\"raw\"\n",
 60 |     "SPLIT_DIR = \"raw_split\"\n",
 61 |     "SAVE_DIR = \"smol\"\n",
 62 |     "SDF_FILE = \"gdb9.sdf\"\n",
 63 |     "METADATA_FILE = \"gdb9.sdf.csv\"\n",
 64 |     "SKIP_FILE = \"uncharacterized.txt\""
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "code",
 69 |    "execution_count": null,
 70 |    "metadata": {},
 71 |    "outputs": [],
 72 |    "source": [
 73 |     "# Copied from MiDi code, so should create the same splits (they didn't make them available)\n",
 74 |     "def split_qm9(metadata_df):\n",
 75 |     "    n_samples = len(metadata_df)\n",
 76 |     "    n_train = 100000\n",
 77 |     "    n_test = int(0.1 * n_samples)\n",
 78 |     "    n_val = n_samples - (n_train + n_test)\n",
 79 |     "\n",
 80 |     "    # Shuffle dataset with df.sample, then split\n",
 81 |     "    train, val, test = np.split(metadata_df.sample(frac=1, random_state=42), [n_train, n_val + n_train])\n",
 82 |     "    return train, val, test"
 83 |    ]
 84 |   },
 85 |   {
 86 |    "cell_type": "code",
 87 |    "execution_count": null,
 88 |    "metadata": {},
 89 |    "outputs": [],
 90 |    "source": [
 91 |     "# Will skip mol indices which appear in the skip file\n",
 92 |     "def rdkit_mols_from_df(split_path, sdf_path, skip_path):\n",
 93 |     "    target_df = pd.read_csv(split_path, index_col=0)\n",
 94 |     "    target_df.drop(columns=['mol_id'], inplace=True)\n",
 95 |     "\n",
 96 |     "    with open(skip_path, 'r') as f:\n",
 97 |     "        skip = [int(x.split()[0]) - 1 for x in f.read().split('\\n')[9:-2]]\n",
 98 |     "\n",
 99 |     "    suppl = Chem.SDMolSupplier(str(sdf_path), removeHs=False, sanitize=False)\n",
100 |     "\n",
101 |     "    mols = []\n",
102 |     "    all_smiles = []\n",
103 |     "\n",
104 |     "    errors = 0\n",
105 |     "    skipped = 0\n",
106 |     "\n",
107 |     "    for i, mol in enumerate(tqdm(suppl)):\n",
108 |     "        if i not in target_df.index:\n",
109 |     "            continue\n",
110 |     "\n",
111 |     "        if i in skip:\n",
112 |     "            skipped += 1\n",
113 |     "            continue\n",
114 |     "\n",
115 |     "        try:\n",
116 |     "            Chem.SanitizeMol(mol)\n",
117 |     "            smiles = Chem.MolToSmiles(mol, isomericSmiles=False)\n",
118 |     "        except:\n",
119 |     "            smiles = None\n",
120 |     "\n",
121 |     "        if smiles is None:\n",
122 |     "            errors += 1\n",
123 |     "        else:\n",
124 |     "            all_smiles.append(smiles)\n",
125 |     "            mols.append(mol)\n",
126 |     "\n",
127 |     "    print(f\"Skipped {skipped} mols which where in skip file.\")\n",
128 |     "    print(f\"Encountered {errors} molecules which failed sanitisation.\")\n",
129 |     "    print(f\"Completed loading of dataset with {len(mols)} molecules.\")\n",
130 |     "\n",
131 |     "    return mols"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": null,
137 |    "metadata": {},
138 |    "outputs": [],
139 |    "source": [
140 |     "def build_vocab():\n",
141 |     "    # Need to make sure PAD has index 0\n",
142 |     "    special_tokens = [\"<PAD>\", \"<MASK>\"]\n",
143 |     "    core_atoms = [\"H\", \"C\", \"N\", \"O\", \"F\", \"P\", \"S\", \"Cl\"]\n",
144 |     "    other_atoms = [\"Br\", \"B\", \"Al\", \"Si\", \"As\", \"I\", \"Hg\", \"Bi\"]\n",
145 |     "    tokens = special_tokens + core_atoms + other_atoms\n",
146 |     "    return Vocabulary(tokens)"
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "code",
151 |    "execution_count": null,
152 |    "metadata": {},
153 |    "outputs": [],
154 |    "source": [
155 |     "def matching_smiles(rdkit_mol, smol_mol, vocab):\n",
156 |     "    rdkit_mol2 = smol_mol.to_rdkit(vocab)\n",
157 |     "    smi1 = smolRD.smiles_from_mol(rdkit_mol, canonical=True)\n",
158 |     "    smi2 = smolRD.smiles_from_mol(rdkit_mol2, canonical=True)\n",
159 |     "    return smi1 == smi2"
160 |    ]
161 |   },
162 |   {
163 |    "attachments": {},
164 |    "cell_type": "markdown",
165 |    "metadata": {},
166 |    "source": [
167 |     "## QM9"
168 |    ]
169 |   },
170 |   {
171 |    "attachments": {},
172 |    "cell_type": "markdown",
173 |    "metadata": {},
174 |    "source": [
175 |     "### Split QM9 and load into separate CSVs\n",
176 |     "\n",
177 |     "I have copied the code from the MiDi paper and used the same random seed, so hopefully this will generate the same splits as they used. But they haven't provided their splits so we can't say for sure without these.\n",
178 |     "\n",
179 |     "This code just splits the csv file, which contains metadata and properties for each molecule. The full molecular coordinates are stored in a single sdf file."
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "code",
184 |    "execution_count": null,
185 |    "metadata": {},
186 |    "outputs": [],
187 |    "source": [
188 |     "qm9_path = Path(QM9_PATH)\n",
189 |     "dataset = pd.read_csv(qm9_path / RAW_DIR / METADATA_FILE)\n",
190 |     "train, val, test = split_qm9(dataset)\n",
191 |     "\n",
192 |     "train_csv_path = qm9_path / SPLIT_DIR / \"train.csv\"\n",
193 |     "val_csv_path = qm9_path / SPLIT_DIR / \"val.csv\"\n",
194 |     "test_csv_path = qm9_path / SPLIT_DIR / \"test.csv\"\n",
195 |     "\n",
196 |     "# train.to_csv(train_csv_path)\n",
197 |     "# val.to_csv(val_csv_path)\n",
198 |     "# test.to_csv(test_csv_path)"
199 |    ]
200 |   },
201 |   {
202 |    "attachments": {},
203 |    "cell_type": "markdown",
204 |    "metadata": {},
205 |    "source": [
206 |     "### Create Smol Datasets from RDKit Mols from SDF Files"
207 |    ]
208 |   },
209 |   {
210 |    "cell_type": "code",
211 |    "execution_count": null,
212 |    "metadata": {},
213 |    "outputs": [],
214 |    "source": [
215 |     "RDLogger.DisableLog('rdApp.*')"
216 |    ]
217 |   },
218 |   {
219 |    "cell_type": "code",
220 |    "execution_count": null,
221 |    "metadata": {},
222 |    "outputs": [],
223 |    "source": [
224 |     "vocab = build_vocab()"
225 |    ]
226 |   },
227 |   {
228 |    "cell_type": "code",
229 |    "execution_count": null,
230 |    "metadata": {},
231 |    "outputs": [],
232 |    "source": [
233 |     "sdf_path = qm9_path / RAW_DIR / SDF_FILE\n",
234 |     "skip_path = qm9_path / RAW_DIR / SKIP_FILE\n",
235 |     "\n",
236 |     "print(\"Processing train data...\")\n",
237 |     "train_mols = rdkit_mols_from_df(train_csv_path, sdf_path, skip_path)\n",
238 |     "\n",
239 |     "print(\"Processing val data...\")\n",
240 |     "val_mols = rdkit_mols_from_df(val_csv_path, sdf_path, skip_path)\n",
241 |     "\n",
242 |     "print(\"Processing test data...\")\n",
243 |     "test_mols = rdkit_mols_from_df(test_csv_path, sdf_path, skip_path)"
244 |    ]
245 |   },
246 |   {
247 |    "cell_type": "code",
248 |    "execution_count": null,
249 |    "metadata": {},
250 |    "outputs": [],
251 |    "source": [
252 |     "# Create Smol batches for ease of use later on\n",
253 |     "train_batch = GeometricMolBatch([GeometricMol.from_rdkit(mol) for mol in train_mols])\n",
254 |     "val_batch = GeometricMolBatch([GeometricMol.from_rdkit(mol) for mol in val_mols])\n",
255 |     "test_batch = GeometricMolBatch([GeometricMol.from_rdkit(mol) for mol in test_mols])"
256 |    ]
257 |   },
258 |   {
259 |    "cell_type": "code",
260 |    "execution_count": null,
261 |    "metadata": {},
262 |    "outputs": [],
263 |    "source": [
264 |     "# Check it looks right\n",
265 |     "print(\"Dataset sizes:\")\n",
266 |     "print(len(train_batch))\n",
267 |     "print(len(val_batch))\n",
268 |     "print(len(test_batch))\n",
269 |     "\n",
270 |     "example_mol = train_batch[567]\n",
271 |     "print()\n",
272 |     "print(\"Example mol:\")\n",
273 |     "print(example_mol.coords)\n",
274 |     "print(example_mol.atomics)\n",
275 |     "print(example_mol.bonds)\n",
276 |     "print(example_mol.charges)"
277 |    ]
278 |   },
279 |   {
280 |    "cell_type": "code",
281 |    "execution_count": null,
282 |    "metadata": {},
283 |    "outputs": [],
284 |    "source": [
285 |     "example_mol.to_rdkit(vocab)"
286 |    ]
287 |   },
288 |   {
289 |    "cell_type": "code",
290 |    "execution_count": null,
291 |    "metadata": {},
292 |    "outputs": [],
293 |    "source": [
294 |     "for atom in example_mol.to_rdkit(vocab).GetAtoms():\n",
295 |     "    print(f\"Atom {atom.GetSymbol()} -- charge {atom.GetFormalCharge()} -- valence {atom.GetExplicitValence()}\")"
296 |    ]
297 |   },
298 |   {
299 |    "cell_type": "code",
300 |    "execution_count": null,
301 |    "metadata": {},
302 |    "outputs": [],
303 |    "source": [
304 |     "train_path = qm9_path / SAVE_DIR / \"train.smol\"\n",
305 |     "val_path = qm9_path / SAVE_DIR / \"val.smol\"\n",
306 |     "test_path = qm9_path / SAVE_DIR / \"test.smol\"\n",
307 |     "\n",
308 |     "train_bytes = train_batch.to_bytes()\n",
309 |     "val_bytes = val_batch.to_bytes()\n",
310 |     "test_bytes = test_batch.to_bytes()\n",
311 |     "\n",
312 |     "train_path.write_bytes(train_bytes)\n",
313 |     "val_path.write_bytes(val_bytes)\n",
314 |     "test_path.write_bytes(test_bytes)"
315 |    ]
316 |   },
317 |   {
318 |    "cell_type": "code",
319 |    "execution_count": null,
320 |    "metadata": {},
321 |    "outputs": [],
322 |    "source": [
323 |     "train_matching = [matching_smiles(mol1, mol2, vocab) for mol1, mol2 in zip(train_mols, train_batch.to_list())]\n",
324 |     "print(\"Proportion matching\", sum(train_matching) / len(train_matching))"
325 |    ]
326 |   },
327 |   {
328 |    "cell_type": "code",
329 |    "execution_count": null,
330 |    "metadata": {},
331 |    "outputs": [],
332 |    "source": [
333 |     "print(len(train_mols))\n",
334 |     "print(len(train_batch))"
335 |    ]
336 |   },
337 |   {
338 |    "cell_type": "code",
339 |    "execution_count": null,
340 |    "metadata": {},
341 |    "outputs": [],
342 |    "source": [
343 |     "unmatched_idxs = [idx for idx, matching in enumerate(train_matching) if not matching]"
344 |    ]
345 |   },
346 |   {
347 |    "cell_type": "code",
348 |    "execution_count": null,
349 |    "metadata": {},
350 |    "outputs": [],
351 |    "source": [
352 |     "idx = 100\n",
353 |     "unmatched_idx = unmatched_idxs[idx]\n",
354 |     "print(smolRD.smiles_from_mol(train_mols[unmatched_idx]))\n",
355 |     "print(smolRD.smiles_from_mol(train_batch[unmatched_idx].to_rdkit(vocab)))"
356 |    ]
357 |   },
358 |   {
359 |    "cell_type": "code",
360 |    "execution_count": null,
361 |    "metadata": {},
362 |    "outputs": [],
363 |    "source": [
364 |     "train_valid = [smolRD.mol_is_valid(mol.to_rdkit(vocab)) for mol in train_batch]\n",
365 |     "print(\"Propertion valid\", sum(train_valid) / len(train_valid))"
366 |    ]
367 |   },
368 |   {
369 |    "cell_type": "code",
370 |    "execution_count": null,
371 |    "metadata": {},
372 |    "outputs": [],
373 |    "source": []
374 |   },
375 |   {
376 |    "attachments": {},
377 |    "cell_type": "markdown",
378 |    "metadata": {},
379 |    "source": [
380 |     "## Analyse QM9 Dataset"
381 |    ]
382 |   },
383 |   {
384 |    "cell_type": "code",
385 |    "execution_count": null,
386 |    "metadata": {},
387 |    "outputs": [],
388 |    "source": [
389 |     "train_coords = train_batch.coords\n",
390 |     "train_mask = train_batch.mask\n",
391 |     "\n",
392 |     "_, std_dev = smolF.standardise_coords(train_coords, train_mask)\n",
393 |     "print(\"Coord std dev on train data\", std_dev)"
394 |    ]
395 |   },
396 |   {
397 |    "cell_type": "code",
398 |    "execution_count": null,
399 |    "metadata": {},
400 |    "outputs": [],
401 |    "source": [
402 |     "avg_n_atoms = sum(train_batch.seq_length) / len(train_batch.seq_length)\n",
403 |     "max_n_atoms = max(train_batch.seq_length)\n",
404 |     "min_n_atoms = min(train_batch.seq_length)\n",
405 |     "print(\"avg\", avg_n_atoms)\n",
406 |     "print(\"max\", max_n_atoms)\n",
407 |     "print(\"min\", min_n_atoms)"
408 |    ]
409 |   },
410 |   {
411 |    "cell_type": "code",
412 |    "execution_count": null,
413 |    "metadata": {},
414 |    "outputs": [],
415 |    "source": [
416 |     "plt.hist(train_batch.seq_length, bins=26)\n",
417 |     "plt.show()"
418 |    ]
419 |   },
420 |   {
421 |    "attachments": {},
422 |    "cell_type": "markdown",
423 |    "metadata": {},
424 |    "source": [
425 |     "### Firstly, try loading the saved data"
426 |    ]
427 |   },
428 |   {
429 |    "cell_type": "code",
430 |    "execution_count": null,
431 |    "metadata": {},
432 |    "outputs": [],
433 |    "source": [
434 |     "SAVE_DIR = \"smol\""
435 |    ]
436 |   },
437 |   {
438 |    "cell_type": "code",
439 |    "execution_count": null,
440 |    "metadata": {},
441 |    "outputs": [],
442 |    "source": [
443 |     "qm9_path = Path(QM9_PATH)\n",
444 |     "train_path = qm9_path / SAVE_DIR / \"train.smol\"\n",
445 |     "val_path = qm9_path / SAVE_DIR / \"val.smol\"\n",
446 |     "test_path = qm9_path / SAVE_DIR / \"test.smol\""
447 |    ]
448 |   },
449 |   {
450 |    "cell_type": "code",
451 |    "execution_count": null,
452 |    "metadata": {},
453 |    "outputs": [],
454 |    "source": [
455 |     "train_bytes = train_path.read_bytes()\n",
456 |     "val_bytes = val_path.read_bytes()\n",
457 |     "test_bytes = test_path.read_bytes()\n",
458 |     "\n",
459 |     "train_batch = GeometricMolBatch.from_bytes(train_bytes)\n",
460 |     "val_batch = GeometricMolBatch.from_bytes(val_bytes)\n",
461 |     "test_batch = GeometricMolBatch.from_bytes(test_bytes)"
462 |    ]
463 |   },
464 |   {
465 |    "cell_type": "code",
466 |    "execution_count": null,
467 |    "metadata": {},
468 |    "outputs": [],
469 |    "source": [
470 |     "vocab = build_vocab()"
471 |    ]
472 |   },
473 |   {
474 |    "cell_type": "code",
475 |    "execution_count": null,
476 |    "metadata": {},
477 |    "outputs": [],
478 |    "source": [
479 |     "sample_mols = train_batch.to_list()"
480 |    ]
481 |   },
482 |   {
483 |    "cell_type": "code",
484 |    "execution_count": null,
485 |    "metadata": {},
486 |    "outputs": [],
487 |    "source": [
488 |     "sample_mols[567].to_rdkit(vocab)"
489 |    ]
490 |   },
491 |   {
492 |    "cell_type": "code",
493 |    "execution_count": null,
494 |    "metadata": {},
495 |    "outputs": [],
496 |    "source": [
497 |     "for atom in sample_mols[567].to_rdkit(vocab).GetAtoms():\n",
498 |     "    print(f\"Atom {atom.GetSymbol()} -- charge {atom.GetFormalCharge()} -- valence {atom.GetExplicitValence()}\")"
499 |    ]
500 |   },
501 |   {
502 |    "cell_type": "code",
503 |    "execution_count": null,
504 |    "metadata": {},
505 |    "outputs": [],
506 |    "source": [
507 |     "gen_metrics = {\n",
508 |     "    \"validity\": Metrics.Validity(),\n",
509 |     "    \"fc-validity\": Metrics.Validity(connected=True),\n",
510 |     "    \"uniqueness\": Metrics.Uniqueness(),\n",
511 |     "    \"energy-validity\": Metrics.EnergyValidity(),\n",
512 |     "    \"opt-energy-validity\": Metrics.EnergyValidity(optimise=True),\n",
513 |     "    \"energy\": Metrics.AverageEnergy(),\n",
514 |     "    \"energy-per-atom\": Metrics.AverageEnergy(per_atom=True),\n",
515 |     "    \"strain\": Metrics.AverageStrainEnergy(),\n",
516 |     "    \"strain-per-atom\": Metrics.AverageStrainEnergy(per_atom=True),\n",
517 |     "    \"opt-rmsd\": Metrics.AverageOptRmsd()\n",
518 |     "}\n",
519 |     "gen_metrics = MetricCollection(gen_metrics, compute_groups=False)"
520 |    ]
521 |   },
522 |   {
523 |    "cell_type": "code",
524 |    "execution_count": null,
525 |    "metadata": {},
526 |    "outputs": [],
527 |    "source": [
528 |     "# Compute benchmark metrics on loaded train dataset samples\n",
529 |     "rdkit_sample_mols = [mol.to_rdkit(vocab, sanitize=True) for mol in sample_mols]\n",
530 |     "gen_metrics.reset()\n",
531 |     "gen_metrics.update(rdkit_sample_mols)\n",
532 |     "results = gen_metrics.compute()"
533 |    ]
534 |   },
535 |   {
536 |    "cell_type": "code",
537 |    "execution_count": null,
538 |    "metadata": {},
539 |    "outputs": [],
540 |    "source": [
541 |     "for metric, result in results.items():\n",
542 |     "    print(f\"{metric} -- {result.item():.3f}\")"
543 |    ]
544 |   },
545 |   {
546 |    "cell_type": "code",
547 |    "execution_count": null,
548 |    "metadata": {},
549 |    "outputs": [],
550 |    "source": [
551 |     "# Compute benchmark metrics on original train dataset samples\n",
552 |     "gen_metrics.reset()\n",
553 |     "gen_metrics.update(train_mols)\n",
554 |     "results = gen_metrics.compute()"
555 |    ]
556 |   },
557 |   {
558 |    "cell_type": "code",
559 |    "execution_count": null,
560 |    "metadata": {},
561 |    "outputs": [],
562 |    "source": [
563 |     "for metric, result in results.items():\n",
564 |     "    print(f\"{metric} -- {result.item():.3f}\")"
565 |    ]
566 |   },
567 |   {
568 |    "cell_type": "code",
569 |    "execution_count": null,
570 |    "metadata": {},
571 |    "outputs": [],
572 |    "source": [
573 |     "for idx, mol in enumerate(sample_mols[82008:82010]):\n",
574 |     "    print(idx)\n",
575 |     "    mol.to_rdkit(vocab)"
576 |    ]
577 |   },
578 |   {
579 |    "cell_type": "code",
580 |    "execution_count": null,
581 |    "metadata": {},
582 |    "outputs": [],
583 |    "source": [
584 |     "idx = 82000 + 8\n",
585 |     "sample_mols[idx].to_rdkit(vocab)"
586 |    ]
587 |   },
588 |   {
589 |    "cell_type": "code",
590 |    "execution_count": null,
591 |    "metadata": {},
592 |    "outputs": [],
593 |    "source": [
594 |     "original_mol = Chem.Mol(train_mols[idx])\n",
595 |     "Chem.SanitizeMol(original_mol)\n",
596 |     "original_mol"
597 |    ]
598 |   },
599 |   {
600 |    "attachments": {},
601 |    "cell_type": "markdown",
602 |    "metadata": {},
603 |    "source": [
604 |     "### Recreate this issue with functions in the notebook"
605 |    ]
606 |   },
607 |   {
608 |    "cell_type": "code",
609 |    "execution_count": null,
610 |    "metadata": {},
611 |    "outputs": [],
612 |    "source": [
613 |     "def mol_from_atoms(coords, tokens, bonds):\n",
614 |     "    try:\n",
615 |     "        atomics = [smolRD.PT.atomic_from_symbol(token) for token in tokens]\n",
616 |     "    except:\n",
617 |     "        return None\n",
618 |     "\n",
619 |     "    # Add atom types\n",
620 |     "    mol = Chem.EditableMol(Chem.Mol())\n",
621 |     "    for atomic in atomics:\n",
622 |     "        mol.AddAtom(Chem.Atom(atomic))\n",
623 |     "\n",
624 |     "    # Add 3D coords\n",
625 |     "    conf = Chem.Conformer(coords.shape[0])\n",
626 |     "    for idx, coord in enumerate(coords.tolist()):\n",
627 |     "        conf.SetAtomPosition(idx, coord)\n",
628 |     "\n",
629 |     "    mol = mol.GetMol()\n",
630 |     "    mol.AddConformer(conf)\n",
631 |     "\n",
632 |     "    # Add bonds if they have been provided\n",
633 |     "    mol = Chem.EditableMol(mol)\n",
634 |     "    for bond in bonds.astype(np.int32).tolist():\n",
635 |     "        start, end, b_type = bond\n",
636 |     "\n",
637 |     "        if b_type not in smolRD.IDX_BOND_MAP:\n",
638 |     "            return None\n",
639 |     "\n",
640 |     "        # Don't add self connections\n",
641 |     "        if start != end:\n",
642 |     "            b_type = smolRD.IDX_BOND_MAP[b_type]\n",
643 |     "            mol.AddBond(start, end, b_type)\n",
644 |     "\n",
645 |     "    mol = mol.GetMol()\n",
646 |     "    for atom in mol.GetAtoms():\n",
647 |     "        atom.UpdatePropertyCache(strict=False)\n",
648 |     "\n",
649 |     "    # try:\n",
650 |     "    #     Chem.SanitizeMol(mol)\n",
651 |     "    # except:\n",
652 |     "    #     return None\n",
653 |     "\n",
654 |     "    return mol"
655 |    ]
656 |   },
657 |   {
658 |    "cell_type": "code",
659 |    "execution_count": null,
660 |    "metadata": {},
661 |    "outputs": [],
662 |    "source": [
663 |     "def to_rdkit(mol, vocab):\n",
664 |     "    if len(mol.atomics.size()) == 2:\n",
665 |     "        vocab_indices = torch.argmax(mol.atomics, dim=1).tolist()\n",
666 |     "        tokens = vocab.tokens_from_indices(vocab_indices)\n",
667 |     "\n",
668 |     "    else:\n",
669 |     "        atomics = mol.atomics.tolist()\n",
670 |     "        tokens = [smolRD.PT.symbol_from_atomic(a) for a in atomics]\n",
671 |     "\n",
672 |     "    coords = mol.coords.numpy()\n",
673 |     "    bonds = mol.bonds.numpy()\n",
674 |     "\n",
675 |     "    rdkit_mol = mol_from_atoms(coords, tokens, bonds)\n",
676 |     "    return rdkit_mol"
677 |    ]
678 |   },
679 |   {
680 |    "cell_type": "code",
681 |    "execution_count": null,
682 |    "metadata": {},
683 |    "outputs": [],
684 |    "source": [
685 |     "idx = 82000 + 8\n",
686 |     "problem_mol = sample_mols[idx]\n",
687 |     "rdkit_mol = to_rdkit(problem_mol, vocab)"
688 |    ]
689 |   },
690 |   {
691 |    "cell_type": "code",
692 |    "execution_count": null,
693 |    "metadata": {},
694 |    "outputs": [],
695 |    "source": [
696 |     "for atom in rdkit_mol.GetAtoms():\n",
697 |     "    print(f\"Atom {atom.GetSymbol()} -- charge {atom.GetFormalCharge()} -- valence {atom.GetExplicitValence()}\")\n",
698 |     "\n",
699 |     "print()\n",
700 |     "for atom in original_mol.GetAtoms():\n",
701 |     "    print(f\"Atom {atom.GetSymbol()} -- charge {atom.GetFormalCharge()} -- valence {atom.GetExplicitValence()}\")"
702 |    ]
703 |   }
704 |  ],
705 |  "metadata": {
706 |   "kernelspec": {
707 |    "display_name": "fegnn",
708 |    "language": "python",
709 |    "name": "python3"
710 |   },
711 |   "language_info": {
712 |    "codemirror_mode": {
713 |     "name": "ipython",
714 |     "version": 3
715 |    },
716 |    "file_extension": ".py",
717 |    "mimetype": "text/x-python",
718 |    "name": "python",
719 |    "nbconvert_exporter": "python",
720 |    "pygments_lexer": "ipython3",
721 |    "version": "3.11.8"
722 |   },
723 |   "orig_nbformat": 4
724 |  },
725 |  "nbformat": 4,
726 |  "nbformat_minor": 2
727 | }
728 | 


--------------------------------------------------------------------------------
/semlaflow/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rssrwn/semla-flow/65d7106bc907e4d136deec8de7417386e3f9b10b/semlaflow/__init__.py


--------------------------------------------------------------------------------
/semlaflow/data/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rssrwn/semla-flow/65d7106bc907e4d136deec8de7417386e3f9b10b/semlaflow/data/__init__.py


--------------------------------------------------------------------------------
/semlaflow/data/datamodules.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | from functools import partial
  3 | 
  4 | import lightning as L
  5 | import torch
  6 | from torch.utils.data import DataLoader
  7 | 
  8 | import semlaflow.util.functional as smolF
  9 | import semlaflow.util.rdkit as smolRD
 10 | from semlaflow.data.util import BucketBatchSampler
 11 | from semlaflow.util.molrepr import GeometricMol, GeometricMolBatch
 12 | 
 13 | 
 14 | class SmolDM(L.LightningDataModule):
 15 |     def __init__(
 16 |         self,
 17 |         train_dataset,
 18 |         val_dataset,
 19 |         test_dataset,
 20 |         batch_cost,
 21 |         bucket_limits=None,
 22 |         bucket_cost_scale="constant",
 23 |         pad_to_bucket=False,
 24 |     ):
 25 |         super().__init__()
 26 | 
 27 |         if bucket_cost_scale not in [None, "constant", "linear", "quadratic"]:
 28 |             raise ValueError(f"Bucket cost scale '{bucket_cost_scale}' is not supported.")
 29 | 
 30 |         if bucket_limits is not None:
 31 |             bucket_limits = sorted(bucket_limits)
 32 |             largest_padding = bucket_limits[-1]
 33 | 
 34 |             if train_dataset is not None and max(train_dataset.lengths) > largest_padding:
 35 |                 raise ValueError("At least one item in train dataset is larger than largest padded size.")
 36 | 
 37 |             if val_dataset is not None and max(val_dataset.lengths) > largest_padding:
 38 |                 raise ValueError("At least one item in val dataset is larger than largest padded size.")
 39 | 
 40 |             if test_dataset is not None and max(test_dataset.lengths) > largest_padding:
 41 |                 raise ValueError("At least one item in test dataset is larger than largest padded size.")
 42 | 
 43 |         self._num_workers = len(os.sched_getaffinity(0))
 44 | 
 45 |         self.train_dataset = train_dataset
 46 |         self.val_dataset = val_dataset
 47 |         self.test_dataset = test_dataset
 48 | 
 49 |         self.batch_cost = batch_cost
 50 |         self.bucket_limits = bucket_limits
 51 |         self.bucket_cost_scale = bucket_cost_scale
 52 |         self.pad_to_bucket = pad_to_bucket
 53 | 
 54 |     @property
 55 |     def hparams(self):
 56 |         train_data = self.train_dataset
 57 |         val_data = self.val_dataset
 58 |         test_data = self.test_dataset
 59 | 
 60 |         train_hps = {f"train-{k}": v for k, v in train_data.hparams.items()} if train_data is not None else {}
 61 |         val_hps = {f"val-{k}": v for k, v in val_data.hparams.items()} if val_data is not None else {}
 62 |         test_hps = {f"test-{k}": v for k, v in test_data.hparams.items()} if test_data is not None else {}
 63 | 
 64 |         hparams = {
 65 |             "batch-cost": self.batch_cost,
 66 |             "buckets": len(self.bucket_limits),
 67 |             "bucket-cost-scale": self.bucket_cost_scale,
 68 |             **train_hps,
 69 |             **val_hps,
 70 |             **test_hps,
 71 |         }
 72 |         return hparams
 73 | 
 74 |     def train_dataloader(self):
 75 |         sampler = self._sampler(self.train_dataset, drop_last=True)
 76 |         batch_size = self.batch_cost if sampler is None else 1
 77 |         shuffle = sampler is None
 78 | 
 79 |         dataloader = DataLoader(
 80 |             self.train_dataset,
 81 |             batch_size=batch_size,
 82 |             shuffle=shuffle,
 83 |             batch_sampler=sampler,
 84 |             num_workers=self._num_workers,
 85 |             pin_memory=True,
 86 |             collate_fn=partial(self._collate, dataset="train"),
 87 |         )
 88 |         return dataloader
 89 | 
 90 |     def val_dataloader(self):
 91 |         sampler = self._sampler(self.val_dataset, drop_last=False)
 92 |         batch_size = self.batch_cost if sampler is None else 1
 93 | 
 94 |         dataloader = DataLoader(
 95 |             self.val_dataset,
 96 |             batch_size=batch_size,
 97 |             shuffle=False,
 98 |             batch_sampler=sampler,
 99 |             num_workers=self._num_workers,
100 |             collate_fn=partial(self._collate, dataset="val"),
101 |         )
102 |         return dataloader
103 | 
104 |     def test_dataloader(self):
105 |         sampler = self._sampler(self.test_dataset, drop_last=False)
106 |         batch_size = self.batch_cost if sampler is None else 1
107 | 
108 |         dataloader = DataLoader(
109 |             self.test_dataset,
110 |             batch_size=batch_size,
111 |             shuffle=False,
112 |             batch_sampler=sampler,
113 |             num_workers=self._num_workers,
114 |             collate_fn=partial(self._collate, dataset="test"),
115 |         )
116 |         return dataloader
117 | 
118 |     def _sampler(self, dataset, drop_last=False):
119 |         sampler = None
120 |         if self.bucket_limits is not None:
121 |             costs = self._get_bucket_costs()
122 |             sampler = BucketBatchSampler(
123 |                 self.bucket_limits,
124 |                 dataset.lengths,
125 |                 self.batch_cost,
126 |                 bucket_costs=costs,
127 |                 drop_last=drop_last,
128 |                 round_batch_to_8=True,
129 |             )
130 | 
131 |         return sampler
132 | 
133 |     def _get_bucket_costs(self):
134 |         if self.bucket_cost_scale is None:
135 |             return None
136 |         elif self.bucket_cost_scale == "constant":
137 |             return [1] * len(self.bucket_limits)
138 |         elif self.bucket_cost_scale == "linear":
139 |             return self.bucket_limits
140 |         elif self.bucket_cost_scale == "quadratic":
141 |             # Divide by 256 and add one to approximate the linear and constant overheads
142 |             # A molecule with 16 atoms will therefore have a cost of 1 + 1
143 |             return [((limit**2) / 256) + 1 for limit in self.bucket_limits]
144 |         else:
145 |             raise ValueError(f"Unknown value for bucket_cost_scale '{self.bucket_cost_scale}'")
146 | 
147 |     # TODO implement this using GeometricDM stuff and add extra collations for other types of SmolMol
148 |     def _collate(self, batch, dataset):
149 |         raise NotImplementedError()
150 | 
151 | 
152 | # TODO could make this more general for all types of SmolMol
153 | # Just have to allow for different types of SmolMol in collate and take different tensors in batch_to_dict
154 | class GeometricDM(SmolDM):
155 |     def _collate(self, batch, dataset):
156 |         if isinstance(batch, GeometricMolBatch):
157 |             return self._batch_to_dict(batch)
158 | 
159 |         elif isinstance(batch[0], GeometricMol):
160 |             smol_batch = GeometricMolBatch.from_list(list(batch))
161 |             return self._batch_to_dict(smol_batch)
162 | 
163 |         # If we don't have a list of Mols, we should have a list of tuples of Mols and other objects
164 |         collated = [self._collate_objs(list(objs)) for objs in tuple(zip(*batch))]
165 |         return collated
166 | 
167 |     def _collate_objs(self, objs):
168 |         if isinstance(objs, GeometricMolBatch):
169 |             return self._batch_to_dict(objs)
170 | 
171 |         elif isinstance(objs, dict):
172 |             return {key: self._collate_objs(val) for key, val in objs.items()}
173 | 
174 |         elif isinstance(objs[0], GeometricMol):
175 |             smol_batch = GeometricMolBatch.from_list(list(objs))
176 |             return self._batch_to_dict(smol_batch)
177 | 
178 |         elif isinstance(objs[0], torch.Tensor):
179 |             return torch.stack(objs)
180 | 
181 |         elif isinstance(objs[0], dict):
182 |             collated = {k: [obj[k] for obj in objs] for k in list(objs[0].keys)}
183 |             return self._collate_objs(collated)
184 | 
185 |         return objs
186 | 
187 |     def _batch_to_dict(self, smol_batch):
188 |         # Pad batch to n_atoms using a fake mol
189 |         # If we are not padding to bucket size get_padded_size will just return largest mol size
190 |         n_atoms = self._get_padded_size(smol_batch)
191 |         batch = [self._fake_mol_like(smol_batch[0], n_atoms)] + smol_batch.to_list()
192 |         batch = GeometricMolBatch.from_list(batch)
193 | 
194 |         coords = batch.coords.float()[1:]
195 |         atomics = batch.atomics.float()[1:]
196 |         bonds = batch.adjacency.float()[1:]
197 |         charges = batch.charges.long()[1:]
198 |         mask = batch.mask.long()[1:]
199 | 
200 |         # Assume that charges have already been transformed to indices
201 |         if charges is not None:
202 |             n_charges = len(smolRD.CHARGE_IDX_MAP.keys())
203 |             charges = smolF.one_hot_encode_tensor(charges, n_charges)
204 | 
205 |         data = {"coords": coords, "atomics": atomics, "bonds": bonds, "charges": charges, "mask": mask}
206 |         return data
207 | 
208 |     def _get_padded_size(self, smol_batch):
209 |         largest_mol_size = max(smol_batch.seq_length)
210 |         if self.bucket_limits is None or not self.pad_to_bucket:
211 |             return largest_mol_size
212 | 
213 |         # Find smallest bucket which all mols will fit in
214 |         for size in self.bucket_limits:
215 |             if size >= largest_mol_size:
216 |                 return size
217 | 
218 |         raise ValueError(f"Mol size of {largest_mol_size} is larger than largest padded size.")
219 | 
220 |     def _fake_mol_like(self, mol, n_atoms):
221 |         coords = torch.zeros((n_atoms, 3))
222 |         if len(mol.atomics.shape) == 1:
223 |             atomics = torch.zeros((n_atoms,))
224 |         else:
225 |             atomics = torch.zeros((n_atoms, mol.atomics.size(1)))
226 | 
227 |         bond_indices = torch.tensor([[0, 0]])
228 |         if len(mol.bond_types.shape) == 1:
229 |             bond_types = torch.tensor([0])
230 |         else:
231 |             bond_types = torch.zeros((1, mol.bond_types.size(1)))
232 | 
233 |         return GeometricMol(coords, atomics, bond_indices=bond_indices, bond_types=bond_types)
234 | 
235 | 
236 | class GeometricInterpolantDM(GeometricDM):
237 |     def __init__(
238 |         self,
239 |         train_dataset,
240 |         val_dataset,
241 |         test_dataset,
242 |         batch_size,
243 |         train_interpolant=None,
244 |         val_interpolant=None,
245 |         test_interpolant=None,
246 |         bucket_limits=None,
247 |         bucket_cost_scale=None,
248 |         pad_to_bucket=False,
249 |     ):
250 | 
251 |         self.train_interpolant = train_interpolant
252 |         self.val_interpolant = val_interpolant
253 |         self.test_interpolant = test_interpolant
254 | 
255 |         super().__init__(
256 |             train_dataset,
257 |             val_dataset,
258 |             test_dataset,
259 |             batch_size,
260 |             bucket_limits=bucket_limits,
261 |             bucket_cost_scale=bucket_cost_scale,
262 |             pad_to_bucket=pad_to_bucket,
263 |         )
264 | 
265 |     @property
266 |     def hparams(self):
267 |         interps = [self.train_interpolant, self.val_interpolant, self.test_interpolant]
268 |         datasets = ["train", "val", "test"]
269 | 
270 |         hparams = []
271 |         for dataset, interp in zip(datasets, interps):
272 |             if interp is not None:
273 |                 interp_hparams = {f"{dataset}-{k}": v for k, v in interp.hparams.items()}
274 |                 hparams.append(interp_hparams)
275 | 
276 |         hparams = {k: v for interp_hparams in hparams for k, v in interp_hparams.items()}
277 |         return {**hparams, **super().hparams}
278 | 
279 |     def _collate(self, batch, dataset):
280 |         if dataset == "train" and self.train_interpolant is not None:
281 |             objs = self.train_interpolant.interpolate(batch)
282 |             batch = list(zip(*objs))
283 | 
284 |         elif dataset == "val" and self.val_interpolant is not None:
285 |             objs = self.val_interpolant.interpolate(batch)
286 |             batch = list(zip(*objs))
287 | 
288 |         elif dataset == "test" and self.test_interpolant is not None:
289 |             objs = self.test_interpolant.interpolate(batch)
290 |             batch = list(zip(*objs))
291 | 
292 |         return super()._collate(batch, dataset)
293 | 


--------------------------------------------------------------------------------
/semlaflow/data/datasets.py:
--------------------------------------------------------------------------------
  1 | from abc import ABC, abstractmethod
  2 | from pathlib import Path
  3 | 
  4 | import numpy as np
  5 | import torch
  6 | 
  7 | from semlaflow.util.molrepr import GeometricMolBatch
  8 | 
  9 | # *** Util functions ***
 10 | 
 11 | 
 12 | def load_smol_data(data_path, smol_cls):
 13 |     data_path = Path(data_path)
 14 | 
 15 |     # TODO handle having a directory with batched data files
 16 |     if data_path.is_dir():
 17 |         raise NotImplementedError()
 18 | 
 19 |     # TODO maybe read in chunks if this is too big
 20 |     bytes_data = data_path.read_bytes()
 21 |     data = smol_cls.from_bytes(bytes_data)
 22 |     return data
 23 | 
 24 | 
 25 | # *** Abstract class for all Smol data types ***
 26 | 
 27 | 
 28 | class SmolDataset(ABC, torch.utils.data.Dataset):
 29 |     def __init__(self, smol_data, transform=None):
 30 |         super().__init__()
 31 | 
 32 |         self._data = smol_data
 33 |         self.transform = transform
 34 | 
 35 |     @property
 36 |     def hparams(self):
 37 |         return {}
 38 | 
 39 |     @property
 40 |     def lengths(self):
 41 |         return self._data.seq_length
 42 | 
 43 |     def __len__(self):
 44 |         return self._data.batch_size
 45 | 
 46 |     def __getitem__(self, item):
 47 |         molecule = self._data[item]
 48 |         if self.transform is not None:
 49 |             molecule = self.transform(molecule)
 50 | 
 51 |         return molecule
 52 | 
 53 |     @classmethod
 54 |     @abstractmethod
 55 |     def load(cls, data_path, transform=None):
 56 |         pass
 57 | 
 58 | 
 59 | # *** SmolDataset implementations ***
 60 | 
 61 | 
 62 | class GeometricDataset(SmolDataset):
 63 |     def sample(self, n_items, replacement=False):
 64 |         mol_samples = np.random.choice(self._data.to_list(), n_items, replace=replacement)
 65 |         data = GeometricMolBatch.from_list(mol_samples)
 66 |         return GeometricDataset(data, transform=self.transform)
 67 | 
 68 |     @classmethod
 69 |     def load(cls, data_path, transform=None, min_size=None):
 70 |         data = load_smol_data(data_path, GeometricMolBatch)
 71 |         if min_size is not None:
 72 |             mols = [mol for mol in data if mol.seq_length >= min_size]
 73 |             data = GeometricMolBatch.from_list(mols)
 74 | 
 75 |         return GeometricDataset(data, transform=transform)
 76 | 
 77 | 
 78 | # *** Other useful datasets ***
 79 | 
 80 | 
 81 | class SmolPairDataset(torch.utils.data.Dataset):
 82 |     """A dataset which returns pairs of SmolMol objects"""
 83 | 
 84 |     def __init__(self, from_dataset: SmolDataset, to_dataset: SmolDataset):
 85 |         super().__init__()
 86 | 
 87 |         if len(from_dataset) != len(to_dataset):
 88 |             raise ValueError("From and to datasets must have the same number of items.")
 89 | 
 90 |         if from_dataset.lengths != to_dataset.lengths:
 91 |             raise ValueError("From and to datasets must have molecules of the same length at each index.")
 92 | 
 93 |         self.from_dataset = from_dataset
 94 |         self.to_dataset = to_dataset
 95 | 
 96 |     # TODO stop hparams clashing from different sources
 97 |     @property
 98 |     def hparams(self):
 99 |         return {**self.from_dataset.hparams, **self.to_dataset.hparams}
100 | 
101 |     @property
102 |     def lengths(self):
103 |         return self.from_dataset.lengths
104 | 
105 |     def __len__(self):
106 |         return len(self.from_dataset)
107 | 
108 |     def __getitem__(self, item):
109 |         from_mol = self.from_dataset[item]
110 |         to_mol = self.to_dataset[item]
111 |         return from_mol, to_mol
112 | 


--------------------------------------------------------------------------------
/semlaflow/data/interpolate.py:
--------------------------------------------------------------------------------
  1 | from abc import ABC, abstractmethod
  2 | from typing import Optional
  3 | 
  4 | import numpy as np
  5 | import torch
  6 | from scipy.optimize import linear_sum_assignment
  7 | from scipy.spatial.transform import Rotation
  8 | 
  9 | import semlaflow.util.functional as smolF
 10 | from semlaflow.util.molrepr import GeometricMol, GeometricMolBatch, SmolBatch, SmolMol
 11 | 
 12 | SCALE_OT_FACTOR = 0.2
 13 | 
 14 | 
 15 | _InterpT = tuple[list[SmolMol], list[SmolMol], list[SmolMol], torch.Tensor]
 16 | _GeometricInterpT = tuple[list[GeometricMol], list[GeometricMol], list[GeometricMol], torch.Tensor]
 17 | 
 18 | 
 19 | class Interpolant(ABC):
 20 |     @property
 21 |     @abstractmethod
 22 |     def hparams(self):
 23 |         pass
 24 | 
 25 |     @abstractmethod
 26 |     def interpolate(self, to_batch: list[SmolMol]) -> _InterpT:
 27 |         pass
 28 | 
 29 | 
 30 | class NoiseSampler(ABC):
 31 |     @property
 32 |     def hparams(self):
 33 |         pass
 34 | 
 35 |     @abstractmethod
 36 |     def sample_molecule(self, num_atoms: int) -> SmolMol:
 37 |         pass
 38 | 
 39 |     @abstractmethod
 40 |     def sample_batch(self, num_atoms: list[int]) -> SmolBatch:
 41 |         pass
 42 | 
 43 | 
 44 | class GeometricNoiseSampler(NoiseSampler):
 45 |     def __init__(
 46 |         self,
 47 |         vocab_size: int,
 48 |         n_bond_types: int,
 49 |         coord_noise: str = "gaussian",
 50 |         type_noise: str = "uniform-sample",
 51 |         bond_noise: str = "uniform-sample",
 52 |         scale_ot: bool = False,
 53 |         zero_com: bool = True,
 54 |         type_mask_index: Optional[int] = None,
 55 |         bond_mask_index: Optional[int] = None,
 56 |     ):
 57 |         if coord_noise != "gaussian":
 58 |             raise NotImplementedError(f"Coord noise {coord_noise} is not supported.")
 59 | 
 60 |         self._check_cat_noise_type(type_noise, type_mask_index, "type")
 61 |         self._check_cat_noise_type(bond_noise, bond_mask_index, "bond")
 62 | 
 63 |         self.vocab_size = vocab_size
 64 |         self.n_bond_types = n_bond_types
 65 |         self.coord_noise = coord_noise
 66 |         self.type_noise = type_noise
 67 |         self.bond_noise = bond_noise
 68 |         self.scale_ot = scale_ot
 69 |         self.zero_com = zero_com
 70 |         self.type_mask_index = type_mask_index
 71 |         self.bond_mask_index = bond_mask_index
 72 | 
 73 |         self.coord_dist = torch.distributions.Normal(torch.tensor(0.0), torch.tensor(1.0))
 74 |         self.atomic_dirichlet = torch.distributions.Dirichlet(torch.ones(vocab_size))
 75 |         self.bond_dirichlet = torch.distributions.Dirichlet(torch.ones(n_bond_types))
 76 | 
 77 |     @property
 78 |     def hparams(self):
 79 |         return {
 80 |             "coord-noise": self.coord_noise,
 81 |             "type-noise": self.type_noise,
 82 |             "bond-noise": self.bond_noise,
 83 |             "noise-scale-ot": self.scale_ot,
 84 |             "zero-com": self.zero_com,
 85 |         }
 86 | 
 87 |     def sample_molecule(self, n_atoms: int) -> GeometricMol:
 88 |         # Sample coords and scale, if required
 89 |         coords = self.coord_dist.sample((n_atoms, 3))
 90 |         if self.scale_ot:
 91 |             coords = coords * np.log(n_atoms + 1) * SCALE_OT_FACTOR
 92 | 
 93 |         # Sample atom types
 94 |         if self.type_noise == "dirichlet":
 95 |             atomics = self.atomic_dirichlet.sample((n_atoms,))
 96 | 
 97 |         elif self.type_noise == "uniform-dist":
 98 |             atomics = torch.ones((n_atoms, self.vocab_size)) / self.vocab_size
 99 | 
100 |         elif self.type_noise == "mask":
101 |             atomics = torch.zeros((n_atoms, self.vocab_size), dtype=torch.float32)
102 |             atomics[:, self.type_mask_index] = 1.0
103 | 
104 |         elif self.type_noise == "uniform-sample":
105 |             atomics = torch.randint(0, self.vocab_size, (n_atoms,))
106 |             atomics = smolF.one_hot_encode_tensor(atomics, self.vocab_size)
107 | 
108 |         # Create bond indices and sample bond types
109 |         bond_indices = torch.ones((n_atoms, n_atoms)).nonzero()
110 |         n_bonds = bond_indices.size(0)
111 | 
112 |         if self.bond_noise == "dirichlet":
113 |             bond_types = self.bond_dirichlet.sample((n_bonds,))
114 | 
115 |         elif self.bond_noise == "uniform-dist":
116 |             bond_types = torch.ones((n_bonds, self.n_bond_types)) / self.n_bond_types
117 | 
118 |         elif self.bond_noise == "mask":
119 |             bond_types = torch.tensor(self.bond_mask_index).repeat(n_bonds)
120 |             bond_types = smolF.one_hot_encode_tensor(bond_types, self.n_bond_types)
121 | 
122 |         elif self.bond_noise == "uniform-sample":
123 |             bond_types = torch.randint(0, self.n_bond_types, size=(n_bonds,))
124 |             bond_types = smolF.one_hot_encode_tensor(bond_types, self.n_bond_types)
125 | 
126 |         # Create smol mol object
127 |         mol = GeometricMol(coords, atomics, bond_indices=bond_indices, bond_types=bond_types)
128 |         if self.zero_com:
129 |             mol = mol.zero_com()
130 | 
131 |         return mol
132 | 
133 |     def sample_batch(self, num_atoms: list[int]) -> GeometricMolBatch:
134 |         mols = [self.sample_molecule(n) for n in num_atoms]
135 |         batch = GeometricMolBatch.from_list(mols)
136 |         return batch
137 | 
138 |     def _check_cat_noise_type(self, noise_type, mask_index, name):
139 |         if noise_type not in ["dirichlet", "uniform-dist", "mask", "uniform-sample"]:
140 |             raise ValueError(f"{name} noise {noise_type} is not supported.")
141 | 
142 |         if noise_type == "mask" and mask_index is None:
143 |             raise ValueError(f"{name}_mask_index must be provided if {name}_noise is 'mask'.")
144 | 
145 | 
146 | class GeometricInterpolant(Interpolant):
147 |     def __init__(
148 |         self,
149 |         prior_sampler: GeometricNoiseSampler,
150 |         coord_interpolation: str = "linear",
151 |         type_interpolation: str = "unmask",
152 |         bond_interpolation: str = "unmask",
153 |         coord_noise_std: float = 0.0,
154 |         type_dist_temp: float = 1.0,
155 |         equivariant_ot: bool = False,
156 |         batch_ot: bool = False,
157 |         time_alpha: float = 1.0,
158 |         time_beta: float = 1.0,
159 |         fixed_time: Optional[float] = None,
160 |     ):
161 | 
162 |         if fixed_time is not None and (fixed_time < 0 or fixed_time > 1):
163 |             raise ValueError("fixed_time must be between 0 and 1 if provided.")
164 | 
165 |         if coord_interpolation != "linear":
166 |             raise ValueError(f"coord interpolation '{coord_interpolation}' not supported.")
167 | 
168 |         if type_interpolation not in ["dirichlet", "unmask"]:
169 |             raise ValueError(f"type interpolation '{type_interpolation}' not supported.")
170 | 
171 |         if bond_interpolation not in ["dirichlet", "unmask"]:
172 |             raise ValueError(f"bond interpolation '{bond_interpolation}' not supported.")
173 | 
174 |         self.prior_sampler = prior_sampler
175 |         self.coord_interpolation = coord_interpolation
176 |         self.type_interpolation = type_interpolation
177 |         self.bond_interpolation = bond_interpolation
178 |         self.coord_noise_std = coord_noise_std
179 |         self.type_dist_temp = type_dist_temp
180 |         self.equivariant_ot = equivariant_ot
181 |         self.batch_ot = batch_ot
182 |         self.time_alpha = time_alpha if fixed_time is None else None
183 |         self.time_beta = time_beta if fixed_time is None else None
184 |         self.fixed_time = fixed_time
185 | 
186 |         self.time_dist = torch.distributions.Beta(time_alpha, time_beta)
187 | 
188 |     @property
189 |     def hparams(self):
190 |         prior_hparams = {f"prior-{k}": v for k, v in self.prior_sampler.hparams.items()}
191 |         hparams = {
192 |             "coord-interpolation": self.coord_interpolation,
193 |             "type-interpolation": self.type_interpolation,
194 |             "bond-interpolation": self.bond_interpolation,
195 |             "coord-noise-std": self.coord_noise_std,
196 |             "type-dist-temp": self.type_dist_temp,
197 |             "equivariant-ot": self.equivariant_ot,
198 |             "batch-ot": self.batch_ot,
199 |             "time-alpha": self.time_alpha,
200 |             "time-beta": self.time_beta,
201 |             **prior_hparams,
202 |         }
203 | 
204 |         if self.fixed_time is not None:
205 |             hparams["fixed-interpolation-time"] = self.fixed_time
206 | 
207 |         return hparams
208 | 
209 |     def interpolate(self, to_mols: list[GeometricMol]) -> _GeometricInterpT:
210 |         batch_size = len(to_mols)
211 |         num_atoms = max([mol.seq_length for mol in to_mols])
212 | 
213 |         from_mols = [self.prior_sampler.sample_molecule(num_atoms) for _ in to_mols]
214 | 
215 |         # Choose best possible matches for the whole batch if using batch OT
216 |         if self.batch_ot:
217 |             from_mols = [mol.zero_com() for mol in from_mols]
218 |             to_mols = [mol.zero_com() for mol in to_mols]
219 |             from_mols = self._ot_map(from_mols, to_mols)
220 | 
221 |         # Within match_mols either just truncate noise to match size of data molecule
222 |         # Or also permute and rotate the noise to best match data molecule
223 |         else:
224 |             from_mols = [self._match_mols(from_mol, to_mol) for from_mol, to_mol in zip(from_mols, to_mols)]
225 | 
226 |         if self.fixed_time is not None:
227 |             times = torch.tensor([self.fixed_time] * batch_size)
228 |         else:
229 |             times = self.time_dist.sample((batch_size,))
230 | 
231 |         tuples = zip(from_mols, to_mols, times.tolist())
232 |         interp_mols = [self._interpolate_mol(from_mol, to_mol, t) for from_mol, to_mol, t in tuples]
233 |         return from_mols, to_mols, interp_mols, list(times)
234 | 
235 |     def _ot_map(self, from_mols: list[GeometricMol], to_mols: list[GeometricMol]) -> list[GeometricMol]:
236 |         """Permute the from_mols batch so that it forms an approximate mini-batch OT map with to_mols"""
237 | 
238 |         mol_matrix = []
239 |         cost_matrix = []
240 | 
241 |         # Create matrix with to mols on outer axis and from mols on inner axis
242 |         for to_mol in to_mols:
243 |             best_from_mols = [self._match_mols(from_mol, to_mol) for from_mol in from_mols]
244 |             best_costs = [self._match_cost(mol, to_mol) for mol in best_from_mols]
245 |             mol_matrix.append(list(best_from_mols))
246 |             cost_matrix.append(list(best_costs))
247 | 
248 |         row_indices, col_indices = linear_sum_assignment(np.array(cost_matrix))
249 |         best_from_mols = [mol_matrix[r][c] for r, c in zip(row_indices, col_indices)]
250 |         return best_from_mols
251 | 
252 |     def _match_mols(self, from_mol: GeometricMol, to_mol: GeometricMol) -> GeometricMol:
253 |         """Permute the from_mol to best match the to_mol and return the permuted from_mol"""
254 | 
255 |         if to_mol.seq_length > from_mol.seq_length:
256 |             raise RuntimeError("from_mol must have at least as many atoms as to_mol.")
257 | 
258 |         # Find best permutation first, then best rotation
259 |         # As done in Equivariant Flow Matching (https://arxiv.org/abs/2306.15030)
260 | 
261 |         # Keep the same number of atoms as the data mol in the noise mol
262 |         from_mol = from_mol.permute(list(range(to_mol.seq_length)))
263 | 
264 |         if not self.equivariant_ot:
265 |             return from_mol
266 | 
267 |         cost_matrix = smolF.inter_distances(to_mol.coords.cpu(), from_mol.coords.cpu(), sqrd=True)
268 |         _, from_mol_indices = linear_sum_assignment(cost_matrix.numpy())
269 |         from_mol = from_mol.permute(from_mol_indices.tolist())
270 | 
271 |         padded_coords = smolF.pad_tensors([from_mol.coords.cpu(), to_mol.coords.cpu()])
272 |         from_mol_coords = padded_coords[0].numpy()
273 |         to_mol_coords = padded_coords[1].numpy()
274 | 
275 |         rotation, _ = Rotation.align_vectors(to_mol_coords, from_mol_coords)
276 |         from_mol = from_mol.rotate(rotation)
277 | 
278 |         return from_mol
279 | 
280 |     def _match_cost(self, from_mol: GeometricMol, to_mol: GeometricMol) -> float:
281 |         """Calculate MSE between mol coords as a match cost"""
282 | 
283 |         sqrd_dists = smolF.inter_distances(from_mol.coords.cpu(), to_mol.coords.cpu(), sqrd=True)
284 |         mse = sqrd_dists.mean().item()
285 |         return mse
286 | 
287 |     def _interpolate_mol(self, from_mol: GeometricMol, to_mol: GeometricMol, t: float) -> GeometricMol:
288 |         """Interpolates mols which have already been sampled according to OT map, if required"""
289 | 
290 |         if from_mol.seq_length != to_mol.seq_length:
291 |             raise RuntimeError("Both molecules to be interpolated must have the same number of atoms.")
292 | 
293 |         # Interpolate coords and add gaussian noise
294 |         coords_mean = (from_mol.coords * (1 - t)) + (to_mol.coords * t)
295 |         coords_noise = torch.randn_like(coords_mean) * self.coord_noise_std
296 |         coords = coords_mean + coords_noise
297 | 
298 |         if self.type_interpolation == "dirichlet":
299 |             to_atomics = torch.softmax(to_mol.atomics / self.type_dist_temp, dim=-1)
300 |             atomics_mean = (from_mol.atomics * (1 - t)) + (to_atomics * t)
301 |             atomics = torch.distributions.Dirichlet(atomics_mean).sample()
302 | 
303 |         elif self.type_interpolation == "unmask":
304 |             atom_mask = torch.rand(from_mol.seq_length) > t
305 |             to_atomics = torch.argmax(to_mol.atomics, dim=-1)
306 |             from_atomics = torch.argmax(from_mol.atomics, dim=-1)
307 |             to_atomics[atom_mask] = from_atomics[atom_mask]
308 |             atomics = smolF.one_hot_encode_tensor(to_atomics, to_mol.atomics.size(-1))
309 | 
310 |         # Interpolate bonds
311 |         if self.bond_interpolation == "dirichlet":
312 |             to_adj = torch.softmax(to_mol.adjacency / self.type_dist_temp, dim=-1)
313 |             adj_mean = (from_mol.adjacency * (1 - t)) + (to_adj * t)
314 |             interp_adj = torch.distributions.Dirichlet(adj_mean).sample()
315 | 
316 |         elif self.bond_interpolation == "unmask":
317 |             to_adj = torch.argmax(to_mol.adjacency, dim=-1)
318 |             from_adj = torch.argmax(from_mol.adjacency, dim=-1)
319 |             bond_mask = torch.rand_like(from_adj.float()) > t
320 |             to_adj[bond_mask] = from_adj[bond_mask]
321 |             interp_adj = smolF.one_hot_encode_tensor(to_adj, to_mol.adjacency.size(-1))
322 | 
323 |         bond_indices = torch.ones((from_mol.seq_length, from_mol.seq_length)).nonzero()
324 |         bond_types = interp_adj[bond_indices[:, 0], bond_indices[:, 1]]
325 | 
326 |         interp_mol = GeometricMol(coords, atomics, bond_indices=bond_indices, bond_types=bond_types)
327 |         return interp_mol
328 | 


--------------------------------------------------------------------------------
/semlaflow/data/util.py:
--------------------------------------------------------------------------------
 1 | import math
 2 | import random
 3 | from typing import Optional
 4 | 
 5 | from torch.utils.data import RandomSampler, Sampler
 6 | 
 7 | 
 8 | class BucketBatchSampler(Sampler):
 9 |     def __init__(
10 |         self,
11 |         bucket_limits: list[int],
12 |         lengths: list[int],
13 |         batch_cost: float,
14 |         bucket_costs: Optional[list[float]] = None,
15 |         drop_last: bool = True,
16 |         round_batch_to_8: bool = False,
17 |     ):
18 | 
19 |         # Modern GPUs can be more efficient when data is provided as a multiple of 8 (for 16-bit training)
20 |         self.round_batch_to_8 = round_batch_to_8
21 |         self.drop_last = drop_last
22 | 
23 |         if bucket_costs is not None and len(bucket_costs) != len(bucket_limits):
24 |             raise ValueError("The number of costs and buckets must be the same.")
25 | 
26 |         if max(lengths) > max(bucket_limits):
27 |             raise ValueError("Largest length cannot be larger than largest bucket limit.")
28 | 
29 |         bucket_limits = sorted(bucket_limits)
30 | 
31 |         # Use a constant bucket cost by default
32 |         bucket_costs = [1] * len(bucket_limits) if bucket_costs is None else bucket_costs
33 | 
34 |         # Add indices to correct bucket based on seq length
35 |         buckets = [[] for _ in range(len(bucket_limits))]
36 |         for seq_idx, length in enumerate(lengths):
37 |             for b_idx, limit in enumerate(bucket_limits):
38 |                 if limit >= length:
39 |                     buckets[b_idx].append(seq_idx)
40 |                     break
41 | 
42 |         # TODO allow non-shuffled sampling
43 |         samplers = [RandomSampler(idxs, replacement=False) if len(idxs) > 0 else None for idxs in buckets]
44 |         bucket_batch_sizes = [self._round_batch_size(batch_cost / cost) for cost in bucket_costs]
45 | 
46 |         batches_per_bucket = []
47 |         for bucket, batch_size in zip(buckets, bucket_batch_sizes):
48 |             n_batches = int(len(bucket) // batch_size)
49 |             if not drop_last and n_batches * batch_size != len(bucket):
50 |                 n_batches += 1
51 | 
52 |             batches_per_bucket.append(n_batches)
53 | 
54 |         print()
55 |         print("items per bucket", [len(idxs) for idxs in buckets])
56 |         print("bucket batch sizes", bucket_batch_sizes)
57 |         print("batches per bucket", batches_per_bucket)
58 | 
59 |         self.buckets = buckets
60 |         self.samplers = samplers
61 |         self.bucket_batch_sizes = bucket_batch_sizes
62 |         self.batches_per_bucket = batches_per_bucket
63 | 
64 |     def __len__(self):
65 |         return sum(self.batches_per_bucket)
66 | 
67 |     def __iter__(self):
68 |         iters = [iter(sampler) if sampler is not None else None for sampler in self.samplers]
69 |         remaining_batches = self.batches_per_bucket[:]
70 |         remaining_items = [len(items) for items in self.buckets]
71 | 
72 |         while sum(remaining_batches) > 0:
73 |             b_idx = random.choices(range(len(remaining_batches)), weights=remaining_batches, k=1)[0]
74 |             if remaining_batches[b_idx] > 1 or self.drop_last:
75 |                 batch_size = self.bucket_batch_sizes[b_idx]
76 |             else:
77 |                 batch_size = remaining_items[b_idx]
78 | 
79 |             batch_idxs = [next(iters[b_idx]) for _ in range(batch_size)]
80 | 
81 |             # Samplers will produce indices into the list, so look up dataset indices using sampled bucket indices
82 |             batch = [self.buckets[b_idx][idx] for idx in batch_idxs]
83 | 
84 |             remaining_batches[b_idx] -= 1
85 |             remaining_items[b_idx] -= batch_size
86 | 
87 |             yield batch
88 | 
89 |     def _round_batch_size(self, batch_size):
90 |         if not self.round_batch_to_8:
91 |             bs = math.floor(batch_size)
92 |         else:
93 |             bs = 8 * round(batch_size / 8)
94 | 
95 |         bs = 1 if bs == 0 else bs
96 |         return bs
97 | 


--------------------------------------------------------------------------------
/semlaflow/evaluate.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | from functools import partial
  3 | from pathlib import Path
  4 | 
  5 | import lightning as L
  6 | import numpy as np
  7 | import torch
  8 | 
  9 | import semlaflow.scriptutil as util
 10 | from semlaflow.data.datamodules import GeometricInterpolantDM
 11 | from semlaflow.data.datasets import GeometricDataset
 12 | from semlaflow.data.interpolate import GeometricInterpolant, GeometricNoiseSampler
 13 | from semlaflow.models.fm import Integrator, MolecularCFM
 14 | from semlaflow.models.semla import EquiInvDynamics, SemlaGenerator
 15 | 
 16 | # Default script arguments
 17 | DEFAULT_DATASET_SPLIT = "test"
 18 | DEFAULT_N_MOLECULES = 10000
 19 | DEFAULT_N_REPLICATES = 3
 20 | DEFAULT_BATCH_COST = 8192
 21 | DEFAULT_BUCKET_COST_SCALE = "linear"
 22 | DEFAULT_INTEGRATION_STEPS = 100
 23 | DEFAULT_CAT_SAMPLING_NOISE_LEVEL = 1
 24 | DEFAULT_ODE_SAMPLING_STRATEGY = "log"
 25 | 
 26 | 
 27 | def load_model(args, vocab):
 28 |     checkpoint = torch.load(args.ckpt_path)
 29 |     hparams = checkpoint["hyper_parameters"]
 30 | 
 31 |     hparams["compile_model"] = False
 32 |     hparams["integration-steps"] = args.integration_steps
 33 |     hparams["sampling_strategy"] = args.ode_sampling_strategy
 34 | 
 35 |     n_bond_types = util.get_n_bond_types(hparams["integration-type-strategy"])
 36 | 
 37 |     # Set default arch to semla if nothing has been saved
 38 |     if hparams.get("architecture") is None:
 39 |         hparams["architecture"] = "semla"
 40 | 
 41 |     if hparams["architecture"] == "semla":
 42 |         dynamics = EquiInvDynamics(
 43 |             hparams["d_model"],
 44 |             hparams["d_message"],
 45 |             hparams["n_coord_sets"],
 46 |             hparams["n_layers"],
 47 |             n_attn_heads=hparams["n_attn_heads"],
 48 |             d_message_hidden=hparams["d_message_hidden"],
 49 |             d_edge=hparams["d_edge"],
 50 |             self_cond=hparams["self_cond"],
 51 |             coord_norm=hparams["coord_norm"],
 52 |         )
 53 |         egnn_gen = SemlaGenerator(
 54 |             hparams["d_model"],
 55 |             dynamics,
 56 |             vocab.size,
 57 |             hparams["n_atom_feats"],
 58 |             d_edge=hparams["d_edge"],
 59 |             n_edge_types=n_bond_types,
 60 |             self_cond=hparams["self_cond"],
 61 |             size_emb=hparams["size_emb"],
 62 |             max_atoms=hparams["max_atoms"],
 63 |         )
 64 | 
 65 |     elif hparams["architecture"] == "eqgat":
 66 |         from semlaflow.models.eqgat import EqgatGenerator
 67 | 
 68 |         egnn_gen = EqgatGenerator(
 69 |             hparams["d_model"],
 70 |             hparams["n_layers"],
 71 |             hparams["n_equi_feats"],
 72 |             vocab.size,
 73 |             hparams["n_atom_feats"],
 74 |             hparams["d_edge"],
 75 |             hparams["n_edge_types"],
 76 |         )
 77 | 
 78 |     elif hparams["architecture"] == "egnn":
 79 |         from semlaflow.models.egnn import VanillaEgnnGenerator
 80 | 
 81 |         n_layers = args.n_layers if hparams.get("n_layers") is None else hparams["n_layers"]
 82 |         if n_layers is None:
 83 |             raise ValueError("No hparam for n_layers was saved, use script arg to provide n_layers")
 84 | 
 85 |         egnn_gen = VanillaEgnnGenerator(
 86 |             hparams["d_model"],
 87 |             n_layers,
 88 |             vocab.size,
 89 |             hparams["n_atom_feats"],
 90 |             d_edge=hparams["d_edge"],
 91 |             n_edge_types=n_bond_types,
 92 |         )
 93 | 
 94 |     else:
 95 |         raise ValueError(f"Unknown architecture hyperparameter.")
 96 | 
 97 |     type_mask_index = (
 98 |         vocab.indices_from_tokens(["<MASK>"])[0] if hparams["train-type-interpolation"] == "mask" else None
 99 |     )
100 |     bond_mask_index = None
101 | 
102 |     integrator = Integrator(
103 |         args.integration_steps,
104 |         type_strategy=hparams["integration-type-strategy"],
105 |         bond_strategy=hparams["integration-bond-strategy"],
106 |         type_mask_index=type_mask_index,
107 |         bond_mask_index=bond_mask_index,
108 |         cat_noise_level=args.cat_sampling_noise_level,
109 |     )
110 |     fm_model = MolecularCFM.load_from_checkpoint(
111 |         args.ckpt_path,
112 |         gen=egnn_gen,
113 |         vocab=vocab,
114 |         integrator=integrator,
115 |         type_mask_index=type_mask_index,
116 |         bond_mask_index=bond_mask_index,
117 |         **hparams,
118 |     )
119 |     return fm_model
120 | 
121 | 
122 | def build_dm(args, hparams, vocab):
123 |     if args.dataset == "qm9":
124 |         coord_std = util.QM9_COORDS_STD_DEV
125 |         bucket_limits = util.QM9_BUCKET_LIMITS
126 | 
127 |     elif args.dataset == "geom-drugs":
128 |         coord_std = util.GEOM_COORDS_STD_DEV
129 |         bucket_limits = util.GEOM_DRUGS_BUCKET_LIMITS
130 | 
131 |     else:
132 |         raise ValueError(f"Unknown dataset {args.dataset}")
133 | 
134 |     n_bond_types = 5
135 |     transform = partial(util.mol_transform, vocab=vocab, n_bonds=n_bond_types, coord_std=coord_std)
136 | 
137 |     if args.dataset_split == "train":
138 |         dataset_path = Path(args.data_path) / "train.smol"
139 |     elif args.dataset_split == "val":
140 |         dataset_path = Path(args.data_path) / "val.smol"
141 |     elif args.dataset_split == "test":
142 |         dataset_path = Path(args.data_path) / "test.smol"
143 | 
144 |     dataset = GeometricDataset.load(dataset_path, transform=transform)
145 |     dataset = dataset.sample(args.n_molecules, replacement=True)
146 | 
147 |     type_mask_index = vocab.indices_from_tokens(["<MASK>"])[0] if hparams["val-type-interpolation"] == "mask" else None
148 |     bond_mask_index = None
149 | 
150 |     prior_sampler = GeometricNoiseSampler(
151 |         vocab.size,
152 |         n_bond_types,
153 |         coord_noise="gaussian",
154 |         type_noise=hparams["val-prior-type-noise"],
155 |         bond_noise=hparams["val-prior-bond-noise"],
156 |         scale_ot=hparams["val-prior-noise-scale-ot"],
157 |         zero_com=True,
158 |         type_mask_index=type_mask_index,
159 |         bond_mask_index=bond_mask_index,
160 |     )
161 |     eval_interpolant = GeometricInterpolant(
162 |         prior_sampler,
163 |         coord_interpolation="linear",
164 |         type_interpolation=hparams["val-type-interpolation"],
165 |         bond_interpolation=hparams["val-bond-interpolation"],
166 |         equivariant_ot=False,
167 |         batch_ot=False,
168 |     )
169 |     dm = GeometricInterpolantDM(
170 |         None,
171 |         None,
172 |         dataset,
173 |         args.batch_cost,
174 |         test_interpolant=eval_interpolant,
175 |         bucket_limits=bucket_limits,
176 |         bucket_cost_scale=args.bucket_cost_scale,
177 |         pad_to_bucket=False,
178 |     )
179 |     return dm
180 | 
181 | 
182 | def dm_from_ckpt(args, vocab):
183 |     checkpoint = torch.load(args.ckpt_path)
184 |     hparams = checkpoint["hyper_parameters"]
185 |     dm = build_dm(args, hparams, vocab)
186 |     return dm
187 | 
188 | 
189 | def evaluate(args, model, dm, metrics, stab_metrics):
190 |     results_list = []
191 |     for replicate_index in range(args.n_replicates):
192 |         print(f"Running replicate {replicate_index + 1} out of {args.n_replicates}")
193 |         molecules, _, stabilities = util.generate_molecules(
194 |             model, dm, args.integration_steps, args.ode_sampling_strategy, stabilities=True
195 |         )
196 | 
197 |         print("Calculating metrics...")
198 |         results = util.calc_metrics_(molecules, metrics, stab_metrics=stab_metrics, mol_stabs=stabilities)
199 |         results_list.append(results)
200 | 
201 |     results_dict = {key: [] for key in results_list[0].keys()}
202 |     for results in results_list:
203 |         for metric, value in results.items():
204 |             results_dict[metric].append(value.item())
205 | 
206 |     mean_results = {metric: np.mean(values) for metric, values in results_dict.items()}
207 |     std_results = {metric: np.std(values) for metric, values in results_dict.items()}
208 | 
209 |     return mean_results, std_results, results_dict
210 | 
211 | 
212 | def main(args):
213 |     print(f"Running evaluation script for {args.n_replicates} replicates with {args.n_molecules} molecules each...")
214 |     print(f"Using model stored at {args.ckpt_path}")
215 | 
216 |     if args.n_replicates < 1:
217 |         raise ValueError("n_replicates must be at least 1.")
218 | 
219 |     L.seed_everything(12345)
220 |     util.disable_lib_stdout()
221 |     util.configure_fs()
222 | 
223 |     print("Building model vocab...")
224 |     vocab = util.build_vocab()
225 |     print("Vocab complete.")
226 | 
227 |     print("Loading datamodule...")
228 |     dm = dm_from_ckpt(args, vocab)
229 |     print("Datamodule complete.")
230 | 
231 |     print(f"Loading model...")
232 |     model = load_model(args, vocab)
233 |     print("Model complete.")
234 | 
235 |     print("Initialising metrics...")
236 |     metrics, stab_metrics = util.init_metrics(args.data_path, model)
237 |     print("Metrics complete.")
238 | 
239 |     print("Running evaluation...")
240 |     avg_results, std_results, list_results = evaluate(args, model, dm, metrics, stab_metrics)
241 |     print("Evaluation complete.")
242 | 
243 |     util.print_results(avg_results, std_results=std_results)
244 | 
245 |     print("All replicate results...")
246 |     print(f"{'Metric':<22}Result")
247 |     print("-" * 30)
248 | 
249 |     for metric, results_list in list_results.items():
250 |         print(f"{metric:<22}{results_list}")
251 |     print()
252 | 
253 | 
254 | if __name__ == "__main__":
255 |     parser = argparse.ArgumentParser()
256 | 
257 |     parser.add_argument("--ckpt_path", type=str)
258 |     parser.add_argument("--data_path", type=str)
259 |     parser.add_argument("--dataset", type=str)
260 | 
261 |     parser.add_argument("--batch_cost", type=int, default=DEFAULT_BATCH_COST)
262 |     parser.add_argument("--dataset_split", type=str, default=DEFAULT_DATASET_SPLIT)
263 |     parser.add_argument("--n_molecules", type=int, default=DEFAULT_N_MOLECULES)
264 |     parser.add_argument("--n_replicates", type=int, default=DEFAULT_N_REPLICATES)
265 |     parser.add_argument("--integration_steps", type=int, default=DEFAULT_INTEGRATION_STEPS)
266 |     parser.add_argument("--cat_sampling_noise_level", type=int, default=DEFAULT_CAT_SAMPLING_NOISE_LEVEL)
267 |     parser.add_argument("--ode_sampling_strategy", type=str, default=DEFAULT_ODE_SAMPLING_STRATEGY)
268 | 
269 |     parser.add_argument("--bucket_cost_scale", type=str, default=DEFAULT_BUCKET_COST_SCALE)
270 | 
271 |     # Allow overridding for EGNN arch since some models were not saved with a value for n_layers
272 |     parser.add_argument("--n_layers", type=int, default=None)
273 | 
274 |     args = parser.parse_args()
275 |     main(args)
276 | 


--------------------------------------------------------------------------------
/semlaflow/models/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rssrwn/semla-flow/65d7106bc907e4d136deec8de7417386e3f9b10b/semlaflow/models/__init__.py


--------------------------------------------------------------------------------
/semlaflow/models/egnn.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Original EGNN implementation. Keep this mostly seperate from our implementations so that this remains consistent with
  3 | with the original version of the model.
  4 | """
  5 | 
  6 | import torch
  7 | 
  8 | import semlaflow.util.functional as smolF
  9 | from semlaflow.models.semla import MolecularGenerator
 10 | 
 11 | 
 12 | class VanillaEgnnLayer(torch.nn.Module):
 13 |     def __init__(self, d_model, in_edge_feats, d_pred_edge=None, norm=False, eps=1e-5):
 14 |         super().__init__()
 15 | 
 16 |         self.d_model = d_model
 17 |         self.in_edge_feats = in_edge_feats
 18 |         self.d_pred_edge = d_pred_edge
 19 |         self.norm = norm
 20 |         self.eps = eps
 21 | 
 22 |         input_feats = (d_model * 2) + in_edge_feats + 1
 23 |         phi_e_out = d_model if d_pred_edge is None else d_model + d_pred_edge
 24 | 
 25 |         self.phi_e = torch.nn.Sequential(
 26 |             torch.nn.Linear(input_feats, d_model), torch.nn.SiLU(), torch.nn.Linear(d_model, phi_e_out), torch.nn.SiLU()
 27 |         )
 28 | 
 29 |         self.phi_att = torch.nn.Sequential(torch.nn.Linear(d_model, 1), torch.nn.Sigmoid())
 30 | 
 31 |         self.phi_h = torch.nn.Sequential(
 32 |             torch.nn.Linear(d_model * 2, d_model), torch.nn.SiLU(), torch.nn.Linear(d_model, d_model)
 33 |         )
 34 | 
 35 |         self.phi_x = torch.nn.Sequential(
 36 |             torch.nn.Linear(input_feats, d_model),
 37 |             torch.nn.SiLU(),
 38 |             torch.nn.Linear(d_model, d_model),
 39 |             torch.nn.SiLU(),
 40 |             torch.nn.Linear(d_model, 1),
 41 |         )
 42 | 
 43 |         if norm:
 44 |             self.norm_layer = torch.nn.LayerNorm(d_model)
 45 | 
 46 |     def forward(self, coords, inv_feats, adj_matrix, atom_mask, edge_feats):
 47 |         """Pass data through the layer
 48 | 
 49 |         Args:
 50 |             coords (torch.Tensor): Input coordinates, shape [batch_size, n_atoms, 3]
 51 |             inv_feats (torch.Tensor): Invariant atom features, shape [batch_size, n_atoms, d_model]
 52 |             adj_matrix (torch.Tensor): Adjacency matrix, shape [batch_size, n_atoms, n_atoms], 1 for connected
 53 |             atom_mask (torch.Tensor, Optional): Mask for fake atoms, shape [batch_size, n_atoms], 1 for real atoms
 54 |             edge_feats (torch.Tensor, Optional): In edge features, shape [batch_size, n_nodes, n_nodes, d_edge]
 55 | 
 56 |         Returns:
 57 |             ((torch.Tensor, torch.Tensor)): A tuple of the new node coordinates and the new node features
 58 |         """
 59 | 
 60 |         atom_mask = atom_mask.unsqueeze(2)
 61 | 
 62 |         # Add distances to edge features, then compute messages
 63 |         sqrd_dists = smolF.calc_distances(coords, sqrd=True).unsqueeze(-1)
 64 |         edge_feats = torch.cat((edge_feats, sqrd_dists), dim=-1)
 65 | 
 66 |         edge_messages = self._compute_edge_messages(inv_feats, edge_feats)
 67 |         if self.d_pred_edge is not None:
 68 |             edge_pred = edge_messages[:, :, :, self.d_model :]
 69 |             edge_messages = edge_messages[:, :, :, : self.d_model]
 70 | 
 71 |         attentions = self.phi_att(edge_messages)
 72 |         edge_messages = attentions * edge_messages
 73 |         edge_messages = edge_messages * adj_matrix.unsqueeze(-1)
 74 | 
 75 |         # Compute new node features
 76 |         node_messages = edge_messages.sum(dim=2)
 77 |         in_feats = torch.cat((inv_feats, node_messages), dim=-1)
 78 |         out_node_feats = self.phi_h(in_feats)
 79 | 
 80 |         # Compute new node coords
 81 |         coord_updates = self._compute_coord_updates(coords, inv_feats, edge_feats, adj_matrix, atom_mask)
 82 |         out_coords = coords + coord_updates
 83 | 
 84 |         out_node_feats = out_node_feats * atom_mask
 85 |         out_coords = out_coords * atom_mask
 86 | 
 87 |         if self.norm:
 88 |             out_node_feats = self.norm_layer(out_node_feats)
 89 | 
 90 |         if self.d_pred_edge is not None:
 91 |             return out_coords, out_node_feats, edge_pred
 92 | 
 93 |         return out_coords, out_node_feats
 94 | 
 95 |     def _compute_edge_messages(self, node_feats, edge_feats):
 96 |         """Computes messages with attention applied, for all edges
 97 | 
 98 |         Args:
 99 |             node_feats (torch.Tensor): Invariant atom features, shape [batch_size, n_atoms, d_model]
100 |             edge_feats (torch.Tensor, Optional): In edge features, shape [batch_size, n_nodes, n_nodes, d_edge]
101 | 
102 |         Returns:
103 |             (torch.Tensor) Message tensor, shape [batch_size, n_nodes, n_nodes, d_model]
104 |         """
105 | 
106 |         batch_size, n_nodes, _ = tuple(node_feats.shape)
107 | 
108 |         node_i = node_feats.unsqueeze(2).expand(batch_size, n_nodes, n_nodes, -1)
109 |         node_j = node_feats.unsqueeze(1).expand(batch_size, n_nodes, n_nodes, -1)
110 | 
111 |         in_feats = torch.cat((node_i, node_j, edge_feats), dim=3)
112 |         messages = self.phi_e(in_feats)
113 | 
114 |         return messages
115 | 
116 |     def _compute_coord_updates(self, coords, node_feats, edge_feats, adj_matrix, atom_mask):
117 |         """Computes coordinate updates by summing over edges with scalar attention
118 | 
119 |         Args:
120 |             coords (torch.Tensor): Input coordinates, shape [batch_size, n_atoms, 3]
121 |             node_feats (torch.Tensor): Invariant atom features, shape [batch_size, n_atoms, d_model]
122 |             edge_feats (torch.Tensor, Optional): In edge features, shape [batch_size, n_nodes, n_nodes, d_edge]
123 |             adj_matrix (torch.Tensor): Adjacency matrix, shape [batch_size, n_atoms, n_atoms], 1 for connected
124 |             atom_mask (torch.Tensor, Optional): Mask for fake atoms, shape [batch_size, n_atoms], 1 for real atoms
125 | 
126 |         Returns:
127 |             (torch.Tensor) Message tensor, shape [batch_size, num_nodes, 3]
128 |         """
129 | 
130 |         batch_size, n_nodes, _ = tuple(node_feats.shape)
131 | 
132 |         node_i = node_feats.unsqueeze(2).expand(batch_size, n_nodes, n_nodes, -1)
133 |         node_j = node_feats.unsqueeze(1).expand(batch_size, n_nodes, n_nodes, -1)
134 | 
135 |         in_feats = torch.cat((node_i, node_j, edge_feats), dim=3)
136 |         edge_attn = self.phi_x(in_feats)
137 | 
138 |         # Compute a vector for each edge using the coord diff, edge attention score and a normaliser
139 |         # coord_diffs = coords[batch_index, edge_is, :] - coords[batch_index, edge_js, :]
140 |         coord_diffs = coords.unsqueeze(-2) - coords.unsqueeze(-3)
141 |         normalisers = torch.sqrt(torch.sum(coord_diffs * coord_diffs, dim=-1) + self.eps) + 1
142 |         weighted_edges = (coord_diffs * edge_attn) / normalisers.unsqueeze(-1)
143 |         weighted_edges = weighted_edges * adj_matrix.unsqueeze(-1)
144 | 
145 |         # Sum over all of a node's edges to get a coordinate update for that node
146 |         coord_updates = weighted_edges.sum(dim=2)
147 | 
148 |         # Take average over number of edges to reduce size of coord updates
149 |         # *** Not part of vanilla ***
150 |         num_nodes = atom_mask.sum(dim=1) + 1
151 |         coord_updates = coord_updates / num_nodes.view(-1, 1, 1)
152 | 
153 |         return coord_updates
154 | 
155 | 
156 | class VanillaEgnnDynamics(torch.nn.Module):
157 |     def __init__(self, d_model, n_layers, d_edge):
158 |         super().__init__()
159 | 
160 |         self.d_model = d_model
161 |         self.n_layers = n_layers
162 |         self.d_edge = d_edge
163 | 
164 |         in_edge_feats = 1
165 |         layers = [VanillaEgnnLayer(d_model, in_edge_feats, norm=True) for _ in range(n_layers - 2)]
166 | 
167 |         self.layers = torch.nn.ParameterList(layers)
168 |         self.enc_layer = VanillaEgnnLayer(d_model, in_edge_feats + d_edge, norm=True)
169 |         self.dec_layer = VanillaEgnnLayer(d_model, in_edge_feats, d_pred_edge=d_edge, norm=True)
170 | 
171 |     def forward(self, coords, inv_feats, adj_matrix, atom_mask=None, edge_feats=None):
172 |         """Generate molecular coordinates and atom features
173 | 
174 |         Args:
175 |             coords (torch.Tensor): Input coordinates, shape [batch_size, n_atoms, 3]
176 |             inv_feats (torch.Tensor): Invariant atom features, shape [batch_size, n_atoms, d_model]
177 |             adj_matrix (torch.Tensor): Adjacency matrix, shape [batch_size, n_atoms, n_atoms], 1 for connected
178 |             atom_mask (torch.Tensor, Optional): Mask for fake atoms, shape [batch_size, n_atoms], 1 for real atoms
179 |             edge_feats (torch.Tensor, Optional): In edge features, shape [batch_size, n_nodes, n_nodes, d_edge]
180 | 
181 |         Returns:
182 |             (coords, atom feats, edge feats)
183 |             All torch.Tensor, shapes:
184 |                 Coordinates [batch_size, n_atoms, 3],
185 |                 Atom feats [batch_size, n_atoms, d_model]
186 |                 Edge feats [batch_size, n_atoms, n_atoms, d_edge]
187 |         """
188 | 
189 |         # Compute initial distance between edges as our only edge feature
190 |         dist_feats = smolF.calc_distances(coords, sqrd=True).unsqueeze(-1)
191 | 
192 |         edge_feats = torch.cat((dist_feats, edge_feats), dim=-1)
193 |         edge_feats = edge_feats * adj_matrix.unsqueeze(-1)
194 | 
195 |         coords, inv_feats = self.enc_layer(coords, inv_feats, adj_matrix, atom_mask, edge_feats)
196 | 
197 |         # Update coords and node feats using the model EGNN layers
198 |         # Remove CoM from predicted coords before passing to next layer
199 |         for layer in self.layers:
200 |             coords = smolF.zero_com(coords, node_mask=atom_mask)
201 |             coords, inv_feats = layer(coords, inv_feats, adj_matrix, atom_mask, dist_feats)
202 | 
203 |         coords, inv_feats, pred_edges = self.dec_layer(coords, inv_feats, adj_matrix, atom_mask, dist_feats)
204 |         coords = smolF.zero_com(coords, node_mask=atom_mask)
205 | 
206 |         return coords, inv_feats, pred_edges
207 | 
208 | 
209 | class VanillaEgnnGenerator(MolecularGenerator):
210 |     def __init__(
211 |         self,
212 |         d_model,
213 |         n_layers,
214 |         vocab_size,
215 |         n_atom_feats,
216 |         d_edge,
217 |         n_edge_types,
218 |         self_cond=False,
219 |     ):
220 |         if self_cond:
221 |             raise NotImplementedError("Self conditioning not implemented for EGNN")
222 | 
223 |         hparams = {
224 |             "d_model": d_model,
225 |             "n_layers": n_layers,
226 |             "vocab_size": vocab_size,
227 |             "n_atom_feats": n_atom_feats,
228 |             "d_edge": d_edge,
229 |             "n_edge_types": n_edge_types,
230 |             "self_cond": self_cond,
231 |         }
232 | 
233 |         super().__init__(**hparams)
234 | 
235 |         self.feat_proj = torch.nn.Linear(n_atom_feats, d_model)
236 |         self.dynamics = VanillaEgnnDynamics(d_model, n_layers, d_edge)
237 | 
238 |         self.edge_in_proj = torch.nn.Sequential(
239 |             torch.nn.Linear(n_edge_types, d_edge), torch.nn.SiLU(inplace=False), torch.nn.Linear(d_edge, d_edge)
240 |         )
241 |         self.edge_out_proj = torch.nn.Sequential(
242 |             torch.nn.Linear(d_edge, d_edge), torch.nn.SiLU(inplace=False), torch.nn.Linear(d_edge, n_edge_types)
243 |         )
244 | 
245 |         self.classifier_head = torch.nn.Sequential(
246 |             torch.nn.Linear(d_model, d_model), torch.nn.SiLU(inplace=False), torch.nn.Linear(d_model, vocab_size)
247 |         )
248 |         self.charge_head = torch.nn.Sequential(
249 |             torch.nn.Linear(d_model, d_model), torch.nn.SiLU(inplace=False), torch.nn.Linear(d_model, 7)
250 |         )
251 | 
252 |     def forward(
253 |         self,
254 |         coords,
255 |         inv_feats,
256 |         edge_feats=None,
257 |         cond_coords=None,
258 |         cond_atomics=None,
259 |         cond_bonds=None,
260 |         atom_mask=None,
261 |     ):
262 |         """Predict molecular coordinates and atom types
263 | 
264 |         Args:
265 |             coords (torch.Tensor): Input coordinates, shape [batch_size, num_atoms, 3]
266 |             inv_feats (torch.Tensor): Invariant atom features, shape [batch_size, num_atoms, num_feats]
267 |             atom_mask (torch.Tensor, Optional): Mask for real and dummy atoms, shape [batch_size, num_atoms],
268 |                     1 for real atom 0 otherwise
269 | 
270 |         Returns:
271 |             (predicted coordinates, atom type logits)
272 |             Both torch.Tensor, shapes [batch_size, num_atoms, 3] and [batch_size, num atoms, vocab_size]
273 |         """
274 | 
275 |         if edge_feats is None:
276 |             raise ValueError("edge_feats must be provided.")
277 | 
278 |         atom_mask = torch.ones_like(coords[..., 0]) if atom_mask is None else atom_mask
279 |         adj_matrix = smolF.edges_from_nodes(coords, node_mask=atom_mask)
280 | 
281 |         atom_feats = self.feat_proj(inv_feats)
282 |         edge_feats = self.edge_in_proj(edge_feats.float())
283 | 
284 |         pred_coords, pred_feats, pred_bonds = self.dynamics(
285 |             coords, atom_feats, adj_matrix, atom_mask=atom_mask, edge_feats=edge_feats
286 |         )
287 | 
288 |         type_logits = self.classifier_head(pred_feats)
289 |         charge_logits = self.charge_head(pred_feats)
290 | 
291 |         pred_edges = pred_bonds + pred_bonds.transpose(1, 2)
292 |         edge_logits = self.edge_out_proj(pred_edges)
293 | 
294 |         return pred_coords, type_logits, edge_logits, charge_logits
295 | 


--------------------------------------------------------------------------------
/semlaflow/models/eqgat.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn.functional as F
  3 | 
  4 | import semlaflow.util.functional as smolF
  5 | from semlaflow.models.semla import CoordNorm, MolecularGenerator
  6 | 
  7 | 
  8 | def adj_to_attn_mask(adj_matrix, pos_inf=False):
  9 |     """Assumes adj_matrix is only 0s and 1s"""
 10 | 
 11 |     inf = float("inf") if pos_inf else float("-inf")
 12 |     attn_mask = torch.zeros_like(adj_matrix.float())
 13 |     attn_mask[adj_matrix == 0] = inf
 14 | 
 15 |     # Ensure nodes with no connections (fake nodes) don't have all -inf in the attn mask
 16 |     # Otherwise we would have problems when softmaxing
 17 |     n_nodes = adj_matrix.sum(dim=-1)
 18 |     attn_mask[n_nodes == 0] = 0.0
 19 | 
 20 |     return attn_mask
 21 | 
 22 | 
 23 | class GatedEquiUpdate(torch.nn.Module):
 24 |     def __init__(self, d_model, n_equi_feats, eps=1e-5):
 25 |         super().__init__()
 26 | 
 27 |         self.d_model = d_model
 28 |         self.n_equi_feats = n_equi_feats
 29 |         self.eps = eps
 30 | 
 31 |         self.equi_proj = torch.nn.Linear(n_equi_feats, 2 * n_equi_feats, bias=False)
 32 |         self.inv_proj = torch.nn.Linear(d_model + n_equi_feats, d_model + n_equi_feats)
 33 | 
 34 |     def forward(self, inv_feats, equi_feats):
 35 |         """Pass data through one layer of the model
 36 | 
 37 |         Args:
 38 |             inv_feats (torch.Tensor): Invariant atom features, shape [batch_size, n_atoms, d_model]
 39 |             equi_feats (torch.Tensor): Equivariant atom features, shape [batch_size, n_atoms, n_equi_feats, 3]
 40 | 
 41 |         Returns:
 42 |             (atom feats, equi feats)
 43 |             All torch.Tensor, shapes:
 44 |                 Atom feats [batch_size, n_atoms, d_model]
 45 |                 Equi feats [batch_size, n_atoms, n_equi_feats, 3]
 46 |         """
 47 | 
 48 |         equi_feats_proj = self.equi_proj(equi_feats.transpose(2, 3)).transpose(2, 3)
 49 |         equi_feats_out = equi_feats_proj[:, :, : self.n_equi_feats, :]
 50 |         norms = torch.linalg.vector_norm(equi_feats_proj[:, :, self.n_equi_feats :, :], dim=-1) + self.eps
 51 | 
 52 |         inv_feats_cat = torch.cat((inv_feats, norms), dim=-1)
 53 |         inv_feats_proj = self.inv_proj(inv_feats_cat)
 54 |         inv_feats_out = inv_feats_proj[:, :, : self.d_model]
 55 |         inv_gate_feats = inv_feats_proj[:, :, self.d_model :]
 56 | 
 57 |         equi_feats_out = equi_feats_out * inv_gate_feats.unsqueeze(-1)
 58 | 
 59 |         return inv_feats_out, equi_feats_out
 60 | 
 61 | 
 62 | class EqgatLayer(torch.nn.Module):
 63 |     def __init__(self, d_model, n_equi_feats, d_edge, eps=1e-5):
 64 |         super().__init__()
 65 | 
 66 |         self.d_model = d_model
 67 |         self.n_equi_feats = n_equi_feats
 68 |         self.d_edge = d_edge
 69 |         self.eps = eps
 70 | 
 71 |         pairwise_input_feats = (2 * d_model) + d_edge + 4
 72 |         pairwise_output_feats = (2 * n_equi_feats) + d_model + d_edge + 1
 73 | 
 74 |         self.pairwise_mlp = torch.nn.Sequential(
 75 |             torch.nn.Linear(pairwise_input_feats, d_model),
 76 |             torch.nn.SiLU(),
 77 |             torch.nn.Linear(d_model, pairwise_output_feats),
 78 |         )
 79 | 
 80 |         self.node_proj = torch.nn.Linear(d_model, d_model, bias=False)
 81 |         self.equi_proj = torch.nn.Linear(n_equi_feats, n_equi_feats, bias=False)
 82 |         self.edge_in_proj = torch.nn.Linear(d_edge, d_edge, bias=False)
 83 |         self.edge_out_proj = torch.nn.Linear(d_edge, d_edge, bias=False)
 84 | 
 85 |         self.inv_norm = torch.nn.LayerNorm(d_model)
 86 |         self.coord_norm = CoordNorm(1)
 87 |         self.equi_norm = CoordNorm(n_equi_feats)
 88 | 
 89 |         self.gated_update = GatedEquiUpdate(d_model, n_equi_feats, eps=eps)
 90 | 
 91 |     def forward(self, coords, inv_feats, equi_feats, adj_matrix, atom_mask, edge_feats):
 92 |         """Pass data through one layer of the model
 93 | 
 94 |         Args:
 95 |             coords (torch.Tensor): Input coordinates, shape [batch_size, n_atoms, 3]
 96 |             inv_feats (torch.Tensor): Invariant atom features, shape [batch_size, n_atoms, d_model]
 97 |             equi_feats (torch.Tensor): Equivariant atom features, shape [batch_size, n_atoms, n_equi_feats, 3]
 98 |             adj_matrix (torch.Tensor): Adjacency matrix, shape [batch_size, n_atoms, n_atoms], 1 for connected
 99 |             atom_mask (torch.Tensor): Mask for fake atoms, shape [batch_size, n_atoms], 1 for real atoms
100 |             edge_feats (torch.Tensor): In edge features, shape [batch_size, n_nodes, n_nodes, d_edge]
101 | 
102 |         Returns:
103 |             (coords, atom feats, equi feats, edge feats)
104 |             All torch.Tensor, shapes:
105 |                 Coordinates [batch_size, n_atoms, 3],
106 |                 Atom feats [batch_size, n_atoms, d_model]
107 |                 Equi feats [batch_size, n_atoms, n_equi_feats, 3]
108 |                 Edge feats [batch_size, n_atoms, n_atoms, d_edge]
109 |         """
110 | 
111 |         batch_size, n_nodes, _ = tuple(coords.shape)
112 | 
113 |         coord_norms = torch.linalg.vector_norm(coords, dim=-1).unsqueeze(-1)
114 |         atom_feats = torch.cat((inv_feats, coord_norms), dim=-1)
115 | 
116 |         node_i = atom_feats.unsqueeze(2).expand(batch_size, n_nodes, n_nodes, -1)
117 |         node_j = atom_feats.unsqueeze(1).expand(batch_size, n_nodes, n_nodes, -1)
118 | 
119 |         distances = smolF.calc_distances(coords).unsqueeze(-1)
120 |         dotprods = torch.bmm(coords, coords.transpose(1, 2)).unsqueeze(-1)
121 | 
122 |         proj_edge_feats = self.edge_in_proj(edge_feats)
123 |         atom_in_feats = torch.cat((node_i, node_j, proj_edge_feats, distances, dotprods), dim=3)
124 | 
125 |         c_start = self.n_equi_feats + self.d_model
126 |         d_start = self.d_model + (2 * self.n_equi_feats)
127 |         d_end = self.d_model + (2 * self.n_equi_feats) + self.d_edge
128 | 
129 |         pairwise_mlp_out = self.pairwise_mlp(atom_in_feats)
130 |         a = pairwise_mlp_out[:, :, :, : self.d_model]
131 |         b = pairwise_mlp_out[:, :, :, self.d_model : c_start]
132 |         c = pairwise_mlp_out[:, :, :, c_start:d_start]
133 |         d = pairwise_mlp_out[:, :, :, d_start:d_end]
134 |         s = pairwise_mlp_out[:, :, :, d_end : d_end + 1]
135 | 
136 |         # Compute attention weights
137 |         attn_mask = adj_to_attn_mask(adj_matrix)
138 |         messages = a + attn_mask.unsqueeze(3)
139 |         attentions = torch.softmax(messages, dim=2)
140 | 
141 |         # Apply attention to projected node features
142 |         proj_atom_feats = self.node_proj(inv_feats)
143 |         scaled_attentions = proj_atom_feats.unsqueeze(2) * attentions
144 |         node_feats_out = inv_feats + scaled_attentions.sum(dim=2)
145 | 
146 |         # Apply edge update
147 |         edge_out = self.edge_out_proj(F.silu(edge_feats + d))
148 | 
149 |         # Apply equi feat udpate
150 |         vector_dist = coords.unsqueeze(2) - coords.unsqueeze(1)
151 |         vector_dist_norm = torch.linalg.vector_norm(vector_dist, dim=-1)
152 |         x_ij = vector_dist / (vector_dist_norm.unsqueeze(-1) + self.eps)
153 | 
154 |         n_atoms = atom_mask.sum(dim=-1) + self.eps
155 | 
156 |         # x_b_outer shape [B, N, N, F, 3]
157 |         # c_outer shape [B, N, N, F, 3]
158 |         # equi_feats shape [B, N, F, 3]
159 |         x_b_outer = x_ij.unsqueeze(-2) * b.unsqueeze(-1)
160 |         c_outer = torch.ones((1, 1, 1, 1, 3), device=c.device) * c.unsqueeze(-1)
161 |         equi_feats_proj = self.equi_proj(equi_feats.transpose(2, 3)).transpose(2, 3)
162 |         equi_feats_mult = equi_feats_proj.unsqueeze(2) * c_outer
163 |         equi_feats_update = (x_b_outer + equi_feats_mult).sum(dim=2)
164 |         equi_feats_out = equi_feats + (equi_feats_update / n_atoms.view(-1, 1, 1, 1))
165 | 
166 |         # Apply coord update
167 |         coord_pairwise_updates = s * x_ij
168 |         coords_out = coords + (coord_pairwise_updates.sum(dim=2) / n_atoms.view(-1, 1, 1))
169 | 
170 |         node_feats_out = self.inv_norm(node_feats_out)
171 |         coords_out = self.coord_norm(coords_out.unsqueeze(1), atom_mask.unsqueeze(1)).squeeze(1)
172 | 
173 |         equi_atom_mask = atom_mask.unsqueeze(1).repeat(1, self.n_equi_feats, 1)
174 |         equi_feats_out = self.equi_norm(equi_feats_out.transpose(1, 2), equi_atom_mask).transpose(1, 2)
175 | 
176 |         inv_update, equi_update = self.gated_update(node_feats_out, equi_feats_out)
177 |         node_feats_out = (node_feats_out + inv_update) * atom_mask.unsqueeze(-1)
178 |         equi_feats_out = equi_feats_out + equi_update
179 | 
180 |         return coords_out, node_feats_out, equi_feats_out, edge_out
181 | 
182 | 
183 | class EqgatPredictionHead(torch.nn.Module):
184 |     def __init__(self, d_model, n_equi_feats, d_edge, vocab_size, n_edge_types, n_charges):
185 |         super().__init__()
186 | 
187 |         self.d_model = d_model
188 |         self.n_equi_feats = n_equi_feats
189 | 
190 |         self.inv_proj = torch.nn.Sequential(torch.nn.Linear(d_model, d_model), torch.nn.SiLU())
191 |         self.edge_feat_proj = torch.nn.Linear(d_edge, d_edge)
192 |         self.equi_proj = torch.nn.Linear(n_equi_feats, 1, bias=False)
193 |         self.atom_proj = torch.nn.Linear(d_model, vocab_size)
194 |         self.charge_proj = torch.nn.Linear(d_model, n_charges)
195 | 
196 |         edge_in_feats = (d_model * 2) + d_edge + 1
197 |         self.bond_proj = torch.nn.Sequential(
198 |             torch.nn.Linear(edge_in_feats, d_edge), torch.nn.SiLU(), torch.nn.Linear(d_edge, n_edge_types)
199 |         )
200 | 
201 |     def forward(self, coords, inv_feats, equi_feats, adj_matrix, atom_mask, edge_feats):
202 |         """Predict final atom types, charges, bonds and coordinates
203 | 
204 |         Args:
205 |             coords (torch.Tensor): Input coordinates, shape [batch_size, n_atoms, 3]
206 |             inv_feats (torch.Tensor): Invariant atom features, shape [batch_size, n_atoms, d_model]
207 |             equi_feats (torch.Tensor): Equivariant atom features, shape [batch_size, n_atoms, n_equi_feats, 3]
208 |             adj_matrix (torch.Tensor): Adjacency matrix, shape [batch_size, n_atoms, n_atoms], 1 for connected
209 |             atom_mask (torch.Tensor): Mask for fake atoms, shape [batch_size, n_atoms], 1 for real atoms
210 |             edge_feats (torch.Tensor): In edge features, shape [batch_size, n_nodes, n_nodes, d_edge]
211 | 
212 |         Returns:
213 |             (coords, atom logits, charge logits, bond logits)
214 |             All torch.Tensor, shapes:
215 |                 Coordinates [batch_size, n_atoms, 3],
216 |                 Atom logits [batch_size, n_atoms, vocab_size]
217 |                 Bond logits [batch_size, n_atoms, n_atoms, n_bond_types]
218 |                 Charge logits [batch_size, n_atoms, n_charges]
219 |         """
220 | 
221 |         batch_size, n_nodes, _ = tuple(coords.shape)
222 | 
223 |         equi_feats_proj = self.equi_proj(equi_feats.transpose(2, 3)).squeeze(-1)
224 |         coords_out = coords + equi_feats_proj
225 | 
226 |         edge_feats = edge_feats * adj_matrix.unsqueeze(-1)
227 |         edge_feats_in = edge_feats + edge_feats.transpose(1, 2)
228 |         edge_feats_proj = self.edge_feat_proj(edge_feats_in)
229 | 
230 |         node_feats = self.inv_proj(inv_feats)
231 |         node_feats_start = node_feats.unsqueeze(2).expand(batch_size, n_nodes, n_nodes, -1)
232 |         node_feats_end = node_feats.unsqueeze(1).expand(batch_size, n_nodes, n_nodes, -1)
233 |         node_pairs = torch.cat((node_feats_start, node_feats_end), dim=-1)
234 | 
235 |         distances = smolF.calc_distances(coords_out).unsqueeze(-1)
236 |         pairwise_feats = torch.cat((node_pairs, edge_feats_proj, distances), dim=-1)
237 |         bond_logits = self.bond_proj(pairwise_feats)
238 | 
239 |         atom_logits = self.atom_proj(node_feats)
240 |         charge_logits = self.charge_proj(node_feats)
241 | 
242 |         return coords_out, atom_logits, bond_logits, charge_logits
243 | 
244 | 
245 | class EqgatDynamics(torch.nn.Module):
246 |     def __init__(self, d_model, n_layers, n_equi_feats, d_edge, eps=1e-5):
247 |         super().__init__()
248 | 
249 |         layers = [EqgatLayer(d_model, n_equi_feats, d_edge, eps=eps) for _ in range(n_layers)]
250 |         self.layers = torch.nn.ParameterList(layers)
251 | 
252 |     def forward(self, coords, inv_feats, equi_feats, adj_matrix, atom_mask, edge_feats):
253 |         """Generate molecular coordinates and atom features
254 | 
255 |         Args:
256 |             coords (torch.Tensor): Input coordinates, shape [batch_size, n_atoms, 3]
257 |             inv_feats (torch.Tensor): Invariant atom features, shape [batch_size, n_atoms, d_model]
258 |             equi_feats (torch.Tensor): Equivariant atom features, shape [batch_size, n_atoms, n_equi_feats, 3]
259 |             adj_matrix (torch.Tensor): Adjacency matrix, shape [batch_size, n_atoms, n_atoms], 1 for connected
260 |             atom_mask (torch.Tensor): Mask for fake atoms, shape [batch_size, n_atoms], 1 for real atoms
261 |             edge_feats (torch.Tensor): In edge features, shape [batch_size, n_nodes, n_nodes, d_edge]
262 | 
263 |         Returns:
264 |             (coords, atom feats, edge feats)
265 |             All torch.Tensor, shapes:
266 |                 Coordinates [batch_size, n_atoms, 3],
267 |                 Atom feats [batch_size, n_atoms, d_model]
268 |                 Equi feats [batch_size, n_atoms, n_equi_feats, 3]
269 |                 Edge feats [batch_size, n_atoms, n_atoms, d_edge]
270 |         """
271 | 
272 |         for layer in self.layers:
273 |             coords, inv_feats, equi_feats, edge_feats = layer(
274 |                 coords, inv_feats, equi_feats, adj_matrix, atom_mask, edge_feats
275 |             )
276 | 
277 |         return coords, inv_feats, equi_feats, edge_feats
278 | 
279 | 
280 | class EqgatGenerator(MolecularGenerator):
281 |     def __init__(
282 |         self,
283 |         d_model,
284 |         n_layers,
285 |         n_equi_feats,
286 |         vocab_size,
287 |         n_atom_feats,
288 |         d_edge,
289 |         n_edge_types,
290 |     ):
291 | 
292 |         hparams = {
293 |             "d_model": d_model,
294 |             "n_layers": n_layers,
295 |             "n_equi_feats": n_equi_feats,
296 |             "vocab_size": vocab_size,
297 |             "n_atom_feats": n_atom_feats,
298 |             "d_edge": d_edge,
299 |             "n_edge_types": n_edge_types,
300 |         }
301 | 
302 |         super().__init__(**hparams)
303 | 
304 |         self.d_model = d_model
305 |         self.n_equi_feats = n_equi_feats
306 | 
307 |         n_charges = 7
308 | 
309 |         self.feat_proj = torch.nn.Linear(n_atom_feats, d_model)
310 |         self.edge_in_proj = torch.nn.Sequential(
311 |             torch.nn.Linear(n_edge_types, d_edge), torch.nn.SiLU(inplace=False), torch.nn.Linear(d_edge, d_edge)
312 |         )
313 | 
314 |         self.dynamics = EqgatDynamics(d_model, n_layers, n_equi_feats, d_edge)
315 |         self.pred_head = EqgatPredictionHead(d_model, n_equi_feats, d_edge, vocab_size, n_edge_types, n_charges)
316 | 
317 |     def forward(
318 |         self,
319 |         coords,
320 |         inv_feats,
321 |         edge_feats=None,
322 |         cond_coords=None,
323 |         cond_atomics=None,
324 |         cond_bonds=None,
325 |         atom_mask=None,
326 |     ):
327 |         """Predict molecular coordinates and atom types
328 | 
329 |         Args:
330 |             coords (torch.Tensor): Input coordinates, shape [batch_size, n_atoms, 3]
331 |             inv_feats (torch.Tensor): Invariant atom features, shape [batch_size, n_atoms, n_feats]
332 |             edge_feats (torch.Tensor): In edge features, shape [batch_size, n_atoms, n_atoms, n_edge_types]
333 |             atom_mask (torch.Tensor): Mask for fake atoms, shape [batch_size, n_atoms], 1 for real atoms
334 | 
335 |         Returns:
336 |             (predicted coordinates, atom type logits, bond logits, atom charges)
337 |             All torch.Tensor, shapes:
338 |                 Coordinates: [batch_size, n_atoms, 3]
339 |                 Type logits: [batch_size, n_atoms, vocab_size],
340 |                 Bond logits: [batch_size, n_atoms, n_atoms, n_edge_types]
341 |                 Charge logits: [batch_size, n_atoms, 7]
342 |         """
343 | 
344 |         atom_mask = torch.ones_like(coords[..., 0]) if atom_mask is None else atom_mask
345 |         adj_matrix = smolF.edges_from_nodes(coords, node_mask=atom_mask)
346 | 
347 |         inv_feats_proj = self.feat_proj(inv_feats)
348 |         edge_feats_proj = self.edge_in_proj(edge_feats.float())
349 | 
350 |         equi_feats = torch.zeros_like(coords.unsqueeze(2)).repeat(1, 1, self.n_equi_feats, 1)
351 | 
352 |         out = self.dynamics(coords, inv_feats_proj, equi_feats, adj_matrix, atom_mask, edge_feats_proj)
353 |         coords, atom_feats, equi_feats, edge_feats = out
354 | 
355 |         pred = self.pred_head(coords, atom_feats, equi_feats, adj_matrix, atom_mask, edge_feats)
356 |         return pred
357 | 


--------------------------------------------------------------------------------
/semlaflow/predict.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Script for generating molecules using a trained model and saving them.
  3 | 
  4 | Note that the script currently does not save the molecules in batches - all of the molecules are generated and then
  5 | all saved together in one Smol batch. If generating many molecules ensure you have enough memory to store them.
  6 | """
  7 | 
  8 | import argparse
  9 | import os
 10 | from functools import partial
 11 | from pathlib import Path
 12 | 
 13 | import lightning as L
 14 | import torch
 15 | from rdkit import Chem
 16 | 
 17 | import semlaflow.scriptutil as util
 18 | from semlaflow.data.datamodules import GeometricInterpolantDM
 19 | from semlaflow.data.datasets import GeometricDataset
 20 | from semlaflow.data.interpolate import GeometricInterpolant, GeometricNoiseSampler
 21 | from semlaflow.models.fm import Integrator, MolecularCFM
 22 | from semlaflow.models.semla import EquiInvDynamics, SemlaGenerator
 23 | from semlaflow.util.molrepr import GeometricMolBatch
 24 | 
 25 | # Default script arguments
 26 | DEFAULT_SAVE_FILE = "predictions.smol"
 27 | DEFAULT_DATASET_SPLIT = "train"
 28 | DEFAULT_N_MOLECULES = 5000
 29 | DEFAULT_BATCH_COST = 8192
 30 | DEFAULT_BUCKET_COST_SCALE = "linear"
 31 | DEFAULT_INTEGRATION_STEPS = 100
 32 | DEFAULT_CAT_SAMPLING_NOISE_LEVEL = 1
 33 | DEFAULT_ODE_SAMPLING_STRATEGY = "log"
 34 | 
 35 | 
 36 | def load_model(args, vocab):
 37 |     checkpoint = torch.load(args.ckpt_path)
 38 |     hparams = checkpoint["hyper_parameters"]
 39 | 
 40 |     hparams["compile_model"] = False
 41 |     hparams["integration-steps"] = args.integration_steps
 42 |     hparams["sampling_strategy"] = args.ode_sampling_strategy
 43 | 
 44 |     n_bond_types = util.get_n_bond_types(hparams["integration-type-strategy"])
 45 | 
 46 |     # Set default arch to semla if nothing has been saved
 47 |     if hparams.get("architecture") is None:
 48 |         hparams["architecture"] = "semla"
 49 | 
 50 |     if hparams["architecture"] == "semla":
 51 |         dynamics = EquiInvDynamics(
 52 |             hparams["d_model"],
 53 |             hparams["d_message"],
 54 |             hparams["n_coord_sets"],
 55 |             hparams["n_layers"],
 56 |             n_attn_heads=hparams["n_attn_heads"],
 57 |             d_message_hidden=hparams["d_message_hidden"],
 58 |             d_edge=hparams["d_edge"],
 59 |             self_cond=hparams["self_cond"],
 60 |             coord_norm=hparams["coord_norm"],
 61 |         )
 62 |         egnn_gen = SemlaGenerator(
 63 |             hparams["d_model"],
 64 |             dynamics,
 65 |             vocab.size,
 66 |             hparams["n_atom_feats"],
 67 |             d_edge=hparams["d_edge"],
 68 |             n_edge_types=n_bond_types,
 69 |             self_cond=hparams["self_cond"],
 70 |             size_emb=hparams["size_emb"],
 71 |             max_atoms=hparams["max_atoms"]
 72 |         )
 73 | 
 74 |     elif hparams["architecture"] == "eqgat":
 75 |         from semlaflow.models.eqgat import EqgatGenerator
 76 | 
 77 |         egnn_gen = EqgatGenerator(
 78 |             hparams["d_model"],
 79 |             hparams["n_layers"],
 80 |             hparams["n_equi_feats"],
 81 |             vocab.size,
 82 |             hparams["n_atom_feats"],
 83 |             hparams["d_edge"],
 84 |             hparams["n_edge_types"]
 85 |         )
 86 | 
 87 |     elif hparams["architecture"] == "egnn":
 88 |         from semlaflow.models.egnn import VanillaEgnnGenerator
 89 | 
 90 |         n_layers = args.n_layers if hparams.get("n_layers") is None else hparams["n_layers"]
 91 |         if n_layers is None:
 92 |             raise ValueError("No hparam for n_layers was saved, use script arg to provide n_layers")
 93 | 
 94 |         egnn_gen = VanillaEgnnGenerator(
 95 |             hparams["d_model"],
 96 |             n_layers,
 97 |             vocab.size,
 98 |             hparams["n_atom_feats"],
 99 |             d_edge=hparams["d_edge"],
100 |             n_edge_types=n_bond_types
101 |         )
102 | 
103 |     else:
104 |         raise ValueError(f"Unknown architecture hyperparameter.")
105 | 
106 |     type_mask_index = (
107 |         vocab.indices_from_tokens(["<MASK>"])[0] if hparams["train-type-interpolation"] == "mask" else None
108 |     )
109 |     bond_mask_index = None
110 | 
111 |     integrator = Integrator(
112 |         args.integration_steps,
113 |         type_strategy=hparams["integration-type-strategy"],
114 |         bond_strategy=hparams["integration-bond-strategy"],
115 |         type_mask_index=type_mask_index,
116 |         bond_mask_index=bond_mask_index,
117 |         cat_noise_level=args.cat_sampling_noise_level,
118 |     )
119 |     fm_model = MolecularCFM.load_from_checkpoint(
120 |         args.ckpt_path,
121 |         gen=egnn_gen,
122 |         vocab=vocab,
123 |         integrator=integrator,
124 |         type_mask_index=type_mask_index,
125 |         bond_mask_index=bond_mask_index,
126 |         **hparams,
127 |     )
128 |     return fm_model
129 | 
130 | 
131 | def build_dm(args, hparams, vocab):
132 |     if args.dataset == "qm9":
133 |         coord_std = util.QM9_COORDS_STD_DEV
134 |         bucket_limits = util.QM9_BUCKET_LIMITS
135 | 
136 |     elif args.dataset == "geom-drugs":
137 |         coord_std = util.GEOM_COORDS_STD_DEV
138 |         bucket_limits = util.GEOM_DRUGS_BUCKET_LIMITS
139 | 
140 |     else:
141 |         raise ValueError(f"Unknown dataset {args.dataset}")
142 |  
143 |     n_bond_types = 5
144 |     transform = partial(util.mol_transform, vocab=vocab, n_bonds=n_bond_types, coord_std=coord_std)
145 | 
146 |     if args.dataset_split == "train":
147 |         dataset_path = Path(args.data_path) / "train.smol"
148 |     elif args.dataset_split == "val":
149 |         dataset_path = Path(args.data_path) / "val.smol"
150 |     elif args.dataset_split == "test":
151 |         dataset_path = Path(args.data_path) / "test.smol"
152 | 
153 |     dataset = GeometricDataset.load(dataset_path, transform=transform)
154 |     dataset = dataset.sample(args.n_molecules, replacement=True)
155 | 
156 |     type_mask_index = vocab.indices_from_tokens(["<MASK>"])[0] if hparams["val-type-interpolation"] == "mask" else None
157 |     bond_mask_index = None
158 | 
159 |     prior_sampler = GeometricNoiseSampler(
160 |         vocab.size,
161 |         n_bond_types,
162 |         coord_noise="gaussian",
163 |         type_noise=hparams["val-prior-type-noise"],
164 |         bond_noise=hparams["val-prior-bond-noise"],
165 |         scale_ot=hparams["val-prior-noise-scale-ot"],
166 |         zero_com=True,
167 |         type_mask_index=type_mask_index,
168 |         bond_mask_index=bond_mask_index
169 |     )
170 |     eval_interpolant = GeometricInterpolant(
171 |         prior_sampler,
172 |         coord_interpolation="linear",
173 |         type_interpolation=hparams["val-type-interpolation"],
174 |         bond_interpolation=hparams["val-bond-interpolation"],
175 |         equivariant_ot=False,
176 |         batch_ot=False
177 |     )
178 |     dm = GeometricInterpolantDM(
179 |         None,
180 |         None,
181 |         dataset,
182 |         args.batch_cost,
183 |         test_interpolant=eval_interpolant,
184 |         bucket_limits=bucket_limits,
185 |         bucket_cost_scale=args.bucket_cost_scale,
186 |         pad_to_bucket=False
187 |     )
188 |     return dm
189 | 
190 | 
191 | def dm_from_ckpt(args, vocab):
192 |     checkpoint = torch.load(args.ckpt_path)
193 |     hparams = checkpoint["hyper_parameters"]
194 |     dm = build_dm(args, hparams, vocab)
195 |     return dm
196 | 
197 | 
198 | def generate_smol_mols(output, model):
199 |     coords = output["coords"]
200 |     atom_dists = output["atomics"]
201 |     bond_dists = output["bonds"]
202 |     charge_dists = output["charges"]
203 |     masks = output["mask"]
204 | 
205 |     mols = model.builder.smol_from_tensors(coords, atom_dists, masks, bond_dists=bond_dists, charge_dists=charge_dists)
206 |     return mols
207 | 
208 | 
209 | def save_raw_smol(args, raw_outputs, model):
210 |     # Generate GeometricMols and then combine into one GeometricMolBatch
211 |     mol_lists = [generate_smol_mols(output, model) for output in raw_outputs]
212 |     mols = [mol for mol_list in mol_lists for mol in mol_list]
213 |     batch = GeometricMolBatch.from_list(mols)
214 | 
215 |     save_path = Path(args.save_dir) / args.save_file
216 |     batch_bytes = batch.to_bytes()
217 |     save_path.write_bytes(batch_bytes)
218 | 
219 | 
220 | def save_rdkit_sdf(args, mols):
221 |     path = os.path.join(args.save_dir, args.save_file) + ".sdf"
222 |     writer = Chem.SDWriter(path)
223 |     for m in mols:
224 |         if m is not None:
225 |             writer.write(m)
226 |     writer.close()
227 | 
228 | 
229 | def main(args):
230 |     print(f"Running prediction script for {args.n_molecules} molecules...")
231 |     print(f"Using model stored at {args.ckpt_path}")
232 | 
233 |     L.seed_everything(12345)
234 |     util.disable_lib_stdout()
235 |     util.configure_fs()
236 | 
237 |     print("Building model vocab...")
238 |     vocab = util.build_vocab()
239 |     print("Vocab complete.")
240 | 
241 |     print("Loading datamodule...")
242 |     dm = dm_from_ckpt(args, vocab)
243 |     print("Datamodule complete.")
244 | 
245 |     print("Loading model...")
246 |     model = load_model(args, vocab)
247 |     print("Model complete.")
248 | 
249 |     print("Initialising metrics...")
250 |     metrics, _ = util.init_metrics(args.data_path, model)
251 |     print("Metrics complete.")
252 | 
253 |     print("Running generation...")
254 |     molecules, raw_outputs = util.generate_molecules(model, dm, args.integration_steps, args.ode_sampling_strategy)
255 |     print("Generation complete.")
256 | 
257 |     print(f"Saving predictions to {args.save_dir}/{args.save_file}")
258 |     save_rdkit_sdf(args, molecules)
259 |     save_raw_smol(args, raw_outputs, model)
260 |     print("Complete.")
261 | 
262 |     print("Calculating generative metrics...")
263 |     results = util.calc_metrics_(molecules, metrics)
264 |     util.print_results(results)
265 |     print("Generation script complete!")
266 | 
267 | 
268 | if __name__ == "__main__":
269 |     parser = argparse.ArgumentParser()
270 | 
271 |     parser.add_argument("--ckpt_path", type=str)
272 |     parser.add_argument("--data_path", type=str)
273 |     parser.add_argument("--dataset", type=str)
274 |     parser.add_argument("--save_dir", type=str)
275 |     parser.add_argument("--save_file", type=str, default=DEFAULT_SAVE_FILE)
276 | 
277 |     parser.add_argument("--batch_cost", type=int, default=DEFAULT_BATCH_COST)
278 |     parser.add_argument("--dataset_split", type=str, default=DEFAULT_DATASET_SPLIT)
279 |     parser.add_argument("--n_molecules", type=int, default=DEFAULT_N_MOLECULES)
280 |     parser.add_argument("--integration_steps", type=int, default=DEFAULT_INTEGRATION_STEPS)
281 |     parser.add_argument("--cat_sampling_noise_level", type=int, default=DEFAULT_CAT_SAMPLING_NOISE_LEVEL)
282 |     parser.add_argument("--ode_sampling_strategy", type=str, default=DEFAULT_ODE_SAMPLING_STRATEGY)
283 | 
284 |     parser.add_argument("--bucket_cost_scale", type=str, default=DEFAULT_BUCKET_COST_SCALE)
285 | 
286 |     args = parser.parse_args()
287 |     main(args)
288 | 


--------------------------------------------------------------------------------
/semlaflow/preprocess.py:
--------------------------------------------------------------------------------
 1 | """Preprocessing script only for Geom Drugs, QM9 is done in the QM9 notebook"""
 2 | 
 3 | import argparse
 4 | import pickle
 5 | from pathlib import Path
 6 | 
 7 | from semlaflow.util.molrepr import GeometricMol, GeometricMolBatch
 8 | 
 9 | DEFAULT_RAW_DATA_FOLDER = "raw"
10 | DEFUALT_SAVE_DATA_FOLDER = "smol"
11 | 
12 | 
13 | RAW_TRAIN_FILE = "train_data.pickle"
14 | RAW_VAL_FILE = "val_data.pickle"
15 | RAW_TEST_FILE = "test_data.pickle"
16 | 
17 | SAVE_TRAIN_FILE = "train.smol"
18 | SAVE_VAL_FILE = "val.smol"
19 | SAVE_TEST_FILE = "test.smol"
20 | 
21 | 
22 | def read_from_file(filepath):
23 |     bytes = filepath.read_bytes()
24 |     return pickle.loads(bytes)
25 | 
26 | 
27 | def raw_to_smol_mol(raw_mol):
28 |     _, mols = raw_mol
29 |     smol_mol = GeometricMol.from_rdkit(mols[0])
30 |     return smol_mol
31 | 
32 | 
33 | def raw_to_smol_batch(raw_data):
34 |     smol_mols = [raw_to_smol_mol(raw_mol) for raw_mol in raw_data]
35 |     batch = GeometricMolBatch.from_list(smol_mols)
36 |     return batch
37 | 
38 | 
39 | def process_dataset(raw_filepath, save_filepath):
40 |     raw_dataset = read_from_file(raw_filepath)
41 |     smol_batch = raw_to_smol_batch(raw_dataset)
42 |     dataset_bytes = smol_batch.to_bytes()
43 |     save_filepath.write_bytes(dataset_bytes)
44 | 
45 | 
46 | def main(args):
47 |     data_path = Path(args.data_path)
48 | 
49 |     raw_data_path = data_path / args.raw_data_folder
50 |     raw_train_path = raw_data_path / RAW_TRAIN_FILE
51 |     raw_val_path = raw_data_path / RAW_VAL_FILE
52 |     raw_test_path = raw_data_path / RAW_TEST_FILE
53 | 
54 |     assert raw_train_path.exists()
55 |     assert raw_val_path.exists()
56 |     assert raw_test_path.exists()
57 | 
58 |     save_data_path = data_path / args.save_data_folder
59 |     save_data_path.mkdir(parents=True, exist_ok=True)
60 |     save_train_path = save_data_path / SAVE_TRAIN_FILE
61 |     save_val_path = save_data_path / SAVE_VAL_FILE
62 |     save_test_path = save_data_path / SAVE_TEST_FILE
63 | 
64 |     print("Processing train dataset...")
65 |     process_dataset(raw_train_path, save_train_path)
66 |     print("Train dataset complete.")
67 | 
68 |     print("Processing val dataset...")
69 |     process_dataset(raw_val_path, save_val_path)
70 |     print("Val dataset complete.")
71 | 
72 |     print("Processing test dataset...")
73 |     process_dataset(raw_test_path, save_test_path)
74 |     print("Test dataset complete.")
75 | 
76 | 
77 | if __name__ == "__main__":
78 |     parser = argparse.ArgumentParser()
79 | 
80 |     parser.add_argument("--data_path", type=str)
81 |     parser.add_argument("--raw_data_folder", type=str, default=DEFAULT_RAW_DATA_FOLDER)
82 |     parser.add_argument("--save_data_folder", type=str, default=DEFUALT_SAVE_DATA_FOLDER)
83 | 
84 |     args = parser.parse_args()
85 |     main(args)
86 | 


--------------------------------------------------------------------------------
/semlaflow/scriptutil.py:
--------------------------------------------------------------------------------
  1 | """Util file for Equinv scripts"""
  2 | 
  3 | import math
  4 | import resource
  5 | from pathlib import Path
  6 | 
  7 | import numpy as np
  8 | import torch
  9 | from openbabel import pybel
 10 | from rdkit import RDLogger
 11 | from torchmetrics import MetricCollection
 12 | from tqdm import tqdm
 13 | 
 14 | import semlaflow.util.functional as smolF
 15 | import semlaflow.util.metrics as Metrics
 16 | import semlaflow.util.rdkit as smolRD
 17 | from semlaflow.data.datasets import GeometricDataset
 18 | from semlaflow.util.tokeniser import Vocabulary
 19 | 
 20 | # Declarations to be used in scripts
 21 | QM9_COORDS_STD_DEV = 1.723299503326416
 22 | GEOM_COORDS_STD_DEV = 2.407038688659668
 23 | 
 24 | QM9_BUCKET_LIMITS = [12, 16, 18, 20, 22, 24, 30]
 25 | GEOM_DRUGS_BUCKET_LIMITS = [24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 72, 96, 192]
 26 | 
 27 | PROJECT_PREFIX = "equinv"
 28 | BOND_MASK_INDEX = 5
 29 | COMPILER_CACHE_SIZE = 128
 30 | 
 31 | 
 32 | def disable_lib_stdout():
 33 |     pybel.ob.obErrorLog.StopLogging()
 34 |     RDLogger.DisableLog("rdApp.*")
 35 | 
 36 | 
 37 | # Need to ensure the limits are large enough when using OT since lots of preprocessing needs to be done on the batches
 38 | # OT seems to cause a problem when there are not enough allowed open FDs
 39 | def configure_fs(limit=4096):
 40 |     """
 41 |     Try to increase the limit on open file descriptors
 42 |     If not possible use a different strategy for sharing files in torch
 43 |     """
 44 | 
 45 |     n_file_resource = resource.RLIMIT_NOFILE
 46 |     soft_limit, hard_limit = resource.getrlimit(n_file_resource)
 47 | 
 48 |     print(f"Current limits (soft, hard): {(soft_limit, hard_limit)}")
 49 | 
 50 |     if limit > soft_limit:
 51 |         try:
 52 |             print(f"Attempting to increase open file limit to {limit}...")
 53 |             resource.setrlimit(n_file_resource, (limit, hard_limit))
 54 |             print("Limit changed successfully!")
 55 | 
 56 |         except Exception:
 57 |             print("Limit change unsuccessful. Using torch file_system file sharing strategy instead.")
 58 | 
 59 |             import torch.multiprocessing
 60 | 
 61 |             torch.multiprocessing.set_sharing_strategy("file_system")
 62 | 
 63 |     else:
 64 |         print("Open file limit already sufficiently large.")
 65 | 
 66 | 
 67 | # Applies the following transformations to a molecule:
 68 | # 1. Scales coordinate values by 1 / coord_std (so that they are standard normal)
 69 | # 2. Applies a random rotation to the coordinates
 70 | # 3. Removes the centre of mass of the molecule
 71 | # 4. Creates a one-hot vector for the atomic numbers of each atom
 72 | # 5. Creates a one-hot vector for the bond type for every possible bond
 73 | # 6. Encodes charges as non-negative numbers according to encoding map
 74 | def mol_transform(molecule, vocab, n_bonds, coord_std):
 75 |     rotation = tuple(np.random.rand(3) * np.pi * 2)
 76 |     molecule = molecule.scale(1.0 / coord_std).rotate(rotation).zero_com()
 77 | 
 78 |     atomic_nums = [int(atomic) for atomic in molecule.atomics.tolist()]
 79 |     tokens = [smolRD.PT.symbol_from_atomic(atomic) for atomic in atomic_nums]
 80 |     one_hot_atomics = torch.tensor(vocab.indices_from_tokens(tokens, one_hot=True))
 81 | 
 82 |     bond_types = smolF.one_hot_encode_tensor(molecule.bond_types, n_bonds)
 83 | 
 84 |     charge_idxs = [smolRD.CHARGE_IDX_MAP[charge] for charge in molecule.charges.tolist()]
 85 |     charge_idxs = torch.tensor(charge_idxs)
 86 | 
 87 |     transformed = molecule._copy_with(atomics=one_hot_atomics, bond_types=bond_types, charges=charge_idxs)
 88 |     return transformed
 89 | 
 90 | 
 91 | # When training a distilled model atom types and bonds are already distributions over categoricals
 92 | def distill_transform(molecule, coord_std):
 93 |     rotation = tuple(np.random.rand(3) * np.pi * 2)
 94 |     molecule = molecule.scale(1.0 / coord_std).rotate(rotation).zero_com()
 95 | 
 96 |     charge_idxs = [smolRD.CHARGE_IDX_MAP[charge] for charge in molecule.charges.tolist()]
 97 |     charge_idxs = torch.tensor(charge_idxs)
 98 | 
 99 |     transformed = molecule._copy_with(charges=charge_idxs)
100 |     return transformed
101 | 
102 | 
103 | def get_n_bond_types(cat_strategy):
104 |     n_bond_types = len(smolRD.BOND_IDX_MAP.keys()) + 1
105 |     n_bond_types = n_bond_types + 1 if cat_strategy == "mask" else n_bond_types
106 |     return n_bond_types
107 | 
108 | 
109 | def build_vocab():
110 |     # Need to make sure PAD has index 0
111 |     special_tokens = ["<PAD>", "<MASK>"]
112 |     core_atoms = ["H", "C", "N", "O", "F", "P", "S", "Cl"]
113 |     other_atoms = ["Br", "B", "Al", "Si", "As", "I", "Hg", "Bi"]
114 |     tokens = special_tokens + core_atoms + other_atoms
115 |     return Vocabulary(tokens)
116 | 
117 | 
118 | # TODO support multi gpus
119 | def calc_train_steps(dm, epochs, acc_batches):
120 |     dm.setup("train")
121 |     steps_per_epoch = math.ceil(len(dm.train_dataloader()) / acc_batches)
122 |     return steps_per_epoch * epochs
123 | 
124 | 
125 | def init_metrics(data_path, model):
126 |     # Load the train data separately from the DM, just to access the list of train SMILES
127 |     train_path = Path(data_path) / "train.smol"
128 |     train_dataset = GeometricDataset.load(train_path)
129 |     train_smiles = [mol.str_id for mol in train_dataset]
130 | 
131 |     print("Creating RDKit mols from training SMILES...")
132 |     train_mols = model.builder.mols_from_smiles(train_smiles, explicit_hs=True)
133 |     train_mols = [mol for mol in train_mols if mol is not None]
134 | 
135 |     metrics = {
136 |         "validity": Metrics.Validity(),
137 |         "connected-validity": Metrics.Validity(connected=True),
138 |         "uniqueness": Metrics.Uniqueness(),
139 |         "novelty": Metrics.Novelty(train_mols),
140 |         "energy-validity": Metrics.EnergyValidity(),
141 |         "opt-energy-validity": Metrics.EnergyValidity(optimise=True),
142 |         "energy": Metrics.AverageEnergy(),
143 |         "energy-per-atom": Metrics.AverageEnergy(per_atom=True),
144 |         "strain": Metrics.AverageStrainEnergy(),
145 |         "strain-per-atom": Metrics.AverageStrainEnergy(per_atom=True),
146 |         "opt-rmsd": Metrics.AverageOptRmsd(),
147 |     }
148 |     stability_metrics = {"atom-stability": Metrics.AtomStability(), "molecule-stability": Metrics.MoleculeStability()}
149 | 
150 |     metrics = MetricCollection(metrics, compute_groups=False)
151 |     stability_metrics = MetricCollection(stability_metrics, compute_groups=False)
152 | 
153 |     return metrics, stability_metrics
154 | 
155 | 
156 | def generate_molecules(model, dm, steps, strategy, stabilities=False):
157 |     test_dl = dm.test_dataloader()
158 |     model.eval()
159 |     cuda_model = model.to("cuda")
160 | 
161 |     outputs = []
162 |     for batch in tqdm(test_dl):
163 |         batch = {k: v.cuda() for k, v in batch[0].items()}
164 |         output = cuda_model._generate(batch, steps, strategy)
165 |         outputs.append(output)
166 | 
167 |     molecules = [cuda_model._generate_mols(output) for output in outputs]
168 |     molecules = [mol for mol_list in molecules for mol in mol_list]
169 | 
170 |     if not stabilities:
171 |         return molecules, outputs
172 | 
173 |     stabilities = [cuda_model._generate_stabilities(output) for output in outputs]
174 |     stabilities = [mol_stab for mol_stabs in stabilities for mol_stab in mol_stabs]
175 |     return molecules, outputs, stabilities
176 | 
177 | 
178 | def calc_metrics_(rdkit_mols, metrics, stab_metrics=None, mol_stabs=None):
179 |     metrics.reset()
180 |     metrics.update(rdkit_mols)
181 |     results = metrics.compute()
182 | 
183 |     if stab_metrics is None:
184 |         return results
185 | 
186 |     stab_metrics.reset()
187 |     stab_metrics.update(mol_stabs)
188 |     stab_results = stab_metrics.compute()
189 | 
190 |     results = {**results, **stab_results}
191 |     return results
192 | 
193 | 
194 | def print_results(results, std_results=None):
195 |     print()
196 |     print(f"{'Metric':<22}Result")
197 |     print("-" * 30)
198 | 
199 |     for metric, value in results.items():
200 |         result_str = f"{metric:<22}{value:.5f}"
201 |         if std_results is not None:
202 |             std = std_results[metric]
203 |             result_str = f"{result_str} +- {std:.7f}"
204 | 
205 |         print(result_str)
206 |     print()
207 | 


--------------------------------------------------------------------------------
/semlaflow/train.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | from functools import partial
  3 | from pathlib import Path
  4 | 
  5 | import lightning as L
  6 | import torch
  7 | from lightning.pytorch.callbacks import LearningRateMonitor, ModelCheckpoint
  8 | from lightning.pytorch.loggers import WandbLogger
  9 | 
 10 | import semlaflow.scriptutil as util
 11 | from semlaflow.data.datamodules import GeometricInterpolantDM
 12 | from semlaflow.data.datasets import GeometricDataset
 13 | from semlaflow.data.interpolate import GeometricInterpolant, GeometricNoiseSampler
 14 | from semlaflow.models.fm import Integrator, MolecularCFM
 15 | from semlaflow.models.semla import EquiInvDynamics, SemlaGenerator
 16 | 
 17 | DEFAULT_DATASET = "geom-drugs"
 18 | DEFAULT_ARCH = "semla"
 19 | 
 20 | DEFAULT_D_MODEL = 384
 21 | DEFAULT_N_LAYERS = 12
 22 | DEFAULT_D_MESSAGE = 128
 23 | DEFAULT_D_EDGE = 128
 24 | DEFAULT_N_COORD_SETS = 64
 25 | DEFAULT_N_ATTN_HEADS = 32
 26 | DEFAULT_D_MESSAGE_HIDDEN = 128
 27 | DEFAULT_COORD_NORM = "length"
 28 | DEFAULT_SIZE_EMB = 64
 29 | 
 30 | DEFAULT_MAX_ATOMS = 256
 31 | 
 32 | DEFAULT_EPOCHS = 200
 33 | DEFAULT_LR = 0.0003
 34 | DEFAULT_BATCH_COST = 4096
 35 | DEFAULT_ACC_BATCHES = 1
 36 | DEFAULT_GRADIENT_CLIP_VAL = 1.0
 37 | DEFAULT_TYPE_LOSS_WEIGHT = 0.2
 38 | DEFAULT_BOND_LOSS_WEIGHT = 1.0
 39 | DEFAULT_CHARGE_LOSS_WEIGHT = 1.0
 40 | DEFAULT_CATEGORICAL_STRATEGY = "uniform-sample"
 41 | DEFAULT_LR_SCHEDULE = "constant"
 42 | DEFAULT_WARM_UP_STEPS = 10000
 43 | DEFAULT_BUCKET_COST_SCALE = "linear"
 44 | 
 45 | DEFAULT_N_VALIDATION_MOLS = 2000
 46 | DEFAULT_VAL_CHECK_EPOCHS = 10
 47 | DEFAULT_NUM_INFERENCE_STEPS = 100
 48 | DEFAULT_CAT_SAMPLING_NOISE_LEVEL = 1
 49 | DEFAULT_COORD_NOISE_STD_DEV = 0.2
 50 | DEFAULT_TYPE_DIST_TEMP = 1.0
 51 | DEFAULT_TIME_ALPHA = 2.0
 52 | DEFAULT_TIME_BETA = 1.0
 53 | DEFAULT_OPTIMAL_TRANSPORT = "equivariant"
 54 | 
 55 | 
 56 | def build_model(args, dm, vocab):
 57 |     # Get hyperparameeters from the datamodule, pass these into the model to be saved
 58 |     hparams = {
 59 |         "epochs": args.epochs,
 60 |         "gradient_clip_val": args.gradient_clip_val,
 61 |         "dataset": args.dataset,
 62 |         "precision": "32",
 63 |         "architecture": args.arch,
 64 |         **dm.hparams,
 65 |     }
 66 | 
 67 |     # Add 1 for the time (0 <= t <= 1 for flow matching)
 68 |     n_atom_feats = vocab.size + 1
 69 |     n_bond_types = util.get_n_bond_types(args.categorical_strategy)
 70 | 
 71 |     if args.arch == "semla":
 72 |         dynamics = EquiInvDynamics(
 73 |             args.d_model,
 74 |             args.d_message,
 75 |             args.n_coord_sets,
 76 |             args.n_layers,
 77 |             n_attn_heads=args.n_attn_heads,
 78 |             d_message_hidden=args.d_message_hidden,
 79 |             d_edge=args.d_edge,
 80 |             bond_refine=True,
 81 |             self_cond=args.self_condition,
 82 |             coord_norm=args.coord_norm,
 83 |         )
 84 |         egnn_gen = SemlaGenerator(
 85 |             args.d_model,
 86 |             dynamics,
 87 |             vocab.size,
 88 |             n_atom_feats,
 89 |             d_edge=args.d_edge,
 90 |             n_edge_types=n_bond_types,
 91 |             self_cond=args.self_condition,
 92 |             size_emb=args.size_emb,
 93 |             max_atoms=args.max_atoms,
 94 |         )
 95 | 
 96 |     elif args.arch == "eqgat":
 97 |         from semlaflow.models.eqgat import EqgatGenerator
 98 | 
 99 |         # Hardcode for now since we only need one model size
100 |         d_model_eqgat = 256
101 |         n_equi_feats_eqgat = 256
102 |         n_layers_eqgat = 12
103 |         d_edge_eqgat = 128
104 | 
105 |         egnn_gen = EqgatGenerator(
106 |             d_model_eqgat, n_layers_eqgat, n_equi_feats_eqgat, vocab.size, n_atom_feats, d_edge_eqgat, n_bond_types
107 |         )
108 | 
109 |     elif args.arch == "egnn":
110 |         from semlaflow.models.egnn import VanillaEgnnGenerator
111 | 
112 |         egnn_gen = VanillaEgnnGenerator(
113 |             args.d_model, args.n_layers, vocab.size, n_atom_feats, d_edge=args.d_edge, n_edge_types=n_bond_types
114 |         )
115 | 
116 |     else:
117 |         raise ValueError(f"Unknown architecture '{args.arch}'; known: `semla`, `eqgat` or `egnn`")
118 | 
119 |     if args.dataset == "qm9":
120 |         coord_scale = util.QM9_COORDS_STD_DEV
121 |     elif args.dataset == "geom-drugs":
122 |         coord_scale = util.GEOM_COORDS_STD_DEV
123 |     else:
124 |         raise ValueError(f"Unknown dataset {args.dataset}")
125 | 
126 |     type_mask_index = None
127 |     bond_mask_index = None
128 | 
129 |     if args.categorical_strategy == "mask":
130 |         type_mask_index = vocab.indices_from_tokens(["<MASK>"])[0]
131 |         bond_mask_index = util.BOND_MASK_INDEX
132 |         train_strategy = "mask"
133 |         sampling_strategy = "mask"
134 | 
135 |     elif args.categorical_strategy == "uniform-sample":
136 |         train_strategy = "ce"
137 |         sampling_strategy = "uniform-sample"
138 | 
139 |     elif args.categorical_strategy == "dirichlet":
140 |         train_strategy = "ce"
141 |         sampling_strategy = "dirichlet"
142 | 
143 |     else:
144 |         raise ValueError(
145 |             f"Interpolation '{args.categorical_strategy}' is not supported. "
146 |             + "Supported are: `mask`, `uniform-sample` and `dirichlet`"
147 |         )
148 | 
149 |     train_steps = util.calc_train_steps(dm, args.epochs, args.acc_batches)
150 |     train_smiles = None if args.trial_run else [mols.str_id for mols in dm.train_dataset]
151 | 
152 |     print(f"Total training steps {train_steps}")
153 | 
154 |     integrator = Integrator(
155 |         args.num_inference_steps,
156 |         type_strategy=sampling_strategy,
157 |         bond_strategy=sampling_strategy,
158 |         cat_noise_level=args.cat_sampling_noise_level,
159 |         type_mask_index=type_mask_index,
160 |         bond_mask_index=bond_mask_index,
161 |     )
162 | 
163 |     fm_model = MolecularCFM(
164 |         egnn_gen,
165 |         vocab,
166 |         args.lr,
167 |         integrator,
168 |         coord_scale=coord_scale,
169 |         type_strategy=train_strategy,
170 |         bond_strategy=train_strategy,
171 |         type_loss_weight=args.type_loss_weight,
172 |         bond_loss_weight=args.bond_loss_weight,
173 |         charge_loss_weight=args.charge_loss_weight,
174 |         pairwise_metrics=False,
175 |         use_ema=args.use_ema,
176 |         compile_model=False,
177 |         self_condition=args.self_condition,
178 |         distill=False,
179 |         lr_schedule=args.lr_schedule,
180 |         warm_up_steps=args.warm_up_steps,
181 |         total_steps=train_steps,
182 |         train_smiles=train_smiles,
183 |         type_mask_index=type_mask_index,
184 |         bond_mask_index=bond_mask_index,
185 |         **hparams,
186 |     )
187 |     return fm_model
188 | 
189 | 
190 | def build_dm(args, vocab):
191 |     if args.dataset == "qm9":
192 |         coord_std = util.QM9_COORDS_STD_DEV
193 |         padded_sizes = util.QM9_BUCKET_LIMITS
194 | 
195 |     elif args.dataset == "geom-drugs":
196 |         coord_std = util.GEOM_COORDS_STD_DEV
197 |         padded_sizes = util.GEOM_DRUGS_BUCKET_LIMITS
198 | 
199 |     else:
200 |         raise ValueError(f"Unknown dataset {args.dataset}. Available datasets are `qm9` and `geom-drugs`.")
201 | 
202 |     data_path = Path(args.data_path)
203 | 
204 |     n_bond_types = util.get_n_bond_types(args.categorical_strategy)
205 |     transform = partial(util.mol_transform, vocab=vocab, n_bonds=n_bond_types, coord_std=coord_std)
206 | 
207 |     # Load generated dataset with different transform fn if we are distilling a model
208 |     # if args.distill:
209 |     #     distill_transform = partial(util.distill_transform, coord_std=coord_std)
210 |     #     train_dataset = GeometricDataset.load(data_path / "distill.smol", transform=distill_transform)
211 |     # else:
212 |     #     train_dataset = GeometricDataset.load(data_path / "train.smol", transform=transform)
213 | 
214 |     train_dataset = GeometricDataset.load(data_path / "train.smol", transform=transform)
215 |     val_dataset = GeometricDataset.load(data_path / "val.smol", transform=transform)
216 |     val_dataset = val_dataset.sample(args.n_validation_mols)
217 | 
218 |     type_mask_index = None
219 |     bond_mask_index = None
220 | 
221 |     if args.categorical_strategy == "mask":
222 |         type_mask_index = vocab.indices_from_tokens(["<MASK>"])[0]
223 |         bond_mask_index = util.BOND_MASK_INDEX
224 |         categorical_interpolation = "unmask"
225 |         categorical_noise = "mask"
226 | 
227 |     elif args.categorical_strategy == "uniform-sample":
228 |         categorical_interpolation = "unmask"
229 |         categorical_noise = "uniform-sample"
230 | 
231 |     elif args.categorical_strategy == "dirichlet":
232 |         categorical_interpolation = "dirichlet"
233 |         categorical_noise = "uniform-dist"
234 | 
235 |     else:
236 |         raise ValueError(
237 |             f"Interpolation '{args.categorical_strategy}' is not supported. "
238 |             + "Supported are: `mask`, `uniform-sample` and `dirichlet`"
239 |         )
240 | 
241 |     scale_ot = False
242 |     batch_ot = False
243 |     equivariant_ot = False
244 | 
245 |     if args.optimal_transport == "batch":
246 |         batch_ot = True
247 |     elif args.optimal_transport == "equivariant":
248 |         equivariant_ot = True
249 |     elif args.optimal_transport == "scale":
250 |         scale_ot = True
251 |         equivariant_ot = True
252 |     elif args.optimal_transport not in ["None", "none", None]:
253 |         raise ValueError(
254 |             f"Unknown value for optimal_transport '{args.optimal_transport}'. "
255 |             + "Acceted values: `batch`, `equivariant` and `scale`."
256 |         )
257 | 
258 |     # train_fixed_time = 0.5 if args.distill else None
259 |     train_fixed_time = None
260 | 
261 |     prior_sampler = GeometricNoiseSampler(
262 |         vocab.size,
263 |         n_bond_types,
264 |         coord_noise="gaussian",
265 |         type_noise=categorical_noise,
266 |         bond_noise=categorical_noise,
267 |         scale_ot=scale_ot,
268 |         zero_com=True,
269 |         type_mask_index=type_mask_index,
270 |         bond_mask_index=bond_mask_index,
271 |     )
272 |     train_interpolant = GeometricInterpolant(
273 |         prior_sampler,
274 |         coord_interpolation="linear",
275 |         type_interpolation=categorical_interpolation,
276 |         bond_interpolation=categorical_interpolation,
277 |         coord_noise_std=args.coord_noise_std_dev,
278 |         type_dist_temp=args.type_dist_temp,
279 |         equivariant_ot=equivariant_ot,
280 |         batch_ot=batch_ot,
281 |         time_alpha=args.time_alpha,
282 |         time_beta=args.time_beta,
283 |         fixed_time=train_fixed_time,
284 |     )
285 |     eval_interpolant = GeometricInterpolant(
286 |         prior_sampler,
287 |         coord_interpolation="linear",
288 |         type_interpolation=categorical_interpolation,
289 |         bond_interpolation=categorical_interpolation,
290 |         equivariant_ot=False,
291 |         batch_ot=False,
292 |         fixed_time=0.9,
293 |     )
294 | 
295 |     dm = GeometricInterpolantDM(
296 |         train_dataset,
297 |         val_dataset,
298 |         None,
299 |         args.batch_cost,
300 |         train_interpolant=train_interpolant,
301 |         val_interpolant=eval_interpolant,
302 |         test_interpolant=None,
303 |         bucket_limits=padded_sizes,
304 |         bucket_cost_scale=args.bucket_cost_scale,
305 |         pad_to_bucket=False,
306 |     )
307 |     return dm
308 | 
309 | 
310 | def build_trainer(args):
311 |     epochs = 1 if args.trial_run else args.epochs
312 |     log_steps = 1 if args.trial_run else 50
313 |     val_check_epochs = 1 if args.trial_run else args.val_check_epochs
314 | 
315 |     project_name = f"{util.PROJECT_PREFIX}-{args.dataset}"
316 |     print("Using precision '32'")
317 | 
318 |     logger = WandbLogger(project=project_name, save_dir="wandb", log_model=True)
319 |     lr_monitor = LearningRateMonitor(logging_interval="step")
320 |     checkpointing = ModelCheckpoint(every_n_epochs=val_check_epochs, monitor="val-validity", mode="max", save_last=True)
321 | 
322 |     # No logger if doing a trial run
323 |     logger = None if args.trial_run else logger
324 | 
325 |     trainer = L.Trainer(
326 |         min_epochs=epochs,
327 |         max_epochs=epochs,
328 |         logger=logger,
329 |         log_every_n_steps=log_steps,
330 |         accumulate_grad_batches=args.acc_batches,
331 |         gradient_clip_val=args.gradient_clip_val,
332 |         check_val_every_n_epoch=val_check_epochs,
333 |         callbacks=[lr_monitor, checkpointing],
334 |         precision="32",
335 |     )
336 |     return trainer
337 | 
338 | 
339 | def main(args):
340 |     # Set some useful torch properties
341 |     # Float32 precision should only affect computation on A100 and should in theory be a lot faster than the default
342 |     # Increasing the cache size is required since the model will be compiled seperately for each bucket
343 |     torch.set_float32_matmul_precision("high")
344 |     # torch._dynamo.config.cache_size_limit = util.COMPILER_CACHE_SIZE
345 |     # print(f"Set torch compiler cache size to {torch._dynamo.config.cache_size_limit}")
346 | 
347 |     L.seed_everything(12345)
348 |     util.disable_lib_stdout()
349 |     util.configure_fs()
350 | 
351 |     print("Building model vocab...")
352 |     vocab = util.build_vocab()
353 |     print("Vocab complete.")
354 | 
355 |     print("Loading datamodule...")
356 |     dm = build_dm(args, vocab)
357 |     print("Datamodule complete.")
358 | 
359 |     print("Building equinv model...")
360 |     model = build_model(args, dm, vocab)
361 |     print("Model complete.")
362 | 
363 |     trainer = build_trainer(args)
364 | 
365 |     print("Fitting datamodule to model...")
366 |     trainer.fit(model, datamodule=dm)
367 |     print("Training complete.")
368 | 
369 | 
370 | if __name__ == "__main__":
371 |     parser = argparse.ArgumentParser()
372 | 
373 |     # Setup args
374 |     parser.add_argument("--data_path", type=str)
375 |     parser.add_argument("--dataset", type=str, default=DEFAULT_DATASET)
376 |     parser.add_argument("--trial_run", action="store_true")
377 | 
378 |     # Model args
379 |     parser.add_argument("--d_model", type=int, default=DEFAULT_D_MODEL)
380 |     parser.add_argument("--n_layers", type=int, default=DEFAULT_N_LAYERS)
381 |     parser.add_argument("--d_message", type=int, default=DEFAULT_D_MESSAGE)
382 |     parser.add_argument("--d_edge", type=int, default=DEFAULT_D_EDGE)
383 |     parser.add_argument("--n_coord_sets", type=int, default=DEFAULT_N_COORD_SETS)
384 |     parser.add_argument("--n_attn_heads", type=int, default=DEFAULT_N_ATTN_HEADS)
385 |     parser.add_argument("--d_message_hidden", type=int, default=DEFAULT_D_MESSAGE_HIDDEN)
386 |     parser.add_argument("--coord_norm", type=str, default=DEFAULT_COORD_NORM)
387 |     parser.add_argument("--size_emb", type=int, default=DEFAULT_SIZE_EMB)
388 |     parser.add_argument("--max_atoms", type=int, default=DEFAULT_MAX_ATOMS)
389 |     parser.add_argument("--arch", type=str, default=DEFAULT_ARCH)
390 | 
391 |     # Training args
392 |     parser.add_argument("--epochs", type=int, default=DEFAULT_EPOCHS)
393 |     parser.add_argument("--lr", type=float, default=DEFAULT_LR)
394 |     parser.add_argument("--batch_cost", type=int, default=DEFAULT_BATCH_COST)
395 |     parser.add_argument("--acc_batches", type=int, default=DEFAULT_ACC_BATCHES)
396 |     parser.add_argument("--gradient_clip_val", type=float, default=DEFAULT_GRADIENT_CLIP_VAL)
397 |     parser.add_argument("--type_loss_weight", type=float, default=DEFAULT_TYPE_LOSS_WEIGHT)
398 |     parser.add_argument("--bond_loss_weight", type=float, default=DEFAULT_BOND_LOSS_WEIGHT)
399 |     parser.add_argument("--charge_loss_weight", type=float, default=DEFAULT_CHARGE_LOSS_WEIGHT)
400 |     parser.add_argument("--categorical_strategy", type=str, default=DEFAULT_CATEGORICAL_STRATEGY)
401 |     parser.add_argument("--lr_schedule", type=str, default=DEFAULT_LR_SCHEDULE)
402 |     parser.add_argument("--warm_up_steps", type=int, default=DEFAULT_WARM_UP_STEPS)
403 |     parser.add_argument("--bucket_cost_scale", type=str, default=DEFAULT_BUCKET_COST_SCALE)
404 |     parser.add_argument("--no_ema", action="store_false", dest="use_ema")
405 |     parser.add_argument("--self_condition", action="store_true")
406 |     # parser.add_argument("--mixed_precision", action="store_true")
407 |     # parser.add_argument("--compile_model", action="store_true")
408 |     # parser.add_argument("--distill", action="store_true")
409 | 
410 |     # Flow matching and sampling args
411 |     parser.add_argument("--val_check_epochs", type=int, default=DEFAULT_VAL_CHECK_EPOCHS)
412 |     parser.add_argument("--n_validation_mols", type=int, default=DEFAULT_N_VALIDATION_MOLS)
413 |     parser.add_argument("--num_inference_steps", type=int, default=DEFAULT_NUM_INFERENCE_STEPS)
414 |     parser.add_argument("--cat_sampling_noise_level", type=int, default=DEFAULT_CAT_SAMPLING_NOISE_LEVEL)
415 |     parser.add_argument("--coord_noise_std_dev", type=float, default=DEFAULT_COORD_NOISE_STD_DEV)
416 |     parser.add_argument("--type_dist_temp", type=float, default=DEFAULT_TYPE_DIST_TEMP)
417 |     parser.add_argument("--time_alpha", type=float, default=DEFAULT_TIME_ALPHA)
418 |     parser.add_argument("--time_beta", type=float, default=DEFAULT_TIME_BETA)
419 |     parser.add_argument("--optimal_transport", type=str, default=DEFAULT_OPTIMAL_TRANSPORT)
420 | 
421 |     parser.set_defaults(
422 |         trial_run=False,
423 |         use_ema=True,
424 |         self_condition=True,
425 |         # compile_model=False,
426 |         # mixed_precision=False,
427 |         # distill=False
428 |     )
429 | 
430 |     args = parser.parse_args()
431 |     main(args)
432 | 


--------------------------------------------------------------------------------
/semlaflow/util/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rssrwn/semla-flow/65d7106bc907e4d136deec8de7417386e3f9b10b/semlaflow/util/__init__.py


--------------------------------------------------------------------------------
/semlaflow/util/functional.py:
--------------------------------------------------------------------------------
  1 | from typing import Union
  2 | 
  3 | import torch
  4 | from scipy.spatial.transform import Rotation
  5 | 
  6 | _T = torch.Tensor
  7 | TupleRot = tuple[float, float, float]
  8 | 
  9 | 
 10 | # *************************************************************************************************
 11 | # ********************************** Tensor Util Functions ****************************************
 12 | # *************************************************************************************************
 13 | 
 14 | 
 15 | def pad_tensors(tensors: list[_T], pad_dim: int = 0) -> _T:
 16 |     """Pad a list of tensors with zeros
 17 | 
 18 |     All dimensions other than pad_dim must have the same shape. A single tensor is returned with the batch dimension
 19 |     first, where the batch dimension is the length of the tensors list.
 20 | 
 21 |     Args:
 22 |         tensors (list[torch.Tensor]): List of tensors
 23 |         pad_dim (int): Dimension on tensors to pad. All other dimensions must be the same size.
 24 | 
 25 |     Returns:
 26 |         torch.Tensor: Batched, padded tensor, if pad_dim is 0 then shape [B, L, *] where L is length of longest tensor.
 27 |     """
 28 | 
 29 |     if pad_dim != 0:
 30 |         # TODO
 31 |         raise NotImplementedError()
 32 | 
 33 |     padded = torch.nn.utils.rnn.pad_sequence(tensors, batch_first=True)
 34 |     return padded
 35 | 
 36 | 
 37 | # TODO replace with tensor version below
 38 | def one_hot_encode(indices: list[int], vocab_size: int) -> _T:
 39 |     """Create one-hot encodings from a list of indices
 40 | 
 41 |     Args:
 42 |         indices (list[int]): List of indices into one-hot vectors
 43 |         vocab_size (int): Length of returned vectors
 44 | 
 45 |     Returns:
 46 |         torch.Tensor: One-hot encoded vectors, shape [L, vocab_size] where L is length of indices list
 47 |     """
 48 | 
 49 |     one_hots = torch.zeros((len(indices), vocab_size), dtype=torch.int64)
 50 | 
 51 |     for batch_idx, vocab_idx in enumerate(indices):
 52 |         one_hots[batch_idx, vocab_idx] = 1
 53 | 
 54 |     return one_hots
 55 | 
 56 | 
 57 | # TODO test
 58 | def one_hot_encode_tensor(indices: _T, vocab_size: int) -> _T:
 59 |     """Create one-hot encodings from indices
 60 | 
 61 |     Args:
 62 |         indices (torch.Tensor): Indices into one-hot vectors, shape [*, L]
 63 |         vocab_size (int): Length of returned vectors
 64 | 
 65 |     Returns:
 66 |         torch.Tensor: One-hot encoded vectors, shape [*, L, vocab_size]
 67 |     """
 68 | 
 69 |     one_hot_shape = (*indices.shape, vocab_size)
 70 |     one_hots = torch.zeros(one_hot_shape, dtype=torch.int64, device=indices.device)
 71 |     one_hots.scatter_(-1, indices.unsqueeze(-1), 1)
 72 |     return one_hots
 73 | 
 74 | 
 75 | def pairwise_concat(t: _T) -> _T:
 76 |     """Concatenates two representations from all possible pairings in dimension 1
 77 | 
 78 |     Computes all possible pairs of indices into dimension 1 and concatenates whatever representation they have in
 79 |     higher dimensions. Note that all higher dimensions will be flattened. The output will have its shape for
 80 |     dimension 1 duplicated in dimension 2.
 81 | 
 82 |     Example:
 83 |     Input shape [100, 16, 128]
 84 |     Output shape [100, 16, 16, 256]
 85 |     """
 86 | 
 87 |     idx_pairs = torch.cartesian_prod(*((torch.arange(t.shape[1]),) * 2))
 88 |     output = t[:, idx_pairs].view(t.shape[0], t.shape[1], t.shape[1], -1)
 89 |     return output
 90 | 
 91 | 
 92 | def segment_sum(data, segment_ids, num_segments):
 93 |     """Computes the sum of data elements that are in each segment
 94 | 
 95 |     The inputs must have shapes that look like the following:
 96 |     data [batch_size, seq_length, num_features]
 97 |     segment_ids [batch_size, seq_length], must contain integers
 98 | 
 99 |     Then the output will have the following shape:
100 |     output [batch_size, num_segments, num_features]
101 |     """
102 | 
103 |     err_msg = "data and segment_ids must have the same shape in the first two dimensions"
104 |     assert data.shape[0:2] == segment_ids.shape[0:2], err_msg
105 | 
106 |     result_shape = (data.shape[0], num_segments, data.shape[2])
107 |     result = data.new_full(result_shape, 0)
108 |     segment_ids = segment_ids.unsqueeze(-1).expand(-1, -1, data.shape[2])
109 |     result.scatter_add_(1, segment_ids, data)
110 |     return result
111 | 
112 | 
113 | # *************************************************************************************************
114 | # ******************************* Functions for handling edges ************************************
115 | # *************************************************************************************************
116 | 
117 | 
118 | def adj_from_node_mask(node_mask, self_connect=False):
119 |     """Creates an edge mask from a given node mask assuming all nodes are fully connected excluding self-connections
120 | 
121 |     Args:
122 |         node_mask (torch.Tensor): Node mask tensor, shape [batch_size, num_nodes], 1 for real node 0 otherwise
123 |         self_connect (bool): Whether to include self connections in the adjacency
124 | 
125 |     Returns:
126 |         torch.Tensor: Adjacency tensor, shape [batch_size, num_nodes, num_nodes], 1 for real edge 0 otherwise
127 |     """
128 | 
129 |     num_nodes = node_mask.size()[1]
130 | 
131 |     # Matrix mult gives us an outer product on the node mask, which is an edge mask
132 |     mask = node_mask.float()
133 |     adjacency = torch.bmm(mask.unsqueeze(2), mask.unsqueeze(1))
134 |     adjacency = adjacency.long()
135 | 
136 |     # Set diagonal connections
137 |     node_idxs = torch.arange(num_nodes)
138 |     self_mask = node_mask if self_connect else torch.zeros_like(node_mask)
139 |     adjacency[:, node_idxs, node_idxs] = self_mask
140 | 
141 |     return adjacency
142 | 
143 | 
144 | def _pad_edges(edges, max_edges, value=0):
145 |     """Add fake edges to an edge tensor so that the shape matches max_edges
146 | 
147 |     Args:
148 |         edges (torch.Tensor): Unbatched edge tensor, shape [num_edges, 2], each element is a node index for the edge
149 |         max_edges (int): The number of edges the output tensor should have
150 |         value (int): Padding value, default 0
151 | 
152 |     Returns:
153 |         (torch.Tensor, torch.Tensor): Tuple of padded edge tensor and padding mask. Shapes [max_edges, 2] for edge
154 |                 tensor and [max_edges] for mask. Mask is one for pad elements, 0 otherwise.
155 |     """
156 | 
157 |     num_edges = edges.size(0)
158 |     mask_kwargs = {"dtype": torch.int64, "device": edges.device}
159 | 
160 |     if num_edges > max_edges:
161 |         raise ValueError("Number of edges in edge tensor to be padded cannot be greater than max_edges.")
162 | 
163 |     add_edges = max_edges - num_edges
164 | 
165 |     if add_edges == 0:
166 |         pad_mask = torch.zeros(num_edges, **mask_kwargs)
167 |         return edges, pad_mask
168 | 
169 |     pad = (0, 0, 0, add_edges)
170 |     padded = torch.nn.functional.pad(edges, pad, mode="constant", value=value)
171 | 
172 |     zeros_mask = torch.zeros(num_edges, **mask_kwargs)
173 |     ones_mask = torch.ones(add_edges, **mask_kwargs)
174 |     pad_mask = torch.cat((zeros_mask, ones_mask), dim=0)
175 | 
176 |     return padded, pad_mask
177 | 
178 | 
179 | # TODO change callers to use bonds_from_adj
180 | def edges_from_adj(adj_matrix):
181 |     """Flatten an adjacency matrix into a 1D edge representation
182 | 
183 |     Args:
184 |         adj_matrix (torch.Tensor): Batched adjacency matrix, shape [batch_size, num_nodes, num_nodes]. It can contain
185 |                 any non-zero integer for connected nodes but must be 0 for unconnected nodes.
186 | 
187 |     Returns:
188 |         A tuple of the edge tensor and the edge mask tensor. The edge tensor has shape [batch_size, max_num_edges, 2]
189 |         and the mask [batch_size, max_num_edges]. The mask contains 1 for real edges, 0 otherwise.
190 |     """
191 | 
192 |     adj_ones = torch.zeros_like(adj_matrix).int()
193 |     adj_ones[adj_matrix != 0] = 1
194 | 
195 |     # Pad each batch element by a seperate amount so that they can all be packed into a tensor
196 |     # It might be possible to do this in batch form without iterating, but for now this will do
197 |     num_edges = adj_ones.sum(dim=(1, 2)).tolist()
198 |     edge_tuples = list(adj_matrix.nonzero()[:, 1:].split(num_edges))
199 |     padded = [_pad_edges(edges, max(num_edges), value=0) for edges in edge_tuples]
200 | 
201 |     # Unravel the padded tuples and stack them into batches
202 |     edge_tuples_padded, pad_masks = tuple(zip(*padded))
203 |     edges = torch.stack(edge_tuples_padded).long()
204 |     edges = (edges[:, :, 0], edges[:, :, 1])
205 |     edge_mask = (torch.stack(pad_masks) == 0).long()
206 |     return edges, edge_mask
207 | 
208 | 
209 | # TODO test and merge with edges_from_adj
210 | def bonds_from_adj(adj_matrix, lower_tri=True):
211 |     """Flatten an adjacency matrix into a 1D edge representation
212 | 
213 |     Args:
214 |         adj_matrix (torch.Tensor): Adjacency matrix, can be batched or not, shape [batch_size, num_nodes, num_nodes].
215 |             Each item in the matrix corrsponds to the bond type and will be placed into index 2 on dim 1 in bonds.
216 |         lower_tri (bool): Whether to only consider bonds which sit in the lower triangular of adj_matrix.
217 | 
218 |     Returns:
219 |         An bond list tensor, shape [batch_size, num_bonds, 3]. If an item is a padding bond index 2 on the last
220 |             dimension will be 0.
221 |     """
222 | 
223 |     batched = True
224 |     if len(adj_matrix.shape) == 2:
225 |         adj_matrix = adj_matrix.unsqueeze(0)
226 |         batched = False
227 | 
228 |     if lower_tri:
229 |         adj_matrix = torch.tril(adj_matrix, diagonal=-1)
230 | 
231 |     bonds = []
232 |     for adj in list(adj_matrix):
233 |         bond_indices = adj.nonzero()
234 |         bond_types = adj[bond_indices[:, 0], bond_indices[:, 1]]
235 |         bond_list = torch.cat((bond_indices, bond_types.unsqueeze(-1)), dim=-1)
236 |         bonds.append(bond_list)
237 | 
238 |     # Bonds will be padded with 0s so the bond type will tell whether the bond is real or not
239 |     bonds = pad_tensors(bonds, pad_dim=0)
240 |     if not batched:
241 |         bonds = bonds.squeeze(0)
242 | 
243 |     return bonds
244 | 
245 | 
246 | def adj_from_edges(edge_indices: _T, edge_types: _T, n_nodes: int, symmetric: bool = False):
247 |     """Create adjacency matrix from a list of edge indices and types
248 | 
249 |     If an edge pair appears multiple times with different edge types, the adj element for that edge is undefined.
250 | 
251 |     Args:
252 |         edge_indices (torch.Tensor): Edge list tensor, shape [n_edges, 2]. Pairs of (from_idx, to_idx).
253 |         edge_types (torch.Tensor): Edge types, shape either [n_edges] or [n_edges, edge_types].
254 |         n_nodes (int): Number of nodes in the adjacency matrix. This must be >= to the max node index in edges.
255 |         symmetric (bool): Whether edges are considered symmetric. If True the adjacency matrix will also be symmetric,
256 |                 otherwise only the exact node indices within edges will be used to create the adjacency.
257 | 
258 |     Returns:
259 |         torch.Tensor: Adjacency matrix tensor, shape [n_nodes, n_nodes] or
260 |                 [n_nodes, n_nodes, edge_types] if distributions over edge types are provided.
261 |     """
262 | 
263 |     assert len(edge_indices.shape) == 2
264 |     assert edge_indices.shape[0] == edge_types.shape[0]
265 |     assert edge_indices.size(1) == 2
266 | 
267 |     adj_dist = len(edge_types.shape) == 2
268 | 
269 |     edge_indices = edge_indices.long()
270 |     edge_types = edge_types.float() if adj_dist else edge_types.long()
271 | 
272 |     if adj_dist:
273 |         shape = (n_nodes, n_nodes, edge_types.size(-1))
274 |         adj = torch.zeros(shape, device=edge_indices.device, dtype=torch.float)
275 | 
276 |     else:
277 |         shape = (n_nodes, n_nodes)
278 |         adj = torch.zeros(shape, device=edge_indices.device, dtype=torch.long)
279 | 
280 |     from_indices = edge_indices[:, 0]
281 |     to_indices = edge_indices[:, 1]
282 | 
283 |     adj[from_indices, to_indices] = edge_types
284 |     if symmetric:
285 |         adj[to_indices, from_indices] = edge_types
286 | 
287 |     return adj
288 | 
289 | 
290 | def edges_from_nodes(coords, k=None, node_mask=None, edge_format="adjacency"):
291 |     """Constuct edges from node coords
292 | 
293 |     Connects a node to its k nearest nodes. If k is None then connects each node to all its neighbours. A node is
294 |     never connected to itself.
295 | 
296 |     Args:
297 |         coords (torch.Tensor): Node coords, shape [batch_size, num_nodes, 3]
298 |         k (int): Number of neighbours to connect each node to, None means connect to all nodes except itself
299 |         node_mask (torch.Tensor): Node mask, shape [batch_size, num_nodes], 1 for real nodes 0 otherwise
300 |         edge_format (str): Edge format, should be either 'adjacency' or 'list'
301 | 
302 |     Returns:
303 |         If format is 'adjacency' this returns an adjacency matrix, shape [batch_size, num_nodes, num_nodes] which
304 |         contains 1 for connected nodes and 0 otherwise. Note that if a value for k is provided the adjacency matrix
305 |         may not be symmetric and should always be used s.t. 'from nodes' are in dim 1 and 'to nodes' are in dim 2.
306 | 
307 |         If format is 'list' this returns the tuple (edges, edge mask), edges is also a two-tuple of tensors, each of
308 |         shape [batch_size, num_edges], specifying node indices for each edge. The edge mask has shape
309 |         [batch_size, num_edges] and contains 1 for 'real' edges and 0 otherwise.
310 |     """
311 | 
312 |     if edge_format not in ["adjacency", "list"]:
313 |         raise ValueError(f"Unrecognised edge format '{edge_format}'")
314 | 
315 |     adj_format = edge_format == "adjacency"
316 |     batch_size, num_nodes, _ = coords.size()
317 | 
318 |     # If node mask is None all nodes are real
319 |     if node_mask is None:
320 |         node_mask = torch.ones((batch_size, num_nodes), device=coords.device, dtype=torch.int64)
321 | 
322 |     adj_matrix = adj_from_node_mask(node_mask)
323 | 
324 |     if k is not None:
325 |         # Find k closest nodes for each node
326 |         dists = calc_distances(coords)
327 |         dists[adj_matrix == 0] = float("inf")
328 |         _, best_idxs = dists.topk(k, dim=2, largest=False)
329 | 
330 |         # Adjust adj matrix to only have k connections per node
331 |         k_adj_matrix = torch.zeros_like(adj_matrix)
332 |         batch_idxs = torch.arange(batch_size).view(-1, 1, 1).expand(-1, num_nodes, k)
333 |         node_idxs = torch.arange(num_nodes).view(1, -1, 1).expand(batch_size, -1, k)
334 |         k_adj_matrix[batch_idxs, node_idxs, best_idxs] = 1
335 | 
336 |         # Ensure that there are no connections to fake nodes
337 |         k_adj_matrix[adj_matrix == 0] = 0
338 |         adj_matrix = k_adj_matrix
339 | 
340 |     if adj_format:
341 |         return adj_matrix
342 | 
343 |     edges, edge_mask = edges_from_adj(adj_matrix)
344 |     return edges, edge_mask
345 | 
346 | 
347 | def gather_edge_features(pairwise_feats, adj_matrix):
348 |     """Gather edge features for each node from pairwise features using the adjacency matrix
349 | 
350 |     All 'from nodes' (dimension 1 on the adj matrix) must have the same number of edges to 'to nodes'. Practically
351 |     this means that the number of non-zero elements in dimension 2 of the adjacency matrix must always be the same.
352 | 
353 |     Args:
354 |         pairwise_feats (torch.Tensor): Pairwise features tensor, shape [batch_size, num_nodes, num_nodes, num_feats]
355 |         adj_matrix (torch.Tensor): Batched adjacency matrix, shape [batch_size, num_nodes, num_nodes]. It can contain
356 |                 any non-zero integer for connected nodes but must be 0 for unconnected nodes.
357 | 
358 |     Returns:
359 |         torch.Tensor: Dense feature matrix, shape [batch_size, num_nodes, edges_per_node, num_feats]
360 |     """
361 | 
362 |     # In case some of the connections don't use 1, create a 1s adjacency matrix
363 |     adj_ones = torch.zeros_like(adj_matrix).int()
364 |     adj_ones[adj_matrix != 0] = 1
365 | 
366 |     num_neighbours = adj_ones.sum(dim=2)
367 |     feats_per_node = num_neighbours[0, 0].item()
368 | 
369 |     assert (num_neighbours == feats_per_node).all(), "All nodes must have the same number of connections"
370 | 
371 |     if len(pairwise_feats.size()) == 3:
372 |         batch_size, num_nodes, _ = pairwise_feats.size()
373 |         pairwise_feats = pairwise_feats.unsqueeze(3)
374 | 
375 |     elif len(pairwise_feats.size()) == 4:
376 |         batch_size, num_nodes, _, _ = pairwise_feats.size()
377 | 
378 |     # nonzero() orders indices lexicographically with the last index changing the fastest, so we can reshape the
379 |     # indices into a dense form with nodes along the outer axis and features along the inner
380 |     gather_idxs = adj_ones.nonzero()[:, 2].reshape((batch_size, num_nodes, feats_per_node))
381 |     batch_idxs = torch.arange(batch_size).view(-1, 1, 1)
382 |     node_idxs = torch.arange(num_nodes).view(1, -1, 1)
383 |     dense_feats = pairwise_feats[batch_idxs, node_idxs, gather_idxs, :]
384 |     if dense_feats.size(-1) == 1:
385 |         return dense_feats.squeeze(-1)
386 | 
387 |     return dense_feats
388 | 
389 | 
390 | # *************************************************************************************************
391 | # ********************************* Geometric Util Functions **************************************
392 | # *************************************************************************************************
393 | 
394 | 
395 | # TODO rename? Maybe also merge with inter_distances
396 | # TODO test unbatched and coord sets inputs
397 | def calc_distances(coords, edges=None, sqrd=False, eps=1e-6):
398 |     """Computes distances between connected nodes
399 | 
400 |     Takes an optional edges argument. If edges is None this will calculate distances between all nodes and return the
401 |     distances in a batched square matrix [batch_size, num_nodes, num_nodes]. If edges is provided the distances are
402 |     returned for each edge in a batched 1D format [batch_size, num_edges].
403 | 
404 |     Args:
405 |         coords (torch.Tensor): Coordinate tensor, shape [batch_size, num_nodes, 3]
406 |         edges (tuple): Two-tuple of connected node indices, each tensor has shape [batch_size, num_edges]
407 |         sqrd (bool): Whether to return the squared distances
408 |         eps (float): Epsilon to add before taking the square root for numical stability in the gradients
409 | 
410 |     Returns:
411 |         torch.Tensor: Distances tensor, the shape depends on whether edges is provided (see above).
412 |     """
413 | 
414 |     # TODO add checks
415 | 
416 |     # Create fake batch dim if unbatched
417 |     unbatched = False
418 |     if len(coords.size()) == 2:
419 |         coords = coords.unsqueeze(0)
420 |         unbatched = True
421 | 
422 |     if edges is None:
423 |         coord_diffs = coords.unsqueeze(-2) - coords.unsqueeze(-3)
424 |         sqrd_dists = torch.sum(coord_diffs * coord_diffs, dim=-1)
425 | 
426 |     else:
427 |         edge_is, edge_js = edges
428 |         batch_index = torch.arange(coords.size(0)).unsqueeze(1)
429 |         coord_diffs = coords[batch_index, edge_js, :] - coords[batch_index, edge_is, :]
430 |         sqrd_dists = torch.sum(coord_diffs * coord_diffs, dim=2)
431 | 
432 |     sqrd_dists = sqrd_dists.squeeze(0) if unbatched else sqrd_dists
433 | 
434 |     if sqrd:
435 |         return sqrd_dists
436 | 
437 |     return torch.sqrt(sqrd_dists + eps)
438 | 
439 | 
440 | def inter_distances(coords1, coords2, sqrd=False, eps=1e-6):
441 |     # TODO add checks and doc
442 | 
443 |     # Create fake batch dim if unbatched
444 |     unbatched = False
445 |     if len(coords1.size()) == 2:
446 |         coords1 = coords1.unsqueeze(0)
447 |         coords2 = coords2.unsqueeze(0)
448 |         unbatched = True
449 | 
450 |     coord_diffs = coords1.unsqueeze(2) - coords2.unsqueeze(1)
451 |     sqrd_dists = torch.sum(coord_diffs * coord_diffs, dim=3)
452 |     sqrd_dists = sqrd_dists.squeeze(0) if unbatched else sqrd_dists
453 | 
454 |     if sqrd:
455 |         return sqrd_dists
456 | 
457 |     return torch.sqrt(sqrd_dists + eps)
458 | 
459 | 
460 | def calc_com(coords, node_mask=None):
461 |     """Calculates the centre of mass of a pointcloud
462 | 
463 |     Args:
464 |         coords (torch.Tensor): Coordinate tensor, shape [*, num_nodes, 3]
465 |         node_mask (torch.Tensor): Mask for points, shape [*, num_nodes], 1 for real node, 0 otherwise
466 | 
467 |     Returns:
468 |         torch.Tensor: CoM of pointclouds with imaginary nodes excluded, shape [*, 1, 3]
469 |     """
470 | 
471 |     node_mask = torch.ones_like(coords[..., 0]) if node_mask is None else node_mask
472 | 
473 |     assert node_mask.shape == coords[..., 0].shape
474 | 
475 |     num_nodes = node_mask.sum(dim=-1)
476 |     real_coords = coords * node_mask.unsqueeze(-1)
477 |     com = real_coords.sum(dim=-2) / num_nodes.unsqueeze(-1)
478 |     return com.unsqueeze(-2)
479 | 
480 | 
481 | def zero_com(coords, node_mask=None):
482 |     """Sets the centre of mass for a batch of pointclouds to zero for each pointcloud
483 | 
484 |     Args:
485 |         coords (torch.Tensor): Coordinate tensor, shape [*, num_nodes, 3]
486 |         node_mask (torch.Tensor): Mask for points, shape [*, num_nodes], 1 for real node, 0 otherwise
487 | 
488 |     Returns:
489 |         torch.Tensor: CoM-free coordinates, where imaginary nodes are excluded from CoM calculation
490 |     """
491 | 
492 |     com = calc_com(coords, node_mask=node_mask)
493 |     shifted = coords - com
494 |     return shifted
495 | 
496 | 
497 | def standardise_coords(coords, node_mask=None):
498 |     """Convert coords into a standard normal distribution
499 | 
500 |     This will first remove the centre of mass from all pointclouds in the batch, then calculate the (biased) variance
501 |     of the shifted coords and use this to produce a standard normal distribution.
502 | 
503 |     Args:
504 |         coords (torch.Tensor):  Coordinate tensor, shape [batch_size, num_nodes, 3]
505 |         node_mask (torch.Tensor): Mask for points, shape [batch_size, num_nodes], 1 for real node, 0 otherwise
506 | 
507 |     Returns:
508 |         Tuple[torch.Tensor, float]: The standardised coords and the variance of the original coords
509 |     """
510 | 
511 |     if node_mask is None:
512 |         node_mask = torch.ones_like(coords)[:, :, 0]
513 | 
514 |     coord_idxs = node_mask.nonzero()
515 |     real_coords = coords[coord_idxs[:, 0], coord_idxs[:, 1], :]
516 | 
517 |     variance = torch.var(real_coords, correction=0)
518 |     std_dev = torch.sqrt(variance)
519 | 
520 |     result = (coords / std_dev) * node_mask.unsqueeze(2)
521 |     return result, std_dev.item()
522 | 
523 | 
524 | def rotate(coords: torch.Tensor, rotation: Union[Rotation, TupleRot]):
525 |     """Rotate coordinates for a single molecule
526 | 
527 |     Args:
528 |         coords (torch.Tensor): Unbatched coordinate tensor, shape [num_atoms, 3]
529 |         rotation (Union[Rotation, Tuple[float, float, float]]): Can be either a scipy Rotation object or a tuple of
530 |                 rotation values in radians, (x, y, z). These are treated as extrinsic rotations. See the scipy docs
531 |                 (https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.transform.Rotation.html) for info.
532 | 
533 |     Returns:
534 |         torch.Tensor: Rotated coordinates
535 |     """
536 | 
537 |     if not isinstance(rotation, Rotation):
538 |         rotation = Rotation.from_euler("xyz", rotation)
539 | 
540 |     device = coords.device
541 |     coords = coords.cpu().numpy()
542 | 
543 |     rotated = rotation.apply(coords)
544 |     return torch.tensor(rotated, device=device)
545 | 
546 | 
547 | def cartesian_to_spherical(coords):
548 |     sqrd_dists = (coords * coords).sum(dim=-1)
549 |     radii = torch.sqrt(sqrd_dists)
550 |     inclination = torch.acos(coords[..., 2] / radii).unsqueeze(2)
551 |     azimuth = torch.atan2(coords[..., 1], coords[..., 0]).unsqueeze(2)
552 |     return torch.cat((radii.unsqueeze(2), inclination, azimuth), dim=-1)
553 | 
554 | 
555 | # *************************************************************************************************
556 | # ************************************** Util Classes *********************************************
557 | # *************************************************************************************************
558 | 
559 | 
560 | class SparseFeatures:
561 |     def __init__(self, dense, idxs):
562 |         assert len(dense.size()) == 3
563 |         assert dense.size() == idxs.size()
564 | 
565 |         batch_size, num_nodes, num_feats = dense.size()
566 | 
567 |         self.bs = batch_size
568 |         self.num_nodes = num_nodes
569 |         self.num_feats = num_feats
570 | 
571 |         self._dense = dense
572 |         self._idxs = idxs
573 | 
574 |     @staticmethod
575 |     def from_sparse(sparse_feats, adj_matrix, feats_per_node):
576 |         err_msg = "adj_matrix must have feats_per_node ones in each row"
577 |         assert sparse_feats.size() == adj_matrix.size(), "sparse_feats and adj_matrix must have the same shape"
578 |         assert adj_matrix.size()[1] == adj_matrix.size()[2], "adj_matrix must be square"
579 |         assert (adj_matrix.sum(dim=2) == feats_per_node).all().item(), err_msg
580 | 
581 |         batch_size, num_nodes, _ = adj_matrix.size()
582 |         feat_idxs = adj_matrix.nonzero()[:, 2].reshape((batch_size, num_nodes, feats_per_node))
583 |         dense_feats = torch.gather(sparse_feats, 2, feat_idxs)
584 |         return SparseFeatures(dense_feats, feat_idxs)
585 | 
586 |     @staticmethod
587 |     def from_dense(dense_feats, idxs):
588 |         return SparseFeatures(dense_feats, idxs)
589 | 
590 |     def to_tensor(self):
591 |         sparse_matrix = torch.zeros((self.bs, self.num_nodes, self.num_nodes), device=self._dense.device)
592 |         sparse_matrix.scatter_(2, self._idxs, self._dense)
593 |         return sparse_matrix
594 | 
595 |     def mult(self, other):
596 |         if isinstance(other, (int, float)):
597 |             return self.from_dense(self._dense * other, self._idxs)
598 | 
599 |         if not torch.is_tensor(other):
600 |             raise TypeError("Object to multiply by must be an int, float or torch.Tensor")
601 | 
602 |         assert other.size() == (self.bs, self.num_nodes, self.num_nodes)
603 | 
604 |         other_dense = torch.gather(other, 2, self._idxs)
605 |         return self.from_dense(self._dense * other_dense, self._idxs)
606 | 
607 |     def matmul(self, other):
608 |         if not torch.is_tensor(other):
609 |             raise TypeError("Object to multiply by must be a torch.Tensor")
610 | 
611 |         assert tuple(other.size()[:2]) == (self.bs, self.num_nodes)
612 | 
613 |         # There doesn't seem to be an efficient implementation of sparse batched matmul available atm, so just do
614 |         # regular matmul instead. We will still get some speed benefit from having lots of zeros.
615 |         tensor = self.to_tensor()
616 |         return torch.bmm(tensor, other)
617 | 
618 |     def softmax(self):
619 |         dense_softmax = torch.softmax(self._dense, dim=2)
620 |         return self.from_dense(dense_softmax, self._idxs)
621 | 
622 |     def dropout(self, p, train=False):
623 |         dense_dropout = torch.dropout(self._dense, p, train=train)
624 |         return self.from_dense(dense_dropout, self._idxs)
625 | 
626 |     def add(self, other):
627 |         """Add a matrix only at elements which are not sparse in self"""
628 | 
629 |         assert len(other.size()) == 3
630 | 
631 |         other_dense = torch.gather(other, 2, self._idxs)
632 |         return self.from_dense(self._dense + other_dense, self._idxs)
633 | 
634 |     def sum(self, dim=None):
635 |         if dim == 1:
636 |             return self.to_tensor().sum(dim=1)
637 | 
638 |         return self._dense.sum(dim=dim)
639 | 


--------------------------------------------------------------------------------
/semlaflow/util/metrics.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | from concurrent.futures import ProcessPoolExecutor
  3 | 
  4 | import torch
  5 | from rdkit import Chem
  6 | from torchmetrics import Metric
  7 | 
  8 | import semlaflow.util.rdkit as smolRD
  9 | 
 10 | ALLOWED_VALENCIES = {
 11 |     "H": {0: 1, 1: 0, -1: 0},
 12 |     "C": {0: [3, 4], 1: 3, -1: 3},
 13 |     "N": {0: [2, 3], 1: [2, 3, 4], -1: 2},  # In QM9, N+ seems to be present in the form NH+ and NH2+
 14 |     "O": {0: 2, 1: 3, -1: 1},
 15 |     "F": {0: 1, -1: 0},
 16 |     "B": 3,
 17 |     "Al": 3,
 18 |     "Si": 4,
 19 |     "P": {0: [3, 5], 1: 4},
 20 |     "S": {0: [2, 6], 1: [2, 3], 2: 4, 3: 5, -1: 3},
 21 |     "Cl": 1,
 22 |     "As": 3,
 23 |     "Br": {0: 1, 1: 2},
 24 |     "I": 1,
 25 |     "Hg": [1, 2],
 26 |     "Bi": [3, 5],
 27 |     "Se": [2, 4, 6],
 28 | }
 29 | 
 30 | 
 31 | def calc_atom_stabilities(mol):
 32 |     stabilities = []
 33 | 
 34 |     for atom in mol.GetAtoms():
 35 |         atom_type = atom.GetSymbol()
 36 |         valence = atom.GetExplicitValence()
 37 |         charge = atom.GetFormalCharge()
 38 | 
 39 |         if atom_type not in ALLOWED_VALENCIES:
 40 |             stabilities.append(False)
 41 |             continue
 42 | 
 43 |         allowed = ALLOWED_VALENCIES[atom_type]
 44 |         atom_stable = _is_valid_valence(valence, allowed, charge)
 45 |         stabilities.append(atom_stable)
 46 | 
 47 |     return stabilities
 48 | 
 49 | 
 50 | def _is_valid_valence(valence, allowed, charge):
 51 |     if isinstance(allowed, int):
 52 |         valid = allowed == valence
 53 | 
 54 |     elif isinstance(allowed, list):
 55 |         valid = valence in allowed
 56 | 
 57 |     elif isinstance(allowed, dict):
 58 |         allowed = allowed.get(charge)
 59 |         if allowed is None:
 60 |             return False
 61 | 
 62 |         valid = _is_valid_valence(valence, allowed, charge)
 63 | 
 64 |     return valid
 65 | 
 66 | 
 67 | def _is_valid_float(num):
 68 |     return num not in [None, float("inf"), float("-inf"), float("nan")]
 69 | 
 70 | 
 71 | class GenerativeMetric(Metric):
 72 |     # TODO add metric attributes - see torchmetrics doc
 73 | 
 74 |     def __init__(self, **kwargs):
 75 |         # Pass extra kwargs (defined in Metric class) to parent
 76 |         super().__init__(**kwargs)
 77 | 
 78 |     def update(self, mols: list[Chem.rdchem.Mol]) -> None:
 79 |         raise NotImplementedError()
 80 | 
 81 |     def compute(self) -> torch.Tensor:
 82 |         raise NotImplementedError()
 83 | 
 84 | 
 85 | class PairMetric(Metric):
 86 |     def __init__(self, **kwargs):
 87 |         # Pass extra kwargs (defined in Metric class) to parent
 88 |         super().__init__(**kwargs)
 89 | 
 90 |     def update(self, predicted: list[Chem.rdchem.Mol], actual: list[Chem.rdchem.Mol]) -> None:
 91 |         raise NotImplementedError()
 92 | 
 93 |     def compute(self) -> torch.Tensor:
 94 |         raise NotImplementedError()
 95 | 
 96 | 
 97 | class AtomStability(Metric):
 98 |     def __init__(self, **kwargs):
 99 |         super().__init__(**kwargs)
100 | 
101 |         self.add_state("atom_stable", default=torch.tensor(0), dist_reduce_fx="sum")
102 |         self.add_state("total", default=torch.tensor(0), dist_reduce_fx="sum")
103 | 
104 |     def update(self, stabilities: list[list[bool]]) -> None:
105 |         all_atom_stables = [atom_stable for atom_stbs in stabilities for atom_stable in atom_stbs]
106 |         self.atom_stable += sum(all_atom_stables)
107 |         self.total += len(all_atom_stables)
108 | 
109 |     def compute(self) -> torch.Tensor:
110 |         return self.atom_stable.float() / self.total
111 | 
112 | 
113 | class MoleculeStability(Metric):
114 |     def __init__(self, **kwargs):
115 |         super().__init__(**kwargs)
116 | 
117 |         self.add_state("mol_stable", default=torch.tensor(0), dist_reduce_fx="sum")
118 |         self.add_state("total", default=torch.tensor(0), dist_reduce_fx="sum")
119 | 
120 |     def update(self, stabilities: list[list[bool]]) -> None:
121 |         mol_stables = [sum(atom_stbs) == len(atom_stbs) for atom_stbs in stabilities]
122 |         self.mol_stable += sum(mol_stables)
123 |         self.total += len(mol_stables)
124 | 
125 |     def compute(self) -> torch.Tensor:
126 |         return self.mol_stable.float() / self.total
127 | 
128 | 
129 | class Validity(GenerativeMetric):
130 |     def __init__(self, connected=False, **kwargs):
131 |         super().__init__(**kwargs)
132 |         self.connected = connected
133 | 
134 |         self.add_state("valid", default=torch.tensor(0), dist_reduce_fx="sum")
135 |         self.add_state("total", default=torch.tensor(0), dist_reduce_fx="sum")
136 | 
137 |     def update(self, mols: list[Chem.rdchem.Mol]) -> None:
138 |         is_valid = [smolRD.mol_is_valid(mol, connected=self.connected) for mol in mols if mol is not None]
139 |         self.valid += sum(is_valid)
140 |         self.total += len(mols)
141 | 
142 |     def compute(self) -> torch.Tensor:
143 |         return self.valid.float() / self.total
144 | 
145 | 
146 | # TODO I don't think this will work with DDP
147 | class Uniqueness(GenerativeMetric):
148 |     """Note: only tracks uniqueness of molecules which can be converted into SMILES"""
149 | 
150 |     def __init__(self, **kwargs):
151 |         super().__init__(**kwargs)
152 |         self.valid_smiles = []
153 | 
154 |     def reset(self):
155 |         self.valid_smiles = []
156 | 
157 |     def update(self, mols: list[Chem.rdchem.Mol]) -> None:
158 |         smiles = [smolRD.smiles_from_mol(mol, canonical=True) for mol in mols if mol is not None]
159 |         valid_smiles = [smi for smi in smiles if smi is not None]
160 |         self.valid_smiles.extend(valid_smiles)
161 | 
162 |     def compute(self) -> torch.Tensor:
163 |         num_unique = len(set(self.valid_smiles))
164 |         uniqueness = torch.tensor(num_unique) / len(self.valid_smiles)
165 |         return uniqueness
166 | 
167 | 
168 | class Novelty(GenerativeMetric):
169 |     def __init__(self, existing_mols: list[Chem.rdchem.Mol], **kwargs):
170 |         super().__init__(**kwargs)
171 | 
172 |         n_workers = min(8, len(os.sched_getaffinity(0)))
173 |         executor = ProcessPoolExecutor(max_workers=n_workers)
174 | 
175 |         futures = [executor.submit(smolRD.smiles_from_mol, mol, canonical=True) for mol in existing_mols]
176 |         smiles = [future.result() for future in futures]
177 |         smiles = [smi for smi in smiles if smi is not None]
178 | 
179 |         executor.shutdown()
180 | 
181 |         self.smiles = set(smiles)
182 | 
183 |         self.add_state("novel", default=torch.tensor(0), dist_reduce_fx="sum")
184 |         self.add_state("total", default=torch.tensor(0), dist_reduce_fx="sum")
185 | 
186 |     def update(self, mols: list[Chem.rdchem.Mol]) -> None:
187 |         smiles = [smolRD.smiles_from_mol(mol, canonical=True) for mol in mols if mol is not None]
188 |         valid_smiles = [smi for smi in smiles if smi is not None]
189 |         novel = [smi not in self.smiles for smi in valid_smiles]
190 | 
191 |         self.novel += sum(novel)
192 |         self.total += len(novel)
193 | 
194 |     def compute(self) -> torch.Tensor:
195 |         return self.novel.float() / self.total
196 | 
197 | 
198 | class EnergyValidity(GenerativeMetric):
199 |     def __init__(self, optimise=False, **kwargs):
200 |         super().__init__(**kwargs)
201 | 
202 |         self.optimise = optimise
203 | 
204 |         self.add_state("n_valid", default=torch.tensor(0), dist_reduce_fx="sum")
205 |         self.add_state("total", default=torch.tensor(0), dist_reduce_fx="sum")
206 | 
207 |     def update(self, mols: list[Chem.rdchem.Mol]) -> None:
208 |         num_mols = len(mols)
209 | 
210 |         if self.optimise:
211 |             mols = [smolRD.optimise_mol(mol) for mol in mols if mol is not None]
212 | 
213 |         energies = [smolRD.calc_energy(mol) for mol in mols if mol is not None]
214 |         valid_energies = [energy for energy in energies if _is_valid_float(energy)]
215 | 
216 |         self.n_valid += len(valid_energies)
217 |         self.total += num_mols
218 | 
219 |     def compute(self) -> torch.Tensor:
220 |         return self.n_valid.float() / self.total
221 | 
222 | 
223 | class AverageEnergy(GenerativeMetric):
224 |     """Average energy for molecules for which energy can be calculated
225 | 
226 |     Note that the energy cannot be calculated for some molecules (specifically invalid ones) and the pose optimisation
227 |     is not guaranteed to succeed. Molecules for which the energy cannot be calculated do not count towards the metric.
228 | 
229 |     This metric doesn't require that input molecules have been sanitised by RDKit, however, it is usually a good idea
230 |     to do this anyway to ensure that all of the required molecular and atom properties are calculated and stored.
231 |     """
232 | 
233 |     def __init__(self, optimise=False, per_atom=False, **kwargs):
234 |         super().__init__(**kwargs)
235 | 
236 |         self.optimise = optimise
237 |         self.per_atom = per_atom
238 | 
239 |         self.add_state("energy", default=torch.tensor(0.0), dist_reduce_fx="sum")
240 |         self.add_state("n_valid_energies", default=torch.tensor(0), dist_reduce_fx="sum")
241 | 
242 |     def update(self, mols: list[Chem.rdchem.Mol]) -> None:
243 |         if self.optimise:
244 |             mols = [smolRD.optimise_mol(mol) for mol in mols if mol is not None]
245 | 
246 |         energies = [smolRD.calc_energy(mol, per_atom=self.per_atom) for mol in mols if mol is not None]
247 |         valid_energies = [energy for energy in energies if _is_valid_float(energy)]
248 | 
249 |         self.energy += sum(valid_energies)
250 |         self.n_valid_energies += len(valid_energies)
251 | 
252 |     def compute(self) -> torch.Tensor:
253 |         return self.energy / self.n_valid_energies
254 | 
255 | 
256 | class AverageStrainEnergy(GenerativeMetric):
257 |     """
258 |     The strain energy is the energy difference between a molecule's pose and its optimised pose. Estimated using RDKit.
259 |     Only calculated when all of the following are true:
260 |     1. The molecule is valid and an energy can be calculated
261 |     2. The pose optimisation succeeds
262 |     3. The energy can be calculated for the optimised pose
263 | 
264 |     Note that molecules which do not meet these criteria will not count towards the metric and can therefore give
265 |     unexpected results. Use the EnergyValidity metric with the optimise flag set to True to track the proportion of
266 |     molecules for which this metric can be calculated.
267 | 
268 |     This metric doesn't require that input molecules have been sanitised by RDKit, however, it is usually a good idea
269 |     to do this anyway to ensure that all of the required molecular and atom properties are calculated and stored.
270 |     """
271 | 
272 |     def __init__(self, per_atom=False, **kwargs):
273 |         super().__init__(**kwargs)
274 | 
275 |         self.per_atom = per_atom
276 | 
277 |         self.add_state("total_energy_diff", default=torch.tensor(0.0), dist_reduce_fx="sum")
278 |         self.add_state("n_valid", default=torch.tensor(0), dist_reduce_fx="sum")
279 | 
280 |     def update(self, mols: list[Chem.rdchem.Mol]) -> None:
281 |         opt_mols = [(idx, smolRD.optimise_mol(mol)) for idx, mol in list(enumerate(mols)) if mol is not None]
282 |         energies = [(idx, smolRD.calc_energy(mol, per_atom=self.per_atom)) for idx, mol in opt_mols if mol is not None]
283 |         valids = [(idx, energy) for idx, energy in energies if energy is not None]
284 | 
285 |         if len(valids) == 0:
286 |             return
287 | 
288 |         valid_indices, valid_energies = tuple(zip(*valids))
289 |         original_energies = [smolRD.calc_energy(mols[idx], per_atom=self.per_atom) for idx in valid_indices]
290 |         energy_diffs = [orig - opt for orig, opt in zip(original_energies, valid_energies)]
291 | 
292 |         self.total_energy_diff += sum(energy_diffs)
293 |         self.n_valid += len(energy_diffs)
294 | 
295 |     def compute(self) -> torch.Tensor:
296 |         return self.total_energy_diff / self.n_valid
297 | 
298 | 
299 | class AverageOptRmsd(GenerativeMetric):
300 |     """
301 |     Average RMSD between a molecule and its optimised pose. Only calculated when all of the following are true:
302 |     1. The molecule is valid
303 |     2. The pose optimisation succeeds
304 | 
305 |     Note that molecules which do not meet these criteria will not count towards the metric and can therefore give
306 |     unexpected results.
307 |     """
308 | 
309 |     def __init__(self, **kwargs):
310 |         super().__init__(**kwargs)
311 | 
312 |         self.add_state("total_rmsd", default=torch.tensor(0.0), dist_reduce_fx="sum")
313 |         self.add_state("n_valid", default=torch.tensor(0), dist_reduce_fx="sum")
314 | 
315 |     def update(self, mols: list[Chem.rdchem.Mol]) -> None:
316 |         valids = [(idx, smolRD.optimise_mol(mol)) for idx, mol in list(enumerate(mols)) if mol is not None]
317 |         valids = [(idx, mol) for idx, mol in valids if mol is not None]
318 | 
319 |         if len(valids) == 0:
320 |             return
321 | 
322 |         valid_indices, opt_mols = tuple(zip(*valids))
323 |         original_mols = [mols[idx] for idx in valid_indices]
324 |         rmsds = [smolRD.conf_distance(mol1, mol2) for mol1, mol2 in zip(original_mols, opt_mols)]
325 | 
326 |         self.total_rmsd += sum(rmsds)
327 |         self.n_valid += len(rmsds)
328 | 
329 |     def compute(self) -> torch.Tensor:
330 |         return self.total_rmsd / self.n_valid
331 | 
332 | 
333 | class MolecularAccuracy(PairMetric):
334 |     def __init__(self, **kwargs):
335 |         super().__init__(**kwargs)
336 | 
337 |         self.add_state("n_correct", default=torch.tensor(0), dist_reduce_fx="sum")
338 |         self.add_state("total", default=torch.tensor(0), dist_reduce_fx="sum")
339 | 
340 |     def update(self, predicted: list[Chem.rdchem.Mol], actual: list[Chem.rdchem.Mol]) -> None:
341 |         predicted_smiles = [smolRD.smiles_from_mol(pred, canonical=True) for pred in predicted]
342 |         actual_smiles = [smolRD.smiles_from_mol(act, canonical=True) for act in actual]
343 |         matches = [pred == act for pred, act in zip(predicted_smiles, actual_smiles) if act is not None]
344 | 
345 |         self.n_correct += sum(matches)
346 |         self.total += len(matches)
347 | 
348 |     def compute(self) -> torch.Tensor:
349 |         return self.n_correct.float() / self.total
350 | 
351 | 
352 | class MolecularPairRMSD(PairMetric):
353 |     def __init__(self, **kwargs):
354 |         super().__init__(**kwargs)
355 | 
356 |         self.add_state("total_rmsd", default=torch.tensor(0.0), dist_reduce_fx="sum")
357 |         self.add_state("n_valid", default=torch.tensor(0), dist_reduce_fx="sum")
358 | 
359 |     def update(self, predicted: list[Chem.rdchem.Mol], actual: list[Chem.rdchem.Mol]) -> None:
360 |         valid_pairs = [(pred, act) for pred, act in zip(predicted, actual) if pred is not None and act is not None]
361 |         rmsds = [smolRD.conf_distance(pred, act) for pred, act in valid_pairs]
362 |         rmsds = [rmsd for rmsd in rmsds if rmsd is not None]
363 | 
364 |         self.total_rmsd += sum(rmsds)
365 |         self.n_valid += len(rmsds)
366 | 
367 |     def compute(self) -> torch.tensor:
368 |         return self.total_rmsd / self.n_valid
369 | 


--------------------------------------------------------------------------------
/semlaflow/util/rdkit.py:
--------------------------------------------------------------------------------
  1 | import threading
  2 | from typing import Optional, Union
  3 | 
  4 | import numpy as np
  5 | from openbabel import pybel
  6 | from rdkit import Chem
  7 | from rdkit.Chem import AllChem
  8 | from scipy.spatial.transform import Rotation
  9 | 
 10 | ArrT = np.ndarray
 11 | 
 12 | 
 13 | # *************************************************************************************************
 14 | # ************************************ Periodic Table class ***************************************
 15 | # *************************************************************************************************
 16 | 
 17 | 
 18 | class PeriodicTable:
 19 |     """Singleton class wrapper for the RDKit periodic table providing a neater interface"""
 20 | 
 21 |     _instance = None
 22 | 
 23 |     def __new__(cls):
 24 |         if cls._instance is None:
 25 |             cls._instance = super().__new__(cls)
 26 | 
 27 |         return cls._instance
 28 | 
 29 |     def __init__(self):
 30 |         self._table = Chem.GetPeriodicTable()
 31 | 
 32 |         # Just to be certain that vocab objects are thread safe
 33 |         self._pt_lock = threading.Lock()
 34 | 
 35 |     def atomic_from_symbol(self, symbol: str) -> int:
 36 |         with self._pt_lock:
 37 |             symbol = symbol.upper() if len(symbol) == 1 else symbol
 38 |             atomic = self._table.GetAtomicNumber(symbol)
 39 | 
 40 |         return atomic
 41 | 
 42 |     def symbol_from_atomic(self, atomic_num: int) -> str:
 43 |         with self._pt_lock:
 44 |             token = self._table.GetElementSymbol(atomic_num)
 45 | 
 46 |         return token
 47 | 
 48 |     def valence(self, atom: Union[str, int]) -> int:
 49 |         with self._pt_lock:
 50 |             valence = self._table.GetDefaultValence(atom)
 51 | 
 52 |         return valence
 53 | 
 54 | 
 55 | # *************************************************************************************************
 56 | # ************************************* Global Declarations ***************************************
 57 | # *************************************************************************************************
 58 | 
 59 | 
 60 | PT = PeriodicTable()
 61 | 
 62 | IDX_BOND_MAP = {1: Chem.BondType.SINGLE, 2: Chem.BondType.DOUBLE, 3: Chem.BondType.TRIPLE, 4: Chem.BondType.AROMATIC}
 63 | BOND_IDX_MAP = {bond: idx for idx, bond in IDX_BOND_MAP.items()}
 64 | 
 65 | IDX_CHARGE_MAP = {0: 0, 1: 1, 2: 2, 3: 3, 4: -1, 5: -2, 6: -3}
 66 | CHARGE_IDX_MAP = {charge: idx for idx, charge in IDX_CHARGE_MAP.items()}
 67 | 
 68 | 
 69 | # *************************************************************************************************
 70 | # *************************************** Util Functions ******************************************
 71 | # *************************************************************************************************
 72 | 
 73 | # TODO merge these with check functions in other files
 74 | 
 75 | 
 76 | def _check_shape_len(arr, allowed, name="object"):
 77 |     num_dims = len(arr.shape)
 78 |     allowed = [allowed] if isinstance(allowed, int) else allowed
 79 |     if num_dims not in allowed:
 80 |         raise RuntimeError(f"Number of dimensions of {name} must be in {str(allowed)}, got {num_dims}")
 81 | 
 82 | 
 83 | def _check_dim_shape(arr, dim, allowed, name="object"):
 84 |     shape = arr.shape[dim]
 85 |     allowed = [allowed] if isinstance(allowed, int) else allowed
 86 |     if shape not in allowed:
 87 |         raise RuntimeError(f"Shape of {name} for dim {dim} must be in {allowed}, got {shape}")
 88 | 
 89 | 
 90 | # *************************************************************************************************
 91 | # ************************************* External Functions ****************************************
 92 | # *************************************************************************************************
 93 | 
 94 | 
 95 | def mol_is_valid(mol: Chem.rdchem.Mol, with_hs: bool = True, connected: bool = True) -> bool:
 96 |     """Whether the mol can be sanitised and, optionally, whether it's fully connected
 97 | 
 98 |     Args:
 99 |         mol (Chem.Mol): RDKit molecule to check
100 |         with_hs (bool): Whether to check validity including hydrogens (if they are in the input mol), default True
101 |         connected (bool): Whether to also assert that the mol must not have disconnected atoms, default True
102 | 
103 |     Returns:
104 |         bool: Whether the mol is valid
105 |     """
106 | 
107 |     if mol is None:
108 |         return False
109 | 
110 |     mol_copy = Chem.Mol(mol)
111 |     if not with_hs:
112 |         mol_copy = Chem.RemoveAllHs(mol_copy)
113 | 
114 |     try:
115 |         AllChem.SanitizeMol(mol_copy)
116 |     except Exception:
117 |         return False
118 | 
119 |     n_frags = len(AllChem.GetMolFrags(mol_copy))
120 |     if connected and n_frags != 1:
121 |         return False
122 | 
123 |     return True
124 | 
125 | 
126 | def calc_energy(mol: Chem.rdchem.Mol, per_atom: bool = False) -> float:
127 |     """Calculate the energy for an RDKit molecule using the MMFF forcefield
128 | 
129 |     The energy is only calculated for the first (0th index) conformer within the molecule. The molecule is copied so
130 |     the original is not modified.
131 | 
132 |     Args:
133 |         mol (Chem.Mol): RDKit molecule
134 |         per_atom (bool): Whether to normalise by number of atoms in mol, default False
135 | 
136 |     Returns:
137 |         float: Energy of the molecule or None if the energy could not be calculated
138 |     """
139 | 
140 |     mol_copy = Chem.Mol(mol)
141 | 
142 |     try:
143 |         mmff_props = AllChem.MMFFGetMoleculeProperties(mol_copy, mmffVariant="MMFF94")
144 |         ff = AllChem.MMFFGetMoleculeForceField(mol_copy, mmff_props, confId=0)
145 |         energy = ff.CalcEnergy()
146 |         energy = energy / mol.GetNumAtoms() if per_atom else energy
147 |     except Exception:
148 |         energy = None
149 | 
150 |     return energy
151 | 
152 | 
153 | def optimise_mol(mol: Chem.rdchem.Mol, max_iters: int = 1000) -> Chem.rdchem.Mol:
154 |     """Optimise the conformation of an RDKit molecule
155 | 
156 |     Only the first (0th index) conformer within the molecule is optimised. The molecule is copied so the original
157 |     is not modified.
158 | 
159 |     Args:
160 |         mol (Chem.Mol): RDKit molecule
161 |         max_iters (int): Max iterations for the conformer optimisation algorithm
162 | 
163 |     Returns:
164 |         Chem.Mol: Optimised molecule or None if the molecule could not be optimised within the given number of
165 |                 iterations
166 |     """
167 | 
168 |     mol_copy = Chem.Mol(mol)
169 |     try:
170 |         AllChem.MMFFOptimizeMolecule(mol_copy, maxIters=max_iters)
171 |     except Exception:
172 |         return None
173 | 
174 |     return mol_copy
175 | 
176 | 
177 | def conf_distance(mol1: Chem.rdchem.Mol, mol2: Chem.rdchem.Mol, fix_order: bool = True) -> float:
178 |     """Approximately align two molecules and then calculate RMSD between them
179 | 
180 |     Alignment and distance is calculated only between the default conformers of each molecule.
181 | 
182 |     Args:
183 |         mol1 (Chem.Mol): First molecule to align
184 |         mol2 (Chem.Mol): Second molecule to align
185 |         fix_order (bool): Whether to fix the atom order of the molecules
186 | 
187 |     Returns:
188 |         float: RMSD between molecules after approximate alignment
189 |     """
190 | 
191 |     assert len(mol1.GetAtoms()) == len(mol2.GetAtoms())
192 | 
193 |     if not fix_order:
194 |         raise NotImplementedError()
195 | 
196 |     coords1 = np.array(mol1.GetConformer().GetPositions())
197 |     coords2 = np.array(mol2.GetConformer().GetPositions())
198 | 
199 |     # Firstly, centre both molecules
200 |     coords1 = coords1 - (coords1.sum(axis=0) / coords1.shape[0])
201 |     coords2 = coords2 - (coords2.sum(axis=0) / coords2.shape[0])
202 | 
203 |     # Find the best rotation alignment between the centred mols
204 |     rotation, _ = Rotation.align_vectors(coords1, coords2)
205 |     aligned_coords2 = rotation.apply(coords2)
206 | 
207 |     sqrd_dists = (coords1 - aligned_coords2) ** 2
208 |     rmsd = np.sqrt(sqrd_dists.sum(axis=1).mean())
209 |     return rmsd
210 | 
211 | 
212 | # TODO could allow more args
213 | def smiles_from_mol(mol: Chem.rdchem.Mol, canonical: bool = True, explicit_hs: bool = False) -> Union[str, None]:
214 |     """Create a SMILES string from a molecule
215 | 
216 |     Args:
217 |         mol (Chem.Mol): RDKit molecule object
218 |         canonical (bool): Whether to create a canonical SMILES, default True
219 |         explicit_hs (bool): Whether to embed hydrogens in the mol before creating a SMILES, default False. If True
220 |                 this will create a new mol with all hydrogens embedded. Note that the SMILES created by doing this
221 |                 is not necessarily the same as creating a SMILES showing implicit hydrogens.
222 | 
223 |     Returns:
224 |         str: SMILES string which could be None if the SMILES generation failed
225 |     """
226 | 
227 |     if mol is None:
228 |         return None
229 | 
230 |     if explicit_hs:
231 |         mol = Chem.AddHs(mol)
232 | 
233 |     try:
234 |         smiles = Chem.MolToSmiles(mol, canonical=canonical)
235 |     except Exception:
236 |         smiles = None
237 | 
238 |     return smiles
239 | 
240 | 
241 | def mol_from_smiles(smiles: str, explicit_hs: bool = False) -> Union[Chem.rdchem.Mol, None]:
242 |     """Create a RDKit molecule from a SMILES string
243 | 
244 |     Args:
245 |         smiles (str): SMILES string
246 |         explicit_hs (bool): Whether to embed explicit hydrogens into the mol
247 | 
248 |     Returns:
249 |         Chem.Mol: RDKit molecule object or None if one cannot be created from the SMILES
250 |     """
251 | 
252 |     if smiles is None:
253 |         return None
254 | 
255 |     try:
256 |         mol = Chem.MolFromSmiles(smiles)
257 |         mol = Chem.AddHs(mol) if explicit_hs else mol
258 |     except Exception:
259 |         mol = None
260 | 
261 |     return mol
262 | 
263 | 
264 | def mol_from_atoms(
265 |     coords: ArrT, tokens: list[str], bonds: Optional[ArrT] = None, charges: Optional[ArrT] = None, sanitise=True
266 | ):
267 |     """Create RDKit mol from atom coords and atom tokens (and optionally bonds)
268 | 
269 |     If any of the atom tokens are not valid atoms (do not exist on the periodic table), None will be returned.
270 | 
271 |     If bonds are not provided this function will create a partial molecule using the atomics and coordinates and then
272 |     infer the bonds based on the coordinates using OpenBabel. Otherwise the bonds are added to the molecule as they
273 |     are given in the bond array.
274 | 
275 |     If bonds are provided they must not contain any duplicates.
276 | 
277 |     If charges are not provided they are assumed to be 0 for all atoms.
278 | 
279 |     Args:
280 |         coords (np.ndarray): Coordinate tensor, shape [n_atoms, 3]
281 |         atomics (list[str]): Atomic numbers, length must be n_atoms
282 |         bonds (np.ndarray, optional): Bond indices and types, shape [n_bonds, 3]
283 |         charges (np.ndarray, optional): Charge for each atom, shape [n_atoms]
284 |         sanitise (bool): Whether to apply RDKit sanitization to the molecule, default True
285 | 
286 |     Returns:
287 |         Chem.rdchem.Mol: RDKit molecule or None if one cannot be created
288 |     """
289 | 
290 |     _check_shape_len(coords, 2, "coords")
291 |     _check_dim_shape(coords, 1, 3, "coords")
292 | 
293 |     if coords.shape[0] != len(tokens):
294 |         raise ValueError("coords and atomics tensor must have the same number of atoms.")
295 | 
296 |     if bonds is not None:
297 |         _check_shape_len(bonds, 2, "bonds")
298 |         _check_dim_shape(bonds, 1, 3, "bonds")
299 | 
300 |     if charges is not None:
301 |         _check_shape_len(charges, 1, "charges")
302 |         _check_dim_shape(charges, 0, len(tokens), "charges")
303 | 
304 |     try:
305 |         atomics = [PT.atomic_from_symbol(token) for token in tokens]
306 |     except Exception:
307 |         return None
308 | 
309 |     charges = charges.tolist() if charges is not None else [0] * len(tokens)
310 | 
311 |     # Add atom types and charges
312 |     mol = Chem.EditableMol(Chem.Mol())
313 |     for idx, atomic in enumerate(atomics):
314 |         atom = Chem.Atom(atomic)
315 |         atom.SetFormalCharge(charges[idx])
316 |         mol.AddAtom(atom)
317 | 
318 |     # Add 3D coords
319 |     conf = Chem.Conformer(coords.shape[0])
320 |     for idx, coord in enumerate(coords.tolist()):
321 |         conf.SetAtomPosition(idx, coord)
322 | 
323 |     mol = mol.GetMol()
324 |     mol.AddConformer(conf)
325 | 
326 |     if bonds is None:
327 |         return _infer_bonds(mol)
328 | 
329 |     # Add bonds if they have been provided
330 |     mol = Chem.EditableMol(mol)
331 |     for bond in bonds.astype(np.int32).tolist():
332 |         start, end, b_type = bond
333 | 
334 |         if b_type not in IDX_BOND_MAP:
335 |             return None
336 | 
337 |         # Don't add self connections
338 |         if start != end:
339 |             b_type = IDX_BOND_MAP[b_type]
340 |             mol.AddBond(start, end, b_type)
341 | 
342 |     try:
343 |         mol = mol.GetMol()
344 |         for atom in mol.GetAtoms():
345 |             atom.UpdatePropertyCache(strict=False)
346 |     except Exception:
347 |         return None
348 | 
349 |     if sanitise:
350 |         try:
351 |             Chem.SanitizeMol(mol)
352 |         except Exception:
353 |             return None
354 | 
355 |     return mol
356 | 
357 | 
358 | def _infer_bonds(mol: Chem.rdchem.Mol):
359 |     coords = mol.GetConformer().GetPositions().tolist()
360 |     coord_strs = ["\t".join([f"{c:.6f}" for c in cs]) for cs in coords]
361 |     atom_symbols = [atom.GetSymbol() for atom in mol.GetAtoms()]
362 | 
363 |     xyz_str_header = f"{str(mol.GetNumAtoms())}\n\n"
364 |     xyz_strs = [f"{str(atom)}\t{coord_str}" for coord_str, atom in zip(coord_strs, atom_symbols)]
365 |     xyz_str = xyz_str_header + "\n".join(xyz_strs)
366 | 
367 |     try:
368 |         pybel_mol = pybel.readstring("xyz", xyz_str)
369 |     except Exception:
370 |         pybel_mol = None
371 | 
372 |     if pybel_mol is None:
373 |         return None
374 | 
375 |     mol_str = pybel_mol.write("mol")
376 |     mol = Chem.MolFromMolBlock(mol_str, removeHs=False, sanitize=True)
377 |     return mol
378 | 


--------------------------------------------------------------------------------
/semlaflow/util/tokeniser.py:
--------------------------------------------------------------------------------
  1 | from __future__ import annotations
  2 | 
  3 | import pickle
  4 | import threading
  5 | from abc import ABC, abstractmethod
  6 | from typing import Optional, Union
  7 | 
  8 | import semlaflow.util.functional as smolF
  9 | 
 10 | indicesT = Union[list[int], list[list[int]]]
 11 | 
 12 | 
 13 | PICKLE_PROTOCOL = 4
 14 | 
 15 | 
 16 | # *** Util functions ***
 17 | 
 18 | 
 19 | def _check_unique(obj_list, name="objects"):
 20 |     if len(obj_list) != len(set(obj_list)):
 21 |         raise RuntimeError(f"{name} cannot contain duplicates")
 22 | 
 23 | 
 24 | def _check_type_all(obj_list, exp_type, name="list"):
 25 |     for obj in obj_list:
 26 |         if not isinstance(obj, exp_type):
 27 |             raise TypeError(f"all objects in {name} must be instances of {exp_type}")
 28 | 
 29 | 
 30 | # *** Tokeniser Interface ***
 31 | 
 32 | 
 33 | class Tokeniser(ABC):
 34 |     """Interface for tokeniser classes"""
 35 | 
 36 |     @abstractmethod
 37 |     def tokenise(self, sentences: list[str]) -> Union[list[str], list[int]]:
 38 |         pass
 39 | 
 40 |     @classmethod
 41 |     @abstractmethod
 42 |     def from_vocabulary(cls, vocab: Vocabulary) -> Tokeniser:
 43 |         pass
 44 | 
 45 | 
 46 | # *** Tokeniser Implementations ***
 47 | 
 48 | # TODO
 49 | 
 50 | 
 51 | # *** Vocabulary Implementations ***
 52 | 
 53 | 
 54 | class Vocabulary:
 55 |     """Vocabulary class which maps tokens <--> indices"""
 56 | 
 57 |     def __init__(self, tokens: list[str]):
 58 |         _check_unique(tokens, "tokens list")
 59 | 
 60 |         token_idx_map = {token: idx for idx, token in enumerate(tokens)}
 61 |         idx_token_map = {idx: token for idx, token in enumerate(tokens)}
 62 | 
 63 |         self.token_idx_map = token_idx_map
 64 |         self.idx_token_map = idx_token_map
 65 | 
 66 |         # Just to be certain that vocab objects are thread safe
 67 |         self._vocab_lock = threading.Lock()
 68 | 
 69 |         # So that we can save this object without assuming the above dictionaries are ordered
 70 |         self._tokens = tokens
 71 | 
 72 |     @property
 73 |     def size(self) -> int:
 74 |         return len(self)
 75 | 
 76 |     def __len__(self) -> int:
 77 |         with self._vocab_lock:
 78 |             length = len(self.token_idx_map)
 79 | 
 80 |         return length
 81 | 
 82 |     def contains(self, token: str) -> bool:
 83 |         with self._vocab_lock:
 84 |             contains = token in self.token_idx_map
 85 | 
 86 |         return contains
 87 | 
 88 |     def tokens_from_indices(self, indices: list[int]) -> list[str]:
 89 |         _check_type_all(indices, int, "indices list")
 90 |         with self._vocab_lock:
 91 |             tokens = [self.idx_token_map[idx] for idx in indices]
 92 | 
 93 |         return tokens
 94 | 
 95 |     def indices_from_tokens(self, tokens: list[str], one_hot: Optional[bool] = False) -> indicesT:
 96 |         _check_type_all(tokens, str, "tokens list")
 97 | 
 98 |         with self._vocab_lock:
 99 |             indices = [self.token_idx_map[token] for token in tokens]
100 | 
101 |         if not one_hot:
102 |             return indices
103 | 
104 |         one_hots = smolF.one_hot_encode(indices, len(self)).tolist()
105 |         return one_hots
106 | 
107 |     def to_bytes(self) -> bytes:
108 |         with self._vocab_lock:
109 |             obj_bytes = pickle.dumps(self._tokens, protocol=PICKLE_PROTOCOL)
110 | 
111 |         return obj_bytes
112 | 
113 |     @staticmethod
114 |     def from_bytes(data: bytes) -> Vocabulary:
115 |         tokens = pickle.loads(data)
116 |         return Vocabulary(tokens)
117 | 
118 | 
119 | # class AtomVocabulary(Vocabulary):
120 | #     """Vocabulary which only uses atoms and allows converting from atomic numbers"""
121 | 
122 | #     # TODO add more atoms?
123 | #     ATOMIC_MAP = {
124 | #         1: "H",
125 | #         6: "C",
126 | #         7: "N",
127 | #         8: "O",
128 | #         9: "F",
129 | #         15: "P",
130 | #         16: "S",
131 | #         17: "Cl",
132 | #         35: "Br"
133 | #     }
134 | 
135 | #     def __init__(self, extra_atom_map: dict[int, str]):
136 | #         all_atomics = list(AtomVocabulary.ATOMIC_MAP.keys()) + list(extra_atom_map.keys())
137 | #         all_tokens = list(AtomVocabulary.ATOMIC_MAP.values()) + list(extra_atom_map.values())
138 | 
139 | #         _check_unique(all_atomics, "extended atomic numbers list")
140 | #         _check_unique(all_tokens, "extended tokens list")
141 | 
142 | #         extended_map = dict(list(AtomVocabulary.ATOMIC_MAP.items()) + list(extra_atom_map.items()))
143 | 
144 | #         atomic_tokens = list(extended_map.items())
145 | #         sorted_atomic_tokens = sorted(atomic_tokens, key=lambda a_t: a_t[0])
146 | #         sorted_atomics, sorted_tokens = tuple(zip(*sorted_atomic_tokens))
147 | 
148 | #         atomic_token_map = dict(zip(sorted_atomics, sorted_tokens))
149 | #         token_atomic_map = dict(zip(sorted_tokens, sorted_atomics))
150 | 
151 | #         self.atomic_token_map = atomic_token_map
152 | #         self.token_atomic_map = token_atomic_map
153 | 
154 | #         super().__init__(sorted_tokens)
155 | 
156 | #     def tokens_from_atomic(self, atomics: list[int]) -> list[str]:
157 | #         _check_type_all(atomics, int, "atomic numbers list")
158 | 
159 | #         with self._vocab_lock:
160 | #             tokens = [self.atomic_token_map[atom] for atom in atomics]
161 | 
162 | #         return tokens
163 | 
164 | #     def atomic_from_tokens(self, tokens: list[str]) -> list[int]:
165 | #         _check_type_all(tokens, str, "tokens list")
166 | 
167 | #         with self._vocab_lock:
168 | #             atomics = [self.token_atomic_map[token] for token in tokens]
169 | 
170 | #         return atomics
171 | 
172 | #     def indices_from_atomic(self, atomics: list[int], one_hot: Optional[bool] = False) -> indicesT:
173 | #         tokens = self.tokens_from_atomic(atomics)
174 | #         indices = self.indices_from_tokens(tokens, one_hot=one_hot)
175 | #         return indices
176 | 
177 | #     def atomic_from_indices(self, indices: list[int]) -> list[int]:
178 | #         tokens = self.tokens_from_indices(indices)
179 | #         atomics = self.atomic_from_tokens(tokens)
180 | #         return atomics
181 | 
182 | #     def to_bytes(self) -> bytes:
183 | #         with self._vocab_lock:
184 | #             obj_bytes = pickle.dumps(self.atomic_token_map, protocol=PICKLE_PROTOCOL)
185 | 
186 | #         return obj_bytes
187 | 
188 | #     @staticmethod
189 | #     def from_bytes(data: bytes) -> AtomVocabulary:
190 | #         full_map = pickle.loads(data)
191 | #         atomic_nums = set(AtomVocabulary.ATOMIC_MAP.keys())
192 | #         extra_map = {atom: token for atom, token in full_map if atom not in atomic_nums}
193 | #         return AtomVocabulary(extra_map)
194 | 


--------------------------------------------------------------------------------
/tests/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rssrwn/semla-flow/65d7106bc907e4d136deec8de7417386e3f9b10b/tests/__init__.py


--------------------------------------------------------------------------------
/tests/functional.py:
--------------------------------------------------------------------------------
  1 | import unittest
  2 | 
  3 | import numpy as np
  4 | import torch
  5 | from scipy.spatial.transform import Rotation
  6 | 
  7 | import semlaflow.util.functional as smolF
  8 | 
  9 | 
 10 | class TensorFnsTests(unittest.TestCase):
 11 |     def test_pairwise_concat_creates_stacked_pairs(self):
 12 |         vec_size = 4
 13 | 
 14 |         t = torch.rand((3, 2, vec_size))
 15 |         pairwise = smolF.pairwise_concat(t)
 16 | 
 17 |         expected_shape = (3, 2, 2, 2 * vec_size)
 18 |         first_vec = t[0, 0, :]
 19 |         second_vec = t[0, 1, :]
 20 | 
 21 |         self.assertEqual(expected_shape, pairwise.shape)
 22 | 
 23 |         self.assertTrue(torch.equal(first_vec, pairwise[0, 0, 0, :vec_size]))
 24 |         self.assertTrue(torch.equal(first_vec, pairwise[0, 0, 1, :vec_size]))
 25 | 
 26 |         self.assertTrue(torch.equal(second_vec, pairwise[0, 0, 1, vec_size:]))
 27 |         self.assertTrue(torch.equal(second_vec, pairwise[0, 1, 1, vec_size:]))
 28 | 
 29 |         self.assertTrue(torch.equal(first_vec, pairwise[0, 1, 0, vec_size:]))
 30 |         self.assertTrue(torch.equal(second_vec, pairwise[0, 1, 0, :vec_size]))
 31 | 
 32 |     def test_segment_sum_adds_feats_for_segments(self):
 33 |         batch_size = 2
 34 |         seq_len = 5
 35 |         num_feats = 4
 36 |         num_segments = 3
 37 | 
 38 |         t1 = torch.rand((seq_len, num_feats))
 39 |         t2 = torch.rand((seq_len, num_feats))
 40 |         data = torch.stack((t1, t2))
 41 |         segment_ids = torch.tensor([[0, 1, 1, 0, 2], [2, 2, 2, 0, 0]])
 42 | 
 43 |         expected_shape = (batch_size, num_segments, num_feats)
 44 | 
 45 |         exp_b0_s0 = t1[0] + t1[3]
 46 |         exp_b0_s1 = t1[1] + t1[2]
 47 |         exp_b0_s2 = t1[4]
 48 | 
 49 |         exp_b1_s0 = t2[3] + t2[4]
 50 |         exp_b1_s1 = torch.zeros(num_feats)
 51 |         exp_b1_s2 = t2[0] + t2[1] + t2[2]
 52 | 
 53 |         segment_sums = smolF.segment_sum(data, segment_ids, num_segments)
 54 | 
 55 |         self.assertEqual(expected_shape, segment_sums.shape)
 56 | 
 57 |         self.assertTrue(torch.equal(exp_b0_s0, segment_sums[0, 0]))
 58 |         self.assertTrue(torch.equal(exp_b0_s1, segment_sums[0, 1]))
 59 |         self.assertTrue(torch.equal(exp_b0_s2, segment_sums[0, 2]))
 60 | 
 61 |         self.assertTrue(torch.equal(exp_b1_s0, segment_sums[1, 0]))
 62 |         self.assertTrue(torch.equal(exp_b1_s1, segment_sums[1, 1]))
 63 |         self.assertTrue(torch.equal(exp_b1_s2, segment_sums[1, 2]))
 64 | 
 65 | 
 66 | class EdgeFnsTests(unittest.TestCase):
 67 |     def test_adj_from_node_mask_correct_adj(self):
 68 |         num_nodes = 4
 69 | 
 70 |         t1_nodes = torch.tensor([1, 1, 1, 1])
 71 |         t2_nodes = torch.tensor([1, 1, 1, 0])
 72 |         t3_nodes = torch.tensor([0, 0, 0, 0])
 73 |         node_mask = torch.stack((t1_nodes, t2_nodes, t3_nodes))
 74 | 
 75 |         exp_shape = (3, num_nodes, num_nodes)
 76 |         exp_type = torch.long
 77 | 
 78 |         b0_exp = [[0, 1, 1, 1], [1, 0, 1, 1], [1, 1, 0, 1], [1, 1, 1, 0]]
 79 |         b1_exp = [[0, 1, 1, 0], [1, 0, 1, 0], [1, 1, 0, 0], [0, 0, 0, 0]]
 80 |         b2_exp = [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]
 81 | 
 82 |         adjacency = smolF.adj_from_node_mask(node_mask)
 83 | 
 84 |         self.assertEqual(exp_shape, adjacency.shape)
 85 |         self.assertEqual(exp_type, adjacency.dtype)
 86 | 
 87 |         self.assertEqual(b0_exp, adjacency[0].tolist())
 88 |         self.assertEqual(b1_exp, adjacency[1].tolist())
 89 |         self.assertEqual(b2_exp, adjacency[2].tolist())
 90 | 
 91 |     def test_edges_from_adj_correct_edges(self):
 92 |         t1 = torch.tensor([[1, 1, 1, 1], [1, 0, 1, 0], [0, 0, 0, 0], [2, -1, 0, 0]])
 93 |         t2 = torch.tensor([[0, 0, 0, 0], [0, 0, 0, 1], [0, 0, 0, 0], [0, 0, 0, 1]])
 94 |         adjacency = torch.stack((t1, t2))
 95 | 
 96 |         exp_shape = (2, 8)
 97 |         exp_type = torch.long
 98 | 
 99 |         exp_edges_i_b0 = [0, 0, 0, 0, 1, 1, 3, 3]
100 |         exp_edges_j_b0 = [0, 1, 2, 3, 0, 2, 0, 1]
101 | 
102 |         exp_edges_i_b1 = [1, 3, 0, 0, 0, 0, 0, 0]
103 |         exp_edges_j_b1 = [3, 3, 0, 0, 0, 0, 0, 0]
104 | 
105 |         exp_mask_b0 = [1, 1, 1, 1, 1, 1, 1, 1]
106 |         exp_mask_b1 = [1, 1, 0, 0, 0, 0, 0, 0]
107 | 
108 |         (edge_is, edge_js), edge_mask = smolF.edges_from_adj(adjacency)
109 | 
110 |         # Check shapes
111 |         self.assertEqual(exp_shape, edge_is.shape)
112 |         self.assertEqual(exp_shape, edge_js.shape)
113 |         self.assertEqual(exp_shape, edge_mask.shape)
114 | 
115 |         # Check types
116 |         self.assertEqual(exp_type, edge_is.dtype)
117 |         self.assertEqual(exp_type, edge_js.dtype)
118 |         self.assertEqual(exp_type, edge_mask.dtype)
119 | 
120 |         # Check edge indices
121 |         self.assertEqual(exp_edges_i_b0, edge_is[0].tolist())
122 |         self.assertEqual(exp_edges_j_b0, edge_js[0].tolist())
123 |         self.assertEqual(exp_edges_i_b1, edge_is[1].tolist())
124 |         self.assertEqual(exp_edges_j_b1, edge_js[1].tolist())
125 | 
126 |         # Check mask
127 |         self.assertEqual(exp_mask_b0, edge_mask[0].tolist())
128 |         self.assertEqual(exp_mask_b1, edge_mask[1].tolist())
129 | 
130 |     def test_adj_from_edges_correct_adj(self):
131 |         num_nodes = 4
132 | 
133 |         edges = torch.tensor([[0, 0, 1], [0, 2, 2], [1, 0, 1], [1, 3, 0], [2, 2, 3], [3, 1, 1]])
134 | 
135 |         exp_shape = (num_nodes, num_nodes)
136 |         exp_type = torch.long
137 | 
138 |         exp_adj = [[1, 0, 2, 0], [1, 0, 0, 0], [0, 0, 3, 0], [0, 1, 0, 0]]
139 | 
140 |         edge_indices = edges[:, :2]
141 |         edge_types = edges[:, 2]
142 | 
143 |         adjacency = smolF.adj_from_edges(edge_indices, edge_types, num_nodes)
144 | 
145 |         self.assertEqual(exp_shape, adjacency.shape)
146 |         self.assertEqual(exp_type, adjacency.dtype)
147 | 
148 |         self.assertEqual(exp_adj, adjacency.tolist())
149 | 
150 |     def test_edges_from_nodes_fully_connected(self):
151 |         num_nodes = 4
152 | 
153 |         coords_b0 = torch.tensor([[0.0, 0.0, 0.0], [1.0, 1.0, 0.0], [2.0, 3.0, 1.0], [4.0, -2.0, -3.0]])
154 |         coords_b1 = torch.tensor([[1.0, 1.0, 1.0], [1.0, 1.0, 1.0], [3.0, 4.0, 5.0], [-1.0, -5.0, 2.0]])
155 |         coords = torch.stack((coords_b0, coords_b1))
156 | 
157 |         mask = torch.tensor([[1, 1, 1, 1], [1, 1, 0, 0]])
158 | 
159 |         exp_shape = (2, num_nodes, num_nodes)
160 |         exp_type = torch.long
161 | 
162 |         exp_adj_b0 = [[0, 1, 1, 1], [1, 0, 1, 1], [1, 1, 0, 1], [1, 1, 1, 0]]
163 |         exp_adj_b1 = [[0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]
164 | 
165 |         adjacency = smolF.edges_from_nodes(coords, node_mask=mask)
166 | 
167 |         self.assertEqual(exp_shape, adjacency.shape)
168 |         self.assertEqual(exp_type, adjacency.dtype)
169 | 
170 |         self.assertEqual(exp_adj_b0, adjacency[0].tolist())
171 |         self.assertEqual(exp_adj_b1, adjacency[1].tolist())
172 | 
173 |     def test_edges_from_nodes_correct_neighbours(self):
174 |         num_nodes = 4
175 | 
176 |         coords_b0 = torch.tensor([[0.0, 0.0, 0.0], [1.0, 1.0, 0.0], [2.0, 3.0, 1.0], [4.0, -2.0, -3.0]])
177 |         coords_b1 = torch.tensor([[1.0, 1.0, 1.0], [1.0, 1.0, 1.0], [3.0, 4.0, 5.0], [-1.0, -5.0, 2.0]])
178 |         coords = torch.stack((coords_b0, coords_b1))
179 | 
180 |         mask = torch.tensor([[1, 1, 1, 1], [1, 1, 0, 0]])
181 | 
182 |         exp_shape = (2, num_nodes, num_nodes)
183 |         exp_type = torch.long
184 | 
185 |         exp_adj_b0 = [[0, 1, 1, 0], [1, 0, 1, 0], [1, 1, 0, 0], [1, 1, 0, 0]]
186 |         exp_adj_b1 = [[0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]
187 | 
188 |         adjacency = smolF.edges_from_nodes(coords, k=2, node_mask=mask)
189 | 
190 |         self.assertEqual(exp_shape, adjacency.shape)
191 |         self.assertEqual(exp_type, adjacency.dtype)
192 | 
193 |         self.assertEqual(exp_adj_b0, adjacency[0].tolist())
194 |         self.assertEqual(exp_adj_b1, adjacency[1].tolist())
195 | 
196 | 
197 | class SparseFnsTests(unittest.TestCase):
198 |     def test_gather_edge_features(self):
199 |         feats_b0 = torch.tensor(
200 |             [
201 |                 [[0.5, 1.0], [0.1, -0.5], [5.0, -2.0], [-0.1, 0.8]],
202 |                 [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]],
203 |                 [[9.0, 10.0], [11.0, 12.0], [13.0, 14.0], [15.0, 16.0]],
204 |                 [[0.6, -0.2], [0.5, -2.0], [-7.0, 4.0], [5.0, 6.0]],
205 |             ]
206 |         )
207 |         feats_b1 = torch.tensor(
208 |             [
209 |                 [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6], [0.7, 0.8]],
210 |                 [[-1.0, -2.0], [-2.0, -3.0], [-3.0, -4.0], [-4.0, -5.0]],
211 |                 [[0.6, 0.9], [0.3, 0.2], [0.1, -0.7], [-0.5, 0.9]],
212 |                 [[1.5, -2.8], [6.3, 2.9], [5.8, 9.1], [0.4, -3.7]],
213 |             ]
214 |         )
215 |         feats = torch.stack((feats_b0, feats_b1))
216 | 
217 |         adj_1 = torch.tensor(
218 |             [
219 |                 [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]],
220 |                 [[0, 0, 0, 1], [0, 0, 0, 1], [0, 0, 0, 1], [0, 0, 0, 1]],
221 |             ]
222 |         ).long()
223 | 
224 |         adj_2 = torch.tensor(
225 |             [
226 |                 [[0, 1, 1, 0], [1, 0, 0, 1], [0, 1, 0, 1], [1, 1, 0, 0]],
227 |                 [[0, 1, 0, 1], [1, 0, 1, 0], [1, 1, 0, 0], [0, 1, 1, 0]],
228 |             ]
229 |         ).long()
230 | 
231 |         exp_feats_1_b0 = [[[0.5, 1.0]], [[3.0, 4.0]], [[13.0, 14.0]], [[5.0, 6.0]]]
232 |         exp_feats_1_b1 = [[[0.7, 0.8]], [[-4.0, -5.0]], [[-0.5, 0.9]], [[0.4, -3.7]]]
233 |         exp_feats_1 = [exp_feats_1_b0, exp_feats_1_b1]
234 | 
235 |         exp_feats_2_b0 = [
236 |             [[0.1, -0.5], [5.0, -2.0]],
237 |             [[1.0, 2.0], [7.0, 8.0]],
238 |             [[11.0, 12.0], [15.0, 16.0]],
239 |             [[0.6, -0.2], [0.5, -2.0]],
240 |         ]
241 |         exp_feats_2_b1 = [
242 |             [[0.3, 0.4], [0.7, 0.8]],
243 |             [[-1.0, -2.0], [-3.0, -4.0]],
244 |             [[0.6, 0.9], [0.3, 0.2]],
245 |             [[6.3, 2.9], [5.8, 9.1]],
246 |         ]
247 |         exp_feats_2 = [exp_feats_2_b0, exp_feats_2_b1]
248 | 
249 |         gathered_1 = smolF.gather_edge_features(feats, adj_1)
250 |         gathered_2 = smolF.gather_edge_features(feats, adj_2)
251 | 
252 |         np.testing.assert_almost_equal(exp_feats_1, gathered_1.tolist(), decimal=5)
253 |         np.testing.assert_almost_equal(exp_feats_2, gathered_2.tolist(), decimal=5)
254 | 
255 | 
256 | class GeometryFnsTests(unittest.TestCase):
257 |     def test_calc_distance_without_edges(self):
258 |         num_nodes = 4
259 | 
260 |         coords_b0 = torch.tensor([[1.0, 1.0, 1.0], [1.0, 1.0, 1.0], [-1.0, 0.0, 1.0], [5.0, -1.0, -2.0]])
261 |         coords_b1 = torch.tensor([[0.5, 1.0, -0.25], [1.0, 1.0, 1.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])
262 |         coords = torch.stack((coords_b0, coords_b1))
263 | 
264 |         exp_shape = (2, num_nodes, num_nodes)
265 |         exp_type = torch.float
266 | 
267 |         exp_b0 = [[0.0, 0.0, 5.0, 29.0], [0.0, 0.0, 5.0, 29.0], [5.0, 5.0, 0.0, 46.0], [29.0, 29.0, 46.0, 0.0]]
268 |         exp_b1 = [
269 |             [0.0, 1.8125, 1.3125, 1.3125],
270 |             [1.8125, 0.0, 3.0, 3.0],
271 |             [1.3125, 3.0, 0.0, 0.0],
272 |             [1.3125, 3.0, 0.0, 0.0],
273 |         ]
274 | 
275 |         sqrd_dists = smolF.calc_distances(coords, sqrd=True)
276 |         dists = torch.sqrt(sqrd_dists)
277 | 
278 |         self.assertEqual(exp_shape, sqrd_dists.shape)
279 |         self.assertEqual(exp_type, sqrd_dists.dtype)
280 | 
281 |         np.testing.assert_almost_equal(exp_b0, sqrd_dists[0].tolist(), decimal=5)
282 |         np.testing.assert_almost_equal(exp_b1, sqrd_dists[1].tolist(), decimal=5)
283 | 
284 |         np.testing.assert_almost_equal(np.sqrt(exp_b0).tolist(), dists[0].tolist(), decimal=5)
285 |         np.testing.assert_almost_equal(np.sqrt(exp_b1).tolist(), dists[1].tolist(), decimal=5)
286 | 
287 |     def test_calc_distances_from_edges(self):
288 |         num_edges = 8
289 | 
290 |         coords_b0 = torch.tensor([[1.0, 1.0, 1.0], [1.0, 1.0, 1.0], [-1.0, 0.0, 1.0], [5.0, -1.0, -2.0]])
291 |         coords_b1 = torch.tensor([[0.5, 1.0, -0.25], [1.0, 1.0, 1.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])
292 |         coords = torch.stack((coords_b0, coords_b1))
293 | 
294 |         edge_is = torch.tensor([[0, 0, 0, 0, 1, 2, 2, 2], [0, 0, 0, 1, 1, 0, 0, 0]])
295 |         edge_js = torch.tensor([[0, 1, 2, 3, 0, 2, 0, 0], [0, 1, 2, 2, 2, 0, 0, 0]])
296 |         edges = (edge_is, edge_js)
297 | 
298 |         exp_shape = (2, num_edges)
299 |         exp_type = torch.float
300 | 
301 |         exp_b0 = [0.0, 0.0, 5.0, 29.0, 0.0, 0.0, 5.0, 5.0]
302 |         exp_b1 = [0.0, 1.8125, 1.3125, 3.0, 3.0, 0.0, 0.0, 0.0]
303 | 
304 |         sqrd_dists = smolF.calc_distances(coords, edges=edges, sqrd=True)
305 |         dists = torch.sqrt(sqrd_dists)
306 | 
307 |         self.assertEqual(exp_shape, sqrd_dists.shape)
308 |         self.assertEqual(exp_type, sqrd_dists.dtype)
309 | 
310 |         np.testing.assert_almost_equal(exp_b0, sqrd_dists[0].tolist(), decimal=5)
311 |         np.testing.assert_almost_equal(exp_b1, sqrd_dists[1].tolist(), decimal=5)
312 | 
313 |         np.testing.assert_almost_equal(np.sqrt(exp_b0).tolist(), dists[0].tolist(), decimal=5)
314 |         np.testing.assert_almost_equal(np.sqrt(exp_b1).tolist(), dists[1].tolist(), decimal=5)
315 | 
316 |     def test_calc_com_correct_centre(self):
317 |         coords_b0 = torch.tensor([[1.0, 1.0, 1.0], [2.0, -2.0, 0.0], [-4.0, 2.0, 2.0], [3.0, -5.0, -5.0]])
318 |         coords_b1 = torch.tensor([[1.0, 1.0, 1.0], [2.0, -2.0, 0.0], [-4.0, 2.0, 1.0], [3.0, -5.0, -5.0]])
319 |         coords_b2 = torch.tensor([[1.0, 1.0, 1.0], [2.0, -2.0, 0.0], [-4.0, 2.0, 1.0], [3.0, -5.0, -5.0]])
320 |         coords = torch.stack((coords_b0, coords_b1, coords_b2))
321 |         mask = torch.tensor([[1, 1, 1, 1], [1, 1, 0, 0], [0, 0, 0, 0]])
322 | 
323 |         exp_shape = (3, 1, 3)
324 |         exp_type = torch.float
325 | 
326 |         exp_com_b0 = [0.5, -1.0, -0.5]
327 |         exp_com_b1 = [1.5, -0.5, 0.5]
328 |         exp_com_b2 = [np.nan, np.nan, np.nan]
329 | 
330 |         com = smolF.calc_com(coords, node_mask=mask)
331 | 
332 |         self.assertEqual(exp_shape, com.shape)
333 |         self.assertEqual(exp_type, com.dtype)
334 | 
335 |         self.assertEqual(exp_com_b0, com[0, 0, :].tolist())
336 |         self.assertEqual(exp_com_b1, com[1, 0, :].tolist())
337 |         np.testing.assert_equal(exp_com_b2, com[2, 0, :].tolist())
338 | 
339 |     def test_rotate_rotates_all_coords_correctly(self):
340 |         coords = torch.tensor([[0.0, 0.0, 0.0], [1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0], [-1.0, 2.0, 0.5]])
341 | 
342 |         rot1 = [np.pi / 2, 0.0, 0.0]
343 |         rot2 = [0.0, np.pi / 2, 0.0]
344 |         rot3 = [0.0, 0.0, np.pi / 2]
345 |         rot4 = [np.pi / 2, np.pi, np.pi / 2]
346 |         rot5 = [-np.pi / 2, 0.0, 2 * np.pi]
347 | 
348 |         exp_coords_1 = [[0.0, 0.0, 0.0], [1.0, 0.0, 0.0], [0.0, 0.0, 1.0], [0.0, -1.0, 0.0], [-1.0, -0.5, 2.0]]
349 |         exp_coords_2 = [[0.0, 0.0, 0.0], [0.0, 0.0, -1.0], [0.0, 1.0, 0.0], [1.0, 0.0, 0.0], [0.5, 2.0, 1.0]]
350 |         exp_coords_3 = [[0.0, 0.0, 0.0], [0.0, 1.0, 0.0], [-1.0, 0.0, 0.0], [0.0, 0.0, 1.0], [-2.0, -1.0, 0.5]]
351 |         exp_coords_4 = [[0.0, 0.0, 0.0], [0.0, -1.0, 0.0], [0.0, 0.0, -1.0], [1.0, 0.0, 0.0], [0.5, 1.0, -2.0]]
352 |         exp_coords_5 = [[0.0, 0.0, 0.0], [1.0, 0.0, 0.0], [0.0, 0.0, -1.0], [0.0, 1.0, 0.0], [-1.0, 0.5, -2.0]]
353 | 
354 |         rotated_1 = smolF.rotate(coords, rot1)
355 |         rotated_2 = smolF.rotate(coords, rot2)
356 |         rotated_3 = smolF.rotate(coords, rot3)
357 |         rotated_4 = smolF.rotate(coords, rot4)
358 |         rotated_5 = smolF.rotate(coords, rot5)
359 | 
360 |         np.testing.assert_almost_equal(exp_coords_1, rotated_1.tolist(), decimal=5)
361 |         np.testing.assert_almost_equal(exp_coords_2, rotated_2.tolist(), decimal=5)
362 |         np.testing.assert_almost_equal(exp_coords_3, rotated_3.tolist(), decimal=5)
363 |         np.testing.assert_almost_equal(exp_coords_4, rotated_4.tolist(), decimal=5)
364 |         np.testing.assert_almost_equal(exp_coords_5, rotated_5.tolist(), decimal=5)
365 | 
366 |     def test_rotate_agrees_with_scipy(self):
367 |         coords = torch.rand((10, 3))
368 | 
369 |         angles_1 = (np.random.rand(3) * np.pi * 2).tolist()
370 |         angles_2 = (np.random.rand(3) * np.pi * 2).tolist()
371 |         angles_3 = (np.random.rand(3) * np.pi * 2).tolist()
372 | 
373 |         rot1 = Rotation.from_euler("xyz", angles_1)
374 |         rot2 = Rotation.from_euler("xyz", angles_2)
375 |         rot3 = Rotation.from_euler("xyz", angles_3)
376 | 
377 |         exp_coords_1 = rot1.apply(coords.tolist())
378 |         exp_coords_2 = rot2.apply(coords.tolist())
379 |         exp_coords_3 = rot3.apply(coords.tolist())
380 | 
381 |         rotated_1 = smolF.rotate(coords, angles_1)
382 |         rotated_2 = smolF.rotate(coords, angles_2)
383 |         rotated_3 = smolF.rotate(coords, angles_3)
384 | 
385 |         np.testing.assert_almost_equal(exp_coords_1, rotated_1, decimal=5)
386 |         np.testing.assert_almost_equal(exp_coords_2, rotated_2, decimal=5)
387 |         np.testing.assert_almost_equal(exp_coords_3, rotated_3, decimal=5)
388 | 
389 | 
390 | if __name__ == "__main__":
391 |     unittest.main()
392 | 


--------------------------------------------------------------------------------