├── .gitignore ├── LICENSE ├── README.md ├── environment.yaml ├── extra_requirements.txt ├── notebooks ├── qm9.ipynb └── visualization_demo.ipynb ├── semlaflow ├── __init__.py ├── data │ ├── __init__.py │ ├── datamodules.py │ ├── datasets.py │ ├── interpolate.py │ └── util.py ├── evaluate.py ├── models │ ├── __init__.py │ ├── egnn.py │ ├── eqgat.py │ ├── fm.py │ └── semla.py ├── predict.py ├── preprocess.py ├── scriptutil.py ├── train.py └── util │ ├── __init__.py │ ├── functional.py │ ├── metrics.py │ ├── molrepr.py │ ├── rdkit.py │ └── tokeniser.py └── tests ├── __init__.py └── functional.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Basic Gitignore file 2 | 3 | # User specific paths 4 | datasets/ 5 | wandb/ 6 | output/ 7 | 8 | # Editors 9 | .vscode/ 10 | 11 | # Jupyter notebook checkpoints 12 | notebooks/.ipynb_checkpoints/ 13 | 14 | # Python cache files 15 | __pycache__/ 16 | */__pycache__/ 17 | **/__pycache__/ 18 | *.pyc 19 | 20 | # Log files 21 | molproc_logs/ 22 | lightning_logs/ 23 | notebooks/lightning_logs/ 24 | wandb/ 25 | nohup.out 26 | 27 | # Slurm submission scripts and logs 28 | subslurm/ 29 | slurm-* 30 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Ross Irwin 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SemlaFlow - Efficient Molecular Generation with Flow Matching and Semla 2 | 3 | This project creates a novel equivariant attention-based message passing architecture, Semla, for molecular design and dynamics tasks. We train a molecular generation model, SemlaFlow, using flow matching with optimal transport to generate realistic 3D molecular structures. 4 | 5 | 6 | ## Installation 7 | 8 | All of the code was run using a mamba/conda environment. You can of course use a different environment manager; all core requirements are contained in the `environment.yaml` file. Using mamba/conda you can recreate the environment as follows: 9 | 1. `mamba env create --file environment.yaml` 10 | 2. `mamba activate semlaflow` 11 | 12 | For developing (and to run the notebooks) you will also need to install the extra requirements: 13 | 3. `pip install -r extra_requirements.txt` 14 | 15 | 16 | ## Datasets 17 | 18 | For ease-of-use we have provided the processed data files in a Google drive [here](https://drive.google.com/drive/folders/1rHi5JzN05bsGRGQUcWRmDu-Ilfoa9EAT?usp=sharing). Copy the folder called `smol` from the QM9 or GEOM drugs folders and point to the `smol` folder when running the scripts. For example, pass `--data_path path/to/data/qm9/smol` to the script you wish to run. 19 | 20 | 21 | ### Data Prep 22 | 23 | We copied the code from MiDi (https://github.com/cvignac/MiDi) to download the QM9 dataset and create the data splits. We provide the code to do this, as well as create the _Smol_ internal dataset representation used for training in the `notebooks/qm9.ipynb` notebook. 24 | 25 | For GEOM Drugs we also follow the URLs provided in the MiDi repo. GEOM Drugs is preprocessed using the `preprocess.py` script. GEOM Drugs URLs from MiDi are as follows: 26 | * train: https://drive.switch.ch/index.php/s/UauSNgSMUPQdZ9v 27 | * validation: https://drive.switch.ch/index.php/s/YNW5UriYEeVCDnL 28 | * test: https://drive.switch.ch/index.php/s/GQW9ok7mPInPcIo 29 | 30 | 31 | ## Running 32 | 33 | Once you have created and activated the environment successfully, you can run the code. 34 | 35 | ### Scripts 36 | 37 | We provide 4 scripts in the repository: 38 | * `preprocess` - Used for preprocessing larger datasets into the internal representation used by the model for training 39 | * `train` - Trains a MolFlow model on preprocessed data 40 | * `evaluate` - Evaluates a trained model and prints the results 41 | * `predict` - Runs the sampling for a trained model and saves the generated molecules 42 | 43 | Each script can be run as follows (where `