├── selfies ├── utils │ ├── __init__.py │ ├── selfies_utils.py │ ├── matching_utils.py │ └── encoding_utils.py ├── exceptions.py ├── compatibility.py ├── constants.py ├── __init__.py ├── grammar_rules.py ├── bond_constraints.py ├── decoder.py ├── encoder.py └── mol_graph.py ├── original_code_from_paper ├── gan │ ├── GAN_smiles │ │ ├── README.md │ │ ├── one_hot_converter.py │ │ └── GAN.py │ ├── GAN_selfies │ │ ├── README.md │ │ ├── translate.py │ │ ├── one_hot_converter.py │ │ └── GAN.py │ └── README.md ├── vae │ ├── VAE_dependencies │ │ ├── README.md │ │ ├── Datasets │ │ │ ├── README.md │ │ │ └── QM9 │ │ │ │ └── README.md │ │ ├── GrammarVAE_grammar.py │ │ ├── data_loader.py │ │ └── GrammarVAE_codes.py │ ├── settings.yml │ ├── settingsSELFIES.yml │ ├── settingsDeepSMILES.yml │ ├── settingsGrammarVAE.yml │ ├── settingsSMILES.yml │ └── README.md ├── README.md ├── bitflips_in_paper_Fig3.txt ├── bitflip_from_mdma.py └── environment.yml ├── docs ├── requirements.txt ├── README.md ├── Makefile ├── make.bat └── source │ ├── index.rst │ ├── conf.py │ └── selfies.rst ├── examples ├── vae_example │ ├── datasets │ │ └── README.md │ ├── settings.yml │ ├── README.md │ └── data_loader.py ├── VAE_LS_Validity.png ├── selfies_indice.png └── workshop2021 │ └── SELFIES_working_groups.pdf ├── tox.ini ├── .readthedocs.yml ├── tests ├── conftest.py ├── test_sets │ └── custom_cases.csv ├── run_on_large_dataset.py ├── test_on_datasets.py ├── test_selfies_utils.py ├── test_selfies.py └── test_specific_cases.py ├── .github └── workflows │ └── ci.yml ├── setup.py ├── .gitignore ├── CHANGELOG.md ├── LICENSE └── README.md /selfies/utils/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /original_code_from_paper/gan/GAN_smiles/README.md: -------------------------------------------------------------------------------- 1 | GAN for SMILES 2 | -------------------------------------------------------------------------------- /original_code_from_paper/gan/GAN_selfies/README.md: -------------------------------------------------------------------------------- 1 | GAN for SELFIES 2 | -------------------------------------------------------------------------------- /original_code_from_paper/vae/VAE_dependencies/README.md: -------------------------------------------------------------------------------- 1 | additional data files 2 | -------------------------------------------------------------------------------- /docs/requirements.txt: -------------------------------------------------------------------------------- 1 | nbsphinx 2 | sphinx-autodoc-typehints 3 | sphinx-rtd-theme 4 | -------------------------------------------------------------------------------- /original_code_from_paper/vae/VAE_dependencies/Datasets/README.md: -------------------------------------------------------------------------------- 1 | datasets (only qm9) 2 | -------------------------------------------------------------------------------- /examples/vae_example/datasets/README.md: -------------------------------------------------------------------------------- 1 | Dataset files with molecules represented in SMILES. 2 | -------------------------------------------------------------------------------- /examples/VAE_LS_Validity.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aspuru-guzik-group/selfies/HEAD/examples/VAE_LS_Validity.png -------------------------------------------------------------------------------- /examples/selfies_indice.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aspuru-guzik-group/selfies/HEAD/examples/selfies_indice.png -------------------------------------------------------------------------------- /original_code_from_paper/vae/VAE_dependencies/Datasets/QM9/README.md: -------------------------------------------------------------------------------- 1 | QM9 representation in SMILES, DeepSMILES, SELFIES, GrammarVAE 2 | -------------------------------------------------------------------------------- /examples/workshop2021/SELFIES_working_groups.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aspuru-guzik-group/selfies/HEAD/examples/workshop2021/SELFIES_working_groups.pdf -------------------------------------------------------------------------------- /original_code_from_paper/README.md: -------------------------------------------------------------------------------- 1 | This file contains the original code from the SELFIES 2 | [paper](https://arxiv.org/abs/1905.13741). The required dependencies to run 3 | this code are listed in ``environment.yml``. Note that this code ran on 4 | an old version of SELFIES (``v0.1.1``). 5 | -------------------------------------------------------------------------------- /tox.ini: -------------------------------------------------------------------------------- 1 | [tox] 2 | envlist = py37,py38,py39,py310 3 | requires = tox-conda 4 | 5 | [testenv] 6 | setenv = 7 | CONDA_DLL_SEARCH_MODIFICATION_ENABLE = 1 8 | whitelist_externals = python 9 | 10 | [testenv:py{37,38,39,310}] 11 | conda_deps = 12 | pytest 13 | rdkit 14 | conda_channels = 15 | conda-forge 16 | commands = pytest --basetemp="{envtmpdir}" {posargs} 17 | -------------------------------------------------------------------------------- /docs/README.md: -------------------------------------------------------------------------------- 1 | To build the documentation, please install 2 | * the [sphinx-autotoc-typehints](https://github.com/agronholm/sphinx-autodoc-typehints) extension 3 | * the [Read the Docs Sphinx theme](https://github.com/readthedocs/sphinx_rtd_theme) 4 | * the [nbsphinx](https://github.com/spatialaudio/nbsphinx) extension 5 | 6 | and then run 7 | ``` 8 | python -m sphinx source 9 | ``` 10 | in this directory. 11 | -------------------------------------------------------------------------------- /examples/vae_example/settings.yml: -------------------------------------------------------------------------------- 1 | data: 2 | batch_size: 100 3 | smiles_file: datasets/0SelectedSMILES_QM9.txt 4 | type_of_encoding: 0 5 | 6 | decoder: 7 | latent_dimension: 50 8 | gru_neurons_num: 100 9 | gru_stack_size: 1 10 | 11 | encoder: 12 | layer_1d: 100 13 | layer_2d: 100 14 | layer_3d: 100 15 | latent_dimension: 50 16 | 17 | training: 18 | KLD_alpha: 1.0e-05 19 | lr_enc: 0.0001 20 | lr_dec: 0.0001 21 | num_epochs: 5000 22 | sample_num: 1000 23 | -------------------------------------------------------------------------------- /.readthedocs.yml: -------------------------------------------------------------------------------- 1 | # Read the Docs configuration file 2 | # See https://docs.readthedocs.io/en/stable/config-file/v2.html for details 3 | 4 | # Required 5 | version: 2 6 | 7 | # Build documentation in the docs/ directory with Sphinx 8 | sphinx: 9 | configuration: docs/source/conf.py 10 | 11 | # Optionally build your docs in additional formats such as PDF 12 | formats: 13 | - pdf 14 | 15 | # Optionally set the version of Python and requirements required to build your docs 16 | python: 17 | version: "3.8" 18 | install: 19 | - requirements: docs/requirements.txt 20 | -------------------------------------------------------------------------------- /tests/conftest.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | 3 | 4 | def pytest_addoption(parser): 5 | parser.addoption( 6 | "--trials", action="store", default=10000, 7 | help="number of trails for random tests" 8 | ) 9 | parser.addoption( 10 | "--dataset_samples", action="store", default=10000, 11 | help="number of samples to test from the data sets" 12 | ) 13 | 14 | 15 | @pytest.fixture 16 | def trials(request): 17 | return int(request.config.getoption("--trials")) 18 | 19 | 20 | @pytest.fixture 21 | def dataset_samples(request): 22 | return int(request.config.getoption("--dataset_samples")) 23 | -------------------------------------------------------------------------------- /original_code_from_paper/vae/settings.yml: -------------------------------------------------------------------------------- 1 | data: 2 | batch_size: 190 3 | cuda_device: 0 4 | smiles_file: VAE_dependencies/Datasets/QM9/2RGSMILES_QM9.txt 5 | type_of_encoding: 2 6 | decoder: 7 | gru_neurons_num: 100 8 | gru_stack_size: 1 9 | latent_dimension: 25 10 | encoder: 11 | latent_dimension: 25 12 | layer_1d: 100 13 | layer_2d: 50 14 | layer_3d: 25 15 | training: 16 | KLD_alpha: 3.9914730047188435e-06 17 | checkpoint: true 18 | latent_dimension: 25 19 | lr_dec: 0.002427375401692564 20 | lr_enc: 0.002427375401692564 21 | num_epochs: 5000 22 | sample_num: 1000 23 | tensorBoard_graphing: false 24 | -------------------------------------------------------------------------------- /original_code_from_paper/vae/settingsSELFIES.yml: -------------------------------------------------------------------------------- 1 | data: 2 | batch_size: 190 3 | cuda_device: 0 4 | smiles_file: VAE_dependencies/Datasets/QM9/2RGSMILES_QM9.txt 5 | type_of_encoding: 2 6 | decoder: 7 | gru_neurons_num: 100 8 | gru_stack_size: 1 9 | latent_dimension: 25 10 | encoder: 11 | latent_dimension: 25 12 | layer_1d: 100 13 | layer_2d: 50 14 | layer_3d: 25 15 | training: 16 | KLD_alpha: 3.9914730047188435e-06 17 | checkpoint: true 18 | latent_dimension: 25 19 | lr_dec: 0.002427375401692564 20 | lr_enc: 0.002427375401692564 21 | num_epochs: 5000 22 | sample_num: 1000 23 | tensorBoard_graphing: false 24 | -------------------------------------------------------------------------------- /original_code_from_paper/vae/settingsDeepSMILES.yml: -------------------------------------------------------------------------------- 1 | data: 2 | batch_size: 190 3 | cuda_device: 0 4 | smiles_file: VAE_dependencies/Datasets/QM9/1DeepSMILES_QM9.txt 5 | type_of_encoding: 1 6 | decoder: 7 | gru_neurons_num: 10 8 | gru_stack_size: 5 9 | latent_dimension: 25 10 | encoder: 11 | latent_dimension: 25 12 | layer_1d: 2000 13 | layer_2d: 1000 14 | layer_3d: 500 15 | training: 16 | KLD_alpha: 0.07233630964775388 17 | checkpoint: true 18 | latent_dimension: 25 19 | lr_dec: 0.00022284476453107537 20 | lr_enc: 0.00022284476453107537 21 | num_epochs: 5000 22 | sample_num: 1000 23 | tensorBoard_graphing: false 24 | -------------------------------------------------------------------------------- /original_code_from_paper/vae/settingsGrammarVAE.yml: -------------------------------------------------------------------------------- 1 | data: 2 | batch_size: 190 3 | cuda_device: 0 4 | smiles_file: VAE_dependencies/Datasets/QM9/3GrammarVAE_QM9.txt 5 | type_of_encoding: 3 6 | decoder: 7 | gru_neurons_num: 100 8 | gru_stack_size: 1 9 | latent_dimension: 300 10 | encoder: 11 | latent_dimension: 300 12 | layer_1d: 100 13 | layer_2d: 50 14 | layer_3d: 500 15 | training: 16 | KLD_alpha: 0.00019199633293733154 17 | checkpoint: true 18 | latent_dimension: 300 19 | lr_dec: 8.804286872231203e-05 20 | lr_enc: 8.804286872231203e-05 21 | num_epochs: 5000 22 | sample_num: 1000 23 | tensorBoard_graphing: false 24 | -------------------------------------------------------------------------------- /original_code_from_paper/vae/settingsSMILES.yml: -------------------------------------------------------------------------------- 1 | data: 2 | batch_size: 190 3 | cuda_device: 0 4 | smiles_file: VAE_dependencies\Datasets\QM9\0SelectedSMILES_QM9.txt 5 | type_of_encoding: 0 6 | 7 | decoder: 8 | gru_neurons_num: 10 9 | gru_stack_size: 5 10 | latent_dimension: 300 11 | 12 | encoder: 13 | latent_dimension: 300 14 | layer_1d: 100 15 | layer_2d: 1000 16 | layer_3d: 25 17 | 18 | training: 19 | KLD_alpha: 0.0032309168486524178 20 | checkpoint: true 21 | latent_dimension: 300 22 | lr_dec: 6.98450538191299e-05 23 | lr_enc: 6.98450538191299e-05 24 | num_epochs: 5000 25 | sample_num: 1000 26 | tensorBoard_graphing: false 27 | -------------------------------------------------------------------------------- /docs/Makefile: -------------------------------------------------------------------------------- 1 | # Minimal makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line, and also 5 | # from the environment for the first two. 6 | SPHINXOPTS ?= 7 | SPHINXBUILD ?= sphinx-build 8 | SOURCEDIR = source 9 | BUILDDIR = build 10 | 11 | # Put it first so that "make" without argument is like "make help". 12 | help: 13 | @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 14 | 15 | .PHONY: help Makefile 16 | 17 | # Catch-all target: route all unknown targets to Sphinx using the new 18 | # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). 19 | %: Makefile 20 | @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 21 | -------------------------------------------------------------------------------- /.github/workflows/ci.yml: -------------------------------------------------------------------------------- 1 | name: CI Tests 2 | 3 | on: 4 | push: 5 | branches: [master] 6 | pull_request: 7 | branches: [master] 8 | 9 | jobs: 10 | 11 | build: 12 | 13 | runs-on: ubuntu-latest 14 | strategy: 15 | matrix: 16 | python-version: [3.9] 17 | 18 | steps: 19 | - uses: actions/checkout@v2 20 | - name: Set up Python ${{ matrix.python-version }} 21 | uses: actions/setup-python@v1 22 | with: 23 | python-version: ${{ matrix.python-version }} 24 | - name: Install basic dependencies 25 | run: | 26 | python -m pip install --upgrade pip 27 | - name: Test with tox 28 | run: | 29 | pip install -e . 30 | pip install tox 31 | tox 32 | -------------------------------------------------------------------------------- /docs/make.bat: -------------------------------------------------------------------------------- 1 | @ECHO OFF 2 | 3 | pushd %~dp0 4 | 5 | REM Command file for Sphinx documentation 6 | 7 | if "%SPHINXBUILD%" == "" ( 8 | set SPHINXBUILD=sphinx-build 9 | ) 10 | set SOURCEDIR=source 11 | set BUILDDIR=build 12 | 13 | if "%1" == "" goto help 14 | 15 | %SPHINXBUILD% >NUL 2>NUL 16 | if errorlevel 9009 ( 17 | echo. 18 | echo.The 'sphinx-build' command was not found. Make sure you have Sphinx 19 | echo.installed, then set the SPHINXBUILD environment variable to point 20 | echo.to the full path of the 'sphinx-build' executable. Alternatively you 21 | echo.may add the Sphinx directory to PATH. 22 | echo. 23 | echo.If you don't have Sphinx installed, grab it from 24 | echo.http://sphinx-doc.org/ 25 | exit /b 1 26 | ) 27 | 28 | %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% 29 | goto end 30 | 31 | :help 32 | %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% 33 | 34 | :end 35 | popd 36 | -------------------------------------------------------------------------------- /selfies/exceptions.py: -------------------------------------------------------------------------------- 1 | class SMILESParserError(ValueError): 2 | """Exception raised when a SMILES fails to be parsed. 3 | """ 4 | 5 | def __init__(self, smiles, reason="N/A", idx=-1): 6 | self.smiles = smiles 7 | self.idx = idx 8 | self.reason = reason 9 | 10 | def __str__(self): 11 | err_msg = "\n" \ 12 | "\tSMILES: {smiles}\n" \ 13 | "\t {pointer}\n" \ 14 | "\tIndex: {index}\n" \ 15 | "\tReason: {reason}" 16 | 17 | return err_msg.format( 18 | smiles=self.smiles, 19 | pointer=(" " * self.idx + "^"), 20 | index=self.idx, 21 | reason=self.reason 22 | ) 23 | 24 | 25 | class EncoderError(Exception): 26 | """Exception raised by :func:`selfies.encoder`. 27 | """ 28 | 29 | pass 30 | 31 | 32 | class DecoderError(Exception): 33 | """Exception raised by :func:`selfies.decoder`. 34 | """ 35 | 36 | pass 37 | -------------------------------------------------------------------------------- /original_code_from_paper/vae/README.md: -------------------------------------------------------------------------------- 1 | ## SELFIES Example: Variational Auto Encoder (VAE) for chemistry 2 | comparing SMILES, DeepSMILES, GrammarVAE and SELFIES representation using reconstruction quality, diversity and latent space validity as metrics of interest 3 | - v0.1.0 -- 04. August 2019 4 | 5 | ### information: 6 | This is the original code used to generate the data in our paper. 7 | 8 | It used a hand-written SELFIES encoding (Table2 of paper), and cannot easily be adapted to other situations. If you want to use the VAE, please see ../chemistryVAE.py 9 | 10 | That file is connected with the selfies.encoder/selfies.decoder, and can be applied on general datasets. For more documentation, please look there. 11 | 12 | 13 | settings*.yml 14 | these four files contain the settings for values for the best model described in the paper 15 | 16 | 17 | 18 | For comments, bug reports or feature ideas, please send an email to 19 | mario.krenn@utoronto.ca and alan@aspuru.com 20 | 21 | -------------------------------------------------------------------------------- /examples/vae_example/README.md: -------------------------------------------------------------------------------- 1 | # SELFIES Example: Variational Autoencoder (VAE) for Chemistry 2 | 3 | An implementation of a variational autoencoder that runs on both SMILES and 4 | SELFIES. Included is code that compares the SMILES and SELFIES representations 5 | for a VAE using reconstruction quality, diversity, and latent space validity 6 | as metrics of interest. 7 | 8 | ## Dependencies 9 | Dependencies are ``pytorch``, ``rdkit``, and ``pyyaml``, which can be installed 10 | using Conda. 11 | 12 | ## Files 13 | 14 | * ``chemistry_vae.py``: the main file; contains the model definitions, 15 | the data processing, and the training. 16 | * ``settings.yml``: a file containing the hyperparameters of the 17 | model and the training. Also configures the VAE to run on either SMILES 18 | or SELFIES. 19 | * ``data_loader.py``: contains helper methods that convert SMILES and SELFIES 20 | into integer-encoded or one-hot encoded vectors. 21 | 22 | ### Tested at: 23 | - Python 3.7 24 | 25 | CPU and GPU supported 26 | 27 | For comments, bug reports or feature ideas, please send an email to 28 | mario.krenn@utoronto.ca and alan@aspuru.com 29 | 30 | -------------------------------------------------------------------------------- /original_code_from_paper/gan/README.md: -------------------------------------------------------------------------------- 1 | # Package requirements for running the code: 2 | - Pytorch 3 | - TensorBoardX 4 | - rdkit 5 | - numpy 6 | - matplotlib 7 | 8 | 9 | # File Navigator: 10 | 11 | GAN_selfies: 12 | - 2RGSMILES_QM9.txt : Dearomatized QM9 dataset (selfies representation, from alphabet described in the main text (symbols shortened for simplicity)) 13 | - GAN.py: Code for running the generative adversarial network 14 | - one_hot_converter.py: Code for creating one-hot-encodings of molecular strings 15 | - adjusted_selfies_fcts.py: SMILES to SELFIES conversion file 16 | - GPlus2S.py: SMILES to SELFIES conversion file 17 | - translate.py: General helper functions file 18 | 19 | GAN_smiles: 20 | - GAN.py: Code for running the generative adversarial network 21 | - one_hot_converter.py: Code for creating one-hot-encodings of molecular strings 22 | - smiles_qm9.txt: Dearomatized QM9 dataset (smiles representation) 23 | 24 | # How to Run the code:? 25 | Step1: cd inside either 'GAN_selfies' or 'GAN_smiles' depending on which type of molecular representation you would like to run the GAN 26 | 27 | Step2: run ./GAN. 28 | The code will automatically detect the availability of a GPU on your device, and run multiple models with different hyperparameters 29 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import setuptools 4 | 5 | with open("README.md", "r") as fh: 6 | long_description = fh.read() 7 | 8 | setuptools.setup( 9 | name="selfies", 10 | version="2.2.0", 11 | author="Mario Krenn, Alston Lo, Robert Pollice and many other contributors", 12 | author_email="mario.krenn@mpl.mpg.de, alan@aspuru.com", 13 | description="SELFIES (SELF-referencIng Embedded Strings) is a " 14 | "general-purpose, sequence-based, robust representation of " 15 | "semantically constrained graphs.", 16 | long_description=long_description, 17 | long_description_content_type="text/markdown", 18 | url="https://github.com/aspuru-guzik-group/selfies", 19 | packages=setuptools.find_packages(), 20 | classifiers=[ 21 | "Programming Language :: Python :: 3", 22 | "Programming Language :: Python :: 3.7", 23 | "Programming Language :: Python :: 3.8", 24 | "Programming Language :: Python :: 3.9", 25 | "Programming Language :: Python :: 3.10", 26 | "Programming Language :: Python :: 3 :: Only", 27 | "License :: OSI Approved :: Apache Software License", 28 | "Operating System :: OS Independent", 29 | ], 30 | python_requires='>=3.7' 31 | ) 32 | -------------------------------------------------------------------------------- /docs/source/index.rst: -------------------------------------------------------------------------------- 1 | .. selfies documentation master file, created by 2 | sphinx-quickstart on Sun Jun 14 23:40:28 2020. 3 | You can adapt this file completely to your liking, but it should at least 4 | contain the root `toctree` directive. 5 | 6 | Welcome to the SELFIES documentation! 7 | ===================================== 8 | 9 | SELFIES (SELF-referencIng Embedded Strings) is a 100% robust molecular string 10 | representation. A main objective is to use SELFIES as direct input into 11 | machine learning models, in particular in generative models, for the 12 | generation of outputs with guaranteed validity. 13 | 14 | This library is intended to be light-weight and easy to use. 15 | 16 | For explanation of the underlying principle (formal grammar) and experiments, 17 | please see the `original paper`_. 18 | 19 | For comments, bug reports or feature ideas, please use github issues 20 | or send an email to mario.krenn@utoronto.ca and alan@aspuru.com. 21 | 22 | .. _original paper: https://doi.org/10.1088/2632-2153/aba947 23 | 24 | Installation 25 | ############ 26 | 27 | Install SELFIES in the command line using pip: 28 | 29 | .. code-block:: 30 | 31 | $ pip install selfies 32 | 33 | 34 | .. toctree:: 35 | :maxdepth: 2 36 | :caption: Contents 37 | 38 | derivation 39 | tutorial.ipynb 40 | selfies 41 | 42 | Indices and tables 43 | ================== 44 | 45 | * :ref:`genindex` 46 | * :ref:`modindex` 47 | * :ref:`search` 48 | -------------------------------------------------------------------------------- /selfies/compatibility.py: -------------------------------------------------------------------------------- 1 | from selfies.utils.smiles_utils import atom_to_smiles, smiles_to_atom 2 | 3 | 4 | def modernize_symbol(symbol): 5 | """Converts a SELFIES symbol from 0: 50 | print(ii,': ',new_smiles) -------------------------------------------------------------------------------- /original_code_from_paper/bitflips_in_paper_Fig3.txt: -------------------------------------------------------------------------------- 1 | smiles: 2 | 1mutation 3 | CNC(C)CC1=CC=CNC(=C1)OCO2 : no 4 | CNC(C)CC1=CC=C2C(=C1FOCO2 : no 5 | CFC(C)CC1=CC=C2C(=C1)OCO2 : no 6 | 7 | 2mutation 8 | CNC(C)OC1=CC=C2C(=C1COCO2 : no 9 | CNC(C)CC1=CCOCCC(=C1)OCO2 : no 10 | CNC(C)#C1=CC=C2C(=C1)OCON : no 11 | 12 | 3mutation 13 | CNC(C)CC1=CCCC2#F=C1)OCO2 : no 14 | C=C(C#CC1=CC=C2C(=CN)OCO2 : no 15 | CNO(C)CC1=CC=C2C(#F1)OCO2 : no 16 | 17 | 18 | selfies 19 | 1mutation 20 | [C][N][C][Branch1_3][epsilon][C][C][C][=C][C][=C][C][Branch1_2][=O][=C][Ring1][#N][O][C][O][Ring1][#N] - CNC(C)CC1=CC=C2C(C1)OCO2 : yes 21 | [C][N][C][Branch1_3][epsilon][C][C][C][=C][C][=C][C][Branch1_3][=O][=C][=N][#N][O][C][O][Ring1][#N] - CNC(C)CC=CC=CC(=C=NN)OCO : yes 22 | [C][N][C][Branch1_3][epsilon][C][C][C][=C][N][=C][C][Branch1_3][=O][=C][Ring1][#N][O][C][O][Ring1][#N] - CNC(C)CC1=CN=C2C(=C1)OCO2 : yes 23 | 24 | CNC(C)CC1=CC=C2C(C1)OCO2 25 | CNC(C)CC=CC=CC(=C=NN)OCO 26 | CNC(C)CC1=CN=C2C(=C1)OCO2 27 | 28 | 2mutation 29 | [C][N][C][Branch1_3][epsilon][C][C][C][=C][C][=C][C][Branch1_3][Branch1_2][Branch1_3][Ring1][#N][O][C][O][Ring1][#N] - CNC(C)CC=CC=C1C(=NOCO1) : yes 30 | [C][N][C][Branch1_3][epsilon][C][C][=N][=C][C][=C][C][Branch1_3][=O][=C][Ring1][F][O][C][O][Ring1][#N] - CNC(C)C=NCC1=C2C(=C1)OCO2 : yes 31 | [C][N][C][Branch1_3][epsilon][C][C][C][=C][C][=C][Ring1][Branch1_3][=O][=C][Ring1][#N][O][C][O][Ring1][#N] - C1NC(C)CC=CC=C1O : yes 32 | 33 | CNC(C)CC=CC=C1C(=NOCO1) 34 | CNC(C)C=NCC1=C2C(=C1)OCO2 35 | C1NC(C)CC=CC=C1O 36 | 37 | 38 | 39 | 3mutation 40 | [C][N][C][Branch1_3][=C][C][C][C][=C][C][=C][Branch1_3][Branch1_3][=O][=C][Ring1][#N][=N][C][O][Ring1][#N] - CNC(CCC1=CC=C2(O))C1=NCO2 : yes 41 | [C][N][#C][Branch1_3][epsilon][C][C][#N][=C][C][=C][C][Branch1_3][=O][=C][Ring1][#N][O][C][O][Ring1][#N] - CN=C(C)C#N : yes 42 | [C][N][=N][Branch1_3][epsilon][C][C][C][=N][Branch1_3][=C][C][Branch1_3][=O][=C][Ring1][#N][O][C][O][Ring1][#N] - CN=NC1CC=NC(=C1)OCO : yes -------------------------------------------------------------------------------- /selfies/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """ 4 | SELFIES: a robust representation of semantically constrained graphs with an 5 | example application in chemistry. 6 | 7 | SELFIES (SELF-referencIng Embedded Strings) is a general-purpose, 8 | sequence-based, robust representation of semantically constrained graphs. 9 | It is based on a Chomsky type-2 grammar, augmented with two self-referencing 10 | functions. A main objective is to use SELFIES as direct input into machine 11 | learning models, in particular in generative models, for the generation of 12 | outputs with high validity. 13 | 14 | The code presented here is a concrete application of SELFIES in chemistry, for 15 | the robust representation of molecules. 16 | 17 | Typical usage example: 18 | import selfies as sf 19 | 20 | benzene = "C1=CC=CC=C1" 21 | benzene_selfies = sf.encoder(benzene) 22 | benzene_smiles = sf.decoder(benzene_selfies) 23 | 24 | For comments, bug reports or feature ideas, please send an email to 25 | mario.krenn@utoronto.ca and alan@aspuru.com. 26 | """ 27 | 28 | __version__ = "2.2.0" 29 | 30 | __all__ = [ 31 | "encoder", 32 | "decoder", 33 | "get_preset_constraints", 34 | "get_semantic_robust_alphabet", 35 | "get_semantic_constraints", 36 | "set_semantic_constraints", 37 | "len_selfies", 38 | "split_selfies", 39 | "get_alphabet_from_selfies", 40 | "selfies_to_encoding", 41 | "batch_selfies_to_flat_hot", 42 | "encoding_to_selfies", 43 | "batch_flat_hot_to_selfies", 44 | "EncoderError", 45 | "DecoderError" 46 | ] 47 | 48 | from .bond_constraints import ( 49 | get_preset_constraints, 50 | get_semantic_constraints, 51 | get_semantic_robust_alphabet, 52 | set_semantic_constraints 53 | ) 54 | from .decoder import decoder 55 | from .encoder import encoder 56 | from .exceptions import DecoderError, EncoderError 57 | from .utils.encoding_utils import ( 58 | batch_flat_hot_to_selfies, 59 | batch_selfies_to_flat_hot, 60 | encoding_to_selfies, 61 | selfies_to_encoding 62 | ) 63 | from .utils.selfies_utils import ( 64 | get_alphabet_from_selfies, 65 | len_selfies, 66 | split_selfies 67 | ) 68 | -------------------------------------------------------------------------------- /docs/source/conf.py: -------------------------------------------------------------------------------- 1 | # Configuration file for the Sphinx documentation builder. 2 | # 3 | # This file only contains a selection of the most common options. For a full 4 | # list see the documentation: 5 | # https://www.sphinx-doc.org/en/master/usage/configuration.html 6 | 7 | # -- Path setup -------------------------------------------------------------- 8 | 9 | # If extensions (or modules to document with autodoc) are in another directory, 10 | # add these directories to sys.path here. If the directory is relative to the 11 | # documentation root, use os.path.abspath to make it absolute, like shown here. 12 | # 13 | import os 14 | import sys 15 | 16 | sys.path.insert(0, os.path.abspath("../..")) 17 | 18 | # -- Project information ----------------------------------------------------- 19 | 20 | project = "selfies" 21 | copyright = "2020, Mario Krenn" 22 | author = "Mario Krenn" 23 | 24 | # The full version, including alpha/beta/rc tags 25 | release = "2.0.0" 26 | 27 | # -- General configuration --------------------------------------------------- 28 | 29 | # Add any Sphinx extension module names here, as strings. They can be 30 | # extensions coming with Sphinx (named "sphinx.ext.*") or your custom 31 | # ones. 32 | extensions = [ 33 | "sphinx.ext.autodoc", 34 | "sphinx_autodoc_typehints", 35 | "sphinx.ext.autosummary", 36 | "sphinx_rtd_theme", 37 | "nbsphinx", 38 | "sphinx.ext.mathjax", 39 | "sphinx.ext.viewcode" 40 | ] 41 | 42 | # Add any paths that contain templates here, relative to this directory. 43 | templates_path = ["_templates"] 44 | 45 | # List of patterns, relative to source directory, that match files and 46 | # directories to ignore when looking for source files. 47 | # This pattern also affects html_static_path and html_extra_path. 48 | exclude_patterns = [] 49 | 50 | # -- Options for HTML output ------------------------------------------------- 51 | 52 | # The theme to use for HTML and HTML Help pages. See the documentation for 53 | # a list of builtin themes. 54 | # 55 | html_theme = "sphinx_rtd_theme" 56 | 57 | # Add any paths that contain custom static files (such as style sheets) here, 58 | # relative to this directory. They are copied after the builtin static files, 59 | # so a file named "default.css" will overwrite the builtin "default.css". 60 | html_static_path = [] 61 | -------------------------------------------------------------------------------- /selfies/utils/selfies_utils.py: -------------------------------------------------------------------------------- 1 | from typing import Iterable, Iterator, Set 2 | 3 | 4 | def len_selfies(selfies: str) -> int: 5 | """Returns the number of symbols in a given SELFIES string. 6 | 7 | :param selfies: a SELFIES string. 8 | :return: the symbol length of the SELFIES string. 9 | 10 | :Example: 11 | 12 | >>> import selfies as sf 13 | >>> sf.len_selfies("[C][=C][F].[C]") 14 | 5 15 | """ 16 | 17 | return selfies.count("[") + selfies.count(".") 18 | 19 | 20 | def split_selfies(selfies: str) -> Iterator[str]: 21 | """Tokenizes a SELFIES string into its individual symbols. 22 | 23 | :param selfies: a SELFIES string. 24 | :return: the symbols of the SELFIES string one-by-one with order preserved. 25 | 26 | :Example: 27 | 28 | >>> import selfies as sf 29 | >>> list(sf.split_selfies("[C][=C][F].[C]")) 30 | ['[C]', '[=C]', '[F]', '.', '[C]'] 31 | """ 32 | 33 | left_idx = selfies.find("[") 34 | 35 | while 0 <= left_idx < len(selfies): 36 | right_idx = selfies.find("]", left_idx + 1) 37 | if right_idx == -1: 38 | raise ValueError("malformed SELFIES string, hanging '[' bracket") 39 | 40 | next_symbol = selfies[left_idx: right_idx + 1] 41 | yield next_symbol 42 | 43 | left_idx = right_idx + 1 44 | if selfies[left_idx: left_idx + 1] == ".": 45 | yield "." 46 | left_idx += 1 47 | 48 | 49 | def get_alphabet_from_selfies(selfies_iter: Iterable[str]) -> Set[str]: 50 | """Constructs an alphabet from an iterable of SELFIES strings. 51 | 52 | The returned alphabet is the set of all symbols that appear in the 53 | SELFIES strings from the input iterable, minus the dot ``.`` symbol. 54 | 55 | :param selfies_iter: an iterable of SELFIES strings. 56 | :return: an alphabet of SELFIES symbols, built from the input iterable. 57 | 58 | :Example: 59 | 60 | >>> import selfies as sf 61 | >>> selfies_list = ["[C][F][O]", "[C].[O]", "[F][F]"] 62 | >>> alphabet = sf.get_alphabet_from_selfies(selfies_list) 63 | >>> sorted(list(alphabet)) 64 | ['[C]', '[F]', '[O]'] 65 | """ 66 | 67 | alphabet = set() 68 | for s in selfies_iter: 69 | for symbol in split_selfies(s): 70 | alphabet.add(symbol) 71 | alphabet.discard(".") 72 | return alphabet 73 | -------------------------------------------------------------------------------- /examples/vae_example/data_loader.py: -------------------------------------------------------------------------------- 1 | """ 2 | This file is to encode SMILES and SELFIES into one-hot encodings 3 | """ 4 | 5 | import numpy as np 6 | 7 | import selfies as sf 8 | 9 | 10 | def smile_to_hot(smile, largest_smile_len, alphabet): 11 | """Go from a single smile string to a one-hot encoding. 12 | """ 13 | 14 | char_to_int = dict((c, i) for i, c in enumerate(alphabet)) 15 | 16 | # pad with ' ' 17 | smile += ' ' * (largest_smile_len - len(smile)) 18 | 19 | # integer encode input smile 20 | integer_encoded = [char_to_int[char] for char in smile] 21 | 22 | # one hot-encode input smile 23 | onehot_encoded = list() 24 | for value in integer_encoded: 25 | letter = [0 for _ in range(len(alphabet))] 26 | letter[value] = 1 27 | onehot_encoded.append(letter) 28 | return integer_encoded, np.array(onehot_encoded) 29 | 30 | 31 | def multiple_smile_to_hot(smiles_list, largest_molecule_len, alphabet): 32 | """Convert a list of smile strings to a one-hot encoding 33 | 34 | Returned shape (num_smiles x len_of_largest_smile x len_smile_encoding) 35 | """ 36 | 37 | hot_list = [] 38 | for s in smiles_list: 39 | _, onehot_encoded = smile_to_hot(s, largest_molecule_len, alphabet) 40 | hot_list.append(onehot_encoded) 41 | return np.array(hot_list) 42 | 43 | 44 | def selfies_to_hot(selfie, largest_selfie_len, alphabet): 45 | """Go from a single selfies string to a one-hot encoding. 46 | """ 47 | 48 | symbol_to_int = dict((c, i) for i, c in enumerate(alphabet)) 49 | 50 | # pad with [nop] 51 | selfie += '[nop]' * (largest_selfie_len - sf.len_selfies(selfie)) 52 | 53 | # integer encode 54 | symbol_list = sf.split_selfies(selfie) 55 | integer_encoded = [symbol_to_int[symbol] for symbol in symbol_list] 56 | 57 | # one hot-encode the integer encoded selfie 58 | onehot_encoded = list() 59 | for index in integer_encoded: 60 | letter = [0] * len(alphabet) 61 | letter[index] = 1 62 | onehot_encoded.append(letter) 63 | 64 | return integer_encoded, np.array(onehot_encoded) 65 | 66 | 67 | def multiple_selfies_to_hot(selfies_list, largest_molecule_len, alphabet): 68 | """Convert a list of selfies strings to a one-hot encoding 69 | """ 70 | 71 | hot_list = [] 72 | for s in selfies_list: 73 | _, onehot_encoded = selfies_to_hot(s, largest_molecule_len, alphabet) 74 | hot_list.append(onehot_encoded) 75 | return np.array(hot_list) 76 | -------------------------------------------------------------------------------- /tests/run_on_large_dataset.py: -------------------------------------------------------------------------------- 1 | """Script for testing selfies against large datasets. 2 | """ 3 | 4 | import argparse 5 | import pathlib 6 | 7 | import pandas as pd 8 | from rdkit import Chem 9 | from tqdm import tqdm 10 | 11 | import selfies as sf 12 | 13 | parser = argparse.ArgumentParser() 14 | parser.add_argument("--data_path", type=str, default="version.smi.gz") 15 | parser.add_argument("--col_name", type=str, default="isosmiles") 16 | parser.add_argument("--sep", type=str, default=r"\s+") 17 | parser.add_argument("--start_from", type=int, default=0) 18 | args = parser.parse_args() 19 | 20 | TEST_DIR = pathlib.Path(__file__).parent 21 | TEST_SET_PATH = TEST_DIR / "test_sets" / args.data_path 22 | ERROR_LOG_DIR = TEST_DIR / "error_logs" 23 | ERROR_LOG_DIR.mkdir(exist_ok=True, parents=True) 24 | 25 | 26 | def make_reader(): 27 | return pd.read_csv(TEST_SET_PATH, sep=args.sep, chunksize=10000) 28 | 29 | 30 | def roundtrip_translation(): 31 | sf.set_semantic_constraints("hypervalent") 32 | 33 | n_entries = 0 34 | for chunk in make_reader(): 35 | n_entries += len(chunk) 36 | pbar = tqdm(total=n_entries) 37 | 38 | reader = make_reader() 39 | error_log = open(ERROR_LOG_DIR / f"{TEST_SET_PATH.stem}.txt", "a+") 40 | 41 | curr_idx = 0 42 | for chunk_idx, chunk in enumerate(reader): 43 | for in_smiles in chunk[args.col_name]: 44 | pbar.update(1) 45 | curr_idx += 1 46 | if curr_idx < args.start_from: 47 | continue 48 | 49 | in_smiles = in_smiles.strip() 50 | 51 | mol = Chem.MolFromSmiles(in_smiles, sanitize=True) 52 | if (mol is None) or ("*" in in_smiles): 53 | continue 54 | 55 | try: 56 | selfies = sf.encoder(in_smiles, strict=True) 57 | out_smiles = sf.decoder(selfies) 58 | except (sf.EncoderError, sf.DecoderError): 59 | error_log.write(in_smiles + "\n") 60 | tqdm.write(in_smiles) 61 | continue 62 | 63 | if not is_same_mol(in_smiles, out_smiles): 64 | error_log.write(in_smiles + "\n") 65 | tqdm.write(in_smiles) 66 | 67 | error_log.close() 68 | 69 | 70 | def is_same_mol(smiles1, smiles2): 71 | try: 72 | can_smiles1 = Chem.CanonSmiles(smiles1) 73 | can_smiles2 = Chem.CanonSmiles(smiles2) 74 | return can_smiles1 == can_smiles2 75 | except Exception: 76 | return False 77 | 78 | 79 | if __name__ == "__main__": 80 | roundtrip_translation() 81 | -------------------------------------------------------------------------------- /tests/test_on_datasets.py: -------------------------------------------------------------------------------- 1 | import faulthandler 2 | import pathlib 3 | import random 4 | 5 | import pandas as pd 6 | import pytest 7 | from rdkit import Chem 8 | 9 | import selfies as sf 10 | 11 | faulthandler.enable() 12 | 13 | TEST_SET_DIR = pathlib.Path(__file__).parent / "test_sets" 14 | ERROR_LOG_DIR = pathlib.Path(__file__).parent / "error_logs" 15 | ERROR_LOG_DIR.mkdir(exist_ok=True, parents=True) 16 | 17 | datasets = list(TEST_SET_DIR.glob("**/*.csv")) 18 | 19 | 20 | @pytest.mark.parametrize("test_path", datasets) 21 | def test_roundtrip_translation(test_path, dataset_samples): 22 | """Tests SMILES -> SELFIES -> SMILES translation on various datasets. 23 | """ 24 | 25 | # very relaxed constraints 26 | constraints = sf.get_preset_constraints("hypervalent") 27 | constraints.update({"P": 7, "P-1": 8, "P+1": 6, "?": 12}) 28 | sf.set_semantic_constraints(constraints) 29 | 30 | error_path = ERROR_LOG_DIR / "{}.csv".format(test_path.stem) 31 | with open(error_path, "w+") as error_log: 32 | error_log.write("In, Out\n") 33 | 34 | error_data = [] 35 | error_found = False 36 | 37 | n_lines = sum(1 for _ in open(test_path)) - 1 38 | n_keep = dataset_samples if (0 < dataset_samples <= n_lines) else n_lines 39 | skip = random.sample(range(1, n_lines + 1), n_lines - n_keep) 40 | reader = pd.read_csv(test_path, chunksize=10000, header=0, skiprows=skip) 41 | 42 | for chunk in reader: 43 | 44 | for in_smiles in chunk["smiles"]: 45 | in_smiles = in_smiles.strip() 46 | 47 | mol = Chem.MolFromSmiles(in_smiles, sanitize=True) 48 | if (mol is None) or ("*" in in_smiles): 49 | continue 50 | 51 | try: 52 | selfies = sf.encoder(in_smiles, strict=True) 53 | out_smiles = sf.decoder(selfies) 54 | except (sf.EncoderError, sf.DecoderError): 55 | error_data.append((in_smiles, "")) 56 | continue 57 | 58 | if not is_same_mol(in_smiles, out_smiles): 59 | error_data.append((in_smiles, out_smiles)) 60 | 61 | with open(error_path, "a") as error_log: 62 | for entry in error_data: 63 | error_log.write(",".join(entry) + "\n") 64 | 65 | error_found = error_found or error_data 66 | error_data = [] 67 | 68 | sf.set_semantic_constraints() # restore constraints 69 | 70 | assert not error_found 71 | 72 | 73 | def is_same_mol(smiles1, smiles2): 74 | try: 75 | can_smiles1 = Chem.CanonSmiles(smiles1) 76 | can_smiles2 = Chem.CanonSmiles(smiles2) 77 | return can_smiles1 == can_smiles2 78 | except Exception: 79 | return False 80 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | share/python-wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | MANIFEST 28 | 29 | # PyInstaller 30 | # Usually these files are written by a python script from a template 31 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 32 | *.manifest 33 | *.spec 34 | 35 | # Installer logs 36 | pip-log.txt 37 | pip-delete-this-directory.txt 38 | 39 | # Unit test / coverage reports 40 | htmlcov/ 41 | .tox/ 42 | .nox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *.cover 49 | *.py,cover 50 | .hypothesis/ 51 | .pytest_cache/ 52 | cover/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | .pybuilder/ 76 | target/ 77 | 78 | # Jupyter Notebook 79 | .ipynb_checkpoints 80 | 81 | # IPython 82 | profile_default/ 83 | ipython_config.py 84 | 85 | # pyenv 86 | # For a library or package, you might want to ignore these files since the code is 87 | # intended to run in multiple environments; otherwise, check them in: 88 | # .python-version 89 | 90 | # pipenv 91 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 92 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 93 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 94 | # install all needed dependencies. 95 | #Pipfile.lock 96 | 97 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 98 | __pypackages__/ 99 | 100 | # Celery stuff 101 | celerybeat-schedule 102 | celerybeat.pid 103 | 104 | # SageMath parsed files 105 | *.sage.py 106 | 107 | # Environments 108 | .env 109 | .venv 110 | env/ 111 | venv/ 112 | ENV/ 113 | env.bak/ 114 | venv.bak/ 115 | 116 | # Spyder project settings 117 | .spyderproject 118 | .spyproject 119 | 120 | # Rope project settings 121 | .ropeproject 122 | 123 | # mkdocs documentation 124 | /site 125 | 126 | # mypy 127 | .mypy_cache/ 128 | .dmypy.json 129 | dmypy.json 130 | 131 | # Pyre type checker 132 | .pyre/ 133 | 134 | # pytype static type analyzer 135 | .pytype/ 136 | 137 | # Cython debug symbols 138 | cython_debug/ 139 | 140 | # IntelliJ 141 | .idea/ 142 | 143 | # Project-specific files 144 | error_logs/ 145 | *.dat 146 | sandbox 147 | version.smi.gz 148 | -------------------------------------------------------------------------------- /docs/source/selfies.rst: -------------------------------------------------------------------------------- 1 | API Reference 2 | ================== 3 | 4 | .. currentmodule:: selfies 5 | 6 | Core Functions 7 | ------------------------ 8 | .. autofunction:: encoder 9 | .. autofunction:: decoder 10 | 11 | Customization Functions 12 | ------------------------ 13 | 14 | The SELFIES grammar is derived dynamically from a set of semantic constraints, 15 | which assign bonding capacities to various atoms. 16 | By default, :mod:`selfies` operates under the following constraints: 17 | 18 | .. table:: 19 | :align: center 20 | 21 | +-----------+------------------------------+ 22 | | Max Bonds | Atom(s) | 23 | +===========+==============================+ 24 | | 1 | F, Cl, Br, I | 25 | +-----------+------------------------------+ 26 | | 2 | O | 27 | +-----------+------------------------------+ 28 | | 3 | B, N | 29 | +-----------+------------------------------+ 30 | | 4 | C | 31 | +-----------+------------------------------+ 32 | | 5 | P | 33 | +-----------+------------------------------+ 34 | | 6 | S | 35 | +-----------+------------------------------+ 36 | | 8 | All other atoms | 37 | +-----------+------------------------------+ 38 | 39 | The +1 and -1 charged versions of O, N, C, S, and P are also constrained, 40 | where a +1 increases the bonding capacity of the neutral atom by 1, 41 | and a -1 decreases the bonding capacity of the neutral atom by 1. 42 | For example, N+1 has a bonding capacity of :math:`3 + 1 = 4`, 43 | and N-1 has a bonding capacity of :math:`3 - 1 = 2`. The charged versions 44 | B+1 and B-1 are constrained to a capacity of 2 and 4 bonds, respectively. 45 | 46 | However, the default constraints are inadequate for SMILES strings that violate them. For 47 | example, nitrobenzene ``O=N(=O)C1=CC=CC=C1`` has a nitrogen with 6 bonds and 48 | the chlorate anion ``O=Cl(=O)[O-]`` has a chlorine with 5 bonds - these 49 | SMILES strings *cannot* be represented by SELFIES strings under the default constraints. 50 | Additionally, users may want to specify their own custom constraints. Thus, we 51 | provide the following methods for configuring the semantic constraints 52 | of :mod:`selfies`. 53 | 54 | .. warning:: 55 | 56 | SELFIES strings may be translated differently under different semantic constraints. 57 | Therefore, if custom semantic constraints are used, it is recommended to report 58 | them for reproducibility reasons. 59 | 60 | .. autofunction:: get_preset_constraints 61 | .. autofunction:: get_semantic_constraints 62 | .. autofunction:: set_semantic_constraints 63 | 64 | 65 | Utility Functions 66 | ------------------------ 67 | .. autofunction:: len_selfies 68 | .. autofunction:: split_selfies 69 | .. autofunction:: get_alphabet_from_selfies 70 | .. autofunction:: get_semantic_robust_alphabet 71 | .. autofunction:: selfies_to_encoding 72 | .. autofunction:: encoding_to_selfies 73 | .. autofunction:: batch_selfies_to_flat_hot 74 | .. autofunction:: batch_flat_hot_to_selfies 75 | 76 | 77 | Exceptions 78 | ------------------------ 79 | .. autoexception:: EncoderError 80 | .. autoexception:: DecoderError 81 | -------------------------------------------------------------------------------- /original_code_from_paper/vae/VAE_dependencies/GrammarVAE_grammar.py: -------------------------------------------------------------------------------- 1 | import nltk 2 | import numpy as np 3 | import six 4 | import pdb 5 | 6 | # the zinc grammar 7 | gram = """smiles -> chain 8 | atom -> bracket_atom 9 | atom -> aliphatic_organic 10 | atom -> aromatic_organic 11 | aliphatic_organic -> 'B' 12 | aliphatic_organic -> 'C' 13 | aliphatic_organic -> 'N' 14 | aliphatic_organic -> 'O' 15 | aliphatic_organic -> 'S' 16 | aliphatic_organic -> 'P' 17 | aliphatic_organic -> 'F' 18 | aliphatic_organic -> 'I' 19 | aliphatic_organic -> 'Cl' 20 | aliphatic_organic -> 'Br' 21 | aromatic_organic -> 'c' 22 | aromatic_organic -> 'n' 23 | aromatic_organic -> 'o' 24 | aromatic_organic -> 's' 25 | bracket_atom -> '[' BAI ']' 26 | BAI -> isotope symbol BAC 27 | BAI -> symbol BAC 28 | BAI -> isotope symbol 29 | BAI -> symbol 30 | BAC -> chiral BAH 31 | BAC -> BAH 32 | BAC -> chiral 33 | BAH -> hcount BACH 34 | BAH -> BACH 35 | BAH -> hcount 36 | BACH -> charge class 37 | BACH -> charge 38 | BACH -> class 39 | symbol -> aliphatic_organic 40 | symbol -> aromatic_organic 41 | isotope -> DIGIT 42 | isotope -> DIGIT DIGIT 43 | isotope -> DIGIT DIGIT DIGIT 44 | DIGIT -> '1' 45 | DIGIT -> '2' 46 | DIGIT -> '3' 47 | DIGIT -> '4' 48 | DIGIT -> '5' 49 | DIGIT -> '6' 50 | DIGIT -> '7' 51 | DIGIT -> '8' 52 | chiral -> '@' 53 | chiral -> '@@' 54 | hcount -> 'H' 55 | hcount -> 'H' DIGIT 56 | charge -> '-' 57 | charge -> '-' DIGIT 58 | charge -> '-' DIGIT DIGIT 59 | charge -> '+' 60 | charge -> '+' DIGIT 61 | charge -> '+' DIGIT DIGIT 62 | bond -> '-' 63 | bond -> '=' 64 | bond -> '#' 65 | bond -> '/' 66 | bond -> '\\' 67 | ringbond -> DIGIT 68 | ringbond -> bond DIGIT 69 | branched_atom -> atom 70 | branched_atom -> atom RB 71 | branched_atom -> atom BB 72 | branched_atom -> atom RB BB 73 | RB -> RB ringbond 74 | RB -> ringbond 75 | BB -> BB branch 76 | BB -> branch 77 | branch -> '(' chain ')' 78 | branch -> '(' bond chain ')' 79 | chain -> branched_atom 80 | chain -> chain branched_atom 81 | chain -> chain bond branched_atom 82 | Nothing -> None""" 83 | 84 | # form the CFG and get the start symbol 85 | GCFG = nltk.CFG.fromstring(gram) 86 | start_index = GCFG.productions()[0].lhs() 87 | 88 | # collect all lhs symbols, and the unique set of them 89 | all_lhs = [a.lhs().symbol() for a in GCFG.productions()] 90 | lhs_list = [] 91 | for a in all_lhs: 92 | if a not in lhs_list: 93 | lhs_list.append(a) 94 | 95 | D = len(GCFG.productions()) 96 | 97 | # this map tells us the rhs symbol indices for each production rule 98 | rhs_map = [None]*D 99 | count = 0 100 | for a in GCFG.productions(): 101 | rhs_map[count] = [] 102 | for b in a.rhs(): 103 | if not isinstance(b,six.string_types): 104 | s = b.symbol() 105 | rhs_map[count].extend(list(np.where(np.array(lhs_list) == s)[0])) 106 | count = count + 1 107 | 108 | masks = np.zeros((len(lhs_list),D)) 109 | count = 0 110 | 111 | # this tells us for each lhs symbol which productions rules should be masked 112 | for sym in lhs_list: 113 | is_in = np.array([a == sym for a in all_lhs], dtype=int).reshape(1,-1) 114 | masks[count] = is_in 115 | count = count + 1 116 | 117 | # this tells us the indices where the masks are equal to 1 118 | index_array = [] 119 | for i in range(masks.shape[1]): 120 | index_array.append(np.where(masks[:,i]==1)[0][0]) 121 | ind_of_ind = np.array(index_array) 122 | 123 | max_rhs = max([len(l) for l in rhs_map]) 124 | 125 | # rules 29 and 31 aren't used in the zinc data so we 126 | # 0 their masks so they can never be selected 127 | masks[:,29] = 0 128 | masks[:,31] = 0 129 | -------------------------------------------------------------------------------- /original_code_from_paper/vae/VAE_dependencies/data_loader.py: -------------------------------------------------------------------------------- 1 | """ 2 | This file is meant to go from various representations to 1HOT and back 3 | """ 4 | import numpy as np 5 | from GrammarVAE_codes import to_one_hot, prods_to_eq 6 | import GrammarVAE_grammar as zinc_grammar 7 | 8 | def unique_chars_iterator(smile): 9 | """ 10 | """ 11 | atoms = [] 12 | for i in range(len(smile)): 13 | atoms.append(smile[i]) 14 | return atoms 15 | 16 | 17 | 18 | def grammar_one_hot_to_smile(one_hot_ls): 19 | _grammar = zinc_grammar 20 | _productions = _grammar.GCFG.productions() 21 | 22 | # This is the generated grammar sequence 23 | grammar_seq = [[_productions[one_hot_ls[index,t].argmax()] 24 | for t in range(one_hot_ls.shape[1])] 25 | for index in range(one_hot_ls.shape[0])] 26 | #print(grammar_seq) 27 | smile = [prods_to_eq(prods) for prods in grammar_seq] 28 | 29 | return grammar_seq, smile 30 | 31 | 32 | def smile_to_hot(smile, largest_smile_len, alphabet, type_of_encoding): 33 | """ 34 | Go from a single smile string to a one-hot encoding. 35 | """ 36 | char_to_int = dict((c, i) for i, c in enumerate(alphabet)) 37 | # integer encode input smile 38 | if type_of_encoding==0: 39 | for _ in range(largest_smile_len-len(smile)): 40 | smile+=' ' 41 | elif type_of_encoding==1: 42 | for _ in range(largest_smile_len-len(smile)): 43 | smile+=' ' 44 | elif type_of_encoding==2: 45 | for _ in range(largest_smile_len-len(smile)): 46 | smile+='A' 47 | 48 | integer_encoded = [char_to_int[char] for char in unique_chars_iterator(smile)] 49 | 50 | 51 | # one hot-encode input smile 52 | onehot_encoded = list() 53 | for value in integer_encoded: 54 | letter = [0 for _ in range(len(alphabet))] 55 | letter[value] = 1 56 | onehot_encoded.append(letter) 57 | return integer_encoded, np.array(onehot_encoded) 58 | 59 | 60 | def multiple_smile_to_hot(smiles_list, largest_smile_len, alphabet, type_of_encoding): 61 | """ 62 | Convert a list of smile strings to a one-hot encoding 63 | 64 | Returned shape (num_smiles x len_of_largest_smile x len_smile_encoding) 65 | """ 66 | hot_list = [] 67 | for smile in smiles_list: 68 | _, onehot_encoded = smile_to_hot(smile, largest_smile_len, alphabet, type_of_encoding) 69 | hot_list.append(onehot_encoded) 70 | return np.array(hot_list) 71 | 72 | 73 | def hot_to_smile(onehot_encoded,alphabet): 74 | """ 75 | Go from one-hot encoding to smile string 76 | """ 77 | # From one-hot to integer encoding 78 | integer_encoded = onehot_encoded.argmax(1) 79 | 80 | int_to_char = dict((i, c) for i, c in enumerate(alphabet)) 81 | 82 | # integer encoding to smile 83 | regen_smile = "".join(int_to_char[x] for x in integer_encoded) 84 | regen_smile = regen_smile.strip() 85 | return regen_smile 86 | 87 | 88 | def check_conversion_bijection(smiles_list, largest_smile_len): 89 | """ 90 | This function should be called to check successful conversion to and from 91 | one-hot on a data set. 92 | """ 93 | for i, smile in enumerate(smiles_list): 94 | _, onehot_encoded = smile_to_hot(smile, largest_smile_len) 95 | regen_smile = hot_to_smile(onehot_encoded) 96 | # print('Original: ', smile, ' shape: ', len(smile)) 97 | # print('REcon: ', regen_smile , ' shape: ', len(regen_smile)) 98 | # return 99 | if smile != regen_smile: 100 | print('Filed conversion for: ', smile, ' @index: ', i) 101 | break 102 | print('All conditions passed!') 103 | 104 | -------------------------------------------------------------------------------- /selfies/utils/matching_utils.py: -------------------------------------------------------------------------------- 1 | import heapq 2 | import itertools 3 | from collections import deque 4 | from typing import List, Optional 5 | 6 | 7 | def find_perfect_matching(graph: List[List[int]]) -> Optional[List[int]]: 8 | """Finds a perfect matching for an undirected graph (without self-loops). 9 | 10 | :param graph: an adjacency list representing the input graph. 11 | :return: a list representing a perfect matching, where j is the i-th 12 | element if nodes i and j are matched. Returns None, if the graph cannot 13 | be perfectly matched. 14 | """ 15 | 16 | # start with a maximal matching for efficiency 17 | matching = _greedy_matching(graph) 18 | 19 | unmatched = set(i for i in range(len(graph)) if matching[i] is None) 20 | while unmatched: 21 | 22 | # find augmenting path which starts at root 23 | root = unmatched.pop() 24 | path = _find_augmenting_path(graph, root, matching) 25 | 26 | if path is None: 27 | return None 28 | else: 29 | _flip_augmenting_path(matching, path) 30 | unmatched.discard(path[0]) 31 | unmatched.discard(path[-1]) 32 | 33 | return matching 34 | 35 | 36 | def _greedy_matching(graph): 37 | matching = [None] * len(graph) 38 | free_degrees = [len(graph[i]) for i in range(len(graph))] 39 | # free_degrees[i] = number of unmatched neighbors for node i 40 | 41 | # prioritize nodes with fewer unmatched neighbors 42 | node_pqueue = [(free_degrees[i], i) for i in range(len(graph))] 43 | heapq.heapify(node_pqueue) 44 | 45 | while node_pqueue: 46 | _, node = heapq.heappop(node_pqueue) 47 | 48 | if (matching[node] is not None) or (free_degrees[node] == 0): 49 | continue # node cannot be matched 50 | 51 | # match node with first unmatched neighbor 52 | mate = next(i for i in graph[node] if matching[i] is None) 53 | matching[node] = mate 54 | matching[mate] = node 55 | 56 | for adj in itertools.chain(graph[node], graph[mate]): 57 | free_degrees[adj] -= 1 58 | if (matching[adj] is None) and (free_degrees[adj] > 0): 59 | heapq.heappush(node_pqueue, (free_degrees[adj], adj)) 60 | 61 | return matching 62 | 63 | 64 | def _find_augmenting_path(graph, root, matching): 65 | assert matching[root] is None 66 | 67 | # run modified BFS to find path from root to unmatched node 68 | other_end = None 69 | node_queue = deque([root]) 70 | 71 | # parent BFS tree - None indicates an unvisited node 72 | parents = [None] * len(graph) 73 | parents[root] = [None, None] 74 | 75 | while node_queue: 76 | node = node_queue.popleft() 77 | 78 | for adj in graph[node]: 79 | if matching[adj] is None: # unmatched node 80 | if adj != root: # augmenting path found! 81 | parents[adj] = [node, adj] 82 | other_end = adj 83 | break 84 | else: 85 | adj_mate = matching[adj] 86 | if parents[adj_mate] is None: # adj_mate not visited 87 | parents[adj_mate] = [node, adj] 88 | node_queue.append(adj_mate) 89 | 90 | if other_end is not None: 91 | break # augmenting path found! 92 | 93 | if other_end is None: 94 | return None 95 | else: 96 | path = [] 97 | node = other_end 98 | while node != root: 99 | path.append(parents[node][1]) 100 | path.append(parents[node][0]) 101 | node = parents[node][0] 102 | return path 103 | 104 | 105 | def _flip_augmenting_path(matching, path): 106 | for i in range(0, len(path), 2): 107 | a, b = path[i], path[i + 1] 108 | matching[a] = b 109 | matching[b] = a 110 | -------------------------------------------------------------------------------- /original_code_from_paper/gan/GAN_selfies/one_hot_converter.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | """ 4 | Created on Fri Jul 26 10:12:09 2019 5 | 6 | @author: akshat 7 | """ 8 | import numpy as np 9 | 10 | def unique_chars_iterator(smile): 11 | """ 12 | Iterate over the characters of a smile string. 13 | Note that 'Cl' & 'Br' are considered as one character 14 | """ 15 | atoms = [] 16 | for i in range(len(smile)): 17 | atoms.append(smile[i]) 18 | return atoms 19 | 20 | def smile_to_hot(smile, largest_smile_len, alphabet, type_of_encoding): 21 | """ 22 | Go from a single smile string to a one-hot encoding. 23 | """ 24 | char_to_int = dict((c, i) for i, c in enumerate(alphabet)) 25 | # print('ENCODING: ', char_to_int) 26 | # integer encode input smile 27 | if type_of_encoding==0: 28 | for _ in range(largest_smile_len-len(smile)): 29 | smile+=' ' 30 | elif type_of_encoding==1: 31 | for _ in range(largest_smile_len-len(smile)): 32 | smile+=' ' 33 | elif type_of_encoding==2: 34 | for _ in range(largest_smile_len-len(smile)): 35 | smile+='A' 36 | 37 | integer_encoded = [char_to_int[char] for char in unique_chars_iterator(smile)] 38 | 39 | 40 | # one hot-encode input smile 41 | onehot_encoded = list() 42 | for value in integer_encoded: 43 | letter = [0 for _ in range(len(alphabet))] 44 | letter[value] = 1 45 | onehot_encoded.append(letter) 46 | return integer_encoded, np.array(onehot_encoded) 47 | 48 | 49 | def multiple_smile_to_hot(smiles_list, largest_smile_len, alphabet, type_of_encoding): 50 | """ 51 | Convert a list of smile strings to a one-hot encoding 52 | 53 | Returned shape (num_smiles x len_of_largest_smile x len_smile_encoding) 54 | """ 55 | hot_list = [] 56 | for smile in smiles_list: 57 | _, onehot_encoded = smile_to_hot(smile, largest_smile_len, alphabet, type_of_encoding) 58 | hot_list.append(onehot_encoded) 59 | return np.array(hot_list) 60 | 61 | 62 | def hot_to_smile(onehot_encoded, alphabet): 63 | """ 64 | Go from one-hot encoding to smile string 65 | """ 66 | # From one-hot to integer encoding 67 | integer_encoded = onehot_encoded.argmax(1) 68 | # print('integer_encoded ', integer_encoded) 69 | 70 | int_to_char = dict((i, c) for i, c in enumerate(alphabet)) 71 | # print('DECODING: ', int_to_char) 72 | # integer encoding to smile 73 | regen_smile = "".join(int_to_char[x] for x in integer_encoded) 74 | regen_smile = regen_smile.strip() 75 | return regen_smile 76 | 77 | 78 | def check_conversion_bijection(smiles_list, largest_smile_len, alphabet): 79 | """ 80 | This function should be called to check successful conversion to and from 81 | one-hot on a data set. 82 | """ 83 | for i, smile in enumerate(smiles_list): 84 | _, onehot_encoded = smile_to_hot(smile, largest_smile_len, alphabet, type_of_encoding=0) 85 | regen_smile = hot_to_smile(onehot_encoded, alphabet) 86 | # print('Original: ', smile, ' shape: ', len(smile)) 87 | # print('REcon: ', regen_smile , ' shape: ', len(regen_smile)) 88 | if smile != regen_smile: 89 | print('Filed conversion for: ', smile, ' @index: ', i) 90 | raise Exception('FAILEDDDD!!!') 91 | print('All conditions passed!') 92 | 93 | 94 | 95 | #with open('smiles_qm9.txt') as f: 96 | # content = f.readlines() 97 | #content = content[1:] 98 | #content = [x.strip() for x in content] 99 | #A = [x.split(',')[1] for x in content] 100 | # 101 | #alphabets = ['N', '1', '(', '#', 'C', '3', '5', 'O', '2', 'F', '=', '4', ')', ' '] 102 | # 103 | #data = multiple_smile_to_hot(A, len(max(A, key=len)), alphabets, 0) 104 | 105 | #check_conversion_bijection(smiles_list=A, largest_smile_len=len(max(A, key=len)), alphabet=alphabets) 106 | -------------------------------------------------------------------------------- /original_code_from_paper/gan/GAN_smiles/one_hot_converter.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | """ 4 | Created on Fri Jul 26 10:12:09 2019 5 | 6 | @author: akshat 7 | """ 8 | import numpy as np 9 | 10 | def unique_chars_iterator(smile): 11 | """ 12 | Iterate over the characters of a smile string. 13 | Note that 'Cl' & 'Br' are considered as one character 14 | """ 15 | atoms = [] 16 | for i in range(len(smile)): 17 | atoms.append(smile[i]) 18 | return atoms 19 | 20 | def smile_to_hot(smile, largest_smile_len, alphabet, type_of_encoding): 21 | """ 22 | Go from a single smile string to a one-hot encoding. 23 | """ 24 | char_to_int = dict((c, i) for i, c in enumerate(alphabet)) 25 | # print('ENCODING: ', char_to_int) 26 | # integer encode input smile 27 | if type_of_encoding==0: 28 | for _ in range(largest_smile_len-len(smile)): 29 | smile+=' ' 30 | elif type_of_encoding==1: 31 | for _ in range(largest_smile_len-len(smile)): 32 | smile+=' ' 33 | elif type_of_encoding==2: 34 | for _ in range(largest_smile_len-len(smile)): 35 | smile+='A' 36 | 37 | integer_encoded = [char_to_int[char] for char in unique_chars_iterator(smile)] 38 | 39 | 40 | # one hot-encode input smile 41 | onehot_encoded = list() 42 | for value in integer_encoded: 43 | letter = [0 for _ in range(len(alphabet))] 44 | letter[value] = 1 45 | onehot_encoded.append(letter) 46 | return integer_encoded, np.array(onehot_encoded) 47 | 48 | 49 | def multiple_smile_to_hot(smiles_list, largest_smile_len, alphabet, type_of_encoding): 50 | """ 51 | Convert a list of smile strings to a one-hot encoding 52 | 53 | Returned shape (num_smiles x len_of_largest_smile x len_smile_encoding) 54 | """ 55 | hot_list = [] 56 | for smile in smiles_list: 57 | _, onehot_encoded = smile_to_hot(smile, largest_smile_len, alphabet, type_of_encoding) 58 | hot_list.append(onehot_encoded) 59 | return np.array(hot_list) 60 | 61 | 62 | def hot_to_smile(onehot_encoded, alphabet): 63 | """ 64 | Go from one-hot encoding to smile string 65 | """ 66 | # From one-hot to integer encoding 67 | integer_encoded = onehot_encoded.argmax(1) 68 | # print('integer_encoded ', integer_encoded) 69 | 70 | int_to_char = dict((i, c) for i, c in enumerate(alphabet)) 71 | # print('DECODING: ', int_to_char) 72 | # integer encoding to smile 73 | regen_smile = "".join(int_to_char[x] for x in integer_encoded) 74 | regen_smile = regen_smile.strip() 75 | return regen_smile 76 | 77 | 78 | def check_conversion_bijection(smiles_list, largest_smile_len, alphabet): 79 | """ 80 | This function should be called to check successful conversion to and from 81 | one-hot on a data set. 82 | """ 83 | for i, smile in enumerate(smiles_list): 84 | _, onehot_encoded = smile_to_hot(smile, largest_smile_len, alphabet, type_of_encoding=0) 85 | regen_smile = hot_to_smile(onehot_encoded, alphabet) 86 | # print('Original: ', smile, ' shape: ', len(smile)) 87 | # print('REcon: ', regen_smile , ' shape: ', len(regen_smile)) 88 | if smile != regen_smile: 89 | print('Filed conversion for: ', smile, ' @index: ', i) 90 | raise Exception('FAILEDDDD!!!') 91 | print('All conditions passed!') 92 | 93 | 94 | 95 | #with open('smiles_qm9.txt') as f: 96 | # content = f.readlines() 97 | #content = content[1:] 98 | #content = [x.strip() for x in content] 99 | #A = [x.split(',')[1] for x in content] 100 | # 101 | #alphabets = ['N', '1', '(', '#', 'C', '3', '5', 'O', '2', 'F', '=', '4', ')', ' '] 102 | # 103 | #data = multiple_smile_to_hot(A, len(max(A, key=len)), alphabets, 0) 104 | 105 | #check_conversion_bijection(smiles_list=A, largest_smile_len=len(max(A, key=len)), alphabet=alphabets) 106 | -------------------------------------------------------------------------------- /original_code_from_paper/vae/VAE_dependencies/GrammarVAE_codes.py: -------------------------------------------------------------------------------- 1 | import nltk 2 | import pdb 3 | #import zinc_grammar 4 | import numpy as np 5 | import h5py 6 | #import molecule_vae 7 | 8 | import GrammarVAE_grammar as zinc_grammar 9 | 10 | #f = open('data/250k_rndm_zinc_drugs_clean.smi','r') 11 | #L = [] 12 | # 13 | #count = -1 14 | #for line in f: 15 | # line = line.strip() 16 | # L.append(line) # The zinc data set 17 | #f.close() 18 | # 19 | NCHARS = len(zinc_grammar.GCFG.productions()) 20 | 21 | 22 | 23 | def prods_to_eq(prods): 24 | seq = [prods[0].lhs()] 25 | for prod in prods: 26 | if str(prod.lhs()) == 'Nothing': 27 | break 28 | for ix, s in enumerate(seq): 29 | if s == prod.lhs(): 30 | seq = seq[:ix] + list(prod.rhs()) + seq[ix+1:] 31 | break 32 | try: 33 | return ''.join(seq) 34 | except: 35 | return '' 36 | 37 | 38 | def get_zinc_tokenizer(cfg): 39 | long_tokens = [a for a in list(cfg._lexical_index.keys()) if len(a) > 1] 40 | replacements = ['$','%','^'] # ,'&'] 41 | assert len(long_tokens) == len(replacements) 42 | for token in replacements: 43 | assert token not in cfg._lexical_index 44 | 45 | def tokenize(smiles): 46 | for i, token in enumerate(long_tokens): 47 | smiles = smiles.replace(token, replacements[i]) 48 | tokens = [] 49 | for token in smiles: 50 | try: 51 | ix = replacements.index(token) 52 | tokens.append(long_tokens[ix]) 53 | except: 54 | tokens.append(token) 55 | return tokens 56 | 57 | return tokenize 58 | 59 | 60 | def to_one_hot(smiles, MaxNumSymbols, check=True): 61 | """ Encode a list of smiles strings to one-hot vectors """ 62 | assert type(smiles) == list 63 | prod_map = {} 64 | for ix, prod in enumerate(zinc_grammar.GCFG.productions()): 65 | prod_map[prod] = ix 66 | tokenize = get_zinc_tokenizer(zinc_grammar.GCFG) 67 | tokens = list(map(tokenize, smiles)) 68 | parser = nltk.ChartParser(zinc_grammar.GCFG) 69 | parse_trees = [next(parser.parse(t)) for t in tokens] 70 | productions_seq = [tree.productions() for tree in parse_trees] 71 | 72 | #if check: 73 | # print(productions_seq) 74 | 75 | indices = [np.array([prod_map[prod] for prod in entry], dtype=int) for entry in productions_seq] 76 | one_hot = np.zeros((len(indices), MaxNumSymbols, NCHARS), dtype=np.float32) 77 | for i in range(len(indices)): 78 | num_productions = len(indices[i]) 79 | one_hot[i][np.arange(num_productions),indices[i]] = 1. 80 | one_hot[i][np.arange(num_productions, MaxNumSymbols),-1] = 1. 81 | return one_hot 82 | 83 | 84 | 85 | def SizeOneHot(smiles, check=True): 86 | """ Encode a list of smiles strings to one-hot vectors """ 87 | assert type(smiles) == list 88 | prod_map = {} 89 | for ix, prod in enumerate(zinc_grammar.GCFG.productions()): 90 | prod_map[prod] = ix 91 | tokenize = get_zinc_tokenizer(zinc_grammar.GCFG) 92 | tokens = list(map(tokenize, smiles)) 93 | parser = nltk.ChartParser(zinc_grammar.GCFG) 94 | parse_trees = [next(parser.parse(t)) for t in tokens] 95 | productions_seq = [tree.productions() for tree in parse_trees] 96 | 97 | indices = [np.array([prod_map[prod] for prod in entry], dtype=int) for entry in productions_seq] 98 | return len(indices[0]) 99 | 100 | 101 | # SINGLE EXAMPLE 102 | #smile = [L[0]] 103 | ##smile = ['C'] 104 | #one_hot_single = to_one_hot(smile, ) 105 | #print(one_hot_single.shape) 106 | #print(one_hot_single) 107 | 108 | 109 | # GOING THROUGH ALL OF ZINC.... 110 | 111 | #OH = np.zeros((len(L),MAX_LEN,NCHARS)) 112 | #for i in range(0, len(L), 100): 113 | # print('Processing: i=[' + str(i) + ':' + str(i+100) + ']') 114 | # onehot = to_one_hot(L[i:i+100], False) 115 | # OH[i:i+100,:,:] = onehot 116 | # 117 | #h5f = h5py.File('zinc_grammar_dataset.h5','w') 118 | #h5f.create_dataset('data', data=OH) 119 | #h5f.close() 120 | -------------------------------------------------------------------------------- /tests/test_selfies_utils.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | 3 | import selfies as sf 4 | 5 | 6 | class Entry: 7 | 8 | def __init__(self, selfies, symbols, label, one_hot): 9 | self.selfies = selfies 10 | self.symbols = symbols 11 | self.label = label 12 | self.one_hot = one_hot 13 | 14 | 15 | @pytest.fixture() 16 | def dataset(): 17 | stoi = {"[nop]": 0, "[O]": 1, ".": 2, "[C]": 3, "[F]": 4} 18 | itos = {i: c for c, i in stoi.items()} 19 | pad_to_len = 4 20 | 21 | entries = [ 22 | Entry(selfies="", 23 | symbols=[], 24 | label=[0, 0, 0, 0], 25 | one_hot=[[1, 0, 0, 0, 0], 26 | [1, 0, 0, 0, 0], 27 | [1, 0, 0, 0, 0], 28 | [1, 0, 0, 0, 0]]), 29 | Entry(selfies="[C][C][C]", 30 | symbols=["[C]", "[C]", "[C]"], 31 | label=[3, 3, 3, 0], 32 | one_hot=[[0, 0, 0, 1, 0], 33 | [0, 0, 0, 1, 0], 34 | [0, 0, 0, 1, 0], 35 | [1, 0, 0, 0, 0]]), 36 | Entry(selfies="[C].[C]", 37 | symbols=["[C]", ".", "[C]"], 38 | label=[3, 2, 3, 0], 39 | one_hot=[[0, 0, 0, 1, 0], 40 | [0, 0, 1, 0, 0], 41 | [0, 0, 0, 1, 0], 42 | [1, 0, 0, 0, 0]]), 43 | Entry(selfies="[C][O][C][F]", 44 | symbols=["[C]", "[O]", "[C]", "[F]"], 45 | label=[3, 1, 3, 4], 46 | one_hot=[[0, 0, 0, 1, 0], 47 | [0, 1, 0, 0, 0], 48 | [0, 0, 0, 1, 0], 49 | [0, 0, 0, 0, 1]]), 50 | Entry(selfies="[C][O][C]", 51 | symbols=["[C]", "[O]", "[C]"], 52 | label=[3, 1, 3, 0], 53 | one_hot=[[0, 0, 0, 1, 0], 54 | [0, 1, 0, 0, 0], 55 | [0, 0, 0, 1, 0], 56 | [1, 0, 0, 0, 0]]) 57 | ] 58 | 59 | return entries, (stoi, itos, pad_to_len) 60 | 61 | 62 | @pytest.fixture() 63 | def dataset_flat_hots(dataset): 64 | flat_hots = [] 65 | for entry in dataset[0]: 66 | hot = [elm for vec in entry.one_hot for elm in vec] 67 | flat_hots.append(hot) 68 | return flat_hots 69 | 70 | 71 | def test_len_selfies(dataset): 72 | for entry in dataset[0]: 73 | assert sf.len_selfies(entry.selfies) == len(entry.symbols) 74 | 75 | 76 | def test_split_selfies(dataset): 77 | for entry in dataset[0]: 78 | assert list(sf.split_selfies(entry.selfies)) == entry.symbols 79 | 80 | 81 | def test_get_alphabet_from_selfies(dataset): 82 | entries, (vocab_stoi, _, _) = dataset 83 | 84 | selfies = [entry.selfies for entry in entries] 85 | alphabet = sf.get_alphabet_from_selfies(selfies) 86 | alphabet.add("[nop]") 87 | alphabet.add(".") 88 | 89 | assert alphabet == set(vocab_stoi.keys()) 90 | 91 | 92 | def test_selfies_to_encoding(dataset): 93 | entries, (vocab_stoi, vocab_itos, pad_to_len) = dataset 94 | 95 | for entry in entries: 96 | label, one_hot = sf.selfies_to_encoding( 97 | entry.selfies, vocab_stoi, pad_to_len, "both" 98 | ) 99 | 100 | assert label == entry.label 101 | assert one_hot == entry.one_hot 102 | 103 | # recover original selfies 104 | selfies = sf.encoding_to_selfies(label, vocab_itos, "label") 105 | selfies = selfies.replace("[nop]", "") 106 | assert selfies == entry.selfies 107 | 108 | selfies = sf.encoding_to_selfies(one_hot, vocab_itos, "one_hot") 109 | selfies = selfies.replace("[nop]", "") 110 | assert selfies == entry.selfies 111 | 112 | 113 | def test_selfies_to_flat_hot(dataset, dataset_flat_hots): 114 | entries, (vocab_stoi, vocab_itos, pad_to_len) = dataset 115 | 116 | batch = [entry.selfies for entry in entries] 117 | flat_hots = sf.batch_selfies_to_flat_hot(batch, vocab_stoi, pad_to_len) 118 | 119 | assert flat_hots == dataset_flat_hots 120 | 121 | # recover original selfies 122 | recovered = sf.batch_flat_hot_to_selfies(flat_hots, vocab_itos) 123 | assert batch == [s.replace("[nop]", "") for s in recovered] 124 | -------------------------------------------------------------------------------- /original_code_from_paper/bitflip_from_mdma.py: -------------------------------------------------------------------------------- 1 | # 2 | # Self-Referencing Embedded Strings (SELFIES): 3 | # A 100% robust molecular string representation 4 | # (https://arxiv.org/abs/1905.13741) 5 | # by Mario Krenn, Florian Haese, AkshatKumar Nigam, 6 | # Pascal Friederich, Alán Aspuru-Guzik 7 | # 8 | # Demo of Rubustness of SMILES and SELFIES and DeepSMILES 9 | # 10 | # Generates 1000 cases of 1, 2 or 3 mutations of small bio-molecule (MDMA). 11 | # The alphabets are those that can translate the QM0 dataset (and have been used in all experiments in the paper) 12 | # 13 | # 14 | # questions/remarks: mario.krenn@utoronto.ca or alan@aspuru.com 15 | # 16 | # 11.03.2020 17 | # 18 | # Requirements: RDKit 19 | # selfies (pip install selfies) 20 | # DeepSMILES (pip install --upgrade deepsmiles) 21 | # 22 | # 23 | 24 | 25 | from rdkit.Chem import MolFromSmiles 26 | from rdkit import rdBase 27 | 28 | from random import randint 29 | from selfies import encoder, decoder 30 | 31 | import deepsmiles 32 | 33 | rdBase.DisableLog('rdApp.error') 34 | 35 | 36 | def IsCorrectSMILES(smiles): 37 | if len(smiles)==0: 38 | resMol=None 39 | else: 40 | try: 41 | resMol=MolFromSmiles(smiles, sanitize=True) 42 | except Exception: 43 | resMol=None 44 | 45 | if resMol==None: 46 | return 0 47 | else: 48 | return 1 49 | 50 | 51 | def tokenize_selfies(selfies): 52 | location=selfies.find(']') 53 | all_tokens=[] 54 | while location>=0: 55 | all_tokens.append(selfies[0:location+1]) 56 | selfies=selfies[location+1:] 57 | location=selfies.find(']') 58 | 59 | return all_tokens 60 | 61 | 62 | 63 | def detokenize_selfies(selfies_list): 64 | selfies='' 65 | for ii in range(len(selfies_list)): 66 | selfies=selfies+selfies_list[ii] 67 | 68 | return selfies 69 | 70 | mdma='CNC(C)CC1=CC=C2C(=C1)OCO2' 71 | smiles_symbols='FONC()=#12345' # with this alphabet, the whole QM9 db can be translated (except of ions and stereochemistry) 72 | 73 | print('\n\n\n') 74 | print('SMILES: ',mdma,'\n') 75 | 76 | num_repeat=1000 77 | for c_num_of_mut in range(3): 78 | single_mut_err=0 79 | for c_muts in range(num_repeat): 80 | 81 | new_mdma=mdma 82 | for ii in range(c_num_of_mut+1): 83 | mol_idx=randint(0,len(new_mdma)-1) 84 | symbol_idx=randint(0,len(smiles_symbols)-1) 85 | 86 | new_mdma=new_mdma[0:mol_idx]+smiles_symbols[symbol_idx]+new_mdma[mol_idx+1:] 87 | 88 | res_new=IsCorrectSMILES(new_mdma) 89 | if res_new==0: 90 | single_mut_err=single_mut_err+1 91 | 92 | print(c_num_of_mut+1, 'mutations with SMILES. Correct: ', num_repeat-single_mut_err, '/', num_repeat, '=', 1-single_mut_err/num_repeat) 93 | 94 | 95 | 96 | # SELFIES code 97 | mdma_selfies=encoder(mdma) 98 | print('\n\n\n') 99 | print('SELFIES: ',mdma_selfies,'\n') 100 | mdma_selfies_tok=tokenize_selfies(mdma_selfies) 101 | selfies_symbols=['[epsilon]','[Ring1]','[Ring2]','[Branch1_1]','[Branch1_2]','[Branch1_3]','[F]','[O]','[=O]','[N]','[=N]','[#N]','[C]','[=C]','[#C]']; #with this alphabet, the whole QM9 db can be translated (except of ions and stereochemistry) 102 | 103 | num_repeat=1000 104 | for c_num_of_mut in range(3): 105 | single_mut_err=0 106 | for c_muts in range(num_repeat): 107 | 108 | new_mdma=mdma_selfies_tok 109 | for ii in range(c_num_of_mut+1): 110 | mol_idx=randint(0,len(new_mdma)-1) 111 | symbol_idx=randint(0,len(selfies_symbols)-1) 112 | 113 | new_mdma_str=detokenize_selfies(new_mdma[0:mol_idx]) 114 | new_mdma_str=new_mdma_str+selfies_symbols[symbol_idx] 115 | new_mdma_str=new_mdma_str+detokenize_selfies(new_mdma[mol_idx+1:]) 116 | new_mdma=tokenize_selfies(new_mdma_str) 117 | 118 | mutated_selfies=detokenize_selfies(new_mdma) 119 | mutated_smiles=decoder(mutated_selfies) 120 | res_new=IsCorrectSMILES(mutated_smiles) 121 | 122 | if res_new==0: 123 | single_mut_err=single_mut_err+1 124 | 125 | if c_muts>0 and c_muts%1000==0: 126 | print('Iteration: ', c_muts, '/', num_repeat) 127 | 128 | 129 | print(c_num_of_mut+1, 'mutations with SELFIES. Correct: ', num_repeat-single_mut_err, '/', num_repeat, '=', 1-single_mut_err/num_repeat) 130 | 131 | 132 | 133 | # DeepSMILES code 134 | deepsmiles_symbols='FONC)=#3456789' # with this alphabet, the whole QM9 db can be translated (except of ions and stereochemistry) 135 | converter = deepsmiles.Converter(rings=True, branches=True) 136 | 137 | mdma_deepsmiles=converter.encode(mdma) 138 | print('\n\n\n') 139 | print('DeepSMILES: ',mdma_deepsmiles,'\n') 140 | 141 | num_repeat=1000 142 | for c_num_of_mut in range(3): 143 | single_mut_err=0 144 | for c_muts in range(num_repeat): 145 | 146 | new_mdma=mdma_deepsmiles 147 | for ii in range(c_num_of_mut+1): 148 | mol_idx=randint(0,len(new_mdma)-1) 149 | symbol_idx=randint(0,len(deepsmiles_symbols)-1) 150 | 151 | new_mdma=new_mdma[0:mol_idx]+deepsmiles_symbols[symbol_idx]+new_mdma[mol_idx+1:] 152 | 153 | try: 154 | mutated_smiles=converter.decode(new_mdma) 155 | except Exception: 156 | mutated_smiles='err' 157 | 158 | res_new=IsCorrectSMILES(mutated_smiles) 159 | if res_new==0: 160 | single_mut_err=single_mut_err+1 161 | 162 | print(c_num_of_mut+1, 'mutations with DeepSMILES. Correct: ', num_repeat-single_mut_err, '/', num_repeat, '=', 1-single_mut_err/num_repeat) 163 | -------------------------------------------------------------------------------- /tests/test_selfies.py: -------------------------------------------------------------------------------- 1 | import faulthandler 2 | import random 3 | 4 | import pytest 5 | from rdkit.Chem import MolFromSmiles 6 | 7 | import selfies as sf 8 | 9 | faulthandler.enable() 10 | 11 | 12 | @pytest.fixture() 13 | def max_selfies_len(): 14 | return 1000 15 | 16 | 17 | @pytest.fixture() 18 | def large_alphabet(): 19 | alphabet = sf.get_semantic_robust_alphabet() 20 | alphabet.update([ 21 | "[#Br]", "[#Branch1]", "[#Branch2]", "[#Branch3]", "[#C@@H1]", 22 | "[#C@@]", "[#C@H1]", "[#C@]", "[#C]", "[#Cl]", "[#F]", "[#H]", "[#I]", 23 | "[#NH1]", "[#N]", "[#O]", "[#P]", "[#Ring1]", "[#Ring2]", "[#Ring3]", 24 | "[#S]", "[/Br]", "[/C@@H1]", "[/C@@]", "[/C@H1]", "[/C@]", "[/C]", 25 | "[/Cl]", "[/F]", "[/H]", "[/I]", "[/NH1]", "[/N]", "[/O]", "[/P]", 26 | "[/S]", "[=Br]", "[=Branch1]", "[=Branch2]", "[=Branch3]", "[=C@@H1]", 27 | "[=C@@]", "[=C@H1]", "[=C@]", "[=C]", "[=Cl]", "[=F]", "[=H]", "[=I]", 28 | "[=NH1]", "[=N]", "[=O]", "[=P]", "[=Ring1]", "[=Ring2]", "[=Ring3]", 29 | "[=S]", "[Br]", "[Branch1]", "[Branch2]", "[Branch3]", "[C@@H1]", 30 | "[C@@]", "[C@H1]", "[C@]", "[C]", "[Cl]", "[F]", "[H]", "[I]", "[NH1]", 31 | "[N]", "[O]", "[P]", "[Ring1]", "[Ring2]", "[Ring3]", "[S]", "[\\Br]", 32 | "[\\C@@H1]", "[\\C@@]", "[\\C@H1]", "[\\C@]", "[\\C]", "[\\Cl]", 33 | "[\\F]", "[\\H]", "[\\I]", "[\\NH1]", "[\\N]", "[\\O]", "[\\P]", 34 | "[\\S]", "[nop]" 35 | ]) 36 | return list(alphabet) 37 | 38 | 39 | def test_random_selfies_decoder(trials, max_selfies_len, large_alphabet): 40 | """Tests that SELFIES that are generated by randomly stringing together 41 | symbols from the SELFIES alphabet are decoded into valid SMILES. 42 | """ 43 | 44 | alphabet = tuple(large_alphabet) 45 | 46 | for _ in range(trials): 47 | 48 | # create random SELFIES and decode 49 | rand_len = random.randint(1, max_selfies_len) 50 | rand_selfies = "".join(random_choices(alphabet, k=rand_len)) 51 | smiles = sf.decoder(rand_selfies) 52 | 53 | # check if SMILES is valid 54 | try: 55 | is_valid = MolFromSmiles(smiles, sanitize=True) is not None 56 | except Exception: 57 | is_valid = False 58 | 59 | err_msg = "SMILES: {}\n\t SELFIES: {}".format(smiles, rand_selfies) 60 | assert is_valid, err_msg 61 | 62 | 63 | def test_nop_symbol_decoder(max_selfies_len, large_alphabet): 64 | """Tests that the '[nop]' symbol is always skipped over. 65 | """ 66 | 67 | alphabet = list(large_alphabet) 68 | alphabet.remove("[nop]") 69 | 70 | for _ in range(100): 71 | 72 | # create random SELFIES with and without [nop] 73 | rand_len = random.randint(1, max_selfies_len) 74 | rand_mol = random_choices(alphabet, k=rand_len) 75 | rand_mol.extend(["[nop]"] * (max_selfies_len - rand_len)) 76 | random.shuffle(rand_mol) 77 | 78 | with_nops = "".join(rand_mol) 79 | without_nops = with_nops.replace("[nop]", "") 80 | 81 | assert sf.decoder(with_nops) == sf.decoder(without_nops) 82 | 83 | 84 | def test_get_semantic_constraints(): 85 | constraints = sf.get_semantic_constraints() 86 | assert constraints is not sf.get_semantic_constraints() # not alias 87 | assert "?" in constraints 88 | 89 | 90 | def test_change_constraints_cache_clear(): 91 | alphabet = sf.get_semantic_robust_alphabet() 92 | assert alphabet == sf.get_semantic_robust_alphabet() 93 | assert sf.decoder("[C][#C]") == "C#C" 94 | 95 | new_constraints = sf.get_semantic_constraints() 96 | new_constraints["C"] = 1 97 | sf.set_semantic_constraints(new_constraints) 98 | 99 | new_alphabet = sf.get_semantic_robust_alphabet() 100 | assert new_alphabet != alphabet 101 | assert sf.decoder("[C][#C]") == "CC" 102 | 103 | sf.set_semantic_constraints() # re-set alphabet 104 | 105 | 106 | def test_invalid_or_unsupported_smiles_encoder(): 107 | malformed_smiles = [ 108 | "", 109 | "(", 110 | "C(Cl)(Cl)CC[13C", 111 | "C(CCCOC", 112 | "C=(CCOC", 113 | "CCCC)", 114 | "C1CCCCC", 115 | "C(F)(F)(F)(F)(F)F", # violates bond constraints 116 | "C=C1=CCCCCC1", # violates bond constraints 117 | "CC*CC", # uses wildcard 118 | "C$C", # uses $ bond 119 | "S[As@TB1](F)(Cl)(Br)N", # unrecognized chirality, 120 | "SOMETHINGWRONGHERE", 121 | "1243124124", 122 | ] 123 | 124 | for smiles in malformed_smiles: 125 | with pytest.raises(sf.EncoderError): 126 | sf.encoder(smiles) 127 | 128 | 129 | def test_malformed_selfies_decoder(): 130 | with pytest.raises(sf.DecoderError): 131 | sf.decoder("[O][=C][O][C][C][C][C][O][N][Branch2_3") 132 | 133 | 134 | def random_choices(population, k): # random.choices was new in Python v3.6 135 | return [random.choice(population) for _ in range(k)] 136 | 137 | 138 | def test_decoder_attribution(): 139 | sm, am = sf.decoder( 140 | "[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]", attribute=True) 141 | # check that P lined up 142 | for ta in am: 143 | if ta.token == 'P': 144 | for a in ta.attribution: 145 | if a.token == '[P]': 146 | return 147 | raise ValueError('Failed to find P in attribution map') 148 | 149 | 150 | def test_encoder_attribution(): 151 | smiles = "C1([O-])C=CC=C1Cl" 152 | indices = [0, 3, 3, 3, 5, 7, 8, 10, None, None, 12] 153 | s, am = sf.encoder(smiles, attribute=True) 154 | for i, ta in enumerate(am): 155 | if ta.attribution: 156 | assert indices[i] == ta.attribution[0].index, \ 157 | f'found {ta[1]}; should be {indices[i]}' 158 | if ta.token == '[Cl]': 159 | assert 'Cl' in [ 160 | a.token for a in ta.attribution],\ 161 | 'Failed to find Cl in attribution map' 162 | -------------------------------------------------------------------------------- /selfies/grammar_rules.py: -------------------------------------------------------------------------------- 1 | import functools 2 | import itertools 3 | import re 4 | from typing import Any, List, Optional, Tuple 5 | 6 | from selfies.constants import ( 7 | ELEMENTS, 8 | INDEX_ALPHABET, 9 | INDEX_CODE, 10 | ORGANIC_SUBSET 11 | ) 12 | from selfies.mol_graph import Atom 13 | from selfies.utils.smiles_utils import smiles_to_bond 14 | 15 | 16 | def process_atom_symbol(symbol: str) -> Optional[Tuple[Any, Atom]]: 17 | try: 18 | output = _PROCESS_ATOM_CACHE[symbol] 19 | except KeyError: 20 | output = _process_atom_selfies_no_cache(symbol) 21 | if output is None: 22 | return None 23 | _PROCESS_ATOM_CACHE[symbol] = output 24 | 25 | bond_info, atom_fac = output 26 | atom = atom_fac() 27 | if atom.bonding_capacity < 0: 28 | return None # too many Hs (e.g. [CH9] 29 | return bond_info, atom 30 | 31 | 32 | def process_branch_symbol(symbol: str) -> Optional[Tuple[int, int]]: 33 | try: 34 | return _PROCESS_BRANCH_CACHE[symbol] 35 | except KeyError: 36 | return None 37 | 38 | 39 | def process_ring_symbol(symbol: str) -> Optional[Tuple[int, int, Any]]: 40 | try: 41 | return _PROCESS_RING_CACHE[symbol] 42 | except KeyError: 43 | return None 44 | 45 | 46 | def next_atom_state( 47 | bond_order: int, bond_cap: int, state: int 48 | ) -> Tuple[int, Optional[int]]: 49 | if state == 0: 50 | bond_order = 0 51 | 52 | bond_order = min(bond_order, state, bond_cap) 53 | bonds_left = bond_cap - bond_order 54 | next_state = None if (bonds_left == 0) else bonds_left 55 | return bond_order, next_state 56 | 57 | 58 | def next_branch_state( 59 | branch_type: int, state: int 60 | ) -> Tuple[int, Optional[int]]: 61 | assert 1 <= branch_type <= 3 62 | assert state > 1 63 | 64 | branch_init_state = min(state - 1, branch_type) 65 | next_state = state - branch_init_state 66 | return branch_init_state, next_state 67 | 68 | 69 | def next_ring_state( 70 | ring_type: int, state: int 71 | ) -> Tuple[int, Optional[int]]: 72 | assert state > 0 73 | 74 | bond_order = min(ring_type, state) 75 | bonds_left = state - bond_order 76 | next_state = None if (bonds_left == 0) else bonds_left 77 | return bond_order, next_state 78 | 79 | 80 | def get_index_from_selfies(*symbols: List[str]) -> int: 81 | index = 0 82 | for i, c in enumerate(reversed(symbols)): 83 | index += INDEX_CODE.get(c, 0) * (len(INDEX_CODE) ** i) 84 | return index 85 | 86 | 87 | def get_selfies_from_index(index: int) -> List[str]: 88 | if index < 0: 89 | raise IndexError() 90 | elif index == 0: 91 | return [INDEX_ALPHABET[0]] 92 | 93 | symbols = [] 94 | base = len(INDEX_ALPHABET) 95 | while index: 96 | symbols.append(INDEX_ALPHABET[index % base]) 97 | index //= base 98 | return symbols[::-1] 99 | 100 | 101 | # ============================================================================= 102 | # Caches (for computational speed) 103 | # ============================================================================= 104 | 105 | 106 | SELFIES_ATOM_PATTERN = re.compile( 107 | r"^[\[]" # opening square bracket [ 108 | r"([=#/\\]?)" # bond char 109 | r"(\d*)" # isotope number (optional, e.g. 123, 26) 110 | r"([A-Z][a-z]?)" # element symbol 111 | r"([@]{0,2})" # chiral_tag (optional, only @ and @@ supported) 112 | r"((?:[H]\d)?)" # H count (optional, e.g. H1, H3) 113 | r"((?:[+-][1-9]+)?)" # charge (optional, e.g. +1) 114 | r"[]]$" # closing square bracket ] 115 | ) 116 | 117 | 118 | def _process_atom_selfies_no_cache(symbol): 119 | m = SELFIES_ATOM_PATTERN.match(symbol) 120 | if m is None: 121 | return None 122 | bond_char, isotope, element, chirality, h_count, charge = m.groups() 123 | 124 | if symbol[1 + len(bond_char):-1] in ORGANIC_SUBSET: 125 | atom_fac = functools.partial(Atom, element=element, is_aromatic=False) 126 | return smiles_to_bond(bond_char), atom_fac 127 | 128 | isotope = None if (isotope == "") else int(isotope) 129 | if element not in ELEMENTS: 130 | return None 131 | chirality = None if (chirality == "") else chirality 132 | 133 | s = h_count 134 | if s == "": 135 | h_count = 0 136 | else: 137 | h_count = int(s[1:]) 138 | 139 | s = charge 140 | if s == "": 141 | charge = 0 142 | else: 143 | charge = int(s[1:]) 144 | charge *= 1 if (s[0] == "+") else -1 145 | 146 | atom_fac = functools.partial( 147 | Atom, 148 | element=element, 149 | is_aromatic=False, 150 | isotope=isotope, 151 | chirality=chirality, 152 | h_count=h_count, 153 | charge=charge 154 | ) 155 | 156 | return smiles_to_bond(bond_char), atom_fac 157 | 158 | 159 | def _build_atom_cache(): 160 | cache = dict() 161 | common_symbols = [ 162 | "[#C+1]", "[#C-1]", "[#C]", "[#N+1]", "[#N]", "[#O+1]", "[#P+1]", 163 | "[#P-1]", "[#P]", "[#S+1]", "[#S-1]", "[#S]", "[=C+1]", "[=C-1]", 164 | "[=C]", "[=N+1]", "[=N-1]", "[=N]", "[=O+1]", "[=O]", "[=P+1]", 165 | "[=P-1]", "[=P]", "[=S+1]", "[=S-1]", "[=S]", "[Br]", "[C+1]", "[C-1]", 166 | "[C]", "[Cl]", "[F]", "[H]", "[I]", "[N+1]", "[N-1]", "[N]", "[O+1]", 167 | "[O-1]", "[O]", "[P+1]", "[P-1]", "[P]", "[S+1]", "[S-1]", "[S]" 168 | ] 169 | 170 | for symbol in common_symbols: 171 | cache[symbol] = _process_atom_selfies_no_cache(symbol) 172 | return cache 173 | 174 | 175 | def _build_branch_cache(): 176 | cache = dict() 177 | for L in range(1, 4): 178 | for bond_char in ["", "=", "#"]: 179 | symbol = "[{}Branch{}]".format(bond_char, L) 180 | cache[symbol] = (smiles_to_bond(bond_char)[0], L) 181 | return cache 182 | 183 | 184 | def _build_ring_cache(): 185 | cache = dict() 186 | for L in range(1, 4): 187 | # [RingL], [=RingL], [#RingL] 188 | for bond_char in ["", "=", "#"]: 189 | symbol = "[{}Ring{}]".format(bond_char, L) 190 | order, stereo = smiles_to_bond(bond_char) 191 | cache[symbol] = (order, L, (stereo, stereo)) 192 | 193 | # [-/RingL], [\/RingL], [\-RingL], ... 194 | for lchar, rchar in itertools.product(["-", "/", "\\"], repeat=2): 195 | if lchar == rchar == "-": 196 | continue 197 | symbol = "[{}{}Ring{}]".format(lchar, rchar, L) 198 | order, lstereo = smiles_to_bond(lchar) 199 | order, rstereo = smiles_to_bond(rchar) 200 | cache[symbol] = (order, L, (lstereo, rstereo)) 201 | return cache 202 | 203 | 204 | _PROCESS_ATOM_CACHE = _build_atom_cache() 205 | 206 | _PROCESS_BRANCH_CACHE = _build_branch_cache() 207 | 208 | _PROCESS_RING_CACHE = _build_ring_cache() 209 | -------------------------------------------------------------------------------- /selfies/bond_constraints.py: -------------------------------------------------------------------------------- 1 | import functools 2 | from itertools import product 3 | from typing import Dict, Set, Union 4 | 5 | from selfies.constants import ELEMENTS, INDEX_ALPHABET 6 | 7 | _DEFAULT_CONSTRAINTS = { 8 | "H": 1, "F": 1, "Cl": 1, "Br": 1, "I": 1, 9 | "B": 3, "B+1": 2, "B-1": 4, 10 | "O": 2, "O+1": 3, "O-1": 1, 11 | "N": 3, "N+1": 4, "N-1": 2, 12 | "C": 4, "C+1": 3, "C-1": 3, 13 | "P": 5, "P+1": 4, "P-1": 6, 14 | "S": 6, "S+1": 5, "S-1": 5, 15 | "?": 8 16 | } 17 | 18 | _PRESET_CONSTRAINTS = { 19 | "default": dict(_DEFAULT_CONSTRAINTS), 20 | "octet_rule": dict(_DEFAULT_CONSTRAINTS), 21 | "hypervalent": dict(_DEFAULT_CONSTRAINTS) 22 | } 23 | _PRESET_CONSTRAINTS["octet_rule"].update( 24 | {"S": 2, "S+1": 3, "S-1": 1, "P": 3, "P+1": 4, "P-1": 2} 25 | ) 26 | _PRESET_CONSTRAINTS["hypervalent"].update( 27 | {"Cl": 7, "Br": 7, "I": 7, "N": 5} 28 | ) 29 | 30 | _current_constraints = _PRESET_CONSTRAINTS["default"] 31 | 32 | 33 | def get_preset_constraints(name: str) -> Dict[str, int]: 34 | """Returns the preset semantic constraints with the given name. 35 | 36 | Besides the aforementioned default constraints, :mod:`selfies` offers 37 | other preset constraints for convenience; namely, constraints that 38 | enforce the `octet rule `_ 39 | and constraints that accommodate `hypervalent molecules 40 | `_. 41 | 42 | The differences between these constraints can be summarized as follows: 43 | 44 | .. table:: 45 | :align: center 46 | :widths: auto 47 | 48 | +-----------------+-----------+---+---+-----+-----+---+-----+-----+ 49 | | | Cl, Br, I | N | P | P+1 | P-1 | S | S+1 | S-1 | 50 | +-----------------+-----------+---+---+-----+-----+---+-----+-----+ 51 | | ``default`` | 1 | 3 | 5 | 4 | 6 | 6 | 5 | 5 | 52 | +-----------------+-----------+---+---+-----+-----+---+-----+-----+ 53 | | ``octet_rule`` | 1 | 3 | 3 | 4 | 2 | 2 | 3 | 1 | 54 | +-----------------+-----------+---+---+-----+-----+---+-----+-----+ 55 | | ``hypervalent`` | 7 | 5 | 5 | 6 | 4 | 6 | 7 | 5 | 56 | +-----------------+-----------+---+---+-----+-----+---+-----+-----+ 57 | 58 | :param name: the preset name: ``default`` or ``octet_rule`` or 59 | ``hypervalent``. 60 | :return: the preset constraints with the specified name, represented 61 | as a dictionary which maps atoms (the keys) to their bonding capacities 62 | (the values). 63 | """ 64 | 65 | if name not in _PRESET_CONSTRAINTS: 66 | raise ValueError("unrecognized preset name '{}'".format(name)) 67 | return dict(_PRESET_CONSTRAINTS[name]) 68 | 69 | 70 | def get_semantic_constraints() -> Dict[str, int]: 71 | """Returns the semantic constraints that :mod:`selfies` is currently 72 | operating on. 73 | 74 | :return: the current semantic constraints, represented as a dictionary 75 | which maps atoms (the keys) to their bonding capacities (the values). 76 | """ 77 | 78 | global _current_constraints 79 | return dict(_current_constraints) 80 | 81 | 82 | def set_semantic_constraints( 83 | bond_constraints: Union[str, Dict[str, int]] = "default" 84 | ) -> None: 85 | """Updates the semantic constraints that :mod:`selfies` operates on. 86 | 87 | If the input is a string, the new constraints are taken to be 88 | the preset named ``bond_constraints`` 89 | (see :func:`selfies.get_preset_constraints`). 90 | 91 | Otherwise, the input is a dictionary representing the new constraints. 92 | This dictionary maps atoms (the keys) to non-negative bonding 93 | capacities (the values); the atoms are specified by strings 94 | of the form ``E`` or ``E+C`` or ``E-C``, 95 | where ``E`` is an element symbol and ``C`` is a positive integer. 96 | For example, one may have: 97 | 98 | * ``bond_constraints["I-1"] = 0`` 99 | * ``bond_constraints["C"] = 4`` 100 | 101 | This dictionary must also contain the special ``?`` key, which indicates 102 | the bond capacities of all atoms that are not explicitly listed 103 | in the dictionary. 104 | 105 | :param bond_constraints: the name of a preset, or a dictionary 106 | representing the new semantic constraints. 107 | :return: ``None``. 108 | """ 109 | 110 | global _current_constraints 111 | 112 | if isinstance(bond_constraints, str): 113 | _current_constraints = get_preset_constraints(bond_constraints) 114 | 115 | elif isinstance(bond_constraints, dict): 116 | 117 | # error checking 118 | if "?" not in bond_constraints: 119 | raise ValueError("bond_constraints missing '?' as a key") 120 | 121 | for key, value in bond_constraints.items(): 122 | 123 | # error checking for keys 124 | j = max(key.find("+"), key.find("-")) 125 | if key == "?": 126 | valid = True 127 | elif j == -1: 128 | valid = (key in ELEMENTS) 129 | else: 130 | valid = (key[:j] in ELEMENTS) and key[j + 1:].isnumeric() 131 | if not valid: 132 | err_msg = "invalid key '{}' in bond_constraints".format(key) 133 | raise ValueError(err_msg) 134 | 135 | # error checking for values 136 | if not (isinstance(value, int) and value >= 0): 137 | err_msg = "invalid value at " \ 138 | "bond_constraints['{}'] = {}".format(key, value) 139 | raise ValueError(err_msg) 140 | 141 | _current_constraints = dict(bond_constraints) 142 | 143 | else: 144 | raise ValueError("bond_constraints must be a str or dict") 145 | 146 | # clear cache since we changed alphabet 147 | get_semantic_robust_alphabet.cache_clear() 148 | get_bonding_capacity.cache_clear() 149 | 150 | 151 | @functools.lru_cache() 152 | def get_semantic_robust_alphabet() -> Set[str]: 153 | """Returns a subset of all SELFIES symbols that are constrained 154 | by :mod:`selfies` under the current semantic constraints. 155 | 156 | :return: a subset of all SELFIES symbols that are semantically constrained. 157 | """ 158 | 159 | alphabet_subset = set() 160 | bonds = {"": 1, "=": 2, "#": 3} 161 | 162 | # add atomic symbols 163 | for (a, c), (b, m) in product(_current_constraints.items(), bonds.items()): 164 | if (m > c) or (a == "?"): 165 | continue 166 | symbol = "[{}{}]".format(b, a) 167 | alphabet_subset.add(symbol) 168 | 169 | # add branch and ring symbols 170 | for i in range(1, 4): 171 | alphabet_subset.add("[Ring{}]".format(i)) 172 | alphabet_subset.add("[=Ring{}]".format(i)) 173 | alphabet_subset.add("[Branch{}]".format(i)) 174 | alphabet_subset.add("[=Branch{}]".format(i)) 175 | alphabet_subset.add("[#Branch{}]".format(i)) 176 | 177 | alphabet_subset.update(INDEX_ALPHABET) 178 | 179 | return alphabet_subset 180 | 181 | 182 | @functools.lru_cache() 183 | def get_bonding_capacity(element: str, charge: int) -> int: 184 | """Returns the bonding capacity of a given atom, under the current 185 | semantic constraints. 186 | 187 | :param element: the element of the input atom. 188 | :param charge: the charge of the input atom. 189 | :return: the bonding capacity of the input atom. 190 | """ 191 | 192 | key = element 193 | if charge != 0: 194 | key += "{:+}".format(charge) 195 | 196 | if key in _current_constraints: 197 | return _current_constraints[key] 198 | else: 199 | return _current_constraints["?"] 200 | -------------------------------------------------------------------------------- /selfies/utils/encoding_utils.py: -------------------------------------------------------------------------------- 1 | from typing import Dict, List, Tuple, Union 2 | 3 | from selfies.utils.selfies_utils import len_selfies, split_selfies 4 | 5 | 6 | def selfies_to_encoding( 7 | selfies: str, 8 | vocab_stoi: Dict[str, int], 9 | pad_to_len: int = -1, 10 | enc_type: str = 'both' 11 | ) -> Union[List[int], List[List[int]], Tuple[List[int], List[List[int]]]]: 12 | """Converts a SELFIES string into its label (integer) 13 | and/or one-hot encoding. 14 | 15 | A label encoded output will be a list of shape ``(L,)`` and a 16 | one-hot encoded output will be a 2D list of shape ``(L, len(vocab_stoi))``, 17 | where ``L`` is the symbol length of the SELFIES string. Optionally, 18 | the SELFIES string can be padded before it is encoded. 19 | 20 | :param selfies: the SELFIES string to be encoded. 21 | :param vocab_stoi: a dictionary that maps SELFIES symbols to indices, 22 | which must be non-negative and contiguous, starting from 0. 23 | If the SELFIES string is to be padded, then the special padding symbol 24 | ``[nop]`` must also be a key in this dictionary. 25 | :param pad_to_len: the length that the SELFIES string string is padded to. 26 | If this value is less than or equal to the symbol length of the 27 | SELFIES string, then no padding is added. Defaults to ``-1``. 28 | :param enc_type: the type of encoding of the output: 29 | ``label`` or ``one_hot`` or ``both``. 30 | If this value is ``both``, then a tuple of the label and one-hot 31 | encodings is returned. Defaults to ``both``. 32 | :return: the label encoded and/or one-hot encoded SELFIES string. 33 | 34 | :Example: 35 | 36 | >>> import selfies as sf 37 | >>> sf.selfies_to_encoding("[C][F]", {"[C]": 0, "[F]": 1}) 38 | ([0, 1], [[1, 0], [0, 1]]) 39 | """ 40 | 41 | # some error checking 42 | if enc_type not in ("label", "one_hot", "both"): 43 | raise ValueError("enc_type must be in ('label', 'one_hot', 'both')") 44 | 45 | # pad with [nop] 46 | if pad_to_len > len_selfies(selfies): 47 | selfies += "[nop]" * (pad_to_len - len_selfies(selfies)) 48 | 49 | # integer encode 50 | integer_encoded = [] 51 | for char in split_selfies(selfies): 52 | if (char == ".") and ("." not in vocab_stoi): 53 | raise KeyError( 54 | "The SELFIES string contains two unconnected molecules " 55 | "(given by the '.' character), but vocab_stoi does not " 56 | "contain the '.' key. Please add it to the vocabulary " 57 | "or separate the molecules." 58 | ) 59 | 60 | integer_encoded.append(vocab_stoi[char]) 61 | 62 | if enc_type == "label": 63 | return integer_encoded 64 | 65 | # one-hot encode 66 | one_hot_encoded = list() 67 | for index in integer_encoded: 68 | letter = [0] * len(vocab_stoi) 69 | letter[index] = 1 70 | one_hot_encoded.append(letter) 71 | 72 | if enc_type == "one_hot": 73 | return one_hot_encoded 74 | return integer_encoded, one_hot_encoded 75 | 76 | 77 | def encoding_to_selfies( 78 | encoding: Union[List[int], List[List[int]]], 79 | vocab_itos: Dict[int, str], 80 | enc_type: str, 81 | ) -> str: 82 | """Converts a label (integer) or one-hot encoding into a SELFIES string. 83 | 84 | If the input is label encoded, then a list of shape ``(L,)`` is 85 | expected; and if the input is one-hot encoded, then a 2D list of 86 | shape ``(L, len(vocab_itos))`` is expected. 87 | 88 | :param encoding: a label or one-hot encoding. 89 | :param vocab_itos: a dictionary that maps indices to SELFIES symbols. 90 | The indices of this dictionary must be non-negative and contiguous, 91 | starting from 0. 92 | :param enc_type: the type of encoding of the input: 93 | ``label`` or ``one_hot``. 94 | :return: the SELFIES string represented by the input encoding. 95 | 96 | :Example: 97 | 98 | >>> import selfies as sf 99 | >>> one_hot = [[0, 1, 0], [0, 0, 1], [1, 0, 0]] 100 | >>> vocab_itos = {0: "[nop]", 1: "[C]", 2: "[F]"} 101 | >>> sf.encoding_to_selfies(one_hot, vocab_itos, enc_type="one_hot") 102 | '[C][F][nop]' 103 | """ 104 | 105 | if enc_type not in ("label", "one_hot"): 106 | raise ValueError("enc_type must be in ('label', 'one_hot')") 107 | 108 | if enc_type == "one_hot": # Get integer encoding 109 | integer_encoded = [] 110 | for row in encoding: 111 | integer_encoded.append(row.index(1)) 112 | else: 113 | integer_encoded = encoding 114 | 115 | # Integer encoding -> SELFIES 116 | char_list = [vocab_itos[i] for i in integer_encoded] 117 | selfies = "".join(char_list) 118 | 119 | return selfies 120 | 121 | 122 | def batch_selfies_to_flat_hot( 123 | selfies_batch: List[str], 124 | vocab_stoi: Dict[str, int], 125 | pad_to_len: int = -1, 126 | ) -> List[List[int]]: 127 | """Converts a list of SELFIES strings into its list of flattened 128 | one-hot encodings. 129 | 130 | Each SELFIES string in the input list is one-hot encoded 131 | (and then flattened) using :func:`selfies.selfies_to_encoding`, with 132 | ``vocab_stoi`` and ``pad_to_len`` being passed in as arguments. 133 | 134 | :param selfies_batch: the list of SELFIES strings to be encoded. 135 | :param vocab_stoi: a dictionary that maps SELFIES symbols to indices. 136 | :param pad_to_len: the length that each SELFIES string in the input list 137 | is padded to. Defaults to ``-1``. 138 | :return: the flattened one-hot encodings of the input list. 139 | 140 | :Example: 141 | 142 | >>> import selfies as sf 143 | >>> batch = ["[C]", "[C][C]"] 144 | >>> vocab_stoi = {"[nop]": 0, "[C]": 1} 145 | >>> sf.batch_selfies_to_flat_hot(batch, vocab_stoi, 2) 146 | [[0, 1, 1, 0], [0, 1, 0, 1]] 147 | """ 148 | 149 | hot_list = list() 150 | 151 | for selfies in selfies_batch: 152 | one_hot = selfies_to_encoding(selfies, vocab_stoi, pad_to_len, 153 | enc_type="one_hot") 154 | flattened = [elem for vec in one_hot for elem in vec] 155 | hot_list.append(flattened) 156 | 157 | return hot_list 158 | 159 | 160 | def batch_flat_hot_to_selfies( 161 | one_hot_batch: List[List[int]], 162 | vocab_itos: Dict[int, str], 163 | ) -> List[str]: 164 | """Converts a list of flattened one-hot encodings into a list 165 | of SELFIES strings. 166 | 167 | Each encoding in the input list is unflattened and then decoded using 168 | :func:`selfies.encoding_to_selfies`, with ``vocab_itos`` being passed in 169 | as an argument. 170 | 171 | :param one_hot_batch: a list of flattened one-hot encodings. Each 172 | encoding must be a list of length divisible by ``len(vocab_itos)``. 173 | :param vocab_itos: a dictionary that maps indices to SELFIES symbols. 174 | :return: the list of SELFIES strings represented by the input encodings. 175 | 176 | :Example: 177 | 178 | >>> import selfies as sf 179 | >>> batch = [[0, 1, 1, 0], [0, 1, 0, 1]] 180 | >>> vocab_itos = {0: "[nop]", 1: "[C]"} 181 | >>> sf.batch_flat_hot_to_selfies(batch, vocab_itos) 182 | ['[C][nop]', '[C][C]'] 183 | """ 184 | 185 | selfies_list = [] 186 | 187 | for flat_one_hot in one_hot_batch: 188 | 189 | # Reshape to an L x M array where each column represents an alphabet 190 | # entry and each row is a position in the selfies 191 | one_hot = [] 192 | 193 | M = len(vocab_itos) 194 | if len(flat_one_hot) % M != 0: 195 | raise ValueError("size of vector in one_hot_batch not divisible " 196 | "by the length of the vocabulary.") 197 | L = len(flat_one_hot) // M 198 | 199 | for i in range(L): 200 | one_hot.append(flat_one_hot[M * i: M * (i + 1)]) 201 | 202 | selfies = encoding_to_selfies(one_hot, vocab_itos, enc_type="one_hot") 203 | selfies_list.append(selfies) 204 | 205 | return selfies_list 206 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # Changelog 2 | 3 | ## v2.2.0 - 15.01.2025 4 | - Fixed bug in the kekulization of molecules with radicals (thanks Olabisi-Aishat-Bello for reporting, thanks Robert Pollice for fixing) 5 | - Fixed constraints for validity of molecules with changed C, P or S, to align with validity-definition of RDKit. 6 | 7 | ## v2.1.2 - 15.07.2024 8 | - Fixed recursion bug for very long molecules (thanks haydn-jones) 9 | - Added warning when dot-symbol (".") exists in peculiar cases (thanks vandrw) 10 | 11 | ## v2.1.1 - 14.07.2022 12 | - Fixed index bug in attribution 13 | 14 | ## v2.1.0 - 17.05.2022 15 | 16 | ### Changed: 17 | - Dropped support for Python 3.5-3.6 and will continue to support only current Python versions. 18 | 19 | ### Added: 20 | - optional attribution to map encoder/decoder output string back to input string (Issue #48, #79) 21 | 22 | ## v2.0.0 - 21.10.2021 23 | 24 | ### Changed: 25 | - Improved SMILES parsing (by using adjacencey lists internally), with tighter error handling 26 | (e.g. issues #62 and #60). 27 | - Faster and improved kekulization algorithm (issue #55 fixed). 28 | - Support for symbols that are constrained to 0 bonds (e.g., `[CH4]`) or >8 bonds 29 | (users can now specify custon bond constraints with over 8 bonds). 30 | - New `strict=True` flag to `selfies.encoder`, which raises an error if the input 31 | SMILES violates the current bond constraints. `True` by default, can be `False` for speed-up (if 32 | SMILES are guaranteed to be correct). 33 | - Added bond constraints for B (max. 3 bonds) to the default and preset constraints. 34 | - Updated the syntax of SELFIES symbols to be cleaner and more readable. 35 | - Removing `expl` from atomic symbols, e.g., `[C@@Hexpl]` becommes `[C@@H]` 36 | - Cleaner branch symbols, e.g., `[BranchL_2]` becomes `[=BranchL]` 37 | - Cleaner ring symbols, e.g., `[Expl=RingL]` becomes `[=RingL]` 38 | - If you want to use the old symbols, use the `compatible=True` flag to `selfies.decoder`, 39 | e.g., `sf.decoder('[C][C][Expl=Ring1]',compatible=True)` (not recommended!) 40 | - More logically consistent behaviour of `[Ring]` symbols. 41 | - Standardized SELFIES alphabet, i.e., no two symbols stand for the same atom/ion (issue #58), e.g., 42 | `[N+1]` and `[N+]` are equivalent now. 43 | - Indexing symbols are now included in the alphabet returned by `selfies.get_semantic_robust_alphabet`. 44 | 45 | ### Removed 46 | - Removed `constraints` flag from `selfies.decoder`; please use `selfies.set_semantic_constraints()` 47 | and pass in `"hypervalent"` or `"octet_rule"` instead. 48 | - Removed `print_error` flag in `selfies.encoder` and `selfies.decoder`, 49 | which now raise errors `selfies.EncoderError` and `selfies.DecoderError`, respectively. 50 | 51 | ### Bug Fixes 52 | - Potential chirality inversion of atoms making ring bonds (e.g. ``[C@@H]12CCC2CCNC1``): 53 | fixed by inverting their chirality in ``selfies.encoder`` such that they are decoded with 54 | the original chirality preserved. 55 | - Failure to represent mismatching stereochemical specifications at ring bonds 56 | (e.g. ``F/C=C/C/C=C\C``): fixed by adding new ring symbols (e.g. ``[-/RingL]``, ``[\/RingL]``, etc.). 57 | 58 | --- 59 | 60 | ## v1.0.4 - 23.04.2021 61 | ### Added: 62 | * decoder option for relaxed hypervalence rules, `decoder(...,constraints='hypervalent')` 63 | * decoder option for strict octet rules, `decoder(...,constraints='octet_rule')` 64 | ### Bug Fix: 65 | * Fixed constraint for Phosphorus 66 | 67 | --- 68 | 69 | ## v1.0.3 - 13.01.2021 70 | ### Added: 71 | * Support for aromatic Si and Al (is not officially supported by Daylight SMILES, but RDKit supports it and examples exist in PubChem). 72 | 73 | --- 74 | 75 | ## v1.0.2 - 14.10.2020 76 | ### Added: 77 | * Support for aromatic Te and triple bonds. 78 | * Inbuild SELFIES to 1hot encoding, and 1hot encoding to SELFIES 79 | 80 | ### Changed: 81 | * Added default semantic constraints for charged atoms (single positive/negative charge of `[C]`, `[N]`, `[O]`, `[S]`, `[P]`) 82 | * Raised the bond capacity of `P` to 7 bonds (from 5 bonds). 83 | 84 | ### Bug Fixes: 85 | * Fixed bug: `selfies.decoder` did not terminate for malformed SELFIES 86 | that are missing the closed bracket `']'`. 87 | 88 | --- 89 | 90 | ## v1.0.1 - 25.08.2020 91 | ### Changed: 92 | * Code so that is compatible with python >= 3.5. 93 | * More descriptive error messages. 94 | 95 | ### Bug Fixes: 96 | * Minor bug fixes in the encoder for SMILES ending in branches (e.g. `C(Cl)(F)`), 97 | and SMILES with ring numbers between branches (e.g. `C(Cl)1(Br)CCCC1`) 98 | * Minor bug fix with ring ordering in decoder (e.g. `C1CC2CCC12` vs `C1CC2CCC21`). 99 | 100 | --- 101 | 102 | ## v1.0.0 - 17.08.2020: 103 | ### Added: 104 | * Added semantic handling of aromaticity / delocalization (by kekulizing SMILES with aromatic symbols before 105 | they are translated into SELFIES by `selfies.encoder`). 106 | * Added semantic handling of charged species (e.g. `[CH+]1CCC1`). 107 | * Added semantic handling of radical species (`[CH]1CCC1`) or any species with explicit hydrogens (e.g. `CC[CH2]`). 108 | * Added semantic handling of isotopes (e.g. `[14CH2]=C` or `[235U]`). 109 | * Improved semantic handling of explicit atom symbols in square brackets, e.g. Carbene (`[C]=C`). 110 | * Improved semantic handling of chirality (e.g. `O=C[Co@@](F)(Cl)(Br)(I)S`). 111 | * Improved semantic handling of double-bond configuration (e.g. `F/C=C/C=C/C`). 112 | * Added new functions to the library, such as `selfies.len_selfies` and 113 | `selfies.split_selfies`. 114 | * Added advanced-user functions to the library to customize the SELFIES semantic constraints, e.g. 115 | `selfies.set_semantic_constraints`. Allows to encode for instance diborane, `[BH2]1[H][BH2][H]1`. 116 | * Introduced new padding `[nop]` (no operation) symbol. 117 | 118 | ### Changed: 119 | * Optimized the indexing alphabet (it is base-16 now). 120 | * Optimized the behaviours of rings and branches to fix an issue with specific non-standard molecules that could not be translated. 121 | * Changed behaviour of Ring/Branch, such that states `X9991-X9993` are not necessary anymore. 122 | * Significantly improved encoding and decoding algorithms, it is much faster now. 123 | 124 | --- 125 | 126 | ## v0.2.4 - 01.10.2019: 127 | ### Added: 128 | * Function ``get_alphabet()`` which returns a list of 29 selfies symbols 129 | whose arbitrary combination produce >99.99% valid molecules. 130 | 131 | ### Bug Fixes: 132 | * Fixed bug which happens when three rings start at one node, and two of 133 | them form a double ring. 134 | * Enabled rings with sizes of up to 8000 SELFIES symbols. 135 | * Bug fix for tiny ring to RDKit syntax conversion, spanning multiple 136 | branches. 137 | 138 | We thank Kevin Ryan (LeanAndMean@github), Theophile Gaudin and Andrew Brereton 139 | for suggestions and bug reports. 140 | 141 | --- 142 | 143 | ## v0.2.2 - 19.09.2019: 144 | 145 | ### Added: 146 | * Enabled ``[C@],[C@H],[C@@],[C@@H],[H]`` to use in a semantic 147 | constrained way. 148 | 149 | We thank Andrew Brereton for suggestions and bug reports. 150 | 151 | --- 152 | 153 | ## v0.2.1 - 02.09.2019: 154 | 155 | ### Added: 156 | * Decoder: added optional argument to restrict nitrogen to 3 bonds. 157 | ``decoder(...,N_restrict=False)`` to allow for more bonds; 158 | standard: ``N_restrict=True``. 159 | * Decoder: added optional argument make ring-function bi-local 160 | (i.e. confirms bond number at target). 161 | ``decoder(...,bilocal_ring_function=False)`` to not allow bi-local ring 162 | function; standard: ``bilocal_ring_function=True``. The bi-local ring 163 | function will allow validity of >99.99% of random molecules. 164 | * Decoder: made double-bond ring RDKit syntax conform. 165 | * Decoder: added state X5 and X6 for having five and six bonds free. 166 | 167 | ### Bug Fixes: 168 | * Decoder + Encoder: allowing for explicit brackets for organic atoms, for 169 | instance ``[I]``. 170 | * Encoder: explicit single/double bond for non-canonical SMILES input 171 | issue fixed. 172 | * Decoder: bug fix for ``[Branch*]`` in state X1. 173 | 174 | We thank Benjamin Sanchez-Lengeling, Theophile Gaudin and Zhenpeng Yao 175 | for suggestions and bug reports. 176 | 177 | --- 178 | 179 | ## v0.1.1 - 04.06.2019: 180 | * initial release 181 | -------------------------------------------------------------------------------- /selfies/decoder.py: -------------------------------------------------------------------------------- 1 | import warnings 2 | from typing import List, Union, Tuple 3 | 4 | from selfies.compatibility import modernize_symbol 5 | from selfies.exceptions import DecoderError 6 | from selfies.grammar_rules import ( 7 | get_index_from_selfies, 8 | next_atom_state, 9 | next_branch_state, 10 | next_ring_state, 11 | process_atom_symbol, 12 | process_branch_symbol, 13 | process_ring_symbol 14 | ) 15 | from selfies.mol_graph import MolecularGraph, Attribution 16 | from selfies.utils.selfies_utils import split_selfies 17 | from selfies.utils.smiles_utils import mol_to_smiles 18 | 19 | 20 | def decoder( 21 | selfies: str, 22 | compatible: bool = False, 23 | attribute: bool = False) ->\ 24 | Union[str, Tuple[str, List[Tuple[str, List[Tuple[int, str]]]]]]: 25 | """Translates a SELFIES string into its corresponding SMILES string. 26 | 27 | This translation is deterministic but depends on the current semantic 28 | constraints. The output SMILES string is guaranteed to be syntatically 29 | correct and guaranteed to represent a molecule that obeys the 30 | semantic constraints. 31 | 32 | :param selfies: the SELFIES string to be translated. 33 | :param compatible: if ``True``, this function will accept SELFIES strings 34 | containing depreciated symbols from previous releases. However, this 35 | function may behave differently than in previous major relases, 36 | and should not be treated as backard compatible. 37 | Defaults to ``False``. 38 | :param attribute: if ``True``, an attribution map connecting selfies 39 | tokens to smiles tokens is output. 40 | :return: a SMILES string derived from the input SELFIES string. 41 | :raises DecoderError: if the input SELFIES string is malformed. 42 | 43 | :Example: 44 | 45 | >>> import selfies as sf 46 | >>> sf.decoder('[C][=C][F]') 47 | 'C=CF' 48 | """ 49 | 50 | if compatible: 51 | msg = "\nselfies.decoder() may behave differently than in previous " \ 52 | "major releases. We recommend using SELFIES that are up to date." 53 | warnings.warn(msg, stacklevel=2) 54 | 55 | mol = MolecularGraph(attributable=attribute) 56 | 57 | rings = [] 58 | attribution_index = 0 59 | for s in selfies.split("."): 60 | n = _derive_mol_from_symbols( 61 | symbol_iter=enumerate(_tokenize_selfies(s, compatible)), 62 | mol=mol, 63 | selfies=selfies, 64 | max_derive=float("inf"), 65 | init_state=0, 66 | root_atom=None, 67 | rings=rings, 68 | attribute_stack=[] if attribute else None, 69 | attribution_index=attribution_index 70 | ) 71 | attribution_index += n 72 | _form_rings_bilocally(mol, rings) 73 | return mol_to_smiles(mol, attribute) 74 | 75 | 76 | def _tokenize_selfies(selfies, compatible): 77 | if isinstance(selfies, str): 78 | symbol_iter = split_selfies(selfies) 79 | elif isinstance(selfies, list): 80 | symbol_iter = selfies 81 | else: 82 | raise ValueError() # should not happen 83 | 84 | try: 85 | for symbol in symbol_iter: 86 | if symbol == "[nop]": 87 | continue 88 | if compatible: 89 | symbol = modernize_symbol(symbol) 90 | yield symbol 91 | except ValueError as err: 92 | raise DecoderError(str(err)) from None 93 | 94 | 95 | def _derive_mol_from_symbols( 96 | symbol_iter, mol, selfies, max_derive, 97 | init_state, root_atom, rings, attribute_stack, attribution_index 98 | ): 99 | n_derived = 0 100 | state = init_state 101 | prev_atom = root_atom 102 | 103 | while (state is not None) and (n_derived < max_derive): 104 | 105 | try: # retrieve next symbol 106 | index, symbol = next(symbol_iter) 107 | n_derived += 1 108 | except StopIteration: 109 | break 110 | 111 | # Case 1: Branch symbol (e.g. [Branch1]) 112 | if "ch" == symbol[-4:-2]: 113 | 114 | output = process_branch_symbol(symbol) 115 | if output is None: 116 | _raise_decoder_error(selfies, symbol) 117 | btype, n = output 118 | 119 | if state <= 1: 120 | next_state = state 121 | else: 122 | binit_state, next_state = next_branch_state(btype, state) 123 | 124 | Q = _read_index_from_selfies(symbol_iter, n_symbols=n) 125 | n_derived += n + _derive_mol_from_symbols( 126 | symbol_iter, mol, selfies, (Q + 1), 127 | init_state=binit_state, root_atom=prev_atom, rings=rings, 128 | attribute_stack=attribute_stack + 129 | [Attribution(index + attribution_index, symbol) 130 | ] if attribute_stack is not None else None, 131 | attribution_index=attribution_index 132 | ) 133 | 134 | # Case 2: Ring symbol (e.g. [Ring2]) 135 | elif "ng" == symbol[-4:-2]: 136 | 137 | output = process_ring_symbol(symbol) 138 | if output is None: 139 | _raise_decoder_error(selfies, symbol) 140 | ring_type, n, stereo = output 141 | 142 | if state == 0: 143 | next_state = state 144 | else: 145 | ring_order, next_state = next_ring_state(ring_type, state) 146 | bond_info = (ring_order, stereo) 147 | 148 | Q = _read_index_from_selfies(symbol_iter, n_symbols=n) 149 | n_derived += n 150 | lidx = max(0, prev_atom.index - (Q + 1)) 151 | rings.append((mol.get_atom(lidx), prev_atom, bond_info)) 152 | 153 | # Case 3: [epsilon] 154 | elif "eps" in symbol: 155 | next_state = 0 if (state == 0) else None 156 | 157 | # Case 4: regular symbol (e.g. [N], [=C], [F]) 158 | else: 159 | 160 | output = process_atom_symbol(symbol) 161 | if output is None: 162 | _raise_decoder_error(selfies, symbol) 163 | (bond_order, stereo), atom = output 164 | cap = atom.bonding_capacity 165 | 166 | bond_order, next_state = next_atom_state(bond_order, cap, state) 167 | if bond_order == 0: 168 | if state == 0: 169 | o = mol.add_atom(atom, True) 170 | mol.add_attribution( 171 | o, attribute_stack + 172 | [Attribution(index + attribution_index, symbol)] 173 | if attribute_stack is not None else None) 174 | else: 175 | o = mol.add_atom(atom) 176 | mol.add_attribution( 177 | o, attribute_stack + 178 | [Attribution(index + attribution_index, symbol)] 179 | if attribute_stack is not None else None) 180 | src, dst = prev_atom.index, atom.index 181 | o = mol.add_bond(src=src, dst=dst, 182 | order=bond_order, stereo=stereo) 183 | mol.add_attribution( 184 | o, attribute_stack + 185 | [Attribution(index + attribution_index, symbol)] 186 | if attribute_stack is not None else None) 187 | prev_atom = atom 188 | 189 | if next_state is None: 190 | break 191 | state = next_state 192 | 193 | while n_derived < max_derive: # consume remaining tokens 194 | try: 195 | next(symbol_iter) 196 | n_derived += 1 197 | except StopIteration: 198 | break 199 | 200 | return n_derived 201 | 202 | 203 | def _raise_decoder_error(selfies, invalid_symbol): 204 | err_msg = "invalid symbol '{}'\n\tSELFIES: {}".format( 205 | invalid_symbol, selfies 206 | ) 207 | raise DecoderError(err_msg) 208 | 209 | 210 | def _read_index_from_selfies(symbol_iter, n_symbols): 211 | index_symbols = [] 212 | for _ in range(n_symbols): 213 | try: 214 | index_symbols.append(next(symbol_iter)[-1]) 215 | except StopIteration: 216 | index_symbols.append(None) 217 | return get_index_from_selfies(*index_symbols) 218 | 219 | 220 | def _form_rings_bilocally(mol, rings): 221 | rings_made = [0] * len(mol) 222 | 223 | for latom, ratom, bond_info in rings: 224 | lidx, ridx = latom.index, ratom.index 225 | 226 | if lidx == ridx: # ring to the same atom forbidden 227 | continue 228 | 229 | order, (lstereo, rstereo) = bond_info 230 | lfree = latom.bonding_capacity - mol.get_bond_count(lidx) 231 | rfree = ratom.bonding_capacity - mol.get_bond_count(ridx) 232 | 233 | if lfree <= 0 or rfree <= 0: 234 | continue # no room for ring bond 235 | order = min(order, lfree, rfree) 236 | 237 | if mol.has_bond(a=lidx, b=ridx): 238 | bond = mol.get_dirbond(src=lidx, dst=ridx) 239 | new_order = min(order + bond.order, 3) 240 | mol.update_bond_order(a=lidx, b=ridx, new_order=new_order) 241 | 242 | else: 243 | mol.add_ring_bond( 244 | a=lidx, a_stereo=lstereo, a_pos=rings_made[lidx], 245 | b=ridx, b_stereo=rstereo, b_pos=rings_made[ridx], 246 | order=order 247 | ) 248 | rings_made[lidx] += 1 249 | rings_made[ridx] += 1 250 | -------------------------------------------------------------------------------- /original_code_from_paper/environment.yml: -------------------------------------------------------------------------------- 1 | name: base 2 | channels: 3 | - defaults 4 | - conda-forge 5 | - pytorch 6 | dependencies: 7 | - _ipyw_jlab_nb_ext_conf=0.1.0=py37_0 8 | - alabaster=0.7.12=py37_0 9 | - anaconda=2018.12=py37_0 10 | - anaconda-client=1.7.2=py37_0 11 | - anaconda-navigator=1.9.6=py37_0 12 | - anaconda-project=0.8.2=py37_0 13 | - asn1crypto=0.24.0=py37_0 14 | - astroid=2.1.0=py37_0 15 | - astropy=3.1=py37he774522_0 16 | - atomicwrites=1.2.1=py37_0 17 | - attrs=18.2.0=py37h28b3542_0 18 | - babel=2.6.0=py37_0 19 | - backcall=0.1.0=py37_0 20 | - backports=1.0=py37_1 21 | - backports.os=0.1.1=py37_0 22 | - backports.shutil_get_terminal_size=1.0.0=py37_2 23 | - beautifulsoup4=4.6.3=py37_0 24 | - bitarray=0.8.3=py37hfa6e2cd_0 25 | - bkcharts=0.2=py37_0 26 | - blas=1.0=mkl 27 | - blaze=0.11.3=py37_0 28 | - bleach=3.0.2=py37_0 29 | - blosc=1.14.4=he51fdeb_0 30 | - bokeh=1.0.2=py37_0 31 | - boost=1.68.0=py37hf75dd32_1001 32 | - boost-cpp=1.68.0=h6a4c333_1000 33 | - boto=2.49.0=py37_0 34 | - bottleneck=1.2.1=py37h452e1ab_1 35 | - bzip2=1.0.6=hfa6e2cd_5 36 | - ca-certificates=2018.03.07=0 37 | - cairo=1.16.0=hc1b38c8_1000 38 | - certifi=2018.11.29=py37_0 39 | - cffi=1.11.5=py37h74b6da3_1 40 | - chardet=3.0.4=py37_1 41 | - click=7.0=py37_0 42 | - cloudpickle=0.6.1=py37_0 43 | - clyent=1.2.2=py37_1 44 | - colorama=0.4.1=py37_0 45 | - comtypes=1.1.7=py37_0 46 | - conda=4.6.4=py37_0 47 | - conda-build=3.17.6=py37_0 48 | - conda-env=2.6.0=1 49 | - conda-verify=3.1.1=py37_0 50 | - console_shortcut=0.1.1=3 51 | - contextlib2=0.5.5=py37_0 52 | - cryptography=2.4.2=py37h7a1dbc1_0 53 | - curl=7.63.0=h2a8f88b_1000 54 | - cycler=0.10.0=py37_0 55 | - cython=0.29.2=py37ha925a31_0 56 | - cytoolz=0.9.0.1=py37hfa6e2cd_1 57 | - dask=1.0.0=py37_0 58 | - dask-core=1.0.0=py37_0 59 | - datashape=0.5.4=py37_1 60 | - decorator=4.3.0=py37_0 61 | - defusedxml=0.5.0=py37_1 62 | - distributed=1.25.1=py37_0 63 | - docutils=0.14=py37_0 64 | - entrypoints=0.2.3=py37_2 65 | - et_xmlfile=1.0.1=py37_0 66 | - fastcache=1.0.2=py37hfa6e2cd_2 67 | - filelock=3.0.10=py37_0 68 | - flask=1.0.2=py37_1 69 | - flask-cors=3.0.7=py37_0 70 | - freetype=2.9.1=ha9979f8_1 71 | - future=0.17.1=py37_0 72 | - get_terminal_size=1.0.0=h38e98db_0 73 | - gevent=1.3.7=py37he774522_1 74 | - glob2=0.6=py37_1 75 | - greenlet=0.4.15=py37hfa6e2cd_0 76 | - h5py=2.8.0=py37h3bdd7fb_2 77 | - hdf5=1.10.2=hac2f561_1 78 | - heapdict=1.0.0=py37_2 79 | - html5lib=1.0.1=py37_0 80 | - icc_rt=2019.0.0=h0cc432a_1 81 | - icu=58.2=ha66f8fd_1 82 | - idna=2.8=py37_0 83 | - imageio=2.4.1=py37_0 84 | - imagesize=1.1.0=py37_0 85 | - importlib_metadata=0.6=py37_0 86 | - intel-openmp=2019.1=144 87 | - ipykernel=5.1.0=py37h39e3cac_0 88 | - ipython=7.2.0=py37h39e3cac_0 89 | - ipython_genutils=0.2.0=py37_0 90 | - ipywidgets=7.4.2=py37_0 91 | - isort=4.3.4=py37_0 92 | - itsdangerous=1.1.0=py37_0 93 | - jdcal=1.4=py37_0 94 | - jedi=0.13.2=py37_0 95 | - jinja2=2.10=py37_0 96 | - jpeg=9b=hb83a4c4_2 97 | - jsonschema=2.6.0=py37_0 98 | - jupyter=1.0.0=py37_7 99 | - jupyter_client=5.2.4=py37_0 100 | - jupyter_console=6.0.0=py37_0 101 | - jupyter_core=4.4.0=py37_0 102 | - jupyterlab=0.35.3=py37_0 103 | - jupyterlab_server=0.2.0=py37_0 104 | - keyring=17.0.0=py37_0 105 | - kiwisolver=1.0.1=py37h6538335_0 106 | - krb5=1.16.1=hc04afaa_7 107 | - lazy-object-proxy=1.3.1=py37hfa6e2cd_2 108 | - libarchive=3.3.3=h0643e63_5 109 | - libcurl=7.63.0=h2a8f88b_1000 110 | - libiconv=1.15=h1df5818_7 111 | - libpng=1.6.35=h2a8f88b_0 112 | - libprotobuf=3.6.1=h1a1b453_1000 113 | - libsodium=1.0.16=h9d3ae62_0 114 | - libssh2=1.8.0=h7a1dbc1_4 115 | - libtiff=4.0.9=h36446d0_2 116 | - libxml2=2.9.8=hadb2253_1 117 | - libxslt=1.1.32=hf6f1972_0 118 | - llvmlite=0.26.0=py37ha925a31_0 119 | - locket=0.2.0=py37_1 120 | - lxml=4.2.5=py37hef2cd61_0 121 | - lz4-c=1.8.1.2=h2fa13f4_0 122 | - lzo=2.10=h6df0209_2 123 | - m2w64-gcc-libgfortran=5.3.0=6 124 | - m2w64-gcc-libs=5.3.0=7 125 | - m2w64-gcc-libs-core=5.3.0=7 126 | - m2w64-gmp=6.1.0=2 127 | - m2w64-libwinpthread-git=5.0.0.4634.697f757=2 128 | - markupsafe=1.1.0=py37he774522_0 129 | - mccabe=0.6.1=py37_1 130 | - menuinst=1.4.14=py37hfa6e2cd_0 131 | - mistune=0.8.4=py37he774522_0 132 | - mkl=2019.1=144 133 | - mkl-service=1.1.2=py37hb782905_5 134 | - mkl_fft=1.0.6=py37h6288b17_0 135 | - mkl_random=1.0.2=py37h343c172_0 136 | - more-itertools=4.3.0=py37_0 137 | - mpmath=1.1.0=py37_0 138 | - msgpack-python=0.5.6=py37he980bc4_1 139 | - msys2-conda-epoch=20160418=1 140 | - multipledispatch=0.6.0=py37_0 141 | - navigator-updater=0.2.1=py37_0 142 | - nbconvert=5.4.0=py37_1 143 | - nbformat=4.4.0=py37_0 144 | - networkx=2.2=py37_1 145 | - ninja=1.8.2=py37he980bc4_1 146 | - nltk=3.4=py37_1 147 | - nose=1.3.7=py37_2 148 | - notebook=5.7.4=py37_0 149 | - numba=0.41.0=py37hf9181ef_0 150 | - numexpr=2.6.8=py37hdce8814_0 151 | - numpydoc=0.8.0=py37_0 152 | - odo=0.5.1=py37_0 153 | - olefile=0.46=py37_0 154 | - openpyxl=2.5.12=py37_0 155 | - openssl=1.1.1a=he774522_0 156 | - packaging=18.0=py37_0 157 | - pandas=0.23.4=py37h830ac7b_0 158 | - pandoc=1.19.2.1=hb2460c7_1 159 | - pandocfilters=1.4.2=py37_1 160 | - parso=0.3.1=py37_0 161 | - partd=0.3.9=py37_0 162 | - path.py=11.5.0=py37_0 163 | - pathlib2=2.3.3=py37_0 164 | - patsy=0.5.1=py37_0 165 | - pep8=1.7.1=py37_0 166 | - pickleshare=0.7.5=py37_0 167 | - pillow=5.3.0=py37hdc69c19_0 168 | - pip=18.1=py37_0 169 | - pixman=0.34.0=hcef7cb0_3 170 | - pkginfo=1.4.2=py37_1 171 | - pluggy=0.8.0=py37_0 172 | - ply=3.11=py37_0 173 | - prometheus_client=0.5.0=py37_0 174 | - prompt_toolkit=2.0.7=py37_0 175 | - protobuf=3.6.1=py37he025d50_1001 176 | - psutil=5.4.8=py37he774522_0 177 | - py=1.7.0=py37_0 178 | - pycairo=1.18.0=py37h63da52a_0 179 | - pycodestyle=2.4.0=py37_0 180 | - pycosat=0.6.3=py37hfa6e2cd_0 181 | - pycparser=2.19=py37_0 182 | - pycrypto=2.6.1=py37hfa6e2cd_9 183 | - pycurl=7.43.0.2=py37h7a1dbc1_0 184 | - pyflakes=2.0.0=py37_0 185 | - pygments=2.3.1=py37_0 186 | - pylint=2.2.2=py37_0 187 | - pyodbc=4.0.25=py37ha925a31_0 188 | - pyopenssl=18.0.0=py37_0 189 | - pyparsing=2.3.0=py37_0 190 | - pyqt=5.9.2=py37h6538335_2 191 | - pysocks=1.6.8=py37_0 192 | - pytables=3.4.4=py37he6f6034_0 193 | - pytest=4.0.2=py37_0 194 | - pytest-arraydiff=0.3=py37h39e3cac_0 195 | - pytest-astropy=0.5.0=py37_0 196 | - pytest-doctestplus=0.2.0=py37_0 197 | - pytest-openfiles=0.3.1=py37_0 198 | - pytest-remotedata=0.3.1=py37_0 199 | - python=3.7.1=h8c8aaf0_6 200 | - python-dateutil=2.7.5=py37_0 201 | - python-libarchive-c=2.8=py37_6 202 | - pytorch-cpu=1.0.1=py3.7_cpu_1 203 | - pytz=2018.7=py37_0 204 | - pywavelets=1.0.1=py37h8c2d366_0 205 | - pywin32=223=py37hfa6e2cd_1 206 | - pywinpty=0.5.5=py37_1000 207 | - pyyaml=3.13=py37hfa6e2cd_0 208 | - pyzmq=17.1.2=py37hfa6e2cd_0 209 | - qt=5.9.7=vc14h73c81de_0 210 | - qtawesome=0.5.3=py37_0 211 | - qtconsole=4.4.3=py37_0 212 | - qtpy=1.5.2=py37_0 213 | - rdkit=2018.09.1=py37h3020836_1001 214 | - requests=2.21.0=py37_0 215 | - rope=0.11.0=py37_0 216 | - ruamel_yaml=0.15.46=py37hfa6e2cd_0 217 | - scikit-image=0.14.1=py37ha925a31_0 218 | - scikit-learn=0.20.1=py37h343c172_0 219 | - scipy=1.1.0=py37h29ff71c_2 220 | - seaborn=0.9.0=py37_0 221 | - send2trash=1.5.0=py37_0 222 | - setuptools=40.6.3=py37_0 223 | - simplegeneric=0.8.1=py37_2 224 | - singledispatch=3.4.0.3=py37_0 225 | - sip=4.19.8=py37h6538335_0 226 | - six=1.12.0=py37_0 227 | - snappy=1.1.7=h777316e_3 228 | - snowballstemmer=1.2.1=py37_0 229 | - sortedcollections=1.0.1=py37_0 230 | - sortedcontainers=2.1.0=py37_0 231 | - sphinx=1.8.2=py37_0 232 | - sphinxcontrib=1.0=py37_1 233 | - sphinxcontrib-websupport=1.1.0=py37_1 234 | - spyder=3.3.2=py37_0 235 | - spyder-kernels=0.3.0=py37_0 236 | - sqlalchemy=1.2.15=py37he774522_0 237 | - sqlite=3.26.0=he774522_0 238 | - statsmodels=0.9.0=py37h452e1ab_0 239 | - sympy=1.3=py37_0 240 | - tblib=1.3.2=py37_0 241 | - tensorboardx=1.6=py_0 242 | - terminado=0.8.1=py37_1 243 | - testpath=0.4.2=py37_0 244 | - tk=8.6.8=hfa6e2cd_0 245 | - toolz=0.9.0=py37_0 246 | - torchvision-cpu=0.2.1=py_2 247 | - tornado=5.1.1=py37hfa6e2cd_0 248 | - tqdm=4.28.1=py37h28b3542_0 249 | - traitlets=4.3.2=py37_0 250 | - unicodecsv=0.14.1=py37_0 251 | - urllib3=1.24.1=py37_0 252 | - vc=14.1=h0510ff6_4 253 | - vs2015_runtime=14.15.26706=h3a45250_0 254 | - wcwidth=0.1.7=py37_0 255 | - webencodings=0.5.1=py37_1 256 | - werkzeug=0.14.1=py37_0 257 | - wheel=0.32.3=py37_0 258 | - widgetsnbextension=3.4.2=py37_0 259 | - win_inet_pton=1.0.1=py37_1 260 | - win_unicode_console=0.5=py37_0 261 | - wincertstore=0.2=py37_0 262 | - winpty=0.4.3=4 263 | - wrapt=1.10.11=py37hfa6e2cd_2 264 | - xlrd=1.2.0=py37_0 265 | - xlsxwriter=1.1.2=py37_0 266 | - xlwings=0.15.1=py37_0 267 | - xlwt=1.3.0=py37_0 268 | - xz=5.2.4=h2fa13f4_4 269 | - yaml=0.1.7=hc54c509_2 270 | - zeromq=4.2.5=he025d50_1 271 | - zict=0.1.3=py37_0 272 | - zlib=1.2.11=h62dcd97_3 273 | - zstd=1.3.7=h508b16e_0 274 | - pip: 275 | - absl-py==0.7.0 276 | - astor==0.7.1 277 | - deepsmiles==1.0.1 278 | - gast==0.2.2 279 | - grpcio==1.18.0 280 | - keras-applications==1.0.7 281 | - keras-preprocessing==1.0.9 282 | - markdown==3.0.1 283 | - matplotlib==3.0.3 284 | - mock==2.0.0 285 | - numpy==1.16.1 286 | - pbr==5.1.2 287 | - selfies==0.1.1 288 | - tensorboard==1.12.2 289 | - tensorflow==1.13.0rc2 290 | - tensorflow-estimator==1.13.0rc0 291 | - termcolor==1.1.0 292 | prefix: C:\ProgramData\Anaconda3 293 | 294 | -------------------------------------------------------------------------------- /selfies/encoder.py: -------------------------------------------------------------------------------- 1 | from selfies.exceptions import EncoderError, SMILESParserError 2 | from selfies.grammar_rules import get_selfies_from_index 3 | from selfies.utils.smiles_utils import ( 4 | atom_to_smiles, 5 | bond_to_smiles, 6 | smiles_to_mol 7 | ) 8 | 9 | from selfies.mol_graph import AttributionMap 10 | 11 | 12 | def encoder(smiles: str, strict: bool = True, attribute: bool = False) -> str: 13 | """Translates a SMILES string into its corresponding SELFIES string. 14 | 15 | This translation is deterministic and does not depend on the 16 | current semantic constraints. Additionally, it preserves the atom order 17 | of the input SMILES string; thus, one could generate randomized SELFIES 18 | strings by generating randomized SMILES strings, and then translating them. 19 | 20 | By nature of SELFIES, it is impossible to represent molecules that 21 | violate the current semantic constraints as SELFIES strings. 22 | Thus, we provide the ``strict`` flag to guard against such cases. If 23 | ``strict=True``, then this function will raise a 24 | :class:`selfies.EncoderError` if the input SMILES string represents 25 | a molecule that violates the semantic constraints. If 26 | ``strict=False``, then this function will not raise any error; however, 27 | calling :func:`selfies.decoder` on a SELFIES string generated this 28 | way will *not* be guaranteed to recover a SMILES string representing 29 | the original molecule. 30 | 31 | :param smiles: the SMILES string to be translated. It is recommended to 32 | use RDKit to check that the strings passed into this function 33 | are valid SMILES strings. 34 | :param strict: if ``True``, this function will check that the 35 | input SMILES string obeys the semantic constraints. 36 | Defaults to ``True``. 37 | :param attribute: if an attribution should be returned 38 | :return: a SELFIES string translated from the input SMILES string if 39 | attribute is ``False``, otherwise a tuple is returned of 40 | SELFIES string and attribution list. 41 | :raises EncoderError: if the input SMILES string is invalid, 42 | cannot be kekulized, or violates the semantic constraints with 43 | ``strict=True``. 44 | 45 | :Example: 46 | 47 | >>> import selfies as sf 48 | >>> sf.encoder("C=CF") 49 | '[C][=C][F]' 50 | 51 | .. note:: This function does not currently support SMILES with: 52 | 53 | * The wildcard symbol ``*``. 54 | * The quadruple bond symbol ``$``. 55 | * Chirality specifications other than ``@`` and ``@@``. 56 | * Ring bonds across a dot symbol (e.g. ``c1cc([O-].[Na+])ccc1``) or 57 | ring bonds between atoms that are over 4000 atoms apart. 58 | 59 | Although SELFIES does not have aromatic symbols, this function 60 | *does* support aromatic SMILES strings by internally kekulizing them 61 | before translation. 62 | """ 63 | 64 | try: 65 | mol = smiles_to_mol(smiles, attributable=attribute) 66 | except SMILESParserError as err: 67 | err_msg = "failed to parse input\n\tSMILES: {}".format(smiles) 68 | raise EncoderError(err_msg) from err 69 | 70 | if not mol.kekulize(): 71 | err_msg = "kekulization failed\n\tSMILES: {}".format(smiles) 72 | raise EncoderError(err_msg) 73 | 74 | if strict: 75 | _check_bond_constraints(mol, smiles) 76 | 77 | # invert chirality of atoms where necessary, 78 | # such that they are restored when the SELFIES is decoded 79 | for atom in mol.get_atoms(): 80 | if ((atom.chirality is not None) 81 | and mol.has_out_ring_bond(atom.index) 82 | and _should_invert_chirality(mol, atom)): 83 | atom.invert_chirality() 84 | 85 | fragments = [] 86 | attribution_maps = [] 87 | attribution_index = 0 88 | for root in mol.get_roots(): 89 | derived = list(_fragment_to_selfies( 90 | mol, None, root, attribution_maps, attribution_index)) 91 | attribution_index += len(derived) 92 | fragments.append("".join(derived)) 93 | # trim attribution map of empty tokens 94 | attribution_maps = [a for a in attribution_maps if a.token] 95 | result = ".".join(fragments), attribution_maps 96 | return result if attribute else result[0] 97 | 98 | 99 | def _check_bond_constraints(mol, smiles): 100 | errors = [] 101 | 102 | for atom in mol.get_atoms(): 103 | bond_cap = atom.bonding_capacity 104 | bond_count = mol.get_bond_count(atom.index) 105 | if bond_count > bond_cap: 106 | errors.append((atom_to_smiles(atom), bond_count, bond_cap)) 107 | 108 | if errors: 109 | err_msg = "input violates the currently-set semantic constraints\n" \ 110 | "\tSMILES: {}\n" \ 111 | "\tErrors:\n".format(smiles) 112 | for e in errors: 113 | err_msg += "\t[{:} with {} bond(s) - " \ 114 | "a max. of {} bond(s) was specified]\n".format(*e) 115 | raise EncoderError(err_msg) 116 | 117 | 118 | def _should_invert_chirality(mol, atom): 119 | out_bonds = mol.get_out_dirbonds(atom.index) 120 | 121 | # 1. rings whose right number are bonded to this atom (e.g. ...1...X1) 122 | # 2. rings whose left number are bonded to this atom (e.g. X1...1...) 123 | # 3. branches and other (e.g. X(...)...) 124 | partition = [[], [], []] 125 | for i, bond in enumerate(out_bonds): 126 | if not bond.ring_bond: 127 | partition[2].append(i) 128 | elif bond.src < bond.dst: 129 | partition[1].append(i) 130 | else: 131 | partition[0].append(i) 132 | partition[1].sort(key=lambda x: out_bonds[x].dst) 133 | 134 | # construct permutation 135 | perm = partition[0] + partition[1] + partition[2] 136 | count = 0 137 | for i in range(len(perm)): 138 | for j in range(i + 1, len(perm)): 139 | if perm[i] > perm[j]: 140 | count += 1 141 | return count % 2 != 0 # if odd permutation, should invert chirality 142 | 143 | 144 | def _fragment_to_selfies(mol, bond_into_root, root, 145 | attribution_maps, attribution_index=0): 146 | derived = [] 147 | 148 | bond_into_curr, curr = bond_into_root, root 149 | while True: 150 | curr_atom = mol.get_atom(curr) 151 | token = _atom_to_selfies(bond_into_curr, curr_atom) 152 | derived.append(token) 153 | 154 | attribution_maps.append(AttributionMap( 155 | len(derived) - 1 + attribution_index, 156 | token, mol.get_attribution(curr_atom))) 157 | 158 | out_bonds = mol.get_out_dirbonds(curr) 159 | for i, bond in enumerate(out_bonds): 160 | 161 | if bond.ring_bond: 162 | if bond.src < bond.dst: 163 | continue 164 | 165 | rev_bond = mol.get_dirbond(src=bond.dst, dst=bond.src) 166 | ring_len = bond.src - bond.dst 167 | Q_as_symbols = get_selfies_from_index(ring_len - 1) 168 | ring_symbol = "[{}Ring{}]".format( 169 | _ring_bonds_to_selfies(rev_bond, bond), 170 | len(Q_as_symbols) 171 | ) 172 | 173 | derived.append(ring_symbol) 174 | attribution_maps.append(AttributionMap( 175 | len(derived) - 1 + attribution_index, 176 | ring_symbol, mol.get_attribution(bond))) 177 | for symbol in Q_as_symbols: 178 | derived.append(symbol) 179 | attribution_maps.append(AttributionMap( 180 | len(derived) - 1 + attribution_index, 181 | symbol, mol.get_attribution(bond))) 182 | 183 | elif i == len(out_bonds) - 1: 184 | bond_into_curr, curr = bond, bond.dst 185 | 186 | else: 187 | # start, end are so we can go back and 188 | # correct offset from branch symbol in 189 | # branch tokens 190 | start = len(attribution_maps) 191 | branch = _fragment_to_selfies( 192 | mol, bond, bond.dst, attribution_maps, len(derived)) 193 | Q_as_symbols = get_selfies_from_index(len(branch) - 1) 194 | branch_symbol = "[{}Branch{}]".format( 195 | _bond_to_selfies(bond, show_stereo=False), 196 | len(Q_as_symbols) 197 | ) 198 | end = len(attribution_maps) 199 | 200 | derived.append(branch_symbol) 201 | for symbol in Q_as_symbols: 202 | derived.append(symbol) 203 | attribution_maps.append(AttributionMap( 204 | len(derived) - 1 + attribution_index, 205 | symbol, mol.get_attribution(bond))) 206 | 207 | # account for branch symbol because it is inserted after 208 | for j in range(start, end): 209 | attribution_maps[j].index += len(Q_as_symbols) + 1 210 | attribution_maps.append(AttributionMap( 211 | len(derived) - 1 + attribution_index, 212 | branch_symbol, mol.get_attribution(bond))) 213 | 214 | derived.extend(branch) 215 | 216 | # end of chain 217 | if (not out_bonds) or out_bonds[-1].ring_bond: 218 | break 219 | return derived 220 | 221 | 222 | def _bond_to_selfies(bond, show_stereo=True): 223 | if not show_stereo and (bond.order == 1): 224 | return "" 225 | return bond_to_smiles(bond) 226 | 227 | 228 | def _ring_bonds_to_selfies(lbond, rbond): 229 | assert lbond.order == rbond.order 230 | 231 | if (lbond.order != 1) or all(b.stereo is None for b in (lbond, rbond)): 232 | return _bond_to_selfies(lbond, show_stereo=False) 233 | else: 234 | bond_char = "-" if (lbond.stereo is None) else lbond.stereo 235 | bond_char += "-" if (rbond.stereo is None) else rbond.stereo 236 | return bond_char 237 | 238 | 239 | def _atom_to_selfies(bond, atom): 240 | assert not atom.is_aromatic 241 | bond_char = "" if (bond is None) else _bond_to_selfies(bond) 242 | return "[{}{}]".format(bond_char, atom_to_smiles(atom, brackets=False)) 243 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright 2019 Mario Krenn 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /selfies/mol_graph.py: -------------------------------------------------------------------------------- 1 | import functools 2 | import itertools 3 | from typing import List, Optional, Union 4 | from dataclasses import dataclass, field 5 | 6 | from selfies.bond_constraints import get_bonding_capacity 7 | from selfies.constants import AROMATIC_VALENCES, VALENCE_ELECTRONS 8 | from selfies.utils.matching_utils import find_perfect_matching 9 | 10 | 11 | @dataclass 12 | class Attribution: 13 | """A dataclass that contains token string and its index. 14 | """ 15 | #: token index 16 | index: int 17 | #: token string 18 | token: str 19 | 20 | 21 | @dataclass 22 | class AttributionMap: 23 | """A mapping from input to single output token showing which 24 | input tokens created the output token. 25 | """ 26 | #: Index of output token 27 | index: int 28 | #: Output token 29 | token: str 30 | #: List of input tokens that created the output token 31 | attribution: List[Attribution] = field(default_factory=list) 32 | 33 | 34 | class Atom: 35 | """An atom with associated specifications (e.g. charge, chirality). 36 | """ 37 | 38 | def __init__( 39 | self, 40 | element: str, 41 | is_aromatic: bool, 42 | isotope: Optional[int] = None, 43 | chirality: Optional[str] = None, 44 | h_count: Optional[int] = None, 45 | charge: int = 0 46 | ): 47 | self.index = None 48 | self.element = element 49 | self.is_aromatic = is_aromatic 50 | self.isotope = isotope 51 | self.chirality = chirality 52 | self.h_count = h_count 53 | self.charge = charge 54 | 55 | @property 56 | @functools.lru_cache() 57 | def bonding_capacity(self): 58 | bond_cap = get_bonding_capacity(self.element, self.charge) 59 | bond_cap -= 0 if (self.h_count is None) else self.h_count 60 | return bond_cap 61 | 62 | def invert_chirality(self) -> None: 63 | if self.chirality == "@": 64 | self.chirality = "@@" 65 | elif self.chirality == "@@": 66 | self.chirality = "@" 67 | 68 | 69 | class DirectedBond: 70 | """A bond that contains directional information. 71 | """ 72 | 73 | def __init__( 74 | self, 75 | src: int, 76 | dst: int, 77 | order: Union[int, float], 78 | stereo: Optional[str], 79 | ring_bond: bool 80 | ): 81 | self.src = src 82 | self.dst = dst 83 | self.order = order 84 | self.stereo = stereo 85 | self.ring_bond = ring_bond 86 | 87 | 88 | class MolecularGraph: 89 | """A molecular graph. 90 | 91 | Molecules can be viewed as weighted undirected graphs. However, SMILES 92 | and SELFIES strings are more naturally represented as weighted directed 93 | graphs, where the direction of the edges specifies the order of atoms 94 | and bonds in the string. 95 | """ 96 | 97 | def __init__(self, attributable=False): 98 | self._roots = list() # stores root atoms, where traversal begins 99 | self._atoms = list() # stores atoms in this graph 100 | self._bond_dict = dict() # stores all bonds in this graph 101 | self._adj_list = list() # adjacency list, representing this graph 102 | self._bond_counts = list() # stores number of bonds an atom has made 103 | self._ring_bond_flags = list() # stores if an atom makes a ring bond 104 | self._delocal_subgraph = dict() # delocalization subgraph 105 | self._attribution = dict() # attribution of each atom/bond 106 | self._attributable = attributable 107 | 108 | def __len__(self): 109 | return len(self._atoms) 110 | 111 | def has_bond(self, a: int, b: int) -> bool: 112 | if a > b: 113 | a, b = b, a 114 | return (a, b) in self._bond_dict 115 | 116 | def has_out_ring_bond(self, src: int) -> bool: 117 | return self._ring_bond_flags[src] 118 | 119 | def get_attribution( 120 | self, 121 | o: Union[DirectedBond, Atom] 122 | ) -> List[Attribution]: 123 | if self._attributable and o in self._attribution: 124 | return self._attribution[o] 125 | return None 126 | 127 | def get_roots(self) -> List[int]: 128 | return self._roots 129 | 130 | def get_atom(self, idx: int) -> Atom: 131 | return self._atoms[idx] 132 | 133 | def get_atoms(self) -> List[Atom]: 134 | return self._atoms 135 | 136 | def get_dirbond(self, src, dst) -> DirectedBond: 137 | return self._bond_dict[(src, dst)] 138 | 139 | def get_out_dirbonds(self, src: int) -> List[DirectedBond]: 140 | return self._adj_list[src] 141 | 142 | def get_bond_count(self, idx: int) -> int: 143 | return self._bond_counts[idx] 144 | 145 | def add_atom(self, atom: Atom, mark_root: bool = False) -> Atom: 146 | atom.index = len(self) 147 | 148 | if mark_root: 149 | self._roots.append(atom.index) 150 | self._atoms.append(atom) 151 | self._adj_list.append(list()) 152 | self._bond_counts.append(0) 153 | self._ring_bond_flags.append(False) 154 | if atom.is_aromatic: 155 | self._delocal_subgraph[atom.index] = list() 156 | return atom 157 | 158 | def add_attribution( 159 | self, 160 | o: Union[DirectedBond, Atom], 161 | attr: List[Attribution] 162 | ) -> None: 163 | if self._attributable: 164 | if o in self._attribution: 165 | self._attribution[o].extend(attr) 166 | else: 167 | self._attribution[o] = attr 168 | 169 | def add_bond( 170 | self, src: int, dst: int, 171 | order: Union[int, float], stereo: str 172 | ) -> DirectedBond: 173 | assert src < dst 174 | 175 | bond = DirectedBond(src, dst, order, stereo, False) 176 | self._add_bond_at_loc(bond, -1) 177 | self._bond_counts[src] += order 178 | self._bond_counts[dst] += order 179 | 180 | if order == 1.5: 181 | self._delocal_subgraph.setdefault(src, []).append(dst) 182 | self._delocal_subgraph.setdefault(dst, []).append(src) 183 | return bond 184 | 185 | def add_placeholder_bond(self, src: int) -> int: 186 | out_edges = self._adj_list[src] 187 | out_edges.append(None) 188 | return len(out_edges) - 1 189 | 190 | def add_ring_bond( 191 | self, a: int, b: int, 192 | order: Union[int, float], 193 | a_stereo: Optional[str], b_stereo: Optional[str], 194 | a_pos: int = -1, b_pos: int = -1 195 | ) -> None: 196 | a_bond = DirectedBond(a, b, order, a_stereo, True) 197 | b_bond = DirectedBond(b, a, order, b_stereo, True) 198 | self._add_bond_at_loc(a_bond, a_pos) 199 | self._add_bond_at_loc(b_bond, b_pos) 200 | self._bond_counts[a] += order 201 | self._bond_counts[b] += order 202 | self._ring_bond_flags[a] = True 203 | self._ring_bond_flags[b] = True 204 | 205 | if order == 1.5: 206 | self._delocal_subgraph.setdefault(a, []).append(b) 207 | self._delocal_subgraph.setdefault(b, []).append(a) 208 | 209 | def update_bond_order( 210 | self, a: int, b: int, 211 | new_order: Union[int, float] 212 | ) -> None: 213 | assert 1 <= new_order <= 3 214 | 215 | if a > b: 216 | a, b = b, a # swap so that a < b 217 | a_to_b = self._bond_dict[(a, b)] # prev step guarantees existence 218 | if new_order == a_to_b.order: 219 | return 220 | elif a_to_b.ring_bond: 221 | b_to_a = self._bond_dict[(b, a)] 222 | bonds = (a_to_b, b_to_a) 223 | else: 224 | bonds = (a_to_b,) 225 | 226 | old_order = bonds[0].order 227 | for bond in bonds: 228 | bond.order = new_order 229 | self._bond_counts[a] += (new_order - old_order) 230 | self._bond_counts[b] += (new_order - old_order) 231 | 232 | def _add_bond_at_loc(self, bond, pos): 233 | self._bond_dict[(bond.src, bond.dst)] = bond 234 | 235 | out_edges = self._adj_list[bond.src] 236 | if (pos == -1) or (pos == len(out_edges)): 237 | out_edges.append(bond) 238 | elif out_edges[pos] is None: 239 | out_edges[pos] = bond 240 | else: 241 | out_edges.insert(pos, bond) 242 | 243 | def is_kekulized(self) -> bool: 244 | return not self._delocal_subgraph 245 | 246 | def kekulize(self) -> bool: 247 | # Algorithm based on Depth-First article by Richard L. Apodaca 248 | # Reference: 249 | # https://depth-first.com/articles/2020/02/10/ 250 | # a-comprehensive-treatment-of-aromaticity-in-the-smiles-language/ 251 | 252 | if self.is_kekulized(): 253 | return True 254 | 255 | ds = self._delocal_subgraph 256 | kept_nodes = set(itertools.filterfalse(self._prune_from_ds, ds)) 257 | 258 | # relabel kept DS nodes to be 0, 1, 2, ... 259 | label_to_node = list(sorted(kept_nodes)) 260 | node_to_label = {v: i for i, v in enumerate(label_to_node)} 261 | 262 | # pruned and relabelled DS 263 | pruned_ds = [list() for _ in range(len(kept_nodes))] 264 | for node in kept_nodes: 265 | label = node_to_label[node] 266 | for adj in filter(lambda v: v in kept_nodes, ds[node]): 267 | pruned_ds[label].append(node_to_label[adj]) 268 | 269 | matching = find_perfect_matching(pruned_ds) 270 | if matching is None: 271 | return False 272 | 273 | # de-aromatize and then make double bonds 274 | for node in ds: 275 | for adj in ds[node]: 276 | self.update_bond_order(node, adj, new_order=1) 277 | self._atoms[node].is_aromatic = False 278 | self._bond_counts[node] = int(self._bond_counts[node]) 279 | 280 | for matched_labels in enumerate(matching): 281 | matched_nodes = tuple(label_to_node[i] for i in matched_labels) 282 | self.update_bond_order(*matched_nodes, new_order=2) 283 | 284 | self._delocal_subgraph = dict() # clear DS 285 | return True 286 | 287 | def _prune_from_ds(self, node): 288 | adj_nodes = self._delocal_subgraph[node] 289 | if not adj_nodes: 290 | return True # aromatic atom with no aromatic bonds 291 | 292 | atom = self._atoms[node] 293 | valences = AROMATIC_VALENCES[atom.element] 294 | 295 | # each bond in DS has order 1.5 - we treat them as single bonds 296 | used_electrons = int(self._bond_counts[node] - 0.5 * len(adj_nodes)) 297 | 298 | if atom.h_count is None: # account for implicit Hs 299 | assert atom.charge == 0 300 | return any(used_electrons == v for v in valences) 301 | else: 302 | valence = valences[-1] - atom.charge 303 | used_electrons += atom.h_count 304 | 305 | # count the total number of bound electrons of each atom 306 | bound_electrons = (max(0, atom.charge) + atom.h_count 307 | + int(self._bond_counts[node]) 308 | + int(2 * (self._bond_counts[node] % 1))) 309 | 310 | # calculate the number of unpaired electrons of each atom 311 | radical_electrons = (max(0, VALENCE_ELECTRONS[atom.element] 312 | - bound_electrons) % 2) 313 | 314 | # unpaired electrons do not contribute to the aromatic system 315 | free_electrons = valence - used_electrons - radical_electrons 316 | 317 | if any(used_electrons == v - atom.charge for v in valences): 318 | return True 319 | else: 320 | return not ((free_electrons >= 0) and (free_electrons % 2 != 0)) -------------------------------------------------------------------------------- /original_code_from_paper/gan/GAN_smiles/GAN.py: -------------------------------------------------------------------------------- 1 | """ 2 | @author: Akshat 3 | """ 4 | import torch 5 | import random 6 | import torch.utils.data 7 | from torch import nn, optim 8 | from torch.autograd.variable import Variable 9 | import os 10 | import torch.nn.functional as F 11 | from rdkit import Chem 12 | from rdkit.Chem import Draw 13 | from one_hot_converter import multiple_smile_to_hot, hot_to_smile, check_conversion_bijection 14 | import numpy as np 15 | import matplotlib.pyplot as plt 16 | from tensorboardX import SummaryWriter 17 | writer = SummaryWriter() 18 | 19 | random.seed(4001) 20 | 21 | def load_data(cut_off=None): 22 | ''' 23 | Ensuring correct Bijection: 24 | check_conversion_bijection(smiles_list=A, largest_smile_len=len(max(A, key=len)), alphabet=alphabets) 25 | ''' 26 | with open('smiles_qm9.txt') as f: 27 | content = f.readlines() 28 | content = content[1:] 29 | content = [x.strip() for x in content] 30 | A = [x.split(',')[1] for x in content] 31 | if cut_off is not None: 32 | A = A[0:cut_off] 33 | return A 34 | 35 | alphabets = ['C', '1', '(', '#', 'N', '3', '5', 'O', '2', 'F', '=', '4', ')', ' '] 36 | 37 | 38 | 39 | # Read in the QM9 dataset 40 | A = load_data(cut_off=None) 41 | data = multiple_smile_to_hot(A, len(max(A, key=len)), alphabets, 0) 42 | print('Data shape: ', data.shape) 43 | one_hot_len_comb = data.shape[1]*data.shape[2] 44 | data = data.reshape(( data.shape[0], one_hot_len_comb)) 45 | data = torch.tensor(data, dtype=torch.float) 46 | 47 | data_loader = torch.utils.data.DataLoader(data, batch_size=1024, shuffle=True) 48 | num_batches = len(data_loader) 49 | print('DATA Acquired!') 50 | 51 | 52 | def get_canon_smi_ls(smiles_ls): 53 | ''' 54 | return a list of canonical smiles in smiles_ls 55 | ''' 56 | canon_ls = [Chem.MolToSmiles(Chem.MolFromSmiles(smi), isomericSmiles=False, canonical=True) for smi in smiles_ls] 57 | return canon_ls 58 | 59 | def _make_dir(directory): 60 | os.makedirs(directory) 61 | 62 | def save_models(generator, discriminator, epoch, dir_name): 63 | out_dir = './{}/saved_models/{}'.format(dir_name, epoch) 64 | _make_dir(out_dir) 65 | torch.save(generator, '{}/G'.format(out_dir)) 66 | torch.save(discriminator, '{}/D'.format(out_dir)) 67 | 68 | def display_status(epoch, num_epochs, n_batch, num_batches, d_error, g_error, d_pred_real, d_pred_fake): 69 | 70 | # var_class = torch.autograd.variable.Variable 71 | if isinstance(d_error, torch.autograd.Variable): 72 | d_error = d_error.data.cpu().numpy() 73 | if isinstance(g_error, torch.autograd.Variable): 74 | g_error = g_error.data.cpu().numpy() 75 | if isinstance(d_pred_real, torch.autograd.Variable): 76 | d_pred_real = d_pred_real.data 77 | if isinstance(d_pred_fake, torch.autograd.Variable): 78 | d_pred_fake = d_pred_fake.data 79 | 80 | 81 | print('Epoch: [{}/{}], Batch Num: [{}/{}]'.format( 82 | epoch,num_epochs, n_batch, num_batches) 83 | ) 84 | print('Discriminator Loss: {:.4f}, Generator Loss: {:.4f}'.format(d_error, g_error)) 85 | print('D(x): {:.4f}, D(G(z)): {:.4f}'.format(d_pred_real.mean(), d_pred_fake.mean())) 86 | writer.add_scalar('D(x)', d_pred_real.mean(), epoch) 87 | writer.add_scalar('D(G(z)', d_pred_fake.mean(), epoch) 88 | 89 | 90 | class DiscriminatorNet(torch.nn.Module): 91 | """ 92 | A three hidden-layer discriminative neural network 93 | """ 94 | def __init__(self, drop_rate, layer_2_size): 95 | super(DiscriminatorNet, self).__init__() 96 | n_features = one_hot_len_comb 97 | n_out = 1 98 | 99 | self.hidden0 = nn.Sequential( 100 | nn.Linear(n_features, layer_2_size), 101 | nn.LeakyReLU(0.2), 102 | nn.Dropout(drop_rate) 103 | ) 104 | 105 | self.out = nn.Sequential( 106 | torch.nn.Linear(layer_2_size, n_out), 107 | nn.Sigmoid() 108 | ) 109 | 110 | def forward(self, x): 111 | x = self.hidden0(x) 112 | x = self.out(x) 113 | return x 114 | 115 | 116 | class GeneratorNet(torch.nn.Module): 117 | """ 118 | A three hidden-layer generative neural network 119 | """ 120 | def __init__(self, prior_lv_size, layer_interm_size): 121 | super(GeneratorNet, self).__init__() 122 | n_out = one_hot_len_comb 123 | 124 | self.hidden0 = nn.Sequential( 125 | nn.Linear(prior_lv_size, layer_interm_size), 126 | nn.LeakyReLU(0.2) 127 | ) 128 | 129 | self.out = nn.Sequential( 130 | nn.Linear(layer_interm_size, n_out), 131 | nn.Sigmoid() # Help predict 0 or 1 132 | ) 133 | 134 | def forward(self, x): 135 | x = self.hidden0(x) 136 | x = self.out(x) 137 | return x 138 | 139 | def noise(size, G_start_layer_size): 140 | ''' 141 | Standard nois, which acts as input to the GAN generator . 142 | ''' 143 | n = Variable(torch.randn(size, G_start_layer_size)) 144 | if torch.cuda.is_available(): return n.cuda() 145 | return n 146 | 147 | 148 | def train_discriminator(optimizer, real_data, fake_data, discriminator, criterion): 149 | optimizer.zero_grad() 150 | 151 | # 1.1 Train on Real Data 152 | prediction_real = discriminator(real_data) 153 | y_real = Variable(torch.ones(prediction_real.shape[0], 1)) 154 | if torch.cuda.is_available(): 155 | D_real_loss = criterion(prediction_real, y_real.cuda()) 156 | else: 157 | D_real_loss = criterion(prediction_real, y_real) 158 | 159 | # 1.2 Train on Fake Data 160 | prediction_fake = discriminator(fake_data) 161 | y_fake = Variable(torch.zeros(prediction_fake.shape[0], 1)) 162 | if torch.cuda.is_available(): 163 | D_fake_loss = criterion(prediction_fake, y_fake.cuda()) 164 | else: 165 | D_fake_loss = criterion(prediction_fake, y_fake) 166 | 167 | D_loss = D_real_loss + D_fake_loss 168 | D_loss.backward() 169 | optimizer.step() 170 | 171 | # Return error 172 | return D_real_loss + D_fake_loss, prediction_real, prediction_fake, discriminator 173 | 174 | 175 | def train_generator(optimizer, fake_data, criterion, discriminator): 176 | optimizer.zero_grad() 177 | prediction = discriminator(fake_data) 178 | y = Variable(torch.ones(prediction.shape[0], 1)) 179 | if torch.cuda.is_available(): 180 | G_loss = criterion(prediction, y.cuda(0)) 181 | else: 182 | G_loss = criterion(prediction, y) 183 | G_loss.backward() 184 | 185 | optimizer.step() 186 | return G_loss.data.item(), discriminator 187 | 188 | 189 | def train_gan(lr_disc, lr_genr, prior_lv_size, layer_interm_size_G, discr_dropout, discr_layer_2_size, num_epochs, dir_name, num_unique): 190 | ''' 191 | All the hyper parameters are to be added as parameters to this function!!! 192 | ''' 193 | criterion = nn.BCELoss() 194 | discriminator = DiscriminatorNet(drop_rate=discr_dropout, layer_2_size=discr_layer_2_size) 195 | generator = GeneratorNet(prior_lv_size=prior_lv_size, layer_interm_size=layer_interm_size_G) 196 | criterion = nn.BCELoss() 197 | if torch.cuda.is_available(): 198 | discriminator.cuda() 199 | generator.cuda() 200 | 201 | 202 | # Optimizers (notice the use of 'discriminator'<-Object class) 203 | d_optimizer = optim.Adam(discriminator.parameters(), lr=lr_disc) 204 | g_optimizer = optim.Adam(generator.parameters(), lr=lr_genr) # 1: 3e-4; 2: 1e-5 205 | 206 | for epoch in range(num_epochs+1): 207 | torch.cuda.empty_cache() 208 | print('Epoch: ', epoch) 209 | # batch_num real_data 210 | for n_batch, real_batch in enumerate(data_loader): 211 | 212 | # 1. Train Discriminator 213 | real_data = Variable(real_batch) 214 | 215 | if torch.cuda.is_available(): 216 | real_data = real_data.cuda() 217 | 218 | # Generate fake data 219 | fake_data = generator(noise(real_data.size(0), prior_lv_size)).detach() 220 | 221 | # Train D 222 | d_error, d_pred_real, d_pred_fake, discriminator = train_discriminator(d_optimizer, real_data, fake_data, discriminator, criterion) 223 | 224 | 225 | # 2. Train Generator 226 | # Generate fake data 227 | fake_data = generator(noise(real_batch.size(0), prior_lv_size)) 228 | 229 | # Train G 230 | g_error, discriminator = train_generator(g_optimizer, fake_data, criterion, discriminator) 231 | 232 | 233 | # Log onto a graph 234 | writer.add_scalar('D_error', d_error, epoch * num_batches + n_batch) 235 | writer.add_scalar('G_error', g_error, epoch * num_batches + n_batch) 236 | 237 | if epoch % 10 == 0 and epoch > 0: 238 | generator = generator.eval() 239 | # Display complete training stats 240 | display_status( 241 | epoch, num_epochs, n_batch, num_batches, 242 | d_error, g_error, d_pred_real, d_pred_fake 243 | ) 244 | 245 | smiles_ls = [] 246 | print('Sampling....') 247 | for _ in range(10000): 248 | # Display a random molecule (for sanity) (make sure in eval mode!) 249 | T = generator(noise(1, prior_lv_size)).cpu().detach().numpy().flatten() # Just chose one molecule 250 | T = T.reshape((22, 14)) 251 | 252 | smile = hot_to_smile(T, alphabets) 253 | if ' ' in smile: 254 | smile = smile[:smile.find(' ')] 255 | mol = Chem.MolFromSmiles(smile) 256 | if mol is not None: 257 | smiles_ls.append(smile) 258 | print('unique molecules: ', len(set(get_canon_smi_ls(smiles_ls)))) 259 | writer.add_scalar('Num Smiles', len(set(get_canon_smi_ls(smiles_ls))), epoch) 260 | 261 | # Write TensorBoard curv on to a text file 262 | f = open('{}/smiles_curve.txt'.format(dir_name), 'a+') 263 | f.write(str(len(set(get_canon_smi_ls(smiles_ls))))+ '\n') 264 | f.close() 265 | 266 | # Save the TensorBoard models 267 | save_models(generator, discriminator, epoch, dir_name) 268 | num_unique.append(len(set(get_canon_smi_ls(smiles_ls)))) 269 | 270 | 271 | if epoch > 100: 272 | A = np.array(num_unique) 273 | stopping_criteria = A.max() - ( 6 * ( np.sqrt(A.max()))) 274 | if stopping_criteria < 0: 275 | stopping_criteria = 0 276 | print('Stopping Criteria: ', stopping_criteria) 277 | print('Array A: ', A) 278 | if num_unique[-1] <= stopping_criteria or max(A) < 50: 279 | # Write the stopping epoch onto a text file 280 | f = open('{}/stoping_epoch.txt'.format(dir_name), 'a+') 281 | f.write('Early stopping Epoch: ' +str(epoch)+ '\n') 282 | f.close() 283 | return 284 | 285 | 286 | print(set(smiles_ls)) 287 | generator = generator.train() 288 | 289 | 290 | 291 | if __name__ == '__main__': 292 | 293 | for model_iter in range(100): 294 | dir_name = str(model_iter) # Directory for saving all the data 295 | 296 | os.makedirs(dir_name) 297 | num_unique = [] 298 | 299 | ## HYPERPARAMETER SELECTION! 300 | 301 | num_epochs = 1500 302 | 303 | # Let me select the learning rates: 304 | lr_disc = 10 ** random.uniform(-7, -4) 305 | lr_genr = 10 ** random.uniform(-7, -4) 306 | 307 | # Generator 308 | prior_lv_size = random.randint(50, 300) 309 | layer_interm_size_G = random.randint(300, 3000) 310 | 311 | # Discriminator 312 | discr_dropout = random.uniform(0, 0.8) 313 | discr_layer_2_size = random.randint(5, 100) 314 | 315 | # Save all the selected hyperparamters: 316 | f = open('{}/hyperparams.txt'.format(dir_name), 'a+') 317 | f.write('lr Discr: ' + str(lr_disc) + '\n') 318 | f.write('lr Gener: ' + str(lr_genr)+ '\n') 319 | f.write('Gener Sampling layer size: ' + str(prior_lv_size)+ '\n') 320 | f.write('Gener middle layer size: ' + str(layer_interm_size_G)+ '\n') 321 | f.write('Discr dropout: ' + str(discr_dropout)+ '\n') 322 | f.write('Discr middle layer size: ' + str(discr_layer_2_size)+ '\n') 323 | f.close() 324 | 325 | train_gan(lr_disc, lr_genr, prior_lv_size, layer_interm_size_G, discr_dropout, discr_layer_2_size, num_epochs, dir_name, num_unique) 326 | torch.cuda.empty_cache() 327 | 328 | 329 | 330 | 331 | 332 | -------------------------------------------------------------------------------- /tests/test_specific_cases.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | 3 | import selfies as sf 4 | 5 | 6 | def decode_eq(selfies, smiles): 7 | s = sf.decoder(selfies) 8 | return s == smiles 9 | 10 | 11 | def roundtrip_eq(smiles_in, smiles_out): 12 | sel = sf.encoder(smiles_in) 13 | smi = sf.decoder(sel) 14 | return smi == smiles_out 15 | 16 | 17 | def test_branch_and_ring_at_state_X0(): 18 | """Tests SELFIES with branches and rings at state X0 (i.e. at the 19 | very beginning of a SELFIES). These symbols should be skipped. 20 | """ 21 | 22 | assert decode_eq("[Branch3][C][S][C][O]", "CSCO") 23 | assert decode_eq("[Ring3][C][S][C][O]", "CSCO") 24 | assert decode_eq("[Branch1][Ring1][Ring3][C][S][C][O]", "CSCO") 25 | 26 | 27 | def test_branch_at_state_X1(): 28 | """Test SELFIES with branches at state X1 (i.e. at an atom that 29 | can only make one bond. In this case, the branch symbol should be skipped. 30 | """ 31 | 32 | assert decode_eq("[C][C][O][Branch1][C][I]", "CCOCI") 33 | assert decode_eq("[C][C][C][O][#Branch3][C][I]", "CCCOCI") 34 | 35 | 36 | def test_branch_and_ring_decrement_state(): 37 | """Tests that the branch and ring symbols properly decrement the 38 | derivation state. 39 | """ 40 | 41 | assert decode_eq("[C][C][C][Ring1][Ring1][#C]", "C1CC1=C") 42 | assert decode_eq("[C][=C][C][C][#Ring1][Ring1][#C]", "C=C1CC1") 43 | assert decode_eq("[C][O][C][C][=Ring1][Ring1][#C]", "COCCC") 44 | 45 | assert decode_eq("[C][=C][Branch1][C][=C][#C]", "C=C(C)C") 46 | 47 | 48 | def test_branch_at_end_of_selfies(): 49 | """Test SELFIES that have a branch symbol as its very last symbol. 50 | """ 51 | 52 | assert decode_eq("[C][C][C][C][Branch1]", "CCCC") 53 | assert decode_eq("[C][C][C][C][#Branch3]", "CCCC") 54 | 55 | 56 | def test_ring_at_end_of_selfies(): 57 | """Test SELFIES that have a ring symbol as its very last symbol. 58 | """ 59 | 60 | assert decode_eq("[C][C][C][C][C][Ring1]", "CCCC=C") 61 | assert decode_eq("[C][C][C][C][C][Ring3]", "CCCC=C") 62 | 63 | 64 | def test_branch_with_no_atoms(): 65 | """Test SELFIES that have a branch, but the branch has no atoms in it. 66 | Such branches should not be made in the outputted SMILES. 67 | """ 68 | 69 | s = "[C][Branch1][Ring2][Branch1][Branch1][Branch1][F]" 70 | assert decode_eq(s, "CF") 71 | 72 | s = "[C][Branch1][Ring2][Ring1][Ring1][Branch1][F]" 73 | assert decode_eq(s, "CF") 74 | 75 | s = "[C][=Branch1][Ring2][Branch1][C][Cl][F]" 76 | assert decode_eq(s, "C(Cl)F") 77 | 78 | # special case: #Branch3 takes Q_1, Q_2 = [O] and Q_3 = ''. However, 79 | # there are no more symbols in the branch. 80 | assert decode_eq("[C][C][C][C][#Branch3][O][O]", "CCCC") 81 | 82 | 83 | def test_oversized_branch(): 84 | """Test SELFIES that have a branch, with Q larger than the length 85 | of the SELFIES 86 | """ 87 | 88 | assert decode_eq("[C][Branch2][O][O][C][C][S][F][C]", "CCCSF") 89 | assert decode_eq("[C][#Branch2][O][O][#C][C][S][F]", "C#CCSF") 90 | 91 | 92 | def test_oversized_ring(): 93 | """Test SELFIES that have a ring, with Q so large that the (Q + 1)-th 94 | previously derived atom does not exist. 95 | """ 96 | 97 | assert decode_eq("[C][C][C][C][Ring1][O]", "C1CCC1") 98 | assert decode_eq("[C][C][C][C][Ring2][O][C]", "C1CCC1") 99 | 100 | # special case: Ring2 takes Q_1 = [O] and Q_2 = '', leading to 101 | # Q = 9 * 16 + 0 (i.e. an oversized ring) 102 | assert decode_eq("[C][C][C][C][Ring2][O]", "C1CCC1") 103 | 104 | # special case: ring between 1st atom and 1st atom should not be formed 105 | assert decode_eq("[C][Ring1][O]", "C") 106 | 107 | 108 | def test_branch_at_beginning_of_branch(): 109 | """Test SELFIES that have a branch immediately at the start of a branch. 110 | """ 111 | 112 | # [C@]((Br)Cl)F 113 | s = "[C@][=Branch1][Branch1][Branch1][C][Br][Cl][F]" 114 | assert decode_eq(s, "[C@](Br)(Cl)F") 115 | 116 | # [C@](((Br)Cl)I)F 117 | s = "[C@][#Branch1][Branch2][=Branch1][Branch1][Branch1][C][Br][Cl][I][F]" 118 | assert decode_eq(s, "[C@](Br)(Cl)(I)F") 119 | 120 | # [C@]((Br)(Cl)I)F 121 | s = "[C@][#Branch1][Branch2][Branch1][C][Br][Branch1][C][Cl][I][F]" 122 | assert decode_eq(s, "[C@](Br)(Cl)(I)F") 123 | 124 | 125 | def test_ring_at_beginning_of_branch(): 126 | """Test SELFIES that have a ring immediately at the start of a branch. 127 | """ 128 | 129 | # CC1CCC(1CCl)F 130 | s = "[C][C][C][C][C][=Branch1][Branch1][Ring1][Ring2][C][Cl][F]" 131 | assert decode_eq(s, "CC1CCC1(CCl)F") 132 | 133 | # CC1CCS(Br)(1CCl)F 134 | s = "[C][C][C][C][S][Branch1][C][Br]" \ 135 | "[=Branch1][Branch1][Ring1][Ring2][C][Cl][F]" 136 | assert decode_eq(s, "CC1CCS1(Br)(CCl)F") 137 | 138 | 139 | def test_branch_and_ring_at_beginning_of_branch(): 140 | """Test SELFIES that have a branch and ring immediately at the start 141 | of a branch. 142 | """ 143 | 144 | # CC1CCCS((Br)1Cl)F 145 | s = "[C][C][C][C][C][S][#Branch1][#Branch1][Branch1][C][Br]" \ 146 | "[Ring1][Branch1][Cl][F]" 147 | assert decode_eq(s, "CC1CCCS1(Br)(Cl)F") 148 | 149 | # CC1CCCS(1(Br)Cl)F 150 | s = "[C][C][C][C][C][S][#Branch1][#Branch1][Ring1][Branch1]" \ 151 | "[Branch1][C][Br][Cl][F]" 152 | assert decode_eq(s, "CC1CCCS1(Br)(Cl)F") 153 | 154 | 155 | def test_ring_immediately_following_branch(): 156 | """Test SELFIES that have a ring immediately following after a branch. 157 | """ 158 | 159 | # CCC1CCCC(OCO)1 160 | s = "[C][C][C][C][C][C][C][Branch1][Ring2][O][C][O][Ring1][Branch1]" 161 | assert decode_eq(s, "CCC1CCCC1OCO") 162 | 163 | # CCC1CCCC(OCO)(F)1 164 | s = "[C][C][C][C][C][C][C][Branch1][Ring2][O][C][O]" \ 165 | "[Branch1][C][F][Ring1][Branch1]" 166 | assert decode_eq(s, "CCC1CCCC1(OCO)F") 167 | 168 | 169 | def test_ring_after_branch(): 170 | """Tests SELFIES that have a ring following a branch, but not 171 | immediately after a branch. 172 | """ 173 | 174 | # CCCCCCC1(OCO)1 175 | s = "[C][C][C][C][C][C][C][Branch1][Ring2][O][C][O][C][Ring1][Branch1]" 176 | assert decode_eq(s, "CCCCCCC(OCO)=C") 177 | 178 | s = "[C][C][C][C][C][C][C][Branch1][Ring2][O][C][O]" \ 179 | "[Branch1][C][F][C][C][Ring1][=Branch2]" 180 | assert decode_eq(s, "CCCCC1CC(OCO)(F)CC1") 181 | 182 | 183 | def test_ring_on_top_of_existing_bond(): 184 | """Tests SELFIES with rings between two atoms that are already bonded 185 | in the main scaffold. 186 | """ 187 | 188 | # C1C1, C1C=1, C1C#1, ... 189 | assert decode_eq("[C][C][Ring1][C]", "C=C") 190 | assert decode_eq("[C][/C][Ring1][C]", "C=C") 191 | assert decode_eq("[C][C][=Ring1][C]", "C#C") 192 | assert decode_eq("[C][C][#Ring1][C]", "C#C") 193 | 194 | 195 | def test_consecutive_rings(): 196 | """Test SELFIES which have multiple consecutive rings. 197 | """ 198 | 199 | s = "[C][C][C][C][Ring1][Ring2][Ring1][Ring2]" 200 | assert decode_eq(s, "C=1CCC=1") # 1 + 1 201 | 202 | s = "[C][C][C][C][Ring1][Ring2][Ring1][Ring2][Ring1][Ring2]" 203 | assert decode_eq(s, "C#1CCC#1") # 1 + 1 + 1 204 | 205 | s = "[C][C][C][C][=Ring1][Ring2][Ring1][Ring2]" 206 | assert decode_eq(s, "C#1CCC#1") # 2 + 1 207 | 208 | s = "[C][C][C][C][Ring1][Ring2][=Ring1][Ring2]" 209 | assert decode_eq(s, "C#1CCC#1") # 1 + 2 210 | 211 | # consecutive rings that exceed bond constraints 212 | s = "[C][C][C][C][#Ring1][Ring2][=Ring1][Ring2]" 213 | assert decode_eq(s, "C#1CCC#1") # 3 + 2 214 | 215 | s = "[C][C][C][C][=Ring1][Ring2][#Ring1][Ring2]" 216 | assert decode_eq(s, "C#1CCC#1") # 2 + 3 217 | 218 | s = "[C][C][C][C][=Ring1][Ring2][=Ring1][Ring2]" 219 | assert decode_eq(s, "C#1CCC#1") # 2 + 2 220 | 221 | # consecutive rings with stereochemical single bond 222 | s = "[C][C][C][C][\\/Ring1][Ring2]" 223 | assert decode_eq(s, "C\\1CCC/1") 224 | 225 | s = "[C][C][C][C][\\/Ring1][Ring2][Ring1][Ring2]" 226 | assert decode_eq(s, "C=1CCC=1") 227 | 228 | 229 | def test_unconstrained_symbols(): 230 | """Tests SELFIES with symbols that are not semantically constrained. 231 | """ 232 | 233 | f_branch = "[Branch1][C][F]" 234 | s = "[Xe-2]" + (f_branch * 8) 235 | assert decode_eq(s, "[Xe-2](F)(F)(F)(F)(F)(F)(F)CF") 236 | 237 | # change default semantic constraints 238 | constraints = sf.get_semantic_constraints() 239 | constraints["?"] = 2 240 | sf.set_semantic_constraints(constraints) 241 | 242 | assert decode_eq(s, "[Xe-2](F)CF") 243 | 244 | sf.set_semantic_constraints() 245 | 246 | 247 | def test_isotope_symbols(): 248 | """Tests that SELFIES symbols with isotope specifications are 249 | constrained properly. 250 | """ 251 | 252 | s = "[13C][Branch1][C][Cl][Branch1][C][F][Branch1][C][Br][Branch1][C][I]" 253 | assert decode_eq(s, "[13C](Cl)(F)(Br)CI") 254 | 255 | assert decode_eq("[C][36Cl][C]", "C[36Cl]") 256 | 257 | 258 | def test_chiral_symbols(): 259 | """Tests that SELFIES symbols with chirality specifications are 260 | constrained properly. 261 | """ 262 | 263 | s = "[C@@][Branch1][C][Cl][Branch1][C][F][Branch1][C][Br][Branch1][C][I]" 264 | assert decode_eq(s, "[C@@](Cl)(F)(Br)CI") 265 | 266 | s = "[C@H1][Branch1][C][Cl][Branch1][C][F][Branch1][C][Br]" 267 | assert decode_eq(s, "[C@H1](Cl)(F)CBr") 268 | 269 | 270 | def test_explicit_hydrogen_symbols(): 271 | """Tests that SELFIES symbols with explicit hydrogen specifications 272 | are constrained properly. 273 | """ 274 | 275 | assert decode_eq("[CH1][Branch1][C][Cl][#C]", "[CH1](Cl)=C") 276 | assert decode_eq("[CH3][=C]", "[CH3]C") 277 | 278 | assert decode_eq("[CH4][C][C]", "[CH4]") 279 | assert decode_eq("[C][C][C][CH4]", "CCC") 280 | assert decode_eq("[C][Branch1][Ring2][C][=CH4][C][=C]", "C(C)=C") 281 | 282 | with pytest.raises(sf.DecoderError): 283 | sf.decoder("[C][C][CH5]") 284 | with pytest.raises(sf.DecoderError): 285 | sf.decoder("[C][C][C][OH9]") 286 | 287 | 288 | def test_charged_symbols(): 289 | """Tests that SELFIES symbols with charges are constrained properly. 290 | """ 291 | 292 | constraints = sf.get_semantic_constraints() 293 | constraints["Sn+4"] = 1 294 | constraints["O-2"] = 2 295 | sf.set_semantic_constraints(constraints) 296 | 297 | # the following molecules don't make sense, but we use them to test 298 | # selfies. Hence, we can't verify them with RDKit 299 | assert decode_eq("[Sn+4][=C]", "[Sn+4]C") 300 | assert decode_eq("[O-2][#C]", "[O-2]=C") 301 | 302 | # mixing many symbol types 303 | assert decode_eq("[17O@@H1-2][#C]", "[17O@@H1-2]C") 304 | 305 | sf.set_semantic_constraints() 306 | 307 | 308 | def test_standardized_alphabet(): 309 | """Tests that equivalent SMILES atom symbols are translated into the 310 | same SELFIES atom symbol. 311 | """ 312 | 313 | assert sf.encoder("[C][O][N][P][F]") == "[CH0][OH0][NH0][PH0][FH0]" 314 | assert sf.encoder("[Fe][Si]") == "[Fe][Si]" 315 | assert sf.encoder("[Fe++][Fe+2]") == "[Fe+2][Fe+2]" 316 | assert sf.encoder("[CH][CH1]") == "[CH1][CH1]" 317 | 318 | 319 | def test_old_symbols(): 320 | """Tests backward compatibility of SELFIES with old ( 0: 244 | generator = generator.eval() 245 | # Display complete training stats 246 | display_status( 247 | epoch, num_epochs, n_batch, num_batches, 248 | d_error, g_error, d_pred_real, d_pred_fake 249 | ) 250 | 251 | smiles_ls = [] 252 | print('Sampling....') 253 | for _ in range(10000): 254 | 255 | T = generator(noise(1, prior_lv_size)).cpu().detach().numpy().flatten() # Just chose one molecule 256 | T = T.reshape((21, 14)) 257 | 258 | smile = hot_to_smile(T, alphabets) 259 | 260 | if ' ' in smile: 261 | smile = smile[:smile.find(' ')] 262 | 263 | # Convert SELFIE to SMILE here 264 | smile = IncludeRingsForSMILES(GrammarPlusToSMILES(smile,'X0')) 265 | print(smile) 266 | mol = Chem.MolFromSmiles(smile) 267 | if mol is not None: 268 | smiles_ls.append(smile) 269 | print('unique molecules: ', len(set(get_canon_smi_ls(smiles_ls)))) 270 | writer.add_scalar('Num Smiles', len(set(get_canon_smi_ls(smiles_ls))), epoch) 271 | 272 | # Write TensorBoard curv on to a text file 273 | f = open('{}/smiles_curve.txt'.format(dir_name), 'a+') 274 | f.write(str(len(set(get_canon_smi_ls(smiles_ls))))+ '\n') 275 | f.close() 276 | 277 | # Save the TensorBoard models 278 | save_models(generator, discriminator, epoch, dir_name) 279 | 280 | num_unique.append(len(set(get_canon_smi_ls(smiles_ls)))) 281 | 282 | 283 | # Early stopping criteria for the model to stop training 284 | if epoch > 100: 285 | A = np.array(num_unique) 286 | stopping_criteria = A.max() - ( 6 * ( np.sqrt(A.max()))) 287 | if stopping_criteria < 0: 288 | stopping_criteria = 0 289 | print('Stopping Criteria: ', stopping_criteria) 290 | print('Array A: ', A) 291 | if num_unique[-1] <= stopping_criteria or max(A) < 50: 292 | # Write the stopping epoch onto a text file 293 | f = open('{}/stoping_epoch.txt'.format(dir_name), 'a+') 294 | f.write('Early stopping Epoch: ' +str(epoch)+ '\n') 295 | f.close() 296 | return 297 | 298 | print(set(smiles_ls)) 299 | generator = generator.train() 300 | 301 | 302 | 303 | if __name__ == '__main__': 304 | 305 | num_model = 1 # Number of GAN models to be run (each will have different - randomly initialized hyperparms) 306 | for model_iter in range(100): 307 | dir_name = str(model_iter) # Directory for saving all the data 308 | 309 | os.makedirs(dir_name) 310 | num_unique = [] 311 | 312 | ## HYPERPARAMETER SELECTION! 313 | 314 | num_epochs = 1500 315 | 316 | # Let me select the learning rates: 317 | lr_disc = 10 ** random.uniform(-7, -4) 318 | lr_genr = 10 ** random.uniform(-7, -4) 319 | 320 | # Generator 321 | prior_lv_size = random.randint(50, 300) 322 | layer_interm_size_G = random.randint(300, 3000) 323 | 324 | # Discriminator 325 | discr_dropout = random.uniform(0, 0.8) 326 | discr_layer_2_size = random.randint(5, 100) 327 | 328 | 329 | # Save all the selected hyperparamters: 330 | f = open('{}/hyperparams.txt'.format(dir_name), 'a+') 331 | f.write('lr Discr: ' + str(lr_disc) + '\n') 332 | f.write('lr Gener: ' + str(lr_genr)+ '\n') 333 | f.write('Gener Sampling layer size: ' + str(prior_lv_size)+ '\n') 334 | f.write('Gener middle layer size: ' + str(layer_interm_size_G)+ '\n') 335 | f.write('Discr dropout: ' + str(discr_dropout)+ '\n') 336 | f.write('Discr middle layer size: ' + str(discr_layer_2_size)+ '\n') 337 | f.close() 338 | 339 | train_gan(lr_disc, lr_genr, prior_lv_size, layer_interm_size_G, discr_dropout, discr_layer_2_size, num_epochs, dir_name, num_unique) 340 | 341 | 342 | 343 | 344 | 345 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SELFIES 2 | 3 | [![GitHub release](https://img.shields.io/github/release/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/releases/) 4 | ![versions](https://img.shields.io/pypi/pyversions/selfies.svg) 5 | [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) 6 | [![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-blue.svg)](https://GitHub.com/aspuru-guzik-group/selfies/graphs/commit-activity) 7 | [![GitHub issues](https://img.shields.io/github/issues/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/issues/) 8 | [![Documentation Status](https://readthedocs.org/projects/selfiesv2/badge/?version=latest)](http://selfiesv2.readthedocs.io/?badge=latest) 9 | [![GitHub contributors](https://img.shields.io/github/contributors/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/graphs/contributors/) 10 | 11 | 12 | **Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation**\ 13 | _Mario Krenn, Florian Haese, AkshatKumar Nigam, Pascal Friederich, Alan Aspuru-Guzik_\ 14 | [*Machine Learning: Science and Technology* **1**, 045024 (2020)](https://iopscience.iop.org/article/10.1088/2632-2153/aba947), [extensive blog post January 2021](https://aspuru.substack.com/p/molecular-graph-representations-and).\ 15 | [Talk on youtube about SELFIES](https://www.youtube.com/watch?v=CaIyUmfGXDk).\ 16 | [A community paper with 31 authors on SELFIES and the future of molecular string representations](https://arxiv.org/abs/2204.00056).\ 17 | [Blog explaining SELFIES in Japanese language](https://blacktanktop.hatenablog.com/entry/2021/08/12/115613)\ 18 | **[Code-Paper in February 2023](https://pubs.rsc.org/en/content/articlelanding/2023/DD/D3DD00044C)**\ 19 | [SELFIES in Wolfram Mathematica](https://resources.wolframcloud.com/PacletRepository/resources/WolframChemistry/Selfies/) (since Dec 2023)\ 20 | Major contributors of v1.0.n: _[Alston Lo](https://github.com/alstonlo) and [Seyone Chithrananda](https://github.com/seyonechithrananda)_\ 21 | Main developer of v2.0.0: _[Alston Lo](https://github.com/alstonlo)_\ 22 | Chemistry Advisor: [Robert Pollice](https://scholar.google.at/citations?user=JR2N3JIAAAAJ) 23 | 24 | --- 25 | 26 | A main objective is to use SELFIES as direct input into machine learning models, 27 | in particular in generative models, for the generation of molecular graphs 28 | which are syntactically and semantically valid. 29 | 30 |

31 | SELFIES validity in a VAE latent space 32 |

33 | 34 | ## Installation 35 | Use pip to install ``selfies``. 36 | 37 | ```bash 38 | pip install selfies 39 | ``` 40 | 41 | To check if the correct version of ``selfies`` is installed, use 42 | the following pip command. 43 | 44 | ```bash 45 | pip show selfies 46 | ``` 47 | 48 | To upgrade to the latest release of ``selfies`` if you are using an 49 | older version, use the following pip command. Please see the 50 | [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md) 51 | to review the changes between versions of `selfies`, before upgrading: 52 | 53 | ```bash 54 | pip install selfies --upgrade 55 | ``` 56 | 57 | 58 | ## Usage 59 | 60 | ### Overview 61 | 62 | Please refer to the [documentation in our code-paper](https://pubs.rsc.org/en/content/articlelanding/2023/DD/D3DD00044C), 63 | which contains a thorough tutorial for getting started with ``selfies`` 64 | and detailed descriptions of the functions 65 | that ``selfies`` provides. We summarize some key functions below. 66 | 67 | | Function | Description | 68 | | ------------------------------------- | ----------------------------------------------------------------- | 69 | | ``selfies.encoder`` | Translates a SMILES string into its corresponding SELFIES string. | 70 | | ``selfies.decoder`` | Translates a SELFIES string into its corresponding SMILES string. | 71 | | ``selfies.set_semantic_constraints`` | Configures the semantic constraints that ``selfies`` operates on. | 72 | | ``selfies.len_selfies`` | Returns the number of symbols in a SELFIES string. | 73 | | ``selfies.split_selfies`` | Tokenizes a SELFIES string into its individual symbols. | 74 | | ``selfies.get_alphabet_from_selfies`` | Constructs an alphabet from an iterable of SELFIES strings. | 75 | | ``selfies.selfies_to_encoding`` | Converts a SELFIES string into its label and/or one-hot encoding. | 76 | | ``selfies.encoding_to_selfies`` | Converts a label or one-hot encoding into a SELFIES string. | 77 | 78 | 79 | ### Examples 80 | 81 | #### Translation between SELFIES and SMILES representations: 82 | 83 | ```python 84 | import selfies as sf 85 | 86 | benzene = "c1ccccc1" 87 | 88 | # SMILES -> SELFIES -> SMILES translation 89 | try: 90 | benzene_sf = sf.encoder(benzene) # [C][=C][C][=C][C][=C][Ring1][=Branch1] 91 | benzene_smi = sf.decoder(benzene_sf) # C1=CC=CC=C1 92 | except sf.EncoderError: 93 | pass # sf.encoder error! 94 | except sf.DecoderError: 95 | pass # sf.decoder error! 96 | 97 | len_benzene = sf.len_selfies(benzene_sf) # 8 98 | 99 | symbols_benzene = list(sf.split_selfies(benzene_sf)) 100 | # ['[C]', '[=C]', '[C]', '[=C]', '[C]', '[=C]', '[Ring1]', '[=Branch1]'] 101 | ``` 102 | 103 | #### Very simple creation of random valid molecules: 104 | A key property of SELFIES is the possibility to create valid random molecules in a very simple way -- inspired by a tweet by [Rajarshi Guha](https://twitter.com/rguha/status/1543601839983284224): 105 | 106 | ```python 107 | import selfies as sf 108 | import random 109 | 110 | alphabet=sf.get_semantic_robust_alphabet() # Gets the alphabet of robust symbols 111 | rnd_selfies=''.join(random.sample(list(alphabet), 9)) 112 | rnd_smiles=sf.decoder(rnd_selfies) 113 | print(rnd_smiles) 114 | ``` 115 | These simple lines gives crazy molecules, but all are valid. Can be used as a start for more advanced filtering techniques or for machine learning models. 116 | 117 | #### Integer and one-hot encoding SELFIES: 118 | 119 | In this example, we first build an alphabet from a dataset of SELFIES strings, 120 | and then convert a SELFIES string into its padded encoding. Note that we use the 121 | ``[nop]`` ([no operation](https://en.wikipedia.org/wiki/NOP_(code) )) 122 | symbol to pad our SELFIES, which is a special SELFIES symbol that is always 123 | ignored and skipped over by ``selfies.decoder``, making it a useful 124 | padding character. 125 | 126 | ```python 127 | import selfies as sf 128 | 129 | dataset = ["[C][O][C]", "[F][C][F]", "[O][=O]", "[C][C][O][C][C]"] 130 | alphabet = sf.get_alphabet_from_selfies(dataset) 131 | alphabet.add("[nop]") # [nop] is a special padding symbol 132 | alphabet = list(sorted(alphabet)) # ['[=O]', '[C]', '[F]', '[O]', '[nop]'] 133 | 134 | pad_to_len = max(sf.len_selfies(s) for s in dataset) # 5 135 | symbol_to_idx = {s: i for i, s in enumerate(alphabet)} 136 | 137 | dimethyl_ether = dataset[0] # [C][O][C] 138 | 139 | label, one_hot = sf.selfies_to_encoding( 140 | selfies=dimethyl_ether, 141 | vocab_stoi=symbol_to_idx, 142 | pad_to_len=pad_to_len, 143 | enc_type="both" 144 | ) 145 | # label = [1, 3, 1, 4, 4] 146 | # one_hot = [[0, 1, 0, 0, 0], [0, 0, 0, 1, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [0, 0, 0, 0, 1]] 147 | ``` 148 | 149 | #### Customizing SELFIES: 150 | 151 | In this example, we relax the semantic constraints of ``selfies`` to allow 152 | for hypervalences (caution: hypervalence rules are much less understood 153 | than octet rules. Some molecules containing hypervalences are important, 154 | but generally, it is not known which molecules are stable and reasonable). 155 | 156 | ```python 157 | import selfies as sf 158 | 159 | hypervalent_sf = sf.encoder('O=I(O)(O)(O)(O)O', strict=False) # orthoperiodic acid 160 | standard_derived_smi = sf.decoder(hypervalent_sf) 161 | # OI (the default constraints for I allows for only 1 bond) 162 | 163 | sf.set_semantic_constraints("hypervalent") 164 | relaxed_derived_smi = sf.decoder(hypervalent_sf) 165 | # O=I(O)(O)(O)(O)O (the hypervalent constraints for I allows for 7 bonds) 166 | ``` 167 | 168 | #### Explaining Translation: 169 | 170 | You can get an "attribution" list that traces the connection between input and output tokens. For example let's see which tokens in the SELFIES string ``[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]`` are responsible for the output SMILES tokens. 171 | 172 | ```python 173 | selfies = "[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]" 174 | smiles, attr = sf.decoder( 175 | selfies, attribute=True) 176 | print('SELFIES', selfies) 177 | print('SMILES', smiles) 178 | print('Attribution:') 179 | for smiles_token in attr: 180 | print(smiles_token) 181 | 182 | # output 183 | SELFIES [C][N][C][Branch1][C][P][C][C][Ring1][=Branch1] 184 | SMILES C1NC(P)CC1 185 | Attribution: 186 | AttributionMap(index=0, token='C', attribution=[Attribution(index=0, token='[C]')]) 187 | AttributionMap(index=2, token='N', attribution=[Attribution(index=1, token='[N]')]) 188 | AttributionMap(index=3, token='C', attribution=[Attribution(index=2, token='[C]')]) 189 | AttributionMap(index=5, token='P', attribution=[Attribution(index=3, token='[Branch1]'), Attribution(index=5, token='[P]')]) 190 | AttributionMap(index=7, token='C', attribution=[Attribution(index=6, token='[C]')]) 191 | AttributionMap(index=8, token='C', attribution=[Attribution(index=7, token='[C]')]) 192 | ``` 193 | 194 | ``attr`` is a list of `AttributionMap`s containing the output token, its index, and input tokens that led to it. For example, the ``P`` appearing in the output SMILES at that location is a result of both the ``[Branch1]`` token at position 3 and the ``[P]`` token at index 5. This works for both encoding and decoding. For finer control of tracking the translation (like tracking rings), you can access attributions in the underlying molecular graph with ``get_attribution``. 195 | 196 | ### More Usages and Examples 197 | 198 | * More examples can be found in the ``examples/`` directory, including a 199 | [variational autoencoder that runs on the SELFIES](https://github.com/aspuru-guzik-group/selfies/tree/master/examples/vae_example) language. 200 | * This [ICLR2020 paper](https://arxiv.org/abs/1909.11655) used SELFIES in a 201 | genetic algorithm to achieve state-of-the-art performance for inverse design, 202 | with the [code here](https://github.com/aspuru-guzik-group/GA). 203 | * SELFIES allows for [highly efficient exploration and interpolation of the chemical space](https://chemrxiv.org/articles/preprint/Beyond_Generative_Models_Superfast_Traversal_Optimization_Novelty_Exploration_and_Discovery_STONED_Algorithm_for_Molecules_using_SELFIES/13383266), with a [deterministic algorithms, see code](https://github.com/aspuru-guzik-group/stoned-selfies). 204 | * We use SELFIES for [Deep Molecular dreaming](https://arxiv.org/abs/2012.09712), a new generative model inspired by interpretable neural networks in computational vision. See the [code of PASITHEA here](https://github.com/aspuru-guzik-group/Pasithea). 205 | * Kohulan Rajan, Achim Zielesny, Christoph Steinbeck show in two papers that SELFIES outperforms other representations in [img2string](https://link.springer.com/article/10.1186/s13321-020-00469-w) and [string2string](https://chemrxiv.org/articles/preprint/STOUT_SMILES_to_IUPAC_Names_Using_Neural_Machine_Translation/13469202/1) translation tasks, see the codes of [DECIMER](https://github.com/Kohulan/DECIMER-Image-to-SMILES) and [STOUT](https://github.com/Kohulan/Smiles-TO-iUpac-Translator). 206 | * Nathan Frey, Vijay Gadepally, and Bharath Ramsundar used SELFIES with normalizing flows to develop the [FastFlows](https://arxiv.org/abs/2201.12419) framework for deep chemical generative modeling. 207 | * An improvement to the old genetic algorithm, the authors have also released [JANUS](https://arxiv.org/abs/2106.04011), which allows for more efficient optimization in the chemical space. JANUS makes use of [STONED-SELFIES](https://pubs.rsc.org/en/content/articlepdf/2021/sc/d1sc00231g) and a neural network for efficient sampling. 208 | 209 | ## Tests 210 | `selfies` uses `pytest` with `tox` as its testing framework. 211 | All tests can be found in the `tests/` directory. To run the test suite for 212 | SELFIES, install ``tox`` and run: 213 | 214 | ```bash 215 | tox -- --trials=10000 --dataset_samples=10000 216 | ``` 217 | 218 | By default, `selfies` is tested against a random subset 219 | (of size ``dataset_samples=10000``) on various datasets: 220 | 221 | * 130K molecules from [QM9](https://www.nature.com/articles/sdata201422) 222 | * 250K molecules from [ZINC](https://en.wikipedia.org/wiki/ZINC_database) 223 | * 50K molecules from a dataset of [non-fullerene acceptors for organic solar cells](https://www.sciencedirect.com/science/article/pii/S2542435117301307) 224 | * 160K+ molecules from various [MoleculeNet](https://moleculenet.org/datasets-1) datasets 225 | 226 | In first releases, we also tested the 36M+ molecules from the [eMolecules Database](https://downloads.emolecules.com/free/2024-12-01/). 227 | 228 | 229 | ## Version History 230 | See [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md). 231 | 232 | ## Credits 233 | 234 | We thank Jacques Boitreaud, Andrew Brereton, Nessa Carson (supersciencegrl), Matthew Carbone (x94carbone), Vladimir Chupakhin (chupvl), Nathan Frey (ncfrey), Theophile Gaudin, 235 | HelloJocelynLu, Hyunmin Kim (hmkim), Minjie Li, Vincent Mallet, Alexander Minidis (DocMinus), Kohulan Rajan (Kohulan), 236 | Kevin Ryan (LeanAndMean), Benjamin Sanchez-Lengeling, Andrew White, Zhenpeng Yao and Adamo Young for their suggestions and bug reports, 237 | and Robert Pollice for chemistry advices. 238 | 239 | ## License 240 | 241 | [Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/) 242 | --------------------------------------------------------------------------------