├── selfies
    ├── utils
    │   ├── __init__.py
    │   ├── selfies_utils.py
    │   ├── matching_utils.py
    │   └── encoding_utils.py
    ├── exceptions.py
    ├── compatibility.py
    ├── constants.py
    ├── __init__.py
    ├── grammar_rules.py
    ├── bond_constraints.py
    ├── decoder.py
    ├── encoder.py
    └── mol_graph.py
├── original_code_from_paper
    ├── gan
    │   ├── GAN_smiles
    │   │   ├── README.md
    │   │   ├── one_hot_converter.py
    │   │   └── GAN.py
    │   ├── GAN_selfies
    │   │   ├── README.md
    │   │   ├── translate.py
    │   │   ├── one_hot_converter.py
    │   │   └── GAN.py
    │   └── README.md
    ├── vae
    │   ├── VAE_dependencies
    │   │   ├── README.md
    │   │   ├── Datasets
    │   │   │   ├── README.md
    │   │   │   └── QM9
    │   │   │   │   └── README.md
    │   │   ├── GrammarVAE_grammar.py
    │   │   ├── data_loader.py
    │   │   └── GrammarVAE_codes.py
    │   ├── settings.yml
    │   ├── settingsSELFIES.yml
    │   ├── settingsDeepSMILES.yml
    │   ├── settingsGrammarVAE.yml
    │   ├── settingsSMILES.yml
    │   └── README.md
    ├── README.md
    ├── bitflips_in_paper_Fig3.txt
    ├── bitflip_from_mdma.py
    └── environment.yml
├── docs
    ├── requirements.txt
    ├── README.md
    ├── Makefile
    ├── make.bat
    └── source
    │   ├── index.rst
    │   ├── conf.py
    │   └── selfies.rst
├── examples
    ├── vae_example
    │   ├── datasets
    │   │   └── README.md
    │   ├── settings.yml
    │   ├── README.md
    │   └── data_loader.py
    ├── VAE_LS_Validity.png
    ├── selfies_indice.png
    └── workshop2021
    │   └── SELFIES_working_groups.pdf
├── tox.ini
├── .readthedocs.yml
├── tests
    ├── conftest.py
    ├── test_sets
    │   └── custom_cases.csv
    ├── run_on_large_dataset.py
    ├── test_on_datasets.py
    ├── test_selfies_utils.py
    ├── test_selfies.py
    └── test_specific_cases.py
├── .github
    └── workflows
    │   └── ci.yml
├── setup.py
├── .gitignore
├── CHANGELOG.md
├── LICENSE
└── README.md


/selfies/utils/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/original_code_from_paper/gan/GAN_smiles/README.md:
--------------------------------------------------------------------------------
1 | GAN for SMILES
2 | 


--------------------------------------------------------------------------------
/original_code_from_paper/gan/GAN_selfies/README.md:
--------------------------------------------------------------------------------
1 | GAN for SELFIES
2 | 


--------------------------------------------------------------------------------
/original_code_from_paper/vae/VAE_dependencies/README.md:
--------------------------------------------------------------------------------
1 | additional data files
2 | 


--------------------------------------------------------------------------------
/docs/requirements.txt:
--------------------------------------------------------------------------------
1 | nbsphinx
2 | sphinx-autodoc-typehints
3 | sphinx-rtd-theme
4 | 


--------------------------------------------------------------------------------
/original_code_from_paper/vae/VAE_dependencies/Datasets/README.md:
--------------------------------------------------------------------------------
1 | datasets (only qm9)
2 | 


--------------------------------------------------------------------------------
/examples/vae_example/datasets/README.md:
--------------------------------------------------------------------------------
1 | Dataset files with molecules represented in SMILES.
2 | 


--------------------------------------------------------------------------------
/examples/VAE_LS_Validity.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aspuru-guzik-group/selfies/HEAD/examples/VAE_LS_Validity.png


--------------------------------------------------------------------------------
/examples/selfies_indice.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aspuru-guzik-group/selfies/HEAD/examples/selfies_indice.png


--------------------------------------------------------------------------------
/original_code_from_paper/vae/VAE_dependencies/Datasets/QM9/README.md:
--------------------------------------------------------------------------------
1 | QM9 representation in SMILES, DeepSMILES, SELFIES, GrammarVAE
2 | 


--------------------------------------------------------------------------------
/examples/workshop2021/SELFIES_working_groups.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aspuru-guzik-group/selfies/HEAD/examples/workshop2021/SELFIES_working_groups.pdf


--------------------------------------------------------------------------------
/original_code_from_paper/README.md:
--------------------------------------------------------------------------------
1 | This file contains the original code from the SELFIES
2 | [paper](https://arxiv.org/abs/1905.13741). The required dependencies to run 
3 | this code are listed in ``environment.yml``. Note that this code ran on 
4 | an old version of SELFIES (``v0.1.1``).
5 | 


--------------------------------------------------------------------------------
/tox.ini:
--------------------------------------------------------------------------------
 1 | [tox]
 2 | envlist = py37,py38,py39,py310
 3 | requires = tox-conda
 4 | 
 5 | [testenv]
 6 | setenv =
 7 |     CONDA_DLL_SEARCH_MODIFICATION_ENABLE = 1
 8 | whitelist_externals = python
 9 | 
10 | [testenv:py{37,38,39,310}]
11 | conda_deps =
12 |     pytest
13 |     rdkit
14 | conda_channels =
15 |     conda-forge
16 | commands = pytest --basetemp="{envtmpdir}" {posargs}
17 | 


--------------------------------------------------------------------------------
/docs/README.md:
--------------------------------------------------------------------------------
 1 | To build the documentation, please install
 2 | * the [sphinx-autotoc-typehints](https://github.com/agronholm/sphinx-autodoc-typehints) extension  
 3 | * the [Read the Docs Sphinx theme](https://github.com/readthedocs/sphinx_rtd_theme)
 4 | * the [nbsphinx](https://github.com/spatialaudio/nbsphinx) extension
 5 | 
 6 | and then run 
 7 | ```
 8 | python -m sphinx source <build-dir>
 9 | ```
10 | in this directory. 
11 | 


--------------------------------------------------------------------------------
/examples/vae_example/settings.yml:
--------------------------------------------------------------------------------
 1 | data:
 2 |   batch_size: 100
 3 |   smiles_file: datasets/0SelectedSMILES_QM9.txt
 4 |   type_of_encoding: 0
 5 | 
 6 | decoder:
 7 |   latent_dimension: 50
 8 |   gru_neurons_num: 100
 9 |   gru_stack_size: 1
10 | 
11 | encoder:
12 |   layer_1d: 100
13 |   layer_2d: 100
14 |   layer_3d: 100
15 |   latent_dimension: 50
16 | 
17 | training:
18 |   KLD_alpha: 1.0e-05
19 |   lr_enc: 0.0001
20 |   lr_dec: 0.0001
21 |   num_epochs: 5000
22 |   sample_num: 1000
23 | 


--------------------------------------------------------------------------------
/.readthedocs.yml:
--------------------------------------------------------------------------------
 1 | # Read the Docs configuration file
 2 | # See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
 3 | 
 4 | # Required
 5 | version: 2
 6 | 
 7 | # Build documentation in the docs/ directory with Sphinx
 8 | sphinx:
 9 |   configuration: docs/source/conf.py
10 | 
11 | # Optionally build your docs in additional formats such as PDF
12 | formats:
13 |   - pdf
14 | 
15 | # Optionally set the version of Python and requirements required to build your docs
16 | python:
17 |   version: "3.8"
18 |   install:
19 |     - requirements: docs/requirements.txt
20 | 


--------------------------------------------------------------------------------
/tests/conftest.py:
--------------------------------------------------------------------------------
 1 | import pytest
 2 | 
 3 | 
 4 | def pytest_addoption(parser):
 5 |     parser.addoption(
 6 |         "--trials", action="store", default=10000,
 7 |         help="number of trails for random tests"
 8 |     )
 9 |     parser.addoption(
10 |         "--dataset_samples", action="store", default=10000,
11 |         help="number of samples to test from the data sets"
12 |     )
13 | 
14 | 
15 | @pytest.fixture
16 | def trials(request):
17 |     return int(request.config.getoption("--trials"))
18 | 
19 | 
20 | @pytest.fixture
21 | def dataset_samples(request):
22 |     return int(request.config.getoption("--dataset_samples"))
23 | 


--------------------------------------------------------------------------------
/original_code_from_paper/vae/settings.yml:
--------------------------------------------------------------------------------
 1 | data:
 2 |   batch_size: 190
 3 |   cuda_device: 0
 4 |   smiles_file: VAE_dependencies/Datasets/QM9/2RGSMILES_QM9.txt
 5 |   type_of_encoding: 2
 6 | decoder:
 7 |   gru_neurons_num: 100
 8 |   gru_stack_size: 1
 9 |   latent_dimension: 25
10 | encoder:
11 |   latent_dimension: 25
12 |   layer_1d: 100
13 |   layer_2d: 50
14 |   layer_3d: 25
15 | training:
16 |   KLD_alpha: 3.9914730047188435e-06
17 |   checkpoint: true
18 |   latent_dimension: 25
19 |   lr_dec: 0.002427375401692564
20 |   lr_enc: 0.002427375401692564
21 |   num_epochs: 5000
22 |   sample_num: 1000
23 |   tensorBoard_graphing: false
24 | 


--------------------------------------------------------------------------------
/original_code_from_paper/vae/settingsSELFIES.yml:
--------------------------------------------------------------------------------
 1 | data:
 2 |   batch_size: 190
 3 |   cuda_device: 0
 4 |   smiles_file: VAE_dependencies/Datasets/QM9/2RGSMILES_QM9.txt
 5 |   type_of_encoding: 2
 6 | decoder:
 7 |   gru_neurons_num: 100
 8 |   gru_stack_size: 1
 9 |   latent_dimension: 25
10 | encoder:
11 |   latent_dimension: 25
12 |   layer_1d: 100
13 |   layer_2d: 50
14 |   layer_3d: 25
15 | training:
16 |   KLD_alpha: 3.9914730047188435e-06
17 |   checkpoint: true
18 |   latent_dimension: 25
19 |   lr_dec: 0.002427375401692564
20 |   lr_enc: 0.002427375401692564
21 |   num_epochs: 5000
22 |   sample_num: 1000
23 |   tensorBoard_graphing: false
24 | 


--------------------------------------------------------------------------------
/original_code_from_paper/vae/settingsDeepSMILES.yml:
--------------------------------------------------------------------------------
 1 | data:
 2 |   batch_size: 190
 3 |   cuda_device: 0
 4 |   smiles_file: VAE_dependencies/Datasets/QM9/1DeepSMILES_QM9.txt
 5 |   type_of_encoding: 1
 6 | decoder:
 7 |   gru_neurons_num: 10
 8 |   gru_stack_size: 5
 9 |   latent_dimension: 25
10 | encoder:
11 |   latent_dimension: 25
12 |   layer_1d: 2000
13 |   layer_2d: 1000
14 |   layer_3d: 500
15 | training:
16 |   KLD_alpha: 0.07233630964775388
17 |   checkpoint: true
18 |   latent_dimension: 25
19 |   lr_dec: 0.00022284476453107537
20 |   lr_enc: 0.00022284476453107537
21 |   num_epochs: 5000
22 |   sample_num: 1000
23 |   tensorBoard_graphing: false
24 | 


--------------------------------------------------------------------------------
/original_code_from_paper/vae/settingsGrammarVAE.yml:
--------------------------------------------------------------------------------
 1 | data:
 2 |   batch_size: 190
 3 |   cuda_device: 0
 4 |   smiles_file: VAE_dependencies/Datasets/QM9/3GrammarVAE_QM9.txt
 5 |   type_of_encoding: 3
 6 | decoder:
 7 |   gru_neurons_num: 100
 8 |   gru_stack_size: 1
 9 |   latent_dimension: 300
10 | encoder:
11 |   latent_dimension: 300
12 |   layer_1d: 100
13 |   layer_2d: 50
14 |   layer_3d: 500
15 | training:
16 |   KLD_alpha: 0.00019199633293733154
17 |   checkpoint: true
18 |   latent_dimension: 300
19 |   lr_dec: 8.804286872231203e-05
20 |   lr_enc: 8.804286872231203e-05
21 |   num_epochs: 5000
22 |   sample_num: 1000
23 |   tensorBoard_graphing: false
24 | 


--------------------------------------------------------------------------------
/original_code_from_paper/vae/settingsSMILES.yml:
--------------------------------------------------------------------------------
 1 | data:
 2 |   batch_size: 190
 3 |   cuda_device: 0
 4 |   smiles_file: VAE_dependencies\Datasets\QM9\0SelectedSMILES_QM9.txt
 5 |   type_of_encoding: 0
 6 |   
 7 | decoder:
 8 |   gru_neurons_num: 10
 9 |   gru_stack_size: 5
10 |   latent_dimension: 300
11 |   
12 | encoder:
13 |   latent_dimension: 300
14 |   layer_1d: 100
15 |   layer_2d: 1000
16 |   layer_3d: 25
17 |   
18 | training:
19 |   KLD_alpha: 0.0032309168486524178
20 |   checkpoint: true
21 |   latent_dimension: 300
22 |   lr_dec: 6.98450538191299e-05
23 |   lr_enc: 6.98450538191299e-05
24 |   num_epochs: 5000
25 |   sample_num: 1000
26 |   tensorBoard_graphing: false
27 | 


--------------------------------------------------------------------------------
/docs/Makefile:
--------------------------------------------------------------------------------
 1 | # Minimal makefile for Sphinx documentation
 2 | #
 3 | 
 4 | # You can set these variables from the command line, and also
 5 | # from the environment for the first two.
 6 | SPHINXOPTS    ?=
 7 | SPHINXBUILD   ?= sphinx-build
 8 | SOURCEDIR     = source
 9 | BUILDDIR      = build
10 | 
11 | # Put it first so that "make" without argument is like "make help".
12 | help:
13 | 	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14 | 
15 | .PHONY: help Makefile
16 | 
17 | # Catch-all target: route all unknown targets to Sphinx using the new
18 | # "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
19 | %: Makefile
20 | 	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
21 | 


--------------------------------------------------------------------------------
/.github/workflows/ci.yml:
--------------------------------------------------------------------------------
 1 | name: CI Tests
 2 | 
 3 | on:
 4 |   push:
 5 |     branches: [master]
 6 |   pull_request:
 7 |     branches: [master]
 8 | 
 9 | jobs:
10 | 
11 |   build:
12 | 
13 |     runs-on: ubuntu-latest
14 |     strategy:
15 |       matrix:
16 |         python-version: [3.9]
17 | 
18 |     steps:
19 |       - uses: actions/checkout@v2
20 |       - name: Set up Python ${{ matrix.python-version }}
21 |         uses: actions/setup-python@v1
22 |         with:
23 |           python-version: ${{ matrix.python-version }}
24 |       - name: Install basic dependencies
25 |         run: |
26 |           python -m pip install --upgrade pip
27 |       - name: Test with tox
28 |         run: |
29 |           pip install -e .
30 |           pip install tox
31 |           tox
32 | 


--------------------------------------------------------------------------------
/docs/make.bat:
--------------------------------------------------------------------------------
 1 | @ECHO OFF
 2 | 
 3 | pushd %~dp0
 4 | 
 5 | REM Command file for Sphinx documentation
 6 | 
 7 | if "%SPHINXBUILD%" == "" (
 8 | 	set SPHINXBUILD=sphinx-build
 9 | )
10 | set SOURCEDIR=source
11 | set BUILDDIR=build
12 | 
13 | if "%1" == "" goto help
14 | 
15 | %SPHINXBUILD% >NUL 2>NUL
16 | if errorlevel 9009 (
17 | 	echo.
18 | 	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
19 | 	echo.installed, then set the SPHINXBUILD environment variable to point
20 | 	echo.to the full path of the 'sphinx-build' executable. Alternatively you
21 | 	echo.may add the Sphinx directory to PATH.
22 | 	echo.
23 | 	echo.If you don't have Sphinx installed, grab it from
24 | 	echo.http://sphinx-doc.org/
25 | 	exit /b 1
26 | )
27 | 
28 | %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
29 | goto end
30 | 
31 | :help
32 | %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
33 | 
34 | :end
35 | popd
36 | 


--------------------------------------------------------------------------------
/selfies/exceptions.py:
--------------------------------------------------------------------------------
 1 | class SMILESParserError(ValueError):
 2 |     """Exception raised when a SMILES fails to be parsed.
 3 |     """
 4 | 
 5 |     def __init__(self, smiles, reason="N/A", idx=-1):
 6 |         self.smiles = smiles
 7 |         self.idx = idx
 8 |         self.reason = reason
 9 | 
10 |     def __str__(self):
11 |         err_msg = "\n" \
12 |                   "\tSMILES: {smiles}\n" \
13 |                   "\t        {pointer}\n" \
14 |                   "\tIndex:  {index}\n" \
15 |                   "\tReason: {reason}"
16 | 
17 |         return err_msg.format(
18 |             smiles=self.smiles,
19 |             pointer=(" " * self.idx + "^"),
20 |             index=self.idx,
21 |             reason=self.reason
22 |         )
23 | 
24 | 
25 | class EncoderError(Exception):
26 |     """Exception raised by :func:`selfies.encoder`.
27 |     """
28 | 
29 |     pass
30 | 
31 | 
32 | class DecoderError(Exception):
33 |     """Exception raised by :func:`selfies.decoder`.
34 |     """
35 | 
36 |     pass
37 | 


--------------------------------------------------------------------------------
/original_code_from_paper/vae/README.md:
--------------------------------------------------------------------------------
 1 | ## SELFIES Example: Variational Auto Encoder (VAE) for chemistry
 2 | comparing SMILES, DeepSMILES, GrammarVAE and SELFIES representation using reconstruction quality, diversity and  latent space validity as metrics of interest
 3 | - v0.1.0 -- 04. August 2019             
 4 |                   
 5 | ### information:
 6 | This is the original code used to generate the data in our paper. 
 7 | 
 8 | It used a hand-written SELFIES encoding (Table2 of paper), and cannot easily be adapted to other situations. If you want to use the VAE, please see ../chemistryVAE.py
 9 | 
10 | That file is connected with the selfies.encoder/selfies.decoder, and can be applied on general datasets. For more documentation, please look there.
11 |      
12 |      
13 |      settings*.yml
14 |          these four files contain the settings for values for the best model described in the paper
15 |      
16 |      
17 | 
18 | For comments, bug reports or feature ideas, please send an email to
19 | mario.krenn@utoronto.ca and alan@aspuru.com 
20 |  
21 | 


--------------------------------------------------------------------------------
/examples/vae_example/README.md:
--------------------------------------------------------------------------------
 1 | # SELFIES Example: Variational Autoencoder (VAE) for Chemistry
 2 | 
 3 | An implementation of a variational autoencoder that runs on both SMILES and 
 4 | SELFIES. Included is code that compares the SMILES and SELFIES representations
 5 | for a VAE using reconstruction quality, diversity, and latent space validity 
 6 | as metrics of interest. 
 7 |  
 8 | ## Dependencies 
 9 | Dependencies are ``pytorch``, ``rdkit``, and ``pyyaml``, which can be installed 
10 | using Conda. 
11 |       
12 | ## Files 
13 | 
14 |  * ``chemistry_vae.py``: the main file; contains the model definitions, 
15 |     the data processing, and the training.
16 |  * ``settings.yml``: a file containing the hyperparameters of the 
17 |     model and the training. Also configures the VAE to run on either SMILES 
18 |     or SELFIES. 
19 |  * ``data_loader.py``: contains helper methods that convert SMILES and SELFIES
20 |     into integer-encoded or one-hot encoded vectors. 
21 |     
22 | ### Tested at:
23 | - Python 3.7 
24 |               
25 | CPU and GPU supported
26 | 
27 | For comments, bug reports or feature ideas, please send an email to
28 | mario.krenn@utoronto.ca and alan@aspuru.com 
29 |  
30 | 


--------------------------------------------------------------------------------
/original_code_from_paper/gan/README.md:
--------------------------------------------------------------------------------
 1 | # Package requirements for running the code: 
 2 | - Pytorch
 3 | - TensorBoardX
 4 | - rdkit 
 5 | - numpy 
 6 | - matplotlib
 7 | 
 8 | 
 9 | # File Navigator: 
10 | 
11 | GAN_selfies: 
12 | - 2RGSMILES_QM9.txt : Dearomatized QM9 dataset (selfies representation, from alphabet described in the main text (symbols shortened for simplicity))
13 | - GAN.py: Code for running the generative adversarial network
14 | - one_hot_converter.py: Code for creating one-hot-encodings of molecular strings
15 | - adjusted_selfies_fcts.py: SMILES to SELFIES conversion file
16 | - GPlus2S.py: SMILES to SELFIES conversion file
17 | - translate.py: General helper functions file
18 | 
19 | GAN_smiles: 
20 | - GAN.py: Code for running the generative adversarial network
21 | - one_hot_converter.py: Code for creating one-hot-encodings of molecular strings
22 | - smiles_qm9.txt: Dearomatized QM9 dataset (smiles representation)
23 | 
24 | # How to Run the code:? 
25 | Step1: cd inside either 'GAN_selfies' or 'GAN_smiles' depending on which type of molecular representation you would like to run the GAN
26 | 
27 | Step2: run ./GAN.
28 |        The code will automatically detect the availability of a GPU on your device, and run multiple models with different hyperparameters
29 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import setuptools
 4 | 
 5 | with open("README.md", "r") as fh:
 6 |     long_description = fh.read()
 7 | 
 8 | setuptools.setup(
 9 |     name="selfies",
10 |     version="2.2.0",
11 |     author="Mario Krenn, Alston Lo, Robert Pollice and many other contributors",
12 |     author_email="mario.krenn@mpl.mpg.de, alan@aspuru.com",
13 |     description="SELFIES (SELF-referencIng Embedded Strings) is a "
14 |                 "general-purpose, sequence-based, robust representation of "
15 |                 "semantically constrained graphs.",
16 |     long_description=long_description,
17 |     long_description_content_type="text/markdown",
18 |     url="https://github.com/aspuru-guzik-group/selfies",
19 |     packages=setuptools.find_packages(),
20 |     classifiers=[
21 |         "Programming Language :: Python :: 3",
22 |         "Programming Language :: Python :: 3.7",
23 |         "Programming Language :: Python :: 3.8",
24 |         "Programming Language :: Python :: 3.9",
25 |         "Programming Language :: Python :: 3.10",
26 |         "Programming Language :: Python :: 3 :: Only",
27 |         "License :: OSI Approved :: Apache Software License",
28 |         "Operating System :: OS Independent",
29 |     ],
30 |     python_requires='>=3.7'
31 | )
32 | 


--------------------------------------------------------------------------------
/docs/source/index.rst:
--------------------------------------------------------------------------------
 1 | .. selfies documentation master file, created by
 2 |    sphinx-quickstart on Sun Jun 14 23:40:28 2020.
 3 |    You can adapt this file completely to your liking, but it should at least
 4 |    contain the root `toctree` directive.
 5 | 
 6 | Welcome to the SELFIES documentation!
 7 | =====================================
 8 | 
 9 | SELFIES (SELF-referencIng Embedded Strings) is a 100% robust molecular string
10 | representation. A main objective is to use SELFIES as direct input into
11 | machine learning models, in particular in generative models, for the
12 | generation of outputs with guaranteed validity.
13 | 
14 | This library is intended to be light-weight and easy to use.
15 | 
16 | For explanation of the underlying principle (formal grammar) and experiments,
17 | please see the `original paper`_.
18 | 
19 | For comments, bug reports or feature ideas, please use github issues
20 | or send an email to mario.krenn@utoronto.ca and alan@aspuru.com.
21 | 
22 | .. _original paper: https://doi.org/10.1088/2632-2153/aba947
23 | 
24 | Installation
25 | ############
26 | 
27 | Install SELFIES in the command line using pip:
28 | 
29 | .. code-block::
30 | 
31 |    $ pip install selfies
32 | 
33 | 
34 | .. toctree::
35 |    :maxdepth: 2
36 |    :caption: Contents
37 | 
38 |    derivation
39 |    tutorial.ipynb
40 |    selfies
41 | 
42 | Indices and tables
43 | ==================
44 | 
45 | * :ref:`genindex`
46 | * :ref:`modindex`
47 | * :ref:`search`
48 | 


--------------------------------------------------------------------------------
/selfies/compatibility.py:
--------------------------------------------------------------------------------
 1 | from selfies.utils.smiles_utils import atom_to_smiles, smiles_to_atom
 2 | 
 3 | 
 4 | def modernize_symbol(symbol):
 5 |     """Converts a SELFIES symbol from <v2 to its latest equivalent.
 6 | 
 7 |     :param symbol: an old SELFIES symbol.
 8 |     :return: the latest equivalent of the input symbol, or the input symbol
 9 |         itself, if no such equivalent exists.
10 |     """
11 | 
12 |     if symbol in _SYMBOL_UPDATE_TABLE:
13 |         return _SYMBOL_UPDATE_TABLE[symbol]
14 | 
15 |     if symbol[-5:] == "expl]":  # e.g. [XXXexpl]
16 |         if symbol[1] in "=#/\\":
17 |             bond_char, atom_symbol = symbol[1], symbol[2:-5]
18 |         else:
19 |             bond_char, atom_symbol = "", symbol[1:-5]
20 | 
21 |         atom = smiles_to_atom("[{}]".format(atom_symbol))
22 |         if (atom is not None) and (not atom.is_aromatic):
23 |             atom_symbol = atom_to_smiles(atom, brackets=False)  # standardize
24 |             symbol = "[{}{}]".format(bond_char, atom_symbol)
25 | 
26 |     return symbol
27 | 
28 | 
29 | def _build_update_table():
30 |     update_table = dict()
31 |     for L in range(1, 4):
32 |         entries = [
33 |             ("[Branch{}_1]", "[Branch{}]"),
34 |             ("[Branch{}_2]", "[=Branch{}]"),
35 |             ("[Branch{}_3]", "[#Branch{}]"),
36 |             ("[Expl=Ring{}]", "[=Ring{}]"),
37 |             ("[Expl#Ring{}]", "[#Ring{}]"),
38 |             ("[Expl/Ring{}]", "[//Ring{}]"),
39 |             ("[Expl\\Ring{}]", "[\\\\Ring{}]")
40 |         ]
41 | 
42 |         for old, new in entries:
43 |             update_table[old.format(L)] = new.format(L)
44 |     return update_table
45 | 
46 | 
47 | _SYMBOL_UPDATE_TABLE = _build_update_table()
48 | 


--------------------------------------------------------------------------------
/tests/test_sets/custom_cases.csv:
--------------------------------------------------------------------------------
 1 | smiles
 2 | CCCCC1CCCC(Cl)1CCCCCC
 3 | CCCCC1CCCC(CCCCCCCl)1CCCCCC
 4 | [Xe](F)(F)(F)(F)(F)F
 5 | CCC(Cl)(F)
 6 | CCC(Cl)1(Br)CCCC(Cl)(Br)1
 7 | S(Cl)12(F)3CCC1CCCC2CCCC(Cl)3(Br)
 8 | CC(C)(C)C1=CC(=C/C=C/c2cc(C(C)(C)C)[te+]c(C(C)(C)C)c2)C=C(C(C)(C)C)[Se]1
 9 | CC(C)(C)C1=CC(=C/C=C/c2cc(C(C)(C)C)[te+]c(C(C)(C)C)c2)C=C(C(C)(C)C)O1
10 | CC(C)(C)C1=CC(=C/C=C/c2cc(C(C)(C)C)[te+]c(C(C)(C)C)c2)C=C(C(C)(C)C)[Te]1
11 | CN(C)c1ccc(-c2cc(-c3ccc(N(C)C)cc3)[te+]c(-c3ccc(N(C)C)cc3)c2)cc1
12 | CN(C)c1ccc(-c2cc(-c3ccccc3)[te+]c(-c3ccc(N(C)C)cc3)c2)cc1
13 | CN(C)c1ccc(-c2cc(-c3ccccc3)[te+]c(-c3ccc(N)cc3)c2)cc1
14 | O=C(NCCCCN1CCN(c2cccc(Cl)c2Cl)CC1)c1cc2ccccc2[te]1
15 | COc1ccccc1N1CCN(CCCCNC(=O)c2cc3ccccc3[te]2)CC1
16 | CN(C)c1ccc(-c2cc(-c3ccc(N)cc3)[te+]c(-c3ccc(N4CCOCC4)cc3)c2)cc1
17 | CN(C)c1ccc(-c2cc(-c3ccc(N)cc3)[te+]c(-c3ccc(N)cc3)c2)cc1
18 | Nc1ccc(-c2cc(-c3ccccc3)cc(-c3ccc(N4CCOCC4)cc3)[te+]2)cc1
19 | Nc1ccc(-c2cc(-c3ccc(N4CCOCC4)cc3)cc(-c3ccc(N4CCOCC4)cc3)[te+]2)cc1
20 | Nc1ccc(-c2cc(-c3ccc(N4CCOCC4)cc3)cc(-c3ccc(N)cc3)[te+]2)cc1
21 | OC(=O)c1[te]ccc1
22 | CCc1nc#cn1CC1CC(c2ccccc2)(c2ccccc2)C(=O)O1
23 | c1cc2cccc3c4cccc5cccc(c(c1)c23)c45
24 | c1cc2c3c(c4cccc(c1)c24)c1c2cccc4cccc(c1c1c5cccc6cccc(c31)c56)c24
25 | c1ccc(Sc2cccc3c2-c2nc-3nc3c4c(Sc5ccccc5)cccc4c4nc5nc(nc6c7c(Sc8ccccc8)cccc7c(n2)n6[al+]n43)-c2cccc(Sc3ccccc3)c2-5)cc1
26 | C/1=C/C=C\C=C/C=C\1
27 | F/C=C/1CNCCC\1=C\F
28 | FC=C1CNCCC\1=C\F
29 | C1CCCCCCC/C=C/1\F
30 | C/1\2=C(\OCCC2)/CCCN1
31 | NC(=O)c1cccc2c1-c1ccc(cc1)-n-c-2=O
32 | F[C@@]12NCCCC1CCCO2
33 | F[C@@]21NCCCC1CCCO2
34 | C(CCC1CC)1
35 | CCN(CC)C1=CC=C(B(=O)O)C=C1
36 | CC(C)OB8(C1=CC=C3C(=C1)C2=CC=CC=C2N3C7=CC=C6C4=CC=CC=C4C5=CC=CC=C5C6=C7OC8)(CCC)
37 | C1C(C)OB1(C=CC(F)=C(NC(=O)NC2=CC(C(F)(F)F)=CC=C2F)CF)(C)C
38 | 


--------------------------------------------------------------------------------
/selfies/constants.py:
--------------------------------------------------------------------------------
 1 | ELEMENTS = {
 2 |     "H", "He", "Li", "Be", "B", "C", "N", "O", "F", "Ne", "Na", "Mg",
 3 |     "Al", "Si", "P", "S", "Cl", "Ar", "K", "Ca", "Sc", "Ti", "V", "Cr",
 4 |     "Mn", "Fe", "Co", "Ni", "Cu", "Zn", "Ga", "Ge", "As", "Se", "Br",
 5 |     "Kr", "Rb", "Sr", "Y", "Zr", "Nb", "Mo", "Tc", "Ru", "Rh", "Pd",
 6 |     "Ag", "Cd", "In", "Sn", "Sb", "Te", "I", "Xe", "Cs", "Ba", "Hf",
 7 |     "Ta", "W", "Re", "Os", "Ir", "Pt", "Au", "Hg", "Tl", "Pb", "Bi",
 8 |     "Po", "At", "Rn", "Fr", "Ra", "Rf", "Db", "Sg", "Bh", "Hs", "Mt",
 9 |     "Ds", "Rg", "Cn", "Fl", "Lv", "La", "Ce", "Pr", "Nd", "Pm", "Sm",
10 |     "Eu", "Gd", "Tb", "Dy", "Ho", "Er", "Tm", "Yb", "Lu", "Ac", "Th",
11 |     "Pa", "U", "Np", "Pu", "Am", "Cm", "Bk", "Cf", "Es", "Fm", "Md",
12 |     "No", "Lr"
13 | }
14 | 
15 | ORGANIC_SUBSET = {"B", "C", "N", "O", "S", "P", "F", "Cl", "Br", "I"}
16 | 
17 | AROMATIC_VALENCES = {
18 |     "B": (3,), "Al": (3,),
19 |     "C": (4,), "Si": (4,),
20 |     "N": (3, 5), "P": (3, 5), "As": (3, 5),
21 |     "O": (2, 4), "S": (2, 4), "Se": (2, 4), "Te": (2, 4)
22 | }
23 | 
24 | VALENCE_ELECTRONS = {
25 |     "B": 3, "Al": 3,
26 |     "C": 4, "Si": 4,
27 |     "N": 5, "P": 5, "As": 5,
28 |     "O": 6, "S": 6, "Se": 6, "Te": 6
29 | }
30 | 
31 | AROMATIC_SUBSET = set(e.lower() for e in AROMATIC_VALENCES)
32 | 
33 | # =============================================================================
34 | # SELFIES-specific constants
35 | # =============================================================================
36 | 
37 | 
38 | INDEX_ALPHABET = (
39 |     "[C]", "[Ring1]", "[Ring2]",
40 |     "[Branch1]", "[=Branch1]", "[#Branch1]",
41 |     "[Branch2]", "[=Branch2]", "[#Branch2]",
42 |     "[O]", "[N]", "[=N]", "[=C]", "[#C]", "[S]", "[P]"
43 | )
44 | 
45 | INDEX_CODE = {c: i for i, c in enumerate(INDEX_ALPHABET)}
46 | 


--------------------------------------------------------------------------------
/original_code_from_paper/gan/GAN_selfies/translate.py:
--------------------------------------------------------------------------------
 1 | #alphabets = ['I', 'N', 'D', 'K', 'L', 'B', 'C', 'J', 'G', 'A', 'H', 'E', 'F', ' ']
 2 | #smile = IncludeRingsForSMILES(GrammarPlusToSMILES(smile,'X0'))
 3 | 
 4 | import time
 5 | from adjusted_selfies_fcts import encoder, decoder
 6 | import pandas as pd
 7 | import numpy as np
 8 | 
 9 | from rdkit.Chem import MolFromSmiles, MolToSmiles
10 | 
11 | 
12 | def is_correct_smiles(smiles):    
13 |     """
14 |     Using RDKit to calculate whether molecule is syntactically and semantically valid.
15 |     """
16 |     try:
17 |         res_molecule=MolFromSmiles(smiles, sanitize=True)
18 |     except Exception:
19 |         res_molecule=None
20 |                 
21 |     if res_molecule==None:
22 |         return 0
23 |     else:
24 |         return 1
25 | 
26 | def old_vs_new_decoder(old_selfies):
27 |     translate_alphabet={'A': '[epsilon]', 'B': '[F]', 'C': '[=O]', 'D': '[#N]', 'E': '[O]', 'F': '[N]', 'G': '[=N]', 'H': '[C]', 'I': '[=C]', 'J': '[#C]', 'K': '[Branch1_1]', 'L': '[Branch1_3]', 'M': '[Branch1_2]', 'N': '[Ring1]'}
28 |     new_selfie=''
29 |     for jj in range(len(old_selfies)):
30 |         new_selfie+=translate_alphabet[old_selfies[jj]]
31 |     translated_mol=decoder(new_selfie)
32 |     return translated_mol
33 | 
34 | 
35 | 
36 | df = pd.read_csv('0SelectedSMILES_QM9.txt')
37 | smiles_list = np.asanyarray(df.smiles)
38 |     
39 | df = pd.read_csv('2RGSMILES_QM9.txt')
40 | old_selfies_list = np.asanyarray(df.smiles)    
41 | 
42 | for ii in range(len(smiles_list)):
43 |     new_smiles=old_vs_new_decoder(old_selfies_list[ii])
44 |     isSame=(new_smiles==smiles_list[ii])
45 |     if not isSame:
46 |         print(str(ii)+': '+new_smiles+', '+smiles_list[ii])
47 |         time.sleep(0.5)
48 |         
49 |     if ii%1000==0 and ii>0:
50 |         print(ii,': ',new_smiles)


--------------------------------------------------------------------------------
/original_code_from_paper/bitflips_in_paper_Fig3.txt:
--------------------------------------------------------------------------------
 1 | smiles:
 2 | 1mutation
 3 | CNC(C)CC1=CC=CNC(=C1)OCO2 : no
 4 | CNC(C)CC1=CC=C2C(=C1FOCO2 : no
 5 | CFC(C)CC1=CC=C2C(=C1)OCO2 : no
 6 | 
 7 | 2mutation
 8 | CNC(C)OC1=CC=C2C(=C1COCO2 : no
 9 | CNC(C)CC1=CCOCCC(=C1)OCO2 : no
10 | CNC(C)#C1=CC=C2C(=C1)OCON : no
11 | 
12 | 3mutation
13 | CNC(C)CC1=CCCC2#F=C1)OCO2 : no
14 | C=C(C#CC1=CC=C2C(=CN)OCO2 : no
15 | CNO(C)CC1=CC=C2C(#F1)OCO2 : no
16 | 
17 | 
18 | selfies
19 | 1mutation                                                                                                  
20 | [C][N][C][Branch1_3][epsilon][C][C][C][=C][C][=C][C][Branch1_2][=O][=C][Ring1][#N][O][C][O][Ring1][#N]  -  CNC(C)CC1=CC=C2C(C1)OCO2 : yes
21 | [C][N][C][Branch1_3][epsilon][C][C][C][=C][C][=C][C][Branch1_3][=O][=C][=N][#N][O][C][O][Ring1][#N]  -  CNC(C)CC=CC=CC(=C=NN)OCO : yes
22 | [C][N][C][Branch1_3][epsilon][C][C][C][=C][N][=C][C][Branch1_3][=O][=C][Ring1][#N][O][C][O][Ring1][#N]  -  CNC(C)CC1=CN=C2C(=C1)OCO2 : yes
23 | 
24 | CNC(C)CC1=CC=C2C(C1)OCO2
25 | CNC(C)CC=CC=CC(=C=NN)OCO
26 | CNC(C)CC1=CN=C2C(=C1)OCO2
27 | 
28 | 2mutation
29 | [C][N][C][Branch1_3][epsilon][C][C][C][=C][C][=C][C][Branch1_3][Branch1_2][Branch1_3][Ring1][#N][O][C][O][Ring1][#N]  -  CNC(C)CC=CC=C1C(=NOCO1) : yes
30 | [C][N][C][Branch1_3][epsilon][C][C][=N][=C][C][=C][C][Branch1_3][=O][=C][Ring1][F][O][C][O][Ring1][#N]  -  CNC(C)C=NCC1=C2C(=C1)OCO2 : yes
31 | [C][N][C][Branch1_3][epsilon][C][C][C][=C][C][=C][Ring1][Branch1_3][=O][=C][Ring1][#N][O][C][O][Ring1][#N]  -  C1NC(C)CC=CC=C1O : yes
32 | 
33 | CNC(C)CC=CC=C1C(=NOCO1)
34 | CNC(C)C=NCC1=C2C(=C1)OCO2
35 | C1NC(C)CC=CC=C1O
36 | 
37 | 
38 | 
39 | 3mutation
40 | [C][N][C][Branch1_3][=C][C][C][C][=C][C][=C][Branch1_3][Branch1_3][=O][=C][Ring1][#N][=N][C][O][Ring1][#N]  -  CNC(CCC1=CC=C2(O))C1=NCO2 : yes
41 | [C][N][#C][Branch1_3][epsilon][C][C][#N][=C][C][=C][C][Branch1_3][=O][=C][Ring1][#N][O][C][O][Ring1][#N]  -  CN=C(C)C#N : yes
42 | [C][N][=N][Branch1_3][epsilon][C][C][C][=N][Branch1_3][=C][C][Branch1_3][=O][=C][Ring1][#N][O][C][O][Ring1][#N]  -  CN=NC1CC=NC(=C1)OCO : yes


--------------------------------------------------------------------------------
/selfies/__init__.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | """
 4 | SELFIES: a robust representation of semantically constrained graphs with an
 5 |          example application in chemistry.
 6 | 
 7 | SELFIES (SELF-referencIng Embedded Strings) is a general-purpose,
 8 | sequence-based, robust representation of semantically constrained graphs.
 9 | It is based on a Chomsky type-2 grammar, augmented with two self-referencing
10 | functions. A main objective is to use SELFIES as direct input into machine
11 | learning models, in particular in generative models, for the generation of
12 | outputs with high validity.
13 | 
14 | The code presented here is a concrete application of SELFIES in chemistry, for
15 | the robust representation of molecules.
16 | 
17 |     Typical usage example:
18 |         import selfies as sf
19 | 
20 |         benzene = "C1=CC=CC=C1"
21 |         benzene_selfies = sf.encoder(benzene)
22 |         benzene_smiles = sf.decoder(benzene_selfies)
23 | 
24 | For comments, bug reports or feature ideas, please send an email to
25 | mario.krenn@utoronto.ca and alan@aspuru.com.
26 | """
27 | 
28 | __version__ = "2.2.0"
29 | 
30 | __all__ = [
31 |     "encoder",
32 |     "decoder",
33 |     "get_preset_constraints",
34 |     "get_semantic_robust_alphabet",
35 |     "get_semantic_constraints",
36 |     "set_semantic_constraints",
37 |     "len_selfies",
38 |     "split_selfies",
39 |     "get_alphabet_from_selfies",
40 |     "selfies_to_encoding",
41 |     "batch_selfies_to_flat_hot",
42 |     "encoding_to_selfies",
43 |     "batch_flat_hot_to_selfies",
44 |     "EncoderError",
45 |     "DecoderError"
46 | ]
47 | 
48 | from .bond_constraints import (
49 |     get_preset_constraints,
50 |     get_semantic_constraints,
51 |     get_semantic_robust_alphabet,
52 |     set_semantic_constraints
53 | )
54 | from .decoder import decoder
55 | from .encoder import encoder
56 | from .exceptions import DecoderError, EncoderError
57 | from .utils.encoding_utils import (
58 |     batch_flat_hot_to_selfies,
59 |     batch_selfies_to_flat_hot,
60 |     encoding_to_selfies,
61 |     selfies_to_encoding
62 | )
63 | from .utils.selfies_utils import (
64 |     get_alphabet_from_selfies,
65 |     len_selfies,
66 |     split_selfies
67 | )
68 | 


--------------------------------------------------------------------------------
/docs/source/conf.py:
--------------------------------------------------------------------------------
 1 | # Configuration file for the Sphinx documentation builder.
 2 | #
 3 | # This file only contains a selection of the most common options. For a full
 4 | # list see the documentation:
 5 | # https://www.sphinx-doc.org/en/master/usage/configuration.html
 6 | 
 7 | # -- Path setup --------------------------------------------------------------
 8 | 
 9 | # If extensions (or modules to document with autodoc) are in another directory,
10 | # add these directories to sys.path here. If the directory is relative to the
11 | # documentation root, use os.path.abspath to make it absolute, like shown here.
12 | #
13 | import os
14 | import sys
15 | 
16 | sys.path.insert(0, os.path.abspath("../.."))
17 | 
18 | # -- Project information -----------------------------------------------------
19 | 
20 | project = "selfies"
21 | copyright = "2020, Mario Krenn"
22 | author = "Mario Krenn"
23 | 
24 | # The full version, including alpha/beta/rc tags
25 | release = "2.0.0"
26 | 
27 | # -- General configuration ---------------------------------------------------
28 | 
29 | # Add any Sphinx extension module names here, as strings. They can be
30 | # extensions coming with Sphinx (named "sphinx.ext.*") or your custom
31 | # ones.
32 | extensions = [
33 |     "sphinx.ext.autodoc",
34 |     "sphinx_autodoc_typehints",
35 |     "sphinx.ext.autosummary",
36 |     "sphinx_rtd_theme",
37 |     "nbsphinx",
38 |     "sphinx.ext.mathjax",
39 |     "sphinx.ext.viewcode"
40 | ]
41 | 
42 | # Add any paths that contain templates here, relative to this directory.
43 | templates_path = ["_templates"]
44 | 
45 | # List of patterns, relative to source directory, that match files and
46 | # directories to ignore when looking for source files.
47 | # This pattern also affects html_static_path and html_extra_path.
48 | exclude_patterns = []
49 | 
50 | # -- Options for HTML output -------------------------------------------------
51 | 
52 | # The theme to use for HTML and HTML Help pages.  See the documentation for
53 | # a list of builtin themes.
54 | #
55 | html_theme = "sphinx_rtd_theme"
56 | 
57 | # Add any paths that contain custom static files (such as style sheets) here,
58 | # relative to this directory. They are copied after the builtin static files,
59 | # so a file named "default.css" will overwrite the builtin "default.css".
60 | html_static_path = []
61 | 


--------------------------------------------------------------------------------
/selfies/utils/selfies_utils.py:
--------------------------------------------------------------------------------
 1 | from typing import Iterable, Iterator, Set
 2 | 
 3 | 
 4 | def len_selfies(selfies: str) -> int:
 5 |     """Returns the number of symbols in a given SELFIES string.
 6 | 
 7 |     :param selfies: a SELFIES string.
 8 |     :return: the symbol length of the SELFIES string.
 9 | 
10 |     :Example:
11 | 
12 |     >>> import selfies as sf
13 |     >>> sf.len_selfies("[C][=C][F].[C]")
14 |     5
15 |     """
16 | 
17 |     return selfies.count("[") + selfies.count(".")
18 | 
19 | 
20 | def split_selfies(selfies: str) -> Iterator[str]:
21 |     """Tokenizes a SELFIES string into its individual symbols.
22 | 
23 |     :param selfies: a SELFIES string.
24 |     :return: the symbols of the SELFIES string one-by-one with order preserved.
25 | 
26 |     :Example:
27 | 
28 |     >>> import selfies as sf
29 |     >>> list(sf.split_selfies("[C][=C][F].[C]"))
30 |     ['[C]', '[=C]', '[F]', '.', '[C]']
31 |     """
32 | 
33 |     left_idx = selfies.find("[")
34 | 
35 |     while 0 <= left_idx < len(selfies):
36 |         right_idx = selfies.find("]", left_idx + 1)
37 |         if right_idx == -1:
38 |             raise ValueError("malformed SELFIES string, hanging '[' bracket")
39 | 
40 |         next_symbol = selfies[left_idx: right_idx + 1]
41 |         yield next_symbol
42 | 
43 |         left_idx = right_idx + 1
44 |         if selfies[left_idx: left_idx + 1] == ".":
45 |             yield "."
46 |             left_idx += 1
47 | 
48 | 
49 | def get_alphabet_from_selfies(selfies_iter: Iterable[str]) -> Set[str]:
50 |     """Constructs an alphabet from an iterable of SELFIES strings.
51 | 
52 |     The returned alphabet is the set of all symbols that appear in the
53 |     SELFIES strings from the input iterable, minus the dot ``.`` symbol.
54 | 
55 |     :param selfies_iter: an iterable of SELFIES strings.
56 |     :return: an alphabet of SELFIES symbols, built from the input iterable.
57 | 
58 |     :Example:
59 | 
60 |     >>> import selfies as sf
61 |     >>> selfies_list = ["[C][F][O]", "[C].[O]", "[F][F]"]
62 |     >>> alphabet = sf.get_alphabet_from_selfies(selfies_list)
63 |     >>> sorted(list(alphabet))
64 |     ['[C]', '[F]', '[O]']
65 |     """
66 | 
67 |     alphabet = set()
68 |     for s in selfies_iter:
69 |         for symbol in split_selfies(s):
70 |             alphabet.add(symbol)
71 |     alphabet.discard(".")
72 |     return alphabet
73 | 


--------------------------------------------------------------------------------
/examples/vae_example/data_loader.py:
--------------------------------------------------------------------------------
 1 | """
 2 | This file is to encode SMILES and SELFIES into one-hot encodings
 3 | """
 4 | 
 5 | import numpy as np
 6 | 
 7 | import selfies as sf
 8 | 
 9 | 
10 | def smile_to_hot(smile, largest_smile_len, alphabet):
11 |     """Go from a single smile string to a one-hot encoding.
12 |     """
13 | 
14 |     char_to_int = dict((c, i) for i, c in enumerate(alphabet))
15 | 
16 |     # pad with ' '
17 |     smile += ' ' * (largest_smile_len - len(smile))
18 | 
19 |     # integer encode input smile
20 |     integer_encoded = [char_to_int[char] for char in smile]
21 | 
22 |     # one hot-encode input smile
23 |     onehot_encoded = list()
24 |     for value in integer_encoded:
25 |         letter = [0 for _ in range(len(alphabet))]
26 |         letter[value] = 1
27 |         onehot_encoded.append(letter)
28 |     return integer_encoded, np.array(onehot_encoded)
29 | 
30 | 
31 | def multiple_smile_to_hot(smiles_list, largest_molecule_len, alphabet):
32 |     """Convert a list of smile strings to a one-hot encoding
33 | 
34 |     Returned shape (num_smiles x len_of_largest_smile x len_smile_encoding)
35 |     """
36 | 
37 |     hot_list = []
38 |     for s in smiles_list:
39 |         _, onehot_encoded = smile_to_hot(s, largest_molecule_len, alphabet)
40 |         hot_list.append(onehot_encoded)
41 |     return np.array(hot_list)
42 | 
43 | 
44 | def selfies_to_hot(selfie, largest_selfie_len, alphabet):
45 |     """Go from a single selfies string to a one-hot encoding.
46 |     """
47 | 
48 |     symbol_to_int = dict((c, i) for i, c in enumerate(alphabet))
49 | 
50 |     # pad with [nop]
51 |     selfie += '[nop]' * (largest_selfie_len - sf.len_selfies(selfie))
52 | 
53 |     # integer encode
54 |     symbol_list = sf.split_selfies(selfie)
55 |     integer_encoded = [symbol_to_int[symbol] for symbol in symbol_list]
56 | 
57 |     # one hot-encode the integer encoded selfie
58 |     onehot_encoded = list()
59 |     for index in integer_encoded:
60 |         letter = [0] * len(alphabet)
61 |         letter[index] = 1
62 |         onehot_encoded.append(letter)
63 | 
64 |     return integer_encoded, np.array(onehot_encoded)
65 | 
66 | 
67 | def multiple_selfies_to_hot(selfies_list, largest_molecule_len, alphabet):
68 |     """Convert a list of selfies strings to a one-hot encoding
69 |     """
70 | 
71 |     hot_list = []
72 |     for s in selfies_list:
73 |         _, onehot_encoded = selfies_to_hot(s, largest_molecule_len, alphabet)
74 |         hot_list.append(onehot_encoded)
75 |     return np.array(hot_list)
76 | 


--------------------------------------------------------------------------------
/tests/run_on_large_dataset.py:
--------------------------------------------------------------------------------
 1 | """Script for testing selfies against large datasets.
 2 | """
 3 | 
 4 | import argparse
 5 | import pathlib
 6 | 
 7 | import pandas as pd
 8 | from rdkit import Chem
 9 | from tqdm import tqdm
10 | 
11 | import selfies as sf
12 | 
13 | parser = argparse.ArgumentParser()
14 | parser.add_argument("--data_path", type=str, default="version.smi.gz")
15 | parser.add_argument("--col_name", type=str, default="isosmiles")
16 | parser.add_argument("--sep", type=str, default=r"\s+")
17 | parser.add_argument("--start_from", type=int, default=0)
18 | args = parser.parse_args()
19 | 
20 | TEST_DIR = pathlib.Path(__file__).parent
21 | TEST_SET_PATH = TEST_DIR / "test_sets" / args.data_path
22 | ERROR_LOG_DIR = TEST_DIR / "error_logs"
23 | ERROR_LOG_DIR.mkdir(exist_ok=True, parents=True)
24 | 
25 | 
26 | def make_reader():
27 |     return pd.read_csv(TEST_SET_PATH, sep=args.sep, chunksize=10000)
28 | 
29 | 
30 | def roundtrip_translation():
31 |     sf.set_semantic_constraints("hypervalent")
32 | 
33 |     n_entries = 0
34 |     for chunk in make_reader():
35 |         n_entries += len(chunk)
36 |     pbar = tqdm(total=n_entries)
37 | 
38 |     reader = make_reader()
39 |     error_log = open(ERROR_LOG_DIR / f"{TEST_SET_PATH.stem}.txt", "a+")
40 | 
41 |     curr_idx = 0
42 |     for chunk_idx, chunk in enumerate(reader):
43 |         for in_smiles in chunk[args.col_name]:
44 |             pbar.update(1)
45 |             curr_idx += 1
46 |             if curr_idx < args.start_from:
47 |                 continue
48 | 
49 |             in_smiles = in_smiles.strip()
50 | 
51 |             mol = Chem.MolFromSmiles(in_smiles, sanitize=True)
52 |             if (mol is None) or ("*" in in_smiles):
53 |                 continue
54 | 
55 |             try:
56 |                 selfies = sf.encoder(in_smiles, strict=True)
57 |                 out_smiles = sf.decoder(selfies)
58 |             except (sf.EncoderError, sf.DecoderError):
59 |                 error_log.write(in_smiles + "\n")
60 |                 tqdm.write(in_smiles)
61 |                 continue
62 | 
63 |             if not is_same_mol(in_smiles, out_smiles):
64 |                 error_log.write(in_smiles + "\n")
65 |                 tqdm.write(in_smiles)
66 | 
67 |     error_log.close()
68 | 
69 | 
70 | def is_same_mol(smiles1, smiles2):
71 |     try:
72 |         can_smiles1 = Chem.CanonSmiles(smiles1)
73 |         can_smiles2 = Chem.CanonSmiles(smiles2)
74 |         return can_smiles1 == can_smiles2
75 |     except Exception:
76 |         return False
77 | 
78 | 
79 | if __name__ == "__main__":
80 |     roundtrip_translation()
81 | 


--------------------------------------------------------------------------------
/tests/test_on_datasets.py:
--------------------------------------------------------------------------------
 1 | import faulthandler
 2 | import pathlib
 3 | import random
 4 | 
 5 | import pandas as pd
 6 | import pytest
 7 | from rdkit import Chem
 8 | 
 9 | import selfies as sf
10 | 
11 | faulthandler.enable()
12 | 
13 | TEST_SET_DIR = pathlib.Path(__file__).parent / "test_sets"
14 | ERROR_LOG_DIR = pathlib.Path(__file__).parent / "error_logs"
15 | ERROR_LOG_DIR.mkdir(exist_ok=True, parents=True)
16 | 
17 | datasets = list(TEST_SET_DIR.glob("**/*.csv"))
18 | 
19 | 
20 | @pytest.mark.parametrize("test_path", datasets)
21 | def test_roundtrip_translation(test_path, dataset_samples):
22 |     """Tests SMILES -> SELFIES -> SMILES translation on various datasets.
23 |     """
24 | 
25 |     # very relaxed constraints
26 |     constraints = sf.get_preset_constraints("hypervalent")
27 |     constraints.update({"P": 7, "P-1": 8, "P+1": 6, "?": 12})
28 |     sf.set_semantic_constraints(constraints)
29 | 
30 |     error_path = ERROR_LOG_DIR / "{}.csv".format(test_path.stem)
31 |     with open(error_path, "w+") as error_log:
32 |         error_log.write("In, Out\n")
33 | 
34 |     error_data = []
35 |     error_found = False
36 | 
37 |     n_lines = sum(1 for _ in open(test_path)) - 1
38 |     n_keep = dataset_samples if (0 < dataset_samples <= n_lines) else n_lines
39 |     skip = random.sample(range(1, n_lines + 1), n_lines - n_keep)
40 |     reader = pd.read_csv(test_path, chunksize=10000, header=0, skiprows=skip)
41 | 
42 |     for chunk in reader:
43 | 
44 |         for in_smiles in chunk["smiles"]:
45 |             in_smiles = in_smiles.strip()
46 | 
47 |             mol = Chem.MolFromSmiles(in_smiles, sanitize=True)
48 |             if (mol is None) or ("*" in in_smiles):
49 |                 continue
50 | 
51 |             try:
52 |                 selfies = sf.encoder(in_smiles, strict=True)
53 |                 out_smiles = sf.decoder(selfies)
54 |             except (sf.EncoderError, sf.DecoderError):
55 |                 error_data.append((in_smiles, ""))
56 |                 continue
57 | 
58 |             if not is_same_mol(in_smiles, out_smiles):
59 |                 error_data.append((in_smiles, out_smiles))
60 | 
61 |         with open(error_path, "a") as error_log:
62 |             for entry in error_data:
63 |                 error_log.write(",".join(entry) + "\n")
64 | 
65 |         error_found = error_found or error_data
66 |         error_data = []
67 | 
68 |     sf.set_semantic_constraints()  # restore constraints
69 | 
70 |     assert not error_found
71 | 
72 | 
73 | def is_same_mol(smiles1, smiles2):
74 |     try:
75 |         can_smiles1 = Chem.CanonSmiles(smiles1)
76 |         can_smiles2 = Chem.CanonSmiles(smiles2)
77 |         return can_smiles1 == can_smiles2
78 |     except Exception:
79 |         return False
80 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | share/python-wheels/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | MANIFEST
 28 | 
 29 | # PyInstaller
 30 | #  Usually these files are written by a python script from a template
 31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 32 | *.manifest
 33 | *.spec
 34 | 
 35 | # Installer logs
 36 | pip-log.txt
 37 | pip-delete-this-directory.txt
 38 | 
 39 | # Unit test / coverage reports
 40 | htmlcov/
 41 | .tox/
 42 | .nox/
 43 | .coverage
 44 | .coverage.*
 45 | .cache
 46 | nosetests.xml
 47 | coverage.xml
 48 | *.cover
 49 | *.py,cover
 50 | .hypothesis/
 51 | .pytest_cache/
 52 | cover/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | .pybuilder/
 76 | target/
 77 | 
 78 | # Jupyter Notebook
 79 | .ipynb_checkpoints
 80 | 
 81 | # IPython
 82 | profile_default/
 83 | ipython_config.py
 84 | 
 85 | # pyenv
 86 | #   For a library or package, you might want to ignore these files since the code is
 87 | #   intended to run in multiple environments; otherwise, check them in:
 88 | # .python-version
 89 | 
 90 | # pipenv
 91 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 92 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 93 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 94 | #   install all needed dependencies.
 95 | #Pipfile.lock
 96 | 
 97 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 98 | __pypackages__/
 99 | 
100 | # Celery stuff
101 | celerybeat-schedule
102 | celerybeat.pid
103 | 
104 | # SageMath parsed files
105 | *.sage.py
106 | 
107 | # Environments
108 | .env
109 | .venv
110 | env/
111 | venv/
112 | ENV/
113 | env.bak/
114 | venv.bak/
115 | 
116 | # Spyder project settings
117 | .spyderproject
118 | .spyproject
119 | 
120 | # Rope project settings
121 | .ropeproject
122 | 
123 | # mkdocs documentation
124 | /site
125 | 
126 | # mypy
127 | .mypy_cache/
128 | .dmypy.json
129 | dmypy.json
130 | 
131 | # Pyre type checker
132 | .pyre/
133 | 
134 | # pytype static type analyzer
135 | .pytype/
136 | 
137 | # Cython debug symbols
138 | cython_debug/
139 | 
140 | # IntelliJ
141 | .idea/
142 | 
143 | # Project-specific files
144 | error_logs/
145 | *.dat
146 | sandbox
147 | version.smi.gz
148 | 


--------------------------------------------------------------------------------
/docs/source/selfies.rst:
--------------------------------------------------------------------------------
 1 | API Reference
 2 | ==================
 3 | 
 4 | .. currentmodule:: selfies
 5 | 
 6 | Core Functions
 7 | ------------------------
 8 | .. autofunction:: encoder
 9 | .. autofunction:: decoder
10 | 
11 | Customization Functions
12 | ------------------------
13 | 
14 | The SELFIES grammar is derived dynamically from a set of semantic constraints,
15 | which assign bonding capacities to various atoms.
16 | By default, :mod:`selfies` operates under the following constraints:
17 | 
18 | .. table::
19 |     :align: center
20 | 
21 |     +-----------+------------------------------+
22 |     | Max Bonds | Atom(s)                      |
23 |     +===========+==============================+
24 |     | 1         | F, Cl, Br, I                 |
25 |     +-----------+------------------------------+
26 |     | 2         | O                            |
27 |     +-----------+------------------------------+
28 |     | 3         | B, N                         |
29 |     +-----------+------------------------------+
30 |     | 4         | C                            |
31 |     +-----------+------------------------------+
32 |     | 5         | P                            |
33 |     +-----------+------------------------------+
34 |     | 6         | S                            |
35 |     +-----------+------------------------------+
36 |     | 8         | All other atoms              |
37 |     +-----------+------------------------------+
38 | 
39 | The +1 and -1 charged versions of O, N, C, S, and P are also constrained,
40 | where a +1 increases the bonding capacity of the neutral atom by 1,
41 | and a -1 decreases the bonding capacity of the neutral atom by 1.
42 | For example, N+1 has a bonding capacity of :math:`3 + 1 = 4`,
43 | and N-1 has a bonding capacity of :math:`3 - 1 = 2`. The charged versions
44 | B+1 and B-1 are constrained to a capacity of 2 and 4 bonds, respectively.
45 | 
46 | However, the default constraints are inadequate for SMILES strings that violate them. For
47 | example, nitrobenzene ``O=N(=O)C1=CC=CC=C1`` has a nitrogen with 6 bonds and
48 | the chlorate anion ``O=Cl(=O)[O-]`` has a chlorine with 5 bonds - these
49 | SMILES strings *cannot* be represented by SELFIES strings under the default constraints.
50 | Additionally, users may want to specify their own custom constraints. Thus, we
51 | provide the following methods for configuring the semantic constraints
52 | of :mod:`selfies`.
53 | 
54 | .. warning::
55 | 
56 |     SELFIES strings may be translated differently under different semantic constraints.
57 |     Therefore, if custom semantic constraints are used, it is recommended to report
58 |     them for reproducibility reasons.
59 | 
60 | .. autofunction:: get_preset_constraints
61 | .. autofunction:: get_semantic_constraints
62 | .. autofunction:: set_semantic_constraints
63 | 
64 | 
65 | Utility Functions
66 | ------------------------
67 | .. autofunction:: len_selfies
68 | .. autofunction:: split_selfies
69 | .. autofunction:: get_alphabet_from_selfies
70 | .. autofunction:: get_semantic_robust_alphabet
71 | .. autofunction:: selfies_to_encoding
72 | .. autofunction:: encoding_to_selfies
73 | .. autofunction:: batch_selfies_to_flat_hot
74 | .. autofunction:: batch_flat_hot_to_selfies
75 | 
76 | 
77 | Exceptions
78 | ------------------------
79 | .. autoexception:: EncoderError
80 | .. autoexception:: DecoderError
81 | 


--------------------------------------------------------------------------------
/original_code_from_paper/vae/VAE_dependencies/GrammarVAE_grammar.py:
--------------------------------------------------------------------------------
  1 | import nltk
  2 | import numpy as np
  3 | import six
  4 | import pdb
  5 | 
  6 | # the zinc grammar
  7 | gram = """smiles -> chain
  8 | atom -> bracket_atom
  9 | atom -> aliphatic_organic
 10 | atom -> aromatic_organic
 11 | aliphatic_organic -> 'B'
 12 | aliphatic_organic -> 'C'
 13 | aliphatic_organic -> 'N'
 14 | aliphatic_organic -> 'O'
 15 | aliphatic_organic -> 'S'
 16 | aliphatic_organic -> 'P'
 17 | aliphatic_organic -> 'F'
 18 | aliphatic_organic -> 'I'
 19 | aliphatic_organic -> 'Cl'
 20 | aliphatic_organic -> 'Br'
 21 | aromatic_organic -> 'c'
 22 | aromatic_organic -> 'n'
 23 | aromatic_organic -> 'o'
 24 | aromatic_organic -> 's'
 25 | bracket_atom -> '[' BAI ']'
 26 | BAI -> isotope symbol BAC
 27 | BAI -> symbol BAC
 28 | BAI -> isotope symbol
 29 | BAI -> symbol
 30 | BAC -> chiral BAH
 31 | BAC -> BAH
 32 | BAC -> chiral
 33 | BAH -> hcount BACH
 34 | BAH -> BACH
 35 | BAH -> hcount
 36 | BACH -> charge class
 37 | BACH -> charge
 38 | BACH -> class
 39 | symbol -> aliphatic_organic
 40 | symbol -> aromatic_organic
 41 | isotope -> DIGIT
 42 | isotope -> DIGIT DIGIT
 43 | isotope -> DIGIT DIGIT DIGIT
 44 | DIGIT -> '1'
 45 | DIGIT -> '2'
 46 | DIGIT -> '3'
 47 | DIGIT -> '4'
 48 | DIGIT -> '5'
 49 | DIGIT -> '6'
 50 | DIGIT -> '7'
 51 | DIGIT -> '8'
 52 | chiral -> '@'
 53 | chiral -> '@@'
 54 | hcount -> 'H'
 55 | hcount -> 'H' DIGIT
 56 | charge -> '-'
 57 | charge -> '-' DIGIT
 58 | charge -> '-' DIGIT DIGIT
 59 | charge -> '+'
 60 | charge -> '+' DIGIT
 61 | charge -> '+' DIGIT DIGIT
 62 | bond -> '-'
 63 | bond -> '='
 64 | bond -> '#'
 65 | bond -> '/'
 66 | bond -> '\\'
 67 | ringbond -> DIGIT
 68 | ringbond -> bond DIGIT
 69 | branched_atom -> atom
 70 | branched_atom -> atom RB
 71 | branched_atom -> atom BB
 72 | branched_atom -> atom RB BB
 73 | RB -> RB ringbond
 74 | RB -> ringbond
 75 | BB -> BB branch
 76 | BB -> branch
 77 | branch -> '(' chain ')'
 78 | branch -> '(' bond chain ')'
 79 | chain -> branched_atom
 80 | chain -> chain branched_atom
 81 | chain -> chain bond branched_atom
 82 | Nothing -> None"""
 83 | 
 84 | # form the CFG and get the start symbol
 85 | GCFG = nltk.CFG.fromstring(gram)
 86 | start_index = GCFG.productions()[0].lhs()
 87 | 
 88 | # collect all lhs symbols, and the unique set of them
 89 | all_lhs = [a.lhs().symbol() for a in GCFG.productions()]
 90 | lhs_list = []
 91 | for a in all_lhs:
 92 |     if a not in lhs_list:
 93 |         lhs_list.append(a)
 94 | 
 95 | D = len(GCFG.productions())
 96 | 
 97 | # this map tells us the rhs symbol indices for each production rule
 98 | rhs_map = [None]*D
 99 | count = 0
100 | for a in GCFG.productions():
101 |     rhs_map[count] = []
102 |     for b in a.rhs():
103 |         if not isinstance(b,six.string_types):
104 |             s = b.symbol()
105 |             rhs_map[count].extend(list(np.where(np.array(lhs_list) == s)[0]))
106 |     count = count + 1
107 | 
108 | masks = np.zeros((len(lhs_list),D))
109 | count = 0
110 | 
111 | # this tells us for each lhs symbol which productions rules should be masked
112 | for sym in lhs_list:
113 |     is_in = np.array([a == sym for a in all_lhs], dtype=int).reshape(1,-1)
114 |     masks[count] = is_in
115 |     count = count + 1
116 | 
117 | # this tells us the indices where the masks are equal to 1
118 | index_array = []
119 | for i in range(masks.shape[1]):
120 |     index_array.append(np.where(masks[:,i]==1)[0][0])
121 | ind_of_ind = np.array(index_array)
122 | 
123 | max_rhs = max([len(l) for l in rhs_map])
124 | 
125 | # rules 29 and 31 aren't used in the zinc data so we 
126 | # 0 their masks so they can never be selected
127 | masks[:,29] = 0
128 | masks[:,31] = 0
129 | 


--------------------------------------------------------------------------------
/original_code_from_paper/vae/VAE_dependencies/data_loader.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This file is meant to go from various representations to 1HOT and back
  3 | """
  4 | import numpy as np
  5 | from GrammarVAE_codes import to_one_hot, prods_to_eq
  6 | import GrammarVAE_grammar as zinc_grammar
  7 | 
  8 | def unique_chars_iterator(smile):
  9 |      """
 10 |      """
 11 |      atoms = []
 12 |      for i in range(len(smile)):
 13 |          atoms.append(smile[i])
 14 |      return atoms
 15 | 
 16 | 
 17 | 
 18 | def grammar_one_hot_to_smile(one_hot_ls):
 19 |     _grammar = zinc_grammar
 20 |     _productions = _grammar.GCFG.productions()
 21 |     
 22 |     # This is the generated grammar sequence
 23 |     grammar_seq = [[_productions[one_hot_ls[index,t].argmax()] 
 24 |                         for t in range(one_hot_ls.shape[1])] 
 25 |                         for index in range(one_hot_ls.shape[0])]
 26 |     #print(grammar_seq)
 27 |     smile = [prods_to_eq(prods) for prods in grammar_seq]
 28 |     
 29 |     return grammar_seq, smile
 30 | 
 31 | 
 32 | def smile_to_hot(smile, largest_smile_len, alphabet, type_of_encoding):
 33 |     """
 34 |     Go from a single smile string to a one-hot encoding.
 35 |     """
 36 |     char_to_int = dict((c, i) for i, c in enumerate(alphabet))
 37 |     # integer encode input smile
 38 |     if type_of_encoding==0:
 39 |         for _ in range(largest_smile_len-len(smile)):
 40 |             smile+=' ' 
 41 |     elif type_of_encoding==1: 
 42 |         for _ in range(largest_smile_len-len(smile)):
 43 |             smile+=' '    
 44 |     elif type_of_encoding==2: 
 45 |         for _ in range(largest_smile_len-len(smile)):
 46 |             smile+='A'        
 47 |         
 48 |     integer_encoded = [char_to_int[char] for char in unique_chars_iterator(smile)]
 49 | 
 50 |         
 51 |     # one hot-encode input smile
 52 |     onehot_encoded = list()
 53 |     for value in integer_encoded:
 54 |     	letter = [0 for _ in range(len(alphabet))]
 55 |     	letter[value] = 1
 56 |     	onehot_encoded.append(letter)
 57 |     return integer_encoded, np.array(onehot_encoded)
 58 |     
 59 | 
 60 | def multiple_smile_to_hot(smiles_list, largest_smile_len, alphabet, type_of_encoding):
 61 |     """
 62 |     Convert a list of smile strings to a one-hot encoding
 63 |     
 64 |     Returned shape (num_smiles x len_of_largest_smile x len_smile_encoding)
 65 |     """
 66 |     hot_list = []
 67 |     for smile in smiles_list:
 68 |         _, onehot_encoded = smile_to_hot(smile, largest_smile_len, alphabet, type_of_encoding)
 69 |         hot_list.append(onehot_encoded)
 70 |     return np.array(hot_list)
 71 |         
 72 | 
 73 | def hot_to_smile(onehot_encoded,alphabet):
 74 |     """
 75 |     Go from one-hot encoding to smile string
 76 |     """    
 77 |     # From one-hot to integer encoding
 78 |     integer_encoded = onehot_encoded.argmax(1)
 79 |     
 80 |     int_to_char = dict((i, c) for i, c in enumerate(alphabet))
 81 |     
 82 |     # integer encoding to smile
 83 |     regen_smile = "".join(int_to_char[x] for x in integer_encoded)
 84 |     regen_smile = regen_smile.strip()
 85 |     return regen_smile
 86 | 
 87 | 
 88 | def check_conversion_bijection(smiles_list, largest_smile_len):
 89 |     """
 90 |     This function should be called to check successful conversion to and from 
 91 |     one-hot on a data set.
 92 |     """
 93 |     for i, smile in enumerate(smiles_list):
 94 |         _, onehot_encoded = smile_to_hot(smile, largest_smile_len)
 95 |         regen_smile = hot_to_smile(onehot_encoded)
 96 | #        print('Original: ', smile, ' shape: ', len(smile))
 97 | #        print('REcon: ', regen_smile , ' shape: ', len(regen_smile))
 98 | #        return
 99 |         if smile != regen_smile:
100 |             print('Filed conversion for: ', smile, ' @index: ', i)
101 |             break
102 |     print('All conditions passed!')
103 | 
104 | 


--------------------------------------------------------------------------------
/selfies/utils/matching_utils.py:
--------------------------------------------------------------------------------
  1 | import heapq
  2 | import itertools
  3 | from collections import deque
  4 | from typing import List, Optional
  5 | 
  6 | 
  7 | def find_perfect_matching(graph: List[List[int]]) -> Optional[List[int]]:
  8 |     """Finds a perfect matching for an undirected graph (without self-loops).
  9 | 
 10 |     :param graph: an adjacency list representing the input graph.
 11 |     :return: a list representing a perfect matching, where j is the i-th
 12 |         element if nodes i and j are matched. Returns None, if the graph cannot
 13 |         be perfectly matched.
 14 |     """
 15 | 
 16 |     # start with a maximal matching for efficiency
 17 |     matching = _greedy_matching(graph)
 18 | 
 19 |     unmatched = set(i for i in range(len(graph)) if matching[i] is None)
 20 |     while unmatched:
 21 | 
 22 |         # find augmenting path which starts at root
 23 |         root = unmatched.pop()
 24 |         path = _find_augmenting_path(graph, root, matching)
 25 | 
 26 |         if path is None:
 27 |             return None
 28 |         else:
 29 |             _flip_augmenting_path(matching, path)
 30 |             unmatched.discard(path[0])
 31 |             unmatched.discard(path[-1])
 32 | 
 33 |     return matching
 34 | 
 35 | 
 36 | def _greedy_matching(graph):
 37 |     matching = [None] * len(graph)
 38 |     free_degrees = [len(graph[i]) for i in range(len(graph))]
 39 |     # free_degrees[i] = number of unmatched neighbors for node i
 40 | 
 41 |     # prioritize nodes with fewer unmatched neighbors
 42 |     node_pqueue = [(free_degrees[i], i) for i in range(len(graph))]
 43 |     heapq.heapify(node_pqueue)
 44 | 
 45 |     while node_pqueue:
 46 |         _, node = heapq.heappop(node_pqueue)
 47 | 
 48 |         if (matching[node] is not None) or (free_degrees[node] == 0):
 49 |             continue  # node cannot be matched
 50 | 
 51 |         # match node with first unmatched neighbor
 52 |         mate = next(i for i in graph[node] if matching[i] is None)
 53 |         matching[node] = mate
 54 |         matching[mate] = node
 55 | 
 56 |         for adj in itertools.chain(graph[node], graph[mate]):
 57 |             free_degrees[adj] -= 1
 58 |             if (matching[adj] is None) and (free_degrees[adj] > 0):
 59 |                 heapq.heappush(node_pqueue, (free_degrees[adj], adj))
 60 | 
 61 |     return matching
 62 | 
 63 | 
 64 | def _find_augmenting_path(graph, root, matching):
 65 |     assert matching[root] is None
 66 | 
 67 |     # run modified BFS to find path from root to unmatched node
 68 |     other_end = None
 69 |     node_queue = deque([root])
 70 | 
 71 |     # parent BFS tree - None indicates an unvisited node
 72 |     parents = [None] * len(graph)
 73 |     parents[root] = [None, None]
 74 | 
 75 |     while node_queue:
 76 |         node = node_queue.popleft()
 77 | 
 78 |         for adj in graph[node]:
 79 |             if matching[adj] is None:  # unmatched node
 80 |                 if adj != root:  # augmenting path found!
 81 |                     parents[adj] = [node, adj]
 82 |                     other_end = adj
 83 |                     break
 84 |             else:
 85 |                 adj_mate = matching[adj]
 86 |                 if parents[adj_mate] is None:  # adj_mate not visited
 87 |                     parents[adj_mate] = [node, adj]
 88 |                     node_queue.append(adj_mate)
 89 | 
 90 |         if other_end is not None:
 91 |             break  # augmenting path found!
 92 | 
 93 |     if other_end is None:
 94 |         return None
 95 |     else:
 96 |         path = []
 97 |         node = other_end
 98 |         while node != root:
 99 |             path.append(parents[node][1])
100 |             path.append(parents[node][0])
101 |             node = parents[node][0]
102 |         return path
103 | 
104 | 
105 | def _flip_augmenting_path(matching, path):
106 |     for i in range(0, len(path), 2):
107 |         a, b = path[i], path[i + 1]
108 |         matching[a] = b
109 |         matching[b] = a
110 | 


--------------------------------------------------------------------------------
/original_code_from_paper/gan/GAN_selfies/one_hot_converter.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | """
  4 | Created on Fri Jul 26 10:12:09 2019
  5 | 
  6 | @author: akshat
  7 | """
  8 | import numpy as np
  9 | 
 10 | def unique_chars_iterator(smile):
 11 |      """
 12 |      Iterate over the characters of a smile string. 
 13 |      Note that 'Cl' & 'Br' are considered as one character
 14 |      """
 15 |      atoms = []
 16 |      for i in range(len(smile)):
 17 |          atoms.append(smile[i])
 18 |      return atoms
 19 | 
 20 | def smile_to_hot(smile, largest_smile_len, alphabet, type_of_encoding):
 21 |     """
 22 |     Go from a single smile string to a one-hot encoding.
 23 |     """
 24 |     char_to_int = dict((c, i) for i, c in enumerate(alphabet))
 25 | #    print('ENCODING: ', char_to_int)
 26 |     # integer encode input smile
 27 |     if type_of_encoding==0:
 28 |         for _ in range(largest_smile_len-len(smile)):
 29 |             smile+=' ' 
 30 |     elif type_of_encoding==1: 
 31 |         for _ in range(largest_smile_len-len(smile)):
 32 |             smile+=' '    
 33 |     elif type_of_encoding==2: 
 34 |         for _ in range(largest_smile_len-len(smile)):
 35 |             smile+='A'        
 36 |         
 37 |     integer_encoded = [char_to_int[char] for char in unique_chars_iterator(smile)]
 38 | 
 39 |         
 40 |     # one hot-encode input smile
 41 |     onehot_encoded = list()
 42 |     for value in integer_encoded:
 43 |     	letter = [0 for _ in range(len(alphabet))]
 44 |     	letter[value] = 1
 45 |     	onehot_encoded.append(letter)
 46 |     return integer_encoded, np.array(onehot_encoded)
 47 |     
 48 | 
 49 | def multiple_smile_to_hot(smiles_list, largest_smile_len, alphabet, type_of_encoding):
 50 |     """
 51 |     Convert a list of smile strings to a one-hot encoding
 52 |     
 53 |     Returned shape (num_smiles x len_of_largest_smile x len_smile_encoding)
 54 |     """
 55 |     hot_list = []
 56 |     for smile in smiles_list:
 57 |         _, onehot_encoded = smile_to_hot(smile, largest_smile_len, alphabet, type_of_encoding)
 58 |         hot_list.append(onehot_encoded)
 59 |     return np.array(hot_list)
 60 |         
 61 | 
 62 | def hot_to_smile(onehot_encoded, alphabet):
 63 |     """
 64 |     Go from one-hot encoding to smile string
 65 |     """    
 66 |     # From one-hot to integer encoding
 67 |     integer_encoded = onehot_encoded.argmax(1)
 68 | #    print('integer_encoded ', integer_encoded)
 69 |     
 70 |     int_to_char = dict((i, c) for i, c in enumerate(alphabet))
 71 | #    print('DECODING: ', int_to_char)
 72 |     # integer encoding to smile
 73 |     regen_smile = "".join(int_to_char[x] for x in integer_encoded)
 74 |     regen_smile = regen_smile.strip()
 75 |     return regen_smile
 76 | 
 77 | 
 78 | def check_conversion_bijection(smiles_list, largest_smile_len, alphabet):
 79 |     """
 80 |     This function should be called to check successful conversion to and from 
 81 |     one-hot on a data set.
 82 |     """
 83 |     for i, smile in enumerate(smiles_list):
 84 |         _, onehot_encoded = smile_to_hot(smile, largest_smile_len, alphabet, type_of_encoding=0)
 85 |         regen_smile = hot_to_smile(onehot_encoded, alphabet)
 86 |         # print('Original: ', smile, ' shape: ', len(smile))
 87 |         # print('REcon: ', regen_smile , ' shape: ', len(regen_smile))
 88 |         if smile != regen_smile:
 89 |             print('Filed conversion for: ', smile, ' @index: ', i)
 90 |             raise Exception('FAILEDDDD!!!')
 91 |     print('All conditions passed!')
 92 | 
 93 | 
 94 | 
 95 | #with open('smiles_qm9.txt') as f:
 96 | #    content = f.readlines()
 97 | #content = content[1:]
 98 | #content = [x.strip() for x in content] 
 99 | #A = [x.split(',')[1] for x in content]
100 | #
101 | #alphabets = ['N', '1', '(', '#', 'C', '3', '5', 'O', '2', 'F', '=', '4', ')', ' ']
102 | #                         
103 | #data = multiple_smile_to_hot(A, len(max(A, key=len)), alphabets, 0)
104 |                          
105 | #check_conversion_bijection(smiles_list=A, largest_smile_len=len(max(A, key=len)), alphabet=alphabets)
106 |                          


--------------------------------------------------------------------------------
/original_code_from_paper/gan/GAN_smiles/one_hot_converter.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | """
  4 | Created on Fri Jul 26 10:12:09 2019
  5 | 
  6 | @author: akshat
  7 | """
  8 | import numpy as np
  9 | 
 10 | def unique_chars_iterator(smile):
 11 |      """
 12 |      Iterate over the characters of a smile string. 
 13 |      Note that 'Cl' & 'Br' are considered as one character
 14 |      """
 15 |      atoms = []
 16 |      for i in range(len(smile)):
 17 |          atoms.append(smile[i])
 18 |      return atoms
 19 | 
 20 | def smile_to_hot(smile, largest_smile_len, alphabet, type_of_encoding):
 21 |     """
 22 |     Go from a single smile string to a one-hot encoding.
 23 |     """
 24 |     char_to_int = dict((c, i) for i, c in enumerate(alphabet))
 25 | #    print('ENCODING: ', char_to_int)
 26 |     # integer encode input smile
 27 |     if type_of_encoding==0:
 28 |         for _ in range(largest_smile_len-len(smile)):
 29 |             smile+=' ' 
 30 |     elif type_of_encoding==1: 
 31 |         for _ in range(largest_smile_len-len(smile)):
 32 |             smile+=' '    
 33 |     elif type_of_encoding==2: 
 34 |         for _ in range(largest_smile_len-len(smile)):
 35 |             smile+='A'        
 36 |         
 37 |     integer_encoded = [char_to_int[char] for char in unique_chars_iterator(smile)]
 38 | 
 39 |         
 40 |     # one hot-encode input smile
 41 |     onehot_encoded = list()
 42 |     for value in integer_encoded:
 43 |     	letter = [0 for _ in range(len(alphabet))]
 44 |     	letter[value] = 1
 45 |     	onehot_encoded.append(letter)
 46 |     return integer_encoded, np.array(onehot_encoded)
 47 |     
 48 | 
 49 | def multiple_smile_to_hot(smiles_list, largest_smile_len, alphabet, type_of_encoding):
 50 |     """
 51 |     Convert a list of smile strings to a one-hot encoding
 52 |     
 53 |     Returned shape (num_smiles x len_of_largest_smile x len_smile_encoding)
 54 |     """
 55 |     hot_list = []
 56 |     for smile in smiles_list:
 57 |         _, onehot_encoded = smile_to_hot(smile, largest_smile_len, alphabet, type_of_encoding)
 58 |         hot_list.append(onehot_encoded)
 59 |     return np.array(hot_list)
 60 |         
 61 | 
 62 | def hot_to_smile(onehot_encoded, alphabet):
 63 |     """
 64 |     Go from one-hot encoding to smile string
 65 |     """    
 66 |     # From one-hot to integer encoding
 67 |     integer_encoded = onehot_encoded.argmax(1)
 68 | #    print('integer_encoded ', integer_encoded)
 69 |     
 70 |     int_to_char = dict((i, c) for i, c in enumerate(alphabet))
 71 | #    print('DECODING: ', int_to_char)
 72 |     # integer encoding to smile
 73 |     regen_smile = "".join(int_to_char[x] for x in integer_encoded)
 74 |     regen_smile = regen_smile.strip()
 75 |     return regen_smile
 76 | 
 77 | 
 78 | def check_conversion_bijection(smiles_list, largest_smile_len, alphabet):
 79 |     """
 80 |     This function should be called to check successful conversion to and from 
 81 |     one-hot on a data set.
 82 |     """
 83 |     for i, smile in enumerate(smiles_list):
 84 |         _, onehot_encoded = smile_to_hot(smile, largest_smile_len, alphabet, type_of_encoding=0)
 85 |         regen_smile = hot_to_smile(onehot_encoded, alphabet)
 86 |         # print('Original: ', smile, ' shape: ', len(smile))
 87 |         # print('REcon: ', regen_smile , ' shape: ', len(regen_smile))
 88 |         if smile != regen_smile:
 89 |             print('Filed conversion for: ', smile, ' @index: ', i)
 90 |             raise Exception('FAILEDDDD!!!')
 91 |     print('All conditions passed!')
 92 | 
 93 | 
 94 | 
 95 | #with open('smiles_qm9.txt') as f:
 96 | #    content = f.readlines()
 97 | #content = content[1:]
 98 | #content = [x.strip() for x in content] 
 99 | #A = [x.split(',')[1] for x in content]
100 | #
101 | #alphabets = ['N', '1', '(', '#', 'C', '3', '5', 'O', '2', 'F', '=', '4', ')', ' ']
102 | #                         
103 | #data = multiple_smile_to_hot(A, len(max(A, key=len)), alphabets, 0)
104 |                          
105 | #check_conversion_bijection(smiles_list=A, largest_smile_len=len(max(A, key=len)), alphabet=alphabets)
106 |                          


--------------------------------------------------------------------------------
/original_code_from_paper/vae/VAE_dependencies/GrammarVAE_codes.py:
--------------------------------------------------------------------------------
  1 | import nltk
  2 | import pdb
  3 | #import zinc_grammar
  4 | import numpy as np
  5 | import h5py
  6 | #import molecule_vae
  7 | 
  8 | import GrammarVAE_grammar as zinc_grammar
  9 | 
 10 | #f = open('data/250k_rndm_zinc_drugs_clean.smi','r')
 11 | #L = []    
 12 | #
 13 | #count = -1
 14 | #for line in f:
 15 | #    line = line.strip()
 16 | #    L.append(line)      # The zinc data set
 17 | #f.close()
 18 | #
 19 | NCHARS = len(zinc_grammar.GCFG.productions())
 20 | 
 21 | 
 22 | 
 23 | def prods_to_eq(prods):
 24 |     seq = [prods[0].lhs()]
 25 |     for prod in prods:
 26 |         if str(prod.lhs()) == 'Nothing':
 27 |             break
 28 |         for ix, s in enumerate(seq):
 29 |             if s == prod.lhs():
 30 |                 seq = seq[:ix] + list(prod.rhs()) + seq[ix+1:]
 31 |                 break
 32 |     try:
 33 |         return ''.join(seq)
 34 |     except:
 35 |         return ''
 36 | 
 37 | 
 38 | def get_zinc_tokenizer(cfg):
 39 |     long_tokens = [a for a in list(cfg._lexical_index.keys()) if len(a) > 1]
 40 |     replacements = ['$','%','^'] # ,'&']
 41 |     assert len(long_tokens) == len(replacements)
 42 |     for token in replacements: 
 43 |         assert token not in cfg._lexical_index
 44 |     
 45 |     def tokenize(smiles):
 46 |         for i, token in enumerate(long_tokens):
 47 |             smiles = smiles.replace(token, replacements[i])
 48 |         tokens = []
 49 |         for token in smiles:
 50 |             try:
 51 |                 ix = replacements.index(token)
 52 |                 tokens.append(long_tokens[ix])
 53 |             except:
 54 |                 tokens.append(token)
 55 |         return tokens
 56 |     
 57 |     return tokenize
 58 | 
 59 | 
 60 | def to_one_hot(smiles, MaxNumSymbols, check=True):
 61 |     """ Encode a list of smiles strings to one-hot vectors """
 62 |     assert type(smiles) == list
 63 |     prod_map = {}
 64 |     for ix, prod in enumerate(zinc_grammar.GCFG.productions()):
 65 |         prod_map[prod] = ix
 66 |     tokenize = get_zinc_tokenizer(zinc_grammar.GCFG)
 67 |     tokens = list(map(tokenize, smiles))
 68 |     parser = nltk.ChartParser(zinc_grammar.GCFG)
 69 |     parse_trees = [next(parser.parse(t)) for t in tokens]
 70 |     productions_seq = [tree.productions() for tree in parse_trees]
 71 |     
 72 |     #if check:
 73 |     #    print(productions_seq)
 74 |         
 75 |     indices = [np.array([prod_map[prod] for prod in entry], dtype=int) for entry in productions_seq]
 76 |     one_hot = np.zeros((len(indices), MaxNumSymbols, NCHARS), dtype=np.float32)
 77 |     for i in range(len(indices)):
 78 |         num_productions = len(indices[i])
 79 |         one_hot[i][np.arange(num_productions),indices[i]] = 1.
 80 |         one_hot[i][np.arange(num_productions, MaxNumSymbols),-1] = 1.
 81 |     return one_hot
 82 | 
 83 | 
 84 | 
 85 | def SizeOneHot(smiles, check=True):
 86 |     """ Encode a list of smiles strings to one-hot vectors """
 87 |     assert type(smiles) == list
 88 |     prod_map = {}
 89 |     for ix, prod in enumerate(zinc_grammar.GCFG.productions()):
 90 |         prod_map[prod] = ix
 91 |     tokenize = get_zinc_tokenizer(zinc_grammar.GCFG)
 92 |     tokens = list(map(tokenize, smiles))
 93 |     parser = nltk.ChartParser(zinc_grammar.GCFG)
 94 |     parse_trees = [next(parser.parse(t)) for t in tokens]
 95 |     productions_seq = [tree.productions() for tree in parse_trees]
 96 |     
 97 |     indices = [np.array([prod_map[prod] for prod in entry], dtype=int) for entry in productions_seq]
 98 |     return len(indices[0])
 99 | 
100 | 
101 | # SINGLE EXAMPLE
102 | #smile = [L[0]]
103 | ##smile = ['C']
104 | #one_hot_single =  to_one_hot(smile, )
105 | #print(one_hot_single.shape)
106 | #print(one_hot_single)
107 | 
108 | 
109 | # GOING THROUGH ALL OF ZINC....
110 | 
111 | #OH = np.zeros((len(L),MAX_LEN,NCHARS))
112 | #for i in range(0, len(L), 100):
113 | #    print('Processing: i=[' + str(i) + ':' + str(i+100) + ']')
114 | #    onehot = to_one_hot(L[i:i+100], False)
115 | #    OH[i:i+100,:,:] = onehot
116 | #
117 | #h5f = h5py.File('zinc_grammar_dataset.h5','w')
118 | #h5f.create_dataset('data', data=OH)
119 | #h5f.close()
120 | 


--------------------------------------------------------------------------------
/tests/test_selfies_utils.py:
--------------------------------------------------------------------------------
  1 | import pytest
  2 | 
  3 | import selfies as sf
  4 | 
  5 | 
  6 | class Entry:
  7 | 
  8 |     def __init__(self, selfies, symbols, label, one_hot):
  9 |         self.selfies = selfies
 10 |         self.symbols = symbols
 11 |         self.label = label
 12 |         self.one_hot = one_hot
 13 | 
 14 | 
 15 | @pytest.fixture()
 16 | def dataset():
 17 |     stoi = {"[nop]": 0, "[O]": 1, ".": 2, "[C]": 3, "[F]": 4}
 18 |     itos = {i: c for c, i in stoi.items()}
 19 |     pad_to_len = 4
 20 | 
 21 |     entries = [
 22 |         Entry(selfies="",
 23 |               symbols=[],
 24 |               label=[0, 0, 0, 0],
 25 |               one_hot=[[1, 0, 0, 0, 0],
 26 |                        [1, 0, 0, 0, 0],
 27 |                        [1, 0, 0, 0, 0],
 28 |                        [1, 0, 0, 0, 0]]),
 29 |         Entry(selfies="[C][C][C]",
 30 |               symbols=["[C]", "[C]", "[C]"],
 31 |               label=[3, 3, 3, 0],
 32 |               one_hot=[[0, 0, 0, 1, 0],
 33 |                        [0, 0, 0, 1, 0],
 34 |                        [0, 0, 0, 1, 0],
 35 |                        [1, 0, 0, 0, 0]]),
 36 |         Entry(selfies="[C].[C]",
 37 |               symbols=["[C]", ".", "[C]"],
 38 |               label=[3, 2, 3, 0],
 39 |               one_hot=[[0, 0, 0, 1, 0],
 40 |                        [0, 0, 1, 0, 0],
 41 |                        [0, 0, 0, 1, 0],
 42 |                        [1, 0, 0, 0, 0]]),
 43 |         Entry(selfies="[C][O][C][F]",
 44 |               symbols=["[C]", "[O]", "[C]", "[F]"],
 45 |               label=[3, 1, 3, 4],
 46 |               one_hot=[[0, 0, 0, 1, 0],
 47 |                        [0, 1, 0, 0, 0],
 48 |                        [0, 0, 0, 1, 0],
 49 |                        [0, 0, 0, 0, 1]]),
 50 |         Entry(selfies="[C][O][C]",
 51 |               symbols=["[C]", "[O]", "[C]"],
 52 |               label=[3, 1, 3, 0],
 53 |               one_hot=[[0, 0, 0, 1, 0],
 54 |                        [0, 1, 0, 0, 0],
 55 |                        [0, 0, 0, 1, 0],
 56 |                        [1, 0, 0, 0, 0]])
 57 |     ]
 58 | 
 59 |     return entries, (stoi, itos, pad_to_len)
 60 | 
 61 | 
 62 | @pytest.fixture()
 63 | def dataset_flat_hots(dataset):
 64 |     flat_hots = []
 65 |     for entry in dataset[0]:
 66 |         hot = [elm for vec in entry.one_hot for elm in vec]
 67 |         flat_hots.append(hot)
 68 |     return flat_hots
 69 | 
 70 | 
 71 | def test_len_selfies(dataset):
 72 |     for entry in dataset[0]:
 73 |         assert sf.len_selfies(entry.selfies) == len(entry.symbols)
 74 | 
 75 | 
 76 | def test_split_selfies(dataset):
 77 |     for entry in dataset[0]:
 78 |         assert list(sf.split_selfies(entry.selfies)) == entry.symbols
 79 | 
 80 | 
 81 | def test_get_alphabet_from_selfies(dataset):
 82 |     entries, (vocab_stoi, _, _) = dataset
 83 | 
 84 |     selfies = [entry.selfies for entry in entries]
 85 |     alphabet = sf.get_alphabet_from_selfies(selfies)
 86 |     alphabet.add("[nop]")
 87 |     alphabet.add(".")
 88 | 
 89 |     assert alphabet == set(vocab_stoi.keys())
 90 | 
 91 | 
 92 | def test_selfies_to_encoding(dataset):
 93 |     entries, (vocab_stoi, vocab_itos, pad_to_len) = dataset
 94 | 
 95 |     for entry in entries:
 96 |         label, one_hot = sf.selfies_to_encoding(
 97 |             entry.selfies, vocab_stoi, pad_to_len, "both"
 98 |         )
 99 | 
100 |         assert label == entry.label
101 |         assert one_hot == entry.one_hot
102 | 
103 |         # recover original selfies
104 |         selfies = sf.encoding_to_selfies(label, vocab_itos, "label")
105 |         selfies = selfies.replace("[nop]", "")
106 |         assert selfies == entry.selfies
107 | 
108 |         selfies = sf.encoding_to_selfies(one_hot, vocab_itos, "one_hot")
109 |         selfies = selfies.replace("[nop]", "")
110 |         assert selfies == entry.selfies
111 | 
112 | 
113 | def test_selfies_to_flat_hot(dataset, dataset_flat_hots):
114 |     entries, (vocab_stoi, vocab_itos, pad_to_len) = dataset
115 | 
116 |     batch = [entry.selfies for entry in entries]
117 |     flat_hots = sf.batch_selfies_to_flat_hot(batch, vocab_stoi, pad_to_len)
118 | 
119 |     assert flat_hots == dataset_flat_hots
120 | 
121 |     # recover original selfies
122 |     recovered = sf.batch_flat_hot_to_selfies(flat_hots, vocab_itos)
123 |     assert batch == [s.replace("[nop]", "") for s in recovered]
124 | 


--------------------------------------------------------------------------------
/original_code_from_paper/bitflip_from_mdma.py:
--------------------------------------------------------------------------------
  1 | #
  2 | # Self-Referencing Embedded Strings (SELFIES):
  3 | #                  A 100% robust molecular string representation
  4 | #                  (https://arxiv.org/abs/1905.13741)
  5 | # by Mario Krenn, Florian Haese, AkshatKumar Nigam,
  6 | #      Pascal Friederich, Alán Aspuru-Guzik
  7 | #
  8 | # Demo of Rubustness of SMILES and SELFIES and DeepSMILES
  9 | # 
 10 | # Generates 1000 cases of 1, 2 or 3 mutations of small bio-molecule (MDMA).
 11 | # The alphabets are those that can translate the QM0 dataset (and have been used in all experiments in the paper)
 12 | #
 13 | #
 14 | # questions/remarks: mario.krenn@utoronto.ca or alan@aspuru.com
 15 | #
 16 | #                        11.03.2020
 17 | #
 18 | # Requirements: RDKit
 19 | #               selfies (pip install selfies)    
 20 | #               DeepSMILES (pip install --upgrade deepsmiles)            
 21 | #
 22 | #
 23 | 
 24 | 
 25 | from rdkit.Chem import MolFromSmiles
 26 | from rdkit import rdBase
 27 | 
 28 | from random import randint
 29 | from selfies import encoder, decoder  
 30 | 
 31 | import deepsmiles
 32 | 
 33 | rdBase.DisableLog('rdApp.error')
 34 | 
 35 | 
 36 | def IsCorrectSMILES(smiles):
 37 |     if len(smiles)==0:
 38 |         resMol=None
 39 |     else:
 40 |         try:
 41 |             resMol=MolFromSmiles(smiles, sanitize=True)
 42 |         except Exception:
 43 |             resMol=None
 44 | 
 45 |     if resMol==None:
 46 |         return 0
 47 |     else:
 48 |         return 1
 49 | 
 50 | 
 51 | def tokenize_selfies(selfies):
 52 |     location=selfies.find(']')
 53 |     all_tokens=[]
 54 |     while location>=0:
 55 |         all_tokens.append(selfies[0:location+1])
 56 |         selfies=selfies[location+1:]
 57 |         location=selfies.find(']')
 58 |         
 59 |     return all_tokens
 60 | 
 61 | 
 62 | 
 63 | def detokenize_selfies(selfies_list):
 64 |     selfies=''
 65 |     for ii in range(len(selfies_list)):
 66 |         selfies=selfies+selfies_list[ii]
 67 | 
 68 |     return selfies
 69 | 
 70 | mdma='CNC(C)CC1=CC=C2C(=C1)OCO2'
 71 | smiles_symbols='FONC()=#12345' # with this alphabet, the whole QM9 db can be translated (except of ions and stereochemistry)
 72 | 
 73 | print('\n\n\n')
 74 | print('SMILES: ',mdma,'\n')
 75 | 
 76 | num_repeat=1000
 77 | for c_num_of_mut in range(3):
 78 |     single_mut_err=0
 79 |     for c_muts in range(num_repeat):
 80 | 
 81 |         new_mdma=mdma
 82 |         for ii in range(c_num_of_mut+1):
 83 |             mol_idx=randint(0,len(new_mdma)-1)        
 84 |             symbol_idx=randint(0,len(smiles_symbols)-1)
 85 | 
 86 |             new_mdma=new_mdma[0:mol_idx]+smiles_symbols[symbol_idx]+new_mdma[mol_idx+1:]
 87 | 
 88 |         res_new=IsCorrectSMILES(new_mdma)
 89 |         if res_new==0:
 90 |             single_mut_err=single_mut_err+1
 91 | 
 92 |     print(c_num_of_mut+1, 'mutations with SMILES. Correct: ', num_repeat-single_mut_err, '/', num_repeat, '=', 1-single_mut_err/num_repeat)
 93 | 
 94 | 
 95 | 
 96 | # SELFIES code
 97 | mdma_selfies=encoder(mdma)
 98 | print('\n\n\n')
 99 | print('SELFIES: ',mdma_selfies,'\n')
100 | mdma_selfies_tok=tokenize_selfies(mdma_selfies)
101 | selfies_symbols=['[epsilon]','[Ring1]','[Ring2]','[Branch1_1]','[Branch1_2]','[Branch1_3]','[F]','[O]','[=O]','[N]','[=N]','[#N]','[C]','[=C]','[#C]']; #with this alphabet, the whole QM9 db can be translated (except of ions and stereochemistry)
102 | 
103 | num_repeat=1000
104 | for c_num_of_mut in range(3):
105 |     single_mut_err=0
106 |     for c_muts in range(num_repeat):
107 | 
108 |         new_mdma=mdma_selfies_tok
109 |         for ii in range(c_num_of_mut+1):
110 |             mol_idx=randint(0,len(new_mdma)-1)        
111 |             symbol_idx=randint(0,len(selfies_symbols)-1)
112 | 
113 |             new_mdma_str=detokenize_selfies(new_mdma[0:mol_idx])
114 |             new_mdma_str=new_mdma_str+selfies_symbols[symbol_idx]
115 |             new_mdma_str=new_mdma_str+detokenize_selfies(new_mdma[mol_idx+1:])
116 |             new_mdma=tokenize_selfies(new_mdma_str)
117 | 
118 |         mutated_selfies=detokenize_selfies(new_mdma)
119 |         mutated_smiles=decoder(mutated_selfies)
120 |         res_new=IsCorrectSMILES(mutated_smiles)
121 |         
122 |         if res_new==0:
123 |             single_mut_err=single_mut_err+1
124 |             
125 |         if c_muts>0 and c_muts%1000==0:
126 |             print('Iteration: ', c_muts, '/', num_repeat)
127 |         
128 | 
129 |     print(c_num_of_mut+1, 'mutations with SELFIES. Correct: ', num_repeat-single_mut_err, '/', num_repeat, '=', 1-single_mut_err/num_repeat)
130 | 
131 | 
132 | 
133 | # DeepSMILES code
134 | deepsmiles_symbols='FONC)=#3456789' # with this alphabet, the whole QM9 db can be translated (except of ions and stereochemistry)
135 | converter = deepsmiles.Converter(rings=True, branches=True)
136 | 
137 | mdma_deepsmiles=converter.encode(mdma)
138 | print('\n\n\n')
139 | print('DeepSMILES: ',mdma_deepsmiles,'\n')
140 | 
141 | num_repeat=1000
142 | for c_num_of_mut in range(3):
143 |     single_mut_err=0
144 |     for c_muts in range(num_repeat):
145 | 
146 |         new_mdma=mdma_deepsmiles
147 |         for ii in range(c_num_of_mut+1):
148 |             mol_idx=randint(0,len(new_mdma)-1)        
149 |             symbol_idx=randint(0,len(deepsmiles_symbols)-1)
150 | 
151 |             new_mdma=new_mdma[0:mol_idx]+deepsmiles_symbols[symbol_idx]+new_mdma[mol_idx+1:]
152 | 
153 |         try:
154 |             mutated_smiles=converter.decode(new_mdma)
155 |         except Exception:
156 |             mutated_smiles='err'
157 | 
158 |         res_new=IsCorrectSMILES(mutated_smiles)
159 |         if res_new==0:
160 |             single_mut_err=single_mut_err+1
161 | 
162 |     print(c_num_of_mut+1, 'mutations with DeepSMILES. Correct: ', num_repeat-single_mut_err, '/', num_repeat, '=', 1-single_mut_err/num_repeat)
163 | 


--------------------------------------------------------------------------------
/tests/test_selfies.py:
--------------------------------------------------------------------------------
  1 | import faulthandler
  2 | import random
  3 | 
  4 | import pytest
  5 | from rdkit.Chem import MolFromSmiles
  6 | 
  7 | import selfies as sf
  8 | 
  9 | faulthandler.enable()
 10 | 
 11 | 
 12 | @pytest.fixture()
 13 | def max_selfies_len():
 14 |     return 1000
 15 | 
 16 | 
 17 | @pytest.fixture()
 18 | def large_alphabet():
 19 |     alphabet = sf.get_semantic_robust_alphabet()
 20 |     alphabet.update([
 21 |         "[#Br]", "[#Branch1]", "[#Branch2]", "[#Branch3]", "[#C@@H1]",
 22 |         "[#C@@]", "[#C@H1]", "[#C@]", "[#C]", "[#Cl]", "[#F]", "[#H]", "[#I]",
 23 |         "[#NH1]", "[#N]", "[#O]", "[#P]", "[#Ring1]", "[#Ring2]", "[#Ring3]",
 24 |         "[#S]", "[/Br]", "[/C@@H1]", "[/C@@]", "[/C@H1]", "[/C@]", "[/C]",
 25 |         "[/Cl]", "[/F]", "[/H]", "[/I]", "[/NH1]", "[/N]", "[/O]", "[/P]",
 26 |         "[/S]", "[=Br]", "[=Branch1]", "[=Branch2]", "[=Branch3]", "[=C@@H1]",
 27 |         "[=C@@]", "[=C@H1]", "[=C@]", "[=C]", "[=Cl]", "[=F]", "[=H]", "[=I]",
 28 |         "[=NH1]", "[=N]", "[=O]", "[=P]", "[=Ring1]", "[=Ring2]", "[=Ring3]",
 29 |         "[=S]", "[Br]", "[Branch1]", "[Branch2]", "[Branch3]", "[C@@H1]",
 30 |         "[C@@]", "[C@H1]", "[C@]", "[C]", "[Cl]", "[F]", "[H]", "[I]", "[NH1]",
 31 |         "[N]", "[O]", "[P]", "[Ring1]", "[Ring2]", "[Ring3]", "[S]", "[\\Br]",
 32 |         "[\\C@@H1]", "[\\C@@]", "[\\C@H1]", "[\\C@]", "[\\C]", "[\\Cl]",
 33 |         "[\\F]", "[\\H]", "[\\I]", "[\\NH1]", "[\\N]", "[\\O]", "[\\P]",
 34 |         "[\\S]", "[nop]"
 35 |     ])
 36 |     return list(alphabet)
 37 | 
 38 | 
 39 | def test_random_selfies_decoder(trials, max_selfies_len, large_alphabet):
 40 |     """Tests that SELFIES that are generated by randomly stringing together
 41 |     symbols from the SELFIES alphabet are decoded into valid SMILES.
 42 |     """
 43 | 
 44 |     alphabet = tuple(large_alphabet)
 45 | 
 46 |     for _ in range(trials):
 47 | 
 48 |         # create random SELFIES and decode
 49 |         rand_len = random.randint(1, max_selfies_len)
 50 |         rand_selfies = "".join(random_choices(alphabet, k=rand_len))
 51 |         smiles = sf.decoder(rand_selfies)
 52 | 
 53 |         # check if SMILES is valid
 54 |         try:
 55 |             is_valid = MolFromSmiles(smiles, sanitize=True) is not None
 56 |         except Exception:
 57 |             is_valid = False
 58 | 
 59 |         err_msg = "SMILES: {}\n\t SELFIES: {}".format(smiles, rand_selfies)
 60 |         assert is_valid, err_msg
 61 | 
 62 | 
 63 | def test_nop_symbol_decoder(max_selfies_len, large_alphabet):
 64 |     """Tests that the '[nop]' symbol is always skipped over.
 65 |     """
 66 | 
 67 |     alphabet = list(large_alphabet)
 68 |     alphabet.remove("[nop]")
 69 | 
 70 |     for _ in range(100):
 71 | 
 72 |         # create random SELFIES with and without [nop]
 73 |         rand_len = random.randint(1, max_selfies_len)
 74 |         rand_mol = random_choices(alphabet, k=rand_len)
 75 |         rand_mol.extend(["[nop]"] * (max_selfies_len - rand_len))
 76 |         random.shuffle(rand_mol)
 77 | 
 78 |         with_nops = "".join(rand_mol)
 79 |         without_nops = with_nops.replace("[nop]", "")
 80 | 
 81 |         assert sf.decoder(with_nops) == sf.decoder(without_nops)
 82 | 
 83 | 
 84 | def test_get_semantic_constraints():
 85 |     constraints = sf.get_semantic_constraints()
 86 |     assert constraints is not sf.get_semantic_constraints()  # not alias
 87 |     assert "?" in constraints
 88 | 
 89 | 
 90 | def test_change_constraints_cache_clear():
 91 |     alphabet = sf.get_semantic_robust_alphabet()
 92 |     assert alphabet == sf.get_semantic_robust_alphabet()
 93 |     assert sf.decoder("[C][#C]") == "C#C"
 94 | 
 95 |     new_constraints = sf.get_semantic_constraints()
 96 |     new_constraints["C"] = 1
 97 |     sf.set_semantic_constraints(new_constraints)
 98 | 
 99 |     new_alphabet = sf.get_semantic_robust_alphabet()
100 |     assert new_alphabet != alphabet
101 |     assert sf.decoder("[C][#C]") == "CC"
102 | 
103 |     sf.set_semantic_constraints()  # re-set alphabet
104 | 
105 | 
106 | def test_invalid_or_unsupported_smiles_encoder():
107 |     malformed_smiles = [
108 |         "",
109 |         "(",
110 |         "C(Cl)(Cl)CC[13C",
111 |         "C(CCCOC",
112 |         "C=(CCOC",
113 |         "CCCC)",
114 |         "C1CCCCC",
115 |         "C(F)(F)(F)(F)(F)F",  # violates bond constraints
116 |         "C=C1=CCCCCC1",  # violates bond constraints
117 |         "CC*CC",  # uses wildcard
118 |         "C$C",  # uses $ bond
119 |         "S[As@TB1](F)(Cl)(Br)N",  # unrecognized chirality,
120 |         "SOMETHINGWRONGHERE",
121 |         "1243124124",
122 |     ]
123 | 
124 |     for smiles in malformed_smiles:
125 |         with pytest.raises(sf.EncoderError):
126 |             sf.encoder(smiles)
127 | 
128 | 
129 | def test_malformed_selfies_decoder():
130 |     with pytest.raises(sf.DecoderError):
131 |         sf.decoder("[O][=C][O][C][C][C][C][O][N][Branch2_3")
132 | 
133 | 
134 | def random_choices(population, k):  # random.choices was new in Python v3.6
135 |     return [random.choice(population) for _ in range(k)]
136 | 
137 | 
138 | def test_decoder_attribution():
139 |     sm, am = sf.decoder(
140 |         "[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]", attribute=True)
141 |     # check that P lined up
142 |     for ta in am:
143 |         if ta.token == 'P':
144 |             for a in ta.attribution:
145 |                 if a.token == '[P]':
146 |                     return
147 |     raise ValueError('Failed to find P in attribution map')
148 | 
149 | 
150 | def test_encoder_attribution():
151 |     smiles = "C1([O-])C=CC=C1Cl"
152 |     indices = [0, 3, 3, 3, 5, 7, 8, 10, None, None, 12]
153 |     s, am = sf.encoder(smiles, attribute=True)
154 |     for i, ta in enumerate(am):
155 |         if ta.attribution:
156 |             assert indices[i] == ta.attribution[0].index, \
157 |                 f'found {ta[1]}; should be {indices[i]}'
158 |         if ta.token == '[Cl]':
159 |             assert 'Cl' in [
160 |                 a.token for a in ta.attribution],\
161 |                 'Failed to find Cl in attribution map'
162 | 


--------------------------------------------------------------------------------
/selfies/grammar_rules.py:
--------------------------------------------------------------------------------
  1 | import functools
  2 | import itertools
  3 | import re
  4 | from typing import Any, List, Optional, Tuple
  5 | 
  6 | from selfies.constants import (
  7 |     ELEMENTS,
  8 |     INDEX_ALPHABET,
  9 |     INDEX_CODE,
 10 |     ORGANIC_SUBSET
 11 | )
 12 | from selfies.mol_graph import Atom
 13 | from selfies.utils.smiles_utils import smiles_to_bond
 14 | 
 15 | 
 16 | def process_atom_symbol(symbol: str) -> Optional[Tuple[Any, Atom]]:
 17 |     try:
 18 |         output = _PROCESS_ATOM_CACHE[symbol]
 19 |     except KeyError:
 20 |         output = _process_atom_selfies_no_cache(symbol)
 21 |         if output is None:
 22 |             return None
 23 |         _PROCESS_ATOM_CACHE[symbol] = output
 24 | 
 25 |     bond_info, atom_fac = output
 26 |     atom = atom_fac()
 27 |     if atom.bonding_capacity < 0:
 28 |         return None  # too many Hs (e.g. [CH9]
 29 |     return bond_info, atom
 30 | 
 31 | 
 32 | def process_branch_symbol(symbol: str) -> Optional[Tuple[int, int]]:
 33 |     try:
 34 |         return _PROCESS_BRANCH_CACHE[symbol]
 35 |     except KeyError:
 36 |         return None
 37 | 
 38 | 
 39 | def process_ring_symbol(symbol: str) -> Optional[Tuple[int, int, Any]]:
 40 |     try:
 41 |         return _PROCESS_RING_CACHE[symbol]
 42 |     except KeyError:
 43 |         return None
 44 | 
 45 | 
 46 | def next_atom_state(
 47 |         bond_order: int, bond_cap: int, state: int
 48 | ) -> Tuple[int, Optional[int]]:
 49 |     if state == 0:
 50 |         bond_order = 0
 51 | 
 52 |     bond_order = min(bond_order, state, bond_cap)
 53 |     bonds_left = bond_cap - bond_order
 54 |     next_state = None if (bonds_left == 0) else bonds_left
 55 |     return bond_order, next_state
 56 | 
 57 | 
 58 | def next_branch_state(
 59 |         branch_type: int, state: int
 60 | ) -> Tuple[int, Optional[int]]:
 61 |     assert 1 <= branch_type <= 3
 62 |     assert state > 1
 63 | 
 64 |     branch_init_state = min(state - 1, branch_type)
 65 |     next_state = state - branch_init_state
 66 |     return branch_init_state, next_state
 67 | 
 68 | 
 69 | def next_ring_state(
 70 |         ring_type: int, state: int
 71 | ) -> Tuple[int, Optional[int]]:
 72 |     assert state > 0
 73 | 
 74 |     bond_order = min(ring_type, state)
 75 |     bonds_left = state - bond_order
 76 |     next_state = None if (bonds_left == 0) else bonds_left
 77 |     return bond_order, next_state
 78 | 
 79 | 
 80 | def get_index_from_selfies(*symbols: List[str]) -> int:
 81 |     index = 0
 82 |     for i, c in enumerate(reversed(symbols)):
 83 |         index += INDEX_CODE.get(c, 0) * (len(INDEX_CODE) ** i)
 84 |     return index
 85 | 
 86 | 
 87 | def get_selfies_from_index(index: int) -> List[str]:
 88 |     if index < 0:
 89 |         raise IndexError()
 90 |     elif index == 0:
 91 |         return [INDEX_ALPHABET[0]]
 92 | 
 93 |     symbols = []
 94 |     base = len(INDEX_ALPHABET)
 95 |     while index:
 96 |         symbols.append(INDEX_ALPHABET[index % base])
 97 |         index //= base
 98 |     return symbols[::-1]
 99 | 
100 | 
101 | # =============================================================================
102 | # Caches (for computational speed)
103 | # =============================================================================
104 | 
105 | 
106 | SELFIES_ATOM_PATTERN = re.compile(
107 |     r"^[\[]"  # opening square bracket [
108 |     r"([=#/\\]?)"  # bond char
109 |     r"(\d*)"  # isotope number (optional, e.g. 123, 26)
110 |     r"([A-Z][a-z]?)"  # element symbol
111 |     r"([@]{0,2})"  # chiral_tag (optional, only @ and @@ supported)
112 |     r"((?:[H]\d)?)"  # H count (optional, e.g. H1, H3)
113 |     r"((?:[+-][1-9]+)?)"  # charge (optional, e.g. +1)
114 |     r"[]]$"  # closing square bracket ]
115 | )
116 | 
117 | 
118 | def _process_atom_selfies_no_cache(symbol):
119 |     m = SELFIES_ATOM_PATTERN.match(symbol)
120 |     if m is None:
121 |         return None
122 |     bond_char, isotope, element, chirality, h_count, charge = m.groups()
123 | 
124 |     if symbol[1 + len(bond_char):-1] in ORGANIC_SUBSET:
125 |         atom_fac = functools.partial(Atom, element=element, is_aromatic=False)
126 |         return smiles_to_bond(bond_char), atom_fac
127 | 
128 |     isotope = None if (isotope == "") else int(isotope)
129 |     if element not in ELEMENTS:
130 |         return None
131 |     chirality = None if (chirality == "") else chirality
132 | 
133 |     s = h_count
134 |     if s == "":
135 |         h_count = 0
136 |     else:
137 |         h_count = int(s[1:])
138 | 
139 |     s = charge
140 |     if s == "":
141 |         charge = 0
142 |     else:
143 |         charge = int(s[1:])
144 |         charge *= 1 if (s[0] == "+") else -1
145 | 
146 |     atom_fac = functools.partial(
147 |         Atom,
148 |         element=element,
149 |         is_aromatic=False,
150 |         isotope=isotope,
151 |         chirality=chirality,
152 |         h_count=h_count,
153 |         charge=charge
154 |     )
155 | 
156 |     return smiles_to_bond(bond_char), atom_fac
157 | 
158 | 
159 | def _build_atom_cache():
160 |     cache = dict()
161 |     common_symbols = [
162 |         "[#C+1]", "[#C-1]", "[#C]", "[#N+1]", "[#N]", "[#O+1]", "[#P+1]",
163 |         "[#P-1]", "[#P]", "[#S+1]", "[#S-1]", "[#S]", "[=C+1]", "[=C-1]",
164 |         "[=C]", "[=N+1]", "[=N-1]", "[=N]", "[=O+1]", "[=O]", "[=P+1]",
165 |         "[=P-1]", "[=P]", "[=S+1]", "[=S-1]", "[=S]", "[Br]", "[C+1]", "[C-1]",
166 |         "[C]", "[Cl]", "[F]", "[H]", "[I]", "[N+1]", "[N-1]", "[N]", "[O+1]",
167 |         "[O-1]", "[O]", "[P+1]", "[P-1]", "[P]", "[S+1]", "[S-1]", "[S]"
168 |     ]
169 | 
170 |     for symbol in common_symbols:
171 |         cache[symbol] = _process_atom_selfies_no_cache(symbol)
172 |     return cache
173 | 
174 | 
175 | def _build_branch_cache():
176 |     cache = dict()
177 |     for L in range(1, 4):
178 |         for bond_char in ["", "=", "#"]:
179 |             symbol = "[{}Branch{}]".format(bond_char, L)
180 |             cache[symbol] = (smiles_to_bond(bond_char)[0], L)
181 |     return cache
182 | 
183 | 
184 | def _build_ring_cache():
185 |     cache = dict()
186 |     for L in range(1, 4):
187 |         # [RingL], [=RingL], [#RingL]
188 |         for bond_char in ["", "=", "#"]:
189 |             symbol = "[{}Ring{}]".format(bond_char, L)
190 |             order, stereo = smiles_to_bond(bond_char)
191 |             cache[symbol] = (order, L, (stereo, stereo))
192 | 
193 |         # [-/RingL], [\/RingL], [\-RingL], ...
194 |         for lchar, rchar in itertools.product(["-", "/", "\\"], repeat=2):
195 |             if lchar == rchar == "-":
196 |                 continue
197 |             symbol = "[{}{}Ring{}]".format(lchar, rchar, L)
198 |             order, lstereo = smiles_to_bond(lchar)
199 |             order, rstereo = smiles_to_bond(rchar)
200 |             cache[symbol] = (order, L, (lstereo, rstereo))
201 |     return cache
202 | 
203 | 
204 | _PROCESS_ATOM_CACHE = _build_atom_cache()
205 | 
206 | _PROCESS_BRANCH_CACHE = _build_branch_cache()
207 | 
208 | _PROCESS_RING_CACHE = _build_ring_cache()
209 | 


--------------------------------------------------------------------------------
/selfies/bond_constraints.py:
--------------------------------------------------------------------------------
  1 | import functools
  2 | from itertools import product
  3 | from typing import Dict, Set, Union
  4 | 
  5 | from selfies.constants import ELEMENTS, INDEX_ALPHABET
  6 | 
  7 | _DEFAULT_CONSTRAINTS = {
  8 |     "H": 1, "F": 1, "Cl": 1, "Br": 1, "I": 1,
  9 |     "B": 3, "B+1": 2, "B-1": 4,
 10 |     "O": 2, "O+1": 3, "O-1": 1,
 11 |     "N": 3, "N+1": 4, "N-1": 2,
 12 |     "C": 4, "C+1": 3, "C-1": 3,
 13 |     "P": 5, "P+1": 4, "P-1": 6,
 14 |     "S": 6, "S+1": 5, "S-1": 5,
 15 |     "?": 8
 16 | }
 17 | 
 18 | _PRESET_CONSTRAINTS = {
 19 |     "default": dict(_DEFAULT_CONSTRAINTS),
 20 |     "octet_rule": dict(_DEFAULT_CONSTRAINTS),
 21 |     "hypervalent": dict(_DEFAULT_CONSTRAINTS)
 22 | }
 23 | _PRESET_CONSTRAINTS["octet_rule"].update(
 24 |     {"S": 2, "S+1": 3, "S-1": 1, "P": 3, "P+1": 4, "P-1": 2}
 25 | )
 26 | _PRESET_CONSTRAINTS["hypervalent"].update(
 27 |     {"Cl": 7, "Br": 7, "I": 7, "N": 5}
 28 | )
 29 | 
 30 | _current_constraints = _PRESET_CONSTRAINTS["default"]
 31 | 
 32 | 
 33 | def get_preset_constraints(name: str) -> Dict[str, int]:
 34 |     """Returns the preset semantic constraints with the given name.
 35 | 
 36 |     Besides the aforementioned default constraints, :mod:`selfies` offers
 37 |     other preset constraints for convenience; namely, constraints that
 38 |     enforce the `octet rule <https://en.wikipedia.org/wiki/Octet_rule>`_
 39 |     and constraints that accommodate `hypervalent molecules
 40 |     <https://en.wikipedia.org/wiki/Hypervalent_molecule>`_.
 41 | 
 42 |     The differences between these constraints can be summarized as follows:
 43 | 
 44 |     .. table::
 45 |         :align: center
 46 |         :widths: auto
 47 | 
 48 |         +-----------------+-----------+---+---+-----+-----+---+-----+-----+
 49 |         |                 | Cl, Br, I | N | P | P+1 | P-1 | S | S+1 | S-1 |
 50 |         +-----------------+-----------+---+---+-----+-----+---+-----+-----+
 51 |         | ``default``     |     1     | 3 | 5 |  4  |  6  | 6 |  5  |  5  |
 52 |         +-----------------+-----------+---+---+-----+-----+---+-----+-----+
 53 |         | ``octet_rule``  |     1     | 3 | 3 |  4  |  2  | 2 |  3  |  1  |
 54 |         +-----------------+-----------+---+---+-----+-----+---+-----+-----+
 55 |         | ``hypervalent`` |     7     | 5 | 5 |  6  |  4  | 6 |  7  |  5  |
 56 |         +-----------------+-----------+---+---+-----+-----+---+-----+-----+
 57 | 
 58 |     :param name: the preset name: ``default`` or ``octet_rule`` or
 59 |         ``hypervalent``.
 60 |     :return: the preset constraints with the specified name, represented
 61 |         as a dictionary which maps atoms (the keys) to their bonding capacities
 62 |         (the values).
 63 |     """
 64 | 
 65 |     if name not in _PRESET_CONSTRAINTS:
 66 |         raise ValueError("unrecognized preset name '{}'".format(name))
 67 |     return dict(_PRESET_CONSTRAINTS[name])
 68 | 
 69 | 
 70 | def get_semantic_constraints() -> Dict[str, int]:
 71 |     """Returns the semantic constraints that :mod:`selfies` is currently
 72 |     operating on.
 73 | 
 74 |     :return: the current semantic constraints, represented as a dictionary
 75 |         which maps atoms (the keys) to their bonding capacities (the values).
 76 |     """
 77 | 
 78 |     global _current_constraints
 79 |     return dict(_current_constraints)
 80 | 
 81 | 
 82 | def set_semantic_constraints(
 83 |         bond_constraints: Union[str, Dict[str, int]] = "default"
 84 | ) -> None:
 85 |     """Updates the semantic constraints that :mod:`selfies` operates on.
 86 | 
 87 |     If the input is a string, the new constraints are taken to be
 88 |     the preset named ``bond_constraints``
 89 |     (see :func:`selfies.get_preset_constraints`).
 90 | 
 91 |     Otherwise, the input is a dictionary representing the new constraints.
 92 |     This dictionary maps atoms (the keys) to non-negative bonding
 93 |     capacities (the values); the atoms are specified by strings
 94 |     of the form ``E`` or ``E+C`` or ``E-C``,
 95 |     where ``E`` is an element symbol and ``C`` is a positive integer.
 96 |     For example, one may have:
 97 | 
 98 |        * ``bond_constraints["I-1"] = 0``
 99 |        * ``bond_constraints["C"] = 4``
100 | 
101 |     This dictionary must also contain the special ``?`` key, which indicates
102 |     the bond capacities of all atoms that are not explicitly listed
103 |     in the dictionary.
104 | 
105 |     :param bond_constraints: the name of a preset, or a dictionary
106 |         representing the new semantic constraints.
107 |     :return: ``None``.
108 |     """
109 | 
110 |     global _current_constraints
111 | 
112 |     if isinstance(bond_constraints, str):
113 |         _current_constraints = get_preset_constraints(bond_constraints)
114 | 
115 |     elif isinstance(bond_constraints, dict):
116 | 
117 |         # error checking
118 |         if "?" not in bond_constraints:
119 |             raise ValueError("bond_constraints missing '?' as a key")
120 | 
121 |         for key, value in bond_constraints.items():
122 | 
123 |             # error checking for keys
124 |             j = max(key.find("+"), key.find("-"))
125 |             if key == "?":
126 |                 valid = True
127 |             elif j == -1:
128 |                 valid = (key in ELEMENTS)
129 |             else:
130 |                 valid = (key[:j] in ELEMENTS) and key[j + 1:].isnumeric()
131 |             if not valid:
132 |                 err_msg = "invalid key '{}' in bond_constraints".format(key)
133 |                 raise ValueError(err_msg)
134 | 
135 |             # error checking for values
136 |             if not (isinstance(value, int) and value >= 0):
137 |                 err_msg = "invalid value at " \
138 |                           "bond_constraints['{}'] = {}".format(key, value)
139 |                 raise ValueError(err_msg)
140 | 
141 |         _current_constraints = dict(bond_constraints)
142 | 
143 |     else:
144 |         raise ValueError("bond_constraints must be a str or dict")
145 | 
146 |     # clear cache since we changed alphabet
147 |     get_semantic_robust_alphabet.cache_clear()
148 |     get_bonding_capacity.cache_clear()
149 | 
150 | 
151 | @functools.lru_cache()
152 | def get_semantic_robust_alphabet() -> Set[str]:
153 |     """Returns a subset of all SELFIES symbols that are constrained
154 |     by :mod:`selfies` under the current semantic constraints.
155 | 
156 |     :return: a subset of all SELFIES symbols that are semantically constrained.
157 |     """
158 | 
159 |     alphabet_subset = set()
160 |     bonds = {"": 1, "=": 2, "#": 3}
161 | 
162 |     # add atomic symbols
163 |     for (a, c), (b, m) in product(_current_constraints.items(), bonds.items()):
164 |         if (m > c) or (a == "?"):
165 |             continue
166 |         symbol = "[{}{}]".format(b, a)
167 |         alphabet_subset.add(symbol)
168 | 
169 |     # add branch and ring symbols
170 |     for i in range(1, 4):
171 |         alphabet_subset.add("[Ring{}]".format(i))
172 |         alphabet_subset.add("[=Ring{}]".format(i))
173 |         alphabet_subset.add("[Branch{}]".format(i))
174 |         alphabet_subset.add("[=Branch{}]".format(i))
175 |         alphabet_subset.add("[#Branch{}]".format(i))
176 | 
177 |     alphabet_subset.update(INDEX_ALPHABET)
178 | 
179 |     return alphabet_subset
180 | 
181 | 
182 | @functools.lru_cache()
183 | def get_bonding_capacity(element: str, charge: int) -> int:
184 |     """Returns the bonding capacity of a given atom, under the current
185 |     semantic constraints.
186 | 
187 |     :param element: the element of the input atom.
188 |     :param charge: the charge of the input atom.
189 |     :return: the bonding capacity of the input atom.
190 |     """
191 | 
192 |     key = element
193 |     if charge != 0:
194 |         key += "{:+}".format(charge)
195 | 
196 |     if key in _current_constraints:
197 |         return _current_constraints[key]
198 |     else:
199 |         return _current_constraints["?"]
200 | 


--------------------------------------------------------------------------------
/selfies/utils/encoding_utils.py:
--------------------------------------------------------------------------------
  1 | from typing import Dict, List, Tuple, Union
  2 | 
  3 | from selfies.utils.selfies_utils import len_selfies, split_selfies
  4 | 
  5 | 
  6 | def selfies_to_encoding(
  7 |         selfies: str,
  8 |         vocab_stoi: Dict[str, int],
  9 |         pad_to_len: int = -1,
 10 |         enc_type: str = 'both'
 11 | ) -> Union[List[int], List[List[int]], Tuple[List[int], List[List[int]]]]:
 12 |     """Converts a SELFIES string into its label (integer)
 13 |     and/or one-hot encoding.
 14 | 
 15 |     A label encoded output will be a list of shape ``(L,)`` and a
 16 |     one-hot encoded output will be a 2D list of shape ``(L, len(vocab_stoi))``,
 17 |     where ``L`` is the symbol length of the SELFIES string. Optionally,
 18 |     the SELFIES string can be padded before it is encoded.
 19 | 
 20 |     :param selfies: the SELFIES string to be encoded.
 21 |     :param vocab_stoi: a dictionary that maps SELFIES symbols to indices,
 22 |         which must be non-negative and contiguous, starting from 0.
 23 |         If the SELFIES string is to be padded, then the special padding symbol
 24 |         ``[nop]`` must also be a key in this dictionary.
 25 |     :param pad_to_len: the length that the SELFIES string string is padded to.
 26 |         If this value is less than or equal to the symbol length of the
 27 |         SELFIES string, then no padding is added. Defaults to ``-1``.
 28 |     :param enc_type: the type of encoding of the output:
 29 |         ``label`` or ``one_hot`` or ``both``.
 30 |         If this value is ``both``, then a tuple of the label and one-hot
 31 |         encodings is returned. Defaults to ``both``.
 32 |     :return: the label encoded and/or one-hot encoded SELFIES string.
 33 | 
 34 |     :Example:
 35 | 
 36 |     >>> import selfies as sf
 37 |     >>> sf.selfies_to_encoding("[C][F]", {"[C]": 0, "[F]": 1})
 38 |     ([0, 1], [[1, 0], [0, 1]])
 39 |     """
 40 | 
 41 |     # some error checking
 42 |     if enc_type not in ("label", "one_hot", "both"):
 43 |         raise ValueError("enc_type must be in ('label', 'one_hot', 'both')")
 44 | 
 45 |     # pad with [nop]
 46 |     if pad_to_len > len_selfies(selfies):
 47 |         selfies += "[nop]" * (pad_to_len - len_selfies(selfies))
 48 | 
 49 |     # integer encode
 50 |     integer_encoded = []
 51 |     for char in split_selfies(selfies):
 52 |         if (char == ".") and ("." not in vocab_stoi):
 53 |             raise KeyError(
 54 |                 "The SELFIES string contains two unconnected molecules "
 55 |                 "(given by the '.' character), but vocab_stoi does not "
 56 |                 "contain the '.' key. Please add it to the vocabulary "
 57 |                 "or separate the molecules."
 58 |             )
 59 | 
 60 |         integer_encoded.append(vocab_stoi[char])
 61 |         
 62 |     if enc_type == "label":
 63 |         return integer_encoded
 64 | 
 65 |     # one-hot encode
 66 |     one_hot_encoded = list()
 67 |     for index in integer_encoded:
 68 |         letter = [0] * len(vocab_stoi)
 69 |         letter[index] = 1
 70 |         one_hot_encoded.append(letter)
 71 | 
 72 |     if enc_type == "one_hot":
 73 |         return one_hot_encoded
 74 |     return integer_encoded, one_hot_encoded
 75 | 
 76 | 
 77 | def encoding_to_selfies(
 78 |         encoding: Union[List[int], List[List[int]]],
 79 |         vocab_itos: Dict[int, str],
 80 |         enc_type: str,
 81 | ) -> str:
 82 |     """Converts a label (integer) or one-hot encoding into a SELFIES string.
 83 | 
 84 |     If the input is label encoded, then a list of shape ``(L,)`` is
 85 |     expected; and if the input is one-hot encoded, then a 2D list of
 86 |     shape ``(L, len(vocab_itos))`` is expected.
 87 | 
 88 |     :param encoding: a label or one-hot encoding.
 89 |     :param vocab_itos: a dictionary that maps indices to SELFIES symbols.
 90 |         The indices of this dictionary must be non-negative and contiguous,
 91 |         starting from 0.
 92 |     :param enc_type: the type of encoding of the input:
 93 |         ``label`` or ``one_hot``.
 94 |     :return: the SELFIES string represented by the input encoding.
 95 | 
 96 |     :Example:
 97 | 
 98 |     >>> import selfies as sf
 99 |     >>> one_hot = [[0, 1, 0], [0, 0, 1], [1, 0, 0]]
100 |     >>> vocab_itos = {0: "[nop]", 1: "[C]", 2: "[F]"}
101 |     >>> sf.encoding_to_selfies(one_hot, vocab_itos, enc_type="one_hot")
102 |     '[C][F][nop]'
103 |     """
104 | 
105 |     if enc_type not in ("label", "one_hot"):
106 |         raise ValueError("enc_type must be in ('label', 'one_hot')")
107 | 
108 |     if enc_type == "one_hot":  # Get integer encoding
109 |         integer_encoded = []
110 |         for row in encoding:
111 |             integer_encoded.append(row.index(1))
112 |     else:
113 |         integer_encoded = encoding
114 | 
115 |     # Integer encoding -> SELFIES
116 |     char_list = [vocab_itos[i] for i in integer_encoded]
117 |     selfies = "".join(char_list)
118 | 
119 |     return selfies
120 | 
121 | 
122 | def batch_selfies_to_flat_hot(
123 |         selfies_batch: List[str],
124 |         vocab_stoi: Dict[str, int],
125 |         pad_to_len: int = -1,
126 | ) -> List[List[int]]:
127 |     """Converts a list of SELFIES strings into its list of flattened
128 |     one-hot encodings.
129 | 
130 |     Each SELFIES string in the input list is one-hot encoded
131 |     (and then flattened) using :func:`selfies.selfies_to_encoding`, with
132 |     ``vocab_stoi`` and ``pad_to_len`` being passed in as arguments.
133 | 
134 |     :param selfies_batch: the list of SELFIES strings to be encoded.
135 |     :param vocab_stoi: a dictionary that maps SELFIES symbols to indices.
136 |     :param pad_to_len: the length that each SELFIES string in the input list
137 |         is padded to. Defaults to ``-1``.
138 |     :return: the flattened one-hot encodings of the input list.
139 | 
140 |     :Example:
141 | 
142 |     >>> import selfies as sf
143 |     >>> batch = ["[C]", "[C][C]"]
144 |     >>> vocab_stoi = {"[nop]": 0, "[C]": 1}
145 |     >>> sf.batch_selfies_to_flat_hot(batch, vocab_stoi, 2)
146 |     [[0, 1, 1, 0], [0, 1, 0, 1]]
147 |     """
148 | 
149 |     hot_list = list()
150 | 
151 |     for selfies in selfies_batch:
152 |         one_hot = selfies_to_encoding(selfies, vocab_stoi, pad_to_len,
153 |                                       enc_type="one_hot")
154 |         flattened = [elem for vec in one_hot for elem in vec]
155 |         hot_list.append(flattened)
156 | 
157 |     return hot_list
158 | 
159 | 
160 | def batch_flat_hot_to_selfies(
161 |         one_hot_batch: List[List[int]],
162 |         vocab_itos: Dict[int, str],
163 | ) -> List[str]:
164 |     """Converts a list of flattened one-hot encodings into a list
165 |     of SELFIES strings.
166 | 
167 |     Each encoding in the input list is unflattened and then decoded using
168 |     :func:`selfies.encoding_to_selfies`, with ``vocab_itos`` being passed in
169 |     as an argument.
170 | 
171 |     :param one_hot_batch: a list of flattened one-hot encodings. Each
172 |         encoding must be a list of length divisible by ``len(vocab_itos)``.
173 |     :param vocab_itos: a dictionary that maps indices to SELFIES symbols.
174 |     :return: the list of SELFIES strings represented by the input encodings.
175 | 
176 |     :Example:
177 | 
178 |     >>> import selfies as sf
179 |     >>> batch = [[0, 1, 1, 0], [0, 1, 0, 1]]
180 |     >>> vocab_itos = {0: "[nop]", 1: "[C]"}
181 |     >>> sf.batch_flat_hot_to_selfies(batch, vocab_itos)
182 |     ['[C][nop]', '[C][C]']
183 |     """
184 | 
185 |     selfies_list = []
186 | 
187 |     for flat_one_hot in one_hot_batch:
188 | 
189 |         # Reshape to an L x M array where each column represents an alphabet
190 |         # entry and each row is a position in the selfies
191 |         one_hot = []
192 | 
193 |         M = len(vocab_itos)
194 |         if len(flat_one_hot) % M != 0:
195 |             raise ValueError("size of vector in one_hot_batch not divisible "
196 |                              "by the length of the vocabulary.")
197 |         L = len(flat_one_hot) // M
198 | 
199 |         for i in range(L):
200 |             one_hot.append(flat_one_hot[M * i: M * (i + 1)])
201 | 
202 |         selfies = encoding_to_selfies(one_hot, vocab_itos, enc_type="one_hot")
203 |         selfies_list.append(selfies)
204 | 
205 |     return selfies_list
206 | 


--------------------------------------------------------------------------------
/CHANGELOG.md:
--------------------------------------------------------------------------------
  1 | # Changelog
  2 | 
  3 | ## v2.2.0 - 15.01.2025
  4 | - Fixed bug in the kekulization of molecules with radicals (thanks Olabisi-Aishat-Bello for reporting, thanks Robert Pollice for fixing)
  5 | - Fixed constraints for validity of molecules with changed C, P or S, to align with validity-definition of RDKit.
  6 |   
  7 | ## v2.1.2 - 15.07.2024
  8 | - Fixed recursion bug for very long molecules (thanks haydn-jones)
  9 | - Added warning when dot-symbol (".") exists in peculiar cases (thanks vandrw)
 10 | 
 11 | ## v2.1.1 - 14.07.2022
 12 | - Fixed index bug in attribution 
 13 | 
 14 | ## v2.1.0 - 17.05.2022
 15 | 
 16 | ### Changed:
 17 | - Dropped support for Python 3.5-3.6 and will continue to support only current Python versions.
 18 | 
 19 | ### Added:
 20 | - optional attribution to map encoder/decoder output string back to input string (Issue #48, #79)
 21 | 
 22 | ## v2.0.0 - 21.10.2021
 23 | 
 24 | ### Changed:
 25 | - Improved SMILES parsing (by using adjacencey lists internally), with tighter error handling
 26 |   (e.g. issues #62 and #60).
 27 | - Faster and improved kekulization algorithm (issue #55 fixed).
 28 | - Support for symbols that are constrained to 0 bonds (e.g., `[CH4]`) or >8 bonds
 29 |   (users can now specify custon bond constraints with over 8 bonds).
 30 | - New `strict=True` flag to `selfies.encoder`, which raises an error if the input
 31 |   SMILES violates the current bond constraints. `True` by default, can be `False` for speed-up (if
 32 |   SMILES are guaranteed to be correct).
 33 | - Added bond constraints for B (max. 3 bonds) to the default and preset constraints.
 34 | - Updated the syntax of SELFIES symbols to be cleaner and more readable.
 35 |     - Removing `expl` from atomic symbols, e.g., `[C@@Hexpl]` becommes `[C@@H]`
 36 |     - Cleaner branch symbols, e.g., `[BranchL_2]` becomes `[=BranchL]`
 37 |     - Cleaner ring symbols, e.g., `[Expl=RingL]` becomes `[=RingL]`
 38 |     - If you want to use the old symbols, use the `compatible=True` flag to `selfies.decoder`,
 39 |       e.g., `sf.decoder('[C][C][Expl=Ring1]',compatible=True)` (not recommended!)
 40 | - More logically consistent behaviour of `[Ring]` symbols.
 41 | - Standardized SELFIES alphabet, i.e., no two symbols stand for the same atom/ion (issue #58), e.g.,
 42 |   `[N+1]` and `[N+]` are equivalent now.
 43 | - Indexing symbols are now included in the alphabet returned by `selfies.get_semantic_robust_alphabet`.
 44 | 
 45 | ### Removed
 46 | - Removed `constraints` flag from `selfies.decoder`; please use `selfies.set_semantic_constraints()`
 47 |   and pass in `"hypervalent"` or `"octet_rule"` instead.
 48 | - Removed `print_error` flag in `selfies.encoder` and `selfies.decoder`,
 49 |   which now raise errors `selfies.EncoderError` and `selfies.DecoderError`, respectively.
 50 | 
 51 | ### Bug Fixes
 52 | - Potential chirality inversion of atoms making ring bonds (e.g. ``[C@@H]12CCC2CCNC1``):
 53 |   fixed by inverting their chirality in ``selfies.encoder`` such that they are decoded with
 54 |   the original chirality preserved.
 55 | - Failure to represent mismatching stereochemical specifications at ring bonds
 56 |   (e.g. ``F/C=C/C/C=C\C``): fixed by adding new ring symbols (e.g. ``[-/RingL]``, ``[\/RingL]``, etc.).
 57 | 
 58 | ---
 59 | 
 60 | ## v1.0.4 - 23.04.2021
 61 | ### Added:
 62 |  * decoder option for relaxed hypervalence rules, `decoder(...,constraints='hypervalent')`
 63 |  * decoder option for strict octet rules, `decoder(...,constraints='octet_rule')`
 64 | ### Bug Fix:
 65 |  * Fixed constraint for Phosphorus
 66 | 
 67 |  ---
 68 | 
 69 | ## v1.0.3 - 13.01.2021
 70 | ### Added:
 71 |  * Support for aromatic Si and Al (is not officially supported by Daylight SMILES, but RDKit supports it and examples exist in PubChem).
 72 | 
 73 |  ---
 74 | 
 75 | ## v1.0.2 - 14.10.2020
 76 | ### Added:
 77 |  * Support for aromatic Te and triple bonds.
 78 |  * Inbuild SELFIES to 1hot encoding, and 1hot encoding to SELFIES
 79 | 
 80 | ### Changed:
 81 |  * Added default semantic constraints for charged atoms (single positive/negative charge of `[C]`, `[N]`, `[O]`, `[S]`, `[P]`)
 82 |  * Raised the bond capacity of `P` to 7 bonds (from 5 bonds).
 83 | 
 84 | ### Bug Fixes:
 85 |  * Fixed bug: `selfies.decoder` did not terminate for malformed SELFIES
 86 |    that are missing the closed bracket `']'`.
 87 | 
 88 | ---
 89 | 
 90 | ## v1.0.1 - 25.08.2020
 91 | ### Changed:
 92 |  *  Code so that is compatible with python >= 3.5.
 93 |  *  More descriptive error messages.
 94 | 
 95 | ### Bug Fixes:
 96 |  *  Minor bug fixes in the encoder for SMILES ending in branches (e.g. `C(Cl)(F)`),
 97 |     and SMILES with ring numbers between branches (e.g. `C(Cl)1(Br)CCCC1`)
 98 |  *  Minor bug fix with ring ordering in decoder (e.g. `C1CC2CCC12` vs `C1CC2CCC21`).
 99 | 
100 | ---
101 | 
102 | ## v1.0.0 - 17.08.2020:
103 | ### Added:
104 |  *  Added semantic handling of aromaticity / delocalization (by kekulizing SMILES with aromatic symbols before
105 |     they are translated into SELFIES by `selfies.encoder`).
106 |  *  Added semantic handling of charged species (e.g. `[CH+]1CCC1`).
107 |  *  Added semantic handling of radical species (`[CH]1CCC1`) or any species with explicit hydrogens (e.g. `CC[CH2]`).
108 |  *  Added semantic handling of isotopes (e.g. `[14CH2]=C` or `[235U]`).
109 |  *  Improved semantic handling of explicit atom symbols in square brackets, e.g. Carbene (`[C]=C`).
110 |  *  Improved semantic handling of chirality (e.g. `O=C[Co@@](F)(Cl)(Br)(I)S`).
111 |  *  Improved semantic handling of double-bond configuration (e.g. `F/C=C/C=C/C`).
112 |  *  Added new functions to the library, such as `selfies.len_selfies` and
113 |     `selfies.split_selfies`.
114 |  *  Added advanced-user functions to the library to customize the SELFIES semantic constraints, e.g.
115 |     `selfies.set_semantic_constraints`. Allows to encode for instance diborane, `[BH2]1[H][BH2][H]1`.
116 |  *  Introduced new padding `[nop]` (no operation) symbol.
117 | 
118 | ### Changed:
119 |  *  Optimized the indexing alphabet (it is base-16 now).
120 |  *  Optimized the behaviours of rings and branches to fix an issue with specific non-standard molecules that could not be translated.
121 |  *  Changed behaviour of Ring/Branch, such that states `X9991-X9993` are not necessary anymore.
122 |  *  Significantly improved encoding and decoding algorithms, it is much faster now.
123 | 
124 | ---
125 | 
126 | ## v0.2.4 - 01.10.2019:
127 | ### Added:
128 |  *  Function ``get_alphabet()`` which returns a list of 29 selfies symbols
129 |     whose arbitrary combination produce >99.99% valid molecules.
130 | 
131 | ### Bug Fixes:
132 |  *  Fixed bug which happens when three rings start at one node, and two of
133 |     them form a double ring.
134 |  *  Enabled rings with sizes of up to 8000 SELFIES symbols.
135 |  *  Bug fix for tiny ring to RDKit syntax conversion, spanning multiple
136 |     branches.
137 | 
138 | We thank Kevin Ryan (LeanAndMean@github), Theophile Gaudin and Andrew Brereton
139 | for suggestions and bug reports.
140 | 
141 | ---
142 | 
143 | ## v0.2.2 - 19.09.2019:
144 | 
145 | ### Added:
146 |  *  Enabled ``[C@],[C@H],[C@@],[C@@H],[H]`` to use in a semantic
147 |     constrained way.
148 | 
149 | We thank Andrew Brereton for suggestions and bug reports.
150 | 
151 | ---
152 | 
153 | ## v0.2.1 - 02.09.2019:
154 | 
155 | ### Added:
156 |  *  Decoder: added optional argument to restrict nitrogen to 3 bonds.
157 |     ``decoder(...,N_restrict=False)`` to allow for more bonds;
158 |     standard: ``N_restrict=True``.
159 |  *  Decoder: added optional argument make ring-function bi-local
160 |     (i.e. confirms bond number at target).
161 |     ``decoder(...,bilocal_ring_function=False)`` to not allow bi-local ring
162 |     function; standard: ``bilocal_ring_function=True``. The bi-local ring
163 |     function will allow validity of >99.99% of random molecules.
164 |  *  Decoder: made double-bond ring RDKit syntax conform.
165 |  *  Decoder: added state X5 and X6 for having five and six bonds free.
166 | 
167 | ### Bug Fixes:
168 |  * Decoder + Encoder: allowing for explicit brackets for organic atoms, for
169 |    instance ``[I]``.
170 |  * Encoder: explicit single/double bond for non-canonical SMILES input
171 |    issue fixed.
172 |  * Decoder: bug fix for ``[Branch*]`` in state X1.
173 | 
174 | We thank Benjamin Sanchez-Lengeling, Theophile Gaudin and Zhenpeng Yao
175 | for suggestions and bug reports.
176 | 
177 | ---
178 | 
179 | ## v0.1.1 - 04.06.2019:
180 |  * initial release
181 | 


--------------------------------------------------------------------------------
/selfies/decoder.py:
--------------------------------------------------------------------------------
  1 | import warnings
  2 | from typing import List, Union, Tuple
  3 | 
  4 | from selfies.compatibility import modernize_symbol
  5 | from selfies.exceptions import DecoderError
  6 | from selfies.grammar_rules import (
  7 |     get_index_from_selfies,
  8 |     next_atom_state,
  9 |     next_branch_state,
 10 |     next_ring_state,
 11 |     process_atom_symbol,
 12 |     process_branch_symbol,
 13 |     process_ring_symbol
 14 | )
 15 | from selfies.mol_graph import MolecularGraph, Attribution
 16 | from selfies.utils.selfies_utils import split_selfies
 17 | from selfies.utils.smiles_utils import mol_to_smiles
 18 | 
 19 | 
 20 | def decoder(
 21 |         selfies: str,
 22 |         compatible: bool = False,
 23 |         attribute: bool = False) ->\
 24 |         Union[str, Tuple[str, List[Tuple[str,  List[Tuple[int, str]]]]]]:
 25 |     """Translates a SELFIES string into its corresponding SMILES string.
 26 | 
 27 |     This translation is deterministic but depends on the current semantic
 28 |     constraints. The output SMILES string is guaranteed to be syntatically
 29 |     correct and guaranteed to represent a molecule that obeys the
 30 |     semantic constraints.
 31 | 
 32 |     :param selfies: the SELFIES string to be translated.
 33 |     :param compatible: if ``True``, this function will accept SELFIES strings
 34 |         containing depreciated symbols from previous releases. However, this
 35 |         function may behave differently than in previous major relases,
 36 |         and should not be treated as backard compatible.
 37 |         Defaults to ``False``.
 38 |     :param attribute: if ``True``, an attribution map connecting selfies
 39 |         tokens to smiles tokens is output.
 40 |     :return: a SMILES string derived from the input SELFIES string.
 41 |     :raises DecoderError: if the input SELFIES string is malformed.
 42 | 
 43 |     :Example:
 44 | 
 45 |     >>> import selfies as sf
 46 |     >>> sf.decoder('[C][=C][F]')
 47 |     'C=CF'
 48 |     """
 49 | 
 50 |     if compatible:
 51 |         msg = "\nselfies.decoder() may behave differently than in previous " \
 52 |               "major releases. We recommend using SELFIES that are up to date."
 53 |         warnings.warn(msg, stacklevel=2)
 54 | 
 55 |     mol = MolecularGraph(attributable=attribute)
 56 | 
 57 |     rings = []
 58 |     attribution_index = 0
 59 |     for s in selfies.split("."):
 60 |         n = _derive_mol_from_symbols(
 61 |             symbol_iter=enumerate(_tokenize_selfies(s, compatible)),
 62 |             mol=mol,
 63 |             selfies=selfies,
 64 |             max_derive=float("inf"),
 65 |             init_state=0,
 66 |             root_atom=None,
 67 |             rings=rings,
 68 |             attribute_stack=[] if attribute else None,
 69 |             attribution_index=attribution_index
 70 |         )
 71 |         attribution_index += n
 72 |     _form_rings_bilocally(mol, rings)
 73 |     return mol_to_smiles(mol, attribute)
 74 | 
 75 | 
 76 | def _tokenize_selfies(selfies, compatible):
 77 |     if isinstance(selfies, str):
 78 |         symbol_iter = split_selfies(selfies)
 79 |     elif isinstance(selfies, list):
 80 |         symbol_iter = selfies
 81 |     else:
 82 |         raise ValueError()  # should not happen
 83 | 
 84 |     try:
 85 |         for symbol in symbol_iter:
 86 |             if symbol == "[nop]":
 87 |                 continue
 88 |             if compatible:
 89 |                 symbol = modernize_symbol(symbol)
 90 |             yield symbol
 91 |     except ValueError as err:
 92 |         raise DecoderError(str(err)) from None
 93 | 
 94 | 
 95 | def _derive_mol_from_symbols(
 96 |         symbol_iter, mol, selfies, max_derive,
 97 |         init_state, root_atom, rings, attribute_stack, attribution_index
 98 | ):
 99 |     n_derived = 0
100 |     state = init_state
101 |     prev_atom = root_atom
102 | 
103 |     while (state is not None) and (n_derived < max_derive):
104 | 
105 |         try:  # retrieve next symbol
106 |             index, symbol = next(symbol_iter)
107 |             n_derived += 1
108 |         except StopIteration:
109 |             break
110 | 
111 |         # Case 1: Branch symbol (e.g. [Branch1])
112 |         if "ch" == symbol[-4:-2]:
113 | 
114 |             output = process_branch_symbol(symbol)
115 |             if output is None:
116 |                 _raise_decoder_error(selfies, symbol)
117 |             btype, n = output
118 | 
119 |             if state <= 1:
120 |                 next_state = state
121 |             else:
122 |                 binit_state, next_state = next_branch_state(btype, state)
123 | 
124 |                 Q = _read_index_from_selfies(symbol_iter, n_symbols=n)
125 |                 n_derived += n + _derive_mol_from_symbols(
126 |                     symbol_iter, mol, selfies, (Q + 1),
127 |                     init_state=binit_state, root_atom=prev_atom, rings=rings,
128 |                     attribute_stack=attribute_stack +
129 |                     [Attribution(index + attribution_index, symbol)
130 |                      ] if attribute_stack is not None else None,
131 |                     attribution_index=attribution_index
132 |                 )
133 | 
134 |         # Case 2: Ring symbol (e.g. [Ring2])
135 |         elif "ng" == symbol[-4:-2]:
136 | 
137 |             output = process_ring_symbol(symbol)
138 |             if output is None:
139 |                 _raise_decoder_error(selfies, symbol)
140 |             ring_type, n, stereo = output
141 | 
142 |             if state == 0:
143 |                 next_state = state
144 |             else:
145 |                 ring_order, next_state = next_ring_state(ring_type, state)
146 |                 bond_info = (ring_order, stereo)
147 | 
148 |                 Q = _read_index_from_selfies(symbol_iter, n_symbols=n)
149 |                 n_derived += n
150 |                 lidx = max(0, prev_atom.index - (Q + 1))
151 |                 rings.append((mol.get_atom(lidx), prev_atom, bond_info))
152 | 
153 |         # Case 3: [epsilon]
154 |         elif "eps" in symbol:
155 |             next_state = 0 if (state == 0) else None
156 | 
157 |         # Case 4: regular symbol (e.g. [N], [=C], [F])
158 |         else:
159 | 
160 |             output = process_atom_symbol(symbol)
161 |             if output is None:
162 |                 _raise_decoder_error(selfies, symbol)
163 |             (bond_order, stereo), atom = output
164 |             cap = atom.bonding_capacity
165 | 
166 |             bond_order, next_state = next_atom_state(bond_order, cap, state)
167 |             if bond_order == 0:
168 |                 if state == 0:
169 |                     o = mol.add_atom(atom, True)
170 |                     mol.add_attribution(
171 |                         o,  attribute_stack +
172 |                         [Attribution(index + attribution_index, symbol)]
173 |                         if attribute_stack is not None else None)
174 |             else:
175 |                 o = mol.add_atom(atom)
176 |                 mol.add_attribution(
177 |                     o, attribute_stack +
178 |                     [Attribution(index + attribution_index, symbol)]
179 |                     if attribute_stack is not None else None)
180 |                 src, dst = prev_atom.index, atom.index
181 |                 o = mol.add_bond(src=src, dst=dst,
182 |                                  order=bond_order, stereo=stereo)
183 |                 mol.add_attribution(
184 |                     o, attribute_stack +
185 |                     [Attribution(index + attribution_index, symbol)]
186 |                     if attribute_stack is not None else None)
187 |             prev_atom = atom
188 | 
189 |         if next_state is None:
190 |             break
191 |         state = next_state
192 | 
193 |     while n_derived < max_derive:  # consume remaining tokens
194 |         try:
195 |             next(symbol_iter)
196 |             n_derived += 1
197 |         except StopIteration:
198 |             break
199 | 
200 |     return n_derived
201 | 
202 | 
203 | def _raise_decoder_error(selfies, invalid_symbol):
204 |     err_msg = "invalid symbol '{}'\n\tSELFIES: {}".format(
205 |         invalid_symbol, selfies
206 |     )
207 |     raise DecoderError(err_msg)
208 | 
209 | 
210 | def _read_index_from_selfies(symbol_iter, n_symbols):
211 |     index_symbols = []
212 |     for _ in range(n_symbols):
213 |         try:
214 |             index_symbols.append(next(symbol_iter)[-1])
215 |         except StopIteration:
216 |             index_symbols.append(None)
217 |     return get_index_from_selfies(*index_symbols)
218 | 
219 | 
220 | def _form_rings_bilocally(mol, rings):
221 |     rings_made = [0] * len(mol)
222 | 
223 |     for latom, ratom, bond_info in rings:
224 |         lidx, ridx = latom.index, ratom.index
225 | 
226 |         if lidx == ridx:  # ring to the same atom forbidden
227 |             continue
228 | 
229 |         order, (lstereo, rstereo) = bond_info
230 |         lfree = latom.bonding_capacity - mol.get_bond_count(lidx)
231 |         rfree = ratom.bonding_capacity - mol.get_bond_count(ridx)
232 | 
233 |         if lfree <= 0 or rfree <= 0:
234 |             continue  # no room for ring bond
235 |         order = min(order, lfree, rfree)
236 | 
237 |         if mol.has_bond(a=lidx, b=ridx):
238 |             bond = mol.get_dirbond(src=lidx, dst=ridx)
239 |             new_order = min(order + bond.order, 3)
240 |             mol.update_bond_order(a=lidx, b=ridx, new_order=new_order)
241 | 
242 |         else:
243 |             mol.add_ring_bond(
244 |                 a=lidx, a_stereo=lstereo, a_pos=rings_made[lidx],
245 |                 b=ridx, b_stereo=rstereo, b_pos=rings_made[ridx],
246 |                 order=order
247 |             )
248 |             rings_made[lidx] += 1
249 |             rings_made[ridx] += 1
250 | 


--------------------------------------------------------------------------------
/original_code_from_paper/environment.yml:
--------------------------------------------------------------------------------
  1 | name: base
  2 | channels:
  3 |   - defaults
  4 |   - conda-forge
  5 |   - pytorch
  6 | dependencies:
  7 |   - _ipyw_jlab_nb_ext_conf=0.1.0=py37_0
  8 |   - alabaster=0.7.12=py37_0
  9 |   - anaconda=2018.12=py37_0
 10 |   - anaconda-client=1.7.2=py37_0
 11 |   - anaconda-navigator=1.9.6=py37_0
 12 |   - anaconda-project=0.8.2=py37_0
 13 |   - asn1crypto=0.24.0=py37_0
 14 |   - astroid=2.1.0=py37_0
 15 |   - astropy=3.1=py37he774522_0
 16 |   - atomicwrites=1.2.1=py37_0
 17 |   - attrs=18.2.0=py37h28b3542_0
 18 |   - babel=2.6.0=py37_0
 19 |   - backcall=0.1.0=py37_0
 20 |   - backports=1.0=py37_1
 21 |   - backports.os=0.1.1=py37_0
 22 |   - backports.shutil_get_terminal_size=1.0.0=py37_2
 23 |   - beautifulsoup4=4.6.3=py37_0
 24 |   - bitarray=0.8.3=py37hfa6e2cd_0
 25 |   - bkcharts=0.2=py37_0
 26 |   - blas=1.0=mkl
 27 |   - blaze=0.11.3=py37_0
 28 |   - bleach=3.0.2=py37_0
 29 |   - blosc=1.14.4=he51fdeb_0
 30 |   - bokeh=1.0.2=py37_0
 31 |   - boost=1.68.0=py37hf75dd32_1001
 32 |   - boost-cpp=1.68.0=h6a4c333_1000
 33 |   - boto=2.49.0=py37_0
 34 |   - bottleneck=1.2.1=py37h452e1ab_1
 35 |   - bzip2=1.0.6=hfa6e2cd_5
 36 |   - ca-certificates=2018.03.07=0
 37 |   - cairo=1.16.0=hc1b38c8_1000
 38 |   - certifi=2018.11.29=py37_0
 39 |   - cffi=1.11.5=py37h74b6da3_1
 40 |   - chardet=3.0.4=py37_1
 41 |   - click=7.0=py37_0
 42 |   - cloudpickle=0.6.1=py37_0
 43 |   - clyent=1.2.2=py37_1
 44 |   - colorama=0.4.1=py37_0
 45 |   - comtypes=1.1.7=py37_0
 46 |   - conda=4.6.4=py37_0
 47 |   - conda-build=3.17.6=py37_0
 48 |   - conda-env=2.6.0=1
 49 |   - conda-verify=3.1.1=py37_0
 50 |   - console_shortcut=0.1.1=3
 51 |   - contextlib2=0.5.5=py37_0
 52 |   - cryptography=2.4.2=py37h7a1dbc1_0
 53 |   - curl=7.63.0=h2a8f88b_1000
 54 |   - cycler=0.10.0=py37_0
 55 |   - cython=0.29.2=py37ha925a31_0
 56 |   - cytoolz=0.9.0.1=py37hfa6e2cd_1
 57 |   - dask=1.0.0=py37_0
 58 |   - dask-core=1.0.0=py37_0
 59 |   - datashape=0.5.4=py37_1
 60 |   - decorator=4.3.0=py37_0
 61 |   - defusedxml=0.5.0=py37_1
 62 |   - distributed=1.25.1=py37_0
 63 |   - docutils=0.14=py37_0
 64 |   - entrypoints=0.2.3=py37_2
 65 |   - et_xmlfile=1.0.1=py37_0
 66 |   - fastcache=1.0.2=py37hfa6e2cd_2
 67 |   - filelock=3.0.10=py37_0
 68 |   - flask=1.0.2=py37_1
 69 |   - flask-cors=3.0.7=py37_0
 70 |   - freetype=2.9.1=ha9979f8_1
 71 |   - future=0.17.1=py37_0
 72 |   - get_terminal_size=1.0.0=h38e98db_0
 73 |   - gevent=1.3.7=py37he774522_1
 74 |   - glob2=0.6=py37_1
 75 |   - greenlet=0.4.15=py37hfa6e2cd_0
 76 |   - h5py=2.8.0=py37h3bdd7fb_2
 77 |   - hdf5=1.10.2=hac2f561_1
 78 |   - heapdict=1.0.0=py37_2
 79 |   - html5lib=1.0.1=py37_0
 80 |   - icc_rt=2019.0.0=h0cc432a_1
 81 |   - icu=58.2=ha66f8fd_1
 82 |   - idna=2.8=py37_0
 83 |   - imageio=2.4.1=py37_0
 84 |   - imagesize=1.1.0=py37_0
 85 |   - importlib_metadata=0.6=py37_0
 86 |   - intel-openmp=2019.1=144
 87 |   - ipykernel=5.1.0=py37h39e3cac_0
 88 |   - ipython=7.2.0=py37h39e3cac_0
 89 |   - ipython_genutils=0.2.0=py37_0
 90 |   - ipywidgets=7.4.2=py37_0
 91 |   - isort=4.3.4=py37_0
 92 |   - itsdangerous=1.1.0=py37_0
 93 |   - jdcal=1.4=py37_0
 94 |   - jedi=0.13.2=py37_0
 95 |   - jinja2=2.10=py37_0
 96 |   - jpeg=9b=hb83a4c4_2
 97 |   - jsonschema=2.6.0=py37_0
 98 |   - jupyter=1.0.0=py37_7
 99 |   - jupyter_client=5.2.4=py37_0
100 |   - jupyter_console=6.0.0=py37_0
101 |   - jupyter_core=4.4.0=py37_0
102 |   - jupyterlab=0.35.3=py37_0
103 |   - jupyterlab_server=0.2.0=py37_0
104 |   - keyring=17.0.0=py37_0
105 |   - kiwisolver=1.0.1=py37h6538335_0
106 |   - krb5=1.16.1=hc04afaa_7
107 |   - lazy-object-proxy=1.3.1=py37hfa6e2cd_2
108 |   - libarchive=3.3.3=h0643e63_5
109 |   - libcurl=7.63.0=h2a8f88b_1000
110 |   - libiconv=1.15=h1df5818_7
111 |   - libpng=1.6.35=h2a8f88b_0
112 |   - libprotobuf=3.6.1=h1a1b453_1000
113 |   - libsodium=1.0.16=h9d3ae62_0
114 |   - libssh2=1.8.0=h7a1dbc1_4
115 |   - libtiff=4.0.9=h36446d0_2
116 |   - libxml2=2.9.8=hadb2253_1
117 |   - libxslt=1.1.32=hf6f1972_0
118 |   - llvmlite=0.26.0=py37ha925a31_0
119 |   - locket=0.2.0=py37_1
120 |   - lxml=4.2.5=py37hef2cd61_0
121 |   - lz4-c=1.8.1.2=h2fa13f4_0
122 |   - lzo=2.10=h6df0209_2
123 |   - m2w64-gcc-libgfortran=5.3.0=6
124 |   - m2w64-gcc-libs=5.3.0=7
125 |   - m2w64-gcc-libs-core=5.3.0=7
126 |   - m2w64-gmp=6.1.0=2
127 |   - m2w64-libwinpthread-git=5.0.0.4634.697f757=2
128 |   - markupsafe=1.1.0=py37he774522_0
129 |   - mccabe=0.6.1=py37_1
130 |   - menuinst=1.4.14=py37hfa6e2cd_0
131 |   - mistune=0.8.4=py37he774522_0
132 |   - mkl=2019.1=144
133 |   - mkl-service=1.1.2=py37hb782905_5
134 |   - mkl_fft=1.0.6=py37h6288b17_0
135 |   - mkl_random=1.0.2=py37h343c172_0
136 |   - more-itertools=4.3.0=py37_0
137 |   - mpmath=1.1.0=py37_0
138 |   - msgpack-python=0.5.6=py37he980bc4_1
139 |   - msys2-conda-epoch=20160418=1
140 |   - multipledispatch=0.6.0=py37_0
141 |   - navigator-updater=0.2.1=py37_0
142 |   - nbconvert=5.4.0=py37_1
143 |   - nbformat=4.4.0=py37_0
144 |   - networkx=2.2=py37_1
145 |   - ninja=1.8.2=py37he980bc4_1
146 |   - nltk=3.4=py37_1
147 |   - nose=1.3.7=py37_2
148 |   - notebook=5.7.4=py37_0
149 |   - numba=0.41.0=py37hf9181ef_0
150 |   - numexpr=2.6.8=py37hdce8814_0
151 |   - numpydoc=0.8.0=py37_0
152 |   - odo=0.5.1=py37_0
153 |   - olefile=0.46=py37_0
154 |   - openpyxl=2.5.12=py37_0
155 |   - openssl=1.1.1a=he774522_0
156 |   - packaging=18.0=py37_0
157 |   - pandas=0.23.4=py37h830ac7b_0
158 |   - pandoc=1.19.2.1=hb2460c7_1
159 |   - pandocfilters=1.4.2=py37_1
160 |   - parso=0.3.1=py37_0
161 |   - partd=0.3.9=py37_0
162 |   - path.py=11.5.0=py37_0
163 |   - pathlib2=2.3.3=py37_0
164 |   - patsy=0.5.1=py37_0
165 |   - pep8=1.7.1=py37_0
166 |   - pickleshare=0.7.5=py37_0
167 |   - pillow=5.3.0=py37hdc69c19_0
168 |   - pip=18.1=py37_0
169 |   - pixman=0.34.0=hcef7cb0_3
170 |   - pkginfo=1.4.2=py37_1
171 |   - pluggy=0.8.0=py37_0
172 |   - ply=3.11=py37_0
173 |   - prometheus_client=0.5.0=py37_0
174 |   - prompt_toolkit=2.0.7=py37_0
175 |   - protobuf=3.6.1=py37he025d50_1001
176 |   - psutil=5.4.8=py37he774522_0
177 |   - py=1.7.0=py37_0
178 |   - pycairo=1.18.0=py37h63da52a_0
179 |   - pycodestyle=2.4.0=py37_0
180 |   - pycosat=0.6.3=py37hfa6e2cd_0
181 |   - pycparser=2.19=py37_0
182 |   - pycrypto=2.6.1=py37hfa6e2cd_9
183 |   - pycurl=7.43.0.2=py37h7a1dbc1_0
184 |   - pyflakes=2.0.0=py37_0
185 |   - pygments=2.3.1=py37_0
186 |   - pylint=2.2.2=py37_0
187 |   - pyodbc=4.0.25=py37ha925a31_0
188 |   - pyopenssl=18.0.0=py37_0
189 |   - pyparsing=2.3.0=py37_0
190 |   - pyqt=5.9.2=py37h6538335_2
191 |   - pysocks=1.6.8=py37_0
192 |   - pytables=3.4.4=py37he6f6034_0
193 |   - pytest=4.0.2=py37_0
194 |   - pytest-arraydiff=0.3=py37h39e3cac_0
195 |   - pytest-astropy=0.5.0=py37_0
196 |   - pytest-doctestplus=0.2.0=py37_0
197 |   - pytest-openfiles=0.3.1=py37_0
198 |   - pytest-remotedata=0.3.1=py37_0
199 |   - python=3.7.1=h8c8aaf0_6
200 |   - python-dateutil=2.7.5=py37_0
201 |   - python-libarchive-c=2.8=py37_6
202 |   - pytorch-cpu=1.0.1=py3.7_cpu_1
203 |   - pytz=2018.7=py37_0
204 |   - pywavelets=1.0.1=py37h8c2d366_0
205 |   - pywin32=223=py37hfa6e2cd_1
206 |   - pywinpty=0.5.5=py37_1000
207 |   - pyyaml=3.13=py37hfa6e2cd_0
208 |   - pyzmq=17.1.2=py37hfa6e2cd_0
209 |   - qt=5.9.7=vc14h73c81de_0
210 |   - qtawesome=0.5.3=py37_0
211 |   - qtconsole=4.4.3=py37_0
212 |   - qtpy=1.5.2=py37_0
213 |   - rdkit=2018.09.1=py37h3020836_1001
214 |   - requests=2.21.0=py37_0
215 |   - rope=0.11.0=py37_0
216 |   - ruamel_yaml=0.15.46=py37hfa6e2cd_0
217 |   - scikit-image=0.14.1=py37ha925a31_0
218 |   - scikit-learn=0.20.1=py37h343c172_0
219 |   - scipy=1.1.0=py37h29ff71c_2
220 |   - seaborn=0.9.0=py37_0
221 |   - send2trash=1.5.0=py37_0
222 |   - setuptools=40.6.3=py37_0
223 |   - simplegeneric=0.8.1=py37_2
224 |   - singledispatch=3.4.0.3=py37_0
225 |   - sip=4.19.8=py37h6538335_0
226 |   - six=1.12.0=py37_0
227 |   - snappy=1.1.7=h777316e_3
228 |   - snowballstemmer=1.2.1=py37_0
229 |   - sortedcollections=1.0.1=py37_0
230 |   - sortedcontainers=2.1.0=py37_0
231 |   - sphinx=1.8.2=py37_0
232 |   - sphinxcontrib=1.0=py37_1
233 |   - sphinxcontrib-websupport=1.1.0=py37_1
234 |   - spyder=3.3.2=py37_0
235 |   - spyder-kernels=0.3.0=py37_0
236 |   - sqlalchemy=1.2.15=py37he774522_0
237 |   - sqlite=3.26.0=he774522_0
238 |   - statsmodels=0.9.0=py37h452e1ab_0
239 |   - sympy=1.3=py37_0
240 |   - tblib=1.3.2=py37_0
241 |   - tensorboardx=1.6=py_0
242 |   - terminado=0.8.1=py37_1
243 |   - testpath=0.4.2=py37_0
244 |   - tk=8.6.8=hfa6e2cd_0
245 |   - toolz=0.9.0=py37_0
246 |   - torchvision-cpu=0.2.1=py_2
247 |   - tornado=5.1.1=py37hfa6e2cd_0
248 |   - tqdm=4.28.1=py37h28b3542_0
249 |   - traitlets=4.3.2=py37_0
250 |   - unicodecsv=0.14.1=py37_0
251 |   - urllib3=1.24.1=py37_0
252 |   - vc=14.1=h0510ff6_4
253 |   - vs2015_runtime=14.15.26706=h3a45250_0
254 |   - wcwidth=0.1.7=py37_0
255 |   - webencodings=0.5.1=py37_1
256 |   - werkzeug=0.14.1=py37_0
257 |   - wheel=0.32.3=py37_0
258 |   - widgetsnbextension=3.4.2=py37_0
259 |   - win_inet_pton=1.0.1=py37_1
260 |   - win_unicode_console=0.5=py37_0
261 |   - wincertstore=0.2=py37_0
262 |   - winpty=0.4.3=4
263 |   - wrapt=1.10.11=py37hfa6e2cd_2
264 |   - xlrd=1.2.0=py37_0
265 |   - xlsxwriter=1.1.2=py37_0
266 |   - xlwings=0.15.1=py37_0
267 |   - xlwt=1.3.0=py37_0
268 |   - xz=5.2.4=h2fa13f4_4
269 |   - yaml=0.1.7=hc54c509_2
270 |   - zeromq=4.2.5=he025d50_1
271 |   - zict=0.1.3=py37_0
272 |   - zlib=1.2.11=h62dcd97_3
273 |   - zstd=1.3.7=h508b16e_0
274 |   - pip:
275 |     - absl-py==0.7.0
276 |     - astor==0.7.1
277 |     - deepsmiles==1.0.1
278 |     - gast==0.2.2
279 |     - grpcio==1.18.0
280 |     - keras-applications==1.0.7
281 |     - keras-preprocessing==1.0.9
282 |     - markdown==3.0.1
283 |     - matplotlib==3.0.3
284 |     - mock==2.0.0
285 |     - numpy==1.16.1
286 |     - pbr==5.1.2
287 |     - selfies==0.1.1
288 |     - tensorboard==1.12.2
289 |     - tensorflow==1.13.0rc2
290 |     - tensorflow-estimator==1.13.0rc0
291 |     - termcolor==1.1.0
292 | prefix: C:\ProgramData\Anaconda3
293 | 
294 | 


--------------------------------------------------------------------------------
/selfies/encoder.py:
--------------------------------------------------------------------------------
  1 | from selfies.exceptions import EncoderError, SMILESParserError
  2 | from selfies.grammar_rules import get_selfies_from_index
  3 | from selfies.utils.smiles_utils import (
  4 |     atom_to_smiles,
  5 |     bond_to_smiles,
  6 |     smiles_to_mol
  7 | )
  8 | 
  9 | from selfies.mol_graph import AttributionMap
 10 | 
 11 | 
 12 | def encoder(smiles: str, strict: bool = True, attribute: bool = False) -> str:
 13 |     """Translates a SMILES string into its corresponding SELFIES string.
 14 | 
 15 |     This translation is deterministic and does not depend on the
 16 |     current semantic constraints. Additionally, it preserves the atom order
 17 |     of the input SMILES string; thus, one could generate randomized SELFIES
 18 |     strings by generating randomized SMILES strings, and then translating them.
 19 | 
 20 |     By nature of SELFIES, it is impossible to represent molecules that
 21 |     violate the current semantic constraints as SELFIES strings.
 22 |     Thus, we provide the ``strict`` flag to guard against such cases. If
 23 |     ``strict=True``, then this function will raise a
 24 |     :class:`selfies.EncoderError` if the input SMILES string represents
 25 |     a molecule that violates the semantic constraints. If
 26 |     ``strict=False``, then this function will not raise any error; however,
 27 |     calling :func:`selfies.decoder` on a SELFIES string generated this
 28 |     way will *not* be guaranteed to recover a SMILES string representing
 29 |     the original molecule.
 30 | 
 31 |     :param smiles: the SMILES string to be translated. It is recommended to
 32 |         use RDKit to check that the strings passed into this function
 33 |         are valid SMILES strings.
 34 |     :param strict: if ``True``, this function will check that the
 35 |         input SMILES string obeys the semantic constraints.
 36 |         Defaults to ``True``.
 37 |     :param attribute: if an attribution should be returned
 38 |     :return: a SELFIES string translated from the input SMILES string if
 39 |              attribute is ``False``, otherwise a tuple is returned of
 40 |              SELFIES string and attribution list.
 41 |     :raises EncoderError:  if the input SMILES string is invalid,
 42 |         cannot be kekulized, or violates the semantic constraints with
 43 |         ``strict=True``.
 44 | 
 45 |     :Example:
 46 | 
 47 |     >>> import selfies as sf
 48 |     >>> sf.encoder("C=CF")
 49 |     '[C][=C][F]'
 50 | 
 51 |     .. note:: This function does not currently support SMILES with:
 52 | 
 53 |         *   The wildcard symbol ``*``.
 54 |         *   The quadruple bond symbol ``$``.
 55 |         *   Chirality specifications other than ``@`` and ``@@``.
 56 |         *   Ring bonds across a dot symbol (e.g. ``c1cc([O-].[Na+])ccc1``) or
 57 |             ring bonds between atoms that are over 4000 atoms apart.
 58 | 
 59 |         Although SELFIES does not have aromatic symbols, this function
 60 |         *does* support aromatic SMILES strings by internally kekulizing them
 61 |         before translation.
 62 |     """
 63 | 
 64 |     try:
 65 |         mol = smiles_to_mol(smiles, attributable=attribute)
 66 |     except SMILESParserError as err:
 67 |         err_msg = "failed to parse input\n\tSMILES: {}".format(smiles)
 68 |         raise EncoderError(err_msg) from err
 69 | 
 70 |     if not mol.kekulize():
 71 |         err_msg = "kekulization failed\n\tSMILES: {}".format(smiles)
 72 |         raise EncoderError(err_msg)
 73 | 
 74 |     if strict:
 75 |         _check_bond_constraints(mol, smiles)
 76 | 
 77 |     # invert chirality of atoms where necessary,
 78 |     # such that they are restored when the SELFIES is decoded
 79 |     for atom in mol.get_atoms():
 80 |         if ((atom.chirality is not None)
 81 |                 and mol.has_out_ring_bond(atom.index)
 82 |                 and _should_invert_chirality(mol, atom)):
 83 |             atom.invert_chirality()
 84 | 
 85 |     fragments = []
 86 |     attribution_maps = []
 87 |     attribution_index = 0
 88 |     for root in mol.get_roots():
 89 |         derived = list(_fragment_to_selfies(
 90 |             mol, None, root, attribution_maps, attribution_index))
 91 |         attribution_index += len(derived)
 92 |         fragments.append("".join(derived))
 93 |     # trim attribution map of empty tokens
 94 |     attribution_maps = [a for a in attribution_maps if a.token]
 95 |     result = ".".join(fragments), attribution_maps
 96 |     return result if attribute else result[0]
 97 | 
 98 | 
 99 | def _check_bond_constraints(mol, smiles):
100 |     errors = []
101 | 
102 |     for atom in mol.get_atoms():
103 |         bond_cap = atom.bonding_capacity
104 |         bond_count = mol.get_bond_count(atom.index)
105 |         if bond_count > bond_cap:
106 |             errors.append((atom_to_smiles(atom), bond_count, bond_cap))
107 | 
108 |     if errors:
109 |         err_msg = "input violates the currently-set semantic constraints\n" \
110 |                   "\tSMILES: {}\n" \
111 |                   "\tErrors:\n".format(smiles)
112 |         for e in errors:
113 |             err_msg += "\t[{:} with {} bond(s) - " \
114 |                        "a max. of {} bond(s) was specified]\n".format(*e)
115 |         raise EncoderError(err_msg)
116 | 
117 | 
118 | def _should_invert_chirality(mol, atom):
119 |     out_bonds = mol.get_out_dirbonds(atom.index)
120 | 
121 |     # 1. rings whose right number are bonded to this atom (e.g. ...1...X1)
122 |     # 2. rings whose left number are bonded to this atom (e.g. X1...1...)
123 |     # 3. branches and other (e.g. X(...)...)
124 |     partition = [[], [], []]
125 |     for i, bond in enumerate(out_bonds):
126 |         if not bond.ring_bond:
127 |             partition[2].append(i)
128 |         elif bond.src < bond.dst:
129 |             partition[1].append(i)
130 |         else:
131 |             partition[0].append(i)
132 |     partition[1].sort(key=lambda x: out_bonds[x].dst)
133 | 
134 |     # construct permutation
135 |     perm = partition[0] + partition[1] + partition[2]
136 |     count = 0
137 |     for i in range(len(perm)):
138 |         for j in range(i + 1, len(perm)):
139 |             if perm[i] > perm[j]:
140 |                 count += 1
141 |     return count % 2 != 0  # if odd permutation, should invert chirality
142 | 
143 | 
144 | def _fragment_to_selfies(mol, bond_into_root, root,
145 |                          attribution_maps, attribution_index=0):
146 |     derived = []
147 | 
148 |     bond_into_curr, curr = bond_into_root, root
149 |     while True:
150 |         curr_atom = mol.get_atom(curr)
151 |         token = _atom_to_selfies(bond_into_curr, curr_atom)
152 |         derived.append(token)
153 | 
154 |         attribution_maps.append(AttributionMap(
155 |             len(derived) - 1 + attribution_index,
156 |             token, mol.get_attribution(curr_atom)))
157 | 
158 |         out_bonds = mol.get_out_dirbonds(curr)
159 |         for i, bond in enumerate(out_bonds):
160 | 
161 |             if bond.ring_bond:
162 |                 if bond.src < bond.dst:
163 |                     continue
164 | 
165 |                 rev_bond = mol.get_dirbond(src=bond.dst, dst=bond.src)
166 |                 ring_len = bond.src - bond.dst
167 |                 Q_as_symbols = get_selfies_from_index(ring_len - 1)
168 |                 ring_symbol = "[{}Ring{}]".format(
169 |                     _ring_bonds_to_selfies(rev_bond, bond),
170 |                     len(Q_as_symbols)
171 |                 )
172 | 
173 |                 derived.append(ring_symbol)
174 |                 attribution_maps.append(AttributionMap(
175 |                     len(derived) - 1 + attribution_index,
176 |                     ring_symbol, mol.get_attribution(bond)))
177 |                 for symbol in Q_as_symbols:
178 |                     derived.append(symbol)
179 |                     attribution_maps.append(AttributionMap(
180 |                         len(derived) - 1 + attribution_index,
181 |                         symbol, mol.get_attribution(bond)))
182 | 
183 |             elif i == len(out_bonds) - 1:
184 |                 bond_into_curr, curr = bond, bond.dst
185 | 
186 |             else:
187 |                 # start, end are so we can go back and
188 |                 # correct offset from branch symbol in
189 |                 # branch tokens
190 |                 start = len(attribution_maps)
191 |                 branch = _fragment_to_selfies(
192 |                     mol, bond, bond.dst, attribution_maps, len(derived))
193 |                 Q_as_symbols = get_selfies_from_index(len(branch) - 1)
194 |                 branch_symbol = "[{}Branch{}]".format(
195 |                     _bond_to_selfies(bond, show_stereo=False),
196 |                     len(Q_as_symbols)
197 |                 )
198 |                 end = len(attribution_maps)
199 | 
200 |                 derived.append(branch_symbol)
201 |                 for symbol in Q_as_symbols:
202 |                     derived.append(symbol)
203 |                     attribution_maps.append(AttributionMap(
204 |                         len(derived) - 1 + attribution_index,
205 |                         symbol, mol.get_attribution(bond)))
206 | 
207 |                 # account for branch symbol because it is inserted after
208 |                 for j in range(start, end):
209 |                     attribution_maps[j].index += len(Q_as_symbols) + 1
210 |                 attribution_maps.append(AttributionMap(
211 |                     len(derived) - 1 + attribution_index,
212 |                     branch_symbol, mol.get_attribution(bond)))
213 | 
214 |                 derived.extend(branch)
215 | 
216 |         # end of chain
217 |         if (not out_bonds) or out_bonds[-1].ring_bond:
218 |             break
219 |     return derived
220 | 
221 | 
222 | def _bond_to_selfies(bond, show_stereo=True):
223 |     if not show_stereo and (bond.order == 1):
224 |         return ""
225 |     return bond_to_smiles(bond)
226 | 
227 | 
228 | def _ring_bonds_to_selfies(lbond, rbond):
229 |     assert lbond.order == rbond.order
230 | 
231 |     if (lbond.order != 1) or all(b.stereo is None for b in (lbond, rbond)):
232 |         return _bond_to_selfies(lbond, show_stereo=False)
233 |     else:
234 |         bond_char = "-" if (lbond.stereo is None) else lbond.stereo
235 |         bond_char += "-" if (rbond.stereo is None) else rbond.stereo
236 |         return bond_char
237 | 
238 | 
239 | def _atom_to_selfies(bond, atom):
240 |     assert not atom.is_aromatic
241 |     bond_char = "" if (bond is None) else _bond_to_selfies(bond)
242 |     return "[{}{}]".format(bond_char, atom_to_smiles(atom, brackets=False))
243 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright 2019 Mario Krenn
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/selfies/mol_graph.py:
--------------------------------------------------------------------------------
  1 | import functools
  2 | import itertools
  3 | from typing import List, Optional, Union
  4 | from dataclasses import dataclass, field
  5 | 
  6 | from selfies.bond_constraints import get_bonding_capacity
  7 | from selfies.constants import AROMATIC_VALENCES, VALENCE_ELECTRONS
  8 | from selfies.utils.matching_utils import find_perfect_matching
  9 | 
 10 | 
 11 | @dataclass
 12 | class Attribution:
 13 |     """A dataclass that contains token string and its index.
 14 |     """
 15 |     #: token index
 16 |     index: int
 17 |     #: token string
 18 |     token: str
 19 | 
 20 | 
 21 | @dataclass
 22 | class AttributionMap:
 23 |     """A mapping from input to single output token showing which
 24 |     input tokens created the output token.
 25 |     """
 26 |     #: Index of output token
 27 |     index: int
 28 |     #: Output token
 29 |     token: str
 30 |     #: List of input tokens that created the output token
 31 |     attribution: List[Attribution] = field(default_factory=list)
 32 | 
 33 | 
 34 | class Atom:
 35 |     """An atom with associated specifications (e.g. charge, chirality).
 36 |     """
 37 | 
 38 |     def __init__(
 39 |             self,
 40 |             element: str,
 41 |             is_aromatic: bool,
 42 |             isotope: Optional[int] = None,
 43 |             chirality: Optional[str] = None,
 44 |             h_count: Optional[int] = None,
 45 |             charge: int = 0
 46 |     ):
 47 |         self.index = None
 48 |         self.element = element
 49 |         self.is_aromatic = is_aromatic
 50 |         self.isotope = isotope
 51 |         self.chirality = chirality
 52 |         self.h_count = h_count
 53 |         self.charge = charge
 54 | 
 55 |     @property
 56 |     @functools.lru_cache()
 57 |     def bonding_capacity(self):
 58 |         bond_cap = get_bonding_capacity(self.element, self.charge)
 59 |         bond_cap -= 0 if (self.h_count is None) else self.h_count
 60 |         return bond_cap
 61 | 
 62 |     def invert_chirality(self) -> None:
 63 |         if self.chirality == "@":
 64 |             self.chirality = "@@"
 65 |         elif self.chirality == "@@":
 66 |             self.chirality = "@"
 67 | 
 68 | 
 69 | class DirectedBond:
 70 |     """A bond that contains directional information.
 71 |     """
 72 | 
 73 |     def __init__(
 74 |             self,
 75 |             src: int,
 76 |             dst: int,
 77 |             order: Union[int, float],
 78 |             stereo: Optional[str],
 79 |             ring_bond: bool
 80 |     ):
 81 |         self.src = src
 82 |         self.dst = dst
 83 |         self.order = order
 84 |         self.stereo = stereo
 85 |         self.ring_bond = ring_bond
 86 | 
 87 | 
 88 | class MolecularGraph:
 89 |     """A molecular graph.
 90 | 
 91 |     Molecules can be viewed as weighted undirected graphs. However, SMILES
 92 |     and SELFIES strings are more naturally represented as weighted directed
 93 |     graphs, where the direction of the edges specifies the order of atoms
 94 |     and bonds in the string.
 95 |     """
 96 | 
 97 |     def __init__(self, attributable=False):
 98 |         self._roots = list()  # stores root atoms, where traversal begins
 99 |         self._atoms = list()  # stores atoms in this graph
100 |         self._bond_dict = dict()  # stores all bonds in this graph
101 |         self._adj_list = list()  # adjacency list, representing this graph
102 |         self._bond_counts = list()  # stores number of bonds an atom has made
103 |         self._ring_bond_flags = list()  # stores if an atom makes a ring bond
104 |         self._delocal_subgraph = dict()  # delocalization subgraph
105 |         self._attribution = dict()  # attribution of each atom/bond
106 |         self._attributable = attributable
107 | 
108 |     def __len__(self):
109 |         return len(self._atoms)
110 | 
111 |     def has_bond(self, a: int, b: int) -> bool:
112 |         if a > b:
113 |             a, b = b, a
114 |         return (a, b) in self._bond_dict
115 | 
116 |     def has_out_ring_bond(self, src: int) -> bool:
117 |         return self._ring_bond_flags[src]
118 | 
119 |     def get_attribution(
120 |         self,
121 |         o: Union[DirectedBond, Atom]
122 |     ) -> List[Attribution]:
123 |         if self._attributable and o in self._attribution:
124 |             return self._attribution[o]
125 |         return None
126 | 
127 |     def get_roots(self) -> List[int]:
128 |         return self._roots
129 | 
130 |     def get_atom(self, idx: int) -> Atom:
131 |         return self._atoms[idx]
132 | 
133 |     def get_atoms(self) -> List[Atom]:
134 |         return self._atoms
135 | 
136 |     def get_dirbond(self, src, dst) -> DirectedBond:
137 |         return self._bond_dict[(src, dst)]
138 | 
139 |     def get_out_dirbonds(self, src: int) -> List[DirectedBond]:
140 |         return self._adj_list[src]
141 | 
142 |     def get_bond_count(self, idx: int) -> int:
143 |         return self._bond_counts[idx]
144 | 
145 |     def add_atom(self, atom: Atom, mark_root: bool = False) -> Atom:
146 |         atom.index = len(self)
147 | 
148 |         if mark_root:
149 |             self._roots.append(atom.index)
150 |         self._atoms.append(atom)
151 |         self._adj_list.append(list())
152 |         self._bond_counts.append(0)
153 |         self._ring_bond_flags.append(False)
154 |         if atom.is_aromatic:
155 |             self._delocal_subgraph[atom.index] = list()
156 |         return atom
157 | 
158 |     def add_attribution(
159 |             self,
160 |             o: Union[DirectedBond, Atom],
161 |             attr: List[Attribution]
162 |     ) -> None:
163 |         if self._attributable:
164 |             if o in self._attribution:
165 |                 self._attribution[o].extend(attr)
166 |             else:
167 |                 self._attribution[o] = attr
168 | 
169 |     def add_bond(
170 |             self, src: int, dst: int,
171 |             order: Union[int, float], stereo: str
172 |     ) -> DirectedBond:
173 |         assert src < dst
174 | 
175 |         bond = DirectedBond(src, dst, order, stereo, False)
176 |         self._add_bond_at_loc(bond, -1)
177 |         self._bond_counts[src] += order
178 |         self._bond_counts[dst] += order
179 | 
180 |         if order == 1.5:
181 |             self._delocal_subgraph.setdefault(src, []).append(dst)
182 |             self._delocal_subgraph.setdefault(dst, []).append(src)
183 |         return bond
184 | 
185 |     def add_placeholder_bond(self, src: int) -> int:
186 |         out_edges = self._adj_list[src]
187 |         out_edges.append(None)
188 |         return len(out_edges) - 1
189 | 
190 |     def add_ring_bond(
191 |             self, a: int, b: int,
192 |             order: Union[int, float],
193 |             a_stereo: Optional[str], b_stereo: Optional[str],
194 |             a_pos: int = -1, b_pos: int = -1
195 |     ) -> None:
196 |         a_bond = DirectedBond(a, b, order, a_stereo, True)
197 |         b_bond = DirectedBond(b, a, order, b_stereo, True)
198 |         self._add_bond_at_loc(a_bond, a_pos)
199 |         self._add_bond_at_loc(b_bond, b_pos)
200 |         self._bond_counts[a] += order
201 |         self._bond_counts[b] += order
202 |         self._ring_bond_flags[a] = True
203 |         self._ring_bond_flags[b] = True
204 | 
205 |         if order == 1.5:
206 |             self._delocal_subgraph.setdefault(a, []).append(b)
207 |             self._delocal_subgraph.setdefault(b, []).append(a)
208 | 
209 |     def update_bond_order(
210 |             self, a: int, b: int,
211 |             new_order: Union[int, float]
212 |     ) -> None:
213 |         assert 1 <= new_order <= 3
214 | 
215 |         if a > b:
216 |             a, b = b, a  # swap so that a < b
217 |         a_to_b = self._bond_dict[(a, b)]  # prev step guarantees existence
218 |         if new_order == a_to_b.order:
219 |             return
220 |         elif a_to_b.ring_bond:
221 |             b_to_a = self._bond_dict[(b, a)]
222 |             bonds = (a_to_b, b_to_a)
223 |         else:
224 |             bonds = (a_to_b,)
225 | 
226 |         old_order = bonds[0].order
227 |         for bond in bonds:
228 |             bond.order = new_order
229 |         self._bond_counts[a] += (new_order - old_order)
230 |         self._bond_counts[b] += (new_order - old_order)
231 | 
232 |     def _add_bond_at_loc(self, bond, pos):
233 |         self._bond_dict[(bond.src, bond.dst)] = bond
234 | 
235 |         out_edges = self._adj_list[bond.src]
236 |         if (pos == -1) or (pos == len(out_edges)):
237 |             out_edges.append(bond)
238 |         elif out_edges[pos] is None:
239 |             out_edges[pos] = bond
240 |         else:
241 |             out_edges.insert(pos, bond)
242 | 
243 |     def is_kekulized(self) -> bool:
244 |         return not self._delocal_subgraph
245 | 
246 |     def kekulize(self) -> bool:
247 |         # Algorithm based on Depth-First article by Richard L. Apodaca
248 |         # Reference:
249 |         #   https://depth-first.com/articles/2020/02/10/
250 |         #   a-comprehensive-treatment-of-aromaticity-in-the-smiles-language/
251 | 
252 |         if self.is_kekulized():
253 |             return True
254 | 
255 |         ds = self._delocal_subgraph
256 |         kept_nodes = set(itertools.filterfalse(self._prune_from_ds, ds))
257 |         
258 |         # relabel kept DS nodes to be 0, 1, 2, ...
259 |         label_to_node = list(sorted(kept_nodes))
260 |         node_to_label = {v: i for i, v in enumerate(label_to_node)}
261 | 
262 |         # pruned and relabelled DS
263 |         pruned_ds = [list() for _ in range(len(kept_nodes))]
264 |         for node in kept_nodes:
265 |             label = node_to_label[node]
266 |             for adj in filter(lambda v: v in kept_nodes, ds[node]):
267 |                 pruned_ds[label].append(node_to_label[adj])
268 |         
269 |         matching = find_perfect_matching(pruned_ds)
270 |         if matching is None:
271 |             return False
272 | 
273 |         # de-aromatize and then make double bonds
274 |         for node in ds:
275 |             for adj in ds[node]:
276 |                 self.update_bond_order(node, adj, new_order=1)
277 |             self._atoms[node].is_aromatic = False
278 |             self._bond_counts[node] = int(self._bond_counts[node])
279 | 
280 |         for matched_labels in enumerate(matching):
281 |             matched_nodes = tuple(label_to_node[i] for i in matched_labels)
282 |             self.update_bond_order(*matched_nodes, new_order=2)
283 | 
284 |         self._delocal_subgraph = dict()  # clear DS
285 |         return True
286 | 
287 |     def _prune_from_ds(self, node):
288 |         adj_nodes = self._delocal_subgraph[node]
289 |         if not adj_nodes:
290 |             return True  # aromatic atom with no aromatic bonds
291 |         
292 |         atom = self._atoms[node]
293 |         valences = AROMATIC_VALENCES[atom.element]
294 |         
295 |         # each bond in DS has order 1.5 - we treat them as single bonds
296 |         used_electrons = int(self._bond_counts[node] - 0.5 * len(adj_nodes))
297 |         
298 |         if atom.h_count is None:  # account for implicit Hs
299 |             assert atom.charge == 0
300 |             return any(used_electrons == v for v in valences)
301 |         else:
302 |             valence = valences[-1] - atom.charge
303 |             used_electrons += atom.h_count
304 |             
305 |             # count the total number of bound electrons of each atom
306 |             bound_electrons = (max(0, atom.charge) + atom.h_count 
307 |                                + int(self._bond_counts[node]) 
308 |                                + int(2 * (self._bond_counts[node] % 1)))
309 |             
310 |             # calculate the number of unpaired electrons of each atom
311 |             radical_electrons = (max(0, VALENCE_ELECTRONS[atom.element] 
312 |                                  - bound_electrons) % 2)
313 |             
314 |             # unpaired electrons do not contribute to the aromatic system
315 |             free_electrons = valence - used_electrons - radical_electrons
316 |             
317 |             if any(used_electrons == v - atom.charge for v in valences):
318 |                 return True
319 |             else:
320 |                 return not ((free_electrons >= 0) and (free_electrons % 2 != 0))


--------------------------------------------------------------------------------
/original_code_from_paper/gan/GAN_smiles/GAN.py:
--------------------------------------------------------------------------------
  1 | """
  2 | @author: Akshat
  3 | """
  4 | import torch
  5 | import random
  6 | import torch.utils.data
  7 | from torch import nn, optim
  8 | from torch.autograd.variable import Variable
  9 | import os
 10 | import torch.nn.functional as F
 11 | from rdkit import Chem
 12 | from rdkit.Chem import Draw
 13 | from one_hot_converter import multiple_smile_to_hot, hot_to_smile, check_conversion_bijection
 14 | import numpy as np
 15 | import matplotlib.pyplot as plt
 16 | from tensorboardX import SummaryWriter
 17 | writer = SummaryWriter()
 18 | 
 19 | random.seed(4001)
 20 | 
 21 | def load_data(cut_off=None):
 22 |     '''
 23 |     Ensuring correct Bijection:
 24 |         check_conversion_bijection(smiles_list=A, largest_smile_len=len(max(A, key=len)), alphabet=alphabets)         
 25 |     '''
 26 |     with open('smiles_qm9.txt') as f:
 27 |         content = f.readlines()
 28 |     content = content[1:]
 29 |     content = [x.strip() for x in content] 
 30 |     A = [x.split(',')[1] for x in content]
 31 |     if cut_off is not None:
 32 |         A = A[0:cut_off]
 33 |     return A
 34 | 
 35 | alphabets = ['C', '1', '(', '#', 'N', '3', '5', 'O', '2', 'F', '=', '4', ')', ' ']
 36 |                          
 37 |                          
 38 |                          
 39 | # Read in the QM9 dataset
 40 | A = load_data(cut_off=None)
 41 | data = multiple_smile_to_hot(A, len(max(A, key=len)), alphabets, 0)
 42 | print('Data shape: ', data.shape)
 43 | one_hot_len_comb = data.shape[1]*data.shape[2]
 44 | data = data.reshape(( data.shape[0], one_hot_len_comb))
 45 | data = torch.tensor(data, dtype=torch.float)
 46 | 
 47 | data_loader = torch.utils.data.DataLoader(data, batch_size=1024, shuffle=True)
 48 | num_batches = len(data_loader)
 49 | print('DATA Acquired!')
 50 | 
 51 | 
 52 | def get_canon_smi_ls(smiles_ls):
 53 |     '''
 54 |     return a list of canonical smiles in smiles_ls
 55 |     '''    
 56 |     canon_ls = [Chem.MolToSmiles(Chem.MolFromSmiles(smi), isomericSmiles=False, canonical=True) for smi in smiles_ls]
 57 |     return canon_ls
 58 | 
 59 | def _make_dir(directory):
 60 |     os.makedirs(directory)        
 61 | 
 62 | def save_models(generator, discriminator, epoch, dir_name):
 63 |     out_dir = './{}/saved_models/{}'.format(dir_name, epoch)
 64 |     _make_dir(out_dir)
 65 |     torch.save(generator,     '{}/G'.format(out_dir))
 66 |     torch.save(discriminator, '{}/D'.format(out_dir))
 67 | 
 68 | def display_status(epoch, num_epochs, n_batch, num_batches, d_error, g_error, d_pred_real, d_pred_fake):
 69 |     
 70 |     # var_class = torch.autograd.variable.Variable
 71 |     if isinstance(d_error, torch.autograd.Variable):
 72 |         d_error = d_error.data.cpu().numpy()
 73 |     if isinstance(g_error, torch.autograd.Variable):
 74 |         g_error = g_error.data.cpu().numpy()
 75 |     if isinstance(d_pred_real, torch.autograd.Variable):
 76 |         d_pred_real = d_pred_real.data
 77 |     if isinstance(d_pred_fake, torch.autograd.Variable):
 78 |         d_pred_fake = d_pred_fake.data
 79 |     
 80 |     
 81 |     print('Epoch: [{}/{}], Batch Num: [{}/{}]'.format(
 82 |         epoch,num_epochs, n_batch, num_batches)
 83 |          )
 84 |     print('Discriminator Loss: {:.4f}, Generator Loss: {:.4f}'.format(d_error, g_error))
 85 |     print('D(x): {:.4f}, D(G(z)): {:.4f}'.format(d_pred_real.mean(), d_pred_fake.mean()))
 86 |     writer.add_scalar('D(x)', d_pred_real.mean(), epoch)
 87 |     writer.add_scalar('D(G(z)', d_pred_fake.mean(), epoch)
 88 | 
 89 | 
 90 | class DiscriminatorNet(torch.nn.Module):
 91 |     """
 92 |     A three hidden-layer discriminative neural network
 93 |     """
 94 |     def __init__(self, drop_rate, layer_2_size):
 95 |         super(DiscriminatorNet, self).__init__()
 96 |         n_features = one_hot_len_comb
 97 |         n_out = 1
 98 |         
 99 |         self.hidden0 = nn.Sequential( 
100 |             nn.Linear(n_features, layer_2_size),
101 |             nn.LeakyReLU(0.2),
102 |             nn.Dropout(drop_rate)
103 |         )
104 | 
105 |         self.out = nn.Sequential(
106 |             torch.nn.Linear(layer_2_size, n_out),
107 |             nn.Sigmoid()        
108 |         )
109 | 
110 |     def forward(self, x):
111 |         x = self.hidden0(x)
112 |         x = self.out(x)
113 |         return x
114 | 
115 | 
116 | class GeneratorNet(torch.nn.Module):
117 |     """
118 |     A three hidden-layer generative neural network
119 |     """
120 |     def __init__(self, prior_lv_size, layer_interm_size):
121 |         super(GeneratorNet, self).__init__()
122 |         n_out = one_hot_len_comb   
123 |         
124 |         self.hidden0 = nn.Sequential(
125 |             nn.Linear(prior_lv_size, layer_interm_size),
126 |             nn.LeakyReLU(0.2)
127 |         )
128 |         
129 |         self.out = nn.Sequential(
130 |             nn.Linear(layer_interm_size, n_out),
131 |             nn.Sigmoid() # Help predict 0 or 1
132 |         )
133 | 
134 |     def forward(self, x):
135 |         x = self.hidden0(x)
136 |         x = self.out(x)
137 |         return x
138 |     
139 | def noise(size, G_start_layer_size):
140 |     '''
141 |     Standard nois, which acts as input to the GAN generator .
142 |     '''
143 |     n = Variable(torch.randn(size, G_start_layer_size))
144 |     if torch.cuda.is_available(): return n.cuda() 
145 |     return n
146 | 
147 | 
148 | def train_discriminator(optimizer, real_data, fake_data, discriminator, criterion):
149 |     optimizer.zero_grad()
150 |     
151 |     # 1.1 Train on Real Data
152 |     prediction_real = discriminator(real_data)
153 |     y_real = Variable(torch.ones(prediction_real.shape[0], 1))
154 |     if torch.cuda.is_available(): 
155 |         D_real_loss = criterion(prediction_real, y_real.cuda())
156 |     else: 
157 |         D_real_loss = criterion(prediction_real, y_real)
158 | 
159 |     # 1.2 Train on Fake Data
160 |     prediction_fake = discriminator(fake_data)
161 |     y_fake = Variable(torch.zeros(prediction_fake.shape[0], 1))
162 |     if torch.cuda.is_available(): 
163 |         D_fake_loss = criterion(prediction_fake, y_fake.cuda())
164 |     else: 
165 |         D_fake_loss = criterion(prediction_fake, y_fake)
166 |     
167 |     D_loss = D_real_loss + D_fake_loss
168 |     D_loss.backward()
169 |     optimizer.step()
170 |     
171 |     # Return error
172 |     return D_real_loss + D_fake_loss, prediction_real, prediction_fake, discriminator
173 | 
174 | 
175 | def train_generator(optimizer, fake_data, criterion, discriminator):
176 |     optimizer.zero_grad()
177 |     prediction = discriminator(fake_data)
178 |     y = Variable(torch.ones(prediction.shape[0], 1))
179 |     if torch.cuda.is_available(): 
180 |         G_loss = criterion(prediction, y.cuda(0))
181 |     else: 
182 |         G_loss = criterion(prediction, y)
183 |     G_loss.backward()
184 | 
185 |     optimizer.step()
186 |     return G_loss.data.item(), discriminator
187 | 
188 | 
189 | def train_gan(lr_disc, lr_genr, prior_lv_size, layer_interm_size_G, discr_dropout, discr_layer_2_size, num_epochs, dir_name, num_unique):
190 |     '''
191 |     All the hyper parameters are to be added as parameters to this function!!!
192 |     '''    
193 |     criterion = nn.BCELoss() 
194 |     discriminator = DiscriminatorNet(drop_rate=discr_dropout, layer_2_size=discr_layer_2_size)
195 |     generator     = GeneratorNet(prior_lv_size=prior_lv_size, layer_interm_size=layer_interm_size_G)
196 |     criterion = nn.BCELoss() 
197 |     if torch.cuda.is_available():
198 |         discriminator.cuda()
199 |         generator.cuda()
200 |         
201 |         
202 |     # Optimizers (notice the use of 'discriminator'<-Object class)
203 |     d_optimizer = optim.Adam(discriminator.parameters(), lr=lr_disc)
204 |     g_optimizer = optim.Adam(generator.parameters(), lr=lr_genr) # 1: 3e-4; 2: 1e-5
205 |     
206 |     for epoch in range(num_epochs+1): 
207 |         torch.cuda.empty_cache()
208 |         print('Epoch: ', epoch)
209 |         #   batch_num   real_data
210 |         for n_batch, real_batch in enumerate(data_loader):
211 |     
212 |             # 1. Train Discriminator
213 |             real_data = Variable(real_batch)
214 |             
215 |             if torch.cuda.is_available(): 
216 |                 real_data = real_data.cuda()
217 |                 
218 |             # Generate fake data
219 |             fake_data = generator(noise(real_data.size(0), prior_lv_size)).detach()        
220 |             
221 |             # Train D
222 |             d_error, d_pred_real, d_pred_fake, discriminator = train_discriminator(d_optimizer, real_data, fake_data, discriminator, criterion)
223 |     
224 |     
225 |             # 2. Train Generator
226 |             # Generate fake data
227 |             fake_data = generator(noise(real_batch.size(0), prior_lv_size))
228 |             
229 |             # Train G
230 |             g_error, discriminator = train_generator(g_optimizer, fake_data, criterion, discriminator)
231 |     
232 |                     
233 |             # Log onto a graph
234 |             writer.add_scalar('D_error', d_error, epoch * num_batches + n_batch)
235 |             writer.add_scalar('G_error', g_error, epoch * num_batches + n_batch)
236 |             
237 |         if epoch % 10 == 0 and epoch > 0:
238 |             generator = generator.eval()
239 |             # Display complete training stats
240 |             display_status(
241 |                 epoch, num_epochs, n_batch, num_batches,
242 |                 d_error, g_error, d_pred_real, d_pred_fake
243 |             )
244 |             
245 |             smiles_ls = []
246 |             print('Sampling....')
247 |             for _ in range(10000):
248 |                 # Display a random molecule (for sanity) (make sure in eval mode!)
249 |                 T = generator(noise(1, prior_lv_size)).cpu().detach().numpy().flatten() # Just chose one molecule
250 |                 T = T.reshape((22, 14))
251 |                 
252 |                 smile = hot_to_smile(T, alphabets)
253 |                 if ' ' in smile:
254 |                     smile = smile[:smile.find(' ')]
255 |                 mol = Chem.MolFromSmiles(smile)
256 |                 if mol is not None:
257 |                     smiles_ls.append(smile)
258 |             print('unique molecules: ', len(set(get_canon_smi_ls(smiles_ls))))
259 |             writer.add_scalar('Num Smiles', len(set(get_canon_smi_ls(smiles_ls))), epoch)
260 |             
261 |             # Write TensorBoard curv on to a text file
262 |             f = open('{}/smiles_curve.txt'.format(dir_name), 'a+')
263 |             f.write(str(len(set(get_canon_smi_ls(smiles_ls))))+ '\n')
264 |             f.close()            
265 |             
266 |             # Save the TensorBoard models
267 |             save_models(generator, discriminator, epoch, dir_name)
268 |             num_unique.append(len(set(get_canon_smi_ls(smiles_ls))))
269 |             
270 |             
271 |             if epoch > 100:
272 |                 A = np.array(num_unique)
273 |                 stopping_criteria = A.max() - ( 6 *   ( np.sqrt(A.max())))
274 |                 if stopping_criteria < 0:
275 |                     stopping_criteria = 0
276 |                 print('Stopping Criteria: ', stopping_criteria)
277 |                 print('Array A: ', A)
278 |                 if num_unique[-1] <= stopping_criteria or max(A) < 50:
279 |                     # Write the stopping epoch onto a text file
280 |                     f = open('{}/stoping_epoch.txt'.format(dir_name), 'a+')
281 |                     f.write('Early stopping Epoch: ' +str(epoch)+ '\n')
282 |                     f.close()            
283 |                     return 
284 | 
285 | 
286 |             print(set(smiles_ls))
287 |             generator = generator.train()
288 |             
289 |             
290 |             
291 | if __name__ == '__main__':
292 |     
293 |     for model_iter in range(100):
294 |         dir_name = str(model_iter)    # Directory for saving all the data
295 |                           
296 |         os.makedirs(dir_name)
297 |         num_unique = []
298 |     
299 |         ## HYPERPARAMETER SELECTION!
300 |         
301 |         num_epochs = 1500
302 |     
303 |         # Let me select the learning rates:
304 |         lr_disc = 10 ** random.uniform(-7, -4)
305 |         lr_genr = 10 ** random.uniform(-7, -4)
306 |         
307 |         # Generator
308 |         prior_lv_size       = random.randint(50,  300)
309 |         layer_interm_size_G = random.randint(300, 3000)
310 |         
311 |         # Discriminator
312 |         discr_dropout       = random.uniform(0, 0.8)
313 |         discr_layer_2_size  = random.randint(5, 100)
314 |         
315 |         # Save all the selected hyperparamters: 
316 |         f = open('{}/hyperparams.txt'.format(dir_name), 'a+')
317 |         f.write('lr Discr: ' + str(lr_disc) + '\n')
318 |         f.write('lr Gener: ' + str(lr_genr)+ '\n')
319 |         f.write('Gener Sampling layer size: ' + str(prior_lv_size)+ '\n')
320 |         f.write('Gener middle layer size: ' + str(layer_interm_size_G)+ '\n')
321 |         f.write('Discr dropout: ' + str(discr_dropout)+ '\n')
322 |         f.write('Discr middle layer size: ' + str(discr_layer_2_size)+ '\n')
323 |         f.close()
324 |             
325 |         train_gan(lr_disc, lr_genr, prior_lv_size, layer_interm_size_G, discr_dropout, discr_layer_2_size, num_epochs, dir_name, num_unique)
326 |         torch.cuda.empty_cache()
327 |     
328 |     
329 |     
330 |     
331 |     
332 | 


--------------------------------------------------------------------------------
/tests/test_specific_cases.py:
--------------------------------------------------------------------------------
  1 | import pytest
  2 | 
  3 | import selfies as sf
  4 | 
  5 | 
  6 | def decode_eq(selfies, smiles):
  7 |     s = sf.decoder(selfies)
  8 |     return s == smiles
  9 | 
 10 | 
 11 | def roundtrip_eq(smiles_in, smiles_out):
 12 |     sel = sf.encoder(smiles_in)
 13 |     smi = sf.decoder(sel)
 14 |     return smi == smiles_out
 15 | 
 16 | 
 17 | def test_branch_and_ring_at_state_X0():
 18 |     """Tests SELFIES with branches and rings at state X0 (i.e. at the
 19 |     very beginning of a SELFIES). These symbols should be skipped.
 20 |     """
 21 | 
 22 |     assert decode_eq("[Branch3][C][S][C][O]", "CSCO")
 23 |     assert decode_eq("[Ring3][C][S][C][O]", "CSCO")
 24 |     assert decode_eq("[Branch1][Ring1][Ring3][C][S][C][O]", "CSCO")
 25 | 
 26 | 
 27 | def test_branch_at_state_X1():
 28 |     """Test SELFIES with branches at state X1 (i.e. at an atom that
 29 |     can only make one bond. In this case, the branch symbol should be skipped.
 30 |     """
 31 | 
 32 |     assert decode_eq("[C][C][O][Branch1][C][I]", "CCOCI")
 33 |     assert decode_eq("[C][C][C][O][#Branch3][C][I]", "CCCOCI")
 34 | 
 35 | 
 36 | def test_branch_and_ring_decrement_state():
 37 |     """Tests that the branch and ring symbols properly decrement the
 38 |     derivation state.
 39 |     """
 40 | 
 41 |     assert decode_eq("[C][C][C][Ring1][Ring1][#C]", "C1CC1=C")
 42 |     assert decode_eq("[C][=C][C][C][#Ring1][Ring1][#C]", "C=C1CC1")
 43 |     assert decode_eq("[C][O][C][C][=Ring1][Ring1][#C]", "COCCC")
 44 | 
 45 |     assert decode_eq("[C][=C][Branch1][C][=C][#C]", "C=C(C)C")
 46 | 
 47 | 
 48 | def test_branch_at_end_of_selfies():
 49 |     """Test SELFIES that have a branch symbol as its very last symbol.
 50 |     """
 51 | 
 52 |     assert decode_eq("[C][C][C][C][Branch1]", "CCCC")
 53 |     assert decode_eq("[C][C][C][C][#Branch3]", "CCCC")
 54 | 
 55 | 
 56 | def test_ring_at_end_of_selfies():
 57 |     """Test SELFIES that have a ring symbol as its very last symbol.
 58 |     """
 59 | 
 60 |     assert decode_eq("[C][C][C][C][C][Ring1]", "CCCC=C")
 61 |     assert decode_eq("[C][C][C][C][C][Ring3]", "CCCC=C")
 62 | 
 63 | 
 64 | def test_branch_with_no_atoms():
 65 |     """Test SELFIES that have a branch, but the branch has no atoms in it.
 66 |     Such branches should not be made in the outputted SMILES.
 67 |     """
 68 | 
 69 |     s = "[C][Branch1][Ring2][Branch1][Branch1][Branch1][F]"
 70 |     assert decode_eq(s, "CF")
 71 | 
 72 |     s = "[C][Branch1][Ring2][Ring1][Ring1][Branch1][F]"
 73 |     assert decode_eq(s, "CF")
 74 | 
 75 |     s = "[C][=Branch1][Ring2][Branch1][C][Cl][F]"
 76 |     assert decode_eq(s, "C(Cl)F")
 77 | 
 78 |     # special case: #Branch3 takes Q_1, Q_2 = [O] and Q_3 = ''. However,
 79 |     # there are no more symbols in the branch.
 80 |     assert decode_eq("[C][C][C][C][#Branch3][O][O]", "CCCC")
 81 | 
 82 | 
 83 | def test_oversized_branch():
 84 |     """Test SELFIES that have a branch, with Q larger than the length
 85 |     of the SELFIES
 86 |     """
 87 | 
 88 |     assert decode_eq("[C][Branch2][O][O][C][C][S][F][C]", "CCCSF")
 89 |     assert decode_eq("[C][#Branch2][O][O][#C][C][S][F]", "C#CCSF")
 90 | 
 91 | 
 92 | def test_oversized_ring():
 93 |     """Test SELFIES that have a ring, with Q so large that the (Q + 1)-th
 94 |     previously derived atom does not exist.
 95 |     """
 96 | 
 97 |     assert decode_eq("[C][C][C][C][Ring1][O]", "C1CCC1")
 98 |     assert decode_eq("[C][C][C][C][Ring2][O][C]", "C1CCC1")
 99 | 
100 |     # special case: Ring2 takes Q_1 = [O] and Q_2 = '', leading to
101 |     # Q = 9 * 16 + 0 (i.e. an oversized ring)
102 |     assert decode_eq("[C][C][C][C][Ring2][O]", "C1CCC1")
103 | 
104 |     # special case: ring between 1st atom and 1st atom should not be formed
105 |     assert decode_eq("[C][Ring1][O]", "C")
106 | 
107 | 
108 | def test_branch_at_beginning_of_branch():
109 |     """Test SELFIES that have a branch immediately at the start of a branch.
110 |     """
111 | 
112 |     # [C@]((Br)Cl)F
113 |     s = "[C@][=Branch1][Branch1][Branch1][C][Br][Cl][F]"
114 |     assert decode_eq(s, "[C@](Br)(Cl)F")
115 | 
116 |     # [C@](((Br)Cl)I)F
117 |     s = "[C@][#Branch1][Branch2][=Branch1][Branch1][Branch1][C][Br][Cl][I][F]"
118 |     assert decode_eq(s, "[C@](Br)(Cl)(I)F")
119 | 
120 |     # [C@]((Br)(Cl)I)F
121 |     s = "[C@][#Branch1][Branch2][Branch1][C][Br][Branch1][C][Cl][I][F]"
122 |     assert decode_eq(s, "[C@](Br)(Cl)(I)F")
123 | 
124 | 
125 | def test_ring_at_beginning_of_branch():
126 |     """Test SELFIES that have a ring immediately at the start of a branch.
127 |     """
128 | 
129 |     # CC1CCC(1CCl)F
130 |     s = "[C][C][C][C][C][=Branch1][Branch1][Ring1][Ring2][C][Cl][F]"
131 |     assert decode_eq(s, "CC1CCC1(CCl)F")
132 | 
133 |     # CC1CCS(Br)(1CCl)F
134 |     s = "[C][C][C][C][S][Branch1][C][Br]" \
135 |         "[=Branch1][Branch1][Ring1][Ring2][C][Cl][F]"
136 |     assert decode_eq(s, "CC1CCS1(Br)(CCl)F")
137 | 
138 | 
139 | def test_branch_and_ring_at_beginning_of_branch():
140 |     """Test SELFIES that have a branch and ring immediately at the start
141 |     of a branch.
142 |     """
143 | 
144 |     # CC1CCCS((Br)1Cl)F
145 |     s = "[C][C][C][C][C][S][#Branch1][#Branch1][Branch1][C][Br]" \
146 |         "[Ring1][Branch1][Cl][F]"
147 |     assert decode_eq(s, "CC1CCCS1(Br)(Cl)F")
148 | 
149 |     # CC1CCCS(1(Br)Cl)F
150 |     s = "[C][C][C][C][C][S][#Branch1][#Branch1][Ring1][Branch1]" \
151 |         "[Branch1][C][Br][Cl][F]"
152 |     assert decode_eq(s, "CC1CCCS1(Br)(Cl)F")
153 | 
154 | 
155 | def test_ring_immediately_following_branch():
156 |     """Test SELFIES that have a ring immediately following after a branch.
157 |     """
158 | 
159 |     # CCC1CCCC(OCO)1
160 |     s = "[C][C][C][C][C][C][C][Branch1][Ring2][O][C][O][Ring1][Branch1]"
161 |     assert decode_eq(s, "CCC1CCCC1OCO")
162 | 
163 |     # CCC1CCCC(OCO)(F)1
164 |     s = "[C][C][C][C][C][C][C][Branch1][Ring2][O][C][O]" \
165 |         "[Branch1][C][F][Ring1][Branch1]"
166 |     assert decode_eq(s, "CCC1CCCC1(OCO)F")
167 | 
168 | 
169 | def test_ring_after_branch():
170 |     """Tests SELFIES that have a ring following a branch, but not
171 |     immediately after a branch.
172 |     """
173 | 
174 |     # CCCCCCC1(OCO)1
175 |     s = "[C][C][C][C][C][C][C][Branch1][Ring2][O][C][O][C][Ring1][Branch1]"
176 |     assert decode_eq(s, "CCCCCCC(OCO)=C")
177 | 
178 |     s = "[C][C][C][C][C][C][C][Branch1][Ring2][O][C][O]" \
179 |         "[Branch1][C][F][C][C][Ring1][=Branch2]"
180 |     assert decode_eq(s, "CCCCC1CC(OCO)(F)CC1")
181 | 
182 | 
183 | def test_ring_on_top_of_existing_bond():
184 |     """Tests SELFIES with rings between two atoms that are already bonded
185 |     in the main scaffold.
186 |     """
187 | 
188 |     # C1C1, C1C=1, C1C#1, ...
189 |     assert decode_eq("[C][C][Ring1][C]", "C=C")
190 |     assert decode_eq("[C][/C][Ring1][C]", "C=C")
191 |     assert decode_eq("[C][C][=Ring1][C]", "C#C")
192 |     assert decode_eq("[C][C][#Ring1][C]", "C#C")
193 | 
194 | 
195 | def test_consecutive_rings():
196 |     """Test SELFIES which have multiple consecutive rings.
197 |     """
198 | 
199 |     s = "[C][C][C][C][Ring1][Ring2][Ring1][Ring2]"
200 |     assert decode_eq(s, "C=1CCC=1")  # 1 + 1
201 | 
202 |     s = "[C][C][C][C][Ring1][Ring2][Ring1][Ring2][Ring1][Ring2]"
203 |     assert decode_eq(s, "C#1CCC#1")  # 1 + 1 + 1
204 | 
205 |     s = "[C][C][C][C][=Ring1][Ring2][Ring1][Ring2]"
206 |     assert decode_eq(s, "C#1CCC#1")  # 2 + 1
207 | 
208 |     s = "[C][C][C][C][Ring1][Ring2][=Ring1][Ring2]"
209 |     assert decode_eq(s, "C#1CCC#1")  # 1 + 2
210 | 
211 |     # consecutive rings that exceed bond constraints
212 |     s = "[C][C][C][C][#Ring1][Ring2][=Ring1][Ring2]"
213 |     assert decode_eq(s, "C#1CCC#1")  # 3 + 2
214 | 
215 |     s = "[C][C][C][C][=Ring1][Ring2][#Ring1][Ring2]"
216 |     assert decode_eq(s, "C#1CCC#1")  # 2 + 3
217 | 
218 |     s = "[C][C][C][C][=Ring1][Ring2][=Ring1][Ring2]"
219 |     assert decode_eq(s, "C#1CCC#1")  # 2 + 2
220 | 
221 |     # consecutive rings with stereochemical single bond
222 |     s = "[C][C][C][C][\\/Ring1][Ring2]"
223 |     assert decode_eq(s, "C\\1CCC/1")
224 | 
225 |     s = "[C][C][C][C][\\/Ring1][Ring2][Ring1][Ring2]"
226 |     assert decode_eq(s, "C=1CCC=1")
227 | 
228 | 
229 | def test_unconstrained_symbols():
230 |     """Tests SELFIES with symbols that are not semantically constrained.
231 |     """
232 | 
233 |     f_branch = "[Branch1][C][F]"
234 |     s = "[Xe-2]" + (f_branch * 8)
235 |     assert decode_eq(s, "[Xe-2](F)(F)(F)(F)(F)(F)(F)CF")
236 | 
237 |     # change default semantic constraints
238 |     constraints = sf.get_semantic_constraints()
239 |     constraints["?"] = 2
240 |     sf.set_semantic_constraints(constraints)
241 | 
242 |     assert decode_eq(s, "[Xe-2](F)CF")
243 | 
244 |     sf.set_semantic_constraints()
245 | 
246 | 
247 | def test_isotope_symbols():
248 |     """Tests that SELFIES symbols with isotope specifications are
249 |      constrained properly.
250 |     """
251 | 
252 |     s = "[13C][Branch1][C][Cl][Branch1][C][F][Branch1][C][Br][Branch1][C][I]"
253 |     assert decode_eq(s, "[13C](Cl)(F)(Br)CI")
254 | 
255 |     assert decode_eq("[C][36Cl][C]", "C[36Cl]")
256 | 
257 | 
258 | def test_chiral_symbols():
259 |     """Tests that SELFIES symbols with chirality specifications are
260 |     constrained properly.
261 |     """
262 | 
263 |     s = "[C@@][Branch1][C][Cl][Branch1][C][F][Branch1][C][Br][Branch1][C][I]"
264 |     assert decode_eq(s, "[C@@](Cl)(F)(Br)CI")
265 | 
266 |     s = "[C@H1][Branch1][C][Cl][Branch1][C][F][Branch1][C][Br]"
267 |     assert decode_eq(s, "[C@H1](Cl)(F)CBr")
268 | 
269 | 
270 | def test_explicit_hydrogen_symbols():
271 |     """Tests that SELFIES symbols with explicit hydrogen specifications
272 |      are constrained properly.
273 |      """
274 | 
275 |     assert decode_eq("[CH1][Branch1][C][Cl][#C]", "[CH1](Cl)=C")
276 |     assert decode_eq("[CH3][=C]", "[CH3]C")
277 | 
278 |     assert decode_eq("[CH4][C][C]", "[CH4]")
279 |     assert decode_eq("[C][C][C][CH4]", "CCC")
280 |     assert decode_eq("[C][Branch1][Ring2][C][=CH4][C][=C]", "C(C)=C")
281 | 
282 |     with pytest.raises(sf.DecoderError):
283 |         sf.decoder("[C][C][CH5]")
284 |     with pytest.raises(sf.DecoderError):
285 |         sf.decoder("[C][C][C][OH9]")
286 | 
287 | 
288 | def test_charged_symbols():
289 |     """Tests that SELFIES symbols with charges are constrained properly.
290 |     """
291 | 
292 |     constraints = sf.get_semantic_constraints()
293 |     constraints["Sn+4"] = 1
294 |     constraints["O-2"] = 2
295 |     sf.set_semantic_constraints(constraints)
296 | 
297 |     # the following molecules don't make sense, but we use them to test
298 |     # selfies. Hence, we can't verify them with RDKit
299 |     assert decode_eq("[Sn+4][=C]", "[Sn+4]C")
300 |     assert decode_eq("[O-2][#C]", "[O-2]=C")
301 | 
302 |     # mixing many symbol types
303 |     assert decode_eq("[17O@@H1-2][#C]", "[17O@@H1-2]C")
304 | 
305 |     sf.set_semantic_constraints()
306 | 
307 | 
308 | def test_standardized_alphabet():
309 |     """Tests that equivalent SMILES atom symbols are translated into the
310 |     same SELFIES atom symbol.
311 |     """
312 | 
313 |     assert sf.encoder("[C][O][N][P][F]") == "[CH0][OH0][NH0][PH0][FH0]"
314 |     assert sf.encoder("[Fe][Si]") == "[Fe][Si]"
315 |     assert sf.encoder("[Fe++][Fe+2]") == "[Fe+2][Fe+2]"
316 |     assert sf.encoder("[CH][CH1]") == "[CH1][CH1]"
317 | 
318 | 
319 | def test_old_symbols():
320 |     """Tests backward compatibility of SELFIES with old (<v2) symbols.
321 |     """
322 | 
323 |     s = "[C@@Hexpl][Branch1_2][Branch1_1][Branch1_1][C][C][Cl][F]"
324 |     assert sf.decoder(s, compatible=True) == "[C@@H1](C)(Cl)F"
325 | 
326 |     s = "[C][C][C][C][Expl=Ring1][Ring2][Expl#Ring1][Ring2]"
327 |     assert sf.decoder(s, compatible=True) == "C#1CCC#1"
328 | 
329 |     long_s = "[C@@Hexpl][=C][C@@Hexpl][N+expl][=C][C+expl][N+expl][O+expl]" \
330 |              "[Fe++expl][C@@Hexpl][C][N+expl][Branch1_2][Fe++expl][S+expl]" \
331 |              "[=C][Expl=Ring1][Fe++expl][S+expl][Expl=Ring1][O+expl]" \
332 |              "[C@@Hexpl][Expl=Ring1][C@@Hexpl][C@@Hexpl][N+expl][Expl=Ring1]" \
333 |              "[Expl=Ring1][S+expl][=C]"
334 |     try:
335 |         sf.decoder(long_s, compatible=True)
336 |     except Exception:
337 |         assert False
338 | 
339 | 
340 | def test_large_selfies_decoding():
341 |     """Test that we can decode extremely large SELFIES strings (used to cause a RecursionError)
342 |     """
343 | 
344 |     large_selfies = "[C]" * 1024
345 |     expected_smiles = "C" * 1024
346 | 
347 |     assert decode_eq(large_selfies, expected_smiles)
348 | 
349 | 
350 | def test_radical_kekulization():
351 |     """Tests kekulization of aromatic systems with radicals and charges.
352 |     """
353 |     
354 |     assert roundtrip_eq("c1ccc[c]c1", "C1=CC=C[CH0]=C1")
355 |     assert roundtrip_eq("c1[c]n1(C)", "C1=[CH0]N1C")
356 |     assert roundtrip_eq("c1[C][n+]1(C)", "C=1[CH0][N+1]=1C")
357 |     assert roundtrip_eq("c1nnn[n-]1", "C1=NN=N[N-1]1")
358 |     assert roundtrip_eq("c1ccn[c-](C)[n+]1=O", "C1=CC=N[C-1](C)[N+1]1=O")
359 |     assert roundtrip_eq("c1ccs[n+]1c2ccccc2", "C=1C=CS[N+1]=1C2=CC=CC=C2")
360 |     assert roundtrip_eq("c1ccs[nH+]1", "C=1C=CS[NH1+1]=1")
361 |     
362 |     
363 | def test_novel_charged_symbols():
364 |     """Test decoding of updated constraints for charged atoms (update in 2.2.0)."""
365 |     assert decode_eq("[N][#C+1][#NH1][#C@H1]", "N#[C+1]")
366 |     assert decode_eq("[O+1][=P+1][#P-1][#C@@]", "[O+1]=[P+1]=[P-1]#[C@@]")
367 |     assert decode_eq("[=C-1][#S+1][#B]", "[C-1]#[S+1]=B")
368 | 
369 | 


--------------------------------------------------------------------------------
/original_code_from_paper/gan/GAN_selfies/GAN.py:
--------------------------------------------------------------------------------
  1 | """
  2 | @author: Akshat
  3 | """
  4 | import torch
  5 | import random
  6 | import torch.utils.data
  7 | from torch import nn, optim
  8 | from torch.autograd.variable import Variable
  9 | import os
 10 | import torch.nn.functional as F
 11 | from rdkit import Chem
 12 | from rdkit.Chem import Draw
 13 | from one_hot_converter import multiple_smile_to_hot, hot_to_smile, check_conversion_bijection
 14 | from GPlus2S import GrammarPlusToSMILES, IncludeRingsForSMILES
 15 | import numpy as np
 16 | import matplotlib.pyplot as plt
 17 | from tensorboardX import SummaryWriter
 18 | writer = SummaryWriter()
 19 | 
 20 | random.seed(4001)
 21 | 
 22 | def load_data(cut_off=None):
 23 |     '''
 24 |     Ensuring correct Bijection:
 25 |         check_conversion_bijection(smiles_list=A, largest_smile_len=len(max(A, key=len)), alphabet=alphabets)         
 26 |     '''
 27 |     with open('2RGSMILES_QM9.txt') as f:
 28 |         content = f.readlines()
 29 |     content = content[1:]
 30 |     content = [x.strip() for x in content] 
 31 |     A = [x.split(',')[1] for x in content]
 32 |     if cut_off is not None:
 33 |         A = A[0:cut_off]
 34 |     return A
 35 | 
 36 | alphabets = ['I', 'N', 'D', 'K', 'L', 'B', 'C', 'J', 'G', 'A', 'H', 'E', 'F', ' ']
 37 |             
 38 |                          
 39 | # Read in the QM9 dataset
 40 | A = load_data(cut_off=None)
 41 | 
 42 | 
 43 | data = multiple_smile_to_hot(A, len(max(A, key=len)), alphabets, 0)
 44 | 
 45 | print('Data shape: ', data.shape)
 46 | one_hot_len_comb = data.shape[1]*data.shape[2]
 47 | data = data.reshape(( data.shape[0], one_hot_len_comb))
 48 | data = torch.tensor(data, dtype=torch.float)
 49 | 
 50 | data_loader = torch.utils.data.DataLoader(data, batch_size=1024, shuffle=True)
 51 | num_batches = len(data_loader)
 52 | print('DATA Acquired!')
 53 | 
 54 | def get_canon_smi_ls(smiles_ls):
 55 |     '''
 56 |     return a list of canonical smiles in smiles_ls
 57 |     '''    
 58 |     canon_ls = [Chem.MolToSmiles(Chem.MolFromSmiles(smi), isomericSmiles=False, canonical=True) for smi in smiles_ls]
 59 |     return canon_ls
 60 | 
 61 | def _make_dir(directory):
 62 |     os.makedirs(directory)
 63 |         
 64 | 
 65 | def save_models(generator, discriminator, epoch, dir_name):
 66 |     out_dir = './{}/saved_models/{}'.format(dir_name, epoch)
 67 |     _make_dir(out_dir)
 68 |     torch.save(generator,     '{}/G'.format(out_dir))
 69 |     torch.save(discriminator, '{}/D'.format(out_dir))
 70 |         
 71 | 
 72 | def display_status(epoch, num_epochs, n_batch, num_batches, d_error, g_error, d_pred_real, d_pred_fake):
 73 |     
 74 |     # var_class = torch.autograd.variable.Variable
 75 |     if isinstance(d_error, torch.autograd.Variable):
 76 |         d_error = d_error.data.cpu().numpy()
 77 |     if isinstance(g_error, torch.autograd.Variable):
 78 |         g_error = g_error.data.cpu().numpy()
 79 |     if isinstance(d_pred_real, torch.autograd.Variable):
 80 |         d_pred_real = d_pred_real.data
 81 |     if isinstance(d_pred_fake, torch.autograd.Variable):
 82 |         d_pred_fake = d_pred_fake.data
 83 |     
 84 |     
 85 |     print('Epoch: [{}/{}], Batch Num: [{}/{}]'.format(
 86 |         epoch,num_epochs, n_batch, num_batches)
 87 |          )
 88 |     print('Discriminator Loss: {:.4f}, Generator Loss: {:.4f}'.format(d_error, g_error))
 89 |     print('D(x): {:.4f}, D(G(z)): {:.4f}'.format(d_pred_real.mean(), d_pred_fake.mean()))
 90 |     writer.add_scalar('D(x)', d_pred_real.mean(), epoch)
 91 |     writer.add_scalar('D(G(z)', d_pred_fake.mean(), epoch)
 92 | 
 93 | 
 94 | 
 95 | 
 96 | class DiscriminatorNet(torch.nn.Module):
 97 |     """
 98 |     A three hidden-layer discriminative neural network
 99 |     """
100 |     def __init__(self, drop_rate, layer_2_size):
101 |         super(DiscriminatorNet, self).__init__()
102 |         n_features = one_hot_len_comb
103 |         n_out = 1
104 |         
105 |         self.hidden0 = nn.Sequential( 
106 |             nn.Linear(n_features, layer_2_size),
107 |             nn.LeakyReLU(0.2),
108 |             nn.Dropout(drop_rate)
109 |         )
110 | 
111 |         self.out = nn.Sequential(
112 |             torch.nn.Linear(layer_2_size, n_out),
113 |             nn.Sigmoid()        
114 |         )
115 | 
116 |     def forward(self, x):
117 |         x = self.hidden0(x)
118 |         x = self.out(x)
119 |         return x
120 | 
121 | 
122 | 
123 | class GeneratorNet(torch.nn.Module):
124 |     """
125 |     A three hidden-layer generative neural network
126 |     """
127 |     def __init__(self, prior_lv_size, layer_interm_size):
128 |         super(GeneratorNet, self).__init__()
129 |         n_out = one_hot_len_comb  
130 |         
131 |         self.hidden0 = nn.Sequential(
132 |             nn.Linear(prior_lv_size, layer_interm_size),
133 |             nn.LeakyReLU(0.2)
134 |         )
135 |         
136 |         self.out = nn.Sequential(
137 |             nn.Linear(layer_interm_size, n_out),
138 |             nn.Sigmoid() 
139 |         )
140 | 
141 |     def forward(self, x):
142 |         x = self.hidden0(x)
143 |         x = self.out(x)
144 |         return x
145 |     
146 | 
147 | def noise(size, G_start_layer_size):
148 |     '''
149 |     Standard nois, which acts as input to the GAN generator .
150 |     '''
151 |     n = Variable(torch.randn(size, G_start_layer_size))
152 |     if torch.cuda.is_available(): return n.cuda() 
153 |     return n
154 | 
155 | 
156 | def train_discriminator(optimizer, real_data, fake_data, discriminator, criterion):
157 |     optimizer.zero_grad()
158 |     
159 |     # 1.1 Train on Real Data
160 |     prediction_real = discriminator(real_data)
161 |     y_real = Variable(torch.ones(prediction_real.shape[0], 1))
162 |     if torch.cuda.is_available(): 
163 |         D_real_loss = criterion(prediction_real, y_real.cuda())
164 |     else: 
165 |         D_real_loss = criterion(prediction_real, y_real)
166 | 
167 |     # 1.2 Train on Fake Data
168 |     prediction_fake = discriminator(fake_data)
169 |     y_fake = Variable(torch.zeros(prediction_fake.shape[0], 1))
170 |     if torch.cuda.is_available(): 
171 |         D_fake_loss = criterion(prediction_fake, y_fake.cuda())
172 |     else: 
173 |         D_fake_loss = criterion(prediction_fake, y_fake)
174 |     
175 |     D_loss = D_real_loss + D_fake_loss
176 |     D_loss.backward()
177 |     optimizer.step()
178 |     
179 |     return D_real_loss + D_fake_loss, prediction_real, prediction_fake, discriminator
180 | 
181 | 
182 | def train_generator(optimizer, fake_data, criterion, discriminator):
183 |     optimizer.zero_grad()
184 |     prediction = discriminator(fake_data)
185 |     y = Variable(torch.ones(prediction.shape[0], 1))
186 |     if torch.cuda.is_available(): 
187 |         G_loss = criterion(prediction, y.cuda(0))
188 |     else: 
189 |         G_loss = criterion(prediction, y)
190 |     G_loss.backward()
191 | 
192 |     optimizer.step()
193 |     return G_loss.data.item(), discriminator
194 | 
195 | 
196 | def train_gan(lr_disc, lr_genr, prior_lv_size, layer_interm_size_G, discr_dropout, discr_layer_2_size, num_epochs, dir_name, num_unique):
197 |     '''
198 |     All the hyper parameters are to be added as parameters to this function!!!
199 |     '''    
200 |     criterion = nn.BCELoss() 
201 |     discriminator = DiscriminatorNet(drop_rate=discr_dropout, layer_2_size=discr_layer_2_size)
202 |     generator     = GeneratorNet(prior_lv_size=prior_lv_size, layer_interm_size=layer_interm_size_G)
203 |     criterion = nn.BCELoss() 
204 |     if torch.cuda.is_available():
205 |         discriminator.cuda()
206 |         generator.cuda()
207 |         
208 |     # Optimizers (notice the use of 'discriminator'<-Object class)
209 |     d_optimizer = optim.Adam(discriminator.parameters(), lr=lr_disc)
210 |     g_optimizer = optim.Adam(generator.parameters(), lr=lr_genr) # 1: 3e-4; 2: 1e-5
211 |     
212 |     
213 |     for epoch in range(num_epochs+1): 
214 |         print('Epoch: ', epoch)
215 |         #   batch_num   real_data
216 |         for n_batch, real_batch in enumerate(data_loader):
217 |     
218 |             # 1. Train Discriminator
219 |             real_data = Variable(real_batch)
220 |             
221 |             if torch.cuda.is_available(): 
222 |                 real_data = real_data.cuda()
223 |                 
224 |             # Generate fake data
225 |             fake_data = generator(noise(real_data.size(0), prior_lv_size)).detach()        
226 |             
227 |             # Train D
228 |             d_error, d_pred_real, d_pred_fake, discriminator = train_discriminator(d_optimizer, real_data, fake_data, discriminator, criterion)
229 |     
230 |     
231 |             # 2. Train Generator
232 |             # Generate fake data
233 |             fake_data = generator(noise(real_batch.size(0), prior_lv_size))
234 |             
235 |             # Train G
236 |             g_error, discriminator = train_generator(g_optimizer, fake_data, criterion, discriminator)
237 |     
238 |                     
239 |             # Log onto a graph
240 |             writer.add_scalar('D_error', d_error, epoch * num_batches + n_batch)
241 |             writer.add_scalar('G_error', g_error, epoch * num_batches + n_batch)
242 |             
243 |         if epoch % 10 == 0 and epoch > 0:
244 |             generator = generator.eval()
245 |             # Display complete training stats
246 |             display_status(
247 |                 epoch, num_epochs, n_batch, num_batches,
248 |                 d_error, g_error, d_pred_real, d_pred_fake
249 |             )
250 |             
251 |             smiles_ls = []
252 |             print('Sampling....')
253 |             for _ in range(10000):
254 |                 
255 |                 T = generator(noise(1, prior_lv_size)).cpu().detach().numpy().flatten() # Just chose one molecule
256 |                 T = T.reshape((21, 14))
257 |                 
258 |                 smile = hot_to_smile(T, alphabets)
259 |                 
260 |                 if ' ' in smile:
261 |                     smile = smile[:smile.find(' ')]
262 |                     
263 |                 # Convert SELFIE to SMILE here
264 |                 smile = IncludeRingsForSMILES(GrammarPlusToSMILES(smile,'X0'))
265 |                 print(smile)
266 |                 mol = Chem.MolFromSmiles(smile)
267 |                 if mol is not None:
268 |                     smiles_ls.append(smile)
269 |             print('unique molecules: ', len(set(get_canon_smi_ls(smiles_ls))))
270 |             writer.add_scalar('Num Smiles', len(set(get_canon_smi_ls(smiles_ls))), epoch)
271 |             
272 |             # Write TensorBoard curv on to a text file
273 |             f = open('{}/smiles_curve.txt'.format(dir_name), 'a+')
274 |             f.write(str(len(set(get_canon_smi_ls(smiles_ls))))+ '\n')
275 |             f.close()            
276 |             
277 |             # Save the TensorBoard models
278 |             save_models(generator, discriminator, epoch, dir_name)
279 |                 
280 |             num_unique.append(len(set(get_canon_smi_ls(smiles_ls))))
281 |             
282 |             
283 |             #  Early stopping criteria for the model to stop training
284 |             if epoch > 100:
285 |                 A = np.array(num_unique)
286 |                 stopping_criteria = A.max() - ( 6 *   ( np.sqrt(A.max())))
287 |                 if stopping_criteria < 0:
288 |                     stopping_criteria = 0
289 |                 print('Stopping Criteria: ', stopping_criteria)
290 |                 print('Array A: ', A)
291 |                 if num_unique[-1] <= stopping_criteria or max(A) < 50:
292 |                     # Write the stopping epoch onto a text file
293 |                     f = open('{}/stoping_epoch.txt'.format(dir_name), 'a+')
294 |                     f.write('Early stopping Epoch: ' +str(epoch)+ '\n')
295 |                     f.close()            
296 |                     return 
297 | 
298 |             print(set(smiles_ls))
299 |             generator = generator.train()
300 |             
301 |             
302 |             
303 | if __name__ == '__main__':
304 |     
305 |     num_model = 1 # Number of GAN models to be run (each will have different - randomly initialized hyperparms)
306 |     for model_iter in range(100):
307 |         dir_name = str(model_iter)    # Directory for saving all the data
308 |                           
309 |         os.makedirs(dir_name)
310 |         num_unique = []
311 |     
312 |         ## HYPERPARAMETER SELECTION!
313 | 
314 |         num_epochs = 1500
315 |     
316 |         # Let me select the learning rates:
317 |         lr_disc = 10 ** random.uniform(-7, -4)
318 |         lr_genr = 10 ** random.uniform(-7, -4)
319 |         
320 |         # Generator
321 |         prior_lv_size       = random.randint(50,  300)
322 |         layer_interm_size_G = random.randint(300, 3000)
323 |         
324 |         # Discriminator
325 |         discr_dropout       = random.uniform(0, 0.8)
326 |         discr_layer_2_size  = random.randint(5, 100)
327 | 
328 |         
329 |         # Save all the selected hyperparamters: 
330 |         f = open('{}/hyperparams.txt'.format(dir_name), 'a+')
331 |         f.write('lr Discr: ' + str(lr_disc) + '\n')
332 |         f.write('lr Gener: ' + str(lr_genr)+ '\n')
333 |         f.write('Gener Sampling layer size: ' + str(prior_lv_size)+ '\n')
334 |         f.write('Gener middle layer size: ' + str(layer_interm_size_G)+ '\n')
335 |         f.write('Discr dropout: ' + str(discr_dropout)+ '\n')
336 |         f.write('Discr middle layer size: ' + str(discr_layer_2_size)+ '\n')
337 |         f.close()
338 |             
339 |         train_gan(lr_disc, lr_genr, prior_lv_size, layer_interm_size_G, discr_dropout, discr_layer_2_size, num_epochs, dir_name, num_unique)
340 |     
341 |     
342 |     
343 |     
344 |     
345 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # SELFIES
  2 | 
  3 | [![GitHub release](https://img.shields.io/github/release/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/releases/)
  4 | ![versions](https://img.shields.io/pypi/pyversions/selfies.svg)
  5 | [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
  6 | [![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-blue.svg)](https://GitHub.com/aspuru-guzik-group/selfies/graphs/commit-activity)
  7 | [![GitHub issues](https://img.shields.io/github/issues/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/issues/)
  8 | [![Documentation Status](https://readthedocs.org/projects/selfiesv2/badge/?version=latest)](http://selfiesv2.readthedocs.io/?badge=latest)
  9 | [![GitHub contributors](https://img.shields.io/github/contributors/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/graphs/contributors/)
 10 | 
 11 | 
 12 | **Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation**\
 13 | _Mario Krenn, Florian Haese, AkshatKumar Nigam, Pascal Friederich, Alan Aspuru-Guzik_\
 14 | [*Machine Learning: Science and Technology* **1**, 045024 (2020)](https://iopscience.iop.org/article/10.1088/2632-2153/aba947), [extensive blog post January 2021](https://aspuru.substack.com/p/molecular-graph-representations-and).\
 15 | [Talk on youtube about SELFIES](https://www.youtube.com/watch?v=CaIyUmfGXDk).\
 16 | [A community paper with 31 authors on SELFIES and the future of molecular string representations](https://arxiv.org/abs/2204.00056).\
 17 | [Blog explaining SELFIES in Japanese language](https://blacktanktop.hatenablog.com/entry/2021/08/12/115613)\
 18 | **[Code-Paper in February 2023](https://pubs.rsc.org/en/content/articlelanding/2023/DD/D3DD00044C)**\
 19 | [SELFIES in Wolfram Mathematica](https://resources.wolframcloud.com/PacletRepository/resources/WolframChemistry/Selfies/)  (since Dec 2023)\
 20 | Major contributors of v1.0.n: _[Alston Lo](https://github.com/alstonlo) and [Seyone Chithrananda](https://github.com/seyonechithrananda)_\
 21 | Main developer of v2.0.0: _[Alston Lo](https://github.com/alstonlo)_\
 22 | Chemistry Advisor: [Robert Pollice](https://scholar.google.at/citations?user=JR2N3JIAAAAJ)
 23 | 
 24 | ---
 25 | 
 26 | A main objective is to use SELFIES as direct input into machine learning models,
 27 | in particular in generative models, for the generation of molecular graphs
 28 | which are syntactically and semantically valid.
 29 | 
 30 | <p align="center">
 31 |    <img src="https://github.com/aspuru-guzik-group/selfies/blob/master/examples/VAE_LS_Validity.png" alt="SELFIES validity in a VAE latent space" width="666px">
 32 | </p>
 33 | 
 34 | ## Installation
 35 | Use pip to install ``selfies``.
 36 | 
 37 | ```bash
 38 | pip install selfies
 39 | ```
 40 | 
 41 | To check if the correct version of ``selfies`` is installed, use
 42 | the following pip command.
 43 | 
 44 | ```bash
 45 | pip show selfies
 46 | ```
 47 | 
 48 | To upgrade to the latest release of ``selfies`` if you are using an
 49 | older version, use the following pip command. Please see the
 50 | [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md)
 51 | to review the changes between versions of `selfies`, before upgrading:
 52 | 
 53 | ```bash
 54 | pip install selfies --upgrade
 55 | ```
 56 | 
 57 | 
 58 | ## Usage
 59 | 
 60 | ### Overview
 61 | 
 62 | Please refer to the [documentation in our code-paper](https://pubs.rsc.org/en/content/articlelanding/2023/DD/D3DD00044C),
 63 | which contains a thorough tutorial  for getting started with ``selfies``
 64 | and detailed descriptions of the functions
 65 | that ``selfies`` provides. We summarize some key functions below.
 66 | 
 67 | | Function                              | Description                                                       |
 68 | | ------------------------------------- | ----------------------------------------------------------------- |
 69 | | ``selfies.encoder``                   | Translates a SMILES string into its corresponding SELFIES string. |
 70 | | ``selfies.decoder``                   | Translates a SELFIES string into its corresponding SMILES string. |
 71 | | ``selfies.set_semantic_constraints``  | Configures the semantic constraints that ``selfies`` operates on. |
 72 | | ``selfies.len_selfies``               | Returns the number of symbols in a SELFIES string.                |
 73 | | ``selfies.split_selfies``             | Tokenizes a SELFIES string into its individual symbols.           |
 74 | | ``selfies.get_alphabet_from_selfies`` | Constructs an alphabet from an iterable of SELFIES strings.       |
 75 | | ``selfies.selfies_to_encoding``       | Converts a SELFIES string into its label and/or one-hot encoding. |
 76 | | ``selfies.encoding_to_selfies``       | Converts a label or one-hot encoding into a SELFIES string.       |
 77 | 
 78 | 
 79 | ### Examples
 80 | 
 81 | #### Translation between SELFIES and SMILES representations:
 82 | 
 83 | ```python
 84 | import selfies as sf
 85 | 
 86 | benzene = "c1ccccc1"
 87 | 
 88 | # SMILES -> SELFIES -> SMILES translation
 89 | try:
 90 |     benzene_sf = sf.encoder(benzene)  # [C][=C][C][=C][C][=C][Ring1][=Branch1]
 91 |     benzene_smi = sf.decoder(benzene_sf)  # C1=CC=CC=C1
 92 | except sf.EncoderError:
 93 |     pass  # sf.encoder error!
 94 | except sf.DecoderError:
 95 |     pass  # sf.decoder error!
 96 | 
 97 | len_benzene = sf.len_selfies(benzene_sf)  # 8
 98 | 
 99 | symbols_benzene = list(sf.split_selfies(benzene_sf))
100 | # ['[C]', '[=C]', '[C]', '[=C]', '[C]', '[=C]', '[Ring1]', '[=Branch1]']
101 | ```
102 | 
103 | #### Very simple creation of random valid molecules:
104 | A key property of SELFIES is the possibility to create valid random molecules in a very simple way -- inspired by a tweet by [Rajarshi Guha](https://twitter.com/rguha/status/1543601839983284224):
105 | 
106 | ```python
107 | import selfies as sf
108 | import random
109 | 
110 | alphabet=sf.get_semantic_robust_alphabet() # Gets the alphabet of robust symbols
111 | rnd_selfies=''.join(random.sample(list(alphabet), 9))
112 | rnd_smiles=sf.decoder(rnd_selfies)
113 | print(rnd_smiles)
114 | ```
115 | These simple lines gives crazy molecules, but all are valid. Can be used as a start for more advanced filtering techniques or for machine learning models.
116 | 
117 | #### Integer and one-hot encoding SELFIES:
118 | 
119 | In this example, we first build an alphabet from a dataset of SELFIES strings,
120 | and then convert a SELFIES string into its padded encoding. Note that we use the
121 | ``[nop]`` ([no operation](https://en.wikipedia.org/wiki/NOP_(code) ))
122 | symbol to pad our SELFIES, which is a special SELFIES symbol that is always
123 | ignored and skipped over by ``selfies.decoder``, making it a useful
124 | padding character.
125 | 
126 | ```python
127 | import selfies as sf
128 | 
129 | dataset = ["[C][O][C]", "[F][C][F]", "[O][=O]", "[C][C][O][C][C]"]
130 | alphabet = sf.get_alphabet_from_selfies(dataset)
131 | alphabet.add("[nop]")  # [nop] is a special padding symbol
132 | alphabet = list(sorted(alphabet))  # ['[=O]', '[C]', '[F]', '[O]', '[nop]']
133 | 
134 | pad_to_len = max(sf.len_selfies(s) for s in dataset)  # 5
135 | symbol_to_idx = {s: i for i, s in enumerate(alphabet)}
136 | 
137 | dimethyl_ether = dataset[0]  # [C][O][C]
138 | 
139 | label, one_hot = sf.selfies_to_encoding(
140 |    selfies=dimethyl_ether,
141 |    vocab_stoi=symbol_to_idx,
142 |    pad_to_len=pad_to_len,
143 |    enc_type="both"
144 | )
145 | # label = [1, 3, 1, 4, 4]
146 | # one_hot = [[0, 1, 0, 0, 0], [0, 0, 0, 1, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [0, 0, 0, 0, 1]]
147 | ```
148 | 
149 | #### Customizing SELFIES:
150 | 
151 | In this example, we relax the semantic constraints of ``selfies`` to allow
152 | for hypervalences (caution: hypervalence rules are much less understood
153 | than octet rules. Some molecules containing hypervalences are important,
154 | but generally, it is not known which molecules are stable and reasonable).
155 | 
156 | ```python
157 | import selfies as sf
158 | 
159 | hypervalent_sf = sf.encoder('O=I(O)(O)(O)(O)O', strict=False)  # orthoperiodic acid
160 | standard_derived_smi = sf.decoder(hypervalent_sf)
161 | # OI (the default constraints for I allows for only 1 bond)
162 | 
163 | sf.set_semantic_constraints("hypervalent")
164 | relaxed_derived_smi = sf.decoder(hypervalent_sf)
165 | # O=I(O)(O)(O)(O)O (the hypervalent constraints for I allows for 7 bonds)
166 | ```
167 | 
168 | #### Explaining Translation:
169 | 
170 | You can get an "attribution" list that traces the connection between input and output tokens. For example let's see which tokens in the SELFIES string ``[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]`` are responsible for the output SMILES tokens.
171 | 
172 | ```python
173 | selfies = "[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]"
174 | smiles, attr = sf.decoder(
175 |     selfies, attribute=True)
176 | print('SELFIES', selfies)
177 | print('SMILES', smiles)
178 | print('Attribution:')
179 | for smiles_token in attr:
180 |     print(smiles_token)
181 | 
182 | # output
183 | SELFIES [C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]
184 | SMILES C1NC(P)CC1
185 | Attribution:
186 | AttributionMap(index=0, token='C', attribution=[Attribution(index=0, token='[C]')])
187 | AttributionMap(index=2, token='N', attribution=[Attribution(index=1, token='[N]')])
188 | AttributionMap(index=3, token='C', attribution=[Attribution(index=2, token='[C]')])
189 | AttributionMap(index=5, token='P', attribution=[Attribution(index=3, token='[Branch1]'), Attribution(index=5, token='[P]')])
190 | AttributionMap(index=7, token='C', attribution=[Attribution(index=6, token='[C]')])
191 | AttributionMap(index=8, token='C', attribution=[Attribution(index=7, token='[C]')])
192 | ```
193 | 
194 | ``attr`` is a list of `AttributionMap`s containing the output token, its index, and input tokens that led to it. For example, the ``P`` appearing in the output SMILES at that location is a result of both the ``[Branch1]`` token at position 3 and the ``[P]`` token at index 5. This works for both encoding and decoding. For finer control of tracking the translation (like tracking rings), you can access attributions in the underlying molecular graph with ``get_attribution``.
195 | 
196 | ### More Usages and Examples
197 | 
198 | * More examples can be found in the ``examples/`` directory, including a
199 | [variational autoencoder that runs on the SELFIES](https://github.com/aspuru-guzik-group/selfies/tree/master/examples/vae_example) language.
200 | * This [ICLR2020 paper](https://arxiv.org/abs/1909.11655) used SELFIES in a
201 | genetic algorithm to achieve state-of-the-art performance for inverse design,
202 | with the [code here](https://github.com/aspuru-guzik-group/GA).
203 | * SELFIES allows for [highly efficient exploration and interpolation of the chemical space](https://chemrxiv.org/articles/preprint/Beyond_Generative_Models_Superfast_Traversal_Optimization_Novelty_Exploration_and_Discovery_STONED_Algorithm_for_Molecules_using_SELFIES/13383266), with a [deterministic algorithms, see code](https://github.com/aspuru-guzik-group/stoned-selfies).
204 | * We use SELFIES for [Deep Molecular dreaming](https://arxiv.org/abs/2012.09712), a new generative model inspired by interpretable neural networks in computational vision. See the [code of PASITHEA here](https://github.com/aspuru-guzik-group/Pasithea).
205 | * Kohulan Rajan, Achim Zielesny, Christoph Steinbeck show in two papers that SELFIES outperforms other representations in [img2string](https://link.springer.com/article/10.1186/s13321-020-00469-w) and [string2string](https://chemrxiv.org/articles/preprint/STOUT_SMILES_to_IUPAC_Names_Using_Neural_Machine_Translation/13469202/1) translation tasks, see the codes of [DECIMER](https://github.com/Kohulan/DECIMER-Image-to-SMILES) and [STOUT](https://github.com/Kohulan/Smiles-TO-iUpac-Translator).
206 | * Nathan Frey, Vijay Gadepally, and Bharath Ramsundar used SELFIES with normalizing flows to develop the [FastFlows](https://arxiv.org/abs/2201.12419) framework for deep chemical generative modeling.
207 | * An improvement to the old genetic algorithm, the authors have also released [JANUS](https://arxiv.org/abs/2106.04011), which allows for more efficient optimization in the chemical space. JANUS makes use of [STONED-SELFIES](https://pubs.rsc.org/en/content/articlepdf/2021/sc/d1sc00231g) and a neural network for efficient sampling.
208 | 
209 | ## Tests
210 | `selfies` uses `pytest` with `tox` as its testing framework.
211 | All tests can be found in  the `tests/` directory. To run the test suite for
212 | SELFIES, install ``tox`` and run:
213 | 
214 | ```bash
215 | tox -- --trials=10000 --dataset_samples=10000
216 | ```
217 | 
218 | By default, `selfies` is tested against a random subset
219 | (of size ``dataset_samples=10000``) on various datasets:
220 | 
221 |  * 130K molecules from [QM9](https://www.nature.com/articles/sdata201422)
222 |  * 250K molecules from [ZINC](https://en.wikipedia.org/wiki/ZINC_database)
223 |  * 50K molecules from a dataset of [non-fullerene acceptors for organic solar cells](https://www.sciencedirect.com/science/article/pii/S2542435117301307)
224 |  * 160K+ molecules from various [MoleculeNet](https://moleculenet.org/datasets-1) datasets
225 | 
226 | In first releases, we also tested the 36M+ molecules from the [eMolecules Database](https://downloads.emolecules.com/free/2024-12-01/).
227 | 
228 | 
229 | ## Version History
230 | See [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md).
231 | 
232 | ## Credits
233 | 
234 | We thank Jacques Boitreaud, Andrew Brereton, Nessa Carson (supersciencegrl), Matthew Carbone (x94carbone),  Vladimir Chupakhin (chupvl), Nathan Frey (ncfrey), Theophile Gaudin,
235 | HelloJocelynLu, Hyunmin Kim (hmkim), Minjie Li, Vincent Mallet, Alexander Minidis (DocMinus), Kohulan Rajan (Kohulan),
236 | Kevin Ryan (LeanAndMean), Benjamin Sanchez-Lengeling, Andrew White, Zhenpeng Yao and Adamo Young for their suggestions and bug reports,
237 | and Robert Pollice for chemistry advices.
238 | 
239 | ## License
240 | 
241 | [Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/)
242 | 


--------------------------------------------------------------------------------