├── LICENSE
├── README.md
├── assets
    ├── docs
    │   ├── inference_guide.md
    │   └── retraining_guide.md
    └── toc.png
├── data
    ├── 115_rxns
    │   └── 115_rxn_templates.txt
    ├── 91_rxns
    │   └── 91_rxn_templates.txt
    └── smiles_vocab.txt
├── environment.yml
├── pyproject.toml
├── setup.py
├── steps
    ├── step_10_calc_embedding.py
    ├── step_11_generate_fpindex_smiles_tfidf.py
    ├── step_20_generate_reactions.py
    ├── step_30_0_benchmark_filter_raw_output.py
    ├── step_30_1_molport_raw_reconstruct.py
    ├── step_31_enamine_reconstruct.py
    └── step_32_combined_stats.py
└── synllama
    ├── __init__.py
    ├── chem
        ├── __init__.py
        ├── base.py
        ├── fpindex.py
        ├── matrix.py
        ├── mol.py
        ├── reaction.py
        ├── smiles_tfidf.py
        └── stack.py
    └── llm
        ├── parallel_inference.py
        ├── sft
            └── synllama_sft.yml
        └── vars.py


/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright ©2025  The Regents of the University of California (Regents). All Rights Reserved. Permission to use, copy, modify, and distribute this software and its documentation for educational, research, and not-for-profit purposes, without fee and without a signed licensing agreement, is hereby granted, provided that the above copyright notice, this paragraph and the following paragraphs appear in all copies, modifications, and distributions. Contact The Office of Technology Licensing, UC Berkeley, 2150 Shattuck Avenue, Suite 408, Berkeley, CA 94704-1362, otl@berkeley.edu.
2 | 
3 | IN NO EVENT SHALL REGENTS BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF REGENTS HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
4 | 
5 | REGENTS SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS PROVIDED "AS IS". REGENTS HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
6 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # SynLlama: Generating Synthesizable Molecules and Their Analogs with Large Language Models 🧬
 2 | [![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
 3 | [![Arxiv](https://img.shields.io/badge/Arxiv-2503.12602-red.svg)](https://arxiv.org/abs/2503.12602)
 4 | 
 5 | ## 📖 Overview
 6 | ![SynLlama](assets/toc.png)
 7 | SynLlama is a fine-tuned version of Meta's Llama3 large language models that generates synthesizable analogs of small molecules by creating full synthetic pathways using commonly accessible building blocks and robust organic reaction templates, offering a valuable tool for drug discovery with strong performance in bottom-up synthesis, synthesizable analog generation, and hit expansion.
 8 | 
 9 | ## 💡 Usage
10 | 
11 | ### Prerequisites
12 | Ensure you have `conda` installed on your system. All additional dependencies will be managed via the `environment.yml` file.
13 | 
14 | ### Installation
15 | To get started with SynLlama, follow these steps:
16 | ```bash
17 | git clone https://github.com/THGLab/SynLlama
18 | cd SynLlama
19 | conda env create -f environment.yml
20 | conda activate synllama
21 | pip install -e .
22 | ```
23 | 
24 | ### Inference
25 | To perform inference using the already trained SynLlama, download the trained models and relevant files from [here](https://figshare.com/s/39a37d31cea2c190498d) and follow the instructions in the [Inference Guide](assets/docs/inference_guide.md).
26 | 
27 | ### Retraining
28 | If you are interested in retraining the model, please refer to the [Retraining Guide](assets/docs/retraining_guide.md) for detailed instructions.
29 | 
30 | ## 📄 License
31 | This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details
32 | 
33 | ## 🙏 Acknowledgments
34 | This project is built on top of the [ChemProjector Repo](https://github.com/luost26/ChemProjector). We thank the authors for building such a user-friendly github!
35 | 
36 | ## 📝 Citation
37 | If you use this code in your research, please cite:
38 | 
39 | ```bibtex
40 | @misc{sun_synllama_2025,
41 |     title = {SynLlama: Generating Synthesizable Molecules and Their Analogs with Large Language Models},  
42 |     url = {http://arxiv.org/abs/2503.12602},
43 |     doi = {10.48550/arXiv.2503.12602},
44 |     publisher = {arXiv},
45 |     author = {Sun, Kunyang and Bagni, Dorian and Cavanagh, Joseph M. and Wang, Yingze and Sawyer, Jacob M. and Gritsevskiy, Andrew and Head-Gordon, Teresa},
46 |     month = mar,
47 |     year = {2025}
48 | }
49 | ```
50 | 


--------------------------------------------------------------------------------
/assets/docs/inference_guide.md:
--------------------------------------------------------------------------------
 1 | ## 🔍 Inference Guide
 2 | 
 3 | After downloading the trained models and relevant files from the [figshare link](https://figshare.com/s/39a37d31cea2c190498d), you can use the following examples to perform inference.
 4 | 
 5 | In the downloaded folder, you will find the following folders under the `inference` sub-directory:
 6 | 
 7 | - `model`: The trained SynLlama models.
 8 | - `reconstruction`: The necessary reaction embeddings for the reconstruction algorithm.
 9 | - `smiles`: The `.smi` file containing the SMILES strings used in the paper.
10 | 
11 | If you want to perform **synthesis planning tasks**, please follow *all the steps below*.
12 | 
13 | If you want to perform just the **synthesizable analog search** or **hit expansion** tasks, please *only* follow the steps of "🦙 LLM inference using the trained SynLlama models" and "📝 Reconstruction algorithm using exclusively Enamine BBs".
14 | 
15 | ### 🦙 LLM inference using the trained SynLlama models
16 | 
17 | In the `model` folder, you will find the following trained models:
18 | 
19 | - `SynLlama-1B-2M-91rxns`: The trained model for SynLlama-1B-2M using RXN Set 1.
20 | - `SynLlama-1B-2M-115rxns`: The trained model for SynLlama-1B-2M using RXN Set 2.
21 | 
22 | You can choose one of the models to perform inference. Here we take `synllama-data/inference/model/SynLlama-1B-2M-91rxns` model and the file `synllama-data/inference/smiles/syn-planning/1k_chembl.smi` as an example.
23 | 
24 | ```bash
25 | cd SynLlama
26 | python synllama/llm/parallel_inference.py \
27 |     --model_path synllama-data/inference/model/SynLlama-1B-2M-91rxns \
28 |     --smiles_path synllama-data/inference/smiles/syn-planning/1k_chembl.smi \
29 |     --save_path synllama-data/inference/results/temp/1k_chembl_synllama_91rxns.pkl \
30 |     --sample_mode greedy
31 | ```
32 | 
33 | This will generate a `.pkl` file containing the inference results in the `synllama-data/inference/results/temp` folder using 'greedy' sampling mode. All the details of the `sample_mode` can be found in the `synllama/llm/parallel_inference.py` file and the Supplementary Information.
34 | 
35 | ### 🔄 SynLlama raw reconstruction and Molport search
36 | 
37 | In the synthesis planning task, we report that SynLlama models can generate New BBs beyond the training Enamine BBs. To validate the purchasability of the generated BBs, we first sample and filter all the New BBs that are part of valid synthetic pathways, and then use Molport to search whether the New BBs are purchasable.
38 | 
39 | To do so, we first need to filter the raw results for Molport search. Assuming that you have already generated the raw results with `SynLlama-1B-2M-91rxns` model and saved them in the `synllama-data/inference/results/temp/1k_chembl_synllama_91rxns.pkl` file, you can use the following command to filter the raw results for Molport search.
40 | 
41 | ```bash
42 | cd SynLlama
43 | python steps/step_30_0_benchmark_filter_raw_output.py \
44 |     --llama_folder synllama-data/inference/results/temp/ \
45 |     --rxn_mapping_path synllama-data/inference/reconstruction/91rxns/rxn_embeddings/reaction_smarts_map.pkl \
46 |     --fp_searcher_path synllama-data/inference/reconstruction/91rxns/processed/fpindex.pkl \
47 |     --raw_output_only \
48 | ```
49 | 
50 | This will generate a `synllama_reconstruct` folder in the `synllama-data/inference/results/temp` folder, which contains the filtered results for Molport search. To conduct the Molport search, please follow this [link](https://www.molport.com/shop/swl-step-1) and paste all the SMILES strings in the `*_successful_bbs_not_in_enamine_1.txt` file into the List Search. Further details can be found in the Supplementary Information.
51 | 
52 | Once finished, you will download the `.xls` file under the `Selected Items` column from the Molport List Search History page, and name it as `*_molport_ls.xls`. Place this file in the `synllama-data/inference/results/temp/synllama_reconstruct` folder for the next filtering step that outputs csvs containing successful synthetic pathways with SynLlama raw outputs.
53 | 
54 | ```bash
55 | cd SynLlama
56 | python steps/step_30_1_molport_raw_reconstruct.py \
57 |     --llama_folder synllama-data/inference/results/temp/
58 | ```
59 | 
60 | ### 📝 Reconstruction algorithm using exclusively Enamine BBs
61 | 
62 | In cases when SynLlama's raw outputs fail to generate a full synthetic pathway for the target molecule or analog generation is the desired task, we use the reconstruction algorithm using exclusively Enamine BBs to generate synthetic pathways for the target molecule and their close analogs. Using the same example as above, you can use the following command to generate the synthetic pathways for the target molecule and their close analogs.
63 | 
64 | ```bash
65 | cd SynLlama
66 | python steps/step_31_enamine_reconstruct.py \
67 |     --llama_folder synllama-data/inference/results/temp/ \
68 |     --embedding_path synllama-data/inference/reconstruction/91rxns/rxn_embeddings \
69 |     --total_num_mols 1000 # change this to the total number of molecules you want to generate
70 | ```
71 | 
72 | Note that the `llama_folder` should be the folder that contains the `.pkl` file generated by the LLM inference step. Here, if you are trying to do analog generation, you might want to increase k and n_stacks to 10 and 50, respectively for a more thorough search. Please refer to the Supplementary Information Table S1 for more details.
73 | 
74 | ### 🧩 Putting Everything Together
75 | 
76 | If you are doing synthesis planning, you can use the following command to put the raw reconstruction and Enamine reconstruction results together.
77 | 
78 | ```bash
79 | cd SynLlama
80 | python steps/step_32_combined_stats.py \
81 |     --llama_folder synllama-data/inference/results/temp/ \
82 |     --total_num_mols 1000 # change this to the total number of molecules you want to generate
83 | ```
84 | 
85 | This will generate a `combined_final_stats.csv` file in the `synllama-data/inference/results/temp` folder, which contains the combined statistics of the raw reconstruction and Enamine reconstruction results.
86 | 
87 | 
88 | 
89 | 


--------------------------------------------------------------------------------
/assets/docs/retraining_guide.md:
--------------------------------------------------------------------------------
 1 | ## 🔍 Retraining Guide
 2 | 
 3 | This section provides a guide for retraining the SynLlama model. You will first need to generate your fine-tuning data by first accessing Enamine BBs and then generating synthetic pathways. After that, you can perform supervised fine-tuning with the Axolotl package.
 4 | 
 5 | ### 📦 Enamine Synthetic Pathway Generation
 6 | 
 7 | **Step 1:** Since Enamine BBs are not publicly available, you will need to access them first. Please refer to the [Enamine BBs](https://enamine.net/building-blocks/building-blocks-catalog) and follow the necessary steps to create an account and download the BBs from the **US Stock**. After downloading the BBs, you can place the file under the `data/` directory and run the following command to prepare the BBs for the pathway generation. We still use the 91 reaction templates from the original SynLlama paper as an example:
 8 | 
 9 | ```bash
10 | cd SynLlama
11 | python steps/step_10_calc_embedding.py \
12 |     --data_folder data/91_rxns \
13 |     --bb_path ENAMINE_FILE_PATH \ # replace with your downloaded BBs file path
14 |     --rxn_template_path data/91_rxns/91_rxn_templates.txt \
15 |     --testing_data_path TESTING_DATA_PATH # don't need this if you don't have a predefined testing .smi file
16 | ```
17 | 
18 | After this step, you will have a `data/91_rxns/processed` folder containing the reaction matrices for pathway generation. 
19 | 
20 | **Step 2: [Optional]** If you are downloading the most updated Enamine BBs, you will have more than 230k BBs as specified in the paper. Therefore, you should calculate all the reaction embeddings with your new BBs with the following command:
21 | 
22 | ```bash
23 | python steps/step_11_generate_fpindex_smiles_tfidf.py \
24 |     --matrix_file data/91_rxns/processed/all/reaction_matrix.pkl \
25 |     --output_dir data/91_rxns/rxn_embeddings \
26 |     --token_list_path data/smiles_vocab.txt
27 | ```
28 | 
29 | After this step, you will have a `data/91_rxns/rxn_embeddings` folder containing the reaction embeddings for inference. In this case, you don't need to download the figshare data as specified in the [Inference Guide](inference_guide.md). 
30 | 
31 | **Step 3:** Finally, you can generate your fine-tuning data in [alpaca format](https://axolotl-ai-cloud.github.io/axolotl/docs/dataset-formats/inst_tune.html#alpaca) with the following command:
32 | 
33 | ```bash
34 | python steps/step_20_generate_reactions.py \
35 |     --matrix_path data/91_rxns/processed/train/reaction_matrix_train.pkl \ # change it to testing/all reaction matrix file if needed
36 |     --rxn_mapping_path data/91_rxns/rxn_embeddings/reaction_smarts_map.pkl \
37 |     --num_reactions NUM_REACTIONS \ # replace with your desired number of reactions
38 |     --name NAME \ # replace with your desired name
39 | ```
40 | 
41 | This step will generate a `data/NAME.jsonl` file containing the fine-tuning data, which will be used for the next step.
42 | 
43 | ### 📦 Supervised Fine-Tuning (SFT)
44 | 
45 | Here, we provide instructions to reproduce the fine-tuning results in the paper using a package called [Axolotl](https://github.com/axolotl-ai-cloud/axolotl). Axolotl is a user-friendly tool that simplifies the process of fine-tuning large language models. It provides:
46 | 
47 | - Easy configuration through YAML files
48 | - Support for multiple model architectures
49 | - Efficient training with various optimization techniques
50 | - Comprehensive documentation and examples
51 | 
52 | #### Installation
53 | 
54 | For detailed instructions on fine-tuning the model, please refer to the [Axolotl repository](https://github.com/axolotl-ai-cloud/axolotl). We strongly recommend creating a separate conda environment for fine-tuning to avoid dependency conflicts. Please follow the installation and usage guides in the Axolotl repository to fine-tune the model for your specific needs. Make sure to activate your dedicated fine-tuning environment before proceeding to the following steps.
55 | 
56 | #### Supervised Finetuning
57 | 
58 | Axolotl uses a configuration file that we provide `synllama_sft.yml` to specify training parameters and data paths. After generating your fine-tuning data following previous steps, to perform SFT, you'll need to update the provided [config file](../../synllama/llm/sft/synllama_sft.yml) with:
59 | 
60 | - The path to your generated training data
61 | - The path to save the prepared dataset
62 | - The path to save the outputs
63 | - [Optional] The project name and run id for logging ([wandb](https://wandb.ai/site))
64 | 
65 | Make sure to **review** and **modify** the provided [config file](../../synllama/llm/sft/synllama_sft.yml) according to your specific training requirements before proceeding with the fine-tuning process.
66 | 
67 | **Step 1:** To preprocess the data before fine-tuning, run the following command:
68 | 
69 | ```bash
70 | source activate axolotl # activate the fine-tuning environment
71 | CUDA_VISIBLE_DEVICES="" python3 -m axolotl.cli.preprocess synllama_sft.yml
72 | ```
73 | 
74 | **Step 2:** To perform supervised finetuning with multiple GPUs, run the following command:
75 | 
76 | ```bash
77 | source activate axolotl # activate the fine-tuning environment
78 | accelerate launch -m axolotl.cli.train synllama_sft.yml
79 | ```
80 | 
81 | **Step 3:** To merge the LoRA weights with the base model, run the following command:
82 | 
83 | ```bash
84 | source activate axolotl # activate the fine-tuning environment
85 | python -m axolotl.cli.merge_lora synllama_sft.yml --lora_model_dir=CHANGE_TO_YOUR_OUTPUT_PATH
86 | ```
87 | 
88 | Once the merging is done, you can use the merged model for inference following the instructions in [Inference Guide](inference_guide.md).


--------------------------------------------------------------------------------
/assets/toc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/THGLab/SynLlama/5592cfc9d2338c6ebd7add7971c26b69e9aa1111/assets/toc.png


--------------------------------------------------------------------------------
/data/115_rxns/115_rxn_templates.txt:
--------------------------------------------------------------------------------
  1 | [#6:1][N:2]=[C:3]=[S:4].[F,Cl,Br,I][C:5][C;!$(C(=O)[N,O,S,F,Cl,Br,I]):6]=O.[NH2;$([N][#6]);!$([N]C=[O,S,N]):7]>>[#6:1][N:2]=[c:3]1[n:7][c:6][c:5][s:4]1
  2 | [#6:1][N:2]=[C:3]=[S:4].[O:5]=[C:6]1[CH:7]=[C:8][C:9](=[O:10])[N:11]1.[NH2;$([N][#6]);!$([N]C=[O,S,N]):12]>>[#6:1][N:2]=[c:3]1[n:12][c:6]([O:5])[c:7]([C:8][C:9](=[O:10])[N:11])[s:4]1
  3 | [NH2:1][NH:2][#6:3].[#6:4][CH:5]=O>>[#6:4][CH:5]=[N:1][N:2][#6:3]
  4 | [$(C=O),$(C#N),$(S=O),$([N+](=O)[O-]):1][CH3,CH2&$([C]([#6])[#6]):2].[#6:4][CH:5]=O>>[*:1][C:2]=[CH:5][#6:4]
  5 | [#6:1][N:2]=[C:3]=[S:4].[NX3;!$(N[C]=[S,O,N]);$(N[#6,S]);!$(N[#7]);H1,H2:5]>>[N:5][C:3](=[S:4])[N:2][#6:1]
  6 | [#6:1][CH:2]=O.[c:3][C:4](=[O:5])[NH:6][CH2:7][C:8](=[O:9])[OH,O-]>>[#6:1][CH:2]=[C:7]1[C:8](=[O:9])[O:5][C:4]([c:3])=[N:6]1
  7 | [N;$([NH](C=[O,S])[#6]),$([NH2](C=[O,S])):1][C:2](=[O,S:3])[N;$([NH](C=[O,S])[#6]),$([NH2](C=[O,S])):4].[c:5][CH:6]=O.[C;$([CH2](C=O)[#6]),$([CH3](C=O)):7][C:8](=O)[CH2:9][C:10](=[O:12])[$([O][C]),$([NH][C]),$([N]([C])[C]):11]>>[c:5][C:6]1[N:1][C:2](=[*:3])[N:4][C:8]([C:7])=[C:9]1[C:10](=[O:12])[*:11]
  8 | [NH2;$(NC),$(Nc),$(NNC(=O)C):1].[C:3]=[C:4]-O[CH2][CH3]>>[C:3]=[C:4]-[NH:1]
  9 | [NX3;!$(N[C]=[S,O,N]);$(N[#6,S]);!$(N[#7]);H1,H2:1].[#6:4][C:5](=[O:6])[OH,O-]>>[N:1][C:5](=[O:6])[#6:4]
 10 | [NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1,H2:1].[#6:2][Cl,Br,I]>>[N:1][#6:2]
 11 | [$(C=O),$(C#N),$(S=O),$([N+](=O)[O-]):1][CH2:2][$(C=O),$(C#N),$(S=O),$([N+](=O)[O-]):3].[#6:4][CH:5]=O>>[*:1][C:2]([*:3])=[CH:5][#6:4]
 12 | [NX3;!$(N[C]=[S,O,N]);$(N[#6,S]);H1,H2:1].[#6:2][S:3](=[O:4])(=[O:5])[F,Cl,Br,I]>>[N:1][S:3](=[O:4])(=[O:5])[#6:2]
 13 | [#6:1][NH:2][C:3](=[S:4])[NH:5][NH2:6].[#6:7][C:8](=O)[C:9]([#6:10])[Cl,Br,I]>>[#6:1][NH:2][C:3]1=[N:5]-[N:6]=[C:8]([#6:7])-[C:9]([#6:10])-[S:4]1
 14 | [O]=[C;$([C]([C])[#6]),$([CH]([C])):1][C;H2,H3:2].[#6:3][C:4](=O)[c:5][c:6][NH2:7]>>[c:1]1[c:2][c:4]([#6:3])[c:5][c:6][n:7]1
 15 | [#6:1][NH:2][NH2:3].O=[C;$(C(C)[#6]),$([CH][C]):4][C;H1,H2;!R:5][C;$(C(C)[#6]),$([CH][C]):6]=O>>[#6:1][n:2]1[n:3][c:6][c:5][c:4]1
 16 | [#6:1][NH:2][NH2:3].O=[C;$(C(C)[#6]),$([CH][C]):4][#6;H1,H2;!R:5][#6:6][C;$(C(C)[#6]),$([CH][C]):7]=O>>[#6:1][#7H0+0:2]1[#7H0+0:3]=[#6:7][#6:6][#6:5]=[#6:4]1
 17 | [#6,#1:1][NH:2][C:3](=[S:4])[NH2:5].[#6,H:6][N:7]1[C:8](=[O:9])[C:10]=[C:11][C:12]1=[O:13]>>[*:1][N:2]=[C:3]1[S:4][C:11]([C:10][C:8](=[O:9])[N:7][*:6])[C:12](=[O:13])[NH:5]1
 18 | [#6:1][NH:2][NH2:3].O=[C:4][c:5]1[c:6][o:11][c:7][c:8][c:9]1=[O:10]>>[#6:1][N:2]1[N:3]=[C:4][C:5]([C:9](=[O:10])[c:8][c:7][OH:11])=[C:6]1
 19 | [Cl,Br,I][CH2:1][$(C=O),$(C#N),$(S=O),$([N+](=O)[O-]):2].O=[CH:3][c:4][c:5][OH:6]>>[*:2][c:1]1[c:3][c:4][c:5][O:6]1
 20 | [#6:1][$(CO):2](=O)[O].[NH2:3][c:4][c:5][NH2,NH,SH,OH:6]>>[#6:1][c:2]1[nH0+0:3][c:4][c:5][*:6]1
 21 | [#6:1][CH:2](=O).[NH2:3][c:4][c:5][NH2,NH,SH,OH:6]>>[#6:1][c:2]1[nH0+0:3][c:4][c:5][*:6]1
 22 | [#6;!$(C=[S,O,N]):1][Cl,Br,I].[#6;!$(C=[S,O,N]):2][Cl,Br,I]>>[#6:1]S[#6:2]
 23 | [OH;$(Oc):1].[C:2]1[O:4][C:3]1>>[O:1][C:2][C:3][OH:4]
 24 | [OH;$(Oc):1].[C:4]=[C:5][$(C=O),$(C#N),$(S=O),$([N+](=O)[O-]):6]>>[O:1][C:4][C:5][*:6]
 25 | [NX3;!$(N[C]=[S,O,N]);H1,H2:1][c:2][c:3][C:4](=[O:5])[NX3;H1,H2:6].O=[CX3;!$(C(=O)[O,S,N,F,Cl,Br,I]):7]>>[N:1]1[c:2][c:3][C:4](=[O:5])[N:6][C:7]1
 26 | [#6:1][NH2:2].([S:3]=[C:4]=[N;$([N][#6]):5].[#6;$([c][c]N=C=S),$([C][C]N=C=S),$([C]N=C=S):6][C:7](=[O:8])[O;$(O(C)C)])>>[NH:5][C:4](=[S:3])[N:2]([#6:1])[C:7](=[O:8])[#6:6]
 27 | [NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1,H2:1].[NX3;$(N[#6]);!$(N[#7]);H1,H0:2][$([C][O][CH2][C](F)(F)F),$([C]n1cncc1)&!R:3](=[O:4])[O,n]>>[N:1][C:3](=[O:4])[N:2]
 28 | [$(O[#6]),$([NH][#6]),$([#6]):1][C:2](=[O:3])[CH2:4][C:5](=[O:6])[$(O[#6]),$([NH][#6]),$([#6]):7].[C:8][C:9]([$([O][CH3]),$([O][CH2][CH3])])([$([O][CH3]),$([O][CH2][CH3])])[$([O][CH3]),$([O][CH2][CH3])].[#6,$([NH][C](=O)[CH3]);!$(C=[O,S,N]):10][NH2:11]>>[*:1][C:2](=[O:3])[C:4]([C:5](=[O:6])[*:7])=[C:9]([C:8])[NH:11][*:10]
 29 | [#6:1][N:2]=[C:3]=[S:4].[c:5][CH:6]=O>>[c:5][C:6]=[C]1[S][C:3](=[S:4])[N:2]([#6:1])C1=O
 30 | [N:1]#[C:2][#6:3].[NH2:4][c:5][c:6][C:7](=[O:8])[$(O(C=O)C)]>>[#6:3][C:2]1=[N:4][c:5][c:6][C:7](=[O:8])[N:1]1
 31 | [C:1][NH2:2].([OH,$(O(C=O)C),Cl,Br,I][C;!R1:3](=[O:4])[CR1,$([c][c][NH][C]=[O]):5].[CR1,$([c][c][C]=[O]):6][NH:7][C;!R1:8](=[O:9])[OH,$(O(C=O)C),Cl,Br,I])>>[*:6][NH:7][C:8](=[O:9])[N:2]([C:1])[C:3](=[O:4])[*:5]
 32 | [#6:1][C:2](=[O:3])[OH,O-:4].[Cl,Br,I][C;!$(C=[N,O,S]):5]>>[#6:1][C:2](=[O:3])[O:4][C:5]
 33 | [#6;!$(C=[S,O,N]):1][Cl,$(OS(=O)(=O)[#6;!R])].[#6;!$(C=[S,O,N]):2][Cl,$(OS(=O)(=O)[#6;!R])]>>[#6:1]S(=O)(=O)[#6:2]
 34 | [#6;!$(C=[S,O,N]):1][Cl,$(OS(=O)(=O)[#6;!R])].[#6;!$(C=[S,O,N]):2][Cl,$(OS(=O)(=O)[#6;!R])]>>[#6:1]S(=O)(=O)[#6:2]
 35 | [#6:1][NH2:2].Cl[S:3](=[O:4])(=[O:5])[c:6][c:7][C:8](=[O:9])[O;$(O(C)C)]>>[C:1][N:2]1[S:3](=[O:4])(=[O:5])[c:6][c:7][C:8](=[O:9])1
 36 | [NH2:1]-[c:2][c:3]-[C:4]#[N:5].[#6;!$(C=[O,S,N]):6][NH2:7]>>[#6:6][NH:7][cH0+0:4]1[nH0+0:5][cH0+0][nH0+0:1][c:2][c:3]1
 37 | [NH2;$([N][#6]);!$([N]C=[O,S,N]):1].[NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1,H2:2]>>[N:1][C](=[O])[N:2]
 38 | [#6;!$(C=[S,O,N]):1][Cl,$(OS(=O)(=O)[#6;!R])].[#6;!$(C=[S,O,N]):2][SH:3]>>[#6:1][S:3](=O)(=O)[#6:2]
 39 | [NX3;!$(N[C]=[S,O,N]);$(N[#6,S]);!$(N[#7]);H1,H2:1].[NX3;!$(N[C]=[S,O,N]);$(N[#6,S]);!$(N[#7]);H1,H2:2]>>[N:1]C(=O)C(=O)[N:2]
 40 | [N;$([N](=CS)[C]),$([NH]=CS):1]=[C:2]([S;!R]C)[NX3;$([NH](CS)[C]),$([NH2]CS):3].[C;!$(C=[N,O,S]):4][NH2:5]>>[C:4][N:5][C:2](=[N:1])[N:3]
 41 | [$(C=O),$([N+](=O)[O-]):1][CH2:2][$(C=O),$([N+](=O)[O-]):3].FC(F)(F)CO[C:4](=[O:5])[NH:6][#6;!$(C=[O,S,N]):7]>>[*:1][CH:2]([*:3])[C:4](=[O:5])[NH:6][*:7]
 42 | [NX3;$([NH](C=S)[C]),$([NH2]C=S):1][C:2](=S)[NX3;$([NH](C=S)[C]),$([NH2]C=S):3].[C;!$(C=[N,O,S]):4][NH2:5]>>[C:4][NH:5][C:2](=[N:1])[NH:3]
 43 | [#6;!$(C=[S,O,N]):1][Cl,$(OS(=O)(=O)[#6;!R])].[#6;!$(C=[S,O,N]):2][SH:3]>>[#6:1][S:3](=O)[#6:2]
 44 | [NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1,H2:1].[#6:3][CX3;!$(C[!#6]);H0,H1:4]=[O]>>[N:1][C:4][#6:3]
 45 | [NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1,H2:1].[#6:4][C:5](=[O:6])[O:7][C:8](=[O:9])[#6:10]>>([N:1][C:5](=[O:6])[#6:4].[OH:7][C:8](=[O:9])[#6:10])
 46 | [NH2:1][c:2][c:3][C:4](=[O:5])[O][C].[C:6][NH2:7]>>[NH:1]1[c:2][c:3][C:4](=[O:5])[N:7]([C:6])C1=O
 47 | [#6:1]-[C:2]#[N:3].[#6:4]-[C:5](=O)[OH,O-]>>[#6:1]-[cH0+0:2]1[nH0+0][oH0+0][cH0+0:5]([#6:4])[nH0+0:3]1
 48 | [OH:1]-[N:2]=[C:3]([#6:4])-[NH2:5]>>[#6:4]-[cH0+0:3]1[nH0+0:2][oH0+0:1][cH0+0](=O)[nH1+0:5]1
 49 | [#6:1]-[N:2]=[C:3]=S.[NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1,H2:5].[#6:7]-[C:8](=O)[NH:9][NH2:10]>>[#6:1]-[nH0+0:2]1[cH0+0:8](-[#6:7])[nH0+0:9][nH0+0:10][cH0+0:3](-[N:5])1
 50 | [#6:1][C:2](=[O:3])[O;$(O[CH3]),$(O[CH2][CH3]),$(O[CH2][C](F)(F)F)].[NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1,H2:4]>>[#6:1][C:2](=[O:3])[N:4]
 51 | [c:1]-B(O)O.[Cl,Br,I][c:2]>>[c:1]-[c:2]
 52 | [#6,#1:1][C:2](=[O:3])[NH:4][NH2:5].[#6,H:6][C:7](=O)[OH,O-]>>[*:1][c:2]1[nH0+0:4][nH0+0:5][c:7]([*:6])[oH0+0:3]1
 53 | [NX3;$([NH]([#6])[C]),$([NH2][C]):1][C:2][C:3](=[O:4])O[C;!R].[C;!$(C=[N,O,S]):5][NH2:6]>>[N:1]1[C:2][C:3](=[O:4])[N:6]([C:5])C(=O)1
 54 | [NX3;$([NH]([#6])[C]),$([NH2][C]):1][C:2][C:3](=[O:4])O[C;!R].[C;!$(C=[N,O,S]):5][NH:6][C:7](=[O:8])O[CH2]C(F)(F)F>>[N:1]1[C:2][C:3](=[O:4])[N:6]([C:5])[C:7](=[O:8])1
 55 | [#6,#1:1][C:2](=O)[C:3]([#6,H:4])[Cl,Br,I]>>[*:1]-[cH0+0:2]1[nH0+0][cH0+0](-[NH2])[sH0+0][c:3]1[*:4]
 56 | [NX3;$([NH]([#6])[C]),$([NH2][C]):1][C:2][C:3](=[O:4])O[C;!R].[C;!$(C=[N,O,S]):5][NH:6][C:7](=[O:8])O[CH2]C(F)(F)F>>[C:5][NH:6][C:7](=[O:8])[N:1][C:2][C:3](=[O:4])[OH]
 57 | [#6;!$(C=[O,S,N]):1][OH,SH:2].[Cl,Br,I][#6;!$(C=[N,O,S]):3]>>[#6:1][O,S:2][#6:3]
 58 | [#6:1][C:2](=[O:3])[F,Cl,Br].[NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1,H2:4]>>[#6:1][C:2](=[O:3])[N:4]
 59 | [NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1,H2:1].[C:2]1[O:4][C:3]1>>[#7:1][C:2][C:3][OH:4]
 60 | [NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1,H2:1].[C:4]=[C:5][$(C=O),$(C#N),$(S=O),$([N+](=O)[O-]):6]>>[N:1][C:4][C:5][*:6]
 61 | [#6:1]-[N:5]=[N+:6]=[N-:7].[#6:2]-[C:3]#[CH:4]>>[#6:2][cH0+0:3]1[cH1+0:4][nH0+0:5]([#6:1])[nH0+0:6][nH0+0:7]1
 62 | CO-[CH0+0:1]1=[NH0+0:2][CH2+0:3][C,O,N:4][C,O,N:5][CH2+0:6]1.[#6:7]-[C:8](=O)[NH:9][NH2:10]>>[#6:7]-[cH0+0:8]1[nH0+0:9][nH0+0:10][cH0+0:1]2[nH0+0:2]1[CH2+0:3][*:4][*:5][CH2+0:6]2
 63 | [#6,#1:1][C:2](=[O:3])[NH:4][NH2:5].[#6,H:6][N:7]=[C:8]=S>>[*:1][c:2]1[nH0+0:4][nH0+0:5][c:8]([NH:7][*:6])[oH0+0:3]1
 64 | [CH1:1]-[Cl,Br,I].[#6:2]-[C:3]#[CH:4]>>[#6:2][cH0+0:3]1[cH1+0:4][nH0+0]([CH1:1])[nH0+0][nH0+0]1
 65 | [#6:1]-[N:5]=[N+:6]=[N-:7].[#6;!$(C=[O,S,N]):2]-[NH2:3]>>[#6:2]-[NH1:3][CH2][cH0+0]1[cH1+0][nH0+0:5]([#6:1])[nH0+0:6][nH0+0:7]1
 66 | [#6,#7:1][C:2](=[S:3])[NH2:4].[F,Cl,Br,I][CH:5][C:6](=O)[C:7]#[C:8][Si](C)(C)C.[N-:9]=[N+:10]=[N:11][c:12]>>[*:1][c:2]1[s:3][c:5][c:6]([c:7]2[nHo+0:9][nHo+0:10][nHo+0:11]([c:12])[c:8]2)[n:4]1
 67 | [#6;!$(C=[O,S,N]):1]-[NH2:2].[#6;!$(C=[O,S,N]):3]-[NH2:4]>>[#6:1][NH:2]c1ncnc2c1nc[nH0+0:4]2[#6:3]
 68 | [C;!$(C=[N,O,S]):1][NH2:2].[C;!$(C=[N,O,S]):3][NH2:4].[C;!$(C=[N,O,S]):5][NH2:6]>>[C:1][N:2]=C([NH:6][C:5])[NH:4][C:3]
 69 | [NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H2:1].[#6:3][CX3;!$(C[!#6]);H0,H1:4]=[O].[Cl,Br,I][C,S;$([C,S]=[S,O,N]):5]>>[*:5][N:1][C:4][#6:3]
 70 | ([NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1:1].[NX3;H0:2]C(=O)OC(C)(C)C).[#6:3][C:4](=[O:5])[OH,O-].[#6:6][C:7](=[O:8])[OH,O-]>>([N:1][C:4](=[O:5])[#6:3].[N:2][C:7](=[O:8])[#6:6])
 71 | [NH2:1][C:2](=S)[N:3]([#6;!$(C=[O,S,N,C]):4])[#6,H;!$(C=[O,S,N,C]):5].[#6:6]-[C:7](=O)[NH:8][NH2:9]>>[nH1+0:1]1[cH0+0:7](-[#6:6])[nH0+0:8][nH0+0:9][cH0+0:2](-[N:3]([#6,H:5])[#6:4])1
 72 | [#6:1][C:2](=O)[#6:3].[#6:4][OH,nH,NH:5]>>[#6:4][*:5][C:2]([#6:1])([#6:3])C(=O)O
 73 | [#6:1][C:2](=[O:3])[OH,O-:4].[OH][C;!$(C=[N,O,S]):5]>>[#6:1][C:2](=[O:3])[O:4][C:5]
 74 | ([C;!$(C=[N,O,S]):1][NH:2][#6;!$([c][c][c][NH2]);$([#6][#6][NH2]),$([#6][#6][#6][NH2]):3].[#6;$([#6][#6][NH:2]),$([#6][#6][#6][NH:2]):4][NH2:5]).[C;!$(C=[N,O,S]):7][NH2:6]>>[#6:3][N:2]([C:1])C([NH:6][C:7])=[N:5][#6:4]
 75 | CO[C:1](=[O:2])[C:3][NH2:4].CC(C)(C)OC(=O)[NH:5][C:6][C:7](=[O:8])[OH,O-]>>[C:1](=[O:2])1[C:3][NH:4][C:7](=[O:8])[C:6][NH:5]1
 76 | [C;!$(C=[N,O,S]):1][NH2:2].[NH2:3][C;!$(C=[O,S,N]):4][C:5][C:6](=[O:7])[O;$(OC)]>>[C:1][N:2]1[C:6](=[O:7])[C:5][C:4][NH:3]C(=O)1
 77 | [NH2:1][C:2](=[NH])-[#6,H:3].[#6,H:4]-[C:5](=O)[OH,O-].[#6,H:6][NH:7][NH2:8]>>[*:3]-[cH0+0:2]1[nH0+0:1][cH0+0:5](-[*:4])[nH0+0:7]([*:6])[nH0+0:8]1
 78 | [#6,#1:1][C:2](=O)[NH:3][NH2:4].[#6,H:5][C:6](=[NH:7])O[CH3]>>[*:1]-[cH0+0:2]1[nH0+0:3][nH0+0:4][cH0+0:6]([*:5])[nH1+0:7]1
 79 | [#6,#1:1][C:2](=O)[c:9]1[c:10][c:11][c:12][c:13][c:14]1[NH2:3].CC(C)(C)OC(=O)[NH:4][C:5]([#6,#1:6])[C:7](=[O:8])[OH,O-]>>[O:8]=[C:7]1[NH:3][c:14]2[c:13][c:12][c:11][c:10][c:9]2[CH0+0:2]([*:1])=[NH0+0:4][C:5]1[*:6]
 80 | [CH3:1][C:2](=[O])[#6:4].[NH2:5][C:6](=[O:7])[N;$([NH](C=O)[#6]),$([NH2](C=O)):8].[OH,O-:9][c:10][c:11][CH:12]=O>>[#6:4][C:2]12[CH2:1][CH:12]([NH:5][C:6](=[O:7])[N:8]1)[c:11][c:10][O:9]2
 81 | [C;!$(C=[N,O,S]):1][NH2:2].[C;!$(C=[N,O,S]):3][NH2:4].[C:5]=[C:6][C:7](=[O:8])[O;$(OC)]>>[C:1][N:2]1[C:5][C:6][C:7](=[O:8])[N:4]([C:3])C(=O)1
 82 | [NH2:1][NH:2][#6:3].O=[CH:4][c:5][c:6][C:7](=[O:8])OC>>[#6:3][N:2]1[N:1]=[CH:4][c:5][c:6][C:7](=[O:8])1
 83 | [NH2;$([N][#6]);!$([N]C=[O,S,N]):1].[OH,O-][C:2](=[O:3])[C:4][C:5](=[C;H1,H2:6])[C:7](=[O:8])[O:9]>>[N:1]1[C:2](=[O:3])[C:4][C:5]([C:7](=[O:8])[O:9])[C:6]1
 84 | [#6:1][C:2]#[N:3]>>[#6:1][c:2]1[n:3][nH][n][n]1
 85 | [N:1]C(=O)OC([CH3])([CH3])[CH3]>>[N:1]
 86 | [C:1](=[O:2])[O:3][#6]>>[C:1](=[O:2])[O:3]
 87 | [NH3+,NH2:1]-[C$(C(N)(C)(C)(C)),C$([CH](N)(C)(C)),C$([CH2](N)(C)):2]-[C$(C(c)(C)(C)(C)),C$([CH](c)(C)(C)),C$([CH2](c)(C)):3]-[c:4][cH1:5].[CH:6](-[#6:7])=[O]>>[#6:7]-[CH:6]1[N:1][C:2][C:3][c:4][c:5]1
 88 | [C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):1]=[C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):2].[C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):3]=[C$([C](=C)(C)([CX4,OX2,NX3])),C$([CH](=C)(C)):4]-[C$([C](=C)(C)([CX4,OX2,NX3])),C$([CH](=C)(C)):5]=[C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):6]>>[C:1]1[C:2][C:3][C:4]=[C:5][C:6]1
 89 | [C$(C(#C)([CX4,OX2,NX3])),C$([CH](#C)):1]#[C$(C(#C)([CX4,OX2,NX3])),C$([CH](#C)):2].[C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):3]=[C$([C](=C)(C)([CX4,OX2,NX3])),C$([CH](=C)(C)):4]-[C$([C](=C)(C)([CX4,OX2,NX3])),C$([CH](=C)(C)):5]=[C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):6]>>[C:1]1=[C:2][C:3][C:4]=[C:5][C:6]1
 90 | [CH3,$([CH2](C=O)[#6])&!R:4][CD3:2](=[O:3])[c:1][c:5][OH:6].[C;$(C1[#6][#6][N,C][#6][#6]1):7](=[OD1])>>[O:6]1[c:5][c:1][C:2](=[OD1:3])[C:4][C:7]1
 91 | [C$(C(=O)(C)([CX4]))&!R,C$(C[H](=O)(C))&!R:1](=[O])[CH1,CH2:3][CH1,CH2:4][C$(C(=O)(C)([CX4]))&!R,C$(C[H](=O)(C))&!R:5]=[O].[N$([NH2,NH3+1]([CX4])):7]>>[c:5]1[c:4][c:3][c:1][n:7]1
 92 | [NH1;$(N-c1ccccc1):1]([NH2])[c:5][cH1:4].[C;$(C([#6])[#6]):2](=[OD1])-[CH2;$(C([#6])[#6]);!$(C(C=O)C=O):3]>>[c:5]1[nH:1][c:2][c:3][c:4]1
 93 | [OH:7][c:6]1[cH:1][c:2][c:3][c:4][c:5]1.[O$(O(C)([CX4]))][C:11](=[O:15])[CH,CH2:10][C:8]=[O]>>[C:8]1=[C:10][C:11](=[O:15])[O:7][c:6]2[c:5][c:4][c:3][c:2][c:1]12
 94 | [*;Br,I;$(*c1ccccc1)][c:1][c:2][OH1,SH1,NH2:3].[CH1:5]#[C:4]>>[c:1]1[c:2][*:3][c:4][c:5]1
 95 | [C$([C](O)([CX4])([CX4])([CX4])),C$([CH](O)([CX4])([CX4])),C$([CH2](O)([CX4]))][O][C$(C(=O)([CX4])),C$([CH](=O)):2]=[O:5].[C$([CH](C)([CX4])([CX4])),C$([CH2](C)([CX4])),C$([CH3](C)):7]-[C$(C(=O)([CX4])),C$([CH](=O)):8]=[O:9]>>[C:7]([C:2]=[O:5])[C:8]=[O:9]
 96 | [Br,I][C$(C([Br,I])([CX4])([CX4])([CX4])),C$([CH]([Br,I])([CX4])([CX4])),C$([CH2]([Br,I])([CX4])),C$([CH3]([Br,I])),C$([C]([Br,I])(=C)([CX4])),C$([CH]([Br,I])(=C)),C$(C([Br,I])(#C)),c$(c([Br,I])):1].[Br,I][C$(C([Br,I])([CX4])([CX4])([CX4])),C$([CH]([Br,I])([CX4])([CX4])),C$([CH2]([Br,I])([CX4])),C$([CH3]([Br,I])),C$([C]([Br,I])(=C)([CX4])),C$([CH]([Br,I])(=C)),C$(C([Br,I])(#C)),c$(c([Br,I])):2]>>[#6:1][#6:2]
 97 | [C$([CH](=C)([CX4])),C$([CH2](=C)):2]=[C$(C(=C)([CX4])([CX4])),C$([CH](=C)([CX4])),C$([CH2](=C)):3].[Br,I][C$([CX4]([Br,I])),c$([c]([Br,I])):4]>>[#6:4][C:2]=[C:3]
 98 | [#6:1][C:2]#[N;D1].[Cl,Br,I][#6;$([#6]~[#6]);!$([#6]([Cl,Br,I])[Cl,Br,I]);!$([#6]=[O,S,N]):3]>>[#6:1][C:2](=O)[#6:3]
 99 | [#6:1][C;H1,$([C]([#6])[#6]):2]=[OD1:3].[Cl,Br,I][#6;$([#6]~[#6]);!$([#6]([Cl,Br,I])[Cl,Br,I]);!$([#6]=[O,S,N]):4]>>[C:1][#6:2]([OH1:3])[#6:4]
100 | [c:1]B(O)O.[nH1;+0;r5;!$(n[#6]=[O,S,N]);!$(n~n~n);!$(n~n~c~n);!$(n~c~n~n):2]>>[c:1][n:2]
101 | [*:1][C:2]#[CH:3].[Cl,Br,I][C$(C([CX4,c])([CX4,c])([CX4,c])),C$([CH]([CX4,c])([CX4,c])),C$([CH2]([CX4,c])),C$([CH3]),c$(c):4]>>[#6:4][C:3]#[C:2][*:1]
102 | [C$(C(C)([CX4])([CX4])([CX4])),C$([CH](C)([CX4])([CX4])),C$([CH2](C)([CX4])),C$([CH3](C)):1][C:2]#[CH:3].[Cl,Br,I][C$(C(=O)([CX4])),C$([CH](=O)):5]=[O:6]>>[#6:1][C:2]#[C:3][C:5]=[O:6]
103 | [#6:1][C;H1,$([CH0]([#6])[#6]);!$(CC=O):2]=[OD1].[Cl,Br,I][C;H2;$(C[#6]);!$(CC[I,Br]);!$(CCO[CH3]):3]>>[C:1][C:2]=[C:3]
104 | [Cl,Br,I][c;$(c1:[c,n]:[c,n]:[c,n]:[c,n]:[c,n]:1):1].[N;$(NC)&!$(N=*)&!$([N-])&!$(N#*)&!$([ND3])&!$([ND4])&!$([N][N])&!$(N[c,O])&!$(N[C,S]=[S,O,N]),H2&$(Nc1:[c,n]:[c,n]:[c,n]:[c,n]:[c,n]:1):2]>>[c:1][N:2]
105 | [C;$(C([#6])[#6;!$([#6]Br)]):4](=[OD1])[CH;$(C([#6])[#6]):5]Br.[NH2:3][C;$(C(=N)(N)[#6,#7]):2]=[NH;D1:1]>>[c:4]1[c:5][nH:3][c:2][n:1]1
106 | [c;$(c1[c;$(c[C,S,N](=[OD1])[*;R0;!OH1])]cccc1):1][C;$(C(=O)[OH])].[c;$(c1aaccc1):2][Cl,Br,I]>>[c:1][c:2]
107 | [N;$(N[#6]):1]=[C;$(C=O):2].[N;$(N[#6]);!$(N=*);!$([N-]);!$(N#*);!$([ND3]);!$([ND4]);!$(N[O,N]);!$(N[C,S]=[S,O,N]):3]>>[N:1][C:2][N+0:3]
108 | [OH,O-][C$(C(=O)([OH,O-])([CX4])),C$([CH](=O)([OH,O-])):1]=[O:2]>>[Cl][C:1]=[O:2]
109 | [OH][$([CX4]),c:1]>>[Br][#6:1]
110 | [OH][$([CX4]),c:1]>>[Cl][#6:1]
111 | [OH,O-][S$(S([CX4])):1](=[O:2])=[O:3]>>[Cl][S:1](=[O:2])=[O:3]
112 | [OH+0,O-:1][C:2](=[O:3])[C$([CH]([CX4])),C$([CH2]):4]>>[#8:1][C:2](=[O:3])[C:4][Br]
113 | [OH+0,O-:1][C:2](=[O:3])[C$([CH]([CX4])),C$([CH2]):4]>>[#8:1][C:2](=[O:3])[C:4][Cl]
114 | [Cl,Br,I][C$(C([CX4,c])([CX4,c])([CX4,c])),C$([CH]([CX4,c])([CX4,c])),C$([CH2]([CX4,c])),C$([CH3]),c$(c):1]>>[N]#[C][#6:1]
115 | [OH,NH2,NH3+][CH2:2][C$(C([CX4,c])([CX4,c])([CX4,c])),C$([CH]([CX4,c])([CX4,c])),C$([CH2]([CX4,c])),C$([CH3]),c$(c):1]>>[#6:1][C:2]#[N]


--------------------------------------------------------------------------------
/data/91_rxns/91_rxn_templates.txt:
--------------------------------------------------------------------------------
 1 | [cH1:1]1:[c:2](-[CH2:7]-[CH2:8]-[NH2:9]):[c:3]:[c:4]:[c:5]:[c:6]:1.[#6:11]-[CH1;R0:10]=[OD1]>>[c:1]12:[c:2](-[CH2:7]-[CH2:8]-[NH1:9]-[C:10]-2(-[#6:11])):[c:3]:[c:4]:[c:5]:[c:6]:1
 2 | [c;r6:1](-[NH1;$(N-[#6]):2]):[c;r6:3](-[NH2:4]).[#6:6]-[C;R0:5](=[OD1])-[#8;H1,$(O-[CH3])]>>[c:3]2:[c:1]:[n:2]:[c:5](-[#6:6]):[n:4]2
 3 | [c;r6:1](-[NH1;$(N-[#6]):2]):[c;r6:3](-[NH2:4]).[#6:6]-[CH1;R0:5](=[OD1])>>[c:3]2:[c:1]:[n:2]:[c:5](-[#6:6]):[n:4]2
 4 | [c;r6:1](-[SH1:2]):[c;r6:3](-[NH2:4]).[#6:6]-[CH1;R0:5](=[OD1])>>[c:3]2:[c:1]:[s:2]:[c:5](-[#6:6]):[n:4]2
 5 | [c:1](-[OH1;$(Oc1ccccc1):2]):[c;r6:3](-[NH2:4]).[c:6]-[CH1;R0:5](=[OD1])>>[c:3]2:[c:1]:[o:2]:[c:5](-[c:6]):[n:4]2
 6 | [c;r6:1](-[OH1:2]):[c;r6:3](-[NH2:4]).[#6:6]-[C;R0:5](=[OD1])-[OH1]>>[c:3]2:[c:1]:[o:2]:[c:5](-[#6:6]):[n:4]2
 7 | [#6:6]-[C;R0:1](=[OD1])-[CH1;R0:5](-[#6:7])-[*;#17,#35,#53].[NH2:2]-[C:3]=[SD1:4]>>[c:1]2(-[#6:6]):[n:2]:[c:3]:[s:4][c:5]([#6:7]):2
 8 | [c:1](-[C;$(C-c1ccccc1):2](=[OD1:3])-[OH1]):[c:4](-[NH2:5]).[N;!H0;!$(N-N);!$(N-C=N);!$(N(-C=O)-C=O):6]-[C;H1,$(C-[#6]):7]=[OD1]>>[c:4]2:[c:1]-[C:2](=[O:3])-[N:6]-[C:7]=[N:5]-2
 9 | [CH0;$(C-[#6]):1]#[NH0:2]>>[C:1]1=[N:2]-N-N=N-1
10 | [CH0;$(C-[#6]):1]#[NH0:2].[C;A;!$(C=O):3]-[*;#17,#35,#53]>>[C:1]1=[N:2]-N(-[C:3])-N=N-1
11 | [CH0;$(C-[#6]):1]#[NH0:2].[C;A;!$(C=O):3]-[*;#17,#35,#53]>>[C:1]1=[N:2]-N=N-N-1(-[C:3])
12 | [CH0;$(C-[#6]):1]#[CH1:2].[C;H1,H2;A;!$(C=O):3]-[*;#17,#35,#53,OH1]>>[C:1]1=[C:2]-N(-[C:3])-N=N-1
13 | [CH0;$(C-[#6]):1]#[CH1:2].[C;H1,H2;A;!$(C=O):3]-[*;#17,#35,#53,OH1]>>[C:1]1=[C:2]-N=NN(-[C:3])-1
14 | [CH0;$(C-[#6]):1]#[CH0;$(C-[#6]):2].[C;H1,H2;A;!$(C=O):3]-[*;#17,#35,#53,OH1]>>[C:1]1=[C:2]-N=NN(-[C:3])-1
15 | [CH0;$(C-[#6]):1]#[NH0:2].[NH2:3]-[NH1:4]-[CH0;$(C-[#6]);R0:5]=[OD1]>>[N:2]1-[C:1]=[N:3]-[N:4]-[C:5]=1
16 | [CH0;$(C-[#6]):1]#[NH0:2].[CH0;$(C-[#6]);R0:5](=[OD1])-[#8;H1,$(O-[CH3]),$(O-[CH2]-[CH3])]>>[N:2]1-[C:1]=N-N-[C:5]=1
17 | [c:1](-[C;$(C-c1ccccc1):2](=[OD1:3])-[CH3:4]):[c:5](-[OH1:6]).[C;$(C1-[CH2]-[CH2]-[N,C]-[CH2]-[CH2]-1):7](=[OD1])>>[O:6]1-[c:5]:[c:1]-[C:2](=[OD1:3])-[C:4]-[C:7]-1
18 | [c;r6:1](-[C;$(C=O):6]-[OH1]):[c;r6:2]-[C;H1,$(C-C):3]=[OD1].[NH2:4]-[NH1;$(N-[#6]);!$(NC=[O,S,N]):5]>>[c:1]1:[c:2]-[C:3]=[N:4]-[N:5]-[C:6]-1
19 | [C;$(C-c1ccccc1):1](=[OD1])-[C;D3;$(C-c1ccccc1):2]~[O;D1,H1].[CH1;$(C-c):3]=[OD1]>>[C:1]1-N=[C:3]-[NH1]-[C:2]=1
20 | [NH1;$(N-c1ccccc1):1](-[NH2])-[c:5]:[cH1:4].[C;$(C([#6])[#6]):2](=[OD1])-[CH2;$(C([#6])[#6]);!$(C(C=O)C=O):3]>>[C:5]1-[N:1]-[C:2]=[C:3]-[C:4]:1
21 | [NH2;$(N-c1ccccc1):1]-[c:2]:[c:3]-[CH1:4]=[OD1].[C;$(C([#6])[#6]):6](=[OD1])-[CH2;$(C([#6])[#6]);!$(C(C=O)C=O):5]>>[N:1]1-[c:2]:[c:3]-[C:4]=[C:5]-[C:6]:1
22 | [*;Br,I;$(*c1ccccc1)]-[c:1]:[c:2]-[OH1:3].[CH1:5]#[C;$(C-[#6]):4]>>[c:1]1:[c:2]-[O:3]-[C:4]=[C:5]-1
23 | [*;Br,I;$(*c1ccccc1)]-[c:1]:[c:2]-[SD2:3]-[CH3].[CH1:5]#[C;$(C-[#6]):4]>>[c:1]1:[c:2]-[S:3]-[C:4]=[C:5]-1
24 | [*;Br,I;$(*c1ccccc1)]-[c:1]:[c:2]-[NH2:3].[CH1:5]#[C;$(C-[#6]):4]>>[c:1]1:[c:2]-[N:3]-[C:4]=[C:5]-1
25 | [#6:6][C:5]#[#7;D1:4].[#6:1][C:2](=[OD1:3])[OH1]>>[#6:6][c:5]1[n:4][o:3][c:2]([#6:1])n1
26 | [#6;$([#6]~[#6]);!$([#6]=O):2][#8;H1:3].[Cl,Br,I][#6;H2;$([#6]~[#6]):4]>>[CH2:4][O:3][#6:2]
27 | [#6;H0;D3;$([#6](~[#6])~[#6]):1]B(O)O.[#6;H0;D3;$([#6](~[#6])~[#6]):2][Cl,Br,I]>>[#6:2][#6:1]
28 | [c;H1:3]1:[c:4]:[c:5]:[c;H1:6]:[c:7]2:[nH:8]:[c:9]:[c;H1:1]:[c:2]:1:2.O=[C:10]1[#6;H2:11][#6;H2:12][N:13][#6;H2:14][#6;H2:15]1>>[#6;H2:12]3[#6;H1:11]=[C:10]([c:1]1:[c:9]:[n:8]:[c:7]2:[c:6]:[c:5]:[c:4]:[c:3]:[c:2]:1:2)[#6;H2:15][#6;H2:14][N:13]3
29 | [C;H1&$(C([#6])[#6]),H2&$(C[#6]):1][OH1].[NH1;$(N(C=O)C=O):2]>>[C:1][N:2]
30 | [C;H1&$(C([#6])[#6]),H2&$(C[#6]):1][OH1].[OH1;$(Oc1ccccc1):2]>>[C:1][O:2]
31 | [C;H1&$(C([#6])[#6]),H2&$(C[#6]):1][OH1].[NH1;$(N([#6])S(=O)=O):2]>>[C:1][N:2]
32 | [C;H1&$(C([#6])[#6]),H2&$(C[#6]):1][OH1].[#7H1:2]1~[#7:3]~[#7:4]~[#7:5]~[#6:6]~1>>[C:1][#7:2]1:[#7:3]:[#7:4]:[#7:5]:[#6:6]:1
33 | [C;H1&$(C([#6])[#6]),H2&$(C[#6]):1][OH1].[#7H1:2]1~[#7:3]~[#7:4]~[#7:5]~[#6:6]~1>>[#7H0:2]1:[#7:3]:[#7H0:4]([C:1]):[#7:5]:[#6:6]:1
34 | [C;H1&$(C([#6])[#6]),H2&$(C[#6]):1][OH1].[#7:2]1~[#7:3]~[#7H1:4]~[#7:5]~[#6:6]~1>>[C:1][#7H0:2]1:[#7:3]:[#7H0:4]:[#7:5]:[#6:6]:1
35 | [C;H1&$(C([#6])[#6]),H2&$(C[#6]):1][OH1].[#7:2]1~[#7:3]~[#7H1:4]~[#7:5]~[#6:6]~1>>[#7:2]1:[#7:3]:[#7:4]([C:1]):[#7:5]:[#6:6]:1
36 | [#6;$(C=C-[#6]),$(c:c):1][Br,I].[Cl,Br,I][c:2]>>[c:2][#6:1]
37 | [#6:1][C:2]#[#7;D1].[Cl,Br,I][#6;$([#6]~[#6]);!$([#6]([Cl,Br,I])[Cl,Br,I]);!$([#6]=O):3]>>[#6:1][C:2](=O)[#6:3]
38 | [#6:1][C;H1,$([C]([#6])[#6]):2]=[OD1:3].[Cl,Br,I][#6;$([#6]~[#6]);!$([#6]([Cl,Br,I])[Cl,Br,I]);!$([#6]=O):4]>>[C:1][#6:2]([OH1:3])[#6:4]
39 | [S;$(S(=O)(=O)[C,N]):1][Cl].[N;$(NC);!$(N=*);!$([N-]);!$(N#*);!$([ND3]);!$([ND4]);!$(N[c,O]);!$(N[C,S]=[S,O,N]):2]>>[S:1][N+0:2]
40 | [c:1]B(O)O.[nH1;+0;r5;!$(n[#6]=[O,S,N]);!$(n~n~n);!$(n~n~c~n);!$(n~c~n~n):2]>>[c:1][n:2]
41 | [#6:3]-[C;H1,$([CH0](-[#6])[#6]);!$(CC=O):1]=[OD1].[Cl,Br,I][C;H2;$(C-[#6]);!$(CC[I,Br]);!$(CCO[CH3]):2]>>[C:3][C:1]=[C:2]
42 | [Cl,Br,I][c;$(c1:[c,n]:[c,n]:[c,n]:[c,n]:[c,n]:1):1].[N;$(NC)&!$(N=*)&!$([N-])&!$(N#*)&!$([ND3])&!$([ND4])&!$(N[c,O])&!$(N[C,S]=[S,O,N]),H2&$(Nc1:[c,n]:[c,n]:[c,n]:[c,n]:[c,n]:1):2]>>[c:1][N:2]
43 | [C;$(C([#6])[#6;!$([#6]Br)]):4](=[OD1])[CH;$(C([#6])[#6]):5]Br.[#7;H2:3][C;$(C(=N)(N)[c,#7]):2]=[#7;H1;D1:1]>>[C:4]1=[CH0:5][NH:3][C:2]=[N:1]1
44 | [c;$(c1[c;$(c[C,S,N](=[OD1])[*;R0;!OH1])]cccc1):1][C;$(C(=O)[O;H1])].[c;$(c1aaccc1):2][Cl,Br,I]>>[c:1][c:2]
45 | [c;!$(c1ccccc1);$(c1[n,c]c[n,c]c[n,c]1):1][Cl,F].[N;$(NC);!$(N=*);!$([N-]);!$(N#*);!$([ND3]);!$([ND4]);!$(N[c,O]);!$(N[C,S]=[S,O,N]):2]>>[c:1][N:2]
46 | [c;$(c1c(N(~O)~O)cccc1):1][Cl,F].[N;$(NC);!$(N=*);!$([N-]);!$(N#*);!$([ND3]);!$([ND4]);!$(N[c,O]);!$(N[C,S]=[S,O,N]):2]>>[c:1][N:2]
47 | [c;$(c1ccc(N(~O)~O)cc1):1][Cl,F].[N;$(NC);!$(N=*);!$([N-]);!$(N#*);!$([ND3]);!$([ND4]);!$(N[c,O]);!$(N[C,S]=[S,O,N]):2]>>[c:1][N:2]
48 | [N;$(N-[#6]):3]=[C;$(C=O):1].[N;$(N[#6]);!$(N=*);!$([N-]);!$(N#*);!$([ND3]);!$([ND4]);!$(N[O,N]);!$(N[C,S]=[S,O,N]):2]>>[N:3]-[C:1]-[N+0:2]
49 | [N;$(N-[#6]):3]=[C;$(C=S):1].[N;$(N[#6]);!$(N=*);!$([N-]);!$(N#*);!$([ND3]);!$([ND4]);!$(N[O,N]);!$(N[C,S]=[S,O,N]):2]>>[N:3]-[C:1]-[N+0:2]
50 | [$(C([CH2,CH3])),CH:10](=[O:11])-[NH+0:9]-[C$(C(N)(C)(C)(C)),C$([CH](N)(C)(C)),C$([CH2](N)(C)):8]-[C$(C(c)(C)(C)(C)),C$([CH](c)(C)(C)),C$([CH2](c)(C)):7]-[c:6]1[cH:1][c:2][c:3][c:4][c:5]1>>[C:10]-1=[N+0:9]-[C:8]-[C:7]-[c:6]2[c:5][c:4][c:3][c:2][c:1]-12
51 | [$(C([CH2,CH3])),CH:10](=[O:11])-[NH+0:9]-[C$([CH](N)(C)(C)),C$([CH2](N)(C)):8]-[C$([C](c)(C)(C)),C$([CH](c)(C)):7]([O$(OC),OH])-[c:6]1[cH:1][c:2][c:3][c:4][c:5]1>>[c:10]-1[n:9][c:8][c:7][c:6]2[c:5][c:4][c:3][c:2][c:1]-12
52 | [NH3+,NH2]-[C$(C(N)(C)(C)(C)),C$([CH](N)(C)(C)),C$([CH2](N)(C)):8]-[C$(C(c)(C)(C)(C)),C$([CH](c)(C)(C)),C$([CH2](c)(C)):7]-[c:6]1[c:1][c:2][nH:3][cH:5]1.[CH:10](-[CX4:12])=[O:11]>>[c,C:12]-[CH:10]-1-[N]-[C:8]-[C:7]-[c:6]2[c:1][c:2][nH:3][c:5]-12
53 | [NH2,NH3+1:8]-[c:5]1[cH:4][c:3][c:2][c:1][c:6]1.[Br:18][C$([CH2](C)(Br)),C$([CH](C)(C)(Br)):17]-[C:15](=[O:16])-[c:10]1[c:11][c:12][c:13][c:14][c:9]1>>[c:13]1[c:12][c:11][c:10]([c:9][c:14]1)-[c:15]1[c:17][c:4]2[c:3][c:2][c:1][c:6][c:5]2[nH+0:8]1
54 | [Cl:1][CH2:2]-[C$([CH](C)),C$(C(C)(C)):3]=[O:4].[OH:12]-[c:11]1[c:6][c:7][c:8][c:9][c:10]1-[CH:13]=[O:14]>>[C:3](=[O:4])-[c:2]1[c:13][c:10]2[c:9][c:8][c:7][c:6][c:11]2[o:12]1
55 | [NH2,NH3+]-[C$([CX4](N)([c,C])([c,C])([c,C])),C$([CH](N)([c,C])([c,C])),C$([CH2](N)([c,C])),C$([CH3](N)):2].[NH2:12]-[c:7]1[c:6][c:5][c:4][c:3][c:8]1-[C:9](-[OH,O-:11])=[O:10]>>[C:2]-[n+0]-1[c:13][n:12][c:7]2[c:6][c:5][c:4][c:3][c:8]2[c:9]-1=[O:10]
56 | [N$([NH2]([CX4])),N$([NH3+1]([CX4])):1].[O:5]-[C$([CH]([CX4])(C)(O)),C$([CH2]([CX4])(O)):3][C$(C([CX4])(=O)([CX4])),C$([CH]([CX4])(=O)):4]=[O:6]>[O:15]=[C:9]-1-[CH2:10]-[CH2:11]-[CH2:12]-[CH2:13]-[CH2:14]-1>[c:4]1[c:3][n+0:1][c:10]2-[C:11]-[C:12]-[C:13]-[C:14]-[c:9]12
57 | [C$(C(=O)([CX4])([CX4])),C$([CH](=O)([CX4])):2](=[O:6])-[C$([CH]([CX4])),C$([CH2]):3]-[C$(C(=O)([CX4])([CX4])),C$([CH](=O)([CX4])):4]=[O:7].[NH2:8]-[C:9](=[O:10])-[CH2:11][C:12]#[N:13]>>[OH:10]-[c:9]1[n:8][c:4][c:3][c:2][c:11]1[C:12]#[N:13]
58 | [C$(C(#C)([CX4])):2]#[C$(C(#C)([CX4])):1].[N$(N(~N)([CX4])):5]~[N]~[N]>>[c:2]1[c:1][n:5][n][n]1
59 | [C$(C(=C)([CX4])):2]=[C$(C(=C)([CX4])):1].[N$(N(~N)([CX4])):5]~[N]~[N]>>[C:2]1[C:1][N:5][N]=[N]1
60 | [C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):1]=[C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):2].[C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):3]=[C$([C](=C)(C)([CX4,OX2,NX3])),C$([CH](=C)(C)):4]-[C$([C](=C)(C)([CX4,OX2,NX3])),C$([CH](=C)(C)):5]=[C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):6]>>[C:1]1[C:2][C:3][C:4]=[C:5][C:6]1
61 | [C$(C(#C)([CX4,OX2,NX3])),C$([CH](#C)):1]#[C$(C(#C)([CX4,OX2,NX3])),C$([CH](#C)):2].[C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):3]=[C$([C](=C)(C)([CX4,OX2,NX3])),C$([CH](=C)(C)):4]-[C$([C](=C)(C)([CX4,OX2,NX3])),C$([CH](=C)(C)):5]=[C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):6]>>[C:1]1=[C:2][C:3][C:4]=[C:5][C:6]1
62 | [NH2,NH3+:3]-[N$([NH](N)([CX4])):2].[C$([CH](C)(C)([CX4])),C$([CH2](C)(C)):6](-[C$(C(=O)(C)([CX4])),C$([CH](=O)(C)):5]=[O:9])-[C$(C(=O)(C)([CX4])),C$([CH](=O)(C)):7]=[O:10]>>[c:7]1[n:3][n:2][c:5][c:6]1
63 | [C$(C(=O)(C)([CX4])),C$(C[H](=O)(C)):1](=[O:2])-[$([CH](C)(C)([CX4])),$([CH2](C)(C)):3]-[$([CH](C)(C)([CX4])),$([CH2](C)(C)):4]-[C$(C(=O)(C)([CX4])),C$(C[H](=O)(C)):5]=[O:6].[N$([NH2,NH3+1]([CX4])):7]>>[c:5]1[c:4][c:3][c:1][n+0:7]1
64 | [CH:7](=[O:8])-[c:1]1[c:2][c:3][c:4][c:5][c:6]1.[O:24]=[C:23](-[C:22](=[O:25])-[c:15]1[c:10][c:11][c:12][c:13][c:14]1)-[c:20]1[c:21][c:16][c:17][c:18][c:19]1>[NH4].[O-]C(=O)C>[nH:27]-1[c:7]([n:26][c:23]([c:22]-1[c:15]1[c:10][c:11][c:12][c:13][c:14]1)-[c:20]1[c:21][c:16][c:17][c:18][c:19]1)-[c:1]1[c:2][c:3][c:4][c:5][c:6]1
65 | [OH:7]-[c:6]1[cH:1][c:2][c:3][c:4][c:5]1.[O$(O(C)([CX4])):12]-[C:11](=[O:15])-[C$([CH](C)(C)([CX4])),C$([CH2](C)(C)):10]-[C:8]=[O:16]>>[C:8]-1=[C:10]-[C:11](=[O:15])-[O]-[c:6]2[c:5][c:4][c:3][c:2][c:1]-12
66 | [O$(O(C)([CX4])):8][C:7](=[O:9])[CH:6][C:5][C:4][C:3][C:2]([O$(O(C)([CX4])):10])=[O:1]>>[O:8][C:7](=[O:9])[C:6]1[C:5][C:4][C:3][C:2]1=[O:1]
67 | [O$(O(C)([CX4])):8][C:7](=[O:9])[CH:6][C:5][C:11][C:4][C:3][C:2]([O$(O(C)([CX4])):10])=[O:1]>>[O:8][C:7](=[O:9])[C:6]1[C:5][C:11][C:4][C:3][C:2]1=[O:1]
68 | [Cl:9][C:7](=[O:8])-[c:3]1[c:2][c:1][c:6][c:5][c:4]1.[C$([CH2](C)([CX4])),C$([CH3](C)):18]-[C:16](=[O:17])-[c:14]1[c:13][c:12][c:11][c:10][c:15]1-[OH:19]>>[O:17]=[C:16]-1-[C:18]=[C:7](-[O:8]-[c:15]2[c:10][c:11][c:12][c:13][c:14]-12)-[c:3]1[c:2][c:1][c:6][c:5][c:4]1
69 | [C$(C(C)(=O)([CX4,OX2&H0])),C$(C(C)(#N)),N$([N+1](C)(=O)([O-1])):1][C$([CH]([C,N])([C,N])([CX4])),C$([CH2]([C,N])([C,N])):2][C$(C(C)(=O)([CX4,OX2&H0])),C$(C(C)(#N)),N$([N+1](C)(=O)([O-1])):3].[C$(C(C)(#N)),C$(C(C)([CX4,OX2&H0])([CX4,OX2&H0])([OX2&H0])),C$([CH](C)([CX4,OX2&H0])([OX2&H0])),C$([CH2](C)([OX2&H0])),C$(C(C)(=O)([OX2&H0])):6][CH:5]=[C$(C(=C)([CX4])([CX4])),C$([CH](=C)([CX4])),C$([CH2](=C)):4]>>[C:6][C:5][C:4][C:2]([C:1])[C:3]
70 | [C$([C](O)([CX4])([CX4])([CX4])),C$([CH](O)([CX4])([CX4])),C$([CH2](O)([CX4])):4]-[O:3]-[C$(C(=O)([CX4])),C$([CH](=O)):2]=[O:5].[C$([CH](C)([CX4])([CX4])),C$([CH2](C)([CX4])),C$([CH3](C)):7]-[C$(C(=O)([CX4])),C$([CH](=O)):8]=[O:9]>>[C:7](-[C:2]=[O:5])-[C:8]=[O:9]
71 | [Cl,OH,O-:3][C$(C(=O)([CX4,c])),C$([CH](=O)):2]=[O:4].[O$([OH]([CX4,c])),O$([OH]([CX4,c])([CX4,c])),S$([SH]([CX4,c])),S$([SH]([CX4,c])([CX4,c])):6]>>[*:6]-[C:2]=[O:4]
72 | [C$(C(=O)([CX4,c])([CX4,c])),C$([CH](=O)([CX4,c])):1]=[O:2].[N$([NH2,NH3+1]([CX4,c])),N$([NH]([CX4,c])([CX4,c])):3]>>[N+0:3][C:1]
73 | [Br:1][c$(c(Br)),n$(n(Br)),o$(o(Br)),C$([CH](Br)(=C)):2].[C$(C(B)([CX4])([CX4])([CX4])),C$([CH](B)([CX4])([CX4])),C$([CH2](B)([CX4])),C$([CH2](B)),C$(C(B)(=C)),c$(c(B)),o$(o(B)),n$(n(B)):3][B$(B([C,c,n,o])([OH,$(OC)])([OH,$(OC)])),B$([B-1]([C,c,n,o])(N)([OH,$(OC)])([OH,$(OC)])):4]>>[C,c,n,o:2][C,c,n,o:3]
74 | [Br,I:1][C$(C([Br,I])([CX4])([CX4])([CX4])),C$([CH]([Br,I])([CX4])([CX4])),C$([CH2]([Br,I])([CX4])),C$([CH3]([Br,I])),C$([C]([Br,I])(=C)([CX4])),C$([CH]([Br,I])(=C)),C$(C([Br,I])(#C)),c$(c([Br,I])):2].[Br,I:3][C$(C([Br,I])([CX4])([CX4])([CX4])),C$([CH]([Br,I])([CX4])([CX4])),C$([CH2]([Br,I])([CX4])),C$([CH3]([Br,I])),C$([C]([Br,I])(=C)([CX4])),C$([CH]([Br,I])(=C)),C$(C([Br,I])(#C)),c$(c([Br,I])):4]>>[C,c:2][C,c:4]
75 | [OH,O-]-[C$(C(=O)(O)([CX4,c])):2]=[O:3].[OH:8]-[C$([CH](O)([CX4,c])([CX4,c])),C$([CH2](O)([CX4,c])),C$([CH3](O)):6]>>[C:6][O]-[C:2]=[O:3]
76 | [C$([CH](=C)([CX4])),C$([CH2](=C)):2]=[C$(C(=C)([CX4])([CX4])),C$([CH](=C)([CX4])),C$([CH2](=C)):3].[Br,I:7][C$([CX4]([Br,I])),c$([c]([Br,I])):4]>>[C,c:4][C:2]=[C:3]
77 | [Cl,OH,O-:3][C$(C(=O)([CX4,c])),C$([CH](=O)):2]=[O:4].[N$([NH2,NH3+1]([CX4,c])),N$([NH]([CX4,c])([CX4,c])):6]>>[N+0:6]-[C:2]=[O:4]
78 | [C$(C(=C)([CX4])([CX4])),C$([CH](=C)([CX4])),C$([CH2](=C)):1]=[C$(C(=C)([CX4])([CX4])),C$([CH](=C)([CX4])),C$([CH2](=C)):2].[SH:4]-[CX4:5][Br,Cl,I]>>[C:1]-[C:2]-[S:4][C:5]
79 | [C$([C](=O)([CX4])),C$([CH](=O)):2](=[O:1])[OH,Cl,O-:6].[SH:4]-[CX4:5][Br,Cl,I]>>[CH2:2]-[S:4][C:5]
80 | [I:1][C$(C(I)([CX4,c])([CX4,c])([CX4,c])),C$([CH](I)([CX4,c])([CX4,c])),C$([CH2](I)([CX4,c])),C$([CH3](I)):2].[C$(C(=O)([Cl,OH,O-])([CX4,c])),C$([CH]([Cl,OH,O-])(=O)):3](=[O:6])[Cl,OH,O-:5]>>[C:2]-[C:3]=[O:6]
81 | [Cl:5][S$(S(=O)(=O)(Cl)([CX4])):2](=[O:3])=[O:4].[NH2+0,NH3+:6]-[C$(C(N)([CX4,c])([CX4,c])([CX4,c])),C$([CH](N)([CX4,c])([CX4,c])),C$([CH2](N)([CX4,c])),C$([CH3](N)),c$(c(N)):7]>>[C,c:7]-[NH+0:6][S:2](=[O:4])=[O:3]
82 | [*:1][C:2]#[CH:3].[Br,I:4][C$(C([CX4,c])([CX4,c])([CX4,c])),C$([CH]([CX4,c])([CX4,c])),C$([CH2]([CX4,c])),C$([CH3]),c$(c):5]>>[C,c:5][C:3]#[C:2][*:1]
83 | [C$(C(C)([CX4])([CX4])([CX4])),C$([CH](C)([CX4])([CX4])),C$([CH2](C)([CX4])),C$([CH3](C)):1][C:2]#[CH:3].[Br,I:4][C$(C(=O)([Br,I])([CX4])),C$([CH](=O)([Br,I])):5]=[O:6]>>[C:1][C:2]#[C:3][C:5]=[O:6]
84 | [OH,O-:4]-[C$(C(=O)([OH,O-])([CX4])),C$([CH](=O)([OH,O-])):2]=[O:3]>>[Cl:5][C:2]=[O:3]
85 | [OH:2]-[$([CX4]),c:1]>>[Br:3][C,c:1]
86 | [OH:2]-[$([CX4]),c:1]>>[Cl:3][C,c:1]
87 | [OH,O-:3][S$(S([CX4])):2](=[O:4])=[O:5]>>[Cl:6][S:2](=[O:5])=[O:4]
88 | [OH+0,O-:5]-[C:3](=[O:4])-[C$([CH]([CX4])),C$([CH2]):2]>>[OH+0,O-:5]-[C:3](=[O:4])-[C:2]([Br:6])
89 | [OH+0,O-:5]-[C:3](=[O:4])-[C$([CH]([CX4])),C$([CH2]):2]>>[OH+0,O-:5]-[C:3](=[O:4])-[C:2]([Cl:6])
90 | [Cl,I,Br:7][c:1]1[c:2][c:3][c:4][c:5][c:6]1>>[N:9]#[C:8][c:1]1[c:2][c:3][c:4][c:5][c:6]1
91 | [OH,NH2,NH3+:3]-[CH2:2]-[C$(C([CX4,c])([CX4,c])([CX4,c])),C$([CH]([CX4,c])([CX4,c])),C$([CH2]([CX4,c])),C$([CH3]),c$(c):1]>>[C,c:1][C:2]#[N:4]
92 | 


--------------------------------------------------------------------------------
/data/smiles_vocab.txt:
--------------------------------------------------------------------------------
  1 | H
  2 | He
  3 | Li
  4 | Be
  5 | B
  6 | C
  7 | N
  8 | O
  9 | F
 10 | Ne
 11 | Na
 12 | Mg
 13 | Al
 14 | Si
 15 | P
 16 | S
 17 | Cl
 18 | Ar
 19 | K
 20 | Ca
 21 | Sc
 22 | Ti
 23 | V
 24 | Cr
 25 | Mn
 26 | Fe
 27 | Co
 28 | Ni
 29 | Cu
 30 | Zn
 31 | Ga
 32 | Ge
 33 | As
 34 | Se
 35 | Br
 36 | Kr
 37 | Rb
 38 | Sr
 39 | Y
 40 | Zr
 41 | Nb
 42 | Mo
 43 | Tc
 44 | Ru
 45 | Rh
 46 | Pd
 47 | Ag
 48 | Cd
 49 | In
 50 | Sn
 51 | Sb
 52 | Te
 53 | I
 54 | Xe
 55 | Cs
 56 | Ba
 57 | La
 58 | Ce
 59 | Pr
 60 | Nd
 61 | Pm
 62 | Sm
 63 | Eu
 64 | Gd
 65 | Tb
 66 | Dy
 67 | Ho
 68 | Er
 69 | Tm
 70 | Yb
 71 | Lu
 72 | Hf
 73 | Ta
 74 | W
 75 | Re
 76 | Os
 77 | Ir
 78 | Pt
 79 | Au
 80 | Hg
 81 | Tl
 82 | Pb
 83 | Bi
 84 | Po
 85 | At
 86 | Rn
 87 | Fr
 88 | Ra
 89 | Ac
 90 | Th
 91 | Pa
 92 | U
 93 | Np
 94 | Pu
 95 | Am
 96 | Cm
 97 | Bk
 98 | Cf
 99 | Es
100 | Fm
101 | Md
102 | No
103 | Lr
104 | Rf
105 | Db
106 | Sg
107 | Bh
108 | Hs
109 | Mt
110 | Ds
111 | Rg
112 | Cn
113 | Nh
114 | Fl
115 | Mc
116 | Lv
117 | Ts
118 | Og
119 | b
120 | c
121 | n
122 | o
123 | s
124 | p
125 | 0
126 | 1
127 | 2
128 | 3
129 | 4
130 | 5
131 | 6
132 | 7
133 | 8
134 | 9
135 | [
136 | ]
137 | (
138 | )
139 | .
140 | =
141 | #
142 | -
143 | +
144 | \
145 | /
146 | :
147 | ~
148 | @
149 | ?
150 | >
151 | *
152 | $
153 | %
154 | 


--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
  1 | channels:
  2 |   - pytorch
  3 |   - nvidia
  4 |   - conda-forge
  5 | dependencies:
  6 |   - _libgcc_mutex=0.1
  7 |   - _openmp_mutex=4.5
  8 |   - aiohappyeyeballs=2.5.0
  9 |   - aiohttp=3.11.13
 10 |   - aiosignal=1.3.2
 11 |   - alsa-lib=1.2.13
 12 |   - antlr-python-runtime=4.9.3
 13 |   - aom=3.6.1
 14 |   - async-timeout=5.0.1
 15 |   - attr=2.5.1
 16 |   - attrs=25.1.0
 17 |   - aws-c-auth=0.7.22
 18 |   - aws-c-cal=0.6.15
 19 |   - aws-c-common=0.9.23
 20 |   - aws-c-compression=0.2.18
 21 |   - aws-c-event-stream=0.4.2
 22 |   - aws-c-http=0.8.2
 23 |   - aws-c-io=0.14.9
 24 |   - aws-c-mqtt=0.10.4
 25 |   - aws-c-s3=0.5.10
 26 |   - aws-c-sdkutils=0.1.16
 27 |   - aws-checksums=0.1.18
 28 |   - aws-crt-cpp=0.26.12
 29 |   - aws-sdk-cpp=1.11.329
 30 |   - azure-core-cpp=1.14.0
 31 |   - azure-identity-cpp=1.10.0
 32 |   - azure-storage-blobs-cpp=12.13.0
 33 |   - azure-storage-common-cpp=12.8.0
 34 |   - azure-storage-files-datalake-cpp=12.12.0
 35 |   - black=25.1.0
 36 |   - brotli=1.1.0
 37 |   - brotli-bin=1.1.0
 38 |   - brotli-python=1.1.0
 39 |   - bzip2=1.0.8
 40 |   - c-ares=1.34.4
 41 |   - ca-certificates=2025.1.31
 42 |   - cairo=1.18.2
 43 |   - certifi=2025.1.31
 44 |   - cffi=1.17.1
 45 |   - cfgv=3.3.1
 46 |   - chardet=5.2.0
 47 |   - charset-normalizer=3.4.1
 48 |   - click=8.1.8
 49 |   - colorama=0.4.6
 50 |   - contourpy=1.3.1
 51 |   - cpython=3.10.16
 52 |   - cryptography=44.0.2
 53 |   - cuda-cudart=12.1.105
 54 |   - cuda-cupti=12.1.105
 55 |   - cuda-libraries=12.1.0
 56 |   - cuda-nvrtc=12.1.105
 57 |   - cuda-nvtx=12.1.105
 58 |   - cuda-opencl=12.6.77
 59 |   - cuda-runtime=12.1.0
 60 |   - cuda-version=12.6
 61 |   - cudnn=8.9.7.29
 62 |   - cycler=0.12.1
 63 |   - cyrus-sasl=2.1.27
 64 |   - datasets=3.3.2
 65 |   - dbus=1.13.6
 66 |   - dill=0.3.8
 67 |   - distlib=0.3.9
 68 |   - double-conversion=3.3.0
 69 |   - einops=0.8.1
 70 |   - expat=2.6.4
 71 |   - ffmpeg=4.4.2
 72 |   - filelock=3.16.1
 73 |   - flake8=7.1.2
 74 |   - font-ttf-dejavu-sans-mono=2.37
 75 |   - font-ttf-inconsolata=3.000
 76 |   - font-ttf-source-code-pro=2.038
 77 |   - font-ttf-ubuntu=0.83
 78 |   - fontconfig=2.15.0
 79 |   - fonts-conda-ecosystem=1
 80 |   - fonts-conda-forge=1
 81 |   - fonttools=4.55.3
 82 |   - freetype=2.12.1
 83 |   - freetype-py=2.3.0
 84 |   - frozenlist=1.5.0
 85 |   - fsspec=2024.12.0
 86 |   - gettext=0.23.1
 87 |   - gettext-tools=0.23.1
 88 |   - gflags=2.2.2
 89 |   - giflib=5.2.2
 90 |   - gitdb=4.0.12
 91 |   - gitpython=3.1.44
 92 |   - glog=0.7.1
 93 |   - gmp=6.3.0
 94 |   - gmpy2=2.1.5
 95 |   - gnutls=3.7.9
 96 |   - graphite2=1.3.13
 97 |   - greenlet=3.1.1
 98 |   - h2=4.1.0
 99 |   - harfbuzz=10.2.0
100 |   - hpack=4.0.0
101 |   - huggingface_hub=0.29.1
102 |   - hyperframe=6.0.1
103 |   - icu=75.1
104 |   - identify=2.6.5
105 |   - idna=3.10
106 |   - isort=6.0.1
107 |   - jinja2=3.1.5
108 |   - joblib=1.4.2
109 |   - keyutils=1.6.1
110 |   - kiwisolver=1.4.7
111 |   - krb5=1.21.3
112 |   - lame=3.100
113 |   - lcms2=2.16
114 |   - ld_impl_linux-64=2.43
115 |   - lerc=4.0.0
116 |   - libabseil=20240116.2
117 |   - libarrow=16.1.0
118 |   - libarrow-acero=16.1.0
119 |   - libarrow-dataset=16.1.0
120 |   - libarrow-substrait=16.1.0
121 |   - libasprintf=0.23.1
122 |   - libasprintf-devel=0.23.1
123 |   - libblas=3.9.0
124 |   - libboost=1.86.0
125 |   - libboost-python=1.86.0
126 |   - libbrotlicommon=1.1.0
127 |   - libbrotlidec=1.1.0
128 |   - libbrotlienc=1.1.0
129 |   - libcap=2.71
130 |   - libcblas=3.9.0
131 |   - libclang-cpp19.1=19.1.7
132 |   - libclang13=19.1.7
133 |   - libcrc32c=1.1.2
134 |   - libcublas=12.1.0.26
135 |   - libcufft=11.0.2.4
136 |   - libcufile=1.11.1.6
137 |   - libcups=2.3.3
138 |   - libcurand=10.3.7.77
139 |   - libcurl=8.12.1
140 |   - libcusolver=11.4.4.55
141 |   - libcusparse=12.0.2.55
142 |   - libdeflate=1.23
143 |   - libdrm=2.4.124
144 |   - libedit=3.1.20240808
145 |   - libegl=1.7.0
146 |   - libev=4.33
147 |   - libevent=2.1.12
148 |   - libexpat=2.6.4
149 |   - libffi=3.4.2
150 |   - libgcc=14.2.0
151 |   - libgcc-ng=14.2.0
152 |   - libgcrypt-lib=1.11.0
153 |   - libgettextpo=0.23.1
154 |   - libgettextpo-devel=0.23.1
155 |   - libgfortran=14.2.0
156 |   - libgfortran5=14.2.0
157 |   - libgl=1.7.0
158 |   - libglib=2.82.2
159 |   - libglvnd=1.7.0
160 |   - libglx=1.7.0
161 |   - libgoogle-cloud=2.25.0
162 |   - libgoogle-cloud-storage=2.25.0
163 |   - libgpg-error=1.51
164 |   - libgrpc=1.62.2
165 |   - libhwloc=2.11.2
166 |   - libiconv=1.17
167 |   - libidn2=2.3.7
168 |   - libjpeg-turbo=3.0.0
169 |   - liblapack=3.9.0
170 |   - libllvm19=19.1.7
171 |   - liblzma=5.6.3
172 |   - liblzma-devel=5.6.3
173 |   - libmagma=2.8.0
174 |   - libmagma_sparse=2.8.0
175 |   - libnghttp2=1.64.0
176 |   - libnl=3.11.0
177 |   - libnpp=12.0.2.50
178 |   - libnsl=2.0.1
179 |   - libntlm=1.8
180 |   - libnvjitlink=12.1.105
181 |   - libnvjpeg=12.1.1.14
182 |   - libopenblas=0.3.29
183 |   - libopengl=1.7.0
184 |   - libopentelemetry-cpp=1.16.1
185 |   - libopentelemetry-cpp-headers=1.16.1
186 |   - libparquet=16.1.0
187 |   - libpciaccess=0.18
188 |   - libpng=1.6.45
189 |   - libpq=17.4
190 |   - libprotobuf=4.25.3
191 |   - librdkit=2024.09.6
192 |   - libre2-11=2023.09.01
193 |   - libsqlite=3.48.0
194 |   - libssh2=1.11.1
195 |   - libstdcxx=14.2.0
196 |   - libstdcxx-ng=14.2.0
197 |   - libsystemd0=256.9
198 |   - libtasn1=4.20.0
199 |   - libthrift=0.19.0
200 |   - libtiff=4.7.0
201 |   - libtorch=2.4.0
202 |   - libudev1=257.2
203 |   - libunistring=0.9.10
204 |   - libutf8proc=2.8.0
205 |   - libuuid=2.38.1
206 |   - libuv=1.50.0
207 |   - libva=2.22.0
208 |   - libvpx=1.13.1
209 |   - libwebp=1.5.0
210 |   - libwebp-base=1.5.0
211 |   - libxcb=1.17.0
212 |   - libxcrypt=4.4.36
213 |   - libxkbcommon=1.7.0
214 |   - libxml2=2.13.5
215 |   - libxslt=1.1.39
216 |   - libzlib=1.3.1
217 |   - llvm-openmp=19.1.7
218 |   - lz4-c=1.9.4
219 |   - markdown-it-py=3.0.0
220 |   - markupsafe=3.0.2
221 |   - matplotlib=3.10.1
222 |   - matplotlib-base=3.10.1
223 |   - mccabe=0.7.0
224 |   - mdurl=0.1.2
225 |   - mkl=2023.2.0
226 |   - mpc=1.3.1
227 |   - mpfr=4.2.1
228 |   - mpmath=1.3.0
229 |   - multidict=6.1.0
230 |   - multiprocess=0.70.16
231 |   - munkres=1.1.4
232 |   - mypy=1.15.0
233 |   - mypy_extensions=1.0.0
234 |   - mysql-common=9.0.1
235 |   - mysql-libs=9.0.1
236 |   - nccl=2.25.1.1
237 |   - ncurses=6.5
238 |   - nettle=3.9.1
239 |   - networkx=3.4.2
240 |   - nlohmann_json=3.11.3
241 |   - nodeenv=1.9.1
242 |   - numpy=2.2.3
243 |   - ocl-icd=2.3.2
244 |   - omegaconf=2.3.0
245 |   - openbabel=3.1.1
246 |   - opencl-headers=2024.10.24
247 |   - openh264=2.3.1
248 |   - openjpeg=2.5.3
249 |   - openldap=2.6.9
250 |   - openssl=3.4.1
251 |   - optree=0.14.1
252 |   - orc=2.0.1
253 |   - p11-kit=0.24.1
254 |   - packaging=24.2
255 |   - pandas=2.2.3
256 |   - pathspec=0.12.1
257 |   - patsy=1.0.1
258 |   - pcre2=10.44
259 |   - pillow=11.1.0
260 |   - pip=24.3.1
261 |   - pixman=0.44.2
262 |   - platformdirs=4.3.6
263 |   - pre-commit=4.1.0
264 |   - prometheus-cpp=1.2.4
265 |   - propcache=0.2.1
266 |   - psutil=6.1.1
267 |   - pthread-stubs=0.4
268 |   - pyarrow=16.1.0
269 |   - pyarrow-core=16.1.0
270 |   - pybind11=2.13.6
271 |   - pybind11-global=2.13.6
272 |   - pycairo=1.27.0
273 |   - pycodestyle=2.12.1
274 |   - pycparser=2.22
275 |   - pyflakes=3.2.0
276 |   - pygments=2.19.1
277 |   - pyparsing=3.2.1
278 |   - pyside6=6.8.1
279 |   - pysocks=1.7.1
280 |   - python=3.10.16
281 |   - python-dateutil=2.9.0.post0
282 |   - python-tzdata=2024.2
283 |   - python-xxhash=3.5.0
284 |   - python_abi=3.10
285 |   - pytorch=2.4.0
286 |   - pytorch-cuda=12.1
287 |   - pytorch-mutex=1.0
288 |   - pytz=2024.1
289 |   - pyyaml=6.0.2
290 |   - qhull=2020.2
291 |   - qt6-main=6.8.1
292 |   - rdkit=2024.09.6
293 |   - rdma-core=55.0
294 |   - re2=2023.09.01
295 |   - readline=8.2
296 |   - regex=2024.11.6
297 |   - reportlab=4.2.5
298 |   - requests=2.32.3
299 |   - rich=13.9.4
300 |   - rlpycairo=0.2.0
301 |   - s2n=1.4.16
302 |   - safetensors=0.5.3
303 |   - scikit-learn=1.6.1
304 |   - scipy=1.15.1
305 |   - seaborn=0.13.2
306 |   - seaborn-base=0.13.2
307 |   - setuptools=75.8.0
308 |   - six=1.17.0
309 |   - sleef=3.8
310 |   - smmap=5.0.0
311 |   - snappy=1.2.1
312 |   - sqlalchemy=2.0.37
313 |   - statsmodels=0.14.4
314 |   - svt-av1=1.4.1
315 |   - sympy=1.13.3
316 |   - tbb=2021.13.0
317 |   - threadpoolctl=3.5.0
318 |   - tk=8.6.13
319 |   - tomli=2.2.1
320 |   - torchaudio=2.4.0
321 |   - torchvision=0.19.0
322 |   - tornado=6.4.2
323 |   - tqdm=4.67.1
324 |   - types-pyyaml=6.0.12.20241230
325 |   - typing-extensions=4.12.2
326 |   - typing_extensions=4.12.2
327 |   - tzdata=2025a
328 |   - ukkonen=1.0.1
329 |   - unicodedata2=16.0.0
330 |   - urllib3=2.3.0
331 |   - virtualenv=20.29.1
332 |   - wayland=1.23.1
333 |   - wayland-protocols=1.41
334 |   - wheel=0.45.1
335 |   - x264=1!164.3095
336 |   - x265=3.5
337 |   - xcb-util=0.4.1
338 |   - xcb-util-cursor=0.1.5
339 |   - xcb-util-image=0.4.0
340 |   - xcb-util-keysyms=0.4.1
341 |   - xcb-util-renderutil=0.3.10
342 |   - xcb-util-wm=0.4.2
343 |   - xkeyboard-config=2.43
344 |   - xlrd=2.0.1
345 |   - xorg-libice=1.1.2
346 |   - xorg-libsm=1.2.5
347 |   - xorg-libx11=1.8.10
348 |   - xorg-libxau=1.0.12
349 |   - xorg-libxcomposite=0.4.6
350 |   - xorg-libxcursor=1.2.3
351 |   - xorg-libxdamage=1.1.6
352 |   - xorg-libxdmcp=1.1.5
353 |   - xorg-libxext=1.3.6
354 |   - xorg-libxfixes=6.0.1
355 |   - xorg-libxi=1.8.2
356 |   - xorg-libxrandr=1.5.4
357 |   - xorg-libxrender=0.9.12
358 |   - xorg-libxtst=1.2.5
359 |   - xorg-libxxf86vm=1.1.6
360 |   - xxhash=0.8.3
361 |   - xz=5.6.3
362 |   - xz-gpl-tools=5.6.3
363 |   - xz-tools=5.6.3
364 |   - yaml=0.2.5
365 |   - yarl=1.18.3
366 |   - zlib=1.3.1
367 |   - zstandard=0.23.0
368 |   - zstd=1.5.6
369 |   - pip:
370 |       - accelerate==1.4.0
371 |       - huggingface-hub==0.29.2
372 |       - synllama==0.1.0
373 |       - tokenizers==0.19.1
374 |       - transformers==4.44.2


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [tool.black]
 2 | line-length = 119
 3 | target-version = ['py310']
 4 | 
 5 | [tool.isort]
 6 | extra_standard_library = "typing_extensions,mypy,mypy_extensions"
 7 | profile = "black"
 8 | 
 9 | [tool.autoflake]
10 | remove-all-unused-imports = true
11 | expand-star-imports = true
12 | ignore-init-module-imports = true
13 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup
2 | 
3 | setup(
4 |     name="synllama",
5 |     version="0.1.0",
6 |     packages=["synllama"],
7 | )


--------------------------------------------------------------------------------
/steps/step_10_calc_embedding.py:
--------------------------------------------------------------------------------
  1 | # Preprocessing step 1: Extract metadata and create fingerprint index and reactant reaction matrices
  2 | import pathlib, shutil
  3 | from sklearn.cluster import KMeans
  4 | from synllama.chem.mol import read_mol_file, FingerprintOption
  5 | from synllama.chem.fpindex import FingerprintIndex
  6 | from synllama.chem.matrix import ReactantReactionMatrix, ReactionContainer
  7 | from synllama.chem.reaction import read_reaction_file
  8 | from synllama.chem.smiles_tfidf import SmilesSimilaritySearch
  9 | import numpy as np
 10 | import pickle, click
 11 | 
 12 | # As noted in the README, the Enamine data needs to be downloaded separately. If you want to use the default data, 
 13 | # please download the data from the following links: https://enamine.net/building-blocks/building-blocks-catalog
 14 | 
 15 | # If you want to request these exact files, please contact me at kysun@berkeley.edu or leave an issue on the GitHub repo.
 16 | 
 17 | _default_sdf_path = pathlib.Path("data/Enamine_Rush-Delivery_Building_Blocks-US_253345cmpd_20250212.sdf")
 18 | _default_reaction_path = pathlib.Path("data/91_rxns/91_rxn_templates.txt")
 19 | _default_data_folder = pathlib.Path("data/91_rxns/")
 20 | _default_testing_data_path = pathlib.Path("data/13k_unseen_enamine_bbs.smi")
 21 | _random_state = 0
 22 | np.random.seed(_random_state) # for reproducibility of the test set bb if no testing data is provided
 23 | 
 24 | @click.command()
 25 | @click.option(
 26 |     "--data_folder",
 27 |     type=click.Path(path_type=pathlib.Path),
 28 |     default=_default_data_folder,
 29 |     help="Path to the data folder."
 30 | )
 31 | @click.option(
 32 |     "--bb_path",
 33 |     type=click.Path(exists=True, path_type=pathlib.Path),
 34 |     default=_default_sdf_path,
 35 |     help="Path to the input building blocks SDF file."
 36 | )
 37 | @click.option(
 38 |     "--rxn_template_path",
 39 |     type=click.Path(exists=True, path_type=pathlib.Path),
 40 |     default=_default_reaction_path,
 41 |     help="Path to the input reaction templates file."
 42 | )
 43 | @click.option(
 44 |     "--testing_data_path",
 45 |     type=click.Path(exists=True, path_type=pathlib.Path),
 46 |     default=_default_testing_data_path,
 47 |     help="Path to the testing data file."
 48 | )
 49 | 
 50 | def run_all_preprocessing(data_folder, bb_path, rxn_template_path, testing_data_path = None):
 51 |     processed_folder = data_folder / "processed"
 52 |     processed_folder.mkdir(parents=True, exist_ok=True)
 53 |     testing_folder = data_folder / "testing_data"
 54 |     testing_folder.mkdir(parents=True, exist_ok=True)
 55 |     molecules = list(read_mol_file(bb_path))
 56 |     print(f"Generating fingerprints for {bb_path}")
 57 |     generate_morgan_fingerprints(molecules, processed_folder / "fpindex.pkl", processed_folder / "enamine_metadata.csv")
 58 |     # print(f"Generating smiles embedding for {bb_path}")
 59 |     # generate_smiles_embedding(molecules, data_folder / "smiles_embedding.pkl", smiles_vocab_path)
 60 |     if testing_data_path is None:
 61 |         print(f"Clustering fingerprints")
 62 |         knn_clustering(processed_folder / "fpindex.pkl", testing_folder, n_clusters=128, random_state=_random_state)
 63 |     else:
 64 |         shutil.copy(testing_data_path, testing_folder / "test_bb.smi")
 65 |     print(f"Creating reactant reaction matrix cache")
 66 |     create_reactant_reaction_matrix_cache(molecules, rxn_template_path, processed_folder / "all" / "reaction_matrix.pkl")
 67 |     create_reactant_reaction_matrix_cache(molecules, rxn_template_path, processed_folder / "train" / "reaction_matrix_train.pkl", testing_folder / "test_bb.smi")
 68 |     create_reactant_reaction_matrix_cache(molecules, rxn_template_path, processed_folder / "test" / "reaction_matrix_test.pkl", testing_folder / "test_bb.smi", test_only=True)
 69 | 
 70 | def generate_morgan_fingerprints(molecules, out, meta_data_path):
 71 |     """Generate Morgan fingerprints from the specified SDF file and save the FingerprintIndex."""
 72 |     # Define the fingerprint option
 73 |     fp_option = FingerprintOption.morgan_for_building_blocks()
 74 |     if meta_data_path:
 75 |         import csv
 76 |         with open(meta_data_path, 'w', newline='') as csvfile:
 77 |             fieldnames = ['SMILES'] + list(molecules[0].meta_info.keys())
 78 |             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
 79 |             writer.writeheader()
 80 |             for mol in molecules:
 81 |                 row = {'SMILES': mol.smiles}
 82 |                 row.update(mol.meta_info)
 83 |                 writer.writerow(row)
 84 |     # Generate fingerprints
 85 |     fp_index = FingerprintIndex(molecules, fp_option)
 86 |     out.parent.mkdir(parents=True, exist_ok=True)
 87 |     fp_index.save(out)
 88 | 
 89 | # don't need this for now because we are using reaction template-based smiles embedding
 90 | def generate_smiles_embedding(molecules, out, smiles_vocab_path):
 91 |     """Generate smiles embedding from the specified SDF file and save the SmilesSimilaritySearch."""
 92 |     smiles_tokens = [line.strip() for line in open(smiles_vocab_path)]
 93 |     smiles_searcher = SmilesSimilaritySearch(token_list=smiles_tokens)
 94 |     smiles_searcher.fit(molecules, save_ngram=True)
 95 |     out.parent.mkdir(parents=True, exist_ok=True)
 96 |     smiles_searcher.save(out)
 97 | 
 98 | def knn_clustering(fp_index_path, out, n_clusters=128, random_state=_random_state):
 99 |     """Find the smallest cluster in the FingerprintIndex and save the SMILES of the molecules in the cluster to the output file."""
100 |     fp_index = FingerprintIndex.load(fp_index_path)
101 |     fp = fp_index._fp
102 |     kmeans = KMeans(n_clusters=n_clusters, random_state=random_state)
103 |     kmeans.fit(fp)
104 |     for i in range(n_clusters):
105 |         print(f"Cluster {i} has {np.sum(kmeans.labels_ == i)} molecules")
106 |         cluster_idx = np.where(kmeans.labels_ == i)[0]
107 |         cluster_smiles = [fp_index.molecules[i].smiles for i in cluster_idx]
108 |         if i == 0:
109 |             cluster_out_path = out / f"test_bb.smi"
110 |             with open(cluster_out_path, "w") as f:
111 |                 for smi in cluster_smiles:
112 |                     f.write(smi + "\n")
113 | 
114 | def create_reactant_reaction_matrix_cache(molecules, reaction_path, cache_path, excl_path = None, test_only= False):
115 |     """Create a reactant reaction matrix cache for reaction generation."""
116 |     rxns = ReactionContainer(read_reaction_file(reaction_path))
117 |     if test_only and excl_path is None:
118 |         raise ValueError("test_only is True but excl_path is None")
119 |     if test_only:
120 |         mols = list(read_mol_file(excl_path))
121 |     else:
122 |         mols = molecules
123 |         if excl_path is not None:
124 |             excl_mols = list(read_mol_file(excl_path))
125 |             excl_smiles = {m.smiles for m in excl_mols}
126 |             mols = [m for m in mols if m.smiles not in excl_smiles]
127 |     m = ReactantReactionMatrix(mols, rxns)
128 |     cache_path.parent.mkdir(parents=True, exist_ok=True)
129 |     m.save(cache_path)
130 | 
131 | if __name__ == "__main__":
132 |     run_all_preprocessing()
133 |     # old_bbs = pathlib.Path("data/Enamine_Rush-Delivery_Building_Blocks-US_243540cmpd_20240806.sdf")
134 |     # new_bbs = pathlib.Path("data/Enamine_Rush-Delivery_Building_Blocks-US_253345cmpd_20250212.sdf")
135 |     # old_mols = list(read_mol_file(old_bbs))
136 |     # new_mols = list(read_mol_file(new_bbs))
137 |     # old_smiles = {m.smiles for m in old_mols}
138 |     # new_smiles_list = []
139 |     # for mol in new_mols:
140 |     #     if mol.smiles not in old_smiles:
141 |     #         new_smiles_list.append(mol.smiles)
142 |     # with open("data/new_test_smiles_list.smi", "w") as f:
143 |     #     for smi in new_smiles_list:
144 |     #         f.write(smi + "\n")
145 |     # molecules = list(read_mol_file("data/new_test_smiles_list.smi"))
146 |     # rxn_template_path = pathlib.Path("data/91_rxns/reaction_templates_hb.txt")
147 |     # processed_folder = pathlib.Path("data/91_rxns/")
148 |     # create_reactant_reaction_matrix_cache(molecules, rxn_template_path, processed_folder / "test" / f"reaction_matrix_test_new_enamine.pkl","data/new_test_smiles_list.smi", test_only=True)


--------------------------------------------------------------------------------
/steps/step_11_generate_fpindex_smiles_tfidf.py:
--------------------------------------------------------------------------------
 1 | import pickle
 2 | from pathlib import Path
 3 | import click
 4 | from synllama.chem.fpindex import FingerprintIndex
 5 | from synllama.chem.smiles_tfidf import SmilesSimilaritySearch
 6 | from synllama.chem.mol import FingerprintOption
 7 | from synllama.chem.matrix import ReactantReactionMatrix
 8 | 
 9 | _default_matrix_file = "data/91_rxns/processed/all/reaction_matrix.pkl"
10 | _default_output_dir = "data/91_rxns/rxn_embeddings"
11 | _default_token_list_path = "data/smiles_vocab.txt"
12 | 
13 | @click.command()
14 | @click.option("--matrix_file", type=click.Path(exists=True, path_type=Path), default=_default_matrix_file)
15 | @click.option("--output_dir", type=click.Path(path_type=Path), default=_default_output_dir)
16 | @click.option("--token_list_path", type=click.Path(exists=True, path_type=Path), default=_default_token_list_path)
17 | 
18 | def main(matrix_file: Path, output_dir: Path, token_list_path: Path):
19 |     # Load the reactant-reaction matrix
20 |     with open(matrix_file, 'rb') as f:
21 |         reactant_reaction_matrix: ReactantReactionMatrix = pickle.load(f)
22 | 
23 |     # Create output directory if it doesn't exist
24 |     output_dir = Path(output_dir)
25 |     output_dir.mkdir(parents=True, exist_ok=True)
26 | 
27 |     # Dictionary to map reaction index to reaction SMARTS
28 |     reaction_smarts_map = {}
29 | 
30 |     # Iterate over each reaction
31 |     for reaction_idx, reaction in enumerate(reactant_reaction_matrix.reactions):
32 |         # Find reactants that can participate in this reaction
33 |         reactant_indices = reactant_reaction_matrix.matrix[:, reaction_idx].nonzero()[0]
34 |         reactants = [reactant_reaction_matrix.reactants[i] for i in reactant_indices]
35 | 
36 |         # Generate FingerprintIndex
37 |         fp_option = FingerprintOption.morgan_for_building_blocks()
38 |         fp_index = FingerprintIndex(molecules=reactants, fp_option=fp_option)
39 | 
40 |         # Save FingerprintIndex
41 |         fp_index_file = output_dir / f"fpindex_{reaction_idx}.pkl"
42 |         fp_index.save(fp_index_file)
43 | 
44 |         # Generate SmilesSimilaritySearch
45 |         smiles_search = SmilesSimilaritySearch(token_list_path=token_list_path)
46 |         smiles_search.fit(molecules=reactants)
47 | 
48 |         # Save SmilesSimilaritySearch
49 |         smiles_search_file = output_dir / f"smiles_tfidf_{reaction_idx}.pkl"
50 |         smiles_search.save(smiles_search_file)
51 | 
52 |         # Map reaction index to reaction SMARTS
53 |         reaction_smarts_map[reaction_idx] = (reaction.smarts, len(reactants))  # Assuming `reaction` has a `smarts` attribute
54 |         print(f"Processed reaction {reaction_idx}: {len(reactants)} reactants")
55 | 
56 |     # Save the reaction SMARTS map
57 |     smarts_map_file = output_dir / "reaction_smarts_map.pkl"
58 |     with open(smarts_map_file, 'wb') as f:
59 |         pickle.dump(reaction_smarts_map, f)
60 | 
61 |     print("All reactions processed and SMARTS map saved.")
62 | 
63 | if __name__ == "__main__":
64 |     main()
65 | 


--------------------------------------------------------------------------------
/steps/step_20_generate_reactions.py:
--------------------------------------------------------------------------------
  1 | # Preprocessing step 2: Generate prompt-response pairs of reactions for LLM fine-tuning.
  2 | 
  3 | import pickle, random, json, os
  4 | from copy import deepcopy
  5 | from pathlib import Path
  6 | from tqdm import tqdm
  7 | import click
  8 | import multiprocessing as mp
  9 | from joblib import Parallel, delayed
 10 | from collections import defaultdict
 11 | from synllama.chem.matrix import ReactantReactionMatrix
 12 | from synllama.chem.stack import create_stack
 13 | from synllama.chem.reaction import Reaction
 14 | from synllama.llm.vars import TEMPLATE, BB_BASE, REACTION_BASE_MAX2, REACTION_BASE_MAX3
 15 | 
 16 | def rebuild_response(synthesis_route, rxn_mapping, max_reactants = 3):
 17 |     if max_reactants == 2:
 18 |         reaction_base = REACTION_BASE_MAX2
 19 |     elif max_reactants == 3:
 20 |         reaction_base = REACTION_BASE_MAX3
 21 |     else:
 22 |         raise ValueError(f"Invalid number of reactants: {max_reactants}")
 23 |     
 24 |     synthesis = synthesis_route.replace("\\", "\\\\").split(";")[::-1] # fix json escaping
 25 |     target_smiles = synthesis[0]
 26 |     rxn_positions = [i for i, s in enumerate(synthesis) if s.startswith("R")]
 27 |     bb_list = [target_smiles]
 28 |     rxn_list = []
 29 |     
 30 |     for j, rxn_pos in enumerate(rxn_positions):
 31 |         product = synthesis[rxn_pos-1]
 32 |         rxn_idx = j+1
 33 |         if j+1 < len(rxn_positions):
 34 |             reactants = synthesis[rxn_pos+1:rxn_positions[j+1]]
 35 |         else:
 36 |             reactants = synthesis[rxn_pos+1:]
 37 |         reactants_padded = reactants + [""] * (max_reactants - len(reactants))
 38 |         rxn_copy = deepcopy(reaction_base)
 39 |         rxn_copy = rxn_copy.replace('PRODUCT', product)
 40 |         rxn_copy = rxn_copy.replace("RXN_TEMPLATE", rxn_mapping[int(synthesis[rxn_pos][1:].split("_")[0])])
 41 |         rxn_copy = rxn_copy.replace("REACTION_NUM", str(rxn_idx))
 42 |         rxn_copy = rxn_copy.replace("REACTANT1", reactants_padded[-1])
 43 |         rxn_copy = rxn_copy.replace("REACTANT2", reactants_padded[-2])
 44 |         if max_reactants == 3:
 45 |             rxn_copy = rxn_copy.replace("REACTANT3", reactants_padded[-3])
 46 |         rxn_list.append(rxn_copy)
 47 |         bb_list.remove(product)
 48 |         bb_list.extend(reactants)
 49 | 
 50 |     bb_list_formatted = [deepcopy(BB_BASE).replace("Building_Block", bb) for bb in bb_list]
 51 |     
 52 |     template_copy = deepcopy(TEMPLATE)
 53 |     template_copy['input'] = template_copy['input'].replace("SMILES_STRING", target_smiles)
 54 |     template_copy['output'] = template_copy['output'].replace("REACTIONS", ", ".join(rxn_list))
 55 |     template_copy['output'] = template_copy['output'].replace("BUILDING_BLOCKS", ", ".join(bb_list_formatted))
 56 |     output_dict = json.loads(template_copy['output'])
 57 |     return template_copy
 58 | 
 59 | def generate_reaction_data(matrix: ReactantReactionMatrix, rxn_mapping, rxn_count, init_stack_weighted_ratio, prob_u_fp, max_num_reactions=5, max_num_atoms=80):
 60 |     stack = create_stack(
 61 |         matrix,
 62 |         rxn_count,
 63 |         max_num_reactions=max_num_reactions,
 64 |         max_num_atoms=max_num_atoms,
 65 |         init_stack_weighted_ratio=init_stack_weighted_ratio,
 66 |         prob_u_fp=prob_u_fp,
 67 |     )
 68 |     rebuilt_response = rebuild_response(stack.get_action_string(), rxn_mapping)
 69 |     return rebuilt_response
 70 | 
 71 | def generate_reaction_chunk(matrix, rxn_mapping, rxn_count, num_reactions, init_stack_weighted_ratio, prob_u_fp, max_num_reactions=5, max_num_atoms=80):
 72 |     reactions_dict = defaultdict(int)
 73 |     all_reactions = []
 74 |     while len(all_reactions) < num_reactions:
 75 |         try:
 76 |             stack = create_stack(
 77 |                 matrix,
 78 |                 rxn_count,
 79 |                 max_num_reactions=max_num_reactions,
 80 |                 max_num_atoms=max_num_atoms,
 81 |                 init_stack_weighted_ratio=init_stack_weighted_ratio,
 82 |                 prob_u_fp=prob_u_fp,
 83 |             )
 84 |             all_reactions.append(rebuild_response(stack.get_action_string(), rxn_mapping))
 85 |             rxns = [r for r in stack.get_action_string().split(";") if r.startswith("R")]
 86 |             for rxn in rxns:
 87 |                 reactions_dict[rxn] += 1
 88 |         except Exception as e:
 89 |             continue
 90 |     print(sorted(reactions_dict.items(), key=lambda x: int(x[0][1:])))
 91 |     return all_reactions
 92 | 
 93 | @click.command()
 94 | @click.option("--matrix_path", type=click.Path(exists=True, path_type=Path), required=True, default="data/91_rxns/processed/test/reaction_matrix_test.pkl")
 95 | @click.option("--rxn_mapping_path", type=click.Path(exists=True, path_type=Path), required=True, default="data/91_rxns/rxn_embeddings/reaction_smarts_map.pkl")
 96 | @click.option("--prob_u_fp", type=click.Path(exists=True, path_type=Path), default=None)
 97 | @click.option("--init_stack_weighted_ratio", type=float, required=True, default=0.8)
 98 | @click.option("--name", default=None)
 99 | @click.option("--num_reactions", type=int, required=True, default=2000000)
100 | @click.option("--batch_size", type=int, required=True, default=1000)
101 | @click.option("--write_for_benchmark", is_flag=True)
102 | 
103 | def main(matrix_path, rxn_mapping_path, prob_u_fp, num_reactions, init_stack_weighted_ratio=0.8, name=None, batch_size=1000, write_for_benchmark=False):
104 |     matrix: ReactantReactionMatrix = ReactantReactionMatrix.load(matrix_path)
105 |     reaction_smarts_dict = pickle.load(open(rxn_mapping_path, "rb"))
106 |     rxn_mapping = {k: v[0] for k, v in reaction_smarts_dict.items()}
107 |     rxn_count = {k: v[1] for k, v in reaction_smarts_dict.items()}
108 |     prob_u_fp = prob_u_fp
109 |     num_cores = mp.cpu_count()
110 |     num_batches = num_reactions // batch_size // num_cores
111 |     remainder = num_reactions - num_batches * batch_size * num_cores
112 |     if name is None:
113 |         name = f"{num_reactions/1000000:.1f}m_reactions"
114 |     
115 |     # check if the file exists
116 |     if os.path.exists(f"data/{name}.jsonl"):
117 |         print(f"File {name}.jsonl already exists. Deleting to start fresh...")
118 |         os.remove(f"data/{name}.jsonl")
119 |     
120 |     for batch_num in range(num_batches):
121 |         with tqdm(total=batch_size, desc=f"Generating reactions batch {batch_num+1} of {num_batches}") as pbar:
122 |             with open(f"data/{name}.jsonl", "a") as f:
123 |                 results = Parallel(n_jobs=num_cores)(
124 |                 delayed(generate_reaction_chunk)(matrix, rxn_mapping, rxn_count, batch_size, init_stack_weighted_ratio, prob_u_fp)
125 |                 for _ in range(num_cores)
126 |                 )
127 |                 for result in results:
128 |                     for r in result:
129 |                         json.dump(r, f)
130 |                         f.write("\n")
131 | 
132 |     if remainder > 0:
133 |         with tqdm(total=remainder, desc=f"Generating reactions batch {num_batches+1} of {num_batches}") as pbar:
134 |             results = Parallel(n_jobs=num_cores)(
135 |                 delayed(generate_reaction_chunk)(matrix, rxn_mapping, rxn_count, remainder // num_cores + 1, init_stack_weighted_ratio, prob_u_fp)
136 |                 for _ in range(num_cores)
137 |             )
138 |             results = [r for rr in results for r in rr if r is not None]
139 |             results = results[:remainder]
140 |             with open(f"data/{name}.jsonl", "a") as f:
141 |                 for result in results:
142 |                     json.dump(result, f)
143 |                     f.write("\n")
144 |     
145 |     if write_for_benchmark:
146 |         with open(f"data/{name}.jsonl", "r") as f:
147 |             reactions = [json.loads(line) for line in f]
148 |         reactions_dict = {r['input'].split("SMILES string:")[1].strip(): [json.loads(r['output'])] for r in reactions}
149 |         with open(f"data/{name}_benchmark.pkl", "wb") as f:
150 |             pickle.dump(reactions_dict, f)
151 |         with open(f"data/{name}.smi", "w") as f:
152 |             for r in reactions:
153 |                 f.write(r['input'].split("SMILES string:")[1].strip() + "\n")
154 | 
155 | if __name__ == "__main__":
156 |     mp.set_start_method("spawn")
157 |     main()
158 | 


--------------------------------------------------------------------------------
/steps/step_30_0_benchmark_filter_raw_output.py:
--------------------------------------------------------------------------------
  1 | import pickle, argparse, os, glob, csv
  2 | import pandas as pd
  3 | import numpy as np
  4 | from tqdm import tqdm
  5 | from collections import defaultdict
  6 | from synllama.chem.reaction import Reaction
  7 | from synllama.chem.fpindex import FingerprintIndex, compute_fingerprints
  8 | from synllama.chem.mol import FingerprintOption, Molecule
  9 | import multiprocessing as mp
 10 | 
 11 | def arrange_reactants_and_react_synllama(template, reactant_mols):
 12 |     rxn = Reaction(template)
 13 |     if len(reactant_mols) != rxn.num_reactants:
 14 |         return None, False
 15 |     if len(reactant_mols) == 1:
 16 |         product = rxn(reactant_mols)
 17 |         if len(product) == 0:
 18 |             return None, False
 19 |     elif len(reactant_mols) == 2:
 20 |         product = []
 21 |         product.extend(rxn([reactant_mols[0], reactant_mols[1]]))
 22 |         product.extend(rxn([reactant_mols[1], reactant_mols[0]]))
 23 |         if len(product) == 0:
 24 |             return None, False
 25 |     elif len(reactant_mols) == 3:
 26 |         product = []
 27 |         product.extend(rxn([reactant_mols[0], reactant_mols[1], reactant_mols[2]]))
 28 |         product.extend(rxn([reactant_mols[0], reactant_mols[2], reactant_mols[1]]))
 29 |         product.extend(rxn([reactant_mols[1], reactant_mols[0], reactant_mols[2]]))
 30 |         product.extend(rxn([reactant_mols[1], reactant_mols[2], reactant_mols[0]]))
 31 |         product.extend(rxn([reactant_mols[2], reactant_mols[0], reactant_mols[1]]))
 32 |         product.extend(rxn([reactant_mols[2], reactant_mols[1], reactant_mols[0]]))
 33 |         if len(product) == 0:
 34 |             return None, False
 35 |     else:
 36 |         return None, False
 37 |     return product, True
 38 | 
 39 | def filter_raw_output(llama_output, reaction_idx_map):
 40 |     
 41 |     successful_synthesis = defaultdict(list)
 42 |     for product_smiles, example_data in tqdm(llama_output.items()):
 43 |         if type(example_data) == str: continue
 44 |         for output in example_data:
 45 |             if type(output) == str: continue
 46 |             try:
 47 |                 assert 'reactions' in output and 'building_blocks' in output
 48 |                 reactions = output['reactions']
 49 |                 building_blocks = output['building_blocks']
 50 |                 reactant_stack = []
 51 |                 reaction_strings = []
 52 |                 reactant_stack.append(product_smiles)
 53 |                 reaction_strings.append(product_smiles)
 54 |             
 55 |                 for reaction in reactions:
 56 |                     assert 'reaction_template' in reaction and 'reactants' in reaction and 'product' in reaction
 57 |                     product = reaction['product']
 58 |                     assert product in reactant_stack
 59 |                     reactant_stack.remove(product)
 60 |                     reaction_strings.remove(product)
 61 |                     reaction_strings.append(product)
 62 |                     template = reaction['reaction_template'].split('<rxn>')[1].split('</rxn>')[0]
 63 |                     assert template in reaction_idx_map
 64 |                     reaction_strings.append(f"R{reaction_idx_map[template]}")
 65 |                     reactants = reaction['reactants']
 66 |                     reactants = [reactant.split("<bb>")[-1].split("</bb>")[0] if "<bb>" in reactant else reactant for reactant in reactants]
 67 |                     reactant_stack.extend(reactants)
 68 |                     reactant_mols = []
 69 |                     for reactant in reactants:
 70 |                         if reactant == '': continue
 71 |                         mol = Molecule(reactant, source="smiles")
 72 |                         if not mol.is_valid:
 73 |                             raise ValueError(f"Invalid molecule {reactant}")
 74 |                         reactant_mols.append(mol)
 75 |                         reaction_strings.append(reactant)
 76 |                     product_mol = Molecule(product, source="smiles")
 77 |                     if not product_mol.is_valid:
 78 |                         raise ValueError(f"Invalid molecule {product}")
 79 |                     product_from_rxn, matched = arrange_reactants_and_react_synllama(template, reactant_mols)
 80 |                     assert matched
 81 |                     product_from_rxn = [prod.csmiles for prod in product_from_rxn if prod is not None]
 82 |                     assert product_mol.csmiles in product_from_rxn
 83 |                 
 84 |                 bbs = []
 85 |                 for bb in building_blocks:
 86 |                     bb_clean = bb.split("<bb>")[-1].split("</bb>")[0]
 87 |                     assert bb_clean in reactant_stack
 88 |                     reactant_stack.remove(bb_clean)
 89 |                     bbs.append(bb_clean)
 90 |                 
 91 |                 successful_synthesis[product_smiles].append({
 92 |                     "reaction_strings": ";".join(reaction_strings[::-1]),
 93 |                     "bbs": bbs,
 94 |                 })
 95 |                     
 96 |             except Exception as e:
 97 |                 continue
 98 |     
 99 |     return successful_synthesis
100 | 
101 | def check_bb_in_enamine(args):
102 |     bb, fp_searcher = args
103 |     fingerprints = Molecule(bb).get_fingerprint(FingerprintOption.morgan_for_building_blocks(), as_bitvec=False)
104 |     fp_searched_results = fp_searcher.query_single(np.array([fingerprints]), k=10)
105 |     fp_searched_mols = [result.molecule for result in fp_searched_results]
106 |     return np.max([Molecule(bb).sim(mol, FingerprintOption.morgan_for_tanimoto_similarity()) for mol in fp_searched_mols])
107 | 
108 | def check_bbs_in_enamine_parallel(bbs, fp_searcher, num_cores):
109 |     
110 |     bb_mols = [Molecule(bb, source="smiles") for bb in bbs]
111 |     fingerprints = compute_fingerprints(bb_mols, FingerprintOption.morgan_for_building_blocks(), batch_size=1024)
112 |     fp_searched_results = fp_searcher.query(fingerprints, k=10)
113 |     bbs_similarity = []
114 |     for bb, result in zip(bbs, fp_searched_results):
115 |         bbs_similarity.append(np.max([Molecule(bb).sim(r.molecule, FingerprintOption.morgan_for_tanimoto_similarity()) for r in result]))
116 | 
117 |     # Create a dictionary with bb as the key and its similarity score as the value
118 |     bb_similarity_dict = {bb: similarity for bb, similarity in zip(bbs, bbs_similarity)}
119 |     return bb_similarity_dict
120 | 
121 | def convert_smiles_dict(successful_synthesis, save_folder, file_name, fp_searcher, num_cores):
122 |     
123 |     if not os.path.exists(save_folder):
124 |         os.makedirs(save_folder, exist_ok=True)
125 |     # collect all bbs to search in enamine
126 |     all_bbs = []
127 |     for _, value in successful_synthesis.items():
128 |         for v in value:
129 |             all_bbs.extend(v['bbs'])
130 |     all_bbs = list(set(all_bbs))
131 |     bb_similarity_dict = check_bbs_in_enamine_parallel(all_bbs, fp_searcher, num_cores)
132 |     
133 |     for _, value in successful_synthesis.items():
134 |         for v in value:
135 |             v['bbs_similarity'] = [bb_similarity_dict[bb] for bb in v['bbs']]
136 |             v['bbs_not_in_enamine'] = [bb for bb, sim in zip(v['bbs'], v['bbs_similarity']) if sim < 1]
137 |             v['bbs_in_enamine'] = [bb for bb, sim in zip(v['bbs'], v['bbs_similarity']) if sim == 1]
138 |     # save the successful_synthesis to a pickle file
139 |     with open(os.path.join(save_folder, f"{file_name}_successful_synthesis.pkl"), "wb") as f:
140 |         pickle.dump(successful_synthesis, f)
141 |     
142 |     all_bbs_not_in_enamine = []
143 |     all_bbs_in_enamine = []
144 |     for key, value in successful_synthesis.items():
145 |         for v in value:
146 |             bbs_not_in_enamine = v['bbs_not_in_enamine']
147 |             all_bbs_not_in_enamine.extend(bbs_not_in_enamine)
148 |             bbs_in_enamine = v['bbs_in_enamine']
149 |             all_bbs_in_enamine.extend(bbs_in_enamine)
150 |     
151 |     smiles_list = list(set(all_bbs_not_in_enamine))
152 |     file_count = 1
153 |     for i in range(0, len(smiles_list), 10000):
154 |         with open(os.path.join(save_folder, f"{file_name}_successful_bbs_not_in_enamine_{file_count}.txt"), "w") as f:
155 |             f.write("\n".join(smiles_list[i:i+10000]))
156 |         file_count += 1
157 |     
158 |     smiles_list = list(set(all_bbs_in_enamine))
159 |     file_count = 1
160 |     for i in range(0, len(smiles_list), 10000):
161 |         with open(os.path.join(save_folder, f"{file_name}_successful_bbs_in_enamine_{file_count}.txt"), "w") as f:
162 |             f.write("\n".join(smiles_list[i:i+10000]))
163 |         file_count += 1
164 |     
165 | def calc_benchmark_rxn(llama_output, reaction_idx_map):
166 |     # check the correctness in general with rdkit functions
167 |     total_trials = len(llama_output) * len([k for k in llama_output.values() if type(k) != str][0])
168 |     successful_trials = 0
169 |     template_obedience = defaultdict(int)
170 |     reactant_matched = defaultdict(int)
171 |     product_obedience = defaultdict(int)
172 |     bb_obedience = defaultdict(list)
173 |     total_reactions = 0
174 |     successful_reactions = 0
175 |     product_not_in_reactant_stack = 0
176 |     invalid_smiles = []
177 |     total_molecules = 0
178 |     failed_structured_output = []
179 |     template_no_rxn_tag = 0
180 |     failed_cases = [] # template, reactants, product
181 |     total_success_formats = defaultdict(int)
182 |     total_success_reactions = defaultdict(int)
183 | 
184 |     for product_smiles, example_data in tqdm(llama_output.items()):
185 |         if type(example_data) == str: 
186 |             print(example_data)
187 |             continue
188 |         format_success = True
189 |         reaction_success = True
190 |         for output in example_data:
191 |             if 'reactions' not in output or 'building_blocks' not in output:
192 |                 failed_structured_output.append(output)
193 |                 format_success = False
194 |                 continue
195 |             reactions = output['reactions']
196 |             building_blocks = output['building_blocks']
197 |             reactant_stack = []
198 |             reactant_stack.append(product_smiles)
199 |             successful_trials += 1
200 |             
201 |             for reaction in reactions:
202 |                 # Extract the reaction template between <rxn> tags
203 |                 if 'reaction_template' not in reaction or 'reactants' not in reaction or 'product' not in reaction:
204 |                     successful_trials -= 1
205 |                     format_success = False
206 |                     failed_structured_output.append(reaction)
207 |                     break
208 |                 try:
209 |                     template = reaction['reaction_template'].split('<rxn>')[1].split('</rxn>')[0]
210 |                 except:
211 |                     template_no_rxn_tag += 1
212 |                     successful_trials -= 1
213 |                     format_success = False
214 |                     break
215 |                 total_reactions += 1
216 |                 if template not in reaction_idx_map:
217 |                     template_obedience['not_in_template'] += 1
218 |                     print(f"Template {template} not found in reaction_templates")
219 |                     failed_cases.append((reaction, "template"))
220 |                     format_success = False
221 |                     continue
222 |                 else:
223 |                     template_obedience[template] += 1
224 |                 # check if reactants can form the product through the reaction template
225 |                 reactants = reaction['reactants']
226 |                 reactants = [reactant.split("<bb>")[-1].split("</bb>")[0] if "<bb>" in reactant else reactant for reactant in reactants]
227 |                 reactant_stack.extend(reactants)
228 |                 product = reaction['product']
229 |                 if product not in reactant_stack:
230 |                     product_not_in_reactant_stack += 1
231 |                 else:
232 |                     reactant_stack.remove(product)
233 |                 reactant_mols = []
234 |                 total_molecules += len(reactants)
235 |                 for reactant in reactants:
236 |                     if not Molecule(reactant, source="smiles").is_valid:
237 |                         invalid_smiles.append(reactant)
238 |                     elif reactant == '':
239 |                         total_molecules -= 1
240 |                         continue
241 |                     else:
242 |                         reactant_mols.append(Molecule(reactant, source="smiles"))
243 |                 product_mol = Molecule(product, source="smiles")
244 |                 total_molecules += 1
245 |                 if not product_mol.is_valid:
246 |                     invalid_smiles.append(product)
247 |                 product_from_rxn, matched = arrange_reactants_and_react_synllama(template, reactant_mols)
248 |                 if template in reaction_idx_map:
249 |                     reactant_matched[template] += int(matched)
250 |                 if product_from_rxn is None:
251 |                     failed_cases.append((reaction, "reactants"))
252 |                     print(f"Reactants {reactants} cannot react through template {template}")
253 |                     reaction_success = False
254 |                     continue
255 |                 product_from_rxn = [prod.csmiles for prod in product_from_rxn]
256 |                 successful_reactions += 1
257 |                 if not product_mol.is_valid or product_mol.csmiles not in product_from_rxn:
258 |                     failed_cases.append((reaction, "product"))
259 |                     print(f"Product {product} not in product_from_rxn {product_from_rxn}")
260 |                     reaction_success = False
261 |                 else:
262 |                     product_obedience[template] += 1
263 |             
264 |             bb_count = 0
265 |             for bb in building_blocks:
266 |                 bb_clean = bb.split("<bb>")[-1].split("</bb>")[0]
267 |                 if bb_clean not in reactant_stack:
268 |                     print(f"Building block {bb_clean} not found in reactant stack")
269 |                     format_success = False
270 |                 else:
271 |                     bb_count += 1
272 |             if len(building_blocks) > 0:
273 |                 bb_obedience[product_smiles].append(bb_count / len(building_blocks))
274 |             else:
275 |                 bb_obedience[product_smiles].append(1)
276 |         
277 |         bb_obedience[product_smiles] = sum(bb_obedience[product_smiles]) / len(bb_obedience[product_smiles]) if len(bb_obedience[product_smiles]) > 0 else 1
278 |         total_success_formats[product_smiles] = format_success
279 |         total_success_reactions[product_smiles] = reaction_success and format_success
280 |     
281 |     stats_rxn = {
282 |         "total_trials": total_trials,
283 |         "failed_structured_output": len(failed_structured_output),
284 |         "template_no_rxn_tag": template_no_rxn_tag,
285 |         "valid_responses": round(successful_trials / total_trials * 100, 2),
286 |         "valid_smiles": round((1 - len(invalid_smiles) / total_molecules) * 100, 2),
287 |         "recycled_bbs": round(sum(bb_obedience.values()) / len(bb_obedience) * 100, 2),
288 |         "template_memorization": round((1 - template_obedience['not_in_template'] / total_reactions) * 100, 2),
289 |         "matched_reactants": round(successful_reactions / total_reactions * 100, 2),
290 |         "good_products": round(sum(product_obedience.values()) / successful_reactions * 100, 2),
291 |         "total_success_formats": round(sum(total_success_formats.values()) / len(llama_output) * 100, 2),
292 |         "total_success_reactions": round(sum(total_success_reactions.values()) / len(llama_output) * 100, 2),
293 |     }
294 |     return stats_rxn, failed_structured_output
295 | 
296 | def main():
297 |     parser = argparse.ArgumentParser()
298 |     parser.add_argument("--llama_folder", type=str, default = "../synllama-data/results/table_2_syn_planning_91rxns")
299 |     parser.add_argument("--save_folder", type=str, default=None)
300 |     parser.add_argument("--raw_output_only", action="store_true")
301 |     parser.add_argument("--benchmark_only", action="store_true")
302 |     parser.add_argument("--rxn_mapping_path", type=str, default="../synllama-data/inference/reconstruction/91rxns/rxn_embeddings/reaction_smarts_map.pkl")
303 |     parser.add_argument("--fp_searcher_path", type=str, default="../synllama-data/inference/reconstruction/91rxns/processed/fpindex.pkl")
304 |     args = parser.parse_args()
305 |     
306 |     reaction_smarts_dict = pickle.load(open(args.rxn_mapping_path, "rb"))
307 |     reaction_idx_map = {v[0]: k for k, v in reaction_smarts_dict.items()}
308 |     if args.save_folder is None:
309 |         args.save_folder = os.path.join(args.llama_folder, "synllama_reconstruct")
310 |     os.makedirs(args.save_folder, exist_ok=True)
311 |     fp_searcher = FingerprintIndex.load(args.fp_searcher_path)
312 |     
313 |     file_list = glob.glob(os.path.join(args.llama_folder, "*.pkl"))
314 |     all_data = []
315 |     for file in file_list:
316 |         file_name = file.split("/")[-1][:-4]
317 |         with open(file, "rb") as f:
318 |             llama_output = pickle.load(f)
319 |         if not args.raw_output_only:
320 |             stats_rxn, failed_cases = calc_benchmark_rxn(llama_output, reaction_idx_map)
321 |             combined_stats = {**stats_rxn}
322 |             combined_stats['file_name'] = file_name  # Add the file name as an index
323 |             # with open(f"{args.llama_folder}/failed_cases_{file_name}.pkl", "wb") as f:
324 |             #     pickle.dump(failed_cases, f)
325 |             all_data.append(combined_stats)
326 |         if not args.benchmark_only:
327 |             successful_synthesis = filter_raw_output(llama_output, reaction_idx_map)
328 |             if not args.raw_output_only:
329 |                 combined_stats['total_success_molecules'] = len(successful_synthesis)
330 |             convert_smiles_dict(successful_synthesis, args.save_folder, file_name, fp_searcher, mp.cpu_count() // 2)
331 | 
332 |     if not args.raw_output_only:
333 |         all_keys = ['file_name', 'total_trials', 'valid_responses', 'template_memorization', 'recycled_bbs', 'valid_smiles', 'matched_reactants', 'good_products', 'total_success_formats', 'total_success_reactions']
334 |         if not args.benchmark_only:
335 |             all_keys.append('total_success_molecules')
336 |         df = pd.DataFrame(all_data, columns=all_keys)
337 |         df.sort_values(by="file_name", ascending=True, inplace=True)
338 |         df.to_csv(os.path.join(args.save_folder, "llm_benchmark_stats.csv"), index=False)
339 | 
340 | if __name__ == "__main__":
341 |     main()
342 | 
343 | 


--------------------------------------------------------------------------------
/steps/step_30_1_molport_raw_reconstruct.py:
--------------------------------------------------------------------------------
 1 | # This script is used to reconstruct the raw output from MolPort
 2 | 
 3 | import os, sys, argparse, glob
 4 | import pickle
 5 | import pandas as pd
 6 | from collections import defaultdict
 7 | 
 8 | # Once the raw output (*_bbs_1.txt & *_successful_synthesis.pkl) is generated, please first go to Molport (https://www.molport.com/shop/swl-step-1)
 9 | # to do a "list search" for available building blocks and then run this script to reconstruct the raw output.
10 | 
11 | # Upon the completion of the Molport search, please download the list search results under the "Selected Items" column.
12 | # This file contains the list of building blocks that are available in the market based on the amount requested.
13 | 
14 | # Once the file is downloaded, please rename it to "*_molport_ls.xlsx" and put it in the same folder as the raw output files.
15 | # Then, run this script to reconstruct the raw output.
16 | 
17 | def extract_best_csv(total_reconstruction, save_path):
18 |     df = pd.DataFrame(columns = ["target","smiles","score","synthesis","num_steps","scf_sim","pharm2d_sim","rdkit_sim"])
19 |     for k, v in total_reconstruction.items():
20 |         item = v[0]
21 |         df = pd.concat([df, pd.DataFrame.from_dict({
22 |             "target": k,
23 |             "smiles": k,
24 |             "score": 1.0,
25 |             "synthesis": item['synthesis'],
26 |             "num_steps": sum(["R" in r for r in item['synthesis'].split(";")]),
27 |             "scf_sim": 1.0,
28 |             "pharm2d_sim": 1.0,
29 |             "rdkit_sim": 1.0
30 |             }, orient="index").T])
31 |     df.to_csv(save_path, index=False)
32 |     return df
33 | 
34 | def find_synllama_reconstruction(success_raw_syn_path, molport_ls_path):
35 |     successful_synthesis = pickle.load(open(success_raw_syn_path, "rb"))
36 |     search_results = pd.read_excel(molport_ls_path)
37 |     found_bbs = search_results['Search Criteria'].tolist()
38 |     enamine_reconstruction = defaultdict(list)
39 |     non_enamine_reconstruction = defaultdict(list)
40 |     total_reconstruction = defaultdict(list)
41 |     total_building_blocks = 0
42 |     non_enamine_building_blocks = 0
43 |     for k, v in successful_synthesis.items():
44 |         for item in v:
45 |             total_building_blocks += len(item['bbs'])
46 |             non_enamine_building_blocks += len(item['bbs_not_in_enamine'])
47 |             if all(bb in found_bbs for bb in item['bbs_not_in_enamine']):
48 |                 total_reconstruction[k].append({
49 |                     "bbs": item['bbs'],
50 |                     "synthesis": item['reaction_strings']
51 |                 })
52 |                 if len(item['bbs_not_in_enamine']) > 0:
53 |                     non_enamine_reconstruction[k].append({
54 |                         "bbs": item['bbs'],
55 |                         "synthesis": item['reaction_strings']
56 |                     })
57 |                 else:
58 |                     enamine_reconstruction[k].append({
59 |                         "bbs": item['bbs'],
60 |                         "synthesis": item['reaction_strings']
61 |                     })
62 |     enamine_save_path = success_raw_syn_path.replace("_successful_synthesis.pkl", "_enamine_synllama_reconstruct.csv")
63 |     extract_best_csv(enamine_reconstruction, enamine_save_path)
64 |     non_enamine_save_path = success_raw_syn_path.replace("_successful_synthesis.pkl", "_non_enamine_synllama_reconstruct.csv")
65 |     extract_best_csv(non_enamine_reconstruction, non_enamine_save_path)
66 |     all_synllama_reconstruct_path = success_raw_syn_path.replace("_successful_synthesis.pkl", "_all_synllama_reconstruct.csv")
67 |     extract_best_csv(total_reconstruction, all_synllama_reconstruct_path)
68 |     return len(set(total_reconstruction.keys())), len(set(enamine_reconstruction.keys())), len(set(non_enamine_reconstruction.keys())), total_building_blocks, non_enamine_building_blocks
69 | 
70 | def main():
71 |     parser = argparse.ArgumentParser()
72 |     parser.add_argument("--llama_folder", type=str, default="../synllama-data/results/table_2_syn_planning_91rxns")
73 |     args = parser.parse_args()
74 |     
75 |     raw_output_folder = os.path.join(args.llama_folder, "synllama_reconstruct")
76 |     raw_output_files = glob.glob(os.path.join(raw_output_folder, "*_successful_synthesis.pkl"))
77 |     for raw_output_file in raw_output_files:
78 |         molport_ls_path = raw_output_file.replace("_successful_synthesis.pkl", "_molport_ls.xls")
79 |         synllama_reconstruct, enamine_synllama_reconstruct, non_enamine_synllama_reconstruct, total_building_blocks, non_enamine_building_blocks = find_synllama_reconstruction(raw_output_file, molport_ls_path)
80 |         print(f"{raw_output_file} has {synllama_reconstruct} total successful syntheses")
81 |         print(f"{raw_output_file} has {enamine_synllama_reconstruct} enamine successful syntheses")
82 |         print(f"{raw_output_file} has {non_enamine_synllama_reconstruct} non-enamine successful syntheses")
83 |         print(f"{raw_output_file} has {total_building_blocks} total building blocks")
84 |         print(f"{raw_output_file} has {non_enamine_building_blocks} non-enamine building blocks")
85 |         print(f"{raw_output_file} has {(1 - non_enamine_building_blocks / total_building_blocks)*100:.2f}% enamine building blocks")
86 | if __name__ == "__main__":
87 |     main()
88 | 


--------------------------------------------------------------------------------
/steps/step_31_enamine_reconstruct.py:
--------------------------------------------------------------------------------
  1 | import pickle, argparse, copy, glob, os
  2 | import multiprocessing as mp
  3 | import numpy as np
  4 | import pandas as pd
  5 | from tqdm import tqdm
  6 | from rdkit import Chem
  7 | 
  8 | from synllama.chem.smiles_tfidf import SmilesSimilaritySearch
  9 | from synllama.chem.fpindex import FingerprintIndex
 10 | from synllama.chem.mol import FingerprintOption, Molecule
 11 | from synllama.chem.reaction import Reaction
 12 | from synllama.chem.smiles_tfidf import find_closest_match, string_similarity
 13 | from synllama.chem.stack import Stack
 14 | 
 15 | def load_results(file_path):
 16 |     """Load the results from a pickle file."""
 17 |     with open(file_path, "rb") as f:
 18 |         return pickle.load(f)
 19 | 
 20 | def analyze_results(result_file_path, total_num_mols, top_n_rows = 1):
 21 |     """Perform analysis on the reconstruction results."""
 22 |     file_name = result_file_path[:-4].split("/")[-1]
 23 |     results = load_results(result_file_path)
 24 |     max_similarity = []
 25 |     total_failure_rate = []
 26 |     total_reconstruction_rate = []
 27 |     failure_rate_within_group = []
 28 |     reconstruction_rate_within_group = []
 29 |     scf_sim_all = []
 30 |     pharm2d_sim_all = []
 31 |     rdkit_sim_all = []
 32 |     scf_sim_no_reconstruction = []
 33 |     pharm2d_sim_no_reconstruction = []
 34 |     rdkit_sim_no_reconstruction = []
 35 |     average_number_of_steps = []
 36 |     morgan_no_reconstruction = []
 37 | 
 38 |     max_rows_df = pd.DataFrame()
 39 |     
 40 |     for df in results:
 41 |         # Calculate average maximum similarity
 42 |         max_similarity.append(df['score'].max())
 43 |         max_row = df.loc[[df['score'].idxmax()]] if top_n_rows == 1 else df.drop_duplicates(subset=['smiles']).nlargest(top_n_rows, 'score')
 44 |         # remove response_num column
 45 |         if 'response_num' in max_row.columns: max_row = max_row.drop(columns=['response_num'])
 46 |         max_rows_df = pd.concat([max_rows_df, max_row])
 47 |         failure_rate_within_group.append(df['score'].isna().mean())
 48 |         total_failure_rate.append(all(df['score'].isna()))
 49 |         # Calculate reconstruction rate (where similarity == 1)
 50 |         reconstruction_rate_within_group.append((df['score'] == 1).mean())
 51 |         total_reconstruction_rate.append(any(df['score'] == 1))
 52 |         scf_sim_all.append(max_row['scf_sim'].values[0] if not max_row['scf_sim'].isna().any() else np.nan)
 53 |         pharm2d_sim_all.append(max_row['pharm2d_sim'].values[0] if not max_row['pharm2d_sim'].isna().any() else np.nan)
 54 |         rdkit_sim_all.append(max_row['rdkit_sim'].values[0] if not max_row['rdkit_sim'].isna().any() else np.nan)
 55 |         synthesis_steps = max_row['num_steps'].values[0]
 56 |         average_number_of_steps.append(synthesis_steps)
 57 |         if df['score'].max() < 1:
 58 |             morgan_no_reconstruction.append(df['score'].max())
 59 |             scf_sim_no_reconstruction.append(max_row['scf_sim'].values[0] if not max_row['scf_sim'].isna().any() else np.nan)
 60 |             pharm2d_sim_no_reconstruction.append(max_row['pharm2d_sim'].values[0] if not max_row['pharm2d_sim'].isna().any() else np.nan)
 61 |             rdkit_sim_no_reconstruction.append(max_row['rdkit_sim'].values[0] if not max_row['rdkit_sim'].isna().any() else np.nan)
 62 |     result_file_folder = os.path.dirname(result_file_path)
 63 |     max_rows_df.to_csv(os.path.join(result_file_folder, f"{file_name}_enamine_reconstruct.csv"), index=False)
 64 |     
 65 |     return {
 66 |         "file_name": file_name,
 67 |         "max_similarity": np.mean(max_similarity),
 68 |         "total_failure_rate %": round((1 - (len(results) - np.sum(total_failure_rate)) / total_num_mols) * 100, 2),
 69 |         "total_reconstruction_rate %": round((np.sum(total_reconstruction_rate) / total_num_mols) * 100, 2),
 70 |         "scf_sim_including_reconstruction": np.nanmean(scf_sim_all),
 71 |         "pharm2d_sim_including_reconstruction": np.nanmean(pharm2d_sim_all),
 72 |         "avg_rxn_steps": np.nanmean(average_number_of_steps),
 73 |         "morgan_no_reconstruction": np.nanmean(morgan_no_reconstruction),
 74 |         "scf_sim_no_reconstruction": np.nanmean(scf_sim_no_reconstruction),
 75 |         "pharm2d_sim_no_reconstruction": np.nanmean(pharm2d_sim_no_reconstruction),
 76 |     }
 77 | 
 78 | def similarity_score(product_template, stack_prod_smiles):
 79 |     if not Chem.MolFromSmiles(product_template):
 80 |         return string_similarity(product_template, stack_prod_smiles)
 81 |     else:
 82 |         return Molecule(product_template).sim(Molecule(stack_prod_smiles), FingerprintOption.morgan_for_tanimoto_similarity())
 83 | 
 84 | def get_top_k_smiles(input_smiles, smiles_searcher, fp_searcher, k=10):
 85 |     """
 86 |     get the top k smiles from the smiles_searcher and fp_searcher.
 87 |     
 88 |     Args:
 89 |         input_smiles (str): the smiles of the input molecule
 90 |         smiles_searcher (SmilesSimilaritySearch): the smiles searcher
 91 |         fp_searcher (FingerprintIndex): the fingerprint searcher
 92 |         k (int, optional): the number of top smiles to return. Defaults to 10.
 93 |     """
 94 |     # check if smiles is valid
 95 |     input_mol = Chem.MolFromSmiles(input_smiles)
 96 |     if input_mol is None:
 97 |         searched_smiles = smiles_searcher.query(input_smiles, k=k*2)
 98 |         results = [result.molecule.smiles for result in searched_smiles]
 99 |         result_mols = [Molecule(s, source="smiles") for s in results]
100 |     else:
101 |         searched_smiles = smiles_searcher.query(input_smiles, k=k)
102 |         results = [result.molecule.smiles for result in searched_smiles]
103 |         result_mols = [Molecule(s, source="smiles") for s in results]
104 |         fingerprints = Molecule(input_smiles).get_fingerprint(FingerprintOption.morgan_for_building_blocks(), as_bitvec=False)
105 |         fp_searched_results = fp_searcher.query_single(np.array([fingerprints]), k=k)
106 |         results.extend([result.molecule.smiles for result in fp_searched_results])
107 |         result_mols.extend([Molecule(s, source="fp") for s in results])
108 |     return list(set(result_mols))
109 | 
110 | def match_two_reactants(reactant1_list, reactant2_list, rxn, continue_rxn = False):
111 |     valid_combinations = []
112 |     for reactant1 in reactant1_list:
113 |         for reactant2 in reactant2_list:
114 |             reactant_combo1 = [reactant1, reactant2]
115 |             reactant_combo2 = [reactant2, reactant1]
116 |             if rxn(reactant_combo1) or rxn(reactant_combo2):
117 |                 if continue_rxn:
118 |                     valid_combinations.append(reactant2)
119 |                 else:
120 |                     valid_combinations.append(reactant_combo1)
121 |     return valid_combinations
122 | 
123 | def match_three_reactants(reactant1_list, reactant2_list, reactant3_list, rxn, continue_rxn = False):
124 |     valid_combinations = []
125 |     for reactant1 in reactant1_list:
126 |         for reactant2 in reactant2_list:
127 |             for reactant3 in reactant3_list:
128 |                 reactant_combo1 = [reactant1, reactant2, reactant3]
129 |                 reactant_combo2 = [reactant1, reactant3, reactant2]
130 |                 reactant_combo3 = [reactant2, reactant1, reactant3]
131 |                 reactant_combo4 = [reactant2, reactant3, reactant1]
132 |                 reactant_combo5 = [reactant3, reactant1, reactant2]
133 |                 reactant_combo6 = [reactant3, reactant2, reactant1]
134 |                 if rxn(reactant_combo1) or rxn(reactant_combo2) or rxn(reactant_combo3) or rxn(reactant_combo4) or rxn(reactant_combo5) or rxn(reactant_combo6):
135 |                     if continue_rxn:
136 |                         valid_combinations.append([reactant2, reactant3])
137 |                     else:
138 |                         valid_combinations.append(reactant_combo1)
139 |     return valid_combinations
140 |                 
141 | def reconstruct_single_rxn(smiles_to_search, product_template, smiles_searcher, fp_searcher, template, rxn_idx, stacks = None, k=5, n_stacks=25, product_limit = 3):
142 |     """
143 |     Reconstruct a single reaction from a list of building blocks and reactants.
144 |     
145 |     Args:
146 |         smiles_list (list): a list of tuples of (smiles, is_building_block).
147 |         product_template (str): the product template.
148 |         smiles_searcher (SmilesSimilaritySearch): the smiles searcher.
149 |         fp_searcher (FingerprintIndex): the fingerprint searcher.
150 |         template (str): the reaction template.
151 |         rxn_idx (int): the reaction index.
152 |         stack (Stack): the stack to push the reactants.
153 |         k (int, optional): the number of top smiles to return. Defaults to 10.
154 |     """
155 |     # check if reaction template is in the reaction_templates
156 |     rxn = Reaction(template)
157 |     new_stacks = []
158 |     if len(stacks) > 0 and len(stacks[0]) > 0:
159 |         scores = []
160 |         for stack in stacks:
161 |             prev_mol = list(stack.get_top())
162 |             # see how many reactants are needed
163 |             if rxn.num_reactants == 1:
164 |                 assert len(smiles_to_search) == 0
165 |                 success = stack.push_rxn(rxn, rxn_idx, product_template=product_template, product_limit=product_limit)
166 |                 if success:
167 |                     new_stacks.append(stack)
168 |             elif rxn.num_reactants == 2:
169 |                 assert len(smiles_to_search) == 1
170 |                 top_bbs_reactants = get_top_k_smiles(smiles_to_search[0], smiles_searcher, fp_searcher, k)
171 |                 valid_mols = match_two_reactants(prev_mol, top_bbs_reactants, rxn, continue_rxn = True)
172 |                 for mol in valid_mols:
173 |                     new_stack = copy.deepcopy(stack)
174 |                     new_stack.push_mol(mol, 0)
175 |                     success = new_stack.push_rxn(rxn, rxn_idx, product_template=product_template, product_limit=product_limit)
176 |                     if success:
177 |                         scores.append(similarity_score(product_template, new_stack[-1].smiles))
178 |                         new_stacks.append(new_stack)
179 |             elif rxn.num_reactants == 3:
180 |                 assert len(smiles_to_search) == 2
181 |                 top_bbs_reactants1 = get_top_k_smiles(smiles_to_search[0], smiles_searcher, fp_searcher, k)
182 |                 top_bbs_reactants2 = get_top_k_smiles(smiles_to_search[1], smiles_searcher, fp_searcher, k)
183 |                 valid_mols = match_three_reactants(prev_mol, top_bbs_reactants1, top_bbs_reactants2, rxn, continue_rxn = True)
184 |                 for mol1, mol2 in valid_mols:
185 |                     new_stack = copy.deepcopy(stack)
186 |                     new_stack.push_mol(mol1, 0)
187 |                     new_stack.push_mol(mol2, 0)
188 |                     success = new_stack.push_rxn(rxn, rxn_idx, product_template=product_template, product_limit=product_limit)
189 |                     if success:
190 |                         scores.append(similarity_score(product_template, new_stack[-1].smiles))
191 |                         new_stacks.append(new_stack)
192 |     else:
193 |         scores = []
194 |         if rxn.num_reactants == 3:
195 |             assert len(smiles_to_search) == 3
196 |             top_bbs_reactants1 = get_top_k_smiles(smiles_to_search[0], smiles_searcher, fp_searcher, k // 2 + 1)
197 |             top_bbs_reactants2 = get_top_k_smiles(smiles_to_search[1], smiles_searcher, fp_searcher, k // 2 + 1)
198 |             top_bbs_reactants3 = get_top_k_smiles(smiles_to_search[2], smiles_searcher, fp_searcher, k // 2 + 1)
199 |             valid_mols = match_three_reactants(top_bbs_reactants1, top_bbs_reactants2, top_bbs_reactants3, rxn, continue_rxn = False)
200 |             for mol1, mol2, mol3 in valid_mols:
201 |                 new_stack = Stack()
202 |                 new_stack.push_mol(mol1, 0)
203 |                 new_stack.push_mol(mol2, 0)
204 |                 new_stack.push_mol(mol3, 0)
205 |                 success = new_stack.push_rxn(rxn, rxn_idx, product_template=product_template, product_limit=product_limit)
206 |                 if success:
207 |                     scores.append(similarity_score(product_template, new_stack[-1].smiles))
208 |                     new_stacks.append(new_stack)
209 |             
210 |         elif rxn.num_reactants == 2:
211 |             assert len(smiles_to_search) == 2
212 |             top_bbs_reactants1 = get_top_k_smiles(smiles_to_search[0], smiles_searcher, fp_searcher, k)
213 |             top_bbs_reactants2 = get_top_k_smiles(smiles_to_search[1], smiles_searcher, fp_searcher, k)
214 |             valid_mols = match_two_reactants(top_bbs_reactants1, top_bbs_reactants2, rxn, continue_rxn=False)
215 |             for mol1, mol2 in valid_mols:
216 |                 new_stack = Stack()
217 |                 new_stack.push_mol(mol1, 0)
218 |                 new_stack.push_mol(mol2, 0)
219 |                 success = new_stack.push_rxn(rxn, rxn_idx, product_template=product_template, product_limit=product_limit)
220 |                 if success:         
221 |                     scores.append(similarity_score(product_template, new_stack[-1].smiles))
222 |                     new_stacks.append(new_stack)
223 |         
224 |         elif rxn.num_reactants == 1:
225 |             assert len(smiles_to_search) == 1
226 |             top_bbs_reactants = get_top_k_smiles(smiles_to_search[0], smiles_searcher, fp_searcher, k)
227 |             for mol in top_bbs_reactants:
228 |                 new_stack = Stack()
229 |                 new_stack.push_mol(mol, 0)
230 |                 success = new_stack.push_rxn(rxn, rxn_idx, product_template=product_template, product_limit=product_limit)
231 |                 if success:
232 |                     scores.append(similarity_score(product_template, new_stack[-1].smiles))
233 |                     new_stacks.append(new_stack)
234 |     
235 |     new_stacks = [stack for stack in new_stacks if stack is not None and len(stack) > 0]
236 |     if len(new_stacks) == 0:
237 |         return None
238 |     if len(new_stacks) > n_stacks:
239 |         new_stacks = sorted(new_stacks, key=lambda x: scores[new_stacks.index(x)], reverse=True)[:n_stacks]
240 |     return new_stacks
241 | 
242 | def reconstruct_all_rxns(output, reaction_idx_map, embedding_path, k, n_stacks):
243 |     """
244 |     Reconstruct all reactions from a list of building blocks and reactants.
245 |     
246 |     Args:
247 |         output (dict): the output from the LLM.
248 |         reaction_idx_map (dict): the reaction idx map.
249 |         embedding_path (str): the path to the smiles embedding.
250 |         k (int, optional): the number of top smiles to return. Defaults to 5.
251 |         n_stacks (int, optional): the number of stacks to return. Defaults to 50.
252 |     """
253 |     if 'reactions' not in output or 'building_blocks' not in output: return None
254 |     building_blocks = [bb.split("<bb>")[-1].split("</bb>")[0] for bb in output['building_blocks']]
255 |     reactions = output['reactions']
256 |     stacks = [Stack()]
257 |     for i, reaction in enumerate(reactions[::-1]):
258 |         if 'reaction_template' not in reaction or 'reactants' not in reaction or 'product' not in reaction: continue
259 |         template = reaction['reaction_template'].split('<rxn>')[1].split('</rxn>')[0]
260 |         if template not in reaction_idx_map:
261 |             template = find_closest_match(template, list(reaction_idx_map.keys()))
262 |         rxn_idx = reaction_idx_map[template]
263 |         reactants = reaction['reactants']
264 |         product_template = reaction['product']
265 |         smiles_to_search = [s for s in reactants if s in building_blocks]
266 |         smiles_searcher = SmilesSimilaritySearch.load(f"{embedding_path}/smiles_tfidf_{rxn_idx}.pkl")
267 |         fp_searcher = FingerprintIndex.load(f"{embedding_path}/fpindex_{rxn_idx}.pkl")
268 |         stacks = reconstruct_single_rxn(smiles_to_search, product_template, smiles_searcher, fp_searcher, template, rxn_idx, stacks, k, n_stacks)
269 |         if stacks is None:
270 |             print(f"Error reconstructing reaction {i}")
271 |             return None
272 |     return stacks
273 | 
274 | def reaction_scorer(stacks, target_mol, num_calc_extra_metrics: int = 10) -> pd.DataFrame:
275 |     """
276 |     Score the reactions by their similarity to the target molecule.
277 |     
278 |     Args:
279 |         stacks (list[Stack]): the stacks to score.
280 |         target_mol (Molecule): the target molecule.
281 |         num_calc_extra_metrics (int, optional): the number of extra metrics to calculate. Defaults to 10.
282 | 
283 |     Returns:
284 |         pd.DataFrame: a dataframe with the scores and extra metrics.
285 |     """
286 |     rows: list[dict[str, str | float]] = []
287 |     smiles_to_mol: dict[str, Molecule] = {}
288 |     if not stacks:
289 |         return pd.DataFrame()
290 |     for stack in stacks:
291 |         product_mol = stack[-1]
292 |         rows.append(
293 |             {
294 |                 "target": target_mol.smiles,
295 |                 "smiles": product_mol.smiles,
296 |                 "score": target_mol.sim(product_mol, FingerprintOption.morgan_for_tanimoto_similarity()),
297 |                 "synthesis": stack.get_action_string(),
298 |                 # "source": stack.get_source(), # for checking the source of the bb generation
299 |                 "num_steps": stack.count_reactions(),
300 |             }
301 |         )
302 |         smiles_to_mol[product_mol.smiles] = product_mol
303 |     rows.sort(key=lambda r: r["score"], reverse=True)
304 |     for row in rows[:num_calc_extra_metrics]:
305 |         mol = smiles_to_mol[str(row["smiles"])]
306 |         row["scf_sim"] = target_mol.scaffold.tanimoto_similarity(
307 |             mol.scaffold,
308 |             fp_option=FingerprintOption.morgan_for_tanimoto_similarity(),
309 |         )
310 |         row["pharm2d_sim"] = target_mol.dice_similarity(mol, fp_option=FingerprintOption.gobbi_pharm2d())
311 |         row["rdkit_sim"] = target_mol.tanimoto_similarity(mol, fp_option=FingerprintOption.rdkit())
312 |     df = pd.DataFrame(rows)
313 |     return df
314 | 
315 | def result_generator(smiles, llama_answer, reaction_smarts_dict_path, embedding_path, k, n_stacks, num_calc_extra_metrics=10):
316 |     """
317 |     Generate results by finding top k SMILES strings for building blocks
318 |     and building products from reactants and reaction templates.
319 |     
320 |     Args:
321 |         smiles (str): The product SMILES string.
322 |         llama_answer (dict): The output containing reactants and reaction templates.
323 |         reaction_smarts_dict_path (str): The path to the reaction smarts map.
324 |         embedding_path (str): The path to the smiles embedding.
325 |         k (int, optional): The number of top smiles to return. Defaults to 10.
326 |     """
327 |     reaction_smarts_dict = pickle.load(open(reaction_smarts_dict_path, "rb"))
328 |     reaction_idx_map = {v[0]: k for k, v in reaction_smarts_dict.items()}
329 |     print(f"Loaded reaction smarts dict with {len(reaction_idx_map)} reactions")
330 |     product_mol = Molecule(smiles)
331 |     df_product = pd.DataFrame()
332 |     for i, output in enumerate(llama_answer):
333 |         try:
334 |             stacks = reconstruct_all_rxns(output, reaction_idx_map, embedding_path, k, n_stacks)
335 |             if stacks is None:
336 |                 continue
337 |             df = reaction_scorer(stacks, product_mol, num_calc_extra_metrics)
338 |             df_product = pd.concat([df_product, df])
339 |         except Exception as e:
340 |             print(e)
341 |             continue
342 |     print("Finished processing all reactions for " + smiles)
343 |     return df_product.sort_values(by=["score", "rdkit_sim", "scf_sim", "pharm2d_sim"], ascending=[False, False, False, False]).reset_index(drop=True).iloc[:n_stacks] if len(df_product) > 0 else None
344 | 
345 | def result_generator_wrapper(args):
346 |     """Wrapper function to unpack arguments for result_generator."""
347 |     return result_generator(*args)
348 | 
349 | def run_enamine_reconstruct(llama_output_path, embedding_path, reaction_smarts_dict_path, save_path, k, n_stacks):
350 |     # Load data
351 |     llama_outputs = pickle.load(open(llama_output_path, "rb"))
352 |     tasks = [(smiles, llama_answer, reaction_smarts_dict_path, embedding_path, k, n_stacks) for smiles, llama_answer in list(llama_outputs.items())]
353 |     
354 |     # Use multiprocessing
355 |     num_cores = mp.cpu_count()
356 |     with mp.Pool(num_cores) as pool:
357 |         # Create a tqdm progress bar
358 |         with tqdm(total=len(tasks)) as pbar:
359 |             results = []
360 |             for result in pool.imap_unordered(result_generator_wrapper, tasks):
361 |                 results.append(result)
362 |                 pbar.update()
363 |     
364 |     results = [r for r in results if r is not None]
365 |     print(f"Found {len(results)} results")
366 |     pickle.dump(results, open(save_path, "wb"))
367 | 
368 | def main():
369 |     parser = argparse.ArgumentParser()
370 |     parser.add_argument("--llama_folder", type=str, default="../synllama-data/results/table_2_syn_planning_91rxns")
371 |     parser.add_argument("--embedding_path", type=str, default="../synllama-data/inference/reconstruction/91rxns/rxn_embeddings")
372 |     parser.add_argument("--n_stacks", type=int, default=25)
373 |     parser.add_argument("--k", type=int, default=5)
374 |     parser.add_argument("--save_path", type=str, default=None)
375 |     parser.add_argument("--total_num_mols", type=int, default=1000)
376 |     parser.add_argument("--top_n_rows", type=int, default=1)
377 | 
378 |     args = parser.parse_args()
379 |     args.reaction_smarts_dict_path = os.path.join(args.embedding_path, "reaction_smarts_map.pkl")
380 |     mp.set_start_method('spawn')
381 | 
382 |     llama_output_paths = glob.glob(os.path.join(args.llama_folder, "*.pkl"))
383 |     if args.save_path is None:
384 |         args.save_path = os.path.join(args.llama_folder, "enamine_reconstruct") 
385 |     os.makedirs(args.save_path, exist_ok=True)
386 |     
387 |     for llama_output_path in llama_output_paths:
388 |         print(f"Processing {llama_output_path}")
389 |         name = llama_output_path[:-4].split("/")[-1]
390 |         save_path = os.path.join(args.save_path, f"{name}.pkl")
391 |         if os.path.exists(save_path):
392 |             print(f"Skipping {name} because it already exists")
393 |             continue
394 |         run_enamine_reconstruct(llama_output_path, args.embedding_path, args.reaction_smarts_dict_path, save_path, args.k, args.n_stacks)
395 |     
396 |     results_folder = os.path.join(args.llama_folder, "enamine_reconstruct")
397 |     results_file_paths = glob.glob(os.path.join(results_folder, "*.pkl"))
398 |     final_df = pd.DataFrame()
399 |     for results_file_path in results_file_paths:
400 |         result = analyze_results(results_file_path, args.total_num_mols, args.top_n_rows)
401 |         df = pd.DataFrame.from_dict(result, orient="index").T
402 |         final_df = pd.concat([final_df, df])
403 |     
404 |     final_df.sort_values(by="file_name", ascending=True, inplace=True)
405 |     final_df.to_csv(os.path.join(results_folder, "enamine_reconstruct_analysis.csv"), index=False)
406 | 
407 | if __name__ == "__main__":
408 |     main()
409 | 


--------------------------------------------------------------------------------
/steps/step_32_combined_stats.py:
--------------------------------------------------------------------------------
 1 | # This script is used to reconstruct the raw output from MolPort
 2 | 
 3 | import os, sys, argparse, glob
 4 | import pickle
 5 | import pandas as pd
 6 | import numpy as np
 7 | from collections import defaultdict
 8 | 
 9 | def combine_stats(enamine_reconstruct_csv, all_synllama_reconstruct_csv, non_enamine_synllama_reconstruct_csv, enamine_synllama_reconstruct_csv, total_num_mols, llama_folder = None):
10 |     enamine_reconstruct_df = pd.read_csv(enamine_reconstruct_csv)
11 |     enamine_reconstruct_mol = np.sum(enamine_reconstruct_df.groupby('target')['score'].max() == 1)
12 |     # remove the rows in enamine reconstruct that have the same target as in raw output
13 |     synllama_enamine_reconstruct_df = pd.read_csv(enamine_synllama_reconstruct_csv)
14 |     synllama_all_reconstruct_df = pd.read_csv(all_synllama_reconstruct_csv)
15 |     synllama_non_enamine_reconstruct_df = pd.read_csv(non_enamine_synllama_reconstruct_csv)
16 |     non_enamine_reconstruct_mol = np.sum(synllama_non_enamine_reconstruct_df.groupby('target')['score'].max() == 1)
17 |     
18 |     enamine_all_df = pd.concat([enamine_reconstruct_df, synllama_enamine_reconstruct_df], ignore_index=True)
19 |     enamine_reconstruct_mol = np.sum(enamine_all_df.groupby('target')['score'].max() == 1)
20 |     
21 |     enamine_reconstruct_filtered = enamine_reconstruct_df[~enamine_reconstruct_df['target'].isin(synllama_all_reconstruct_df['target'])]
22 |     combined_df = pd.concat([synllama_all_reconstruct_df, enamine_reconstruct_filtered], ignore_index=True)
23 |     no_recon_combined_df = combined_df[combined_df['score'] < 1]
24 |     combined_stats = {
25 |         "file_name": enamine_reconstruct_csv[:-4].split("/")[-1].split("_enamine_reconstruct")[0],
26 |         "total_failure_rate %": round((1 - (len(combined_df) - np.sum(combined_df['score'].isna())) / total_num_mols) * 100, 2),
27 |         "total_enamine_reconstruct_rate %": round((enamine_reconstruct_mol / total_num_mols) * 100, 2),
28 |         "total_non_enamine_reconstruct_rate %": round((non_enamine_reconstruct_mol / total_num_mols) * 100, 2),
29 |         "total_all_reconstruction_rate %": round((np.sum(combined_df['score'] == 1) / total_num_mols) * 100, 2),
30 |         "morgan_sim": combined_df['score'].mean(),
31 |         "scf_sim": combined_df['scf_sim'].mean(),
32 |         "pharm2d_sim": combined_df['pharm2d_sim'].mean(),
33 |         "avg_rxn_steps": combined_df['num_steps'].mean(),
34 |         "morgan_sim_no_recon": no_recon_combined_df['score'].mean(),
35 |         "scf_sim_no_recon": no_recon_combined_df['scf_sim'].mean(),
36 |         "pharm2d_sim_no_recon": no_recon_combined_df['pharm2d_sim'].mean(),
37 |         "avg_rxn_steps_no_recon": no_recon_combined_df['num_steps'].mean(),
38 |     }
39 |     combined_df.to_csv(os.path.join(llama_folder, f"{file_name}_final_reconstruct_stats.csv"), index=False)
40 |     return combined_stats
41 | 
42 | if __name__ == "__main__":
43 |     parser = argparse.ArgumentParser()
44 |     parser.add_argument("--llama_folder", type=str, default="../synllama-data/results/table_2_syn_planning_91rxns")
45 |     parser.add_argument("--total_num_mols", type=int, default=1000)
46 |     
47 |     args = parser.parse_args()
48 | 
49 |     enamine_reconstruct_paths = glob.glob(os.path.join(args.llama_folder, "enamine_reconstruct", "*_enamine_reconstruct.csv"))
50 |     final_df = pd.DataFrame()
51 |     for enamine_reconstruct_path in enamine_reconstruct_paths:
52 |         # if you follow the default naming convention, you can use this line
53 |         file_name = enamine_reconstruct_path[:-4].split("/")[-1].split("_enamine_reconstruct")[0]
54 |         enamine_synllama_reconstruct_path = os.path.join(args.llama_folder, "synllama_reconstruct", f"{file_name}_enamine_synllama_reconstruct.csv")
55 |         all_synllama_reconstruct_path = os.path.join(args.llama_folder, "synllama_reconstruct", f"{file_name}_all_synllama_reconstruct.csv")
56 |         non_enamine_synllama_reconstruct_path = os.path.join(args.llama_folder, "synllama_reconstruct", f"{file_name}_non_enamine_synllama_reconstruct.csv")
57 |         result = combine_stats(enamine_reconstruct_path, all_synllama_reconstruct_path, non_enamine_synllama_reconstruct_path, enamine_synllama_reconstruct_path, total_num_mols=args.total_num_mols, llama_folder = args.llama_folder)
58 |         df = pd.DataFrame.from_dict(result, orient="index").T
59 |         final_df = pd.concat([final_df, df])
60 |     final_df.sort_values(by="file_name", ascending=True, inplace=True)
61 |     final_df.to_csv(os.path.join(args.llama_folder, "combined_final_stats.csv"), index=False)
62 |     


--------------------------------------------------------------------------------
/synllama/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/THGLab/SynLlama/5592cfc9d2338c6ebd7add7971c26b69e9aa1111/synllama/__init__.py


--------------------------------------------------------------------------------
/synllama/chem/__init__.py:
--------------------------------------------------------------------------------
1 | import rdkit
2 | 
3 | from . import matrix, mol, reaction
4 | 
5 | rdkit.RDLogger.DisableLog("rdApp.*")
6 | 


--------------------------------------------------------------------------------
/synllama/chem/base.py:
--------------------------------------------------------------------------------
 1 | import abc
 2 | from typing import Literal, overload
 3 | 
 4 | import PIL.Image
 5 | 
 6 | 
 7 | class Drawable(abc.ABC):
 8 |     @overload
 9 |     def draw(self, size: int, svg: Literal[False]) -> PIL.Image.Image: ...
10 | 
11 |     @overload
12 |     def draw(self, size: int, svg: Literal[True]) -> str: ...
13 | 
14 |     @abc.abstractmethod
15 |     def draw(self, size: int, svg: bool) -> PIL.Image.Image | str: ...
16 | 


--------------------------------------------------------------------------------
/synllama/chem/fpindex.py:
--------------------------------------------------------------------------------
  1 | import dataclasses
  2 | import functools
  3 | import os, pickle
  4 | import pathlib
  5 | import tempfile
  6 | from collections.abc import Iterable, Sequence
  7 | 
  8 | import joblib
  9 | import numpy as np
 10 | import torch
 11 | from sklearn.neighbors import BallTree
 12 | from tqdm.auto import tqdm
 13 | 
 14 | from synllama.chem.mol import FingerprintOption, Molecule
 15 | 
 16 | 
 17 | @dataclasses.dataclass
 18 | class _QueryResult:
 19 |     index: int
 20 |     molecule: Molecule
 21 |     fingerprint: np.ndarray
 22 |     distance: float
 23 | 
 24 | 
 25 | def _fill_fingerprint(
 26 |     fp: np.memmap,
 27 |     offset: int,
 28 |     molecules: Iterable[Molecule],
 29 |     fp_option: FingerprintOption,
 30 | ):
 31 |     os.sched_setaffinity(0, range(os.cpu_count() or 1))
 32 |     for i, mol in enumerate(molecules):
 33 |         fp[offset + i] = mol.get_fingerprint(fp_option).astype(np.uint8)
 34 | 
 35 | 
 36 | def compute_fingerprints(
 37 |     molecules: Sequence[Molecule],
 38 |     fp_option: FingerprintOption,
 39 |     batch_size: int = 1024,
 40 | ) -> np.ndarray:
 41 |     with tempfile.TemporaryDirectory() as tempdir_s:
 42 |         temp_fname = pathlib.Path(tempdir_s) / "fingerprint"
 43 |         fp = np.memmap(
 44 |             str(temp_fname),
 45 |             dtype=np.uint8,
 46 |             mode="w+",
 47 |             shape=(len(molecules), fp_option.dim),
 48 |         )
 49 |         joblib.Parallel(n_jobs=joblib.cpu_count() // 2)(
 50 |             joblib.delayed(_fill_fingerprint)(
 51 |                 fp=fp,
 52 |                 offset=start,
 53 |                 molecules=molecules[start : start + batch_size],
 54 |                 fp_option=fp_option,
 55 |             )
 56 |             for start in tqdm(range(0, len(molecules), batch_size), desc="Fingerprint")
 57 |         )
 58 |         return np.array(fp)
 59 | 
 60 | 
 61 | class FingerprintIndex:
 62 |     def __init__(self, molecules: Iterable[Molecule], fp_option: FingerprintOption) -> None:
 63 |         super().__init__()
 64 |         self._molecules = tuple(molecules)
 65 |         self._fp_option = fp_option
 66 |         self._fp = self._init_fingerprint()
 67 |         self._tree = self._init_tree()
 68 | 
 69 |     @property
 70 |     def molecules(self) -> tuple[Molecule, ...]:
 71 |         return self._molecules
 72 | 
 73 |     @property
 74 |     def fp_option(self) -> FingerprintOption:
 75 |         return self._fp_option
 76 | 
 77 |     def _init_fingerprint(self, batch_size: int = 1024) -> np.ndarray:
 78 |         return compute_fingerprints(
 79 |             molecules=self._molecules,
 80 |             fp_option=self._fp_option,
 81 |             batch_size=batch_size,
 82 |         )
 83 | 
 84 |     def _init_tree(self) -> BallTree:
 85 |         tree = BallTree(self._fp, metric="manhattan")
 86 |         return tree
 87 | 
 88 |     def __getitem__(self, index: int) -> tuple[Molecule, np.ndarray]:
 89 |         return self._molecules[index], self._fp[index]
 90 |     
 91 |     def query_single(self, q: np.ndarray, k: int) -> list[_QueryResult]:
 92 |         dist, idx = self._tree.query(q.reshape([1, self._fp_option.dim]), k=k)
 93 |         results = []
 94 |         for distance, idx in zip(dist[0], idx[0]):
 95 |             results.append(_QueryResult(
 96 |                 index=idx,
 97 |                 molecule=self._molecules[idx],
 98 |                 fingerprint=None,
 99 |                 distance=distance
100 |             ))
101 |         return sorted(results, key=lambda x: x.distance)
102 | 
103 |     def query(self, q: np.ndarray, k: int) -> list[list[_QueryResult]]:
104 |         """
105 |         Args:
106 |             q: shape (bsz, ..., fp_dim)
107 |         """
108 |         bsz = q.shape[0]
109 |         dist, idx = self._tree.query(q.reshape([-1, self._fp_option.dim]), k=k)
110 |         dist = dist.reshape([bsz, -1])
111 |         idx = idx.reshape([bsz, -1])
112 |         results: list[list[_QueryResult]] = []
113 |         for i in range(dist.shape[0]):
114 |             res: list[_QueryResult] = []
115 |             for j in range(dist.shape[1]):
116 |                 index = int(idx[i, j])
117 |                 res.append(
118 |                     _QueryResult(
119 |                         index=index,
120 |                         molecule=self._molecules[index],
121 |                         fingerprint=self._fp[index],
122 |                         distance=dist[i, j],
123 |                     )
124 |                 )
125 |             results.append(res)
126 |         return results
127 | 
128 |     @functools.cache
129 |     def fp_cuda(self, device: torch.device) -> torch.Tensor:
130 |         return torch.tensor(self._fp, dtype=torch.float, device=device)
131 | 
132 |     @torch.inference_mode()
133 |     def query_cuda(self, q: torch.Tensor, k: int) -> list[list[_QueryResult]]:
134 |         bsz = q.size(0)
135 |         q = q.reshape([-1, self._fp_option.dim])
136 |         pwdist = torch.cdist(self.fp_cuda(q.device), q, p=1)  # (n_mols, n_queries)
137 |         dist_t, idx_t = torch.topk(pwdist, k=k, dim=0, largest=False)  # (k, n_queries)
138 |         dist = dist_t.t().reshape([bsz, -1]).cpu().numpy()
139 |         idx = idx_t.t().reshape([bsz, -1]).cpu().numpy()
140 | 
141 |         results: list[list[_QueryResult]] = []
142 |         for i in range(dist.shape[0]):
143 |             res: list[_QueryResult] = []
144 |             for j in range(dist.shape[1]):
145 |                 index = int(idx[i, j])
146 |                 res.append(
147 |                     _QueryResult(
148 |                         index=index,
149 |                         molecule=self._molecules[index],
150 |                         fingerprint=self._fp[index],
151 |                         distance=dist[i, j],
152 |                     )
153 |                 )
154 |             results.append(res)
155 |         return results
156 |     
157 |     def save(self, filename):
158 |         with open(filename, 'wb') as f:
159 |             pickle.dump(self, f)
160 |             
161 |     @classmethod
162 |     def load(cls, filename):
163 |         with open(filename, 'rb') as f:
164 |             return pickle.load(f)
165 |         
166 | if __name__ == "__main__":
167 |     # Later, you can load the searcher without refitting
168 |     loaded_searcher = FingerprintIndex.load("data/processed/fpindex.pkl")
169 |     # Example search
170 |     query_smiles = "CC(=O)Oc1ccccc1C(=O)O"  # Aspirin
171 |     fingerprints = compute_fingerprints(tuple([Molecule(query_smiles)]), fp_option=FingerprintOption.morgan_for_building_blocks())
172 |     results = loaded_searcher.query_single(fingerprints.reshape(1, -1), k=5)
173 |     print(f"\nTop 5 similar SMILES to {query_smiles}:")
174 |     for result in results:
175 |         print(result.molecule.smiles)
176 | 


--------------------------------------------------------------------------------
/synllama/chem/matrix.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import pathlib
 3 | import pickle
 4 | import tempfile
 5 | from collections.abc import Iterable
 6 | from functools import cached_property
 7 | 
 8 | import joblib
 9 | import numpy as np
10 | from tqdm.auto import tqdm
11 | 
12 | from synllama.chem.mol import Molecule
13 | from synllama.chem.reaction import Reaction, ReactionContainer
14 | 
15 | 
16 | def _fill_matrix(matrix: np.memmap, offset: int, reactants: Iterable[Molecule], reactions: Iterable[Reaction]):
17 |     for i, reactant in enumerate(reactants):
18 |         for j, reaction in enumerate(reactions):
19 |             flag = 0
20 |             for t in reaction.match_reactant_templates(reactant):
21 |                 flag |= 1 << t
22 |             matrix[offset + i, j] = flag
23 | 
24 | 
25 | class ReactantReactionMatrix:
26 |     def __init__(
27 |         self,
28 |         reactants: Iterable[Molecule],
29 |         reactions: Iterable[Reaction],
30 |         matrix: np.ndarray | os.PathLike | None = None,
31 |     ) -> None:
32 |         super().__init__()
33 |         self._reactants = tuple(reactants)
34 |         self._reactions = tuple(reactions)
35 |         self._matrix = self._init_matrix(matrix)
36 | 
37 |     def _init_matrix(self, matrix: np.ndarray | os.PathLike | None, batch_size: int = 1024) -> np.ndarray:
38 |         if isinstance(matrix, np.ndarray):
39 |             return matrix
40 |         elif isinstance(matrix, (os.PathLike, str)):
41 |             return np.load(matrix)
42 | 
43 |         with tempfile.TemporaryDirectory() as tempdir_s:
44 |             temp_fname = pathlib.Path(tempdir_s) / "matrix"
45 |             matrix = np.memmap(
46 |                 str(temp_fname),
47 |                 dtype=np.uint8,
48 |                 mode="w+",
49 |                 shape=(len(self._reactants), len(self._reactions)),
50 |             )
51 |             joblib.Parallel(n_jobs=joblib.cpu_count() // 2)(
52 |                 joblib.delayed(_fill_matrix)(
53 |                     matrix=matrix,
54 |                     offset=start,
55 |                     reactants=self._reactants[start : start + batch_size],
56 |                     reactions=self._reactions,
57 |                 )
58 |                 for start in tqdm(range(0, len(self._reactants), batch_size), desc="Create matrix")
59 |             )
60 |             return np.array(matrix)
61 | 
62 |     @property
63 |     def reactants(self) -> tuple[Molecule, ...]:
64 |         return self._reactants
65 | 
66 |     @cached_property
67 |     def reactions(self) -> ReactionContainer:
68 |         return ReactionContainer(self._reactions)
69 | 
70 |     @cached_property
71 |     def seed_reaction_indices(self) -> list[int]:
72 |         full_flag = np.array([0b01 if rxn.num_reactants == 1 else 0b11 for rxn in self._reactions], dtype=np.uint8)
73 |         return np.nonzero(full_flag == np.bitwise_or.reduce(self._matrix, axis=0))[0].tolist()
74 | 
75 |     @cached_property
76 |     def reactant_count(self) -> np.ndarray:
77 |         return (self._matrix != 0).astype(np.int32).sum(0)
78 | 
79 |     @property
80 |     def matrix(self) -> np.ndarray:
81 |         return self._matrix
82 |     
83 |     def save(self, filename):
84 |         with open(filename, 'wb') as f:
85 |             pickle.dump(self, f)
86 | 
87 |     @classmethod
88 |     def load(cls, filename):
89 |         with open(filename, 'rb') as f:
90 |             return pickle.load(f)
91 | 


--------------------------------------------------------------------------------
/synllama/chem/mol.py:
--------------------------------------------------------------------------------
  1 | import dataclasses
  2 | import hashlib
  3 | import os
  4 | import pathlib
  5 | from collections.abc import Iterable, Sequence
  6 | from functools import cache, cached_property, partial
  7 | from typing import Literal, overload
  8 | 
  9 | import numpy as np
 10 | import pandas as pd
 11 | from rdkit import Chem
 12 | from rdkit.Chem import AllChem, DataStructs, Draw
 13 | from rdkit.Chem.Pharm2D import Generate as Generate2D
 14 | from rdkit.Chem.Pharm2D import Gobbi_Pharm2D
 15 | from rdkit.Chem.Scaffolds import MurckoScaffold
 16 | from tqdm.auto import tqdm
 17 | 
 18 | from .base import Drawable
 19 | 
 20 | 
 21 | @dataclasses.dataclass(frozen=True, eq=True, unsafe_hash=True)
 22 | class FingerprintOption:
 23 |     type: str = "morgan"
 24 |     # Morgan
 25 |     morgan_radius: int = 2
 26 |     morgan_n_bits: int = 256
 27 |     # RDKit
 28 |     rdkit_fp_size: int = 2048
 29 | 
 30 |     def __post_init__(self):
 31 |         supported_types = ("morgan", "rdkit", "gobbi_pharm2d")
 32 |         if self.type not in supported_types:
 33 |             raise ValueError(f"Unsupported fingerprint type: {self.type}")
 34 | 
 35 |     @classmethod
 36 |     def morgan_for_tanimoto_similarity(cls):
 37 |         return FingerprintOption(
 38 |             type="morgan",
 39 |             morgan_radius=2,
 40 |             morgan_n_bits=4096,
 41 |         )
 42 | 
 43 |     @classmethod
 44 |     def gobbi_pharm2d(cls):
 45 |         return FingerprintOption(
 46 |             type="gobbi_pharm2d",
 47 |         )
 48 | 
 49 |     @classmethod
 50 |     def morgan_for_building_blocks(cls):
 51 |         return FingerprintOption(
 52 |             type="morgan",
 53 |             morgan_radius=2,
 54 |             morgan_n_bits=256,
 55 |         )
 56 | 
 57 |     @classmethod
 58 |     def rdkit(cls):
 59 |         return FingerprintOption(
 60 |             type="rdkit",
 61 |         )
 62 | 
 63 |     @property
 64 |     def dim(self) -> int:
 65 |         if self.type == "morgan":
 66 |             return self.morgan_n_bits
 67 |         elif self.type == "rdkit":
 68 |             return self.rdkit_fp_size
 69 |         elif self.type == "gobbi_pharm2d":
 70 |             return 39972
 71 |         raise ValueError(f"Unsupported fingerprint type: {self.type}")
 72 | 
 73 | 
 74 | class Molecule(Drawable):
 75 |     def __init__(self, smiles: str, source: Literal["smiles", "fp", ''] = '') -> None:
 76 |         super().__init__()
 77 |         self._smiles = smiles.strip()
 78 |         self.meta_info = {}
 79 |         self._source = source
 80 | 
 81 |     @classmethod
 82 |     def from_rdmol(cls, rdmol: Chem.Mol) -> "Molecule":
 83 |         return cls(Chem.MolToSmiles(rdmol))
 84 | 
 85 |     def __getstate__(self):
 86 |         return self._smiles
 87 | 
 88 |     def __setstate__(self, state):
 89 |         self._smiles = state
 90 |         self._source = ''
 91 | 
 92 |     @property
 93 |     def smiles(self) -> str:
 94 |         return self._smiles
 95 |     
 96 |     @property
 97 |     def source(self) -> Literal["smiles", "fp", '']:
 98 |         return self._source
 99 | 
100 |     @cached_property
101 |     def _rdmol(self):
102 |         return Chem.MolFromSmiles(self._smiles)
103 | 
104 |     @cached_property
105 |     def _rdmol_no_hs(self):
106 |         return Chem.RemoveHs(self._rdmol)
107 | 
108 |     @cached_property
109 |     def is_valid(self) -> bool:
110 |         return self._rdmol is not None
111 | 
112 |     @cached_property
113 |     def csmiles(self) -> str:
114 |         return Chem.MolToSmiles(self._rdmol, canonical=True, isomericSmiles=False)
115 | 
116 |     @cached_property
117 |     def num_atoms(self) -> int:
118 |         return self._rdmol.GetNumAtoms()
119 | 
120 |     def draw(self, size: int = 100, svg: bool = False):
121 |         if svg:
122 |             return Draw._moltoSVG(self._rdmol, sz=(size, size), highlights=[], legend=[], kekulize=True)
123 |         else:
124 |             return Draw.MolToImage(self._rdmol, size=(size, size), kekulize=True)
125 | 
126 |     def __hash__(self) -> int:
127 |         return hash(self._smiles)
128 | 
129 |     def __eq__(self, __value: object) -> bool:
130 |         return isinstance(__value, Molecule) and self.csmiles == __value.csmiles
131 | 
132 |     @cached_property
133 |     def major_molecule(self) -> "Molecule":
134 |         if "." in self.smiles:
135 |             segs = self.smiles.split(".")
136 |             segs.sort(key=lambda a: -len(a))
137 |             return Molecule(segs[0])
138 |         return self
139 | 
140 |     @overload
141 |     def get_fingerprint(self, option: FingerprintOption) -> np.ndarray: ...
142 | 
143 |     @overload
144 |     def get_fingerprint(self, option: FingerprintOption, as_bitvec: Literal[True]) -> Sequence[Literal[0, 1]]: ...
145 | 
146 |     @overload
147 |     def get_fingerprint(self, option: FingerprintOption, as_bitvec: Literal[False]) -> np.ndarray: ...
148 | 
149 |     def get_fingerprint(self, option: FingerprintOption, as_bitvec: bool = False):
150 |         return self._get_fingerprint(option, as_bitvec)  # work-around for mypy check
151 | 
152 |     @cache
153 |     def _get_fingerprint(self, option: FingerprintOption, as_bitvec: bool):
154 |         if option.type == "morgan":
155 |             bit_vec = AllChem.GetMorganFingerprintAsBitVect(self._rdmol, option.morgan_radius, option.morgan_n_bits)
156 |         elif option.type == "rdkit":
157 |             bit_vec = Chem.RDKFingerprint(self._rdmol, fpSize=option.rdkit_fp_size)
158 |         elif option.type == "gobbi_pharm2d":
159 |             bit_vec = DataStructs.cDataStructs.ConvertToExplicit(
160 |                 Generate2D.Gen2DFingerprint(self._rdmol, Gobbi_Pharm2D.factory)
161 |             )
162 |         else:
163 |             raise ValueError(f"Unsupported fingerprint type: {option.type}")
164 | 
165 |         if as_bitvec:
166 |             return bit_vec
167 |         feat = np.zeros((1,), dtype=np.float32)
168 |         DataStructs.ConvertToNumpyArray(bit_vec, feat)
169 |         return feat
170 | 
171 |     @cached_property
172 |     def scaffold(self) -> "Molecule":
173 |         s = Molecule.from_rdmol(MurckoScaffold.GetScaffoldForMol(self._rdmol))
174 |         if not s.is_valid:
175 |             s = self
176 |         return s
177 | 
178 |     def tanimoto_similarity(self, other: "Molecule", fp_option: FingerprintOption) -> float:
179 |         fp1 = self.get_fingerprint(fp_option, as_bitvec=True)
180 |         fp2 = other.get_fingerprint(fp_option, as_bitvec=True)
181 |         return DataStructs.TanimotoSimilarity(fp1, fp2)
182 | 
183 |     def dice_similarity(self, other: "Molecule", fp_option: FingerprintOption) -> float:
184 |         fp1 = self.get_fingerprint(fp_option, as_bitvec=True)
185 |         fp2 = other.get_fingerprint(fp_option, as_bitvec=True)
186 |         return DataStructs.DiceSimilarity(fp1, fp2)
187 | 
188 |     @cache
189 |     def sim(
190 |         self,
191 |         other: "Molecule",
192 |         fp_option: FingerprintOption = FingerprintOption.morgan_for_tanimoto_similarity(),
193 |     ) -> float:
194 |         return self.tanimoto_similarity(other, fp_option)
195 | 
196 |     @cached_property
197 |     def csmiles_md5(self) -> bytes:
198 |         return hashlib.md5(self.csmiles.encode()).digest()
199 | 
200 |     @cached_property
201 |     def csmiles_sha256(self) -> bytes:
202 |         return hashlib.sha256(self.csmiles.encode()).digest()
203 |     
204 | def get_meta_info(mol):
205 |         return {
206 |             "id": mol.GetProp("id") if mol.HasProp("id") else None,
207 |             "IUPAC Name": mol.GetProp("IUPAC Name") if mol.HasProp("IUPAC Name") else None,
208 |             "CAS": mol.GetProp("CAS") if mol.HasProp("CAS") else None,
209 |             "purity": mol.GetProp("purity") if mol.HasProp("purity") else None,
210 |             "MDLNUMBER": mol.GetProp("MDLNUMBER") if mol.HasProp("MDLNUMBER") else None,
211 |             "LogP": mol.GetProp("LogP") if mol.HasProp("LogP") else None,
212 |             "URL": mol.GetProp("URL") if mol.HasProp("URL") else None,
213 |             "avail_US_100mg": mol.GetProp("avail_US_100mg") if mol.HasProp("avail_US_100mg") else None,
214 |             "avail_US_250mg": mol.GetProp("avail_US_250mg") if mol.HasProp("avail_US_250mg") else None,
215 |             "avail_US_1g": mol.GetProp("avail_US_1g") if mol.HasProp("avail_US_1g") else None,
216 |             "avail_US_2_5g": mol.GetProp("avail_US_2_5g") if mol.HasProp("avail_US_2_5g") else None
217 |         }
218 | 
219 | 
220 | def read_mol_file(
221 |     path: os.PathLike,
222 |     major_only: bool = True,
223 |     drop_duplicates: bool = True,
224 |     show_pbar: bool = True,
225 |     smiles_col: str | None = None,
226 |     pbar_fn=partial(tqdm, desc="Reading"),
227 | ) -> Iterable[Molecule]:
228 |     path = pathlib.Path(path)
229 |     if path.suffix == ".sdf":
230 |         f = Chem.SDMolSupplier(str(path))
231 |     elif path.suffix == ".smi":
232 |         f = Chem.SmilesMolSupplier(str(path))
233 |     elif path.suffix == ".csv":
234 |         df = pd.read_csv(path)
235 |         if smiles_col is None:
236 |             if "smiles" in df.columns:
237 |                 smiles_col = "smiles"
238 |             elif "SMILES" in df.columns:
239 |                 smiles_col = "SMILES"
240 |             else:
241 |                 raise ValueError(f"Cannot find SMILES column in {path}")
242 |         f = (Chem.MolFromSmiles(smiles) for smiles in df[smiles_col])
243 |     else:
244 |         raise ValueError(f"Unsupported file type: {path.suffix}")
245 |     visited: set[str] = set()
246 |     if show_pbar:
247 |         f_iter = pbar_fn(f)
248 |     else:
249 |         f_iter = f
250 |     for rdmol in f_iter:
251 |         if rdmol is not None:
252 |             meta_info = get_meta_info(rdmol)
253 |             mol = Molecule.from_rdmol(rdmol)
254 |             mol.meta_info = meta_info
255 |             if major_only:
256 |                 mol = mol.major_molecule
257 |             if drop_duplicates and mol.csmiles in visited:
258 |                 continue
259 |             yield mol
260 |             visited.add(mol.csmiles)
261 | 
262 | 
263 | def write_to_smi(path: os.PathLike, mols: Sequence[Molecule]):
264 |     with open(path, "w") as f:
265 |         for mol in mols:
266 |             f.write(f"{mol.smiles}\n")
267 | 


--------------------------------------------------------------------------------
/synllama/chem/reaction.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | from collections.abc import Iterable, Sequence
  3 | from functools import cached_property
  4 | from typing import overload
  5 | 
  6 | from rdkit import Chem
  7 | from rdkit.Chem import AllChem, Draw, rdChemReactions
  8 | 
  9 | from synllama.chem.base import Drawable
 10 | from synllama.chem.mol import Molecule
 11 | 
 12 | 
 13 | class Template(Drawable):
 14 |     def __init__(self, smarts: str) -> None:
 15 |         super().__init__()
 16 |         self._smarts = smarts.strip()
 17 | 
 18 |     def __getstate__(self):
 19 |         return self._smarts
 20 | 
 21 |     def __setstate__(self, state):
 22 |         self._smarts = state
 23 | 
 24 |     @property
 25 |     def smarts(self) -> str:
 26 |         return self._smarts
 27 | 
 28 |     @cached_property
 29 |     def _rdmol(self):
 30 |         return AllChem.MolFromSmarts(self._smarts)
 31 | 
 32 |     def draw(self, size: int = 100, svg: bool = False):
 33 |         if svg:
 34 |             return Draw._moltoSVG(self._rdmol, sz=(size, size), highlights=[], legend=[], kekulize=True)
 35 |         else:
 36 |             return Draw.MolToImage(self._rdmol, size=(size, size), kekulize=True)
 37 | 
 38 |     def match(self, mol: Molecule) -> bool:
 39 |         return mol._rdmol.HasSubstructMatch(self._rdmol)
 40 | 
 41 |     def __hash__(self) -> int:
 42 |         return hash(self._smarts)
 43 | 
 44 |     def __eq__(self, __value: object) -> bool:
 45 |         return isinstance(__value, Reaction) and self.smarts == __value.smarts
 46 | 
 47 | 
 48 | class Reaction(Drawable):
 49 |     def __init__(self, smarts: str) -> None:
 50 |         super().__init__()
 51 |         self._smarts = smarts.strip()
 52 | 
 53 |     def __getstate__(self):
 54 |         return self._smarts
 55 | 
 56 |     def __setstate__(self, state):
 57 |         self._smarts = state
 58 | 
 59 |     @property
 60 |     def smarts(self) -> str:
 61 |         return self._smarts
 62 | 
 63 |     @cached_property
 64 |     def _reaction(self):
 65 |         r = AllChem.ReactionFromSmarts(self._smarts)
 66 |         rdChemReactions.ChemicalReaction.Initialize(r)
 67 |         return r
 68 | 
 69 |     def draw(self, size: int = 100, svg: bool = False):
 70 |         return Draw.ReactionToImage(self._reaction, subImgSize=(size, size), useSVG=svg)
 71 | 
 72 |     @cached_property
 73 |     def num_reactants(self) -> int:
 74 |         return self._reaction.GetNumReactantTemplates()
 75 | 
 76 |     @cached_property
 77 |     def num_agents(self) -> int:
 78 |         return self._reaction.GetNumAgentTemplates()
 79 | 
 80 |     @cached_property
 81 |     def num_products(self) -> int:
 82 |         return self._reaction.GetNumProductTemplates()
 83 | 
 84 |     @cached_property
 85 |     def reactant_templates(self) -> tuple[Template, ...]:
 86 |         # reactant_smarts = self.smarts.split(">")[0].split(".")
 87 |         reactant_smarts = [Chem.MolToSmarts(self._reaction.GetReactantTemplate(i)) for i in range(self.num_reactants)]
 88 |         return tuple(Template(s) for s in reactant_smarts)
 89 | 
 90 |     def match_reactant_templates(self, mol: Molecule) -> tuple[int, ...]:
 91 |         matched: list[int] = []
 92 |         for i, template in enumerate(self.reactant_templates):
 93 |             if template.match(mol):
 94 |                 matched.append(i)
 95 |         return tuple(matched)
 96 | 
 97 |     @cached_property
 98 |     def product_templates(self) -> tuple[Template, ...]:
 99 |         product_smarts = self.smarts.split(">")[2].split(".")
100 |         return tuple(Template(s) for s in product_smarts)
101 | 
102 |     def is_reactant(self, mol: Molecule) -> bool:
103 |         return self._reaction.IsMoleculeReactant(mol._rdmol)
104 | 
105 |     def is_agent(self, mol: Molecule) -> bool:
106 |         return self._reaction.IsMoleculeAgent(mol._rdmol)
107 | 
108 |     def is_product(self, mol: Molecule) -> bool:
109 |         return self._reaction.IsMoleculeProduct(mol._rdmol)
110 | 
111 |     def __call__(self, reactants: Sequence[Molecule] | Sequence[str]) -> list[Molecule]:
112 |         if isinstance(reactants, Sequence) and not isinstance(reactants, str):
113 |             reactants = [Molecule.from_rdmol(m._rdmol) for m in reactants]
114 |         products = [Molecule.from_rdmol(p[0]) for p in self._reaction.RunReactants([m._rdmol for m in reactants])]
115 |         products = [p for p in products if p.is_valid]
116 |         return products
117 | 
118 |     def __hash__(self) -> int:
119 |         return hash(self._smarts)
120 | 
121 |     def __eq__(self, __value: object) -> bool:
122 |         return isinstance(__value, Reaction) and self.smarts == __value.smarts
123 | 
124 | 
125 | class ReactionContainer(Sequence[Reaction]):
126 |     def __init__(self, reactions: Iterable[Reaction]) -> None:
127 |         super().__init__()
128 |         self._reactions = tuple(reactions)
129 | 
130 |     @overload
131 |     def __getitem__(self, index: int) -> Reaction: ...
132 | 
133 |     @overload
134 |     def __getitem__(self, index: slice) -> tuple[Reaction, ...]: ...
135 | 
136 |     def __getitem__(self, index: int | slice):
137 |         return self._reactions[index]
138 | 
139 |     def __len__(self) -> int:
140 |         return len(self._reactions)
141 | 
142 |     def match_reactions(self, mol: Molecule) -> dict[int, tuple[int, ...]]:
143 |         matched: dict[int, tuple[int, ...]] = {}
144 |         for i, rxn in enumerate(self._reactions):
145 |             m = rxn.match_reactant_templates(mol)
146 |             if len(m) > 0:
147 |                 matched[i] = m
148 |         return matched
149 | 
150 | 
151 | def read_reaction_file(path: os.PathLike) -> list[Reaction]:
152 |     reactions: list[Reaction] = []
153 |     with open(path) as f:
154 |         for line in f:
155 |             line = line.strip()
156 |             if not line:
157 |                 continue
158 |             reactions.append(Reaction(line))
159 |     return reactions
160 | 


--------------------------------------------------------------------------------
/synllama/chem/smiles_tfidf.py:
--------------------------------------------------------------------------------
  1 | import pickle
  2 | from collections import Counter
  3 | from itertools import chain
  4 | from collections.abc import Iterable
  5 | from tqdm import tqdm
  6 | from sklearn.feature_extraction.text import TfidfVectorizer
  7 | from difflib import SequenceMatcher
  8 | 
  9 | import numpy as np
 10 | from typing import List, Tuple
 11 | from sklearn.neighbors import BallTree
 12 | from joblib import Parallel, delayed
 13 | 
 14 | from synllama.chem.mol import Molecule
 15 | from synllama.chem.fpindex import _QueryResult
 16 | 
 17 | def string_similarity(s1, s2):
 18 |     return SequenceMatcher(None, s1, s2).ratio()
 19 | 
 20 | def sort_by_similarity(string, target_list):
 21 |     return sorted(target_list, key=lambda x: string_similarity(string, x))
 22 | 
 23 | def find_closest_match(string, target_list):
 24 |     return sort_by_similarity(string, target_list)[0]
 25 | 
 26 | class SmilesTokenizer:
 27 |     def __init__(self, token_list_path, n_gram_range=(2, 4)):
 28 |         with open(token_list_path, "r") as f:
 29 |             self.token_list = [line.strip() for line in f.readlines()]
 30 |         self.token_list = sorted(self.token_list, key=len, reverse=True)
 31 |         self.n_gram_range = n_gram_range
 32 | 
 33 |     def __call__(self, smiles):
 34 |         tokens = self._tokenize(smiles)
 35 |         return self._create_ngrams(tokens)
 36 | 
 37 |     def _tokenize(self, smiles):
 38 |         tokens = []
 39 |         i = 0
 40 |         while i < len(smiles):
 41 |             matched = False
 42 |             for token in self.token_list:
 43 |                 if smiles.startswith(token, i):
 44 |                     tokens.append(token)
 45 |                     i += len(token)
 46 |                     matched = True
 47 |                     break
 48 |             if not matched:
 49 |                 tokens.append(smiles[i])
 50 |                 i += 1
 51 |         return tokens
 52 | 
 53 |     def _create_ngrams(self, tokens):
 54 |         ngrams = []
 55 |         for n in range(self.n_gram_range[0], self.n_gram_range[1] + 1):
 56 |             for i in range(len(tokens) - n + 1):
 57 |                 ngrams.append(" ".join(tokens[i:i+n]))
 58 |         return ngrams
 59 | 
 60 | def compute_embeddings(vectorizer: TfidfVectorizer, molecules: Iterable[Molecule], batch_size: int = 1024) -> np.ndarray:
 61 |     all_smiles = [mol.smiles for mol in molecules]
 62 |     
 63 |     def process_batch(batch):
 64 |         return vectorizer.transform(batch)
 65 |     
 66 |     batches = [all_smiles[i:i + batch_size] for i in range(0, len(all_smiles), batch_size)]
 67 |     
 68 |     embeddings_list = Parallel(n_jobs=-1)(
 69 |         delayed(process_batch)(batch) for batch in tqdm(batches, desc="Computing embeddings")
 70 |     )
 71 |     
 72 |     return np.vstack(embeddings_list)
 73 | 
 74 | class SmilesSimilaritySearch:
 75 |     def __init__(self, n_gram_range=(2, 4), token_list_path=None, max_features=1024):
 76 |         self.n_gram_range = n_gram_range
 77 |         self.token_list_path = token_list_path
 78 |         self.tokenizer = SmilesTokenizer(token_list_path, n_gram_range) if token_list_path else None
 79 |         self.ngram_vocab = None
 80 |         self.idf = None
 81 |         self._molecules = None
 82 |         self._embeddings = None
 83 |         self._tree = None
 84 |         self.max_features = max_features
 85 | 
 86 |     def fit(self, molecules: Iterable[Molecule], save_ngram = False):
 87 |         self._molecules = tuple(molecules)
 88 |         all_smiles = [mol.smiles for mol in self._molecules]
 89 |         
 90 |         if self.tokenizer:
 91 |             all_tokens = [self.tokenizer(smiles) for smiles in tqdm(all_smiles, desc="Tokenizing SMILES")]
 92 |         else:
 93 |             all_tokens = all_smiles
 94 | 
 95 |         # Generate n-gram vocabulary (now sorted and limited)
 96 |         self.ngram_vocab = self._generate_ngram_vocab(all_tokens)
 97 |         if save_ngram:
 98 |             with open("data/processed/smiles_similarity_search_ngram_vocab.txt", "w") as f:
 99 |                 for ngram in self.ngram_vocab:
100 |                     f.write(ngram + "\n")
101 |         
102 |         # Compute IDF (now using the limited vocabulary)
103 |         self.idf = self._compute_idf(all_tokens)
104 |         
105 |         # Compute TF-IDF embeddings (now using the limited vocabulary)
106 |         self._embeddings = self._compute_tfidf_embeddings(all_tokens)
107 |         
108 |         # Initialize BallTree
109 |         self._tree = self._init_tree()
110 | 
111 |     def _generate_ngram_vocab(self, all_tokens):
112 |         ngram_counter = Counter()
113 |         for tokens in tqdm(all_tokens, desc="Generating n-gram vocabulary"):
114 |             if self.tokenizer:
115 |                 ngrams = self.tokenizer._create_ngrams(tokens)
116 |             else:
117 |                 ngrams = [tokens[i:i+n] for n in range(self.n_gram_range[0], self.n_gram_range[1]+1)
118 |                           for i in range(len(tokens)-n+1)]
119 |             ngram_counter.update(ngrams)
120 |         
121 |         # Sort n-grams by frequency and keep only the top max_features
122 |         return [ngram for ngram, _ in ngram_counter.most_common(self.max_features)]
123 | 
124 |     def _compute_idf(self, all_tokens):
125 |         doc_freq = Counter()
126 |         for tokens in tqdm(all_tokens, desc="Computing IDF"):
127 |             if self.tokenizer:
128 |                 ngrams = set(self.tokenizer._create_ngrams(tokens))
129 |             else:
130 |                 ngrams = set(tokens[i:i+n] for n in range(self.n_gram_range[0], self.n_gram_range[1]+1)
131 |                              for i in range(len(tokens)-n+1))
132 |             doc_freq.update(ngram for ngram in ngrams if ngram in self.ngram_vocab)
133 |         
134 |         num_docs = len(all_tokens)
135 |         return {ngram: np.log(num_docs / (count + 1)) + 1 for ngram, count in doc_freq.items()}
136 | 
137 |     def _compute_tfidf_embeddings(self, all_tokens):
138 |         embeddings = []
139 |         for tokens in tqdm(all_tokens, desc="Computing TF-IDF embeddings"):
140 |             if self.tokenizer:
141 |                 ngrams = self.tokenizer._create_ngrams(tokens)
142 |             else:
143 |                 ngrams = [tokens[i:i+n] for n in range(self.n_gram_range[0], self.n_gram_range[1]+1)
144 |                           for i in range(len(tokens)-n+1)]
145 |             
146 |             tf = Counter(ngram for ngram in ngrams if ngram in self.ngram_vocab)
147 |             tfidf = np.zeros(len(self.ngram_vocab))
148 |             for i, ngram in enumerate(self.ngram_vocab):
149 |                 if ngram in tf:
150 |                     tfidf[i] = tf[ngram] * self.idf.get(ngram, 0)
151 |             embeddings.append(tfidf)
152 |         return np.array(embeddings)
153 | 
154 |     def _init_tree(self) -> BallTree:
155 |         return BallTree(self._embeddings, metric='manhattan')
156 | 
157 |     def query(self, query_smiles: str, k: int = 10) -> List[_QueryResult]:
158 |         if self.tokenizer:
159 |             query_tokens = self.tokenizer(query_smiles)
160 |             query_ngrams = self.tokenizer._create_ngrams(query_tokens)
161 |         else:
162 |             query_ngrams = [query_smiles[i:i+n] for n in range(self.n_gram_range[0], self.n_gram_range[1]+1)
163 |                             for i in range(len(query_smiles)-n+1)]
164 |         
165 |         tf = Counter(ngram for ngram in query_ngrams if ngram in self.ngram_vocab)
166 |         query_embedding = np.zeros(len(self.ngram_vocab))
167 |         for i, ngram in enumerate(self.ngram_vocab):
168 |             if ngram in tf:
169 |                 query_embedding[i] = tf[ngram] * self.idf.get(ngram, 0)
170 |         
171 |         distances, indices = self._tree.query(query_embedding.reshape(1, -1), k=k)
172 |         
173 |         results = []
174 |         for distance, idx in zip(distances[0], indices[0]):
175 |             results.append(_QueryResult(
176 |                 index=idx,
177 |                 molecule=self._molecules[idx],
178 |                 fingerprint=None,
179 |                 distance=distance
180 |             ))
181 |         
182 |         return sorted(results, key=lambda x: x.distance)
183 | 
184 |     def save(self, filename):
185 |         with open(filename, 'wb') as f:
186 |             pickle.dump(self, f)
187 | 
188 |     @classmethod
189 |     def load(cls, filename):
190 |         with open(filename, 'rb') as f:
191 |             return pickle.load(f)
192 | 


--------------------------------------------------------------------------------
/synllama/chem/stack.py:
--------------------------------------------------------------------------------
  1 | import dataclasses
  2 | import itertools
  3 | import random
  4 | from collections.abc import Iterable
  5 | from typing import TypeAlias
  6 | 
  7 | import numpy as np
  8 | 
  9 | from synllama.chem.matrix import ReactantReactionMatrix
 10 | from synllama.chem.mol import Molecule
 11 | from synllama.chem.reaction import Reaction
 12 | from synllama.chem.smiles_tfidf import sort_by_similarity
 13 | 
 14 | _NumReactants: TypeAlias = int
 15 | _MolOrRxnIndex: TypeAlias = int
 16 | _TokenType: TypeAlias = tuple[_NumReactants, _MolOrRxnIndex]
 17 | 
 18 | 
 19 | def _flatten(l):
 20 |     for el in l:
 21 |         if isinstance(el, list):
 22 |             yield from _flatten(el)
 23 |         else:
 24 |             yield el
 25 | 
 26 | 
 27 | @dataclasses.dataclass
 28 | class _Node:
 29 |     mol: Molecule
 30 |     rxn: Reaction | None
 31 |     token: _TokenType
 32 |     children: list["_Node"]
 33 | 
 34 |     def to_str(self, depth: int) -> str:
 35 |         pad = " " * depth * 2
 36 |         lines = [f"{pad}{self.mol.smiles}"]
 37 |         if self.rxn is not None:
 38 |             for c in self.children:
 39 |                 lines.append(f"{c.to_str(depth + 1)}")
 40 |         return "\n".join(lines)
 41 | 
 42 |     def __repr__(self) -> str:
 43 |         return f"Node(\n{self.to_str(1)}\n)"
 44 | 
 45 | 
 46 | class Stack:
 47 |     def __init__(self) -> None:
 48 |         super().__init__()
 49 |         self._mols: list[Molecule] = []
 50 |         self._rxns: list[Reaction | None] = []
 51 |         self._tokens: list[_TokenType] = []
 52 |         self._stack: list[set[Molecule]] = []
 53 | 
 54 |     @property
 55 |     def mols(self) -> tuple[Molecule, ...]:
 56 |         return tuple(self._mols)
 57 | 
 58 |     @property
 59 |     def rxns(self) -> tuple[Reaction | None, ...]:
 60 |         return tuple(self._rxns)
 61 | 
 62 |     @property
 63 |     def tokens(self) -> tuple[_TokenType, ...]:
 64 |         return tuple(self._tokens)
 65 | 
 66 |     def get_top(self) -> set[Molecule]:
 67 |         return self._stack[-1]
 68 | 
 69 |     def get_second_top(self) -> set[Molecule]:
 70 |         return self._stack[-2]
 71 |     
 72 |     def get_third_top(self) -> set[Molecule]:
 73 |         return self._stack[-3]
 74 | 
 75 |     def push_mol(self, mol: Molecule, index: int) -> None:
 76 |         self._mols.append(mol)
 77 |         self._rxns.append(None)
 78 |         self._tokens.append((-1, index))
 79 |         self._stack.append({mol})
 80 | 
 81 |     def push_rxn(self, rxn: Reaction, index: int, product_limit: int | None = None, product_template: str | None = None) -> bool:
 82 |         if len(self._stack) < rxn.num_reactants:
 83 |             return False
 84 | 
 85 |         prods: list[Molecule] = []
 86 |         if rxn.num_reactants == 1:
 87 |             for r in self.get_top():
 88 |                 prods += rxn([r])
 89 |         elif rxn.num_reactants == 2:
 90 |             for r1, r2 in itertools.product(self.get_top(), self.get_second_top()):
 91 |                 if product_limit is not None and len(prods) >= product_limit:
 92 |                     break
 93 |                 prods += rxn([r1, r2]) + rxn([r2, r1])
 94 |         elif rxn.num_reactants == 3:
 95 |             for r1, r2, r3 in itertools.product(self.get_top(), self.get_second_top(), self.get_third_top()):
 96 |                 if product_limit is not None and len(prods) >= product_limit:
 97 |                     break
 98 |                 prods += (
 99 |                     rxn([r1, r2, r3])
100 |                     + rxn([r1, r3, r2])
101 |                     + rxn([r2, r1, r3])
102 |                     + rxn([r2, r3, r1])
103 |                     + rxn([r3, r2, r1])
104 |                     + rxn([r3, r1, r2])
105 |                 )
106 |         else:
107 |             return False
108 | 
109 |         if len(prods) == 0:
110 |             return False
111 |         prod_dict = {m.smiles: m for m in prods}
112 |         if product_template is not None:
113 |             prod_sorted = sort_by_similarity(product_template, list(prod_dict.keys()))
114 |             prod_sorted = [prod_dict[p] for p in prod_sorted]
115 |         else:
116 |             prod_sorted = prods
117 |         if product_limit is not None:
118 |             prod_sorted = prod_sorted[:product_limit]
119 |         prod: Molecule = prod_sorted[0] if product_template is not None else random.choices(prod_sorted, weights=[len(p.smiles) for p in prod_sorted])[0]
120 | 
121 |         self._mols.append(prod) # need to look into this step for why there's wrong molecules, the prod here is not necessarily the product of the reaction
122 |         self._rxns.append(rxn)
123 |         self._tokens.append((rxn.num_reactants, index))
124 |         for _ in range(rxn.num_reactants):
125 |             self._stack.pop()
126 |         self._stack.append(set([prod]))
127 |         return True
128 | 
129 |     def get_tree(self) -> _Node:
130 |         stack: list[_Node] = []
131 |         for i in range(len(self._tokens)):
132 |             token = self._tokens[i]
133 |             n_react = token[0]
134 |             if n_react > 0:
135 |                 item = _Node(self._mols[i], self._rxns[i], token, [])
136 |                 for _ in range(n_react):
137 |                     item.children.append(stack.pop())
138 |                 stack.append(item)
139 |             else:
140 |                 stack.append(_Node(self._mols[i], self._rxns[i], token, []))
141 |         return stack[-1]
142 | 
143 |     def get_postfix_tokens(self) -> tuple[_TokenType, ...]:
144 |         return tuple(self._tokens)
145 | 
146 |     def __len__(self) -> int:
147 |         return len(self._mols)
148 | 
149 |     def __getitem__(self, index: int) -> Molecule:
150 |         return self._mols[index]
151 | 
152 |     def get_mol_idx_seq(self) -> list[int | None]:
153 |         return [t[1] if t[0] == 0 else None for t in self.tokens]
154 | 
155 |     def get_rxn_idx_seq(self) -> list[int | None]:
156 |         return [t[1] if t[0] > 0 else None for t in self.tokens]
157 | 
158 |     def count_reactions(self) -> int:
159 |         cnt = 0
160 |         for rxn in self._rxns:
161 |             if rxn is not None:
162 |                 cnt += 1
163 |         return cnt
164 | 
165 |     def get_state_repr(self) -> str:
166 |         rl: list[str] = []
167 |         for s in self._stack:
168 |             sl = list(map(lambda m: m.smiles, s))
169 |             sl.sort()
170 |             rl.append(",".join(sl))
171 |         return ";".join(rl)
172 | 
173 |     def get_action_string(self, delim: str = ";") -> str:
174 |         tokens: list[str] = []
175 |         for mol, (num_reactants, idx) in zip(self._mols, self._tokens):
176 |             if num_reactants == -1:
177 |                 tokens.append(f"{mol.smiles}")
178 |             else:
179 |                 tokens.append(f"R{idx}_{num_reactants}")
180 |                 tokens.append(f"{mol.smiles}")
181 |         return delim.join(tokens)
182 |     
183 |     def get_source(self, delim: str = ";") -> str:
184 |         return delim.join([mol.source for mol in self._mols])
185 | 
186 | def create_init_stack(matrix: ReactantReactionMatrix, rxn_count: dict[str, int], weighted_ratio: float = 0.0, prob_u_fp: str = None) -> Stack:
187 |     stack = Stack()
188 |     prob_u = np.ones(len(matrix.reactant_count)) / len(matrix.reactant_count)
189 |     if prob_u_fp is not None:
190 |         with open(prob_u_fp, "r") as f:
191 |             prob_u = np.array(list(map(float, f.read().splitlines())), dtype=np.float32)
192 |         prob_u = prob_u / prob_u.sum()
193 |     prob_w = list(rxn_count.values())
194 |     prob_w = np.array([prob_w[i] if i in matrix.seed_reaction_indices else 0 for i in range(len(matrix.reactant_count))], dtype=np.float32)
195 |     prob_w = prob_w / prob_w.sum()
196 |     prob = weighted_ratio * prob_w + (1 - weighted_ratio) * prob_u
197 |     prob = prob / np.sum(prob)
198 |     rxn_index: int = np.random.choice(np.arange(len(matrix.reactions)), p=prob)
199 |     rxn_col = matrix.matrix[:, rxn_index]
200 |     rxn = matrix.reactions[rxn_index]
201 | 
202 |     if rxn.num_reactants == 2:
203 |         m1 = np.random.choice(np.bitwise_and(rxn_col, 0b01).nonzero()[0])
204 |         m2 = np.random.choice(np.bitwise_and(rxn_col, 0b10).nonzero()[0])
205 |         if random.randint(0, 1) % 2 == 1:
206 |             m1, m2 = m2, m1
207 |         stack.push_mol(matrix.reactants[m1], m1)
208 |         stack.push_mol(matrix.reactants[m2], m2)
209 |     elif rxn.num_reactants == 1:
210 |         m = np.random.choice(rxn_col.nonzero()[0])
211 |         stack.push_mol(matrix.reactants[m], m)
212 |     elif rxn.num_reactants == 3:
213 |         m1 = np.random.choice(np.bitwise_and(rxn_col, 0b001).nonzero()[0])
214 |         m2 = np.random.choice(np.bitwise_and(rxn_col, 0b010).nonzero()[0])
215 |         m3 = np.random.choice(np.bitwise_and(rxn_col, 0b100).nonzero()[0])
216 |         m1, m2, m3 = random.sample([m1, m2, m3], 3)
217 |         stack.push_mol(matrix.reactants[m1], m1)
218 |         stack.push_mol(matrix.reactants[m2], m2)
219 |         stack.push_mol(matrix.reactants[m3], m3)
220 | 
221 |     stack.push_rxn(rxn, rxn_index)
222 |     return stack
223 | 
224 | 
225 | def expand_stack(stack: Stack, matrix: ReactantReactionMatrix):
226 |     matches = matrix.reactions.match_reactions(random.choice(list(stack.get_top())))
227 |     if len(matches) == 0:
228 |         return stack, False
229 |     rxn_index = random.choice(list(matches.keys()))
230 |     reactant_flag = 1 << matches[rxn_index][0]
231 | 
232 |     rxn_col = matrix.matrix[:, rxn_index]
233 |     if np.any(rxn_col >= 4):
234 |         # Case of tri-mol reaction
235 |         all_reactants = 0b111
236 |         remaining_reactants = all_reactants ^ reactant_flag
237 |         reactant_1 = remaining_reactants & 0b001  # Isolate the 001 bit
238 |         reactant_2 = remaining_reactants & 0b010  # Isolate the 010 bit
239 |         reactant_3 = remaining_reactants & 0b100  # Isolate the 100 bit
240 |         valid_reactants = [reactant for reactant in [reactant_1, reactant_2, reactant_3] if reactant != 0]
241 |         s_indices_1 = np.logical_and(rxn_col != 0, (rxn_col & valid_reactants[0]) == valid_reactants[0]).nonzero()[0]
242 |         s_indices_2 = np.logical_and(rxn_col != 0, (rxn_col & valid_reactants[1]) == valid_reactants[1]).nonzero()[0]
243 |         s_indices_1, s_indices_2 = random.sample([s_indices_1, s_indices_2], 2)
244 |         s_index1 = np.random.choice(s_indices_1)
245 |         stack.push_mol(matrix.reactants[s_index1], s_index1)
246 |         s_index2 = np.random.choice(s_indices_2)
247 |         stack.push_mol(matrix.reactants[s_index2], s_index2)
248 |         rxn_success = stack.push_rxn(matrix.reactions[rxn_index], rxn_index)
249 |     else:
250 |         s_indices = np.logical_and(rxn_col != 0, rxn_col != reactant_flag).nonzero()[0]
251 |         if len(s_indices) == 0:
252 |             stack.push_rxn(matrix.reactions[rxn_index], rxn_index)
253 |             return stack, True
254 |         s_index = np.random.choice(s_indices)
255 |         stack.push_mol(matrix.reactants[s_index], s_index)
256 |         rxn_success = stack.push_rxn(matrix.reactions[rxn_index], rxn_index)
257 |         if not rxn_success:
258 |             # pop the last pushed molecule and end the reaction
259 |             stack._mols.pop()
260 |             stack._rxns.pop()
261 |             stack._tokens.pop()
262 |             stack._stack.pop()
263 |     return stack, rxn_success
264 | 
265 | 
266 | def create_stack(
267 |     matrix: ReactantReactionMatrix,
268 |     rxn_count: dict[str, int],
269 |     max_num_reactions: int = 5,
270 |     max_num_atoms: int = 80,
271 |     init_stack_weighted_ratio: float = 0.0,
272 |     prob_u_fp: str = None,
273 | ) -> Stack:
274 |     stack = create_init_stack(matrix, rxn_count=rxn_count, weighted_ratio=init_stack_weighted_ratio, prob_u_fp=prob_u_fp)
275 |     for _ in range(1, max_num_reactions):
276 |         stack, changed = expand_stack(stack, matrix)
277 |         if not changed:
278 |             break
279 |         if max(map(lambda m: m.num_atoms, stack.get_top())) > max_num_atoms:
280 |             break
281 |     return stack
282 | 
283 | 
284 | def create_stack_step_by_step(
285 |     matrix: ReactantReactionMatrix,
286 |     rxn_count: dict[str, int],
287 |     max_num_reactions: int = 5,
288 |     max_num_atoms: int = 80,
289 |     init_stack_weighted_ratio: float = 0.0,
290 |     prob_u_fp: str = None,
291 | ) -> Iterable[Stack]:
292 |     stack = create_init_stack(matrix, rxn_count=rxn_count, weighted_ratio=init_stack_weighted_ratio, prob_u_fp=prob_u_fp)
293 |     yield stack
294 |     for _ in range(1, max_num_reactions):
295 |         stack, changed = expand_stack(stack, matrix)
296 |         if changed:
297 |             yield stack
298 |         else:
299 |             break
300 |         if max(map(lambda m: m.num_atoms, stack.get_top())) > max_num_atoms:
301 |             break


--------------------------------------------------------------------------------
/synllama/llm/parallel_inference.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import json, pickle, argparse, os
  3 | import multiprocessing as mp
  4 | from synllama.llm.vars import *
  5 | from tqdm import tqdm
  6 | from transformers import AutoTokenizer, AutoModelForCausalLM
  7 | 
  8 | instruction = TEMPLATE["instruction"]
  9 | input_template = TEMPLATE["input"]
 10 | 
 11 | def generate_text(smiles, tokenizer, model, stopping_ids, sampling_params, max_length=1600):
 12 |     input = input_template.replace("SMILES_STRING", smiles)
 13 |     prompt_complete = "### Instruction:\n" + instruction + "\n\n### Input:\n"+ input + "\n\n### Response: \n"
 14 |     inputs = tokenizer(prompt_complete, return_tensors="pt").to(model.device)
 15 |     prompt_length = inputs.input_ids.shape[1]
 16 |     
 17 |     generated_texts = []
 18 |     
 19 |     for params in sampling_params:
 20 |         temp = params["temp"]
 21 |         top_p = params["top_p"]
 22 |         repeat = params["repeat"]
 23 |         with torch.no_grad():
 24 |             outputs = model.generate(
 25 |                 **inputs,
 26 |                 max_new_tokens=max_length,
 27 |                 do_sample=True,
 28 |                 temperature=temp,
 29 |                 top_p=top_p,
 30 |                 num_return_sequences=repeat,
 31 |                 eos_token_id=stopping_ids,
 32 |                 pad_token_id=tokenizer.eos_token_id
 33 |             )
 34 |         for output in outputs:
 35 |             generated_text = tokenizer.decode(output[prompt_length:], skip_special_tokens=True)
 36 |             generated_texts.append(generated_text.strip())
 37 |     
 38 |     return generated_texts
 39 | 
 40 | def process_batch(args):
 41 |     gpu_id, model_path, smiles_batch, sampling_params = args
 42 |     device = f'cuda:{gpu_id}' if torch.cuda.is_available() else 'cpu'
 43 |     
 44 |     tokenizer = AutoTokenizer.from_pretrained(model_path)
 45 |     model = AutoModelForCausalLM.from_pretrained(
 46 |         model_path,
 47 |         torch_dtype=torch.float16 if 'cuda' in device else torch.float32,
 48 |         device_map={'': device}
 49 |     )
 50 |     
 51 |     stopping_ids = [
 52 |         tokenizer.eos_token_id,
 53 |         tokenizer.convert_tokens_to_ids("<|eot_id|>"),
 54 |     ]
 55 |     
 56 |     results = {}
 57 |     for smiles in tqdm(smiles_batch, desc=f"Processing on {device.upper()}"):
 58 |         try:
 59 |             response = generate_text(smiles, tokenizer, model, stopping_ids, sampling_params)
 60 |             json_responses = []
 61 |             for r in response:
 62 |                 try:
 63 |                     json_responses.append(json.loads(r))
 64 |                 except json.JSONDecodeError:
 65 |                     json_responses.append("json format error")
 66 |             results[smiles] = json_responses
 67 |         except Exception as e:
 68 |             results[smiles] = f"Error: {str(e)}"
 69 |     
 70 |     return results
 71 | 
 72 | def main(model_path, smiles_path, save_path, sampling_params, gpus = None):
 73 |     with open(smiles_path, "r") as f:
 74 |         smiles_list = [line.strip() for line in f]
 75 |     
 76 |     num_gpus = torch.cuda.device_count() if gpus is None else gpus
 77 |     print(f"Number of available GPUs: {num_gpus}")
 78 |     
 79 |     if num_gpus > 1:
 80 |         pool = mp.Pool(num_gpus)
 81 |         try:
 82 |             # Process batches on different GPUs
 83 |             batches = [smiles_list[i::num_gpus] for i in range(num_gpus)]
 84 |             results = pool.map(process_batch, [(i, model_path, batch, sampling_params) for i, batch in enumerate(batches)])
 85 |             
 86 |             # Combine results from all GPUs
 87 |             combined_results = {}
 88 |             for r in results:
 89 |                 combined_results.update(r)
 90 |                 
 91 |         finally:
 92 |             # Ensure pool cleanup happens even if an error occurs
 93 |             pool.close()  # Prevent any more tasks from being submitted
 94 |             pool.join()   # Wait for all processes to finish
 95 |             pool.terminate()  # Terminate all worker processes
 96 |     else:
 97 |         # If only one GPU, process all SMILES on that GPU
 98 |         combined_results = process_batch((0, model_path, smiles_list, sampling_params))
 99 |     
100 |     # Save results
101 |     with open(save_path, "wb") as f:
102 |         pickle.dump(combined_results, f)
103 | 
104 |     # close pool
105 |     return combined_results
106 | 
107 | if __name__ == "__main__":
108 |     parser = argparse.ArgumentParser(description="Run inference pipeline for reaction prediction")
109 |     parser.add_argument("--model_path", type=str, help="Path to the model", default="data/model/SynLlama-1B-2M")
110 |     parser.add_argument("--smiles_path", type=str, help="Path to the SMILES file")
111 |     parser.add_argument("--save_path", type=str, help="Pickle file path to save the results", default = None)
112 |     parser.add_argument("--sample_mode", type=str, default=None, help="Sampling mode, choose from: greedy, frugal, frozen_only, low_only, medium_only, high_only")
113 |     parser.add_argument("--temp", type=float, default=None, help="Temperature for the model")
114 |     parser.add_argument("--top_p", type=float, default=None, help="Top-p for the model")
115 |     parser.add_argument("--repeat", type=int, default=None, help="Number of times to repeat the model")
116 |     parser.add_argument("--gpus", type=int, default=None, help="name of the cuda device to use, default is all available GPUs")
117 |     args = parser.parse_args()
118 |     mp.set_start_method('spawn', force=True)
119 |     if args.save_path is None:
120 |         args.save_path = args.smiles_path.replace(".smi", "_results.pkl")
121 |     directory = os.path.dirname(args.save_path)
122 |     os.makedirs(directory, exist_ok=True)
123 |     sample_mode_mapping = {
124 |         "greedy": sampling_params_greedy,
125 |         "frugal": sampling_params_frugal,
126 |         "frozen_only": sampling_params_frozen_only,
127 |         "low_only": sampling_params_low_only,
128 |         "medium_only": sampling_params_medium_only,
129 |         "high_only": sampling_params_high_only
130 |     }
131 |     if args.sample_mode is None:
132 |         assert args.temp is not None and args.top_p is not None and args.repeat is not None, "Please provide a sample mode or all the sampling parameters"
133 |         sampling_params = [
134 |             {"temp": args.temp, "top_p": args.top_p, "repeat": args.repeat}
135 |         ]
136 |     else:
137 |         assert args.sample_mode in sample_mode_mapping, f"Invalid sample mode: {args.sample_mode}"
138 |         sampling_params = sample_mode_mapping[args.sample_mode]
139 | 
140 |     main(args.model_path, args.smiles_path, args.save_path, sampling_params=sampling_params, gpus=args.gpus)


--------------------------------------------------------------------------------
/synllama/llm/sft/synllama_sft.yml:
--------------------------------------------------------------------------------
 1 | base_model: meta-llama/Llama-3.2-1B-Instruct
 2 | model_type: LlamaForCausalLM
 3 | tokenizer_type: AutoTokenizer
 4 | is_llama_derived_model: true
 5 | 
 6 | load_in_8bit: true
 7 | load_in_4bit: false
 8 | strict: false
 9 | 
10 | datasets:
11 |   - path: CHANGE_TO_YOUR_DATASET_PATH (prepared dataset path in jsonl format)
12 |     ds_type: json
13 |     type: alpaca
14 | dataset_prepared_path: CHANGE_TO_YOUR_DATASET_PATH (file path to save the prepared dataset)
15 | val_set_size: 0.05
16 | output_dir: CHANGE_TO_YOUR_OUTPUT_PATH (file path to save the outputs)
17 | 
18 | sequence_len: 2048
19 | sample_packing: true
20 | eval_sample_packing: true
21 | pad_to_sequence_len: true
22 | 
23 | overrides_of_model_config:
24 |   rope_scaling:
25 |       factor: 1.0
26 |       low_freq_factor: 1.0
27 |       high_freq_factor: 4.0
28 |       original_max_position_embeddings: 8192
29 |       rope_type: llama3
30 | 
31 | adapter: lora
32 | lora_model_dir:
33 | lora_r: 32
34 | lora_alpha: 16
35 | lora_dropout: 0.05
36 | lora_target_linear: true
37 | lora_fan_in_fan_out:
38 | 
39 | wandb_mode:
40 | wandb_project:
41 | wandb_entity:
42 | wandb_run_id:
43 | wandb_watch:
44 | wandb_log_model:
45 | wandb_name:
46 | 
47 | gradient_accumulation_steps: 4
48 | micro_batch_size: 4
49 | num_epochs: 1
50 | optimizer: adamw_bnb_8bit
51 | lr_scheduler: cosine
52 | learning_rate: 0.0002
53 | 
54 | train_on_inputs: false
55 | group_by_length: false
56 | bf16: auto
57 | fp16:
58 | tf32: false
59 | 
60 | gradient_checkpointing: true
61 | early_stopping_patience:
62 | resume_from_checkpoint:
63 | local_rank:
64 | logging_steps: 1
65 | xformers_attention:
66 | flash_attention: true
67 | s2_attention:
68 | 
69 | warmup_steps: 10
70 | evals_per_epoch: 4
71 | eval_table_size:
72 | eval_table_max_new_tokens: 128
73 | saves_per_epoch: 1
74 | debug:
75 | deepspeed:
76 | weight_decay: 0.0
77 | fsdp:
78 | fsdp_config:
79 | special_tokens:
80 |     pad_token: <|finetune_right_pad_id|>
81 | 


--------------------------------------------------------------------------------
/synllama/llm/vars.py:
--------------------------------------------------------------------------------
 1 | TEMPLATE = {
 2 | "instruction": "You are an expert synthetic organic chemist. Your task is to design a synthesis pathway for a given target molecule using common and reliable reaction templates and building blocks. Follow these instructions:\n\n1. **Input the SMILES String:** Read in the SMILES string of the target molecule and identify common reaction templates that can be applied.\n\n2. **Decompose the Target Molecule:** Use the identified reaction templates to decompose the target molecule into different intermediates.\n\n3. **Check for Building Blocks:** For each intermediate:\n   - Identify if it is a building block. If it is, wrap it in <bb> and </bb> tags and save it for later use.\n   - If it is not a building block, apply additional reaction templates to further decompose it into building blocks.\n\n4. **Document Reactions:** For each reaction documented in the output, wrap the reaction template in <rxn> and </rxn> tags.\n\n5. **Repeat the Process:** Continue this process until all intermediates are decomposed into building blocks, and document each step clearly in a structured JSON format.",
 3 | "input": "Provide a synthetic pathway for this SMILES string: SMILES_STRING",
 4 | "output": "{\"reactions\": [REACTIONS], \"building_blocks\": [BUILDING_BLOCKS]}"
 5 | }
 6 | BB_BASE = "\"<bb>Building_Block</bb>\""
 7 | REACTION_BASE_MAX2 = "{\"reaction_number\": REACTION_NUM, \"reaction_template\": \"<rxn>RXN_TEMPLATE</rxn>\", \"reactants\": [\"REACTANT1\", \"REACTANT2\"], \"product\": \"PRODUCT\"}"
 8 | REACTION_BASE_MAX3 = "{\"reaction_number\": REACTION_NUM, \"reaction_template\": \"<rxn>RXN_TEMPLATE</rxn>\", \"reactants\": [\"REACTANT1\", \"REACTANT2\", \"REACTANT3\"], \"product\": \"PRODUCT\"}"
 9 | 
10 | sampling_params_frugal = [
11 |     {"temp": 0.1, "top_p": 0.1, "repeat": 1, "name": "frozen"},
12 |     {"temp": 0.6, "top_p": 0.5, "repeat": 1, "name": "low"},
13 |     {"temp": 1.0, "top_p": 0.7, "repeat": 1, "name": "medium"},
14 |     {"temp": 1.5, "top_p": 0.9, "repeat": 1, "name": "high"}
15 | ]
16 | 
17 | sampling_params_greedy = [
18 |     {"temp": 0.1, "top_p": 0.1, "repeat": 1, "name": "frozen"},
19 |     {"temp": 0.6, "top_p": 0.5, "repeat": 2, "name": "low"},
20 |     {"temp": 1.0, "top_p": 0.7, "repeat": 3, "name": "medium"},
21 |     {"temp": 1.5, "top_p": 0.9, "repeat": 4, "name": "high"}
22 | ]
23 | 
24 | sampling_params_frozen_only = [
25 |     {"temp": 0.1, "top_p": 0.1, "repeat": 1, "name": "frozen"}
26 | ]
27 | 
28 | sampling_params_low_only = [
29 |     {"temp": 0.6, "top_p": 0.5, "repeat": 5, "name": "low"}
30 | ]
31 | 
32 | sampling_params_medium_only = [
33 |     {"temp": 1.0, "top_p": 0.7, "repeat": 5, "name": "medium"}
34 | ]
35 | 
36 | sampling_params_high_only = [
37 |     {"temp": 1.5, "top_p": 0.9, "repeat": 5, "name": "high"}
38 | ]


--------------------------------------------------------------------------------