├── LICENSE ├── README.md ├── assets ├── docs │ ├── inference_guide.md │ └── retraining_guide.md └── toc.png ├── data ├── 115_rxns │ └── 115_rxn_templates.txt ├── 91_rxns │ └── 91_rxn_templates.txt └── smiles_vocab.txt ├── environment.yml ├── pyproject.toml ├── setup.py ├── steps ├── step_10_calc_embedding.py ├── step_11_generate_fpindex_smiles_tfidf.py ├── step_20_generate_reactions.py ├── step_30_0_benchmark_filter_raw_output.py ├── step_30_1_molport_raw_reconstruct.py ├── step_31_enamine_reconstruct.py └── step_32_combined_stats.py └── synllama ├── __init__.py ├── chem ├── __init__.py ├── base.py ├── fpindex.py ├── matrix.py ├── mol.py ├── reaction.py ├── smiles_tfidf.py └── stack.py └── llm ├── parallel_inference.py ├── sft └── synllama_sft.yml └── vars.py /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright ©2025 The Regents of the University of California (Regents). All Rights Reserved. Permission to use, copy, modify, and distribute this software and its documentation for educational, research, and not-for-profit purposes, without fee and without a signed licensing agreement, is hereby granted, provided that the above copyright notice, this paragraph and the following paragraphs appear in all copies, modifications, and distributions. Contact The Office of Technology Licensing, UC Berkeley, 2150 Shattuck Avenue, Suite 408, Berkeley, CA 94704-1362, otl@berkeley.edu. 2 | 3 | IN NO EVENT SHALL REGENTS BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF REGENTS HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 4 | 5 | REGENTS SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS PROVIDED "AS IS". REGENTS HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. 6 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SynLlama: Generating Synthesizable Molecules and Their Analogs with Large Language Models 🧬 2 | [![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE) 3 | [![Arxiv](https://img.shields.io/badge/Arxiv-2503.12602-red.svg)](https://arxiv.org/abs/2503.12602) 4 | 5 | ## 📖 Overview 6 | ![SynLlama](assets/toc.png) 7 | SynLlama is a fine-tuned version of Meta's Llama3 large language models that generates synthesizable analogs of small molecules by creating full synthetic pathways using commonly accessible building blocks and robust organic reaction templates, offering a valuable tool for drug discovery with strong performance in bottom-up synthesis, synthesizable analog generation, and hit expansion. 8 | 9 | ## 💡 Usage 10 | 11 | ### Prerequisites 12 | Ensure you have `conda` installed on your system. All additional dependencies will be managed via the `environment.yml` file. 13 | 14 | ### Installation 15 | To get started with SynLlama, follow these steps: 16 | ```bash 17 | git clone https://github.com/THGLab/SynLlama 18 | cd SynLlama 19 | conda env create -f environment.yml 20 | conda activate synllama 21 | pip install -e . 22 | ``` 23 | 24 | ### Inference 25 | To perform inference using the already trained SynLlama, download the trained models and relevant files from [here](https://figshare.com/s/39a37d31cea2c190498d) and follow the instructions in the [Inference Guide](assets/docs/inference_guide.md). 26 | 27 | ### Retraining 28 | If you are interested in retraining the model, please refer to the [Retraining Guide](assets/docs/retraining_guide.md) for detailed instructions. 29 | 30 | ## 📄 License 31 | This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details 32 | 33 | ## 🙏 Acknowledgments 34 | This project is built on top of the [ChemProjector Repo](https://github.com/luost26/ChemProjector). We thank the authors for building such a user-friendly github! 35 | 36 | ## 📝 Citation 37 | If you use this code in your research, please cite: 38 | 39 | ```bibtex 40 | @misc{sun_synllama_2025, 41 | title = {SynLlama: Generating Synthesizable Molecules and Their Analogs with Large Language Models}, 42 | url = {http://arxiv.org/abs/2503.12602}, 43 | doi = {10.48550/arXiv.2503.12602}, 44 | publisher = {arXiv}, 45 | author = {Sun, Kunyang and Bagni, Dorian and Cavanagh, Joseph M. and Wang, Yingze and Sawyer, Jacob M. and Gritsevskiy, Andrew and Head-Gordon, Teresa}, 46 | month = mar, 47 | year = {2025} 48 | } 49 | ``` 50 | -------------------------------------------------------------------------------- /assets/docs/inference_guide.md: -------------------------------------------------------------------------------- 1 | ## 🔍 Inference Guide 2 | 3 | After downloading the trained models and relevant files from the [figshare link](https://figshare.com/s/39a37d31cea2c190498d), you can use the following examples to perform inference. 4 | 5 | In the downloaded folder, you will find the following folders under the `inference` sub-directory: 6 | 7 | - `model`: The trained SynLlama models. 8 | - `reconstruction`: The necessary reaction embeddings for the reconstruction algorithm. 9 | - `smiles`: The `.smi` file containing the SMILES strings used in the paper. 10 | 11 | If you want to perform **synthesis planning tasks**, please follow *all the steps below*. 12 | 13 | If you want to perform just the **synthesizable analog search** or **hit expansion** tasks, please *only* follow the steps of "🦙 LLM inference using the trained SynLlama models" and "📝 Reconstruction algorithm using exclusively Enamine BBs". 14 | 15 | ### 🦙 LLM inference using the trained SynLlama models 16 | 17 | In the `model` folder, you will find the following trained models: 18 | 19 | - `SynLlama-1B-2M-91rxns`: The trained model for SynLlama-1B-2M using RXN Set 1. 20 | - `SynLlama-1B-2M-115rxns`: The trained model for SynLlama-1B-2M using RXN Set 2. 21 | 22 | You can choose one of the models to perform inference. Here we take `synllama-data/inference/model/SynLlama-1B-2M-91rxns` model and the file `synllama-data/inference/smiles/syn-planning/1k_chembl.smi` as an example. 23 | 24 | ```bash 25 | cd SynLlama 26 | python synllama/llm/parallel_inference.py \ 27 | --model_path synllama-data/inference/model/SynLlama-1B-2M-91rxns \ 28 | --smiles_path synllama-data/inference/smiles/syn-planning/1k_chembl.smi \ 29 | --save_path synllama-data/inference/results/temp/1k_chembl_synllama_91rxns.pkl \ 30 | --sample_mode greedy 31 | ``` 32 | 33 | This will generate a `.pkl` file containing the inference results in the `synllama-data/inference/results/temp` folder using 'greedy' sampling mode. All the details of the `sample_mode` can be found in the `synllama/llm/parallel_inference.py` file and the Supplementary Information. 34 | 35 | ### 🔄 SynLlama raw reconstruction and Molport search 36 | 37 | In the synthesis planning task, we report that SynLlama models can generate New BBs beyond the training Enamine BBs. To validate the purchasability of the generated BBs, we first sample and filter all the New BBs that are part of valid synthetic pathways, and then use Molport to search whether the New BBs are purchasable. 38 | 39 | To do so, we first need to filter the raw results for Molport search. Assuming that you have already generated the raw results with `SynLlama-1B-2M-91rxns` model and saved them in the `synllama-data/inference/results/temp/1k_chembl_synllama_91rxns.pkl` file, you can use the following command to filter the raw results for Molport search. 40 | 41 | ```bash 42 | cd SynLlama 43 | python steps/step_30_0_benchmark_filter_raw_output.py \ 44 | --llama_folder synllama-data/inference/results/temp/ \ 45 | --rxn_mapping_path synllama-data/inference/reconstruction/91rxns/rxn_embeddings/reaction_smarts_map.pkl \ 46 | --fp_searcher_path synllama-data/inference/reconstruction/91rxns/processed/fpindex.pkl \ 47 | --raw_output_only \ 48 | ``` 49 | 50 | This will generate a `synllama_reconstruct` folder in the `synllama-data/inference/results/temp` folder, which contains the filtered results for Molport search. To conduct the Molport search, please follow this [link](https://www.molport.com/shop/swl-step-1) and paste all the SMILES strings in the `*_successful_bbs_not_in_enamine_1.txt` file into the List Search. Further details can be found in the Supplementary Information. 51 | 52 | Once finished, you will download the `.xls` file under the `Selected Items` column from the Molport List Search History page, and name it as `*_molport_ls.xls`. Place this file in the `synllama-data/inference/results/temp/synllama_reconstruct` folder for the next filtering step that outputs csvs containing successful synthetic pathways with SynLlama raw outputs. 53 | 54 | ```bash 55 | cd SynLlama 56 | python steps/step_30_1_molport_raw_reconstruct.py \ 57 | --llama_folder synllama-data/inference/results/temp/ 58 | ``` 59 | 60 | ### 📝 Reconstruction algorithm using exclusively Enamine BBs 61 | 62 | In cases when SynLlama's raw outputs fail to generate a full synthetic pathway for the target molecule or analog generation is the desired task, we use the reconstruction algorithm using exclusively Enamine BBs to generate synthetic pathways for the target molecule and their close analogs. Using the same example as above, you can use the following command to generate the synthetic pathways for the target molecule and their close analogs. 63 | 64 | ```bash 65 | cd SynLlama 66 | python steps/step_31_enamine_reconstruct.py \ 67 | --llama_folder synllama-data/inference/results/temp/ \ 68 | --embedding_path synllama-data/inference/reconstruction/91rxns/rxn_embeddings \ 69 | --total_num_mols 1000 # change this to the total number of molecules you want to generate 70 | ``` 71 | 72 | Note that the `llama_folder` should be the folder that contains the `.pkl` file generated by the LLM inference step. Here, if you are trying to do analog generation, you might want to increase k and n_stacks to 10 and 50, respectively for a more thorough search. Please refer to the Supplementary Information Table S1 for more details. 73 | 74 | ### 🧩 Putting Everything Together 75 | 76 | If you are doing synthesis planning, you can use the following command to put the raw reconstruction and Enamine reconstruction results together. 77 | 78 | ```bash 79 | cd SynLlama 80 | python steps/step_32_combined_stats.py \ 81 | --llama_folder synllama-data/inference/results/temp/ \ 82 | --total_num_mols 1000 # change this to the total number of molecules you want to generate 83 | ``` 84 | 85 | This will generate a `combined_final_stats.csv` file in the `synllama-data/inference/results/temp` folder, which contains the combined statistics of the raw reconstruction and Enamine reconstruction results. 86 | 87 | 88 | 89 | -------------------------------------------------------------------------------- /assets/docs/retraining_guide.md: -------------------------------------------------------------------------------- 1 | ## 🔍 Retraining Guide 2 | 3 | This section provides a guide for retraining the SynLlama model. You will first need to generate your fine-tuning data by first accessing Enamine BBs and then generating synthetic pathways. After that, you can perform supervised fine-tuning with the Axolotl package. 4 | 5 | ### 📦 Enamine Synthetic Pathway Generation 6 | 7 | **Step 1:** Since Enamine BBs are not publicly available, you will need to access them first. Please refer to the [Enamine BBs](https://enamine.net/building-blocks/building-blocks-catalog) and follow the necessary steps to create an account and download the BBs from the **US Stock**. After downloading the BBs, you can place the file under the `data/` directory and run the following command to prepare the BBs for the pathway generation. We still use the 91 reaction templates from the original SynLlama paper as an example: 8 | 9 | ```bash 10 | cd SynLlama 11 | python steps/step_10_calc_embedding.py \ 12 | --data_folder data/91_rxns \ 13 | --bb_path ENAMINE_FILE_PATH \ # replace with your downloaded BBs file path 14 | --rxn_template_path data/91_rxns/91_rxn_templates.txt \ 15 | --testing_data_path TESTING_DATA_PATH # don't need this if you don't have a predefined testing .smi file 16 | ``` 17 | 18 | After this step, you will have a `data/91_rxns/processed` folder containing the reaction matrices for pathway generation. 19 | 20 | **Step 2: [Optional]** If you are downloading the most updated Enamine BBs, you will have more than 230k BBs as specified in the paper. Therefore, you should calculate all the reaction embeddings with your new BBs with the following command: 21 | 22 | ```bash 23 | python steps/step_11_generate_fpindex_smiles_tfidf.py \ 24 | --matrix_file data/91_rxns/processed/all/reaction_matrix.pkl \ 25 | --output_dir data/91_rxns/rxn_embeddings \ 26 | --token_list_path data/smiles_vocab.txt 27 | ``` 28 | 29 | After this step, you will have a `data/91_rxns/rxn_embeddings` folder containing the reaction embeddings for inference. In this case, you don't need to download the figshare data as specified in the [Inference Guide](inference_guide.md). 30 | 31 | **Step 3:** Finally, you can generate your fine-tuning data in [alpaca format](https://axolotl-ai-cloud.github.io/axolotl/docs/dataset-formats/inst_tune.html#alpaca) with the following command: 32 | 33 | ```bash 34 | python steps/step_20_generate_reactions.py \ 35 | --matrix_path data/91_rxns/processed/train/reaction_matrix_train.pkl \ # change it to testing/all reaction matrix file if needed 36 | --rxn_mapping_path data/91_rxns/rxn_embeddings/reaction_smarts_map.pkl \ 37 | --num_reactions NUM_REACTIONS \ # replace with your desired number of reactions 38 | --name NAME \ # replace with your desired name 39 | ``` 40 | 41 | This step will generate a `data/NAME.jsonl` file containing the fine-tuning data, which will be used for the next step. 42 | 43 | ### 📦 Supervised Fine-Tuning (SFT) 44 | 45 | Here, we provide instructions to reproduce the fine-tuning results in the paper using a package called [Axolotl](https://github.com/axolotl-ai-cloud/axolotl). Axolotl is a user-friendly tool that simplifies the process of fine-tuning large language models. It provides: 46 | 47 | - Easy configuration through YAML files 48 | - Support for multiple model architectures 49 | - Efficient training with various optimization techniques 50 | - Comprehensive documentation and examples 51 | 52 | #### Installation 53 | 54 | For detailed instructions on fine-tuning the model, please refer to the [Axolotl repository](https://github.com/axolotl-ai-cloud/axolotl). We strongly recommend creating a separate conda environment for fine-tuning to avoid dependency conflicts. Please follow the installation and usage guides in the Axolotl repository to fine-tune the model for your specific needs. Make sure to activate your dedicated fine-tuning environment before proceeding to the following steps. 55 | 56 | #### Supervised Finetuning 57 | 58 | Axolotl uses a configuration file that we provide `synllama_sft.yml` to specify training parameters and data paths. After generating your fine-tuning data following previous steps, to perform SFT, you'll need to update the provided [config file](../../synllama/llm/sft/synllama_sft.yml) with: 59 | 60 | - The path to your generated training data 61 | - The path to save the prepared dataset 62 | - The path to save the outputs 63 | - [Optional] The project name and run id for logging ([wandb](https://wandb.ai/site)) 64 | 65 | Make sure to **review** and **modify** the provided [config file](../../synllama/llm/sft/synllama_sft.yml) according to your specific training requirements before proceeding with the fine-tuning process. 66 | 67 | **Step 1:** To preprocess the data before fine-tuning, run the following command: 68 | 69 | ```bash 70 | source activate axolotl # activate the fine-tuning environment 71 | CUDA_VISIBLE_DEVICES="" python3 -m axolotl.cli.preprocess synllama_sft.yml 72 | ``` 73 | 74 | **Step 2:** To perform supervised finetuning with multiple GPUs, run the following command: 75 | 76 | ```bash 77 | source activate axolotl # activate the fine-tuning environment 78 | accelerate launch -m axolotl.cli.train synllama_sft.yml 79 | ``` 80 | 81 | **Step 3:** To merge the LoRA weights with the base model, run the following command: 82 | 83 | ```bash 84 | source activate axolotl # activate the fine-tuning environment 85 | python -m axolotl.cli.merge_lora synllama_sft.yml --lora_model_dir=CHANGE_TO_YOUR_OUTPUT_PATH 86 | ``` 87 | 88 | Once the merging is done, you can use the merged model for inference following the instructions in [Inference Guide](inference_guide.md). -------------------------------------------------------------------------------- /assets/toc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/THGLab/SynLlama/5592cfc9d2338c6ebd7add7971c26b69e9aa1111/assets/toc.png -------------------------------------------------------------------------------- /data/115_rxns/115_rxn_templates.txt: -------------------------------------------------------------------------------- 1 | [#6:1][N:2]=[C:3]=[S:4].[F,Cl,Br,I][C:5][C;!$(C(=O)[N,O,S,F,Cl,Br,I]):6]=O.[NH2;$([N][#6]);!$([N]C=[O,S,N]):7]>>[#6:1][N:2]=[c:3]1[n:7][c:6][c:5][s:4]1 2 | [#6:1][N:2]=[C:3]=[S:4].[O:5]=[C:6]1[CH:7]=[C:8][C:9](=[O:10])[N:11]1.[NH2;$([N][#6]);!$([N]C=[O,S,N]):12]>>[#6:1][N:2]=[c:3]1[n:12][c:6]([O:5])[c:7]([C:8][C:9](=[O:10])[N:11])[s:4]1 3 | [NH2:1][NH:2][#6:3].[#6:4][CH:5]=O>>[#6:4][CH:5]=[N:1][N:2][#6:3] 4 | [$(C=O),$(C#N),$(S=O),$([N+](=O)[O-]):1][CH3,CH2&$([C]([#6])[#6]):2].[#6:4][CH:5]=O>>[*:1][C:2]=[CH:5][#6:4] 5 | [#6:1][N:2]=[C:3]=[S:4].[NX3;!$(N[C]=[S,O,N]);$(N[#6,S]);!$(N[#7]);H1,H2:5]>>[N:5][C:3](=[S:4])[N:2][#6:1] 6 | [#6:1][CH:2]=O.[c:3][C:4](=[O:5])[NH:6][CH2:7][C:8](=[O:9])[OH,O-]>>[#6:1][CH:2]=[C:7]1[C:8](=[O:9])[O:5][C:4]([c:3])=[N:6]1 7 | [N;$([NH](C=[O,S])[#6]),$([NH2](C=[O,S])):1][C:2](=[O,S:3])[N;$([NH](C=[O,S])[#6]),$([NH2](C=[O,S])):4].[c:5][CH:6]=O.[C;$([CH2](C=O)[#6]),$([CH3](C=O)):7][C:8](=O)[CH2:9][C:10](=[O:12])[$([O][C]),$([NH][C]),$([N]([C])[C]):11]>>[c:5][C:6]1[N:1][C:2](=[*:3])[N:4][C:8]([C:7])=[C:9]1[C:10](=[O:12])[*:11] 8 | [NH2;$(NC),$(Nc),$(NNC(=O)C):1].[C:3]=[C:4]-O[CH2][CH3]>>[C:3]=[C:4]-[NH:1] 9 | [NX3;!$(N[C]=[S,O,N]);$(N[#6,S]);!$(N[#7]);H1,H2:1].[#6:4][C:5](=[O:6])[OH,O-]>>[N:1][C:5](=[O:6])[#6:4] 10 | [NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1,H2:1].[#6:2][Cl,Br,I]>>[N:1][#6:2] 11 | [$(C=O),$(C#N),$(S=O),$([N+](=O)[O-]):1][CH2:2][$(C=O),$(C#N),$(S=O),$([N+](=O)[O-]):3].[#6:4][CH:5]=O>>[*:1][C:2]([*:3])=[CH:5][#6:4] 12 | [NX3;!$(N[C]=[S,O,N]);$(N[#6,S]);H1,H2:1].[#6:2][S:3](=[O:4])(=[O:5])[F,Cl,Br,I]>>[N:1][S:3](=[O:4])(=[O:5])[#6:2] 13 | [#6:1][NH:2][C:3](=[S:4])[NH:5][NH2:6].[#6:7][C:8](=O)[C:9]([#6:10])[Cl,Br,I]>>[#6:1][NH:2][C:3]1=[N:5]-[N:6]=[C:8]([#6:7])-[C:9]([#6:10])-[S:4]1 14 | [O]=[C;$([C]([C])[#6]),$([CH]([C])):1][C;H2,H3:2].[#6:3][C:4](=O)[c:5][c:6][NH2:7]>>[c:1]1[c:2][c:4]([#6:3])[c:5][c:6][n:7]1 15 | [#6:1][NH:2][NH2:3].O=[C;$(C(C)[#6]),$([CH][C]):4][C;H1,H2;!R:5][C;$(C(C)[#6]),$([CH][C]):6]=O>>[#6:1][n:2]1[n:3][c:6][c:5][c:4]1 16 | [#6:1][NH:2][NH2:3].O=[C;$(C(C)[#6]),$([CH][C]):4][#6;H1,H2;!R:5][#6:6][C;$(C(C)[#6]),$([CH][C]):7]=O>>[#6:1][#7H0+0:2]1[#7H0+0:3]=[#6:7][#6:6][#6:5]=[#6:4]1 17 | [#6,#1:1][NH:2][C:3](=[S:4])[NH2:5].[#6,H:6][N:7]1[C:8](=[O:9])[C:10]=[C:11][C:12]1=[O:13]>>[*:1][N:2]=[C:3]1[S:4][C:11]([C:10][C:8](=[O:9])[N:7][*:6])[C:12](=[O:13])[NH:5]1 18 | [#6:1][NH:2][NH2:3].O=[C:4][c:5]1[c:6][o:11][c:7][c:8][c:9]1=[O:10]>>[#6:1][N:2]1[N:3]=[C:4][C:5]([C:9](=[O:10])[c:8][c:7][OH:11])=[C:6]1 19 | [Cl,Br,I][CH2:1][$(C=O),$(C#N),$(S=O),$([N+](=O)[O-]):2].O=[CH:3][c:4][c:5][OH:6]>>[*:2][c:1]1[c:3][c:4][c:5][O:6]1 20 | [#6:1][$(CO):2](=O)[O].[NH2:3][c:4][c:5][NH2,NH,SH,OH:6]>>[#6:1][c:2]1[nH0+0:3][c:4][c:5][*:6]1 21 | [#6:1][CH:2](=O).[NH2:3][c:4][c:5][NH2,NH,SH,OH:6]>>[#6:1][c:2]1[nH0+0:3][c:4][c:5][*:6]1 22 | [#6;!$(C=[S,O,N]):1][Cl,Br,I].[#6;!$(C=[S,O,N]):2][Cl,Br,I]>>[#6:1]S[#6:2] 23 | [OH;$(Oc):1].[C:2]1[O:4][C:3]1>>[O:1][C:2][C:3][OH:4] 24 | [OH;$(Oc):1].[C:4]=[C:5][$(C=O),$(C#N),$(S=O),$([N+](=O)[O-]):6]>>[O:1][C:4][C:5][*:6] 25 | [NX3;!$(N[C]=[S,O,N]);H1,H2:1][c:2][c:3][C:4](=[O:5])[NX3;H1,H2:6].O=[CX3;!$(C(=O)[O,S,N,F,Cl,Br,I]):7]>>[N:1]1[c:2][c:3][C:4](=[O:5])[N:6][C:7]1 26 | [#6:1][NH2:2].([S:3]=[C:4]=[N;$([N][#6]):5].[#6;$([c][c]N=C=S),$([C][C]N=C=S),$([C]N=C=S):6][C:7](=[O:8])[O;$(O(C)C)])>>[NH:5][C:4](=[S:3])[N:2]([#6:1])[C:7](=[O:8])[#6:6] 27 | [NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1,H2:1].[NX3;$(N[#6]);!$(N[#7]);H1,H0:2][$([C][O][CH2][C](F)(F)F),$([C]n1cncc1)&!R:3](=[O:4])[O,n]>>[N:1][C:3](=[O:4])[N:2] 28 | [$(O[#6]),$([NH][#6]),$([#6]):1][C:2](=[O:3])[CH2:4][C:5](=[O:6])[$(O[#6]),$([NH][#6]),$([#6]):7].[C:8][C:9]([$([O][CH3]),$([O][CH2][CH3])])([$([O][CH3]),$([O][CH2][CH3])])[$([O][CH3]),$([O][CH2][CH3])].[#6,$([NH][C](=O)[CH3]);!$(C=[O,S,N]):10][NH2:11]>>[*:1][C:2](=[O:3])[C:4]([C:5](=[O:6])[*:7])=[C:9]([C:8])[NH:11][*:10] 29 | [#6:1][N:2]=[C:3]=[S:4].[c:5][CH:6]=O>>[c:5][C:6]=[C]1[S][C:3](=[S:4])[N:2]([#6:1])C1=O 30 | [N:1]#[C:2][#6:3].[NH2:4][c:5][c:6][C:7](=[O:8])[$(O(C=O)C)]>>[#6:3][C:2]1=[N:4][c:5][c:6][C:7](=[O:8])[N:1]1 31 | [C:1][NH2:2].([OH,$(O(C=O)C),Cl,Br,I][C;!R1:3](=[O:4])[CR1,$([c][c][NH][C]=[O]):5].[CR1,$([c][c][C]=[O]):6][NH:7][C;!R1:8](=[O:9])[OH,$(O(C=O)C),Cl,Br,I])>>[*:6][NH:7][C:8](=[O:9])[N:2]([C:1])[C:3](=[O:4])[*:5] 32 | [#6:1][C:2](=[O:3])[OH,O-:4].[Cl,Br,I][C;!$(C=[N,O,S]):5]>>[#6:1][C:2](=[O:3])[O:4][C:5] 33 | [#6;!$(C=[S,O,N]):1][Cl,$(OS(=O)(=O)[#6;!R])].[#6;!$(C=[S,O,N]):2][Cl,$(OS(=O)(=O)[#6;!R])]>>[#6:1]S(=O)(=O)[#6:2] 34 | [#6;!$(C=[S,O,N]):1][Cl,$(OS(=O)(=O)[#6;!R])].[#6;!$(C=[S,O,N]):2][Cl,$(OS(=O)(=O)[#6;!R])]>>[#6:1]S(=O)(=O)[#6:2] 35 | [#6:1][NH2:2].Cl[S:3](=[O:4])(=[O:5])[c:6][c:7][C:8](=[O:9])[O;$(O(C)C)]>>[C:1][N:2]1[S:3](=[O:4])(=[O:5])[c:6][c:7][C:8](=[O:9])1 36 | [NH2:1]-[c:2][c:3]-[C:4]#[N:5].[#6;!$(C=[O,S,N]):6][NH2:7]>>[#6:6][NH:7][cH0+0:4]1[nH0+0:5][cH0+0][nH0+0:1][c:2][c:3]1 37 | [NH2;$([N][#6]);!$([N]C=[O,S,N]):1].[NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1,H2:2]>>[N:1][C](=[O])[N:2] 38 | [#6;!$(C=[S,O,N]):1][Cl,$(OS(=O)(=O)[#6;!R])].[#6;!$(C=[S,O,N]):2][SH:3]>>[#6:1][S:3](=O)(=O)[#6:2] 39 | [NX3;!$(N[C]=[S,O,N]);$(N[#6,S]);!$(N[#7]);H1,H2:1].[NX3;!$(N[C]=[S,O,N]);$(N[#6,S]);!$(N[#7]);H1,H2:2]>>[N:1]C(=O)C(=O)[N:2] 40 | [N;$([N](=CS)[C]),$([NH]=CS):1]=[C:2]([S;!R]C)[NX3;$([NH](CS)[C]),$([NH2]CS):3].[C;!$(C=[N,O,S]):4][NH2:5]>>[C:4][N:5][C:2](=[N:1])[N:3] 41 | [$(C=O),$([N+](=O)[O-]):1][CH2:2][$(C=O),$([N+](=O)[O-]):3].FC(F)(F)CO[C:4](=[O:5])[NH:6][#6;!$(C=[O,S,N]):7]>>[*:1][CH:2]([*:3])[C:4](=[O:5])[NH:6][*:7] 42 | [NX3;$([NH](C=S)[C]),$([NH2]C=S):1][C:2](=S)[NX3;$([NH](C=S)[C]),$([NH2]C=S):3].[C;!$(C=[N,O,S]):4][NH2:5]>>[C:4][NH:5][C:2](=[N:1])[NH:3] 43 | [#6;!$(C=[S,O,N]):1][Cl,$(OS(=O)(=O)[#6;!R])].[#6;!$(C=[S,O,N]):2][SH:3]>>[#6:1][S:3](=O)[#6:2] 44 | [NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1,H2:1].[#6:3][CX3;!$(C[!#6]);H0,H1:4]=[O]>>[N:1][C:4][#6:3] 45 | [NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1,H2:1].[#6:4][C:5](=[O:6])[O:7][C:8](=[O:9])[#6:10]>>([N:1][C:5](=[O:6])[#6:4].[OH:7][C:8](=[O:9])[#6:10]) 46 | [NH2:1][c:2][c:3][C:4](=[O:5])[O][C].[C:6][NH2:7]>>[NH:1]1[c:2][c:3][C:4](=[O:5])[N:7]([C:6])C1=O 47 | [#6:1]-[C:2]#[N:3].[#6:4]-[C:5](=O)[OH,O-]>>[#6:1]-[cH0+0:2]1[nH0+0][oH0+0][cH0+0:5]([#6:4])[nH0+0:3]1 48 | [OH:1]-[N:2]=[C:3]([#6:4])-[NH2:5]>>[#6:4]-[cH0+0:3]1[nH0+0:2][oH0+0:1][cH0+0](=O)[nH1+0:5]1 49 | [#6:1]-[N:2]=[C:3]=S.[NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1,H2:5].[#6:7]-[C:8](=O)[NH:9][NH2:10]>>[#6:1]-[nH0+0:2]1[cH0+0:8](-[#6:7])[nH0+0:9][nH0+0:10][cH0+0:3](-[N:5])1 50 | [#6:1][C:2](=[O:3])[O;$(O[CH3]),$(O[CH2][CH3]),$(O[CH2][C](F)(F)F)].[NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1,H2:4]>>[#6:1][C:2](=[O:3])[N:4] 51 | [c:1]-B(O)O.[Cl,Br,I][c:2]>>[c:1]-[c:2] 52 | [#6,#1:1][C:2](=[O:3])[NH:4][NH2:5].[#6,H:6][C:7](=O)[OH,O-]>>[*:1][c:2]1[nH0+0:4][nH0+0:5][c:7]([*:6])[oH0+0:3]1 53 | [NX3;$([NH]([#6])[C]),$([NH2][C]):1][C:2][C:3](=[O:4])O[C;!R].[C;!$(C=[N,O,S]):5][NH2:6]>>[N:1]1[C:2][C:3](=[O:4])[N:6]([C:5])C(=O)1 54 | [NX3;$([NH]([#6])[C]),$([NH2][C]):1][C:2][C:3](=[O:4])O[C;!R].[C;!$(C=[N,O,S]):5][NH:6][C:7](=[O:8])O[CH2]C(F)(F)F>>[N:1]1[C:2][C:3](=[O:4])[N:6]([C:5])[C:7](=[O:8])1 55 | [#6,#1:1][C:2](=O)[C:3]([#6,H:4])[Cl,Br,I]>>[*:1]-[cH0+0:2]1[nH0+0][cH0+0](-[NH2])[sH0+0][c:3]1[*:4] 56 | [NX3;$([NH]([#6])[C]),$([NH2][C]):1][C:2][C:3](=[O:4])O[C;!R].[C;!$(C=[N,O,S]):5][NH:6][C:7](=[O:8])O[CH2]C(F)(F)F>>[C:5][NH:6][C:7](=[O:8])[N:1][C:2][C:3](=[O:4])[OH] 57 | [#6;!$(C=[O,S,N]):1][OH,SH:2].[Cl,Br,I][#6;!$(C=[N,O,S]):3]>>[#6:1][O,S:2][#6:3] 58 | [#6:1][C:2](=[O:3])[F,Cl,Br].[NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1,H2:4]>>[#6:1][C:2](=[O:3])[N:4] 59 | [NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1,H2:1].[C:2]1[O:4][C:3]1>>[#7:1][C:2][C:3][OH:4] 60 | [NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1,H2:1].[C:4]=[C:5][$(C=O),$(C#N),$(S=O),$([N+](=O)[O-]):6]>>[N:1][C:4][C:5][*:6] 61 | [#6:1]-[N:5]=[N+:6]=[N-:7].[#6:2]-[C:3]#[CH:4]>>[#6:2][cH0+0:3]1[cH1+0:4][nH0+0:5]([#6:1])[nH0+0:6][nH0+0:7]1 62 | CO-[CH0+0:1]1=[NH0+0:2][CH2+0:3][C,O,N:4][C,O,N:5][CH2+0:6]1.[#6:7]-[C:8](=O)[NH:9][NH2:10]>>[#6:7]-[cH0+0:8]1[nH0+0:9][nH0+0:10][cH0+0:1]2[nH0+0:2]1[CH2+0:3][*:4][*:5][CH2+0:6]2 63 | [#6,#1:1][C:2](=[O:3])[NH:4][NH2:5].[#6,H:6][N:7]=[C:8]=S>>[*:1][c:2]1[nH0+0:4][nH0+0:5][c:8]([NH:7][*:6])[oH0+0:3]1 64 | [CH1:1]-[Cl,Br,I].[#6:2]-[C:3]#[CH:4]>>[#6:2][cH0+0:3]1[cH1+0:4][nH0+0]([CH1:1])[nH0+0][nH0+0]1 65 | [#6:1]-[N:5]=[N+:6]=[N-:7].[#6;!$(C=[O,S,N]):2]-[NH2:3]>>[#6:2]-[NH1:3][CH2][cH0+0]1[cH1+0][nH0+0:5]([#6:1])[nH0+0:6][nH0+0:7]1 66 | [#6,#7:1][C:2](=[S:3])[NH2:4].[F,Cl,Br,I][CH:5][C:6](=O)[C:7]#[C:8][Si](C)(C)C.[N-:9]=[N+:10]=[N:11][c:12]>>[*:1][c:2]1[s:3][c:5][c:6]([c:7]2[nHo+0:9][nHo+0:10][nHo+0:11]([c:12])[c:8]2)[n:4]1 67 | [#6;!$(C=[O,S,N]):1]-[NH2:2].[#6;!$(C=[O,S,N]):3]-[NH2:4]>>[#6:1][NH:2]c1ncnc2c1nc[nH0+0:4]2[#6:3] 68 | [C;!$(C=[N,O,S]):1][NH2:2].[C;!$(C=[N,O,S]):3][NH2:4].[C;!$(C=[N,O,S]):5][NH2:6]>>[C:1][N:2]=C([NH:6][C:5])[NH:4][C:3] 69 | [NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H2:1].[#6:3][CX3;!$(C[!#6]);H0,H1:4]=[O].[Cl,Br,I][C,S;$([C,S]=[S,O,N]):5]>>[*:5][N:1][C:4][#6:3] 70 | ([NX3;!$(N[C,S]=[S,O,N]);$(N[#6]);!$(N[#7]);H1:1].[NX3;H0:2]C(=O)OC(C)(C)C).[#6:3][C:4](=[O:5])[OH,O-].[#6:6][C:7](=[O:8])[OH,O-]>>([N:1][C:4](=[O:5])[#6:3].[N:2][C:7](=[O:8])[#6:6]) 71 | [NH2:1][C:2](=S)[N:3]([#6;!$(C=[O,S,N,C]):4])[#6,H;!$(C=[O,S,N,C]):5].[#6:6]-[C:7](=O)[NH:8][NH2:9]>>[nH1+0:1]1[cH0+0:7](-[#6:6])[nH0+0:8][nH0+0:9][cH0+0:2](-[N:3]([#6,H:5])[#6:4])1 72 | [#6:1][C:2](=O)[#6:3].[#6:4][OH,nH,NH:5]>>[#6:4][*:5][C:2]([#6:1])([#6:3])C(=O)O 73 | [#6:1][C:2](=[O:3])[OH,O-:4].[OH][C;!$(C=[N,O,S]):5]>>[#6:1][C:2](=[O:3])[O:4][C:5] 74 | ([C;!$(C=[N,O,S]):1][NH:2][#6;!$([c][c][c][NH2]);$([#6][#6][NH2]),$([#6][#6][#6][NH2]):3].[#6;$([#6][#6][NH:2]),$([#6][#6][#6][NH:2]):4][NH2:5]).[C;!$(C=[N,O,S]):7][NH2:6]>>[#6:3][N:2]([C:1])C([NH:6][C:7])=[N:5][#6:4] 75 | CO[C:1](=[O:2])[C:3][NH2:4].CC(C)(C)OC(=O)[NH:5][C:6][C:7](=[O:8])[OH,O-]>>[C:1](=[O:2])1[C:3][NH:4][C:7](=[O:8])[C:6][NH:5]1 76 | [C;!$(C=[N,O,S]):1][NH2:2].[NH2:3][C;!$(C=[O,S,N]):4][C:5][C:6](=[O:7])[O;$(OC)]>>[C:1][N:2]1[C:6](=[O:7])[C:5][C:4][NH:3]C(=O)1 77 | [NH2:1][C:2](=[NH])-[#6,H:3].[#6,H:4]-[C:5](=O)[OH,O-].[#6,H:6][NH:7][NH2:8]>>[*:3]-[cH0+0:2]1[nH0+0:1][cH0+0:5](-[*:4])[nH0+0:7]([*:6])[nH0+0:8]1 78 | [#6,#1:1][C:2](=O)[NH:3][NH2:4].[#6,H:5][C:6](=[NH:7])O[CH3]>>[*:1]-[cH0+0:2]1[nH0+0:3][nH0+0:4][cH0+0:6]([*:5])[nH1+0:7]1 79 | [#6,#1:1][C:2](=O)[c:9]1[c:10][c:11][c:12][c:13][c:14]1[NH2:3].CC(C)(C)OC(=O)[NH:4][C:5]([#6,#1:6])[C:7](=[O:8])[OH,O-]>>[O:8]=[C:7]1[NH:3][c:14]2[c:13][c:12][c:11][c:10][c:9]2[CH0+0:2]([*:1])=[NH0+0:4][C:5]1[*:6] 80 | [CH3:1][C:2](=[O])[#6:4].[NH2:5][C:6](=[O:7])[N;$([NH](C=O)[#6]),$([NH2](C=O)):8].[OH,O-:9][c:10][c:11][CH:12]=O>>[#6:4][C:2]12[CH2:1][CH:12]([NH:5][C:6](=[O:7])[N:8]1)[c:11][c:10][O:9]2 81 | [C;!$(C=[N,O,S]):1][NH2:2].[C;!$(C=[N,O,S]):3][NH2:4].[C:5]=[C:6][C:7](=[O:8])[O;$(OC)]>>[C:1][N:2]1[C:5][C:6][C:7](=[O:8])[N:4]([C:3])C(=O)1 82 | [NH2:1][NH:2][#6:3].O=[CH:4][c:5][c:6][C:7](=[O:8])OC>>[#6:3][N:2]1[N:1]=[CH:4][c:5][c:6][C:7](=[O:8])1 83 | [NH2;$([N][#6]);!$([N]C=[O,S,N]):1].[OH,O-][C:2](=[O:3])[C:4][C:5](=[C;H1,H2:6])[C:7](=[O:8])[O:9]>>[N:1]1[C:2](=[O:3])[C:4][C:5]([C:7](=[O:8])[O:9])[C:6]1 84 | [#6:1][C:2]#[N:3]>>[#6:1][c:2]1[n:3][nH][n][n]1 85 | [N:1]C(=O)OC([CH3])([CH3])[CH3]>>[N:1] 86 | [C:1](=[O:2])[O:3][#6]>>[C:1](=[O:2])[O:3] 87 | [NH3+,NH2:1]-[C$(C(N)(C)(C)(C)),C$([CH](N)(C)(C)),C$([CH2](N)(C)):2]-[C$(C(c)(C)(C)(C)),C$([CH](c)(C)(C)),C$([CH2](c)(C)):3]-[c:4][cH1:5].[CH:6](-[#6:7])=[O]>>[#6:7]-[CH:6]1[N:1][C:2][C:3][c:4][c:5]1 88 | [C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):1]=[C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):2].[C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):3]=[C$([C](=C)(C)([CX4,OX2,NX3])),C$([CH](=C)(C)):4]-[C$([C](=C)(C)([CX4,OX2,NX3])),C$([CH](=C)(C)):5]=[C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):6]>>[C:1]1[C:2][C:3][C:4]=[C:5][C:6]1 89 | [C$(C(#C)([CX4,OX2,NX3])),C$([CH](#C)):1]#[C$(C(#C)([CX4,OX2,NX3])),C$([CH](#C)):2].[C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):3]=[C$([C](=C)(C)([CX4,OX2,NX3])),C$([CH](=C)(C)):4]-[C$([C](=C)(C)([CX4,OX2,NX3])),C$([CH](=C)(C)):5]=[C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):6]>>[C:1]1=[C:2][C:3][C:4]=[C:5][C:6]1 90 | [CH3,$([CH2](C=O)[#6])&!R:4][CD3:2](=[O:3])[c:1][c:5][OH:6].[C;$(C1[#6][#6][N,C][#6][#6]1):7](=[OD1])>>[O:6]1[c:5][c:1][C:2](=[OD1:3])[C:4][C:7]1 91 | [C$(C(=O)(C)([CX4]))&!R,C$(C[H](=O)(C))&!R:1](=[O])[CH1,CH2:3][CH1,CH2:4][C$(C(=O)(C)([CX4]))&!R,C$(C[H](=O)(C))&!R:5]=[O].[N$([NH2,NH3+1]([CX4])):7]>>[c:5]1[c:4][c:3][c:1][n:7]1 92 | [NH1;$(N-c1ccccc1):1]([NH2])[c:5][cH1:4].[C;$(C([#6])[#6]):2](=[OD1])-[CH2;$(C([#6])[#6]);!$(C(C=O)C=O):3]>>[c:5]1[nH:1][c:2][c:3][c:4]1 93 | [OH:7][c:6]1[cH:1][c:2][c:3][c:4][c:5]1.[O$(O(C)([CX4]))][C:11](=[O:15])[CH,CH2:10][C:8]=[O]>>[C:8]1=[C:10][C:11](=[O:15])[O:7][c:6]2[c:5][c:4][c:3][c:2][c:1]12 94 | [*;Br,I;$(*c1ccccc1)][c:1][c:2][OH1,SH1,NH2:3].[CH1:5]#[C:4]>>[c:1]1[c:2][*:3][c:4][c:5]1 95 | [C$([C](O)([CX4])([CX4])([CX4])),C$([CH](O)([CX4])([CX4])),C$([CH2](O)([CX4]))][O][C$(C(=O)([CX4])),C$([CH](=O)):2]=[O:5].[C$([CH](C)([CX4])([CX4])),C$([CH2](C)([CX4])),C$([CH3](C)):7]-[C$(C(=O)([CX4])),C$([CH](=O)):8]=[O:9]>>[C:7]([C:2]=[O:5])[C:8]=[O:9] 96 | [Br,I][C$(C([Br,I])([CX4])([CX4])([CX4])),C$([CH]([Br,I])([CX4])([CX4])),C$([CH2]([Br,I])([CX4])),C$([CH3]([Br,I])),C$([C]([Br,I])(=C)([CX4])),C$([CH]([Br,I])(=C)),C$(C([Br,I])(#C)),c$(c([Br,I])):1].[Br,I][C$(C([Br,I])([CX4])([CX4])([CX4])),C$([CH]([Br,I])([CX4])([CX4])),C$([CH2]([Br,I])([CX4])),C$([CH3]([Br,I])),C$([C]([Br,I])(=C)([CX4])),C$([CH]([Br,I])(=C)),C$(C([Br,I])(#C)),c$(c([Br,I])):2]>>[#6:1][#6:2] 97 | [C$([CH](=C)([CX4])),C$([CH2](=C)):2]=[C$(C(=C)([CX4])([CX4])),C$([CH](=C)([CX4])),C$([CH2](=C)):3].[Br,I][C$([CX4]([Br,I])),c$([c]([Br,I])):4]>>[#6:4][C:2]=[C:3] 98 | [#6:1][C:2]#[N;D1].[Cl,Br,I][#6;$([#6]~[#6]);!$([#6]([Cl,Br,I])[Cl,Br,I]);!$([#6]=[O,S,N]):3]>>[#6:1][C:2](=O)[#6:3] 99 | [#6:1][C;H1,$([C]([#6])[#6]):2]=[OD1:3].[Cl,Br,I][#6;$([#6]~[#6]);!$([#6]([Cl,Br,I])[Cl,Br,I]);!$([#6]=[O,S,N]):4]>>[C:1][#6:2]([OH1:3])[#6:4] 100 | [c:1]B(O)O.[nH1;+0;r5;!$(n[#6]=[O,S,N]);!$(n~n~n);!$(n~n~c~n);!$(n~c~n~n):2]>>[c:1][n:2] 101 | [*:1][C:2]#[CH:3].[Cl,Br,I][C$(C([CX4,c])([CX4,c])([CX4,c])),C$([CH]([CX4,c])([CX4,c])),C$([CH2]([CX4,c])),C$([CH3]),c$(c):4]>>[#6:4][C:3]#[C:2][*:1] 102 | [C$(C(C)([CX4])([CX4])([CX4])),C$([CH](C)([CX4])([CX4])),C$([CH2](C)([CX4])),C$([CH3](C)):1][C:2]#[CH:3].[Cl,Br,I][C$(C(=O)([CX4])),C$([CH](=O)):5]=[O:6]>>[#6:1][C:2]#[C:3][C:5]=[O:6] 103 | [#6:1][C;H1,$([CH0]([#6])[#6]);!$(CC=O):2]=[OD1].[Cl,Br,I][C;H2;$(C[#6]);!$(CC[I,Br]);!$(CCO[CH3]):3]>>[C:1][C:2]=[C:3] 104 | [Cl,Br,I][c;$(c1:[c,n]:[c,n]:[c,n]:[c,n]:[c,n]:1):1].[N;$(NC)&!$(N=*)&!$([N-])&!$(N#*)&!$([ND3])&!$([ND4])&!$([N][N])&!$(N[c,O])&!$(N[C,S]=[S,O,N]),H2&$(Nc1:[c,n]:[c,n]:[c,n]:[c,n]:[c,n]:1):2]>>[c:1][N:2] 105 | [C;$(C([#6])[#6;!$([#6]Br)]):4](=[OD1])[CH;$(C([#6])[#6]):5]Br.[NH2:3][C;$(C(=N)(N)[#6,#7]):2]=[NH;D1:1]>>[c:4]1[c:5][nH:3][c:2][n:1]1 106 | [c;$(c1[c;$(c[C,S,N](=[OD1])[*;R0;!OH1])]cccc1):1][C;$(C(=O)[OH])].[c;$(c1aaccc1):2][Cl,Br,I]>>[c:1][c:2] 107 | [N;$(N[#6]):1]=[C;$(C=O):2].[N;$(N[#6]);!$(N=*);!$([N-]);!$(N#*);!$([ND3]);!$([ND4]);!$(N[O,N]);!$(N[C,S]=[S,O,N]):3]>>[N:1][C:2][N+0:3] 108 | [OH,O-][C$(C(=O)([OH,O-])([CX4])),C$([CH](=O)([OH,O-])):1]=[O:2]>>[Cl][C:1]=[O:2] 109 | [OH][$([CX4]),c:1]>>[Br][#6:1] 110 | [OH][$([CX4]),c:1]>>[Cl][#6:1] 111 | [OH,O-][S$(S([CX4])):1](=[O:2])=[O:3]>>[Cl][S:1](=[O:2])=[O:3] 112 | [OH+0,O-:1][C:2](=[O:3])[C$([CH]([CX4])),C$([CH2]):4]>>[#8:1][C:2](=[O:3])[C:4][Br] 113 | [OH+0,O-:1][C:2](=[O:3])[C$([CH]([CX4])),C$([CH2]):4]>>[#8:1][C:2](=[O:3])[C:4][Cl] 114 | [Cl,Br,I][C$(C([CX4,c])([CX4,c])([CX4,c])),C$([CH]([CX4,c])([CX4,c])),C$([CH2]([CX4,c])),C$([CH3]),c$(c):1]>>[N]#[C][#6:1] 115 | [OH,NH2,NH3+][CH2:2][C$(C([CX4,c])([CX4,c])([CX4,c])),C$([CH]([CX4,c])([CX4,c])),C$([CH2]([CX4,c])),C$([CH3]),c$(c):1]>>[#6:1][C:2]#[N] -------------------------------------------------------------------------------- /data/91_rxns/91_rxn_templates.txt: -------------------------------------------------------------------------------- 1 | [cH1:1]1:[c:2](-[CH2:7]-[CH2:8]-[NH2:9]):[c:3]:[c:4]:[c:5]:[c:6]:1.[#6:11]-[CH1;R0:10]=[OD1]>>[c:1]12:[c:2](-[CH2:7]-[CH2:8]-[NH1:9]-[C:10]-2(-[#6:11])):[c:3]:[c:4]:[c:5]:[c:6]:1 2 | [c;r6:1](-[NH1;$(N-[#6]):2]):[c;r6:3](-[NH2:4]).[#6:6]-[C;R0:5](=[OD1])-[#8;H1,$(O-[CH3])]>>[c:3]2:[c:1]:[n:2]:[c:5](-[#6:6]):[n:4]2 3 | [c;r6:1](-[NH1;$(N-[#6]):2]):[c;r6:3](-[NH2:4]).[#6:6]-[CH1;R0:5](=[OD1])>>[c:3]2:[c:1]:[n:2]:[c:5](-[#6:6]):[n:4]2 4 | [c;r6:1](-[SH1:2]):[c;r6:3](-[NH2:4]).[#6:6]-[CH1;R0:5](=[OD1])>>[c:3]2:[c:1]:[s:2]:[c:5](-[#6:6]):[n:4]2 5 | [c:1](-[OH1;$(Oc1ccccc1):2]):[c;r6:3](-[NH2:4]).[c:6]-[CH1;R0:5](=[OD1])>>[c:3]2:[c:1]:[o:2]:[c:5](-[c:6]):[n:4]2 6 | [c;r6:1](-[OH1:2]):[c;r6:3](-[NH2:4]).[#6:6]-[C;R0:5](=[OD1])-[OH1]>>[c:3]2:[c:1]:[o:2]:[c:5](-[#6:6]):[n:4]2 7 | [#6:6]-[C;R0:1](=[OD1])-[CH1;R0:5](-[#6:7])-[*;#17,#35,#53].[NH2:2]-[C:3]=[SD1:4]>>[c:1]2(-[#6:6]):[n:2]:[c:3]:[s:4][c:5]([#6:7]):2 8 | [c:1](-[C;$(C-c1ccccc1):2](=[OD1:3])-[OH1]):[c:4](-[NH2:5]).[N;!H0;!$(N-N);!$(N-C=N);!$(N(-C=O)-C=O):6]-[C;H1,$(C-[#6]):7]=[OD1]>>[c:4]2:[c:1]-[C:2](=[O:3])-[N:6]-[C:7]=[N:5]-2 9 | [CH0;$(C-[#6]):1]#[NH0:2]>>[C:1]1=[N:2]-N-N=N-1 10 | [CH0;$(C-[#6]):1]#[NH0:2].[C;A;!$(C=O):3]-[*;#17,#35,#53]>>[C:1]1=[N:2]-N(-[C:3])-N=N-1 11 | [CH0;$(C-[#6]):1]#[NH0:2].[C;A;!$(C=O):3]-[*;#17,#35,#53]>>[C:1]1=[N:2]-N=N-N-1(-[C:3]) 12 | [CH0;$(C-[#6]):1]#[CH1:2].[C;H1,H2;A;!$(C=O):3]-[*;#17,#35,#53,OH1]>>[C:1]1=[C:2]-N(-[C:3])-N=N-1 13 | [CH0;$(C-[#6]):1]#[CH1:2].[C;H1,H2;A;!$(C=O):3]-[*;#17,#35,#53,OH1]>>[C:1]1=[C:2]-N=NN(-[C:3])-1 14 | [CH0;$(C-[#6]):1]#[CH0;$(C-[#6]):2].[C;H1,H2;A;!$(C=O):3]-[*;#17,#35,#53,OH1]>>[C:1]1=[C:2]-N=NN(-[C:3])-1 15 | [CH0;$(C-[#6]):1]#[NH0:2].[NH2:3]-[NH1:4]-[CH0;$(C-[#6]);R0:5]=[OD1]>>[N:2]1-[C:1]=[N:3]-[N:4]-[C:5]=1 16 | [CH0;$(C-[#6]):1]#[NH0:2].[CH0;$(C-[#6]);R0:5](=[OD1])-[#8;H1,$(O-[CH3]),$(O-[CH2]-[CH3])]>>[N:2]1-[C:1]=N-N-[C:5]=1 17 | [c:1](-[C;$(C-c1ccccc1):2](=[OD1:3])-[CH3:4]):[c:5](-[OH1:6]).[C;$(C1-[CH2]-[CH2]-[N,C]-[CH2]-[CH2]-1):7](=[OD1])>>[O:6]1-[c:5]:[c:1]-[C:2](=[OD1:3])-[C:4]-[C:7]-1 18 | [c;r6:1](-[C;$(C=O):6]-[OH1]):[c;r6:2]-[C;H1,$(C-C):3]=[OD1].[NH2:4]-[NH1;$(N-[#6]);!$(NC=[O,S,N]):5]>>[c:1]1:[c:2]-[C:3]=[N:4]-[N:5]-[C:6]-1 19 | [C;$(C-c1ccccc1):1](=[OD1])-[C;D3;$(C-c1ccccc1):2]~[O;D1,H1].[CH1;$(C-c):3]=[OD1]>>[C:1]1-N=[C:3]-[NH1]-[C:2]=1 20 | [NH1;$(N-c1ccccc1):1](-[NH2])-[c:5]:[cH1:4].[C;$(C([#6])[#6]):2](=[OD1])-[CH2;$(C([#6])[#6]);!$(C(C=O)C=O):3]>>[C:5]1-[N:1]-[C:2]=[C:3]-[C:4]:1 21 | [NH2;$(N-c1ccccc1):1]-[c:2]:[c:3]-[CH1:4]=[OD1].[C;$(C([#6])[#6]):6](=[OD1])-[CH2;$(C([#6])[#6]);!$(C(C=O)C=O):5]>>[N:1]1-[c:2]:[c:3]-[C:4]=[C:5]-[C:6]:1 22 | [*;Br,I;$(*c1ccccc1)]-[c:1]:[c:2]-[OH1:3].[CH1:5]#[C;$(C-[#6]):4]>>[c:1]1:[c:2]-[O:3]-[C:4]=[C:5]-1 23 | [*;Br,I;$(*c1ccccc1)]-[c:1]:[c:2]-[SD2:3]-[CH3].[CH1:5]#[C;$(C-[#6]):4]>>[c:1]1:[c:2]-[S:3]-[C:4]=[C:5]-1 24 | [*;Br,I;$(*c1ccccc1)]-[c:1]:[c:2]-[NH2:3].[CH1:5]#[C;$(C-[#6]):4]>>[c:1]1:[c:2]-[N:3]-[C:4]=[C:5]-1 25 | [#6:6][C:5]#[#7;D1:4].[#6:1][C:2](=[OD1:3])[OH1]>>[#6:6][c:5]1[n:4][o:3][c:2]([#6:1])n1 26 | [#6;$([#6]~[#6]);!$([#6]=O):2][#8;H1:3].[Cl,Br,I][#6;H2;$([#6]~[#6]):4]>>[CH2:4][O:3][#6:2] 27 | [#6;H0;D3;$([#6](~[#6])~[#6]):1]B(O)O.[#6;H0;D3;$([#6](~[#6])~[#6]):2][Cl,Br,I]>>[#6:2][#6:1] 28 | [c;H1:3]1:[c:4]:[c:5]:[c;H1:6]:[c:7]2:[nH:8]:[c:9]:[c;H1:1]:[c:2]:1:2.O=[C:10]1[#6;H2:11][#6;H2:12][N:13][#6;H2:14][#6;H2:15]1>>[#6;H2:12]3[#6;H1:11]=[C:10]([c:1]1:[c:9]:[n:8]:[c:7]2:[c:6]:[c:5]:[c:4]:[c:3]:[c:2]:1:2)[#6;H2:15][#6;H2:14][N:13]3 29 | [C;H1&$(C([#6])[#6]),H2&$(C[#6]):1][OH1].[NH1;$(N(C=O)C=O):2]>>[C:1][N:2] 30 | [C;H1&$(C([#6])[#6]),H2&$(C[#6]):1][OH1].[OH1;$(Oc1ccccc1):2]>>[C:1][O:2] 31 | [C;H1&$(C([#6])[#6]),H2&$(C[#6]):1][OH1].[NH1;$(N([#6])S(=O)=O):2]>>[C:1][N:2] 32 | [C;H1&$(C([#6])[#6]),H2&$(C[#6]):1][OH1].[#7H1:2]1~[#7:3]~[#7:4]~[#7:5]~[#6:6]~1>>[C:1][#7:2]1:[#7:3]:[#7:4]:[#7:5]:[#6:6]:1 33 | [C;H1&$(C([#6])[#6]),H2&$(C[#6]):1][OH1].[#7H1:2]1~[#7:3]~[#7:4]~[#7:5]~[#6:6]~1>>[#7H0:2]1:[#7:3]:[#7H0:4]([C:1]):[#7:5]:[#6:6]:1 34 | [C;H1&$(C([#6])[#6]),H2&$(C[#6]):1][OH1].[#7:2]1~[#7:3]~[#7H1:4]~[#7:5]~[#6:6]~1>>[C:1][#7H0:2]1:[#7:3]:[#7H0:4]:[#7:5]:[#6:6]:1 35 | [C;H1&$(C([#6])[#6]),H2&$(C[#6]):1][OH1].[#7:2]1~[#7:3]~[#7H1:4]~[#7:5]~[#6:6]~1>>[#7:2]1:[#7:3]:[#7:4]([C:1]):[#7:5]:[#6:6]:1 36 | [#6;$(C=C-[#6]),$(c:c):1][Br,I].[Cl,Br,I][c:2]>>[c:2][#6:1] 37 | [#6:1][C:2]#[#7;D1].[Cl,Br,I][#6;$([#6]~[#6]);!$([#6]([Cl,Br,I])[Cl,Br,I]);!$([#6]=O):3]>>[#6:1][C:2](=O)[#6:3] 38 | [#6:1][C;H1,$([C]([#6])[#6]):2]=[OD1:3].[Cl,Br,I][#6;$([#6]~[#6]);!$([#6]([Cl,Br,I])[Cl,Br,I]);!$([#6]=O):4]>>[C:1][#6:2]([OH1:3])[#6:4] 39 | [S;$(S(=O)(=O)[C,N]):1][Cl].[N;$(NC);!$(N=*);!$([N-]);!$(N#*);!$([ND3]);!$([ND4]);!$(N[c,O]);!$(N[C,S]=[S,O,N]):2]>>[S:1][N+0:2] 40 | [c:1]B(O)O.[nH1;+0;r5;!$(n[#6]=[O,S,N]);!$(n~n~n);!$(n~n~c~n);!$(n~c~n~n):2]>>[c:1][n:2] 41 | [#6:3]-[C;H1,$([CH0](-[#6])[#6]);!$(CC=O):1]=[OD1].[Cl,Br,I][C;H2;$(C-[#6]);!$(CC[I,Br]);!$(CCO[CH3]):2]>>[C:3][C:1]=[C:2] 42 | [Cl,Br,I][c;$(c1:[c,n]:[c,n]:[c,n]:[c,n]:[c,n]:1):1].[N;$(NC)&!$(N=*)&!$([N-])&!$(N#*)&!$([ND3])&!$([ND4])&!$(N[c,O])&!$(N[C,S]=[S,O,N]),H2&$(Nc1:[c,n]:[c,n]:[c,n]:[c,n]:[c,n]:1):2]>>[c:1][N:2] 43 | [C;$(C([#6])[#6;!$([#6]Br)]):4](=[OD1])[CH;$(C([#6])[#6]):5]Br.[#7;H2:3][C;$(C(=N)(N)[c,#7]):2]=[#7;H1;D1:1]>>[C:4]1=[CH0:5][NH:3][C:2]=[N:1]1 44 | [c;$(c1[c;$(c[C,S,N](=[OD1])[*;R0;!OH1])]cccc1):1][C;$(C(=O)[O;H1])].[c;$(c1aaccc1):2][Cl,Br,I]>>[c:1][c:2] 45 | [c;!$(c1ccccc1);$(c1[n,c]c[n,c]c[n,c]1):1][Cl,F].[N;$(NC);!$(N=*);!$([N-]);!$(N#*);!$([ND3]);!$([ND4]);!$(N[c,O]);!$(N[C,S]=[S,O,N]):2]>>[c:1][N:2] 46 | [c;$(c1c(N(~O)~O)cccc1):1][Cl,F].[N;$(NC);!$(N=*);!$([N-]);!$(N#*);!$([ND3]);!$([ND4]);!$(N[c,O]);!$(N[C,S]=[S,O,N]):2]>>[c:1][N:2] 47 | [c;$(c1ccc(N(~O)~O)cc1):1][Cl,F].[N;$(NC);!$(N=*);!$([N-]);!$(N#*);!$([ND3]);!$([ND4]);!$(N[c,O]);!$(N[C,S]=[S,O,N]):2]>>[c:1][N:2] 48 | [N;$(N-[#6]):3]=[C;$(C=O):1].[N;$(N[#6]);!$(N=*);!$([N-]);!$(N#*);!$([ND3]);!$([ND4]);!$(N[O,N]);!$(N[C,S]=[S,O,N]):2]>>[N:3]-[C:1]-[N+0:2] 49 | [N;$(N-[#6]):3]=[C;$(C=S):1].[N;$(N[#6]);!$(N=*);!$([N-]);!$(N#*);!$([ND3]);!$([ND4]);!$(N[O,N]);!$(N[C,S]=[S,O,N]):2]>>[N:3]-[C:1]-[N+0:2] 50 | [$(C([CH2,CH3])),CH:10](=[O:11])-[NH+0:9]-[C$(C(N)(C)(C)(C)),C$([CH](N)(C)(C)),C$([CH2](N)(C)):8]-[C$(C(c)(C)(C)(C)),C$([CH](c)(C)(C)),C$([CH2](c)(C)):7]-[c:6]1[cH:1][c:2][c:3][c:4][c:5]1>>[C:10]-1=[N+0:9]-[C:8]-[C:7]-[c:6]2[c:5][c:4][c:3][c:2][c:1]-12 51 | [$(C([CH2,CH3])),CH:10](=[O:11])-[NH+0:9]-[C$([CH](N)(C)(C)),C$([CH2](N)(C)):8]-[C$([C](c)(C)(C)),C$([CH](c)(C)):7]([O$(OC),OH])-[c:6]1[cH:1][c:2][c:3][c:4][c:5]1>>[c:10]-1[n:9][c:8][c:7][c:6]2[c:5][c:4][c:3][c:2][c:1]-12 52 | [NH3+,NH2]-[C$(C(N)(C)(C)(C)),C$([CH](N)(C)(C)),C$([CH2](N)(C)):8]-[C$(C(c)(C)(C)(C)),C$([CH](c)(C)(C)),C$([CH2](c)(C)):7]-[c:6]1[c:1][c:2][nH:3][cH:5]1.[CH:10](-[CX4:12])=[O:11]>>[c,C:12]-[CH:10]-1-[N]-[C:8]-[C:7]-[c:6]2[c:1][c:2][nH:3][c:5]-12 53 | [NH2,NH3+1:8]-[c:5]1[cH:4][c:3][c:2][c:1][c:6]1.[Br:18][C$([CH2](C)(Br)),C$([CH](C)(C)(Br)):17]-[C:15](=[O:16])-[c:10]1[c:11][c:12][c:13][c:14][c:9]1>>[c:13]1[c:12][c:11][c:10]([c:9][c:14]1)-[c:15]1[c:17][c:4]2[c:3][c:2][c:1][c:6][c:5]2[nH+0:8]1 54 | [Cl:1][CH2:2]-[C$([CH](C)),C$(C(C)(C)):3]=[O:4].[OH:12]-[c:11]1[c:6][c:7][c:8][c:9][c:10]1-[CH:13]=[O:14]>>[C:3](=[O:4])-[c:2]1[c:13][c:10]2[c:9][c:8][c:7][c:6][c:11]2[o:12]1 55 | [NH2,NH3+]-[C$([CX4](N)([c,C])([c,C])([c,C])),C$([CH](N)([c,C])([c,C])),C$([CH2](N)([c,C])),C$([CH3](N)):2].[NH2:12]-[c:7]1[c:6][c:5][c:4][c:3][c:8]1-[C:9](-[OH,O-:11])=[O:10]>>[C:2]-[n+0]-1[c:13][n:12][c:7]2[c:6][c:5][c:4][c:3][c:8]2[c:9]-1=[O:10] 56 | [N$([NH2]([CX4])),N$([NH3+1]([CX4])):1].[O:5]-[C$([CH]([CX4])(C)(O)),C$([CH2]([CX4])(O)):3][C$(C([CX4])(=O)([CX4])),C$([CH]([CX4])(=O)):4]=[O:6]>[O:15]=[C:9]-1-[CH2:10]-[CH2:11]-[CH2:12]-[CH2:13]-[CH2:14]-1>[c:4]1[c:3][n+0:1][c:10]2-[C:11]-[C:12]-[C:13]-[C:14]-[c:9]12 57 | [C$(C(=O)([CX4])([CX4])),C$([CH](=O)([CX4])):2](=[O:6])-[C$([CH]([CX4])),C$([CH2]):3]-[C$(C(=O)([CX4])([CX4])),C$([CH](=O)([CX4])):4]=[O:7].[NH2:8]-[C:9](=[O:10])-[CH2:11][C:12]#[N:13]>>[OH:10]-[c:9]1[n:8][c:4][c:3][c:2][c:11]1[C:12]#[N:13] 58 | [C$(C(#C)([CX4])):2]#[C$(C(#C)([CX4])):1].[N$(N(~N)([CX4])):5]~[N]~[N]>>[c:2]1[c:1][n:5][n][n]1 59 | [C$(C(=C)([CX4])):2]=[C$(C(=C)([CX4])):1].[N$(N(~N)([CX4])):5]~[N]~[N]>>[C:2]1[C:1][N:5][N]=[N]1 60 | [C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):1]=[C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):2].[C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):3]=[C$([C](=C)(C)([CX4,OX2,NX3])),C$([CH](=C)(C)):4]-[C$([C](=C)(C)([CX4,OX2,NX3])),C$([CH](=C)(C)):5]=[C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):6]>>[C:1]1[C:2][C:3][C:4]=[C:5][C:6]1 61 | [C$(C(#C)([CX4,OX2,NX3])),C$([CH](#C)):1]#[C$(C(#C)([CX4,OX2,NX3])),C$([CH](#C)):2].[C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):3]=[C$([C](=C)(C)([CX4,OX2,NX3])),C$([CH](=C)(C)):4]-[C$([C](=C)(C)([CX4,OX2,NX3])),C$([CH](=C)(C)):5]=[C$(C(=C)([CX4,OX2,NX3])([CX4,OX2,NX3])),C$([CH](=C)([CX4,OX2,NX3])),C$([CH2](=C)):6]>>[C:1]1=[C:2][C:3][C:4]=[C:5][C:6]1 62 | [NH2,NH3+:3]-[N$([NH](N)([CX4])):2].[C$([CH](C)(C)([CX4])),C$([CH2](C)(C)):6](-[C$(C(=O)(C)([CX4])),C$([CH](=O)(C)):5]=[O:9])-[C$(C(=O)(C)([CX4])),C$([CH](=O)(C)):7]=[O:10]>>[c:7]1[n:3][n:2][c:5][c:6]1 63 | [C$(C(=O)(C)([CX4])),C$(C[H](=O)(C)):1](=[O:2])-[$([CH](C)(C)([CX4])),$([CH2](C)(C)):3]-[$([CH](C)(C)([CX4])),$([CH2](C)(C)):4]-[C$(C(=O)(C)([CX4])),C$(C[H](=O)(C)):5]=[O:6].[N$([NH2,NH3+1]([CX4])):7]>>[c:5]1[c:4][c:3][c:1][n+0:7]1 64 | [CH:7](=[O:8])-[c:1]1[c:2][c:3][c:4][c:5][c:6]1.[O:24]=[C:23](-[C:22](=[O:25])-[c:15]1[c:10][c:11][c:12][c:13][c:14]1)-[c:20]1[c:21][c:16][c:17][c:18][c:19]1>[NH4].[O-]C(=O)C>[nH:27]-1[c:7]([n:26][c:23]([c:22]-1[c:15]1[c:10][c:11][c:12][c:13][c:14]1)-[c:20]1[c:21][c:16][c:17][c:18][c:19]1)-[c:1]1[c:2][c:3][c:4][c:5][c:6]1 65 | [OH:7]-[c:6]1[cH:1][c:2][c:3][c:4][c:5]1.[O$(O(C)([CX4])):12]-[C:11](=[O:15])-[C$([CH](C)(C)([CX4])),C$([CH2](C)(C)):10]-[C:8]=[O:16]>>[C:8]-1=[C:10]-[C:11](=[O:15])-[O]-[c:6]2[c:5][c:4][c:3][c:2][c:1]-12 66 | [O$(O(C)([CX4])):8][C:7](=[O:9])[CH:6][C:5][C:4][C:3][C:2]([O$(O(C)([CX4])):10])=[O:1]>>[O:8][C:7](=[O:9])[C:6]1[C:5][C:4][C:3][C:2]1=[O:1] 67 | [O$(O(C)([CX4])):8][C:7](=[O:9])[CH:6][C:5][C:11][C:4][C:3][C:2]([O$(O(C)([CX4])):10])=[O:1]>>[O:8][C:7](=[O:9])[C:6]1[C:5][C:11][C:4][C:3][C:2]1=[O:1] 68 | [Cl:9][C:7](=[O:8])-[c:3]1[c:2][c:1][c:6][c:5][c:4]1.[C$([CH2](C)([CX4])),C$([CH3](C)):18]-[C:16](=[O:17])-[c:14]1[c:13][c:12][c:11][c:10][c:15]1-[OH:19]>>[O:17]=[C:16]-1-[C:18]=[C:7](-[O:8]-[c:15]2[c:10][c:11][c:12][c:13][c:14]-12)-[c:3]1[c:2][c:1][c:6][c:5][c:4]1 69 | [C$(C(C)(=O)([CX4,OX2&H0])),C$(C(C)(#N)),N$([N+1](C)(=O)([O-1])):1][C$([CH]([C,N])([C,N])([CX4])),C$([CH2]([C,N])([C,N])):2][C$(C(C)(=O)([CX4,OX2&H0])),C$(C(C)(#N)),N$([N+1](C)(=O)([O-1])):3].[C$(C(C)(#N)),C$(C(C)([CX4,OX2&H0])([CX4,OX2&H0])([OX2&H0])),C$([CH](C)([CX4,OX2&H0])([OX2&H0])),C$([CH2](C)([OX2&H0])),C$(C(C)(=O)([OX2&H0])):6][CH:5]=[C$(C(=C)([CX4])([CX4])),C$([CH](=C)([CX4])),C$([CH2](=C)):4]>>[C:6][C:5][C:4][C:2]([C:1])[C:3] 70 | [C$([C](O)([CX4])([CX4])([CX4])),C$([CH](O)([CX4])([CX4])),C$([CH2](O)([CX4])):4]-[O:3]-[C$(C(=O)([CX4])),C$([CH](=O)):2]=[O:5].[C$([CH](C)([CX4])([CX4])),C$([CH2](C)([CX4])),C$([CH3](C)):7]-[C$(C(=O)([CX4])),C$([CH](=O)):8]=[O:9]>>[C:7](-[C:2]=[O:5])-[C:8]=[O:9] 71 | [Cl,OH,O-:3][C$(C(=O)([CX4,c])),C$([CH](=O)):2]=[O:4].[O$([OH]([CX4,c])),O$([OH]([CX4,c])([CX4,c])),S$([SH]([CX4,c])),S$([SH]([CX4,c])([CX4,c])):6]>>[*:6]-[C:2]=[O:4] 72 | [C$(C(=O)([CX4,c])([CX4,c])),C$([CH](=O)([CX4,c])):1]=[O:2].[N$([NH2,NH3+1]([CX4,c])),N$([NH]([CX4,c])([CX4,c])):3]>>[N+0:3][C:1] 73 | [Br:1][c$(c(Br)),n$(n(Br)),o$(o(Br)),C$([CH](Br)(=C)):2].[C$(C(B)([CX4])([CX4])([CX4])),C$([CH](B)([CX4])([CX4])),C$([CH2](B)([CX4])),C$([CH2](B)),C$(C(B)(=C)),c$(c(B)),o$(o(B)),n$(n(B)):3][B$(B([C,c,n,o])([OH,$(OC)])([OH,$(OC)])),B$([B-1]([C,c,n,o])(N)([OH,$(OC)])([OH,$(OC)])):4]>>[C,c,n,o:2][C,c,n,o:3] 74 | [Br,I:1][C$(C([Br,I])([CX4])([CX4])([CX4])),C$([CH]([Br,I])([CX4])([CX4])),C$([CH2]([Br,I])([CX4])),C$([CH3]([Br,I])),C$([C]([Br,I])(=C)([CX4])),C$([CH]([Br,I])(=C)),C$(C([Br,I])(#C)),c$(c([Br,I])):2].[Br,I:3][C$(C([Br,I])([CX4])([CX4])([CX4])),C$([CH]([Br,I])([CX4])([CX4])),C$([CH2]([Br,I])([CX4])),C$([CH3]([Br,I])),C$([C]([Br,I])(=C)([CX4])),C$([CH]([Br,I])(=C)),C$(C([Br,I])(#C)),c$(c([Br,I])):4]>>[C,c:2][C,c:4] 75 | [OH,O-]-[C$(C(=O)(O)([CX4,c])):2]=[O:3].[OH:8]-[C$([CH](O)([CX4,c])([CX4,c])),C$([CH2](O)([CX4,c])),C$([CH3](O)):6]>>[C:6][O]-[C:2]=[O:3] 76 | [C$([CH](=C)([CX4])),C$([CH2](=C)):2]=[C$(C(=C)([CX4])([CX4])),C$([CH](=C)([CX4])),C$([CH2](=C)):3].[Br,I:7][C$([CX4]([Br,I])),c$([c]([Br,I])):4]>>[C,c:4][C:2]=[C:3] 77 | [Cl,OH,O-:3][C$(C(=O)([CX4,c])),C$([CH](=O)):2]=[O:4].[N$([NH2,NH3+1]([CX4,c])),N$([NH]([CX4,c])([CX4,c])):6]>>[N+0:6]-[C:2]=[O:4] 78 | [C$(C(=C)([CX4])([CX4])),C$([CH](=C)([CX4])),C$([CH2](=C)):1]=[C$(C(=C)([CX4])([CX4])),C$([CH](=C)([CX4])),C$([CH2](=C)):2].[SH:4]-[CX4:5][Br,Cl,I]>>[C:1]-[C:2]-[S:4][C:5] 79 | [C$([C](=O)([CX4])),C$([CH](=O)):2](=[O:1])[OH,Cl,O-:6].[SH:4]-[CX4:5][Br,Cl,I]>>[CH2:2]-[S:4][C:5] 80 | [I:1][C$(C(I)([CX4,c])([CX4,c])([CX4,c])),C$([CH](I)([CX4,c])([CX4,c])),C$([CH2](I)([CX4,c])),C$([CH3](I)):2].[C$(C(=O)([Cl,OH,O-])([CX4,c])),C$([CH]([Cl,OH,O-])(=O)):3](=[O:6])[Cl,OH,O-:5]>>[C:2]-[C:3]=[O:6] 81 | [Cl:5][S$(S(=O)(=O)(Cl)([CX4])):2](=[O:3])=[O:4].[NH2+0,NH3+:6]-[C$(C(N)([CX4,c])([CX4,c])([CX4,c])),C$([CH](N)([CX4,c])([CX4,c])),C$([CH2](N)([CX4,c])),C$([CH3](N)),c$(c(N)):7]>>[C,c:7]-[NH+0:6][S:2](=[O:4])=[O:3] 82 | [*:1][C:2]#[CH:3].[Br,I:4][C$(C([CX4,c])([CX4,c])([CX4,c])),C$([CH]([CX4,c])([CX4,c])),C$([CH2]([CX4,c])),C$([CH3]),c$(c):5]>>[C,c:5][C:3]#[C:2][*:1] 83 | [C$(C(C)([CX4])([CX4])([CX4])),C$([CH](C)([CX4])([CX4])),C$([CH2](C)([CX4])),C$([CH3](C)):1][C:2]#[CH:3].[Br,I:4][C$(C(=O)([Br,I])([CX4])),C$([CH](=O)([Br,I])):5]=[O:6]>>[C:1][C:2]#[C:3][C:5]=[O:6] 84 | [OH,O-:4]-[C$(C(=O)([OH,O-])([CX4])),C$([CH](=O)([OH,O-])):2]=[O:3]>>[Cl:5][C:2]=[O:3] 85 | [OH:2]-[$([CX4]),c:1]>>[Br:3][C,c:1] 86 | [OH:2]-[$([CX4]),c:1]>>[Cl:3][C,c:1] 87 | [OH,O-:3][S$(S([CX4])):2](=[O:4])=[O:5]>>[Cl:6][S:2](=[O:5])=[O:4] 88 | [OH+0,O-:5]-[C:3](=[O:4])-[C$([CH]([CX4])),C$([CH2]):2]>>[OH+0,O-:5]-[C:3](=[O:4])-[C:2]([Br:6]) 89 | [OH+0,O-:5]-[C:3](=[O:4])-[C$([CH]([CX4])),C$([CH2]):2]>>[OH+0,O-:5]-[C:3](=[O:4])-[C:2]([Cl:6]) 90 | [Cl,I,Br:7][c:1]1[c:2][c:3][c:4][c:5][c:6]1>>[N:9]#[C:8][c:1]1[c:2][c:3][c:4][c:5][c:6]1 91 | [OH,NH2,NH3+:3]-[CH2:2]-[C$(C([CX4,c])([CX4,c])([CX4,c])),C$([CH]([CX4,c])([CX4,c])),C$([CH2]([CX4,c])),C$([CH3]),c$(c):1]>>[C,c:1][C:2]#[N:4] 92 | -------------------------------------------------------------------------------- /data/smiles_vocab.txt: -------------------------------------------------------------------------------- 1 | H 2 | He 3 | Li 4 | Be 5 | B 6 | C 7 | N 8 | O 9 | F 10 | Ne 11 | Na 12 | Mg 13 | Al 14 | Si 15 | P 16 | S 17 | Cl 18 | Ar 19 | K 20 | Ca 21 | Sc 22 | Ti 23 | V 24 | Cr 25 | Mn 26 | Fe 27 | Co 28 | Ni 29 | Cu 30 | Zn 31 | Ga 32 | Ge 33 | As 34 | Se 35 | Br 36 | Kr 37 | Rb 38 | Sr 39 | Y 40 | Zr 41 | Nb 42 | Mo 43 | Tc 44 | Ru 45 | Rh 46 | Pd 47 | Ag 48 | Cd 49 | In 50 | Sn 51 | Sb 52 | Te 53 | I 54 | Xe 55 | Cs 56 | Ba 57 | La 58 | Ce 59 | Pr 60 | Nd 61 | Pm 62 | Sm 63 | Eu 64 | Gd 65 | Tb 66 | Dy 67 | Ho 68 | Er 69 | Tm 70 | Yb 71 | Lu 72 | Hf 73 | Ta 74 | W 75 | Re 76 | Os 77 | Ir 78 | Pt 79 | Au 80 | Hg 81 | Tl 82 | Pb 83 | Bi 84 | Po 85 | At 86 | Rn 87 | Fr 88 | Ra 89 | Ac 90 | Th 91 | Pa 92 | U 93 | Np 94 | Pu 95 | Am 96 | Cm 97 | Bk 98 | Cf 99 | Es 100 | Fm 101 | Md 102 | No 103 | Lr 104 | Rf 105 | Db 106 | Sg 107 | Bh 108 | Hs 109 | Mt 110 | Ds 111 | Rg 112 | Cn 113 | Nh 114 | Fl 115 | Mc 116 | Lv 117 | Ts 118 | Og 119 | b 120 | c 121 | n 122 | o 123 | s 124 | p 125 | 0 126 | 1 127 | 2 128 | 3 129 | 4 130 | 5 131 | 6 132 | 7 133 | 8 134 | 9 135 | [ 136 | ] 137 | ( 138 | ) 139 | . 140 | = 141 | # 142 | - 143 | + 144 | \ 145 | / 146 | : 147 | ~ 148 | @ 149 | ? 150 | > 151 | * 152 | $ 153 | % 154 | -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | channels: 2 | - pytorch 3 | - nvidia 4 | - conda-forge 5 | dependencies: 6 | - _libgcc_mutex=0.1 7 | - _openmp_mutex=4.5 8 | - aiohappyeyeballs=2.5.0 9 | - aiohttp=3.11.13 10 | - aiosignal=1.3.2 11 | - alsa-lib=1.2.13 12 | - antlr-python-runtime=4.9.3 13 | - aom=3.6.1 14 | - async-timeout=5.0.1 15 | - attr=2.5.1 16 | - attrs=25.1.0 17 | - aws-c-auth=0.7.22 18 | - aws-c-cal=0.6.15 19 | - aws-c-common=0.9.23 20 | - aws-c-compression=0.2.18 21 | - aws-c-event-stream=0.4.2 22 | - aws-c-http=0.8.2 23 | - aws-c-io=0.14.9 24 | - aws-c-mqtt=0.10.4 25 | - aws-c-s3=0.5.10 26 | - aws-c-sdkutils=0.1.16 27 | - aws-checksums=0.1.18 28 | - aws-crt-cpp=0.26.12 29 | - aws-sdk-cpp=1.11.329 30 | - azure-core-cpp=1.14.0 31 | - azure-identity-cpp=1.10.0 32 | - azure-storage-blobs-cpp=12.13.0 33 | - azure-storage-common-cpp=12.8.0 34 | - azure-storage-files-datalake-cpp=12.12.0 35 | - black=25.1.0 36 | - brotli=1.1.0 37 | - brotli-bin=1.1.0 38 | - brotli-python=1.1.0 39 | - bzip2=1.0.8 40 | - c-ares=1.34.4 41 | - ca-certificates=2025.1.31 42 | - cairo=1.18.2 43 | - certifi=2025.1.31 44 | - cffi=1.17.1 45 | - cfgv=3.3.1 46 | - chardet=5.2.0 47 | - charset-normalizer=3.4.1 48 | - click=8.1.8 49 | - colorama=0.4.6 50 | - contourpy=1.3.1 51 | - cpython=3.10.16 52 | - cryptography=44.0.2 53 | - cuda-cudart=12.1.105 54 | - cuda-cupti=12.1.105 55 | - cuda-libraries=12.1.0 56 | - cuda-nvrtc=12.1.105 57 | - cuda-nvtx=12.1.105 58 | - cuda-opencl=12.6.77 59 | - cuda-runtime=12.1.0 60 | - cuda-version=12.6 61 | - cudnn=8.9.7.29 62 | - cycler=0.12.1 63 | - cyrus-sasl=2.1.27 64 | - datasets=3.3.2 65 | - dbus=1.13.6 66 | - dill=0.3.8 67 | - distlib=0.3.9 68 | - double-conversion=3.3.0 69 | - einops=0.8.1 70 | - expat=2.6.4 71 | - ffmpeg=4.4.2 72 | - filelock=3.16.1 73 | - flake8=7.1.2 74 | - font-ttf-dejavu-sans-mono=2.37 75 | - font-ttf-inconsolata=3.000 76 | - font-ttf-source-code-pro=2.038 77 | - font-ttf-ubuntu=0.83 78 | - fontconfig=2.15.0 79 | - fonts-conda-ecosystem=1 80 | - fonts-conda-forge=1 81 | - fonttools=4.55.3 82 | - freetype=2.12.1 83 | - freetype-py=2.3.0 84 | - frozenlist=1.5.0 85 | - fsspec=2024.12.0 86 | - gettext=0.23.1 87 | - gettext-tools=0.23.1 88 | - gflags=2.2.2 89 | - giflib=5.2.2 90 | - gitdb=4.0.12 91 | - gitpython=3.1.44 92 | - glog=0.7.1 93 | - gmp=6.3.0 94 | - gmpy2=2.1.5 95 | - gnutls=3.7.9 96 | - graphite2=1.3.13 97 | - greenlet=3.1.1 98 | - h2=4.1.0 99 | - harfbuzz=10.2.0 100 | - hpack=4.0.0 101 | - huggingface_hub=0.29.1 102 | - hyperframe=6.0.1 103 | - icu=75.1 104 | - identify=2.6.5 105 | - idna=3.10 106 | - isort=6.0.1 107 | - jinja2=3.1.5 108 | - joblib=1.4.2 109 | - keyutils=1.6.1 110 | - kiwisolver=1.4.7 111 | - krb5=1.21.3 112 | - lame=3.100 113 | - lcms2=2.16 114 | - ld_impl_linux-64=2.43 115 | - lerc=4.0.0 116 | - libabseil=20240116.2 117 | - libarrow=16.1.0 118 | - libarrow-acero=16.1.0 119 | - libarrow-dataset=16.1.0 120 | - libarrow-substrait=16.1.0 121 | - libasprintf=0.23.1 122 | - libasprintf-devel=0.23.1 123 | - libblas=3.9.0 124 | - libboost=1.86.0 125 | - libboost-python=1.86.0 126 | - libbrotlicommon=1.1.0 127 | - libbrotlidec=1.1.0 128 | - libbrotlienc=1.1.0 129 | - libcap=2.71 130 | - libcblas=3.9.0 131 | - libclang-cpp19.1=19.1.7 132 | - libclang13=19.1.7 133 | - libcrc32c=1.1.2 134 | - libcublas=12.1.0.26 135 | - libcufft=11.0.2.4 136 | - libcufile=1.11.1.6 137 | - libcups=2.3.3 138 | - libcurand=10.3.7.77 139 | - libcurl=8.12.1 140 | - libcusolver=11.4.4.55 141 | - libcusparse=12.0.2.55 142 | - libdeflate=1.23 143 | - libdrm=2.4.124 144 | - libedit=3.1.20240808 145 | - libegl=1.7.0 146 | - libev=4.33 147 | - libevent=2.1.12 148 | - libexpat=2.6.4 149 | - libffi=3.4.2 150 | - libgcc=14.2.0 151 | - libgcc-ng=14.2.0 152 | - libgcrypt-lib=1.11.0 153 | - libgettextpo=0.23.1 154 | - libgettextpo-devel=0.23.1 155 | - libgfortran=14.2.0 156 | - libgfortran5=14.2.0 157 | - libgl=1.7.0 158 | - libglib=2.82.2 159 | - libglvnd=1.7.0 160 | - libglx=1.7.0 161 | - libgoogle-cloud=2.25.0 162 | - libgoogle-cloud-storage=2.25.0 163 | - libgpg-error=1.51 164 | - libgrpc=1.62.2 165 | - libhwloc=2.11.2 166 | - libiconv=1.17 167 | - libidn2=2.3.7 168 | - libjpeg-turbo=3.0.0 169 | - liblapack=3.9.0 170 | - libllvm19=19.1.7 171 | - liblzma=5.6.3 172 | - liblzma-devel=5.6.3 173 | - libmagma=2.8.0 174 | - libmagma_sparse=2.8.0 175 | - libnghttp2=1.64.0 176 | - libnl=3.11.0 177 | - libnpp=12.0.2.50 178 | - libnsl=2.0.1 179 | - libntlm=1.8 180 | - libnvjitlink=12.1.105 181 | - libnvjpeg=12.1.1.14 182 | - libopenblas=0.3.29 183 | - libopengl=1.7.0 184 | - libopentelemetry-cpp=1.16.1 185 | - libopentelemetry-cpp-headers=1.16.1 186 | - libparquet=16.1.0 187 | - libpciaccess=0.18 188 | - libpng=1.6.45 189 | - libpq=17.4 190 | - libprotobuf=4.25.3 191 | - librdkit=2024.09.6 192 | - libre2-11=2023.09.01 193 | - libsqlite=3.48.0 194 | - libssh2=1.11.1 195 | - libstdcxx=14.2.0 196 | - libstdcxx-ng=14.2.0 197 | - libsystemd0=256.9 198 | - libtasn1=4.20.0 199 | - libthrift=0.19.0 200 | - libtiff=4.7.0 201 | - libtorch=2.4.0 202 | - libudev1=257.2 203 | - libunistring=0.9.10 204 | - libutf8proc=2.8.0 205 | - libuuid=2.38.1 206 | - libuv=1.50.0 207 | - libva=2.22.0 208 | - libvpx=1.13.1 209 | - libwebp=1.5.0 210 | - libwebp-base=1.5.0 211 | - libxcb=1.17.0 212 | - libxcrypt=4.4.36 213 | - libxkbcommon=1.7.0 214 | - libxml2=2.13.5 215 | - libxslt=1.1.39 216 | - libzlib=1.3.1 217 | - llvm-openmp=19.1.7 218 | - lz4-c=1.9.4 219 | - markdown-it-py=3.0.0 220 | - markupsafe=3.0.2 221 | - matplotlib=3.10.1 222 | - matplotlib-base=3.10.1 223 | - mccabe=0.7.0 224 | - mdurl=0.1.2 225 | - mkl=2023.2.0 226 | - mpc=1.3.1 227 | - mpfr=4.2.1 228 | - mpmath=1.3.0 229 | - multidict=6.1.0 230 | - multiprocess=0.70.16 231 | - munkres=1.1.4 232 | - mypy=1.15.0 233 | - mypy_extensions=1.0.0 234 | - mysql-common=9.0.1 235 | - mysql-libs=9.0.1 236 | - nccl=2.25.1.1 237 | - ncurses=6.5 238 | - nettle=3.9.1 239 | - networkx=3.4.2 240 | - nlohmann_json=3.11.3 241 | - nodeenv=1.9.1 242 | - numpy=2.2.3 243 | - ocl-icd=2.3.2 244 | - omegaconf=2.3.0 245 | - openbabel=3.1.1 246 | - opencl-headers=2024.10.24 247 | - openh264=2.3.1 248 | - openjpeg=2.5.3 249 | - openldap=2.6.9 250 | - openssl=3.4.1 251 | - optree=0.14.1 252 | - orc=2.0.1 253 | - p11-kit=0.24.1 254 | - packaging=24.2 255 | - pandas=2.2.3 256 | - pathspec=0.12.1 257 | - patsy=1.0.1 258 | - pcre2=10.44 259 | - pillow=11.1.0 260 | - pip=24.3.1 261 | - pixman=0.44.2 262 | - platformdirs=4.3.6 263 | - pre-commit=4.1.0 264 | - prometheus-cpp=1.2.4 265 | - propcache=0.2.1 266 | - psutil=6.1.1 267 | - pthread-stubs=0.4 268 | - pyarrow=16.1.0 269 | - pyarrow-core=16.1.0 270 | - pybind11=2.13.6 271 | - pybind11-global=2.13.6 272 | - pycairo=1.27.0 273 | - pycodestyle=2.12.1 274 | - pycparser=2.22 275 | - pyflakes=3.2.0 276 | - pygments=2.19.1 277 | - pyparsing=3.2.1 278 | - pyside6=6.8.1 279 | - pysocks=1.7.1 280 | - python=3.10.16 281 | - python-dateutil=2.9.0.post0 282 | - python-tzdata=2024.2 283 | - python-xxhash=3.5.0 284 | - python_abi=3.10 285 | - pytorch=2.4.0 286 | - pytorch-cuda=12.1 287 | - pytorch-mutex=1.0 288 | - pytz=2024.1 289 | - pyyaml=6.0.2 290 | - qhull=2020.2 291 | - qt6-main=6.8.1 292 | - rdkit=2024.09.6 293 | - rdma-core=55.0 294 | - re2=2023.09.01 295 | - readline=8.2 296 | - regex=2024.11.6 297 | - reportlab=4.2.5 298 | - requests=2.32.3 299 | - rich=13.9.4 300 | - rlpycairo=0.2.0 301 | - s2n=1.4.16 302 | - safetensors=0.5.3 303 | - scikit-learn=1.6.1 304 | - scipy=1.15.1 305 | - seaborn=0.13.2 306 | - seaborn-base=0.13.2 307 | - setuptools=75.8.0 308 | - six=1.17.0 309 | - sleef=3.8 310 | - smmap=5.0.0 311 | - snappy=1.2.1 312 | - sqlalchemy=2.0.37 313 | - statsmodels=0.14.4 314 | - svt-av1=1.4.1 315 | - sympy=1.13.3 316 | - tbb=2021.13.0 317 | - threadpoolctl=3.5.0 318 | - tk=8.6.13 319 | - tomli=2.2.1 320 | - torchaudio=2.4.0 321 | - torchvision=0.19.0 322 | - tornado=6.4.2 323 | - tqdm=4.67.1 324 | - types-pyyaml=6.0.12.20241230 325 | - typing-extensions=4.12.2 326 | - typing_extensions=4.12.2 327 | - tzdata=2025a 328 | - ukkonen=1.0.1 329 | - unicodedata2=16.0.0 330 | - urllib3=2.3.0 331 | - virtualenv=20.29.1 332 | - wayland=1.23.1 333 | - wayland-protocols=1.41 334 | - wheel=0.45.1 335 | - x264=1!164.3095 336 | - x265=3.5 337 | - xcb-util=0.4.1 338 | - xcb-util-cursor=0.1.5 339 | - xcb-util-image=0.4.0 340 | - xcb-util-keysyms=0.4.1 341 | - xcb-util-renderutil=0.3.10 342 | - xcb-util-wm=0.4.2 343 | - xkeyboard-config=2.43 344 | - xlrd=2.0.1 345 | - xorg-libice=1.1.2 346 | - xorg-libsm=1.2.5 347 | - xorg-libx11=1.8.10 348 | - xorg-libxau=1.0.12 349 | - xorg-libxcomposite=0.4.6 350 | - xorg-libxcursor=1.2.3 351 | - xorg-libxdamage=1.1.6 352 | - xorg-libxdmcp=1.1.5 353 | - xorg-libxext=1.3.6 354 | - xorg-libxfixes=6.0.1 355 | - xorg-libxi=1.8.2 356 | - xorg-libxrandr=1.5.4 357 | - xorg-libxrender=0.9.12 358 | - xorg-libxtst=1.2.5 359 | - xorg-libxxf86vm=1.1.6 360 | - xxhash=0.8.3 361 | - xz=5.6.3 362 | - xz-gpl-tools=5.6.3 363 | - xz-tools=5.6.3 364 | - yaml=0.2.5 365 | - yarl=1.18.3 366 | - zlib=1.3.1 367 | - zstandard=0.23.0 368 | - zstd=1.5.6 369 | - pip: 370 | - accelerate==1.4.0 371 | - huggingface-hub==0.29.2 372 | - synllama==0.1.0 373 | - tokenizers==0.19.1 374 | - transformers==4.44.2 -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.black] 2 | line-length = 119 3 | target-version = ['py310'] 4 | 5 | [tool.isort] 6 | extra_standard_library = "typing_extensions,mypy,mypy_extensions" 7 | profile = "black" 8 | 9 | [tool.autoflake] 10 | remove-all-unused-imports = true 11 | expand-star-imports = true 12 | ignore-init-module-imports = true 13 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | setup( 4 | name="synllama", 5 | version="0.1.0", 6 | packages=["synllama"], 7 | ) -------------------------------------------------------------------------------- /steps/step_10_calc_embedding.py: -------------------------------------------------------------------------------- 1 | # Preprocessing step 1: Extract metadata and create fingerprint index and reactant reaction matrices 2 | import pathlib, shutil 3 | from sklearn.cluster import KMeans 4 | from synllama.chem.mol import read_mol_file, FingerprintOption 5 | from synllama.chem.fpindex import FingerprintIndex 6 | from synllama.chem.matrix import ReactantReactionMatrix, ReactionContainer 7 | from synllama.chem.reaction import read_reaction_file 8 | from synllama.chem.smiles_tfidf import SmilesSimilaritySearch 9 | import numpy as np 10 | import pickle, click 11 | 12 | # As noted in the README, the Enamine data needs to be downloaded separately. If you want to use the default data, 13 | # please download the data from the following links: https://enamine.net/building-blocks/building-blocks-catalog 14 | 15 | # If you want to request these exact files, please contact me at kysun@berkeley.edu or leave an issue on the GitHub repo. 16 | 17 | _default_sdf_path = pathlib.Path("data/Enamine_Rush-Delivery_Building_Blocks-US_253345cmpd_20250212.sdf") 18 | _default_reaction_path = pathlib.Path("data/91_rxns/91_rxn_templates.txt") 19 | _default_data_folder = pathlib.Path("data/91_rxns/") 20 | _default_testing_data_path = pathlib.Path("data/13k_unseen_enamine_bbs.smi") 21 | _random_state = 0 22 | np.random.seed(_random_state) # for reproducibility of the test set bb if no testing data is provided 23 | 24 | @click.command() 25 | @click.option( 26 | "--data_folder", 27 | type=click.Path(path_type=pathlib.Path), 28 | default=_default_data_folder, 29 | help="Path to the data folder." 30 | ) 31 | @click.option( 32 | "--bb_path", 33 | type=click.Path(exists=True, path_type=pathlib.Path), 34 | default=_default_sdf_path, 35 | help="Path to the input building blocks SDF file." 36 | ) 37 | @click.option( 38 | "--rxn_template_path", 39 | type=click.Path(exists=True, path_type=pathlib.Path), 40 | default=_default_reaction_path, 41 | help="Path to the input reaction templates file." 42 | ) 43 | @click.option( 44 | "--testing_data_path", 45 | type=click.Path(exists=True, path_type=pathlib.Path), 46 | default=_default_testing_data_path, 47 | help="Path to the testing data file." 48 | ) 49 | 50 | def run_all_preprocessing(data_folder, bb_path, rxn_template_path, testing_data_path = None): 51 | processed_folder = data_folder / "processed" 52 | processed_folder.mkdir(parents=True, exist_ok=True) 53 | testing_folder = data_folder / "testing_data" 54 | testing_folder.mkdir(parents=True, exist_ok=True) 55 | molecules = list(read_mol_file(bb_path)) 56 | print(f"Generating fingerprints for {bb_path}") 57 | generate_morgan_fingerprints(molecules, processed_folder / "fpindex.pkl", processed_folder / "enamine_metadata.csv") 58 | # print(f"Generating smiles embedding for {bb_path}") 59 | # generate_smiles_embedding(molecules, data_folder / "smiles_embedding.pkl", smiles_vocab_path) 60 | if testing_data_path is None: 61 | print(f"Clustering fingerprints") 62 | knn_clustering(processed_folder / "fpindex.pkl", testing_folder, n_clusters=128, random_state=_random_state) 63 | else: 64 | shutil.copy(testing_data_path, testing_folder / "test_bb.smi") 65 | print(f"Creating reactant reaction matrix cache") 66 | create_reactant_reaction_matrix_cache(molecules, rxn_template_path, processed_folder / "all" / "reaction_matrix.pkl") 67 | create_reactant_reaction_matrix_cache(molecules, rxn_template_path, processed_folder / "train" / "reaction_matrix_train.pkl", testing_folder / "test_bb.smi") 68 | create_reactant_reaction_matrix_cache(molecules, rxn_template_path, processed_folder / "test" / "reaction_matrix_test.pkl", testing_folder / "test_bb.smi", test_only=True) 69 | 70 | def generate_morgan_fingerprints(molecules, out, meta_data_path): 71 | """Generate Morgan fingerprints from the specified SDF file and save the FingerprintIndex.""" 72 | # Define the fingerprint option 73 | fp_option = FingerprintOption.morgan_for_building_blocks() 74 | if meta_data_path: 75 | import csv 76 | with open(meta_data_path, 'w', newline='') as csvfile: 77 | fieldnames = ['SMILES'] + list(molecules[0].meta_info.keys()) 78 | writer = csv.DictWriter(csvfile, fieldnames=fieldnames) 79 | writer.writeheader() 80 | for mol in molecules: 81 | row = {'SMILES': mol.smiles} 82 | row.update(mol.meta_info) 83 | writer.writerow(row) 84 | # Generate fingerprints 85 | fp_index = FingerprintIndex(molecules, fp_option) 86 | out.parent.mkdir(parents=True, exist_ok=True) 87 | fp_index.save(out) 88 | 89 | # don't need this for now because we are using reaction template-based smiles embedding 90 | def generate_smiles_embedding(molecules, out, smiles_vocab_path): 91 | """Generate smiles embedding from the specified SDF file and save the SmilesSimilaritySearch.""" 92 | smiles_tokens = [line.strip() for line in open(smiles_vocab_path)] 93 | smiles_searcher = SmilesSimilaritySearch(token_list=smiles_tokens) 94 | smiles_searcher.fit(molecules, save_ngram=True) 95 | out.parent.mkdir(parents=True, exist_ok=True) 96 | smiles_searcher.save(out) 97 | 98 | def knn_clustering(fp_index_path, out, n_clusters=128, random_state=_random_state): 99 | """Find the smallest cluster in the FingerprintIndex and save the SMILES of the molecules in the cluster to the output file.""" 100 | fp_index = FingerprintIndex.load(fp_index_path) 101 | fp = fp_index._fp 102 | kmeans = KMeans(n_clusters=n_clusters, random_state=random_state) 103 | kmeans.fit(fp) 104 | for i in range(n_clusters): 105 | print(f"Cluster {i} has {np.sum(kmeans.labels_ == i)} molecules") 106 | cluster_idx = np.where(kmeans.labels_ == i)[0] 107 | cluster_smiles = [fp_index.molecules[i].smiles for i in cluster_idx] 108 | if i == 0: 109 | cluster_out_path = out / f"test_bb.smi" 110 | with open(cluster_out_path, "w") as f: 111 | for smi in cluster_smiles: 112 | f.write(smi + "\n") 113 | 114 | def create_reactant_reaction_matrix_cache(molecules, reaction_path, cache_path, excl_path = None, test_only= False): 115 | """Create a reactant reaction matrix cache for reaction generation.""" 116 | rxns = ReactionContainer(read_reaction_file(reaction_path)) 117 | if test_only and excl_path is None: 118 | raise ValueError("test_only is True but excl_path is None") 119 | if test_only: 120 | mols = list(read_mol_file(excl_path)) 121 | else: 122 | mols = molecules 123 | if excl_path is not None: 124 | excl_mols = list(read_mol_file(excl_path)) 125 | excl_smiles = {m.smiles for m in excl_mols} 126 | mols = [m for m in mols if m.smiles not in excl_smiles] 127 | m = ReactantReactionMatrix(mols, rxns) 128 | cache_path.parent.mkdir(parents=True, exist_ok=True) 129 | m.save(cache_path) 130 | 131 | if __name__ == "__main__": 132 | run_all_preprocessing() 133 | # old_bbs = pathlib.Path("data/Enamine_Rush-Delivery_Building_Blocks-US_243540cmpd_20240806.sdf") 134 | # new_bbs = pathlib.Path("data/Enamine_Rush-Delivery_Building_Blocks-US_253345cmpd_20250212.sdf") 135 | # old_mols = list(read_mol_file(old_bbs)) 136 | # new_mols = list(read_mol_file(new_bbs)) 137 | # old_smiles = {m.smiles for m in old_mols} 138 | # new_smiles_list = [] 139 | # for mol in new_mols: 140 | # if mol.smiles not in old_smiles: 141 | # new_smiles_list.append(mol.smiles) 142 | # with open("data/new_test_smiles_list.smi", "w") as f: 143 | # for smi in new_smiles_list: 144 | # f.write(smi + "\n") 145 | # molecules = list(read_mol_file("data/new_test_smiles_list.smi")) 146 | # rxn_template_path = pathlib.Path("data/91_rxns/reaction_templates_hb.txt") 147 | # processed_folder = pathlib.Path("data/91_rxns/") 148 | # create_reactant_reaction_matrix_cache(molecules, rxn_template_path, processed_folder / "test" / f"reaction_matrix_test_new_enamine.pkl","data/new_test_smiles_list.smi", test_only=True) -------------------------------------------------------------------------------- /steps/step_11_generate_fpindex_smiles_tfidf.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | from pathlib import Path 3 | import click 4 | from synllama.chem.fpindex import FingerprintIndex 5 | from synllama.chem.smiles_tfidf import SmilesSimilaritySearch 6 | from synllama.chem.mol import FingerprintOption 7 | from synllama.chem.matrix import ReactantReactionMatrix 8 | 9 | _default_matrix_file = "data/91_rxns/processed/all/reaction_matrix.pkl" 10 | _default_output_dir = "data/91_rxns/rxn_embeddings" 11 | _default_token_list_path = "data/smiles_vocab.txt" 12 | 13 | @click.command() 14 | @click.option("--matrix_file", type=click.Path(exists=True, path_type=Path), default=_default_matrix_file) 15 | @click.option("--output_dir", type=click.Path(path_type=Path), default=_default_output_dir) 16 | @click.option("--token_list_path", type=click.Path(exists=True, path_type=Path), default=_default_token_list_path) 17 | 18 | def main(matrix_file: Path, output_dir: Path, token_list_path: Path): 19 | # Load the reactant-reaction matrix 20 | with open(matrix_file, 'rb') as f: 21 | reactant_reaction_matrix: ReactantReactionMatrix = pickle.load(f) 22 | 23 | # Create output directory if it doesn't exist 24 | output_dir = Path(output_dir) 25 | output_dir.mkdir(parents=True, exist_ok=True) 26 | 27 | # Dictionary to map reaction index to reaction SMARTS 28 | reaction_smarts_map = {} 29 | 30 | # Iterate over each reaction 31 | for reaction_idx, reaction in enumerate(reactant_reaction_matrix.reactions): 32 | # Find reactants that can participate in this reaction 33 | reactant_indices = reactant_reaction_matrix.matrix[:, reaction_idx].nonzero()[0] 34 | reactants = [reactant_reaction_matrix.reactants[i] for i in reactant_indices] 35 | 36 | # Generate FingerprintIndex 37 | fp_option = FingerprintOption.morgan_for_building_blocks() 38 | fp_index = FingerprintIndex(molecules=reactants, fp_option=fp_option) 39 | 40 | # Save FingerprintIndex 41 | fp_index_file = output_dir / f"fpindex_{reaction_idx}.pkl" 42 | fp_index.save(fp_index_file) 43 | 44 | # Generate SmilesSimilaritySearch 45 | smiles_search = SmilesSimilaritySearch(token_list_path=token_list_path) 46 | smiles_search.fit(molecules=reactants) 47 | 48 | # Save SmilesSimilaritySearch 49 | smiles_search_file = output_dir / f"smiles_tfidf_{reaction_idx}.pkl" 50 | smiles_search.save(smiles_search_file) 51 | 52 | # Map reaction index to reaction SMARTS 53 | reaction_smarts_map[reaction_idx] = (reaction.smarts, len(reactants)) # Assuming `reaction` has a `smarts` attribute 54 | print(f"Processed reaction {reaction_idx}: {len(reactants)} reactants") 55 | 56 | # Save the reaction SMARTS map 57 | smarts_map_file = output_dir / "reaction_smarts_map.pkl" 58 | with open(smarts_map_file, 'wb') as f: 59 | pickle.dump(reaction_smarts_map, f) 60 | 61 | print("All reactions processed and SMARTS map saved.") 62 | 63 | if __name__ == "__main__": 64 | main() 65 | -------------------------------------------------------------------------------- /steps/step_20_generate_reactions.py: -------------------------------------------------------------------------------- 1 | # Preprocessing step 2: Generate prompt-response pairs of reactions for LLM fine-tuning. 2 | 3 | import pickle, random, json, os 4 | from copy import deepcopy 5 | from pathlib import Path 6 | from tqdm import tqdm 7 | import click 8 | import multiprocessing as mp 9 | from joblib import Parallel, delayed 10 | from collections import defaultdict 11 | from synllama.chem.matrix import ReactantReactionMatrix 12 | from synllama.chem.stack import create_stack 13 | from synllama.chem.reaction import Reaction 14 | from synllama.llm.vars import TEMPLATE, BB_BASE, REACTION_BASE_MAX2, REACTION_BASE_MAX3 15 | 16 | def rebuild_response(synthesis_route, rxn_mapping, max_reactants = 3): 17 | if max_reactants == 2: 18 | reaction_base = REACTION_BASE_MAX2 19 | elif max_reactants == 3: 20 | reaction_base = REACTION_BASE_MAX3 21 | else: 22 | raise ValueError(f"Invalid number of reactants: {max_reactants}") 23 | 24 | synthesis = synthesis_route.replace("\\", "\\\\").split(";")[::-1] # fix json escaping 25 | target_smiles = synthesis[0] 26 | rxn_positions = [i for i, s in enumerate(synthesis) if s.startswith("R")] 27 | bb_list = [target_smiles] 28 | rxn_list = [] 29 | 30 | for j, rxn_pos in enumerate(rxn_positions): 31 | product = synthesis[rxn_pos-1] 32 | rxn_idx = j+1 33 | if j+1 < len(rxn_positions): 34 | reactants = synthesis[rxn_pos+1:rxn_positions[j+1]] 35 | else: 36 | reactants = synthesis[rxn_pos+1:] 37 | reactants_padded = reactants + [""] * (max_reactants - len(reactants)) 38 | rxn_copy = deepcopy(reaction_base) 39 | rxn_copy = rxn_copy.replace('PRODUCT', product) 40 | rxn_copy = rxn_copy.replace("RXN_TEMPLATE", rxn_mapping[int(synthesis[rxn_pos][1:].split("_")[0])]) 41 | rxn_copy = rxn_copy.replace("REACTION_NUM", str(rxn_idx)) 42 | rxn_copy = rxn_copy.replace("REACTANT1", reactants_padded[-1]) 43 | rxn_copy = rxn_copy.replace("REACTANT2", reactants_padded[-2]) 44 | if max_reactants == 3: 45 | rxn_copy = rxn_copy.replace("REACTANT3", reactants_padded[-3]) 46 | rxn_list.append(rxn_copy) 47 | bb_list.remove(product) 48 | bb_list.extend(reactants) 49 | 50 | bb_list_formatted = [deepcopy(BB_BASE).replace("Building_Block", bb) for bb in bb_list] 51 | 52 | template_copy = deepcopy(TEMPLATE) 53 | template_copy['input'] = template_copy['input'].replace("SMILES_STRING", target_smiles) 54 | template_copy['output'] = template_copy['output'].replace("REACTIONS", ", ".join(rxn_list)) 55 | template_copy['output'] = template_copy['output'].replace("BUILDING_BLOCKS", ", ".join(bb_list_formatted)) 56 | output_dict = json.loads(template_copy['output']) 57 | return template_copy 58 | 59 | def generate_reaction_data(matrix: ReactantReactionMatrix, rxn_mapping, rxn_count, init_stack_weighted_ratio, prob_u_fp, max_num_reactions=5, max_num_atoms=80): 60 | stack = create_stack( 61 | matrix, 62 | rxn_count, 63 | max_num_reactions=max_num_reactions, 64 | max_num_atoms=max_num_atoms, 65 | init_stack_weighted_ratio=init_stack_weighted_ratio, 66 | prob_u_fp=prob_u_fp, 67 | ) 68 | rebuilt_response = rebuild_response(stack.get_action_string(), rxn_mapping) 69 | return rebuilt_response 70 | 71 | def generate_reaction_chunk(matrix, rxn_mapping, rxn_count, num_reactions, init_stack_weighted_ratio, prob_u_fp, max_num_reactions=5, max_num_atoms=80): 72 | reactions_dict = defaultdict(int) 73 | all_reactions = [] 74 | while len(all_reactions) < num_reactions: 75 | try: 76 | stack = create_stack( 77 | matrix, 78 | rxn_count, 79 | max_num_reactions=max_num_reactions, 80 | max_num_atoms=max_num_atoms, 81 | init_stack_weighted_ratio=init_stack_weighted_ratio, 82 | prob_u_fp=prob_u_fp, 83 | ) 84 | all_reactions.append(rebuild_response(stack.get_action_string(), rxn_mapping)) 85 | rxns = [r for r in stack.get_action_string().split(";") if r.startswith("R")] 86 | for rxn in rxns: 87 | reactions_dict[rxn] += 1 88 | except Exception as e: 89 | continue 90 | print(sorted(reactions_dict.items(), key=lambda x: int(x[0][1:]))) 91 | return all_reactions 92 | 93 | @click.command() 94 | @click.option("--matrix_path", type=click.Path(exists=True, path_type=Path), required=True, default="data/91_rxns/processed/test/reaction_matrix_test.pkl") 95 | @click.option("--rxn_mapping_path", type=click.Path(exists=True, path_type=Path), required=True, default="data/91_rxns/rxn_embeddings/reaction_smarts_map.pkl") 96 | @click.option("--prob_u_fp", type=click.Path(exists=True, path_type=Path), default=None) 97 | @click.option("--init_stack_weighted_ratio", type=float, required=True, default=0.8) 98 | @click.option("--name", default=None) 99 | @click.option("--num_reactions", type=int, required=True, default=2000000) 100 | @click.option("--batch_size", type=int, required=True, default=1000) 101 | @click.option("--write_for_benchmark", is_flag=True) 102 | 103 | def main(matrix_path, rxn_mapping_path, prob_u_fp, num_reactions, init_stack_weighted_ratio=0.8, name=None, batch_size=1000, write_for_benchmark=False): 104 | matrix: ReactantReactionMatrix = ReactantReactionMatrix.load(matrix_path) 105 | reaction_smarts_dict = pickle.load(open(rxn_mapping_path, "rb")) 106 | rxn_mapping = {k: v[0] for k, v in reaction_smarts_dict.items()} 107 | rxn_count = {k: v[1] for k, v in reaction_smarts_dict.items()} 108 | prob_u_fp = prob_u_fp 109 | num_cores = mp.cpu_count() 110 | num_batches = num_reactions // batch_size // num_cores 111 | remainder = num_reactions - num_batches * batch_size * num_cores 112 | if name is None: 113 | name = f"{num_reactions/1000000:.1f}m_reactions" 114 | 115 | # check if the file exists 116 | if os.path.exists(f"data/{name}.jsonl"): 117 | print(f"File {name}.jsonl already exists. Deleting to start fresh...") 118 | os.remove(f"data/{name}.jsonl") 119 | 120 | for batch_num in range(num_batches): 121 | with tqdm(total=batch_size, desc=f"Generating reactions batch {batch_num+1} of {num_batches}") as pbar: 122 | with open(f"data/{name}.jsonl", "a") as f: 123 | results = Parallel(n_jobs=num_cores)( 124 | delayed(generate_reaction_chunk)(matrix, rxn_mapping, rxn_count, batch_size, init_stack_weighted_ratio, prob_u_fp) 125 | for _ in range(num_cores) 126 | ) 127 | for result in results: 128 | for r in result: 129 | json.dump(r, f) 130 | f.write("\n") 131 | 132 | if remainder > 0: 133 | with tqdm(total=remainder, desc=f"Generating reactions batch {num_batches+1} of {num_batches}") as pbar: 134 | results = Parallel(n_jobs=num_cores)( 135 | delayed(generate_reaction_chunk)(matrix, rxn_mapping, rxn_count, remainder // num_cores + 1, init_stack_weighted_ratio, prob_u_fp) 136 | for _ in range(num_cores) 137 | ) 138 | results = [r for rr in results for r in rr if r is not None] 139 | results = results[:remainder] 140 | with open(f"data/{name}.jsonl", "a") as f: 141 | for result in results: 142 | json.dump(result, f) 143 | f.write("\n") 144 | 145 | if write_for_benchmark: 146 | with open(f"data/{name}.jsonl", "r") as f: 147 | reactions = [json.loads(line) for line in f] 148 | reactions_dict = {r['input'].split("SMILES string:")[1].strip(): [json.loads(r['output'])] for r in reactions} 149 | with open(f"data/{name}_benchmark.pkl", "wb") as f: 150 | pickle.dump(reactions_dict, f) 151 | with open(f"data/{name}.smi", "w") as f: 152 | for r in reactions: 153 | f.write(r['input'].split("SMILES string:")[1].strip() + "\n") 154 | 155 | if __name__ == "__main__": 156 | mp.set_start_method("spawn") 157 | main() 158 | -------------------------------------------------------------------------------- /steps/step_30_0_benchmark_filter_raw_output.py: -------------------------------------------------------------------------------- 1 | import pickle, argparse, os, glob, csv 2 | import pandas as pd 3 | import numpy as np 4 | from tqdm import tqdm 5 | from collections import defaultdict 6 | from synllama.chem.reaction import Reaction 7 | from synllama.chem.fpindex import FingerprintIndex, compute_fingerprints 8 | from synllama.chem.mol import FingerprintOption, Molecule 9 | import multiprocessing as mp 10 | 11 | def arrange_reactants_and_react_synllama(template, reactant_mols): 12 | rxn = Reaction(template) 13 | if len(reactant_mols) != rxn.num_reactants: 14 | return None, False 15 | if len(reactant_mols) == 1: 16 | product = rxn(reactant_mols) 17 | if len(product) == 0: 18 | return None, False 19 | elif len(reactant_mols) == 2: 20 | product = [] 21 | product.extend(rxn([reactant_mols[0], reactant_mols[1]])) 22 | product.extend(rxn([reactant_mols[1], reactant_mols[0]])) 23 | if len(product) == 0: 24 | return None, False 25 | elif len(reactant_mols) == 3: 26 | product = [] 27 | product.extend(rxn([reactant_mols[0], reactant_mols[1], reactant_mols[2]])) 28 | product.extend(rxn([reactant_mols[0], reactant_mols[2], reactant_mols[1]])) 29 | product.extend(rxn([reactant_mols[1], reactant_mols[0], reactant_mols[2]])) 30 | product.extend(rxn([reactant_mols[1], reactant_mols[2], reactant_mols[0]])) 31 | product.extend(rxn([reactant_mols[2], reactant_mols[0], reactant_mols[1]])) 32 | product.extend(rxn([reactant_mols[2], reactant_mols[1], reactant_mols[0]])) 33 | if len(product) == 0: 34 | return None, False 35 | else: 36 | return None, False 37 | return product, True 38 | 39 | def filter_raw_output(llama_output, reaction_idx_map): 40 | 41 | successful_synthesis = defaultdict(list) 42 | for product_smiles, example_data in tqdm(llama_output.items()): 43 | if type(example_data) == str: continue 44 | for output in example_data: 45 | if type(output) == str: continue 46 | try: 47 | assert 'reactions' in output and 'building_blocks' in output 48 | reactions = output['reactions'] 49 | building_blocks = output['building_blocks'] 50 | reactant_stack = [] 51 | reaction_strings = [] 52 | reactant_stack.append(product_smiles) 53 | reaction_strings.append(product_smiles) 54 | 55 | for reaction in reactions: 56 | assert 'reaction_template' in reaction and 'reactants' in reaction and 'product' in reaction 57 | product = reaction['product'] 58 | assert product in reactant_stack 59 | reactant_stack.remove(product) 60 | reaction_strings.remove(product) 61 | reaction_strings.append(product) 62 | template = reaction['reaction_template'].split('')[1].split('')[0] 63 | assert template in reaction_idx_map 64 | reaction_strings.append(f"R{reaction_idx_map[template]}") 65 | reactants = reaction['reactants'] 66 | reactants = [reactant.split("")[-1].split("")[0] if "" in reactant else reactant for reactant in reactants] 67 | reactant_stack.extend(reactants) 68 | reactant_mols = [] 69 | for reactant in reactants: 70 | if reactant == '': continue 71 | mol = Molecule(reactant, source="smiles") 72 | if not mol.is_valid: 73 | raise ValueError(f"Invalid molecule {reactant}") 74 | reactant_mols.append(mol) 75 | reaction_strings.append(reactant) 76 | product_mol = Molecule(product, source="smiles") 77 | if not product_mol.is_valid: 78 | raise ValueError(f"Invalid molecule {product}") 79 | product_from_rxn, matched = arrange_reactants_and_react_synllama(template, reactant_mols) 80 | assert matched 81 | product_from_rxn = [prod.csmiles for prod in product_from_rxn if prod is not None] 82 | assert product_mol.csmiles in product_from_rxn 83 | 84 | bbs = [] 85 | for bb in building_blocks: 86 | bb_clean = bb.split("")[-1].split("")[0] 87 | assert bb_clean in reactant_stack 88 | reactant_stack.remove(bb_clean) 89 | bbs.append(bb_clean) 90 | 91 | successful_synthesis[product_smiles].append({ 92 | "reaction_strings": ";".join(reaction_strings[::-1]), 93 | "bbs": bbs, 94 | }) 95 | 96 | except Exception as e: 97 | continue 98 | 99 | return successful_synthesis 100 | 101 | def check_bb_in_enamine(args): 102 | bb, fp_searcher = args 103 | fingerprints = Molecule(bb).get_fingerprint(FingerprintOption.morgan_for_building_blocks(), as_bitvec=False) 104 | fp_searched_results = fp_searcher.query_single(np.array([fingerprints]), k=10) 105 | fp_searched_mols = [result.molecule for result in fp_searched_results] 106 | return np.max([Molecule(bb).sim(mol, FingerprintOption.morgan_for_tanimoto_similarity()) for mol in fp_searched_mols]) 107 | 108 | def check_bbs_in_enamine_parallel(bbs, fp_searcher, num_cores): 109 | 110 | bb_mols = [Molecule(bb, source="smiles") for bb in bbs] 111 | fingerprints = compute_fingerprints(bb_mols, FingerprintOption.morgan_for_building_blocks(), batch_size=1024) 112 | fp_searched_results = fp_searcher.query(fingerprints, k=10) 113 | bbs_similarity = [] 114 | for bb, result in zip(bbs, fp_searched_results): 115 | bbs_similarity.append(np.max([Molecule(bb).sim(r.molecule, FingerprintOption.morgan_for_tanimoto_similarity()) for r in result])) 116 | 117 | # Create a dictionary with bb as the key and its similarity score as the value 118 | bb_similarity_dict = {bb: similarity for bb, similarity in zip(bbs, bbs_similarity)} 119 | return bb_similarity_dict 120 | 121 | def convert_smiles_dict(successful_synthesis, save_folder, file_name, fp_searcher, num_cores): 122 | 123 | if not os.path.exists(save_folder): 124 | os.makedirs(save_folder, exist_ok=True) 125 | # collect all bbs to search in enamine 126 | all_bbs = [] 127 | for _, value in successful_synthesis.items(): 128 | for v in value: 129 | all_bbs.extend(v['bbs']) 130 | all_bbs = list(set(all_bbs)) 131 | bb_similarity_dict = check_bbs_in_enamine_parallel(all_bbs, fp_searcher, num_cores) 132 | 133 | for _, value in successful_synthesis.items(): 134 | for v in value: 135 | v['bbs_similarity'] = [bb_similarity_dict[bb] for bb in v['bbs']] 136 | v['bbs_not_in_enamine'] = [bb for bb, sim in zip(v['bbs'], v['bbs_similarity']) if sim < 1] 137 | v['bbs_in_enamine'] = [bb for bb, sim in zip(v['bbs'], v['bbs_similarity']) if sim == 1] 138 | # save the successful_synthesis to a pickle file 139 | with open(os.path.join(save_folder, f"{file_name}_successful_synthesis.pkl"), "wb") as f: 140 | pickle.dump(successful_synthesis, f) 141 | 142 | all_bbs_not_in_enamine = [] 143 | all_bbs_in_enamine = [] 144 | for key, value in successful_synthesis.items(): 145 | for v in value: 146 | bbs_not_in_enamine = v['bbs_not_in_enamine'] 147 | all_bbs_not_in_enamine.extend(bbs_not_in_enamine) 148 | bbs_in_enamine = v['bbs_in_enamine'] 149 | all_bbs_in_enamine.extend(bbs_in_enamine) 150 | 151 | smiles_list = list(set(all_bbs_not_in_enamine)) 152 | file_count = 1 153 | for i in range(0, len(smiles_list), 10000): 154 | with open(os.path.join(save_folder, f"{file_name}_successful_bbs_not_in_enamine_{file_count}.txt"), "w") as f: 155 | f.write("\n".join(smiles_list[i:i+10000])) 156 | file_count += 1 157 | 158 | smiles_list = list(set(all_bbs_in_enamine)) 159 | file_count = 1 160 | for i in range(0, len(smiles_list), 10000): 161 | with open(os.path.join(save_folder, f"{file_name}_successful_bbs_in_enamine_{file_count}.txt"), "w") as f: 162 | f.write("\n".join(smiles_list[i:i+10000])) 163 | file_count += 1 164 | 165 | def calc_benchmark_rxn(llama_output, reaction_idx_map): 166 | # check the correctness in general with rdkit functions 167 | total_trials = len(llama_output) * len([k for k in llama_output.values() if type(k) != str][0]) 168 | successful_trials = 0 169 | template_obedience = defaultdict(int) 170 | reactant_matched = defaultdict(int) 171 | product_obedience = defaultdict(int) 172 | bb_obedience = defaultdict(list) 173 | total_reactions = 0 174 | successful_reactions = 0 175 | product_not_in_reactant_stack = 0 176 | invalid_smiles = [] 177 | total_molecules = 0 178 | failed_structured_output = [] 179 | template_no_rxn_tag = 0 180 | failed_cases = [] # template, reactants, product 181 | total_success_formats = defaultdict(int) 182 | total_success_reactions = defaultdict(int) 183 | 184 | for product_smiles, example_data in tqdm(llama_output.items()): 185 | if type(example_data) == str: 186 | print(example_data) 187 | continue 188 | format_success = True 189 | reaction_success = True 190 | for output in example_data: 191 | if 'reactions' not in output or 'building_blocks' not in output: 192 | failed_structured_output.append(output) 193 | format_success = False 194 | continue 195 | reactions = output['reactions'] 196 | building_blocks = output['building_blocks'] 197 | reactant_stack = [] 198 | reactant_stack.append(product_smiles) 199 | successful_trials += 1 200 | 201 | for reaction in reactions: 202 | # Extract the reaction template between tags 203 | if 'reaction_template' not in reaction or 'reactants' not in reaction or 'product' not in reaction: 204 | successful_trials -= 1 205 | format_success = False 206 | failed_structured_output.append(reaction) 207 | break 208 | try: 209 | template = reaction['reaction_template'].split('')[1].split('')[0] 210 | except: 211 | template_no_rxn_tag += 1 212 | successful_trials -= 1 213 | format_success = False 214 | break 215 | total_reactions += 1 216 | if template not in reaction_idx_map: 217 | template_obedience['not_in_template'] += 1 218 | print(f"Template {template} not found in reaction_templates") 219 | failed_cases.append((reaction, "template")) 220 | format_success = False 221 | continue 222 | else: 223 | template_obedience[template] += 1 224 | # check if reactants can form the product through the reaction template 225 | reactants = reaction['reactants'] 226 | reactants = [reactant.split("")[-1].split("")[0] if "" in reactant else reactant for reactant in reactants] 227 | reactant_stack.extend(reactants) 228 | product = reaction['product'] 229 | if product not in reactant_stack: 230 | product_not_in_reactant_stack += 1 231 | else: 232 | reactant_stack.remove(product) 233 | reactant_mols = [] 234 | total_molecules += len(reactants) 235 | for reactant in reactants: 236 | if not Molecule(reactant, source="smiles").is_valid: 237 | invalid_smiles.append(reactant) 238 | elif reactant == '': 239 | total_molecules -= 1 240 | continue 241 | else: 242 | reactant_mols.append(Molecule(reactant, source="smiles")) 243 | product_mol = Molecule(product, source="smiles") 244 | total_molecules += 1 245 | if not product_mol.is_valid: 246 | invalid_smiles.append(product) 247 | product_from_rxn, matched = arrange_reactants_and_react_synllama(template, reactant_mols) 248 | if template in reaction_idx_map: 249 | reactant_matched[template] += int(matched) 250 | if product_from_rxn is None: 251 | failed_cases.append((reaction, "reactants")) 252 | print(f"Reactants {reactants} cannot react through template {template}") 253 | reaction_success = False 254 | continue 255 | product_from_rxn = [prod.csmiles for prod in product_from_rxn] 256 | successful_reactions += 1 257 | if not product_mol.is_valid or product_mol.csmiles not in product_from_rxn: 258 | failed_cases.append((reaction, "product")) 259 | print(f"Product {product} not in product_from_rxn {product_from_rxn}") 260 | reaction_success = False 261 | else: 262 | product_obedience[template] += 1 263 | 264 | bb_count = 0 265 | for bb in building_blocks: 266 | bb_clean = bb.split("")[-1].split("")[0] 267 | if bb_clean not in reactant_stack: 268 | print(f"Building block {bb_clean} not found in reactant stack") 269 | format_success = False 270 | else: 271 | bb_count += 1 272 | if len(building_blocks) > 0: 273 | bb_obedience[product_smiles].append(bb_count / len(building_blocks)) 274 | else: 275 | bb_obedience[product_smiles].append(1) 276 | 277 | bb_obedience[product_smiles] = sum(bb_obedience[product_smiles]) / len(bb_obedience[product_smiles]) if len(bb_obedience[product_smiles]) > 0 else 1 278 | total_success_formats[product_smiles] = format_success 279 | total_success_reactions[product_smiles] = reaction_success and format_success 280 | 281 | stats_rxn = { 282 | "total_trials": total_trials, 283 | "failed_structured_output": len(failed_structured_output), 284 | "template_no_rxn_tag": template_no_rxn_tag, 285 | "valid_responses": round(successful_trials / total_trials * 100, 2), 286 | "valid_smiles": round((1 - len(invalid_smiles) / total_molecules) * 100, 2), 287 | "recycled_bbs": round(sum(bb_obedience.values()) / len(bb_obedience) * 100, 2), 288 | "template_memorization": round((1 - template_obedience['not_in_template'] / total_reactions) * 100, 2), 289 | "matched_reactants": round(successful_reactions / total_reactions * 100, 2), 290 | "good_products": round(sum(product_obedience.values()) / successful_reactions * 100, 2), 291 | "total_success_formats": round(sum(total_success_formats.values()) / len(llama_output) * 100, 2), 292 | "total_success_reactions": round(sum(total_success_reactions.values()) / len(llama_output) * 100, 2), 293 | } 294 | return stats_rxn, failed_structured_output 295 | 296 | def main(): 297 | parser = argparse.ArgumentParser() 298 | parser.add_argument("--llama_folder", type=str, default = "../synllama-data/results/table_2_syn_planning_91rxns") 299 | parser.add_argument("--save_folder", type=str, default=None) 300 | parser.add_argument("--raw_output_only", action="store_true") 301 | parser.add_argument("--benchmark_only", action="store_true") 302 | parser.add_argument("--rxn_mapping_path", type=str, default="../synllama-data/inference/reconstruction/91rxns/rxn_embeddings/reaction_smarts_map.pkl") 303 | parser.add_argument("--fp_searcher_path", type=str, default="../synllama-data/inference/reconstruction/91rxns/processed/fpindex.pkl") 304 | args = parser.parse_args() 305 | 306 | reaction_smarts_dict = pickle.load(open(args.rxn_mapping_path, "rb")) 307 | reaction_idx_map = {v[0]: k for k, v in reaction_smarts_dict.items()} 308 | if args.save_folder is None: 309 | args.save_folder = os.path.join(args.llama_folder, "synllama_reconstruct") 310 | os.makedirs(args.save_folder, exist_ok=True) 311 | fp_searcher = FingerprintIndex.load(args.fp_searcher_path) 312 | 313 | file_list = glob.glob(os.path.join(args.llama_folder, "*.pkl")) 314 | all_data = [] 315 | for file in file_list: 316 | file_name = file.split("/")[-1][:-4] 317 | with open(file, "rb") as f: 318 | llama_output = pickle.load(f) 319 | if not args.raw_output_only: 320 | stats_rxn, failed_cases = calc_benchmark_rxn(llama_output, reaction_idx_map) 321 | combined_stats = {**stats_rxn} 322 | combined_stats['file_name'] = file_name # Add the file name as an index 323 | # with open(f"{args.llama_folder}/failed_cases_{file_name}.pkl", "wb") as f: 324 | # pickle.dump(failed_cases, f) 325 | all_data.append(combined_stats) 326 | if not args.benchmark_only: 327 | successful_synthesis = filter_raw_output(llama_output, reaction_idx_map) 328 | if not args.raw_output_only: 329 | combined_stats['total_success_molecules'] = len(successful_synthesis) 330 | convert_smiles_dict(successful_synthesis, args.save_folder, file_name, fp_searcher, mp.cpu_count() // 2) 331 | 332 | if not args.raw_output_only: 333 | all_keys = ['file_name', 'total_trials', 'valid_responses', 'template_memorization', 'recycled_bbs', 'valid_smiles', 'matched_reactants', 'good_products', 'total_success_formats', 'total_success_reactions'] 334 | if not args.benchmark_only: 335 | all_keys.append('total_success_molecules') 336 | df = pd.DataFrame(all_data, columns=all_keys) 337 | df.sort_values(by="file_name", ascending=True, inplace=True) 338 | df.to_csv(os.path.join(args.save_folder, "llm_benchmark_stats.csv"), index=False) 339 | 340 | if __name__ == "__main__": 341 | main() 342 | 343 | -------------------------------------------------------------------------------- /steps/step_30_1_molport_raw_reconstruct.py: -------------------------------------------------------------------------------- 1 | # This script is used to reconstruct the raw output from MolPort 2 | 3 | import os, sys, argparse, glob 4 | import pickle 5 | import pandas as pd 6 | from collections import defaultdict 7 | 8 | # Once the raw output (*_bbs_1.txt & *_successful_synthesis.pkl) is generated, please first go to Molport (https://www.molport.com/shop/swl-step-1) 9 | # to do a "list search" for available building blocks and then run this script to reconstruct the raw output. 10 | 11 | # Upon the completion of the Molport search, please download the list search results under the "Selected Items" column. 12 | # This file contains the list of building blocks that are available in the market based on the amount requested. 13 | 14 | # Once the file is downloaded, please rename it to "*_molport_ls.xlsx" and put it in the same folder as the raw output files. 15 | # Then, run this script to reconstruct the raw output. 16 | 17 | def extract_best_csv(total_reconstruction, save_path): 18 | df = pd.DataFrame(columns = ["target","smiles","score","synthesis","num_steps","scf_sim","pharm2d_sim","rdkit_sim"]) 19 | for k, v in total_reconstruction.items(): 20 | item = v[0] 21 | df = pd.concat([df, pd.DataFrame.from_dict({ 22 | "target": k, 23 | "smiles": k, 24 | "score": 1.0, 25 | "synthesis": item['synthesis'], 26 | "num_steps": sum(["R" in r for r in item['synthesis'].split(";")]), 27 | "scf_sim": 1.0, 28 | "pharm2d_sim": 1.0, 29 | "rdkit_sim": 1.0 30 | }, orient="index").T]) 31 | df.to_csv(save_path, index=False) 32 | return df 33 | 34 | def find_synllama_reconstruction(success_raw_syn_path, molport_ls_path): 35 | successful_synthesis = pickle.load(open(success_raw_syn_path, "rb")) 36 | search_results = pd.read_excel(molport_ls_path) 37 | found_bbs = search_results['Search Criteria'].tolist() 38 | enamine_reconstruction = defaultdict(list) 39 | non_enamine_reconstruction = defaultdict(list) 40 | total_reconstruction = defaultdict(list) 41 | total_building_blocks = 0 42 | non_enamine_building_blocks = 0 43 | for k, v in successful_synthesis.items(): 44 | for item in v: 45 | total_building_blocks += len(item['bbs']) 46 | non_enamine_building_blocks += len(item['bbs_not_in_enamine']) 47 | if all(bb in found_bbs for bb in item['bbs_not_in_enamine']): 48 | total_reconstruction[k].append({ 49 | "bbs": item['bbs'], 50 | "synthesis": item['reaction_strings'] 51 | }) 52 | if len(item['bbs_not_in_enamine']) > 0: 53 | non_enamine_reconstruction[k].append({ 54 | "bbs": item['bbs'], 55 | "synthesis": item['reaction_strings'] 56 | }) 57 | else: 58 | enamine_reconstruction[k].append({ 59 | "bbs": item['bbs'], 60 | "synthesis": item['reaction_strings'] 61 | }) 62 | enamine_save_path = success_raw_syn_path.replace("_successful_synthesis.pkl", "_enamine_synllama_reconstruct.csv") 63 | extract_best_csv(enamine_reconstruction, enamine_save_path) 64 | non_enamine_save_path = success_raw_syn_path.replace("_successful_synthesis.pkl", "_non_enamine_synllama_reconstruct.csv") 65 | extract_best_csv(non_enamine_reconstruction, non_enamine_save_path) 66 | all_synllama_reconstruct_path = success_raw_syn_path.replace("_successful_synthesis.pkl", "_all_synllama_reconstruct.csv") 67 | extract_best_csv(total_reconstruction, all_synllama_reconstruct_path) 68 | return len(set(total_reconstruction.keys())), len(set(enamine_reconstruction.keys())), len(set(non_enamine_reconstruction.keys())), total_building_blocks, non_enamine_building_blocks 69 | 70 | def main(): 71 | parser = argparse.ArgumentParser() 72 | parser.add_argument("--llama_folder", type=str, default="../synllama-data/results/table_2_syn_planning_91rxns") 73 | args = parser.parse_args() 74 | 75 | raw_output_folder = os.path.join(args.llama_folder, "synllama_reconstruct") 76 | raw_output_files = glob.glob(os.path.join(raw_output_folder, "*_successful_synthesis.pkl")) 77 | for raw_output_file in raw_output_files: 78 | molport_ls_path = raw_output_file.replace("_successful_synthesis.pkl", "_molport_ls.xls") 79 | synllama_reconstruct, enamine_synllama_reconstruct, non_enamine_synllama_reconstruct, total_building_blocks, non_enamine_building_blocks = find_synllama_reconstruction(raw_output_file, molport_ls_path) 80 | print(f"{raw_output_file} has {synllama_reconstruct} total successful syntheses") 81 | print(f"{raw_output_file} has {enamine_synllama_reconstruct} enamine successful syntheses") 82 | print(f"{raw_output_file} has {non_enamine_synllama_reconstruct} non-enamine successful syntheses") 83 | print(f"{raw_output_file} has {total_building_blocks} total building blocks") 84 | print(f"{raw_output_file} has {non_enamine_building_blocks} non-enamine building blocks") 85 | print(f"{raw_output_file} has {(1 - non_enamine_building_blocks / total_building_blocks)*100:.2f}% enamine building blocks") 86 | if __name__ == "__main__": 87 | main() 88 | -------------------------------------------------------------------------------- /steps/step_31_enamine_reconstruct.py: -------------------------------------------------------------------------------- 1 | import pickle, argparse, copy, glob, os 2 | import multiprocessing as mp 3 | import numpy as np 4 | import pandas as pd 5 | from tqdm import tqdm 6 | from rdkit import Chem 7 | 8 | from synllama.chem.smiles_tfidf import SmilesSimilaritySearch 9 | from synllama.chem.fpindex import FingerprintIndex 10 | from synllama.chem.mol import FingerprintOption, Molecule 11 | from synllama.chem.reaction import Reaction 12 | from synllama.chem.smiles_tfidf import find_closest_match, string_similarity 13 | from synllama.chem.stack import Stack 14 | 15 | def load_results(file_path): 16 | """Load the results from a pickle file.""" 17 | with open(file_path, "rb") as f: 18 | return pickle.load(f) 19 | 20 | def analyze_results(result_file_path, total_num_mols, top_n_rows = 1): 21 | """Perform analysis on the reconstruction results.""" 22 | file_name = result_file_path[:-4].split("/")[-1] 23 | results = load_results(result_file_path) 24 | max_similarity = [] 25 | total_failure_rate = [] 26 | total_reconstruction_rate = [] 27 | failure_rate_within_group = [] 28 | reconstruction_rate_within_group = [] 29 | scf_sim_all = [] 30 | pharm2d_sim_all = [] 31 | rdkit_sim_all = [] 32 | scf_sim_no_reconstruction = [] 33 | pharm2d_sim_no_reconstruction = [] 34 | rdkit_sim_no_reconstruction = [] 35 | average_number_of_steps = [] 36 | morgan_no_reconstruction = [] 37 | 38 | max_rows_df = pd.DataFrame() 39 | 40 | for df in results: 41 | # Calculate average maximum similarity 42 | max_similarity.append(df['score'].max()) 43 | max_row = df.loc[[df['score'].idxmax()]] if top_n_rows == 1 else df.drop_duplicates(subset=['smiles']).nlargest(top_n_rows, 'score') 44 | # remove response_num column 45 | if 'response_num' in max_row.columns: max_row = max_row.drop(columns=['response_num']) 46 | max_rows_df = pd.concat([max_rows_df, max_row]) 47 | failure_rate_within_group.append(df['score'].isna().mean()) 48 | total_failure_rate.append(all(df['score'].isna())) 49 | # Calculate reconstruction rate (where similarity == 1) 50 | reconstruction_rate_within_group.append((df['score'] == 1).mean()) 51 | total_reconstruction_rate.append(any(df['score'] == 1)) 52 | scf_sim_all.append(max_row['scf_sim'].values[0] if not max_row['scf_sim'].isna().any() else np.nan) 53 | pharm2d_sim_all.append(max_row['pharm2d_sim'].values[0] if not max_row['pharm2d_sim'].isna().any() else np.nan) 54 | rdkit_sim_all.append(max_row['rdkit_sim'].values[0] if not max_row['rdkit_sim'].isna().any() else np.nan) 55 | synthesis_steps = max_row['num_steps'].values[0] 56 | average_number_of_steps.append(synthesis_steps) 57 | if df['score'].max() < 1: 58 | morgan_no_reconstruction.append(df['score'].max()) 59 | scf_sim_no_reconstruction.append(max_row['scf_sim'].values[0] if not max_row['scf_sim'].isna().any() else np.nan) 60 | pharm2d_sim_no_reconstruction.append(max_row['pharm2d_sim'].values[0] if not max_row['pharm2d_sim'].isna().any() else np.nan) 61 | rdkit_sim_no_reconstruction.append(max_row['rdkit_sim'].values[0] if not max_row['rdkit_sim'].isna().any() else np.nan) 62 | result_file_folder = os.path.dirname(result_file_path) 63 | max_rows_df.to_csv(os.path.join(result_file_folder, f"{file_name}_enamine_reconstruct.csv"), index=False) 64 | 65 | return { 66 | "file_name": file_name, 67 | "max_similarity": np.mean(max_similarity), 68 | "total_failure_rate %": round((1 - (len(results) - np.sum(total_failure_rate)) / total_num_mols) * 100, 2), 69 | "total_reconstruction_rate %": round((np.sum(total_reconstruction_rate) / total_num_mols) * 100, 2), 70 | "scf_sim_including_reconstruction": np.nanmean(scf_sim_all), 71 | "pharm2d_sim_including_reconstruction": np.nanmean(pharm2d_sim_all), 72 | "avg_rxn_steps": np.nanmean(average_number_of_steps), 73 | "morgan_no_reconstruction": np.nanmean(morgan_no_reconstruction), 74 | "scf_sim_no_reconstruction": np.nanmean(scf_sim_no_reconstruction), 75 | "pharm2d_sim_no_reconstruction": np.nanmean(pharm2d_sim_no_reconstruction), 76 | } 77 | 78 | def similarity_score(product_template, stack_prod_smiles): 79 | if not Chem.MolFromSmiles(product_template): 80 | return string_similarity(product_template, stack_prod_smiles) 81 | else: 82 | return Molecule(product_template).sim(Molecule(stack_prod_smiles), FingerprintOption.morgan_for_tanimoto_similarity()) 83 | 84 | def get_top_k_smiles(input_smiles, smiles_searcher, fp_searcher, k=10): 85 | """ 86 | get the top k smiles from the smiles_searcher and fp_searcher. 87 | 88 | Args: 89 | input_smiles (str): the smiles of the input molecule 90 | smiles_searcher (SmilesSimilaritySearch): the smiles searcher 91 | fp_searcher (FingerprintIndex): the fingerprint searcher 92 | k (int, optional): the number of top smiles to return. Defaults to 10. 93 | """ 94 | # check if smiles is valid 95 | input_mol = Chem.MolFromSmiles(input_smiles) 96 | if input_mol is None: 97 | searched_smiles = smiles_searcher.query(input_smiles, k=k*2) 98 | results = [result.molecule.smiles for result in searched_smiles] 99 | result_mols = [Molecule(s, source="smiles") for s in results] 100 | else: 101 | searched_smiles = smiles_searcher.query(input_smiles, k=k) 102 | results = [result.molecule.smiles for result in searched_smiles] 103 | result_mols = [Molecule(s, source="smiles") for s in results] 104 | fingerprints = Molecule(input_smiles).get_fingerprint(FingerprintOption.morgan_for_building_blocks(), as_bitvec=False) 105 | fp_searched_results = fp_searcher.query_single(np.array([fingerprints]), k=k) 106 | results.extend([result.molecule.smiles for result in fp_searched_results]) 107 | result_mols.extend([Molecule(s, source="fp") for s in results]) 108 | return list(set(result_mols)) 109 | 110 | def match_two_reactants(reactant1_list, reactant2_list, rxn, continue_rxn = False): 111 | valid_combinations = [] 112 | for reactant1 in reactant1_list: 113 | for reactant2 in reactant2_list: 114 | reactant_combo1 = [reactant1, reactant2] 115 | reactant_combo2 = [reactant2, reactant1] 116 | if rxn(reactant_combo1) or rxn(reactant_combo2): 117 | if continue_rxn: 118 | valid_combinations.append(reactant2) 119 | else: 120 | valid_combinations.append(reactant_combo1) 121 | return valid_combinations 122 | 123 | def match_three_reactants(reactant1_list, reactant2_list, reactant3_list, rxn, continue_rxn = False): 124 | valid_combinations = [] 125 | for reactant1 in reactant1_list: 126 | for reactant2 in reactant2_list: 127 | for reactant3 in reactant3_list: 128 | reactant_combo1 = [reactant1, reactant2, reactant3] 129 | reactant_combo2 = [reactant1, reactant3, reactant2] 130 | reactant_combo3 = [reactant2, reactant1, reactant3] 131 | reactant_combo4 = [reactant2, reactant3, reactant1] 132 | reactant_combo5 = [reactant3, reactant1, reactant2] 133 | reactant_combo6 = [reactant3, reactant2, reactant1] 134 | if rxn(reactant_combo1) or rxn(reactant_combo2) or rxn(reactant_combo3) or rxn(reactant_combo4) or rxn(reactant_combo5) or rxn(reactant_combo6): 135 | if continue_rxn: 136 | valid_combinations.append([reactant2, reactant3]) 137 | else: 138 | valid_combinations.append(reactant_combo1) 139 | return valid_combinations 140 | 141 | def reconstruct_single_rxn(smiles_to_search, product_template, smiles_searcher, fp_searcher, template, rxn_idx, stacks = None, k=5, n_stacks=25, product_limit = 3): 142 | """ 143 | Reconstruct a single reaction from a list of building blocks and reactants. 144 | 145 | Args: 146 | smiles_list (list): a list of tuples of (smiles, is_building_block). 147 | product_template (str): the product template. 148 | smiles_searcher (SmilesSimilaritySearch): the smiles searcher. 149 | fp_searcher (FingerprintIndex): the fingerprint searcher. 150 | template (str): the reaction template. 151 | rxn_idx (int): the reaction index. 152 | stack (Stack): the stack to push the reactants. 153 | k (int, optional): the number of top smiles to return. Defaults to 10. 154 | """ 155 | # check if reaction template is in the reaction_templates 156 | rxn = Reaction(template) 157 | new_stacks = [] 158 | if len(stacks) > 0 and len(stacks[0]) > 0: 159 | scores = [] 160 | for stack in stacks: 161 | prev_mol = list(stack.get_top()) 162 | # see how many reactants are needed 163 | if rxn.num_reactants == 1: 164 | assert len(smiles_to_search) == 0 165 | success = stack.push_rxn(rxn, rxn_idx, product_template=product_template, product_limit=product_limit) 166 | if success: 167 | new_stacks.append(stack) 168 | elif rxn.num_reactants == 2: 169 | assert len(smiles_to_search) == 1 170 | top_bbs_reactants = get_top_k_smiles(smiles_to_search[0], smiles_searcher, fp_searcher, k) 171 | valid_mols = match_two_reactants(prev_mol, top_bbs_reactants, rxn, continue_rxn = True) 172 | for mol in valid_mols: 173 | new_stack = copy.deepcopy(stack) 174 | new_stack.push_mol(mol, 0) 175 | success = new_stack.push_rxn(rxn, rxn_idx, product_template=product_template, product_limit=product_limit) 176 | if success: 177 | scores.append(similarity_score(product_template, new_stack[-1].smiles)) 178 | new_stacks.append(new_stack) 179 | elif rxn.num_reactants == 3: 180 | assert len(smiles_to_search) == 2 181 | top_bbs_reactants1 = get_top_k_smiles(smiles_to_search[0], smiles_searcher, fp_searcher, k) 182 | top_bbs_reactants2 = get_top_k_smiles(smiles_to_search[1], smiles_searcher, fp_searcher, k) 183 | valid_mols = match_three_reactants(prev_mol, top_bbs_reactants1, top_bbs_reactants2, rxn, continue_rxn = True) 184 | for mol1, mol2 in valid_mols: 185 | new_stack = copy.deepcopy(stack) 186 | new_stack.push_mol(mol1, 0) 187 | new_stack.push_mol(mol2, 0) 188 | success = new_stack.push_rxn(rxn, rxn_idx, product_template=product_template, product_limit=product_limit) 189 | if success: 190 | scores.append(similarity_score(product_template, new_stack[-1].smiles)) 191 | new_stacks.append(new_stack) 192 | else: 193 | scores = [] 194 | if rxn.num_reactants == 3: 195 | assert len(smiles_to_search) == 3 196 | top_bbs_reactants1 = get_top_k_smiles(smiles_to_search[0], smiles_searcher, fp_searcher, k // 2 + 1) 197 | top_bbs_reactants2 = get_top_k_smiles(smiles_to_search[1], smiles_searcher, fp_searcher, k // 2 + 1) 198 | top_bbs_reactants3 = get_top_k_smiles(smiles_to_search[2], smiles_searcher, fp_searcher, k // 2 + 1) 199 | valid_mols = match_three_reactants(top_bbs_reactants1, top_bbs_reactants2, top_bbs_reactants3, rxn, continue_rxn = False) 200 | for mol1, mol2, mol3 in valid_mols: 201 | new_stack = Stack() 202 | new_stack.push_mol(mol1, 0) 203 | new_stack.push_mol(mol2, 0) 204 | new_stack.push_mol(mol3, 0) 205 | success = new_stack.push_rxn(rxn, rxn_idx, product_template=product_template, product_limit=product_limit) 206 | if success: 207 | scores.append(similarity_score(product_template, new_stack[-1].smiles)) 208 | new_stacks.append(new_stack) 209 | 210 | elif rxn.num_reactants == 2: 211 | assert len(smiles_to_search) == 2 212 | top_bbs_reactants1 = get_top_k_smiles(smiles_to_search[0], smiles_searcher, fp_searcher, k) 213 | top_bbs_reactants2 = get_top_k_smiles(smiles_to_search[1], smiles_searcher, fp_searcher, k) 214 | valid_mols = match_two_reactants(top_bbs_reactants1, top_bbs_reactants2, rxn, continue_rxn=False) 215 | for mol1, mol2 in valid_mols: 216 | new_stack = Stack() 217 | new_stack.push_mol(mol1, 0) 218 | new_stack.push_mol(mol2, 0) 219 | success = new_stack.push_rxn(rxn, rxn_idx, product_template=product_template, product_limit=product_limit) 220 | if success: 221 | scores.append(similarity_score(product_template, new_stack[-1].smiles)) 222 | new_stacks.append(new_stack) 223 | 224 | elif rxn.num_reactants == 1: 225 | assert len(smiles_to_search) == 1 226 | top_bbs_reactants = get_top_k_smiles(smiles_to_search[0], smiles_searcher, fp_searcher, k) 227 | for mol in top_bbs_reactants: 228 | new_stack = Stack() 229 | new_stack.push_mol(mol, 0) 230 | success = new_stack.push_rxn(rxn, rxn_idx, product_template=product_template, product_limit=product_limit) 231 | if success: 232 | scores.append(similarity_score(product_template, new_stack[-1].smiles)) 233 | new_stacks.append(new_stack) 234 | 235 | new_stacks = [stack for stack in new_stacks if stack is not None and len(stack) > 0] 236 | if len(new_stacks) == 0: 237 | return None 238 | if len(new_stacks) > n_stacks: 239 | new_stacks = sorted(new_stacks, key=lambda x: scores[new_stacks.index(x)], reverse=True)[:n_stacks] 240 | return new_stacks 241 | 242 | def reconstruct_all_rxns(output, reaction_idx_map, embedding_path, k, n_stacks): 243 | """ 244 | Reconstruct all reactions from a list of building blocks and reactants. 245 | 246 | Args: 247 | output (dict): the output from the LLM. 248 | reaction_idx_map (dict): the reaction idx map. 249 | embedding_path (str): the path to the smiles embedding. 250 | k (int, optional): the number of top smiles to return. Defaults to 5. 251 | n_stacks (int, optional): the number of stacks to return. Defaults to 50. 252 | """ 253 | if 'reactions' not in output or 'building_blocks' not in output: return None 254 | building_blocks = [bb.split("")[-1].split("")[0] for bb in output['building_blocks']] 255 | reactions = output['reactions'] 256 | stacks = [Stack()] 257 | for i, reaction in enumerate(reactions[::-1]): 258 | if 'reaction_template' not in reaction or 'reactants' not in reaction or 'product' not in reaction: continue 259 | template = reaction['reaction_template'].split('')[1].split('')[0] 260 | if template not in reaction_idx_map: 261 | template = find_closest_match(template, list(reaction_idx_map.keys())) 262 | rxn_idx = reaction_idx_map[template] 263 | reactants = reaction['reactants'] 264 | product_template = reaction['product'] 265 | smiles_to_search = [s for s in reactants if s in building_blocks] 266 | smiles_searcher = SmilesSimilaritySearch.load(f"{embedding_path}/smiles_tfidf_{rxn_idx}.pkl") 267 | fp_searcher = FingerprintIndex.load(f"{embedding_path}/fpindex_{rxn_idx}.pkl") 268 | stacks = reconstruct_single_rxn(smiles_to_search, product_template, smiles_searcher, fp_searcher, template, rxn_idx, stacks, k, n_stacks) 269 | if stacks is None: 270 | print(f"Error reconstructing reaction {i}") 271 | return None 272 | return stacks 273 | 274 | def reaction_scorer(stacks, target_mol, num_calc_extra_metrics: int = 10) -> pd.DataFrame: 275 | """ 276 | Score the reactions by their similarity to the target molecule. 277 | 278 | Args: 279 | stacks (list[Stack]): the stacks to score. 280 | target_mol (Molecule): the target molecule. 281 | num_calc_extra_metrics (int, optional): the number of extra metrics to calculate. Defaults to 10. 282 | 283 | Returns: 284 | pd.DataFrame: a dataframe with the scores and extra metrics. 285 | """ 286 | rows: list[dict[str, str | float]] = [] 287 | smiles_to_mol: dict[str, Molecule] = {} 288 | if not stacks: 289 | return pd.DataFrame() 290 | for stack in stacks: 291 | product_mol = stack[-1] 292 | rows.append( 293 | { 294 | "target": target_mol.smiles, 295 | "smiles": product_mol.smiles, 296 | "score": target_mol.sim(product_mol, FingerprintOption.morgan_for_tanimoto_similarity()), 297 | "synthesis": stack.get_action_string(), 298 | # "source": stack.get_source(), # for checking the source of the bb generation 299 | "num_steps": stack.count_reactions(), 300 | } 301 | ) 302 | smiles_to_mol[product_mol.smiles] = product_mol 303 | rows.sort(key=lambda r: r["score"], reverse=True) 304 | for row in rows[:num_calc_extra_metrics]: 305 | mol = smiles_to_mol[str(row["smiles"])] 306 | row["scf_sim"] = target_mol.scaffold.tanimoto_similarity( 307 | mol.scaffold, 308 | fp_option=FingerprintOption.morgan_for_tanimoto_similarity(), 309 | ) 310 | row["pharm2d_sim"] = target_mol.dice_similarity(mol, fp_option=FingerprintOption.gobbi_pharm2d()) 311 | row["rdkit_sim"] = target_mol.tanimoto_similarity(mol, fp_option=FingerprintOption.rdkit()) 312 | df = pd.DataFrame(rows) 313 | return df 314 | 315 | def result_generator(smiles, llama_answer, reaction_smarts_dict_path, embedding_path, k, n_stacks, num_calc_extra_metrics=10): 316 | """ 317 | Generate results by finding top k SMILES strings for building blocks 318 | and building products from reactants and reaction templates. 319 | 320 | Args: 321 | smiles (str): The product SMILES string. 322 | llama_answer (dict): The output containing reactants and reaction templates. 323 | reaction_smarts_dict_path (str): The path to the reaction smarts map. 324 | embedding_path (str): The path to the smiles embedding. 325 | k (int, optional): The number of top smiles to return. Defaults to 10. 326 | """ 327 | reaction_smarts_dict = pickle.load(open(reaction_smarts_dict_path, "rb")) 328 | reaction_idx_map = {v[0]: k for k, v in reaction_smarts_dict.items()} 329 | print(f"Loaded reaction smarts dict with {len(reaction_idx_map)} reactions") 330 | product_mol = Molecule(smiles) 331 | df_product = pd.DataFrame() 332 | for i, output in enumerate(llama_answer): 333 | try: 334 | stacks = reconstruct_all_rxns(output, reaction_idx_map, embedding_path, k, n_stacks) 335 | if stacks is None: 336 | continue 337 | df = reaction_scorer(stacks, product_mol, num_calc_extra_metrics) 338 | df_product = pd.concat([df_product, df]) 339 | except Exception as e: 340 | print(e) 341 | continue 342 | print("Finished processing all reactions for " + smiles) 343 | return df_product.sort_values(by=["score", "rdkit_sim", "scf_sim", "pharm2d_sim"], ascending=[False, False, False, False]).reset_index(drop=True).iloc[:n_stacks] if len(df_product) > 0 else None 344 | 345 | def result_generator_wrapper(args): 346 | """Wrapper function to unpack arguments for result_generator.""" 347 | return result_generator(*args) 348 | 349 | def run_enamine_reconstruct(llama_output_path, embedding_path, reaction_smarts_dict_path, save_path, k, n_stacks): 350 | # Load data 351 | llama_outputs = pickle.load(open(llama_output_path, "rb")) 352 | tasks = [(smiles, llama_answer, reaction_smarts_dict_path, embedding_path, k, n_stacks) for smiles, llama_answer in list(llama_outputs.items())] 353 | 354 | # Use multiprocessing 355 | num_cores = mp.cpu_count() 356 | with mp.Pool(num_cores) as pool: 357 | # Create a tqdm progress bar 358 | with tqdm(total=len(tasks)) as pbar: 359 | results = [] 360 | for result in pool.imap_unordered(result_generator_wrapper, tasks): 361 | results.append(result) 362 | pbar.update() 363 | 364 | results = [r for r in results if r is not None] 365 | print(f"Found {len(results)} results") 366 | pickle.dump(results, open(save_path, "wb")) 367 | 368 | def main(): 369 | parser = argparse.ArgumentParser() 370 | parser.add_argument("--llama_folder", type=str, default="../synllama-data/results/table_2_syn_planning_91rxns") 371 | parser.add_argument("--embedding_path", type=str, default="../synllama-data/inference/reconstruction/91rxns/rxn_embeddings") 372 | parser.add_argument("--n_stacks", type=int, default=25) 373 | parser.add_argument("--k", type=int, default=5) 374 | parser.add_argument("--save_path", type=str, default=None) 375 | parser.add_argument("--total_num_mols", type=int, default=1000) 376 | parser.add_argument("--top_n_rows", type=int, default=1) 377 | 378 | args = parser.parse_args() 379 | args.reaction_smarts_dict_path = os.path.join(args.embedding_path, "reaction_smarts_map.pkl") 380 | mp.set_start_method('spawn') 381 | 382 | llama_output_paths = glob.glob(os.path.join(args.llama_folder, "*.pkl")) 383 | if args.save_path is None: 384 | args.save_path = os.path.join(args.llama_folder, "enamine_reconstruct") 385 | os.makedirs(args.save_path, exist_ok=True) 386 | 387 | for llama_output_path in llama_output_paths: 388 | print(f"Processing {llama_output_path}") 389 | name = llama_output_path[:-4].split("/")[-1] 390 | save_path = os.path.join(args.save_path, f"{name}.pkl") 391 | if os.path.exists(save_path): 392 | print(f"Skipping {name} because it already exists") 393 | continue 394 | run_enamine_reconstruct(llama_output_path, args.embedding_path, args.reaction_smarts_dict_path, save_path, args.k, args.n_stacks) 395 | 396 | results_folder = os.path.join(args.llama_folder, "enamine_reconstruct") 397 | results_file_paths = glob.glob(os.path.join(results_folder, "*.pkl")) 398 | final_df = pd.DataFrame() 399 | for results_file_path in results_file_paths: 400 | result = analyze_results(results_file_path, args.total_num_mols, args.top_n_rows) 401 | df = pd.DataFrame.from_dict(result, orient="index").T 402 | final_df = pd.concat([final_df, df]) 403 | 404 | final_df.sort_values(by="file_name", ascending=True, inplace=True) 405 | final_df.to_csv(os.path.join(results_folder, "enamine_reconstruct_analysis.csv"), index=False) 406 | 407 | if __name__ == "__main__": 408 | main() 409 | -------------------------------------------------------------------------------- /steps/step_32_combined_stats.py: -------------------------------------------------------------------------------- 1 | # This script is used to reconstruct the raw output from MolPort 2 | 3 | import os, sys, argparse, glob 4 | import pickle 5 | import pandas as pd 6 | import numpy as np 7 | from collections import defaultdict 8 | 9 | def combine_stats(enamine_reconstruct_csv, all_synllama_reconstruct_csv, non_enamine_synllama_reconstruct_csv, enamine_synllama_reconstruct_csv, total_num_mols, llama_folder = None): 10 | enamine_reconstruct_df = pd.read_csv(enamine_reconstruct_csv) 11 | enamine_reconstruct_mol = np.sum(enamine_reconstruct_df.groupby('target')['score'].max() == 1) 12 | # remove the rows in enamine reconstruct that have the same target as in raw output 13 | synllama_enamine_reconstruct_df = pd.read_csv(enamine_synllama_reconstruct_csv) 14 | synllama_all_reconstruct_df = pd.read_csv(all_synllama_reconstruct_csv) 15 | synllama_non_enamine_reconstruct_df = pd.read_csv(non_enamine_synllama_reconstruct_csv) 16 | non_enamine_reconstruct_mol = np.sum(synllama_non_enamine_reconstruct_df.groupby('target')['score'].max() == 1) 17 | 18 | enamine_all_df = pd.concat([enamine_reconstruct_df, synllama_enamine_reconstruct_df], ignore_index=True) 19 | enamine_reconstruct_mol = np.sum(enamine_all_df.groupby('target')['score'].max() == 1) 20 | 21 | enamine_reconstruct_filtered = enamine_reconstruct_df[~enamine_reconstruct_df['target'].isin(synllama_all_reconstruct_df['target'])] 22 | combined_df = pd.concat([synllama_all_reconstruct_df, enamine_reconstruct_filtered], ignore_index=True) 23 | no_recon_combined_df = combined_df[combined_df['score'] < 1] 24 | combined_stats = { 25 | "file_name": enamine_reconstruct_csv[:-4].split("/")[-1].split("_enamine_reconstruct")[0], 26 | "total_failure_rate %": round((1 - (len(combined_df) - np.sum(combined_df['score'].isna())) / total_num_mols) * 100, 2), 27 | "total_enamine_reconstruct_rate %": round((enamine_reconstruct_mol / total_num_mols) * 100, 2), 28 | "total_non_enamine_reconstruct_rate %": round((non_enamine_reconstruct_mol / total_num_mols) * 100, 2), 29 | "total_all_reconstruction_rate %": round((np.sum(combined_df['score'] == 1) / total_num_mols) * 100, 2), 30 | "morgan_sim": combined_df['score'].mean(), 31 | "scf_sim": combined_df['scf_sim'].mean(), 32 | "pharm2d_sim": combined_df['pharm2d_sim'].mean(), 33 | "avg_rxn_steps": combined_df['num_steps'].mean(), 34 | "morgan_sim_no_recon": no_recon_combined_df['score'].mean(), 35 | "scf_sim_no_recon": no_recon_combined_df['scf_sim'].mean(), 36 | "pharm2d_sim_no_recon": no_recon_combined_df['pharm2d_sim'].mean(), 37 | "avg_rxn_steps_no_recon": no_recon_combined_df['num_steps'].mean(), 38 | } 39 | combined_df.to_csv(os.path.join(llama_folder, f"{file_name}_final_reconstruct_stats.csv"), index=False) 40 | return combined_stats 41 | 42 | if __name__ == "__main__": 43 | parser = argparse.ArgumentParser() 44 | parser.add_argument("--llama_folder", type=str, default="../synllama-data/results/table_2_syn_planning_91rxns") 45 | parser.add_argument("--total_num_mols", type=int, default=1000) 46 | 47 | args = parser.parse_args() 48 | 49 | enamine_reconstruct_paths = glob.glob(os.path.join(args.llama_folder, "enamine_reconstruct", "*_enamine_reconstruct.csv")) 50 | final_df = pd.DataFrame() 51 | for enamine_reconstruct_path in enamine_reconstruct_paths: 52 | # if you follow the default naming convention, you can use this line 53 | file_name = enamine_reconstruct_path[:-4].split("/")[-1].split("_enamine_reconstruct")[0] 54 | enamine_synllama_reconstruct_path = os.path.join(args.llama_folder, "synllama_reconstruct", f"{file_name}_enamine_synllama_reconstruct.csv") 55 | all_synllama_reconstruct_path = os.path.join(args.llama_folder, "synllama_reconstruct", f"{file_name}_all_synllama_reconstruct.csv") 56 | non_enamine_synllama_reconstruct_path = os.path.join(args.llama_folder, "synllama_reconstruct", f"{file_name}_non_enamine_synllama_reconstruct.csv") 57 | result = combine_stats(enamine_reconstruct_path, all_synllama_reconstruct_path, non_enamine_synllama_reconstruct_path, enamine_synllama_reconstruct_path, total_num_mols=args.total_num_mols, llama_folder = args.llama_folder) 58 | df = pd.DataFrame.from_dict(result, orient="index").T 59 | final_df = pd.concat([final_df, df]) 60 | final_df.sort_values(by="file_name", ascending=True, inplace=True) 61 | final_df.to_csv(os.path.join(args.llama_folder, "combined_final_stats.csv"), index=False) 62 | -------------------------------------------------------------------------------- /synllama/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/THGLab/SynLlama/5592cfc9d2338c6ebd7add7971c26b69e9aa1111/synllama/__init__.py -------------------------------------------------------------------------------- /synllama/chem/__init__.py: -------------------------------------------------------------------------------- 1 | import rdkit 2 | 3 | from . import matrix, mol, reaction 4 | 5 | rdkit.RDLogger.DisableLog("rdApp.*") 6 | -------------------------------------------------------------------------------- /synllama/chem/base.py: -------------------------------------------------------------------------------- 1 | import abc 2 | from typing import Literal, overload 3 | 4 | import PIL.Image 5 | 6 | 7 | class Drawable(abc.ABC): 8 | @overload 9 | def draw(self, size: int, svg: Literal[False]) -> PIL.Image.Image: ... 10 | 11 | @overload 12 | def draw(self, size: int, svg: Literal[True]) -> str: ... 13 | 14 | @abc.abstractmethod 15 | def draw(self, size: int, svg: bool) -> PIL.Image.Image | str: ... 16 | -------------------------------------------------------------------------------- /synllama/chem/fpindex.py: -------------------------------------------------------------------------------- 1 | import dataclasses 2 | import functools 3 | import os, pickle 4 | import pathlib 5 | import tempfile 6 | from collections.abc import Iterable, Sequence 7 | 8 | import joblib 9 | import numpy as np 10 | import torch 11 | from sklearn.neighbors import BallTree 12 | from tqdm.auto import tqdm 13 | 14 | from synllama.chem.mol import FingerprintOption, Molecule 15 | 16 | 17 | @dataclasses.dataclass 18 | class _QueryResult: 19 | index: int 20 | molecule: Molecule 21 | fingerprint: np.ndarray 22 | distance: float 23 | 24 | 25 | def _fill_fingerprint( 26 | fp: np.memmap, 27 | offset: int, 28 | molecules: Iterable[Molecule], 29 | fp_option: FingerprintOption, 30 | ): 31 | os.sched_setaffinity(0, range(os.cpu_count() or 1)) 32 | for i, mol in enumerate(molecules): 33 | fp[offset + i] = mol.get_fingerprint(fp_option).astype(np.uint8) 34 | 35 | 36 | def compute_fingerprints( 37 | molecules: Sequence[Molecule], 38 | fp_option: FingerprintOption, 39 | batch_size: int = 1024, 40 | ) -> np.ndarray: 41 | with tempfile.TemporaryDirectory() as tempdir_s: 42 | temp_fname = pathlib.Path(tempdir_s) / "fingerprint" 43 | fp = np.memmap( 44 | str(temp_fname), 45 | dtype=np.uint8, 46 | mode="w+", 47 | shape=(len(molecules), fp_option.dim), 48 | ) 49 | joblib.Parallel(n_jobs=joblib.cpu_count() // 2)( 50 | joblib.delayed(_fill_fingerprint)( 51 | fp=fp, 52 | offset=start, 53 | molecules=molecules[start : start + batch_size], 54 | fp_option=fp_option, 55 | ) 56 | for start in tqdm(range(0, len(molecules), batch_size), desc="Fingerprint") 57 | ) 58 | return np.array(fp) 59 | 60 | 61 | class FingerprintIndex: 62 | def __init__(self, molecules: Iterable[Molecule], fp_option: FingerprintOption) -> None: 63 | super().__init__() 64 | self._molecules = tuple(molecules) 65 | self._fp_option = fp_option 66 | self._fp = self._init_fingerprint() 67 | self._tree = self._init_tree() 68 | 69 | @property 70 | def molecules(self) -> tuple[Molecule, ...]: 71 | return self._molecules 72 | 73 | @property 74 | def fp_option(self) -> FingerprintOption: 75 | return self._fp_option 76 | 77 | def _init_fingerprint(self, batch_size: int = 1024) -> np.ndarray: 78 | return compute_fingerprints( 79 | molecules=self._molecules, 80 | fp_option=self._fp_option, 81 | batch_size=batch_size, 82 | ) 83 | 84 | def _init_tree(self) -> BallTree: 85 | tree = BallTree(self._fp, metric="manhattan") 86 | return tree 87 | 88 | def __getitem__(self, index: int) -> tuple[Molecule, np.ndarray]: 89 | return self._molecules[index], self._fp[index] 90 | 91 | def query_single(self, q: np.ndarray, k: int) -> list[_QueryResult]: 92 | dist, idx = self._tree.query(q.reshape([1, self._fp_option.dim]), k=k) 93 | results = [] 94 | for distance, idx in zip(dist[0], idx[0]): 95 | results.append(_QueryResult( 96 | index=idx, 97 | molecule=self._molecules[idx], 98 | fingerprint=None, 99 | distance=distance 100 | )) 101 | return sorted(results, key=lambda x: x.distance) 102 | 103 | def query(self, q: np.ndarray, k: int) -> list[list[_QueryResult]]: 104 | """ 105 | Args: 106 | q: shape (bsz, ..., fp_dim) 107 | """ 108 | bsz = q.shape[0] 109 | dist, idx = self._tree.query(q.reshape([-1, self._fp_option.dim]), k=k) 110 | dist = dist.reshape([bsz, -1]) 111 | idx = idx.reshape([bsz, -1]) 112 | results: list[list[_QueryResult]] = [] 113 | for i in range(dist.shape[0]): 114 | res: list[_QueryResult] = [] 115 | for j in range(dist.shape[1]): 116 | index = int(idx[i, j]) 117 | res.append( 118 | _QueryResult( 119 | index=index, 120 | molecule=self._molecules[index], 121 | fingerprint=self._fp[index], 122 | distance=dist[i, j], 123 | ) 124 | ) 125 | results.append(res) 126 | return results 127 | 128 | @functools.cache 129 | def fp_cuda(self, device: torch.device) -> torch.Tensor: 130 | return torch.tensor(self._fp, dtype=torch.float, device=device) 131 | 132 | @torch.inference_mode() 133 | def query_cuda(self, q: torch.Tensor, k: int) -> list[list[_QueryResult]]: 134 | bsz = q.size(0) 135 | q = q.reshape([-1, self._fp_option.dim]) 136 | pwdist = torch.cdist(self.fp_cuda(q.device), q, p=1) # (n_mols, n_queries) 137 | dist_t, idx_t = torch.topk(pwdist, k=k, dim=0, largest=False) # (k, n_queries) 138 | dist = dist_t.t().reshape([bsz, -1]).cpu().numpy() 139 | idx = idx_t.t().reshape([bsz, -1]).cpu().numpy() 140 | 141 | results: list[list[_QueryResult]] = [] 142 | for i in range(dist.shape[0]): 143 | res: list[_QueryResult] = [] 144 | for j in range(dist.shape[1]): 145 | index = int(idx[i, j]) 146 | res.append( 147 | _QueryResult( 148 | index=index, 149 | molecule=self._molecules[index], 150 | fingerprint=self._fp[index], 151 | distance=dist[i, j], 152 | ) 153 | ) 154 | results.append(res) 155 | return results 156 | 157 | def save(self, filename): 158 | with open(filename, 'wb') as f: 159 | pickle.dump(self, f) 160 | 161 | @classmethod 162 | def load(cls, filename): 163 | with open(filename, 'rb') as f: 164 | return pickle.load(f) 165 | 166 | if __name__ == "__main__": 167 | # Later, you can load the searcher without refitting 168 | loaded_searcher = FingerprintIndex.load("data/processed/fpindex.pkl") 169 | # Example search 170 | query_smiles = "CC(=O)Oc1ccccc1C(=O)O" # Aspirin 171 | fingerprints = compute_fingerprints(tuple([Molecule(query_smiles)]), fp_option=FingerprintOption.morgan_for_building_blocks()) 172 | results = loaded_searcher.query_single(fingerprints.reshape(1, -1), k=5) 173 | print(f"\nTop 5 similar SMILES to {query_smiles}:") 174 | for result in results: 175 | print(result.molecule.smiles) 176 | -------------------------------------------------------------------------------- /synllama/chem/matrix.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pathlib 3 | import pickle 4 | import tempfile 5 | from collections.abc import Iterable 6 | from functools import cached_property 7 | 8 | import joblib 9 | import numpy as np 10 | from tqdm.auto import tqdm 11 | 12 | from synllama.chem.mol import Molecule 13 | from synllama.chem.reaction import Reaction, ReactionContainer 14 | 15 | 16 | def _fill_matrix(matrix: np.memmap, offset: int, reactants: Iterable[Molecule], reactions: Iterable[Reaction]): 17 | for i, reactant in enumerate(reactants): 18 | for j, reaction in enumerate(reactions): 19 | flag = 0 20 | for t in reaction.match_reactant_templates(reactant): 21 | flag |= 1 << t 22 | matrix[offset + i, j] = flag 23 | 24 | 25 | class ReactantReactionMatrix: 26 | def __init__( 27 | self, 28 | reactants: Iterable[Molecule], 29 | reactions: Iterable[Reaction], 30 | matrix: np.ndarray | os.PathLike | None = None, 31 | ) -> None: 32 | super().__init__() 33 | self._reactants = tuple(reactants) 34 | self._reactions = tuple(reactions) 35 | self._matrix = self._init_matrix(matrix) 36 | 37 | def _init_matrix(self, matrix: np.ndarray | os.PathLike | None, batch_size: int = 1024) -> np.ndarray: 38 | if isinstance(matrix, np.ndarray): 39 | return matrix 40 | elif isinstance(matrix, (os.PathLike, str)): 41 | return np.load(matrix) 42 | 43 | with tempfile.TemporaryDirectory() as tempdir_s: 44 | temp_fname = pathlib.Path(tempdir_s) / "matrix" 45 | matrix = np.memmap( 46 | str(temp_fname), 47 | dtype=np.uint8, 48 | mode="w+", 49 | shape=(len(self._reactants), len(self._reactions)), 50 | ) 51 | joblib.Parallel(n_jobs=joblib.cpu_count() // 2)( 52 | joblib.delayed(_fill_matrix)( 53 | matrix=matrix, 54 | offset=start, 55 | reactants=self._reactants[start : start + batch_size], 56 | reactions=self._reactions, 57 | ) 58 | for start in tqdm(range(0, len(self._reactants), batch_size), desc="Create matrix") 59 | ) 60 | return np.array(matrix) 61 | 62 | @property 63 | def reactants(self) -> tuple[Molecule, ...]: 64 | return self._reactants 65 | 66 | @cached_property 67 | def reactions(self) -> ReactionContainer: 68 | return ReactionContainer(self._reactions) 69 | 70 | @cached_property 71 | def seed_reaction_indices(self) -> list[int]: 72 | full_flag = np.array([0b01 if rxn.num_reactants == 1 else 0b11 for rxn in self._reactions], dtype=np.uint8) 73 | return np.nonzero(full_flag == np.bitwise_or.reduce(self._matrix, axis=0))[0].tolist() 74 | 75 | @cached_property 76 | def reactant_count(self) -> np.ndarray: 77 | return (self._matrix != 0).astype(np.int32).sum(0) 78 | 79 | @property 80 | def matrix(self) -> np.ndarray: 81 | return self._matrix 82 | 83 | def save(self, filename): 84 | with open(filename, 'wb') as f: 85 | pickle.dump(self, f) 86 | 87 | @classmethod 88 | def load(cls, filename): 89 | with open(filename, 'rb') as f: 90 | return pickle.load(f) 91 | -------------------------------------------------------------------------------- /synllama/chem/mol.py: -------------------------------------------------------------------------------- 1 | import dataclasses 2 | import hashlib 3 | import os 4 | import pathlib 5 | from collections.abc import Iterable, Sequence 6 | from functools import cache, cached_property, partial 7 | from typing import Literal, overload 8 | 9 | import numpy as np 10 | import pandas as pd 11 | from rdkit import Chem 12 | from rdkit.Chem import AllChem, DataStructs, Draw 13 | from rdkit.Chem.Pharm2D import Generate as Generate2D 14 | from rdkit.Chem.Pharm2D import Gobbi_Pharm2D 15 | from rdkit.Chem.Scaffolds import MurckoScaffold 16 | from tqdm.auto import tqdm 17 | 18 | from .base import Drawable 19 | 20 | 21 | @dataclasses.dataclass(frozen=True, eq=True, unsafe_hash=True) 22 | class FingerprintOption: 23 | type: str = "morgan" 24 | # Morgan 25 | morgan_radius: int = 2 26 | morgan_n_bits: int = 256 27 | # RDKit 28 | rdkit_fp_size: int = 2048 29 | 30 | def __post_init__(self): 31 | supported_types = ("morgan", "rdkit", "gobbi_pharm2d") 32 | if self.type not in supported_types: 33 | raise ValueError(f"Unsupported fingerprint type: {self.type}") 34 | 35 | @classmethod 36 | def morgan_for_tanimoto_similarity(cls): 37 | return FingerprintOption( 38 | type="morgan", 39 | morgan_radius=2, 40 | morgan_n_bits=4096, 41 | ) 42 | 43 | @classmethod 44 | def gobbi_pharm2d(cls): 45 | return FingerprintOption( 46 | type="gobbi_pharm2d", 47 | ) 48 | 49 | @classmethod 50 | def morgan_for_building_blocks(cls): 51 | return FingerprintOption( 52 | type="morgan", 53 | morgan_radius=2, 54 | morgan_n_bits=256, 55 | ) 56 | 57 | @classmethod 58 | def rdkit(cls): 59 | return FingerprintOption( 60 | type="rdkit", 61 | ) 62 | 63 | @property 64 | def dim(self) -> int: 65 | if self.type == "morgan": 66 | return self.morgan_n_bits 67 | elif self.type == "rdkit": 68 | return self.rdkit_fp_size 69 | elif self.type == "gobbi_pharm2d": 70 | return 39972 71 | raise ValueError(f"Unsupported fingerprint type: {self.type}") 72 | 73 | 74 | class Molecule(Drawable): 75 | def __init__(self, smiles: str, source: Literal["smiles", "fp", ''] = '') -> None: 76 | super().__init__() 77 | self._smiles = smiles.strip() 78 | self.meta_info = {} 79 | self._source = source 80 | 81 | @classmethod 82 | def from_rdmol(cls, rdmol: Chem.Mol) -> "Molecule": 83 | return cls(Chem.MolToSmiles(rdmol)) 84 | 85 | def __getstate__(self): 86 | return self._smiles 87 | 88 | def __setstate__(self, state): 89 | self._smiles = state 90 | self._source = '' 91 | 92 | @property 93 | def smiles(self) -> str: 94 | return self._smiles 95 | 96 | @property 97 | def source(self) -> Literal["smiles", "fp", '']: 98 | return self._source 99 | 100 | @cached_property 101 | def _rdmol(self): 102 | return Chem.MolFromSmiles(self._smiles) 103 | 104 | @cached_property 105 | def _rdmol_no_hs(self): 106 | return Chem.RemoveHs(self._rdmol) 107 | 108 | @cached_property 109 | def is_valid(self) -> bool: 110 | return self._rdmol is not None 111 | 112 | @cached_property 113 | def csmiles(self) -> str: 114 | return Chem.MolToSmiles(self._rdmol, canonical=True, isomericSmiles=False) 115 | 116 | @cached_property 117 | def num_atoms(self) -> int: 118 | return self._rdmol.GetNumAtoms() 119 | 120 | def draw(self, size: int = 100, svg: bool = False): 121 | if svg: 122 | return Draw._moltoSVG(self._rdmol, sz=(size, size), highlights=[], legend=[], kekulize=True) 123 | else: 124 | return Draw.MolToImage(self._rdmol, size=(size, size), kekulize=True) 125 | 126 | def __hash__(self) -> int: 127 | return hash(self._smiles) 128 | 129 | def __eq__(self, __value: object) -> bool: 130 | return isinstance(__value, Molecule) and self.csmiles == __value.csmiles 131 | 132 | @cached_property 133 | def major_molecule(self) -> "Molecule": 134 | if "." in self.smiles: 135 | segs = self.smiles.split(".") 136 | segs.sort(key=lambda a: -len(a)) 137 | return Molecule(segs[0]) 138 | return self 139 | 140 | @overload 141 | def get_fingerprint(self, option: FingerprintOption) -> np.ndarray: ... 142 | 143 | @overload 144 | def get_fingerprint(self, option: FingerprintOption, as_bitvec: Literal[True]) -> Sequence[Literal[0, 1]]: ... 145 | 146 | @overload 147 | def get_fingerprint(self, option: FingerprintOption, as_bitvec: Literal[False]) -> np.ndarray: ... 148 | 149 | def get_fingerprint(self, option: FingerprintOption, as_bitvec: bool = False): 150 | return self._get_fingerprint(option, as_bitvec) # work-around for mypy check 151 | 152 | @cache 153 | def _get_fingerprint(self, option: FingerprintOption, as_bitvec: bool): 154 | if option.type == "morgan": 155 | bit_vec = AllChem.GetMorganFingerprintAsBitVect(self._rdmol, option.morgan_radius, option.morgan_n_bits) 156 | elif option.type == "rdkit": 157 | bit_vec = Chem.RDKFingerprint(self._rdmol, fpSize=option.rdkit_fp_size) 158 | elif option.type == "gobbi_pharm2d": 159 | bit_vec = DataStructs.cDataStructs.ConvertToExplicit( 160 | Generate2D.Gen2DFingerprint(self._rdmol, Gobbi_Pharm2D.factory) 161 | ) 162 | else: 163 | raise ValueError(f"Unsupported fingerprint type: {option.type}") 164 | 165 | if as_bitvec: 166 | return bit_vec 167 | feat = np.zeros((1,), dtype=np.float32) 168 | DataStructs.ConvertToNumpyArray(bit_vec, feat) 169 | return feat 170 | 171 | @cached_property 172 | def scaffold(self) -> "Molecule": 173 | s = Molecule.from_rdmol(MurckoScaffold.GetScaffoldForMol(self._rdmol)) 174 | if not s.is_valid: 175 | s = self 176 | return s 177 | 178 | def tanimoto_similarity(self, other: "Molecule", fp_option: FingerprintOption) -> float: 179 | fp1 = self.get_fingerprint(fp_option, as_bitvec=True) 180 | fp2 = other.get_fingerprint(fp_option, as_bitvec=True) 181 | return DataStructs.TanimotoSimilarity(fp1, fp2) 182 | 183 | def dice_similarity(self, other: "Molecule", fp_option: FingerprintOption) -> float: 184 | fp1 = self.get_fingerprint(fp_option, as_bitvec=True) 185 | fp2 = other.get_fingerprint(fp_option, as_bitvec=True) 186 | return DataStructs.DiceSimilarity(fp1, fp2) 187 | 188 | @cache 189 | def sim( 190 | self, 191 | other: "Molecule", 192 | fp_option: FingerprintOption = FingerprintOption.morgan_for_tanimoto_similarity(), 193 | ) -> float: 194 | return self.tanimoto_similarity(other, fp_option) 195 | 196 | @cached_property 197 | def csmiles_md5(self) -> bytes: 198 | return hashlib.md5(self.csmiles.encode()).digest() 199 | 200 | @cached_property 201 | def csmiles_sha256(self) -> bytes: 202 | return hashlib.sha256(self.csmiles.encode()).digest() 203 | 204 | def get_meta_info(mol): 205 | return { 206 | "id": mol.GetProp("id") if mol.HasProp("id") else None, 207 | "IUPAC Name": mol.GetProp("IUPAC Name") if mol.HasProp("IUPAC Name") else None, 208 | "CAS": mol.GetProp("CAS") if mol.HasProp("CAS") else None, 209 | "purity": mol.GetProp("purity") if mol.HasProp("purity") else None, 210 | "MDLNUMBER": mol.GetProp("MDLNUMBER") if mol.HasProp("MDLNUMBER") else None, 211 | "LogP": mol.GetProp("LogP") if mol.HasProp("LogP") else None, 212 | "URL": mol.GetProp("URL") if mol.HasProp("URL") else None, 213 | "avail_US_100mg": mol.GetProp("avail_US_100mg") if mol.HasProp("avail_US_100mg") else None, 214 | "avail_US_250mg": mol.GetProp("avail_US_250mg") if mol.HasProp("avail_US_250mg") else None, 215 | "avail_US_1g": mol.GetProp("avail_US_1g") if mol.HasProp("avail_US_1g") else None, 216 | "avail_US_2_5g": mol.GetProp("avail_US_2_5g") if mol.HasProp("avail_US_2_5g") else None 217 | } 218 | 219 | 220 | def read_mol_file( 221 | path: os.PathLike, 222 | major_only: bool = True, 223 | drop_duplicates: bool = True, 224 | show_pbar: bool = True, 225 | smiles_col: str | None = None, 226 | pbar_fn=partial(tqdm, desc="Reading"), 227 | ) -> Iterable[Molecule]: 228 | path = pathlib.Path(path) 229 | if path.suffix == ".sdf": 230 | f = Chem.SDMolSupplier(str(path)) 231 | elif path.suffix == ".smi": 232 | f = Chem.SmilesMolSupplier(str(path)) 233 | elif path.suffix == ".csv": 234 | df = pd.read_csv(path) 235 | if smiles_col is None: 236 | if "smiles" in df.columns: 237 | smiles_col = "smiles" 238 | elif "SMILES" in df.columns: 239 | smiles_col = "SMILES" 240 | else: 241 | raise ValueError(f"Cannot find SMILES column in {path}") 242 | f = (Chem.MolFromSmiles(smiles) for smiles in df[smiles_col]) 243 | else: 244 | raise ValueError(f"Unsupported file type: {path.suffix}") 245 | visited: set[str] = set() 246 | if show_pbar: 247 | f_iter = pbar_fn(f) 248 | else: 249 | f_iter = f 250 | for rdmol in f_iter: 251 | if rdmol is not None: 252 | meta_info = get_meta_info(rdmol) 253 | mol = Molecule.from_rdmol(rdmol) 254 | mol.meta_info = meta_info 255 | if major_only: 256 | mol = mol.major_molecule 257 | if drop_duplicates and mol.csmiles in visited: 258 | continue 259 | yield mol 260 | visited.add(mol.csmiles) 261 | 262 | 263 | def write_to_smi(path: os.PathLike, mols: Sequence[Molecule]): 264 | with open(path, "w") as f: 265 | for mol in mols: 266 | f.write(f"{mol.smiles}\n") 267 | -------------------------------------------------------------------------------- /synllama/chem/reaction.py: -------------------------------------------------------------------------------- 1 | import os 2 | from collections.abc import Iterable, Sequence 3 | from functools import cached_property 4 | from typing import overload 5 | 6 | from rdkit import Chem 7 | from rdkit.Chem import AllChem, Draw, rdChemReactions 8 | 9 | from synllama.chem.base import Drawable 10 | from synllama.chem.mol import Molecule 11 | 12 | 13 | class Template(Drawable): 14 | def __init__(self, smarts: str) -> None: 15 | super().__init__() 16 | self._smarts = smarts.strip() 17 | 18 | def __getstate__(self): 19 | return self._smarts 20 | 21 | def __setstate__(self, state): 22 | self._smarts = state 23 | 24 | @property 25 | def smarts(self) -> str: 26 | return self._smarts 27 | 28 | @cached_property 29 | def _rdmol(self): 30 | return AllChem.MolFromSmarts(self._smarts) 31 | 32 | def draw(self, size: int = 100, svg: bool = False): 33 | if svg: 34 | return Draw._moltoSVG(self._rdmol, sz=(size, size), highlights=[], legend=[], kekulize=True) 35 | else: 36 | return Draw.MolToImage(self._rdmol, size=(size, size), kekulize=True) 37 | 38 | def match(self, mol: Molecule) -> bool: 39 | return mol._rdmol.HasSubstructMatch(self._rdmol) 40 | 41 | def __hash__(self) -> int: 42 | return hash(self._smarts) 43 | 44 | def __eq__(self, __value: object) -> bool: 45 | return isinstance(__value, Reaction) and self.smarts == __value.smarts 46 | 47 | 48 | class Reaction(Drawable): 49 | def __init__(self, smarts: str) -> None: 50 | super().__init__() 51 | self._smarts = smarts.strip() 52 | 53 | def __getstate__(self): 54 | return self._smarts 55 | 56 | def __setstate__(self, state): 57 | self._smarts = state 58 | 59 | @property 60 | def smarts(self) -> str: 61 | return self._smarts 62 | 63 | @cached_property 64 | def _reaction(self): 65 | r = AllChem.ReactionFromSmarts(self._smarts) 66 | rdChemReactions.ChemicalReaction.Initialize(r) 67 | return r 68 | 69 | def draw(self, size: int = 100, svg: bool = False): 70 | return Draw.ReactionToImage(self._reaction, subImgSize=(size, size), useSVG=svg) 71 | 72 | @cached_property 73 | def num_reactants(self) -> int: 74 | return self._reaction.GetNumReactantTemplates() 75 | 76 | @cached_property 77 | def num_agents(self) -> int: 78 | return self._reaction.GetNumAgentTemplates() 79 | 80 | @cached_property 81 | def num_products(self) -> int: 82 | return self._reaction.GetNumProductTemplates() 83 | 84 | @cached_property 85 | def reactant_templates(self) -> tuple[Template, ...]: 86 | # reactant_smarts = self.smarts.split(">")[0].split(".") 87 | reactant_smarts = [Chem.MolToSmarts(self._reaction.GetReactantTemplate(i)) for i in range(self.num_reactants)] 88 | return tuple(Template(s) for s in reactant_smarts) 89 | 90 | def match_reactant_templates(self, mol: Molecule) -> tuple[int, ...]: 91 | matched: list[int] = [] 92 | for i, template in enumerate(self.reactant_templates): 93 | if template.match(mol): 94 | matched.append(i) 95 | return tuple(matched) 96 | 97 | @cached_property 98 | def product_templates(self) -> tuple[Template, ...]: 99 | product_smarts = self.smarts.split(">")[2].split(".") 100 | return tuple(Template(s) for s in product_smarts) 101 | 102 | def is_reactant(self, mol: Molecule) -> bool: 103 | return self._reaction.IsMoleculeReactant(mol._rdmol) 104 | 105 | def is_agent(self, mol: Molecule) -> bool: 106 | return self._reaction.IsMoleculeAgent(mol._rdmol) 107 | 108 | def is_product(self, mol: Molecule) -> bool: 109 | return self._reaction.IsMoleculeProduct(mol._rdmol) 110 | 111 | def __call__(self, reactants: Sequence[Molecule] | Sequence[str]) -> list[Molecule]: 112 | if isinstance(reactants, Sequence) and not isinstance(reactants, str): 113 | reactants = [Molecule.from_rdmol(m._rdmol) for m in reactants] 114 | products = [Molecule.from_rdmol(p[0]) for p in self._reaction.RunReactants([m._rdmol for m in reactants])] 115 | products = [p for p in products if p.is_valid] 116 | return products 117 | 118 | def __hash__(self) -> int: 119 | return hash(self._smarts) 120 | 121 | def __eq__(self, __value: object) -> bool: 122 | return isinstance(__value, Reaction) and self.smarts == __value.smarts 123 | 124 | 125 | class ReactionContainer(Sequence[Reaction]): 126 | def __init__(self, reactions: Iterable[Reaction]) -> None: 127 | super().__init__() 128 | self._reactions = tuple(reactions) 129 | 130 | @overload 131 | def __getitem__(self, index: int) -> Reaction: ... 132 | 133 | @overload 134 | def __getitem__(self, index: slice) -> tuple[Reaction, ...]: ... 135 | 136 | def __getitem__(self, index: int | slice): 137 | return self._reactions[index] 138 | 139 | def __len__(self) -> int: 140 | return len(self._reactions) 141 | 142 | def match_reactions(self, mol: Molecule) -> dict[int, tuple[int, ...]]: 143 | matched: dict[int, tuple[int, ...]] = {} 144 | for i, rxn in enumerate(self._reactions): 145 | m = rxn.match_reactant_templates(mol) 146 | if len(m) > 0: 147 | matched[i] = m 148 | return matched 149 | 150 | 151 | def read_reaction_file(path: os.PathLike) -> list[Reaction]: 152 | reactions: list[Reaction] = [] 153 | with open(path) as f: 154 | for line in f: 155 | line = line.strip() 156 | if not line: 157 | continue 158 | reactions.append(Reaction(line)) 159 | return reactions 160 | -------------------------------------------------------------------------------- /synllama/chem/smiles_tfidf.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | from collections import Counter 3 | from itertools import chain 4 | from collections.abc import Iterable 5 | from tqdm import tqdm 6 | from sklearn.feature_extraction.text import TfidfVectorizer 7 | from difflib import SequenceMatcher 8 | 9 | import numpy as np 10 | from typing import List, Tuple 11 | from sklearn.neighbors import BallTree 12 | from joblib import Parallel, delayed 13 | 14 | from synllama.chem.mol import Molecule 15 | from synllama.chem.fpindex import _QueryResult 16 | 17 | def string_similarity(s1, s2): 18 | return SequenceMatcher(None, s1, s2).ratio() 19 | 20 | def sort_by_similarity(string, target_list): 21 | return sorted(target_list, key=lambda x: string_similarity(string, x)) 22 | 23 | def find_closest_match(string, target_list): 24 | return sort_by_similarity(string, target_list)[0] 25 | 26 | class SmilesTokenizer: 27 | def __init__(self, token_list_path, n_gram_range=(2, 4)): 28 | with open(token_list_path, "r") as f: 29 | self.token_list = [line.strip() for line in f.readlines()] 30 | self.token_list = sorted(self.token_list, key=len, reverse=True) 31 | self.n_gram_range = n_gram_range 32 | 33 | def __call__(self, smiles): 34 | tokens = self._tokenize(smiles) 35 | return self._create_ngrams(tokens) 36 | 37 | def _tokenize(self, smiles): 38 | tokens = [] 39 | i = 0 40 | while i < len(smiles): 41 | matched = False 42 | for token in self.token_list: 43 | if smiles.startswith(token, i): 44 | tokens.append(token) 45 | i += len(token) 46 | matched = True 47 | break 48 | if not matched: 49 | tokens.append(smiles[i]) 50 | i += 1 51 | return tokens 52 | 53 | def _create_ngrams(self, tokens): 54 | ngrams = [] 55 | for n in range(self.n_gram_range[0], self.n_gram_range[1] + 1): 56 | for i in range(len(tokens) - n + 1): 57 | ngrams.append(" ".join(tokens[i:i+n])) 58 | return ngrams 59 | 60 | def compute_embeddings(vectorizer: TfidfVectorizer, molecules: Iterable[Molecule], batch_size: int = 1024) -> np.ndarray: 61 | all_smiles = [mol.smiles for mol in molecules] 62 | 63 | def process_batch(batch): 64 | return vectorizer.transform(batch) 65 | 66 | batches = [all_smiles[i:i + batch_size] for i in range(0, len(all_smiles), batch_size)] 67 | 68 | embeddings_list = Parallel(n_jobs=-1)( 69 | delayed(process_batch)(batch) for batch in tqdm(batches, desc="Computing embeddings") 70 | ) 71 | 72 | return np.vstack(embeddings_list) 73 | 74 | class SmilesSimilaritySearch: 75 | def __init__(self, n_gram_range=(2, 4), token_list_path=None, max_features=1024): 76 | self.n_gram_range = n_gram_range 77 | self.token_list_path = token_list_path 78 | self.tokenizer = SmilesTokenizer(token_list_path, n_gram_range) if token_list_path else None 79 | self.ngram_vocab = None 80 | self.idf = None 81 | self._molecules = None 82 | self._embeddings = None 83 | self._tree = None 84 | self.max_features = max_features 85 | 86 | def fit(self, molecules: Iterable[Molecule], save_ngram = False): 87 | self._molecules = tuple(molecules) 88 | all_smiles = [mol.smiles for mol in self._molecules] 89 | 90 | if self.tokenizer: 91 | all_tokens = [self.tokenizer(smiles) for smiles in tqdm(all_smiles, desc="Tokenizing SMILES")] 92 | else: 93 | all_tokens = all_smiles 94 | 95 | # Generate n-gram vocabulary (now sorted and limited) 96 | self.ngram_vocab = self._generate_ngram_vocab(all_tokens) 97 | if save_ngram: 98 | with open("data/processed/smiles_similarity_search_ngram_vocab.txt", "w") as f: 99 | for ngram in self.ngram_vocab: 100 | f.write(ngram + "\n") 101 | 102 | # Compute IDF (now using the limited vocabulary) 103 | self.idf = self._compute_idf(all_tokens) 104 | 105 | # Compute TF-IDF embeddings (now using the limited vocabulary) 106 | self._embeddings = self._compute_tfidf_embeddings(all_tokens) 107 | 108 | # Initialize BallTree 109 | self._tree = self._init_tree() 110 | 111 | def _generate_ngram_vocab(self, all_tokens): 112 | ngram_counter = Counter() 113 | for tokens in tqdm(all_tokens, desc="Generating n-gram vocabulary"): 114 | if self.tokenizer: 115 | ngrams = self.tokenizer._create_ngrams(tokens) 116 | else: 117 | ngrams = [tokens[i:i+n] for n in range(self.n_gram_range[0], self.n_gram_range[1]+1) 118 | for i in range(len(tokens)-n+1)] 119 | ngram_counter.update(ngrams) 120 | 121 | # Sort n-grams by frequency and keep only the top max_features 122 | return [ngram for ngram, _ in ngram_counter.most_common(self.max_features)] 123 | 124 | def _compute_idf(self, all_tokens): 125 | doc_freq = Counter() 126 | for tokens in tqdm(all_tokens, desc="Computing IDF"): 127 | if self.tokenizer: 128 | ngrams = set(self.tokenizer._create_ngrams(tokens)) 129 | else: 130 | ngrams = set(tokens[i:i+n] for n in range(self.n_gram_range[0], self.n_gram_range[1]+1) 131 | for i in range(len(tokens)-n+1)) 132 | doc_freq.update(ngram for ngram in ngrams if ngram in self.ngram_vocab) 133 | 134 | num_docs = len(all_tokens) 135 | return {ngram: np.log(num_docs / (count + 1)) + 1 for ngram, count in doc_freq.items()} 136 | 137 | def _compute_tfidf_embeddings(self, all_tokens): 138 | embeddings = [] 139 | for tokens in tqdm(all_tokens, desc="Computing TF-IDF embeddings"): 140 | if self.tokenizer: 141 | ngrams = self.tokenizer._create_ngrams(tokens) 142 | else: 143 | ngrams = [tokens[i:i+n] for n in range(self.n_gram_range[0], self.n_gram_range[1]+1) 144 | for i in range(len(tokens)-n+1)] 145 | 146 | tf = Counter(ngram for ngram in ngrams if ngram in self.ngram_vocab) 147 | tfidf = np.zeros(len(self.ngram_vocab)) 148 | for i, ngram in enumerate(self.ngram_vocab): 149 | if ngram in tf: 150 | tfidf[i] = tf[ngram] * self.idf.get(ngram, 0) 151 | embeddings.append(tfidf) 152 | return np.array(embeddings) 153 | 154 | def _init_tree(self) -> BallTree: 155 | return BallTree(self._embeddings, metric='manhattan') 156 | 157 | def query(self, query_smiles: str, k: int = 10) -> List[_QueryResult]: 158 | if self.tokenizer: 159 | query_tokens = self.tokenizer(query_smiles) 160 | query_ngrams = self.tokenizer._create_ngrams(query_tokens) 161 | else: 162 | query_ngrams = [query_smiles[i:i+n] for n in range(self.n_gram_range[0], self.n_gram_range[1]+1) 163 | for i in range(len(query_smiles)-n+1)] 164 | 165 | tf = Counter(ngram for ngram in query_ngrams if ngram in self.ngram_vocab) 166 | query_embedding = np.zeros(len(self.ngram_vocab)) 167 | for i, ngram in enumerate(self.ngram_vocab): 168 | if ngram in tf: 169 | query_embedding[i] = tf[ngram] * self.idf.get(ngram, 0) 170 | 171 | distances, indices = self._tree.query(query_embedding.reshape(1, -1), k=k) 172 | 173 | results = [] 174 | for distance, idx in zip(distances[0], indices[0]): 175 | results.append(_QueryResult( 176 | index=idx, 177 | molecule=self._molecules[idx], 178 | fingerprint=None, 179 | distance=distance 180 | )) 181 | 182 | return sorted(results, key=lambda x: x.distance) 183 | 184 | def save(self, filename): 185 | with open(filename, 'wb') as f: 186 | pickle.dump(self, f) 187 | 188 | @classmethod 189 | def load(cls, filename): 190 | with open(filename, 'rb') as f: 191 | return pickle.load(f) 192 | -------------------------------------------------------------------------------- /synllama/chem/stack.py: -------------------------------------------------------------------------------- 1 | import dataclasses 2 | import itertools 3 | import random 4 | from collections.abc import Iterable 5 | from typing import TypeAlias 6 | 7 | import numpy as np 8 | 9 | from synllama.chem.matrix import ReactantReactionMatrix 10 | from synllama.chem.mol import Molecule 11 | from synllama.chem.reaction import Reaction 12 | from synllama.chem.smiles_tfidf import sort_by_similarity 13 | 14 | _NumReactants: TypeAlias = int 15 | _MolOrRxnIndex: TypeAlias = int 16 | _TokenType: TypeAlias = tuple[_NumReactants, _MolOrRxnIndex] 17 | 18 | 19 | def _flatten(l): 20 | for el in l: 21 | if isinstance(el, list): 22 | yield from _flatten(el) 23 | else: 24 | yield el 25 | 26 | 27 | @dataclasses.dataclass 28 | class _Node: 29 | mol: Molecule 30 | rxn: Reaction | None 31 | token: _TokenType 32 | children: list["_Node"] 33 | 34 | def to_str(self, depth: int) -> str: 35 | pad = " " * depth * 2 36 | lines = [f"{pad}{self.mol.smiles}"] 37 | if self.rxn is not None: 38 | for c in self.children: 39 | lines.append(f"{c.to_str(depth + 1)}") 40 | return "\n".join(lines) 41 | 42 | def __repr__(self) -> str: 43 | return f"Node(\n{self.to_str(1)}\n)" 44 | 45 | 46 | class Stack: 47 | def __init__(self) -> None: 48 | super().__init__() 49 | self._mols: list[Molecule] = [] 50 | self._rxns: list[Reaction | None] = [] 51 | self._tokens: list[_TokenType] = [] 52 | self._stack: list[set[Molecule]] = [] 53 | 54 | @property 55 | def mols(self) -> tuple[Molecule, ...]: 56 | return tuple(self._mols) 57 | 58 | @property 59 | def rxns(self) -> tuple[Reaction | None, ...]: 60 | return tuple(self._rxns) 61 | 62 | @property 63 | def tokens(self) -> tuple[_TokenType, ...]: 64 | return tuple(self._tokens) 65 | 66 | def get_top(self) -> set[Molecule]: 67 | return self._stack[-1] 68 | 69 | def get_second_top(self) -> set[Molecule]: 70 | return self._stack[-2] 71 | 72 | def get_third_top(self) -> set[Molecule]: 73 | return self._stack[-3] 74 | 75 | def push_mol(self, mol: Molecule, index: int) -> None: 76 | self._mols.append(mol) 77 | self._rxns.append(None) 78 | self._tokens.append((-1, index)) 79 | self._stack.append({mol}) 80 | 81 | def push_rxn(self, rxn: Reaction, index: int, product_limit: int | None = None, product_template: str | None = None) -> bool: 82 | if len(self._stack) < rxn.num_reactants: 83 | return False 84 | 85 | prods: list[Molecule] = [] 86 | if rxn.num_reactants == 1: 87 | for r in self.get_top(): 88 | prods += rxn([r]) 89 | elif rxn.num_reactants == 2: 90 | for r1, r2 in itertools.product(self.get_top(), self.get_second_top()): 91 | if product_limit is not None and len(prods) >= product_limit: 92 | break 93 | prods += rxn([r1, r2]) + rxn([r2, r1]) 94 | elif rxn.num_reactants == 3: 95 | for r1, r2, r3 in itertools.product(self.get_top(), self.get_second_top(), self.get_third_top()): 96 | if product_limit is not None and len(prods) >= product_limit: 97 | break 98 | prods += ( 99 | rxn([r1, r2, r3]) 100 | + rxn([r1, r3, r2]) 101 | + rxn([r2, r1, r3]) 102 | + rxn([r2, r3, r1]) 103 | + rxn([r3, r2, r1]) 104 | + rxn([r3, r1, r2]) 105 | ) 106 | else: 107 | return False 108 | 109 | if len(prods) == 0: 110 | return False 111 | prod_dict = {m.smiles: m for m in prods} 112 | if product_template is not None: 113 | prod_sorted = sort_by_similarity(product_template, list(prod_dict.keys())) 114 | prod_sorted = [prod_dict[p] for p in prod_sorted] 115 | else: 116 | prod_sorted = prods 117 | if product_limit is not None: 118 | prod_sorted = prod_sorted[:product_limit] 119 | prod: Molecule = prod_sorted[0] if product_template is not None else random.choices(prod_sorted, weights=[len(p.smiles) for p in prod_sorted])[0] 120 | 121 | self._mols.append(prod) # need to look into this step for why there's wrong molecules, the prod here is not necessarily the product of the reaction 122 | self._rxns.append(rxn) 123 | self._tokens.append((rxn.num_reactants, index)) 124 | for _ in range(rxn.num_reactants): 125 | self._stack.pop() 126 | self._stack.append(set([prod])) 127 | return True 128 | 129 | def get_tree(self) -> _Node: 130 | stack: list[_Node] = [] 131 | for i in range(len(self._tokens)): 132 | token = self._tokens[i] 133 | n_react = token[0] 134 | if n_react > 0: 135 | item = _Node(self._mols[i], self._rxns[i], token, []) 136 | for _ in range(n_react): 137 | item.children.append(stack.pop()) 138 | stack.append(item) 139 | else: 140 | stack.append(_Node(self._mols[i], self._rxns[i], token, [])) 141 | return stack[-1] 142 | 143 | def get_postfix_tokens(self) -> tuple[_TokenType, ...]: 144 | return tuple(self._tokens) 145 | 146 | def __len__(self) -> int: 147 | return len(self._mols) 148 | 149 | def __getitem__(self, index: int) -> Molecule: 150 | return self._mols[index] 151 | 152 | def get_mol_idx_seq(self) -> list[int | None]: 153 | return [t[1] if t[0] == 0 else None for t in self.tokens] 154 | 155 | def get_rxn_idx_seq(self) -> list[int | None]: 156 | return [t[1] if t[0] > 0 else None for t in self.tokens] 157 | 158 | def count_reactions(self) -> int: 159 | cnt = 0 160 | for rxn in self._rxns: 161 | if rxn is not None: 162 | cnt += 1 163 | return cnt 164 | 165 | def get_state_repr(self) -> str: 166 | rl: list[str] = [] 167 | for s in self._stack: 168 | sl = list(map(lambda m: m.smiles, s)) 169 | sl.sort() 170 | rl.append(",".join(sl)) 171 | return ";".join(rl) 172 | 173 | def get_action_string(self, delim: str = ";") -> str: 174 | tokens: list[str] = [] 175 | for mol, (num_reactants, idx) in zip(self._mols, self._tokens): 176 | if num_reactants == -1: 177 | tokens.append(f"{mol.smiles}") 178 | else: 179 | tokens.append(f"R{idx}_{num_reactants}") 180 | tokens.append(f"{mol.smiles}") 181 | return delim.join(tokens) 182 | 183 | def get_source(self, delim: str = ";") -> str: 184 | return delim.join([mol.source for mol in self._mols]) 185 | 186 | def create_init_stack(matrix: ReactantReactionMatrix, rxn_count: dict[str, int], weighted_ratio: float = 0.0, prob_u_fp: str = None) -> Stack: 187 | stack = Stack() 188 | prob_u = np.ones(len(matrix.reactant_count)) / len(matrix.reactant_count) 189 | if prob_u_fp is not None: 190 | with open(prob_u_fp, "r") as f: 191 | prob_u = np.array(list(map(float, f.read().splitlines())), dtype=np.float32) 192 | prob_u = prob_u / prob_u.sum() 193 | prob_w = list(rxn_count.values()) 194 | prob_w = np.array([prob_w[i] if i in matrix.seed_reaction_indices else 0 for i in range(len(matrix.reactant_count))], dtype=np.float32) 195 | prob_w = prob_w / prob_w.sum() 196 | prob = weighted_ratio * prob_w + (1 - weighted_ratio) * prob_u 197 | prob = prob / np.sum(prob) 198 | rxn_index: int = np.random.choice(np.arange(len(matrix.reactions)), p=prob) 199 | rxn_col = matrix.matrix[:, rxn_index] 200 | rxn = matrix.reactions[rxn_index] 201 | 202 | if rxn.num_reactants == 2: 203 | m1 = np.random.choice(np.bitwise_and(rxn_col, 0b01).nonzero()[0]) 204 | m2 = np.random.choice(np.bitwise_and(rxn_col, 0b10).nonzero()[0]) 205 | if random.randint(0, 1) % 2 == 1: 206 | m1, m2 = m2, m1 207 | stack.push_mol(matrix.reactants[m1], m1) 208 | stack.push_mol(matrix.reactants[m2], m2) 209 | elif rxn.num_reactants == 1: 210 | m = np.random.choice(rxn_col.nonzero()[0]) 211 | stack.push_mol(matrix.reactants[m], m) 212 | elif rxn.num_reactants == 3: 213 | m1 = np.random.choice(np.bitwise_and(rxn_col, 0b001).nonzero()[0]) 214 | m2 = np.random.choice(np.bitwise_and(rxn_col, 0b010).nonzero()[0]) 215 | m3 = np.random.choice(np.bitwise_and(rxn_col, 0b100).nonzero()[0]) 216 | m1, m2, m3 = random.sample([m1, m2, m3], 3) 217 | stack.push_mol(matrix.reactants[m1], m1) 218 | stack.push_mol(matrix.reactants[m2], m2) 219 | stack.push_mol(matrix.reactants[m3], m3) 220 | 221 | stack.push_rxn(rxn, rxn_index) 222 | return stack 223 | 224 | 225 | def expand_stack(stack: Stack, matrix: ReactantReactionMatrix): 226 | matches = matrix.reactions.match_reactions(random.choice(list(stack.get_top()))) 227 | if len(matches) == 0: 228 | return stack, False 229 | rxn_index = random.choice(list(matches.keys())) 230 | reactant_flag = 1 << matches[rxn_index][0] 231 | 232 | rxn_col = matrix.matrix[:, rxn_index] 233 | if np.any(rxn_col >= 4): 234 | # Case of tri-mol reaction 235 | all_reactants = 0b111 236 | remaining_reactants = all_reactants ^ reactant_flag 237 | reactant_1 = remaining_reactants & 0b001 # Isolate the 001 bit 238 | reactant_2 = remaining_reactants & 0b010 # Isolate the 010 bit 239 | reactant_3 = remaining_reactants & 0b100 # Isolate the 100 bit 240 | valid_reactants = [reactant for reactant in [reactant_1, reactant_2, reactant_3] if reactant != 0] 241 | s_indices_1 = np.logical_and(rxn_col != 0, (rxn_col & valid_reactants[0]) == valid_reactants[0]).nonzero()[0] 242 | s_indices_2 = np.logical_and(rxn_col != 0, (rxn_col & valid_reactants[1]) == valid_reactants[1]).nonzero()[0] 243 | s_indices_1, s_indices_2 = random.sample([s_indices_1, s_indices_2], 2) 244 | s_index1 = np.random.choice(s_indices_1) 245 | stack.push_mol(matrix.reactants[s_index1], s_index1) 246 | s_index2 = np.random.choice(s_indices_2) 247 | stack.push_mol(matrix.reactants[s_index2], s_index2) 248 | rxn_success = stack.push_rxn(matrix.reactions[rxn_index], rxn_index) 249 | else: 250 | s_indices = np.logical_and(rxn_col != 0, rxn_col != reactant_flag).nonzero()[0] 251 | if len(s_indices) == 0: 252 | stack.push_rxn(matrix.reactions[rxn_index], rxn_index) 253 | return stack, True 254 | s_index = np.random.choice(s_indices) 255 | stack.push_mol(matrix.reactants[s_index], s_index) 256 | rxn_success = stack.push_rxn(matrix.reactions[rxn_index], rxn_index) 257 | if not rxn_success: 258 | # pop the last pushed molecule and end the reaction 259 | stack._mols.pop() 260 | stack._rxns.pop() 261 | stack._tokens.pop() 262 | stack._stack.pop() 263 | return stack, rxn_success 264 | 265 | 266 | def create_stack( 267 | matrix: ReactantReactionMatrix, 268 | rxn_count: dict[str, int], 269 | max_num_reactions: int = 5, 270 | max_num_atoms: int = 80, 271 | init_stack_weighted_ratio: float = 0.0, 272 | prob_u_fp: str = None, 273 | ) -> Stack: 274 | stack = create_init_stack(matrix, rxn_count=rxn_count, weighted_ratio=init_stack_weighted_ratio, prob_u_fp=prob_u_fp) 275 | for _ in range(1, max_num_reactions): 276 | stack, changed = expand_stack(stack, matrix) 277 | if not changed: 278 | break 279 | if max(map(lambda m: m.num_atoms, stack.get_top())) > max_num_atoms: 280 | break 281 | return stack 282 | 283 | 284 | def create_stack_step_by_step( 285 | matrix: ReactantReactionMatrix, 286 | rxn_count: dict[str, int], 287 | max_num_reactions: int = 5, 288 | max_num_atoms: int = 80, 289 | init_stack_weighted_ratio: float = 0.0, 290 | prob_u_fp: str = None, 291 | ) -> Iterable[Stack]: 292 | stack = create_init_stack(matrix, rxn_count=rxn_count, weighted_ratio=init_stack_weighted_ratio, prob_u_fp=prob_u_fp) 293 | yield stack 294 | for _ in range(1, max_num_reactions): 295 | stack, changed = expand_stack(stack, matrix) 296 | if changed: 297 | yield stack 298 | else: 299 | break 300 | if max(map(lambda m: m.num_atoms, stack.get_top())) > max_num_atoms: 301 | break -------------------------------------------------------------------------------- /synllama/llm/parallel_inference.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import json, pickle, argparse, os 3 | import multiprocessing as mp 4 | from synllama.llm.vars import * 5 | from tqdm import tqdm 6 | from transformers import AutoTokenizer, AutoModelForCausalLM 7 | 8 | instruction = TEMPLATE["instruction"] 9 | input_template = TEMPLATE["input"] 10 | 11 | def generate_text(smiles, tokenizer, model, stopping_ids, sampling_params, max_length=1600): 12 | input = input_template.replace("SMILES_STRING", smiles) 13 | prompt_complete = "### Instruction:\n" + instruction + "\n\n### Input:\n"+ input + "\n\n### Response: \n" 14 | inputs = tokenizer(prompt_complete, return_tensors="pt").to(model.device) 15 | prompt_length = inputs.input_ids.shape[1] 16 | 17 | generated_texts = [] 18 | 19 | for params in sampling_params: 20 | temp = params["temp"] 21 | top_p = params["top_p"] 22 | repeat = params["repeat"] 23 | with torch.no_grad(): 24 | outputs = model.generate( 25 | **inputs, 26 | max_new_tokens=max_length, 27 | do_sample=True, 28 | temperature=temp, 29 | top_p=top_p, 30 | num_return_sequences=repeat, 31 | eos_token_id=stopping_ids, 32 | pad_token_id=tokenizer.eos_token_id 33 | ) 34 | for output in outputs: 35 | generated_text = tokenizer.decode(output[prompt_length:], skip_special_tokens=True) 36 | generated_texts.append(generated_text.strip()) 37 | 38 | return generated_texts 39 | 40 | def process_batch(args): 41 | gpu_id, model_path, smiles_batch, sampling_params = args 42 | device = f'cuda:{gpu_id}' if torch.cuda.is_available() else 'cpu' 43 | 44 | tokenizer = AutoTokenizer.from_pretrained(model_path) 45 | model = AutoModelForCausalLM.from_pretrained( 46 | model_path, 47 | torch_dtype=torch.float16 if 'cuda' in device else torch.float32, 48 | device_map={'': device} 49 | ) 50 | 51 | stopping_ids = [ 52 | tokenizer.eos_token_id, 53 | tokenizer.convert_tokens_to_ids("<|eot_id|>"), 54 | ] 55 | 56 | results = {} 57 | for smiles in tqdm(smiles_batch, desc=f"Processing on {device.upper()}"): 58 | try: 59 | response = generate_text(smiles, tokenizer, model, stopping_ids, sampling_params) 60 | json_responses = [] 61 | for r in response: 62 | try: 63 | json_responses.append(json.loads(r)) 64 | except json.JSONDecodeError: 65 | json_responses.append("json format error") 66 | results[smiles] = json_responses 67 | except Exception as e: 68 | results[smiles] = f"Error: {str(e)}" 69 | 70 | return results 71 | 72 | def main(model_path, smiles_path, save_path, sampling_params, gpus = None): 73 | with open(smiles_path, "r") as f: 74 | smiles_list = [line.strip() for line in f] 75 | 76 | num_gpus = torch.cuda.device_count() if gpus is None else gpus 77 | print(f"Number of available GPUs: {num_gpus}") 78 | 79 | if num_gpus > 1: 80 | pool = mp.Pool(num_gpus) 81 | try: 82 | # Process batches on different GPUs 83 | batches = [smiles_list[i::num_gpus] for i in range(num_gpus)] 84 | results = pool.map(process_batch, [(i, model_path, batch, sampling_params) for i, batch in enumerate(batches)]) 85 | 86 | # Combine results from all GPUs 87 | combined_results = {} 88 | for r in results: 89 | combined_results.update(r) 90 | 91 | finally: 92 | # Ensure pool cleanup happens even if an error occurs 93 | pool.close() # Prevent any more tasks from being submitted 94 | pool.join() # Wait for all processes to finish 95 | pool.terminate() # Terminate all worker processes 96 | else: 97 | # If only one GPU, process all SMILES on that GPU 98 | combined_results = process_batch((0, model_path, smiles_list, sampling_params)) 99 | 100 | # Save results 101 | with open(save_path, "wb") as f: 102 | pickle.dump(combined_results, f) 103 | 104 | # close pool 105 | return combined_results 106 | 107 | if __name__ == "__main__": 108 | parser = argparse.ArgumentParser(description="Run inference pipeline for reaction prediction") 109 | parser.add_argument("--model_path", type=str, help="Path to the model", default="data/model/SynLlama-1B-2M") 110 | parser.add_argument("--smiles_path", type=str, help="Path to the SMILES file") 111 | parser.add_argument("--save_path", type=str, help="Pickle file path to save the results", default = None) 112 | parser.add_argument("--sample_mode", type=str, default=None, help="Sampling mode, choose from: greedy, frugal, frozen_only, low_only, medium_only, high_only") 113 | parser.add_argument("--temp", type=float, default=None, help="Temperature for the model") 114 | parser.add_argument("--top_p", type=float, default=None, help="Top-p for the model") 115 | parser.add_argument("--repeat", type=int, default=None, help="Number of times to repeat the model") 116 | parser.add_argument("--gpus", type=int, default=None, help="name of the cuda device to use, default is all available GPUs") 117 | args = parser.parse_args() 118 | mp.set_start_method('spawn', force=True) 119 | if args.save_path is None: 120 | args.save_path = args.smiles_path.replace(".smi", "_results.pkl") 121 | directory = os.path.dirname(args.save_path) 122 | os.makedirs(directory, exist_ok=True) 123 | sample_mode_mapping = { 124 | "greedy": sampling_params_greedy, 125 | "frugal": sampling_params_frugal, 126 | "frozen_only": sampling_params_frozen_only, 127 | "low_only": sampling_params_low_only, 128 | "medium_only": sampling_params_medium_only, 129 | "high_only": sampling_params_high_only 130 | } 131 | if args.sample_mode is None: 132 | assert args.temp is not None and args.top_p is not None and args.repeat is not None, "Please provide a sample mode or all the sampling parameters" 133 | sampling_params = [ 134 | {"temp": args.temp, "top_p": args.top_p, "repeat": args.repeat} 135 | ] 136 | else: 137 | assert args.sample_mode in sample_mode_mapping, f"Invalid sample mode: {args.sample_mode}" 138 | sampling_params = sample_mode_mapping[args.sample_mode] 139 | 140 | main(args.model_path, args.smiles_path, args.save_path, sampling_params=sampling_params, gpus=args.gpus) -------------------------------------------------------------------------------- /synllama/llm/sft/synllama_sft.yml: -------------------------------------------------------------------------------- 1 | base_model: meta-llama/Llama-3.2-1B-Instruct 2 | model_type: LlamaForCausalLM 3 | tokenizer_type: AutoTokenizer 4 | is_llama_derived_model: true 5 | 6 | load_in_8bit: true 7 | load_in_4bit: false 8 | strict: false 9 | 10 | datasets: 11 | - path: CHANGE_TO_YOUR_DATASET_PATH (prepared dataset path in jsonl format) 12 | ds_type: json 13 | type: alpaca 14 | dataset_prepared_path: CHANGE_TO_YOUR_DATASET_PATH (file path to save the prepared dataset) 15 | val_set_size: 0.05 16 | output_dir: CHANGE_TO_YOUR_OUTPUT_PATH (file path to save the outputs) 17 | 18 | sequence_len: 2048 19 | sample_packing: true 20 | eval_sample_packing: true 21 | pad_to_sequence_len: true 22 | 23 | overrides_of_model_config: 24 | rope_scaling: 25 | factor: 1.0 26 | low_freq_factor: 1.0 27 | high_freq_factor: 4.0 28 | original_max_position_embeddings: 8192 29 | rope_type: llama3 30 | 31 | adapter: lora 32 | lora_model_dir: 33 | lora_r: 32 34 | lora_alpha: 16 35 | lora_dropout: 0.05 36 | lora_target_linear: true 37 | lora_fan_in_fan_out: 38 | 39 | wandb_mode: 40 | wandb_project: 41 | wandb_entity: 42 | wandb_run_id: 43 | wandb_watch: 44 | wandb_log_model: 45 | wandb_name: 46 | 47 | gradient_accumulation_steps: 4 48 | micro_batch_size: 4 49 | num_epochs: 1 50 | optimizer: adamw_bnb_8bit 51 | lr_scheduler: cosine 52 | learning_rate: 0.0002 53 | 54 | train_on_inputs: false 55 | group_by_length: false 56 | bf16: auto 57 | fp16: 58 | tf32: false 59 | 60 | gradient_checkpointing: true 61 | early_stopping_patience: 62 | resume_from_checkpoint: 63 | local_rank: 64 | logging_steps: 1 65 | xformers_attention: 66 | flash_attention: true 67 | s2_attention: 68 | 69 | warmup_steps: 10 70 | evals_per_epoch: 4 71 | eval_table_size: 72 | eval_table_max_new_tokens: 128 73 | saves_per_epoch: 1 74 | debug: 75 | deepspeed: 76 | weight_decay: 0.0 77 | fsdp: 78 | fsdp_config: 79 | special_tokens: 80 | pad_token: <|finetune_right_pad_id|> 81 | -------------------------------------------------------------------------------- /synllama/llm/vars.py: -------------------------------------------------------------------------------- 1 | TEMPLATE = { 2 | "instruction": "You are an expert synthetic organic chemist. Your task is to design a synthesis pathway for a given target molecule using common and reliable reaction templates and building blocks. Follow these instructions:\n\n1. **Input the SMILES String:** Read in the SMILES string of the target molecule and identify common reaction templates that can be applied.\n\n2. **Decompose the Target Molecule:** Use the identified reaction templates to decompose the target molecule into different intermediates.\n\n3. **Check for Building Blocks:** For each intermediate:\n - Identify if it is a building block. If it is, wrap it in and tags and save it for later use.\n - If it is not a building block, apply additional reaction templates to further decompose it into building blocks.\n\n4. **Document Reactions:** For each reaction documented in the output, wrap the reaction template in and tags.\n\n5. **Repeat the Process:** Continue this process until all intermediates are decomposed into building blocks, and document each step clearly in a structured JSON format.", 3 | "input": "Provide a synthetic pathway for this SMILES string: SMILES_STRING", 4 | "output": "{\"reactions\": [REACTIONS], \"building_blocks\": [BUILDING_BLOCKS]}" 5 | } 6 | BB_BASE = "\"Building_Block\"" 7 | REACTION_BASE_MAX2 = "{\"reaction_number\": REACTION_NUM, \"reaction_template\": \"RXN_TEMPLATE\", \"reactants\": [\"REACTANT1\", \"REACTANT2\"], \"product\": \"PRODUCT\"}" 8 | REACTION_BASE_MAX3 = "{\"reaction_number\": REACTION_NUM, \"reaction_template\": \"RXN_TEMPLATE\", \"reactants\": [\"REACTANT1\", \"REACTANT2\", \"REACTANT3\"], \"product\": \"PRODUCT\"}" 9 | 10 | sampling_params_frugal = [ 11 | {"temp": 0.1, "top_p": 0.1, "repeat": 1, "name": "frozen"}, 12 | {"temp": 0.6, "top_p": 0.5, "repeat": 1, "name": "low"}, 13 | {"temp": 1.0, "top_p": 0.7, "repeat": 1, "name": "medium"}, 14 | {"temp": 1.5, "top_p": 0.9, "repeat": 1, "name": "high"} 15 | ] 16 | 17 | sampling_params_greedy = [ 18 | {"temp": 0.1, "top_p": 0.1, "repeat": 1, "name": "frozen"}, 19 | {"temp": 0.6, "top_p": 0.5, "repeat": 2, "name": "low"}, 20 | {"temp": 1.0, "top_p": 0.7, "repeat": 3, "name": "medium"}, 21 | {"temp": 1.5, "top_p": 0.9, "repeat": 4, "name": "high"} 22 | ] 23 | 24 | sampling_params_frozen_only = [ 25 | {"temp": 0.1, "top_p": 0.1, "repeat": 1, "name": "frozen"} 26 | ] 27 | 28 | sampling_params_low_only = [ 29 | {"temp": 0.6, "top_p": 0.5, "repeat": 5, "name": "low"} 30 | ] 31 | 32 | sampling_params_medium_only = [ 33 | {"temp": 1.0, "top_p": 0.7, "repeat": 5, "name": "medium"} 34 | ] 35 | 36 | sampling_params_high_only = [ 37 | {"temp": 1.5, "top_p": 0.9, "repeat": 5, "name": "high"} 38 | ] --------------------------------------------------------------------------------