├── LICENSE ├── README.md ├── data ├── Ab8.csv ├── PDGF38_heavy.csv ├── PDGF38_light.csv └── PDGF38_raw.csv ├── images └── PfAbNet-viscosity_workflow.png ├── notebooks ├── 1_preprocess.ipynb ├── 2_build_hm.ipynb ├── 3_validation.ipynb ├── 4_attributions.ipynb ├── 5_sensitivity.ipynb └── __init__.py ├── pfabnet ├── __init__.py ├── base.py ├── dataset.py ├── esp_generator.py ├── generate_attributions.py ├── generate_testset_attributions.py ├── model.py ├── predict.py ├── sbatch_tmpl.sh ├── train.py ├── trainer.py └── utils.py ├── pfabnet_eisenberg ├── __init__.py ├── base.py ├── dataset.py ├── eisenberg_generator.py ├── model.py ├── predict.py ├── train.py ├── trainer.py └── utils.py └── requirements.txt /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # PfAbNet-viscosity 2 | This repository accompanies the manuscript "Low-Data Interpretable Deep Learning Prediction of Antibody Viscosity using a Biophysically Meaningful Representation." The code and notebooks in this repository can be used to train PfAbNet-viscosity models, generate test set predictions and reproduce all analyses reported in the manuscript. 3 | ![alt text](https://github.com/PfizerRD/PfAbNet-viscosity/blob/main/images/PfAbNet-viscosity_workflow.png?raw=true) 4 | 5 | This workflow requires the following software/toolkits licenses: 6 | 1. Bioluminate (Schrodinger LLC) 7 | 2. oechem, oespicoli, and oezap toolkits (OpenEye Scientific Software) 8 | 9 | Run the jupyter notebooks in the following order to reproduce the analyses presented in the manuscript: 10 | 1. ```1_preprocess.ipynb```: Retrieve and process the raw data (measured viscosity and antibody sequences) 11 | 2. ```2_build_hm.ipynb```: Build homology models. Analyze and plot dataset diversity. 12 | 3. ```3_validation.ipynb```: Train PfAbNet-PDGF and PfAbNet-Ab21 models. Generate test set predictions and performance plots. 13 | 4. ```4_attribution.ipynb```: Perform attribution analysis. 14 | 5. ```5_sensitivity.ipynb```: Perform sensitivity analysis. 15 | 16 | ## Training 17 | The following command can be used to train PfAbNet models from the command line after the required input files have been created (see the Jupyter Notebooks on how to specify input arguments). 18 | 19 | For example, this will train models using PDGF38 dataset. 20 | 21 | ``` 22 | python pfabnet/train.py --training_data_file data/PDGF.csv \ 23 | --homology_model_dir data/hm \ 24 | --output_model_prefix PfAbNet-PDGF38 \ 25 | --output_model_dir models/pdgf38 26 | ``` 27 | 28 | ## Inference 29 | The following command can be used to generate predictions for a test antibody using .mol2 file with charges (see the Jupyter Notebooks on how to specify input arguments). 30 | 31 | ``` 32 | python pfabnet/predict.py --structure_file data/hm/mAb1.mol2 \ 33 | --PfAbNet_model_dir models/pdgf38 \ 34 | --PfAbNet_model_prefix PfAbNet-PDGF38 \ 35 | --output_file models/pdgf/mAb1.csv 36 | ``` 37 | 38 | -------------------------------------------------------------------------------- /data/Ab8.csv: -------------------------------------------------------------------------------- 1 | Entity,Viscosity_at_150,SCM 2 | TGN1412,16.42,844.6 3 | Basiliximab,25.05,640.8 4 | Natalizumab,13.67,815.5 5 | Tremelimumab,8.8,704.2 6 | Ipilimumab,8.6,754.0 7 | Atezolizumab,11.56,759.6 8 | Ganitumab,10.1,806.5 9 | Vesencumab,23.57,661.3 10 | -------------------------------------------------------------------------------- /data/PDGF38_heavy.csv: -------------------------------------------------------------------------------- 1 | Name FW1 CDR1 FW2 CDR2 FW3 CDR3 FW4 2 | AB-001 EVQLLESGGGLVQPGGSLRLSCAAS GFTFSSYAMS WVRQAPGKGLEWVS YISDDGSLKYYADSVKG RFTISRDNSKNTLYLQMNSLRAEDTAVYYCAK HPYWYGGQLDL WGQGTLVTVSS 3 | 4QCI ----V-------------------- ---------- -------------- ----------------- -------------------------------R ----------- ----------- 4 | R1-002 -----Q------------------- ---------- -------------- ----------------- -------------------------------R ----------- ----------- 5 | R1-003 ------------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 6 | R1-004 ----------------------R-- ---------- -------------- ----------------- -------------------------------R ----------- ----------- 7 | R1-005 ------------------------- ---------- -------------- ---N------------- -------------------------------R ----------- ----------- 8 | R1-006 ------------------------- ---------- -------------- ----------------- -------------------------------R ----------- --R-------- 9 | R1-007 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 10 | R1-008 ------------------------- ---------- -------------- ----------------- -------------------------------- ----------- ----------- 11 | R1-009 ------------------------- ---------- -------------- ----------------- -------------------------------- ----------- ----------- 12 | R1-010 ------------------------- ---------- -------------- ----------------- -------------------------------- ----------- ----------- 13 | R1-011 ------------------------- ---------- -------------- ----------------- -------------------------------- ----------- ----------- 14 | R1-012 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 15 | R1-013 -----Q------------------- ---------- -------------- ----------------- -------------------------------R ----------- ----------- 16 | R1-014 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 17 | R1-015 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 18 | R1-016 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 19 | R1-017 ------------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 20 | R1-018 ------------------------- ---------- -------------- ----------------- -------------------------------R ----------- --R-------- 21 | R2-001 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 22 | R2-002 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 23 | R2-003 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 24 | R2-004 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 25 | R2-005 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 26 | R2-006 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 27 | R2-007 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 28 | R2-008 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 29 | R2-009 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 30 | R2-010 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R --H-------- ----------- 31 | R2-011 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R -------K--- ----------- 32 | R2-012 -----Q------K------------ ---------- -------------- ---K------------- -------------------------------R ----------- ----------- 33 | R2-013 -----Q------K------------ ---------- -------------- ---N------------- -------------------------------R ----------- ----------- 34 | R2-014 -----Q------K------------ ---------- -------------- ----Q------------ -------------------------------R ----------- ----------- 35 | R2-015 -----Q------K------------ ---------- -------------- ------------N---- -------------------------------R ----------- ----------- 36 | R2-016 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ---------N- ----------- 37 | R2-017 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ---------Y- ----------- 38 | R2-018 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 39 | R2-019 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 40 | R2-020 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 41 | R2-021 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 42 | R2-022 -----Q------K------------ ---------- -------------- ----------------- -------------------------------R ----------- ----------- 43 | 44 | -------------------------------------------------------------------------------- /data/PDGF38_light.csv: -------------------------------------------------------------------------------- 1 | Name FW1 CDR1 FW2 CDR2 FW3 CDR3 FW4 2 | AB-001 SYELTQPPSVSVSPGQTASITC SGDSLGSYFVH WYQQKPGQSPVLVIY DDSNRPS GIPERFSGSNSGNTATLTISGTQAMDEADYYC SAFTHNSDV FGGGTKLTVL 3 | 4QCI ------------A-----R-S- ----------- --------A------ ------- ------------------------E------- --------- ---------- 4 | R1-002 ---------------------- ----------- --------------- ------- -------------------------------- --------- ---------- 5 | R1-003 ---------------------- ----------- --------------- ------- -------------------------------- --------- ---------- 6 | R1-004 ---------------------- ----------- --------------- ------- -------------------------------- --------- ---------- 7 | R1-005 ---------------------- ----------- --------------- ------- -------------------------------- --------- ---------- 8 | R1-006 ---------------------- ----------- --------------- ------- -------------------------------- --------- ---------- 9 | R1-007 ---------------------- ----------- --------------- ------- -------------------------------- --------- ---------- 10 | R1-008 --V------------------- ----------- --------------- ------- -------------------------------- --------- ---------- 11 | R1-009 ----------------R----- ----------- --------------- ------- -------------------------------- --------- ---------- 12 | R1-010 ---------------------- ----------- --------------- ---K--- -------------------------------- --------- ---------- 13 | R1-011 ---------------------- ----------- --------------- ------- -------------------------------- -------N- ---------- 14 | R1-012 --V------------------- ----------- --------------- ------- -------------------------------- --------- ---------- 15 | R1-013 ----------------R----- ----------- --------------- ------- -------------------------------- --------- ---------- 16 | R1-014 ----------------R----- ----------- --------------- ------- -------------------------------- --------- ---------- 17 | R1-015 ---------------------- ----------- --------------- ------- -------------------------------- -------N- ---------- 18 | R1-016 ---------------------- ----------- --------------- ---K--- -------------------------------- --------- ---------- 19 | R1-017 ----------------R----- ----------- --------------- ------- -------------------------------- --------- ---------- 20 | R1-018 ----------------R----- ----------- --------------- ------- -------------------------------- --------- ---------- 21 | R2-001 --V---------A--K--R--- ----------- --------------- ---K--- -------------------------------- --------- ---------- 22 | R2-002 --V---------A--K--R--- ---K------- --------------- ---K--- -------------------------------- --------- ---------- 23 | R2-003 --V---------A--K--R--- ----------K --------------- ---K--- -------------------------------- --------- ---------- 24 | R2-004 --V---------A--K--R--- ----------- --------------H ---K--- -------------------------------- --------- ---------- 25 | R2-005 --V---------A--K--R--- ----------- --------------R ---K--- -------------------------------- --------- ---------- 26 | R2-006 --V---------A--K--R--- ----------- --------------- --KK--- -------------------------------- --------- ---------- 27 | R2-007 --V---------A--K--R--- ----------- --------------- ---K--- --------K----------------------- --------- ---------- 28 | R2-008 --V---------A--K--R--- ----------- --------------- ---K--- ----------K--------------------- --------- ---------- 29 | R2-009 --V---------A--K--R--- ----------- --------------- ---K--- -----------K-------------------- --------- ---------- 30 | R2-010 --V---------A--K--R--- ----------- --------------- ---K--- -------------------------------- --------- ---------- 31 | R2-011 --V---------A--K--R--- ----------- --------------- ---K--- -------------------------------- --------- ---------- 32 | R2-012 --V---------A--K--R--- ----------- --------------- ---K--- -------------------------------- --------- ---------- 33 | R2-013 --V---------A--K--R--- ----------- --------------- ---K--- -------------------------------- --------- ---------- 34 | R2-014 --V---------A--K--R--- ----------- --------------- ---K--- -------------------------------- --------- ---------- 35 | R2-015 --V---------A--K--R--- ----------- --------------- ---K--- -------------------------------- --------- ---------- 36 | R2-016 --V---------A--K--R--- ----------- --------------- ---K--- -------------------------------- --------- ---------- 37 | R2-017 --V---------A--K--R--- ----------- --------------- ---K--- -------------------------------- --------- ---------- 38 | R2-018 --V---------A--K--R--- --N-------- --------------- ---K--- -------------------------------- --------- ---------- 39 | R2-019 --V---------A--K--R--- ----------- --------------- L--K--- -------------------------------- --------- ---------- 40 | R2-020 --V---------A--K--R--- ----------- --------------- -N-K--- -------------------------------- --------- ---------- 41 | R2-021 --V---------A--K--R--- ----------- --------------- ---K--- -------------------------------- -------K- ---------- 42 | R2-022 --V---------A--K--R--- ----------- --------------- ---K--- -------------------------------- -------N- ---------- 43 | 44 | -------------------------------------------------------------------------------- /data/PDGF38_raw.csv: -------------------------------------------------------------------------------- 1 | Entity,Viscosity_at_150,SCM 2 | AB-001,440,-2213 3 | R1-002,288,-2008 4 | R1-003,523,-1985 5 | R1-004,310,-1961 6 | R1-005,190,-1838 7 | R1-006,314,-1941 8 | R1-007,233,-1988 9 | R1-008,567,-2085 10 | R1-009,430,-2180 11 | R1-010,99,-1898 12 | R1-011,519,-2035 13 | R1-012,471,-1861 14 | R1-013,414,-1972 15 | R1-014,415,-1983 16 | R1-015,452,-1817 17 | R1-016,73,-1706 18 | R1-017,1534,-1949 19 | R1-018,416,-1914 20 | R2-001,37,-1503 21 | R2-004,54,-1480 22 | R2-005,37,-1456 23 | R2-006,13,-1470 24 | R2-007,21,-1433 25 | R2-008,23,-1460 26 | R2-009,19,-1469 27 | R2-010,35,-1656 28 | R2-011,26,-1500 29 | R2-012,39,-1273 30 | R2-013,26,-1357 31 | R2-014,51,-1378 32 | R2-015,83,-1446 33 | R2-016,67,-1416 34 | R2-017,84,-1564 35 | R2-018,20,-1346 36 | R2-019,60,-1503 37 | R2-020,10,-1363 38 | R2-021,119,-1197 39 | R2-022,135,-1311 40 | -------------------------------------------------------------------------------- /images/PfAbNet-viscosity_workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfizer-opensource/pfabnet-viscosity/60970a752a3e74cc336db13576a9c3a21448fe2e/images/PfAbNet-viscosity_workflow.png -------------------------------------------------------------------------------- /notebooks/1_preprocess.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "735f2307-49ee-42de-99bd-d3acd30711d4", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import os\n", 11 | "import copy\n", 12 | "import numpy as np\n", 13 | "import pandas as pd" 14 | ] 15 | }, 16 | { 17 | "cell_type": "code", 18 | "execution_count": 2, 19 | "id": "5cc62605-e842-49a0-a545-6c373969999f", 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "try:\n", 24 | " from pfabnet import utils\n", 25 | " from pfabnet.base import ENTITY_KEY, VISCOSITY_KEY\n", 26 | "except ModuleNotFoundError as e:\n", 27 | " os.chdir(os.getcwd() + '/../')\n", 28 | " from pfabnet import base\n", 29 | " from pfabnet.base import ENTITY_KEY, VISCOSITY_KEY" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 3, 35 | "id": "b8daec3b-7a25-4f51-9563-8672e2bdda92", 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "BASE_DIR = os.path.dirname(base.get_file_path()) + '/../'\n", 40 | "DATA_DIR = os.path.join(BASE_DIR, 'data')\n", 41 | "RAW_DATA_DIR = os.path.join(DATA_DIR, 'raw')\n", 42 | "FASTA_DIR = os.path.join(DATA_DIR, 'fasta')" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 4, 48 | "id": "df9be626-a510-4969-b9c3-1fad57a016fd", 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [ 52 | "# create data directory\n", 53 | "os.makedirs(DATA_DIR, exist_ok=True)\n", 54 | "os.makedirs(RAW_DATA_DIR, exist_ok=True)\n", 55 | "os.makedirs(FASTA_DIR, exist_ok=True)" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "id": "1502e6f4-6f1c-4cc8-b035-b92bb0500192", 61 | "metadata": {}, 62 | "source": [ 63 | "## Process Ab21 dataset" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 11, 69 | "id": "8a1a39e8-ceb7-4c8f-beb1-048ba7937c07", 70 | "metadata": {}, 71 | "outputs": [], 72 | "source": [ 73 | "# Extract supplementary data from Lai et al. Mol. Pharmaceutics 2021, 18, 3, 1167–1175\n", 74 | "# https://pubs.acs.org/doi/suppl/10.1021/acs.molpharmaceut.0c01073/suppl_file/mp0c01073_si_001.zip\n", 75 | "\n", 76 | "# 1. copy mp0c01073_si_001.zip to data\n", 77 | "# 2. unzip mp0cp1073_si_001.zip - this will create a sub directory SI" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 19, 83 | "id": "70cfc7f3-9d72-49ad-ba5b-a3db5ffd679e", 84 | "metadata": {}, 85 | "outputs": [ 86 | { 87 | "name": "stdout", 88 | "output_type": "stream", 89 | "text": [ 90 | "Number of antibodies in Ab21 set: 21\n" 91 | ] 92 | }, 93 | { 94 | "data": { 95 | "text/html": [ 96 | "
\n", 97 | "\n", 110 | "\n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | "
EntityViscosity_at_150IsotypeN_hydrophobic FvN_hydrophobic mAbN_hydrophilic FvN_hydrophilic mAbHI_FvHI_mAbSASA_phobic_Fv...net charges mAbFvCSPmAbCSPFv_pImAb_pISAP FvSAP mAbSCM FvSCM mAbclassifier
0mAb114.4IgG1834921207121.0989221.0022463760.633301...26-10408.888.96134.8526.32522.96979.00
1mAb220.9IgG1854981106901.3640371.0781384469.273438...220108.028.75161.4573.41687.75731.80
2mAb314.9IgG1804861227201.3176411.0572774007.478271...260127.678.71149.5552.12170.06075.00
3mAb493.4IgG1784821227141.1950391.0248953754.267578...20-2-248.198.83161.9539.72406.37008.61
4mAb58.6IgG1895041127001.2732851.0528175683.704590...260227.868.85213.3598.21636.95795.40
\n", 260 | "

5 rows × 24 columns

\n", 261 | "
" 262 | ], 263 | "text/plain": [ 264 | " Entity Viscosity_at_150 Isotype N_hydrophobic Fv N_hydrophobic mAb \\\n", 265 | "0 mAb1 14.4 IgG1 83 492 \n", 266 | "1 mAb2 20.9 IgG1 85 498 \n", 267 | "2 mAb3 14.9 IgG1 80 486 \n", 268 | "3 mAb4 93.4 IgG1 78 482 \n", 269 | "4 mAb5 8.6 IgG1 89 504 \n", 270 | "\n", 271 | " N_hydrophilic Fv N_hydrophilic mAb HI_Fv HI_mAb SASA_phobic_Fv \\\n", 272 | "0 120 712 1.098922 1.002246 3760.633301 \n", 273 | "1 110 690 1.364037 1.078138 4469.273438 \n", 274 | "2 122 720 1.317641 1.057277 4007.478271 \n", 275 | "3 122 714 1.195039 1.024895 3754.267578 \n", 276 | "4 112 700 1.273285 1.052817 5683.704590 \n", 277 | "\n", 278 | " ... net charges mAb FvCSP mAbCSP Fv_pI mAb_pI SAP Fv SAP mAb \\\n", 279 | "0 ... 26 -10 40 8.88 8.96 134.8 526.3 \n", 280 | "1 ... 22 0 10 8.02 8.75 161.4 573.4 \n", 281 | "2 ... 26 0 12 7.67 8.71 149.5 552.1 \n", 282 | "3 ... 20 -2 -24 8.19 8.83 161.9 539.7 \n", 283 | "4 ... 26 0 22 7.86 8.85 213.3 598.2 \n", 284 | "\n", 285 | " SCM Fv SCM mAb classifier \n", 286 | "0 2522.9 6979.0 0 \n", 287 | "1 1687.7 5731.8 0 \n", 288 | "2 2170.0 6075.0 0 \n", 289 | "3 2406.3 7008.6 1 \n", 290 | "4 1636.9 5795.4 0 \n", 291 | "\n", 292 | "[5 rows x 24 columns]" 293 | ] 294 | }, 295 | "execution_count": 19, 296 | "metadata": {}, 297 | "output_type": "execute_result" 298 | } 299 | ], 300 | "source": [ 301 | "df_Ab21 = pd.read_csv(os.path.join(os.path.join(DATA_DIR, 'SI'), 'features_values_SI.csv'))\n", 302 | "df_Ab21 = df_Ab21.loc[df_Ab21.Isotype == 'IgG1']\n", 303 | "df_Ab21.rename({'mabs':ENTITY_KEY}, inplace=True, axis=1)\n", 304 | "df_Ab21.reset_index(drop=True, inplace=True)\n", 305 | "\n", 306 | "df_Ab21_2 = pd.read_csv(os.path.join(DATA_DIR, 'Ab21_raw.csv'))\n", 307 | "df_Ab21_merged = df_Ab21_2.merge(df_Ab21, on=ENTITY_KEY)\n", 308 | "\n", 309 | "df_Ab21_merged.to_csv(os.path.join(DATA_DIR, 'Ab21.csv'), index=False)\n", 310 | "print('Number of antibodies in Ab21 set: %d' % len(df_Ab21_merged))\n", 311 | "df_Ab21_merged.head()" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": 20, 317 | "id": "ffc22546-f7ff-4f2a-83ed-f68de958afeb", 318 | "metadata": {}, 319 | "outputs": [ 320 | { 321 | "name": "stdout", 322 | "output_type": "stream", 323 | "text": [ 324 | "21 fasta files were saved in FASTA_DIR\n" 325 | ] 326 | } 327 | ], 328 | "source": [ 329 | "# Extract and save fasta entry for each antibody in the SI fasta file into separate file\n", 330 | "\n", 331 | "from Bio.SeqIO.FastaIO import FastaIterator\n", 332 | "\n", 333 | "# extract light and heavy chain sequences from fasta file\n", 334 | "light_chains = {}\n", 335 | "heavy_chains = {}\n", 336 | "with open(os.path.join(os.path.join(DATA_DIR, 'SI'), 'seq_vis_SI.fasta'), 'r') as handle:\n", 337 | " for record in FastaIterator(handle):\n", 338 | " id_fields = record.id.split('_')\n", 339 | " title = id_fields[0]\n", 340 | " if title == 'mAB27': # handle the inconsistent naming in the SI file\n", 341 | " title = 'mAb27'\n", 342 | " chain_type = id_fields[1]\n", 343 | " if chain_type == 'light':\n", 344 | " light_chains[title] = str(record.seq)\n", 345 | " else:\n", 346 | " heavy_chains[title] = str(record.seq)\n", 347 | "\n", 348 | "fasta_files = []\n", 349 | "for k, v in light_chains.items():\n", 350 | " if k in df_Ab21_merged[ENTITY_KEY].values:\n", 351 | " fasta_file = os.path.join(FASTA_DIR, k + '.fasta')\n", 352 | " fasta_files.append(fasta_file)\n", 353 | " with open(fasta_file, 'w') as fptr:\n", 354 | " fptr.write('>' + k + '_VH\\n')\n", 355 | " fptr.write(heavy_chains[k] + '\\n')\n", 356 | " fptr.write('>' + k + '_VL\\n')\n", 357 | " fptr.write(v + '\\n')\n", 358 | "\n", 359 | "print('%d fasta files were saved in FASTA_DIR' % len(fasta_files))" 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": 22, 365 | "id": "4fae4778-49f5-4d48-848e-6f05ee696716", 366 | "metadata": {}, 367 | "outputs": [ 368 | { 369 | "data": { 370 | "text/html": [ 371 | "
\n", 372 | "\n", 385 | "\n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | "
EntityViscosity_at_150IsotypeN_hydrophobic FvN_hydrophobic mAbN_hydrophilic FvN_hydrophilic mAbHI_FvHI_mAbSASA_phobic_Fv...mAbCSPFv_pImAb_pISAP FvSAP mAbSCM FvSCM mAbclassifierLCHC
0mAb114.4IgG1834921207121.0989221.0022463760.633301...408.888.96134.8526.32522.96979.00DIQMTQSPSSLSASVGDRVTITCRASQGIRNYLAWYQQKPGKAPKL...EVQLVESGGGLVQPGRSLRLSCAASGFTFDDYAMHWVRQAPGKGLE...
1mAb220.9IgG1854981106901.3640371.0781384469.273438...108.028.75161.4573.41687.75731.80DIQMTQSPSSLSASVGDRVTITCRASQDVSTAVAWYQQKPGKAPKL...EVQLVESGGGLVQPGGSLRLSCAASGFTFSDSWIHWVRQAPGKGLE...
2mAb314.9IgG1804861227201.3176411.0572774007.478271...127.678.71149.5552.12170.06075.00DIQMTQSPSSLSASVGDRVTITCSASQDISNYLNWYQQKPGKAPKV...EVQLVESGGGLVQPGGSLRLSCAASGYTFTNYGMNWVRQAPGKGLE...
3mAb493.4IgG1784821227141.1950391.0248953754.267578...-248.198.83161.9539.72406.37008.61DILLTQSPVILSVSPGERVSFSCRASQSIGTNIHWYQQRTNGSPRL...QVQLKQSGPGLVQPSQSLSITCTVSGFSLTNYGVHWVRQSPGKGLE...
4mAb58.6IgG1895041127001.2732851.0528175683.704590...227.868.85213.3598.21636.95795.40EIVLTQSPATLSLSPGERATLSCRASQSVSSYLAWYQQKPGQAPRL...EVQLLESGGGLVQPGGSLRLSCAVSGFTFNSFAMSWVRQAPGKGLE...
\n", 535 | "

5 rows × 26 columns

\n", 536 | "
" 537 | ], 538 | "text/plain": [ 539 | " Entity Viscosity_at_150 Isotype N_hydrophobic Fv N_hydrophobic mAb \\\n", 540 | "0 mAb1 14.4 IgG1 83 492 \n", 541 | "1 mAb2 20.9 IgG1 85 498 \n", 542 | "2 mAb3 14.9 IgG1 80 486 \n", 543 | "3 mAb4 93.4 IgG1 78 482 \n", 544 | "4 mAb5 8.6 IgG1 89 504 \n", 545 | "\n", 546 | " N_hydrophilic Fv N_hydrophilic mAb HI_Fv HI_mAb SASA_phobic_Fv \\\n", 547 | "0 120 712 1.098922 1.002246 3760.633301 \n", 548 | "1 110 690 1.364037 1.078138 4469.273438 \n", 549 | "2 122 720 1.317641 1.057277 4007.478271 \n", 550 | "3 122 714 1.195039 1.024895 3754.267578 \n", 551 | "4 112 700 1.273285 1.052817 5683.704590 \n", 552 | "\n", 553 | " ... mAbCSP Fv_pI mAb_pI SAP Fv SAP mAb SCM Fv SCM mAb classifier \\\n", 554 | "0 ... 40 8.88 8.96 134.8 526.3 2522.9 6979.0 0 \n", 555 | "1 ... 10 8.02 8.75 161.4 573.4 1687.7 5731.8 0 \n", 556 | "2 ... 12 7.67 8.71 149.5 552.1 2170.0 6075.0 0 \n", 557 | "3 ... -24 8.19 8.83 161.9 539.7 2406.3 7008.6 1 \n", 558 | "4 ... 22 7.86 8.85 213.3 598.2 1636.9 5795.4 0 \n", 559 | "\n", 560 | " LC \\\n", 561 | "0 DIQMTQSPSSLSASVGDRVTITCRASQGIRNYLAWYQQKPGKAPKL... \n", 562 | "1 DIQMTQSPSSLSASVGDRVTITCRASQDVSTAVAWYQQKPGKAPKL... \n", 563 | "2 DIQMTQSPSSLSASVGDRVTITCSASQDISNYLNWYQQKPGKAPKV... \n", 564 | "3 DILLTQSPVILSVSPGERVSFSCRASQSIGTNIHWYQQRTNGSPRL... \n", 565 | "4 EIVLTQSPATLSLSPGERATLSCRASQSVSSYLAWYQQKPGQAPRL... \n", 566 | "\n", 567 | " HC \n", 568 | "0 EVQLVESGGGLVQPGRSLRLSCAASGFTFDDYAMHWVRQAPGKGLE... \n", 569 | "1 EVQLVESGGGLVQPGGSLRLSCAASGFTFSDSWIHWVRQAPGKGLE... \n", 570 | "2 EVQLVESGGGLVQPGGSLRLSCAASGYTFTNYGMNWVRQAPGKGLE... \n", 571 | "3 QVQLKQSGPGLVQPSQSLSITCTVSGFSLTNYGVHWVRQSPGKGLE... \n", 572 | "4 EVQLLESGGGLVQPGGSLRLSCAVSGFTFNSFAMSWVRQAPGKGLE... \n", 573 | "\n", 574 | "[5 rows x 26 columns]" 575 | ] 576 | }, 577 | "execution_count": 22, 578 | "metadata": {}, 579 | "output_type": "execute_result" 580 | } 581 | ], 582 | "source": [ 583 | "Ab21_entity_list = []; Ab21_LC_list = []; Ab21_HC_list = []\n", 584 | "for k, v in light_chains.items():\n", 585 | " Ab21_entity_list.append(k)\n", 586 | " Ab21_LC_list.append(v)\n", 587 | " Ab21_HC_list.append(heavy_chains[k])\n", 588 | " \n", 589 | "df_tmp = pd.DataFrame({ENTITY_KEY:Ab21_entity_list, 'LC':Ab21_LC_list, 'HC':Ab21_HC_list})\n", 590 | "\n", 591 | "df_Ab21 = df_Ab21_merged.merge(df_tmp, on=ENTITY_KEY)\n", 592 | "df_Ab21.head()" 593 | ] 594 | }, 595 | { 596 | "cell_type": "markdown", 597 | "id": "023937c8-1c65-44b9-9592-35f140677fc7", 598 | "metadata": {}, 599 | "source": [ 600 | "## Process PDGF38 dataset" 601 | ] 602 | }, 603 | { 604 | "cell_type": "code", 605 | "execution_count": 23, 606 | "id": "5f49968c-43e2-43a7-94cc-b5ffd6ce3626", 607 | "metadata": {}, 608 | "outputs": [], 609 | "source": [ 610 | "def get_plos_seq_data(df_seq_plos):\n", 611 | " df_tmp = df_seq_plos.loc[:,'FW1':'FW4']\n", 612 | " df_seq_plos['seq'] = df_tmp.apply(''.join, axis=1)\n", 613 | " entity_to_sequence = {}\n", 614 | " sequences = df_seq_plos['seq'].values\n", 615 | " for _, row in df_seq_plos.iterrows():\n", 616 | " ref_sequence = list(row['seq'])\n", 617 | " entity_to_sequence[row['Name']] = ref_sequence\n", 618 | " for _, row2 in df_seq_plos.iterrows():\n", 619 | " if row2['Name'] == row['Name']: \n", 620 | " continue\n", 621 | " sequence2 = list(row2['seq'])\n", 622 | " sequence2_mod = copy.copy(ref_sequence)\n", 623 | " for idx, (aa1, aa2) in enumerate(zip(ref_sequence, sequence2)):\n", 624 | " if aa2 != '-':\n", 625 | " sequence2_mod[idx] = aa2\n", 626 | " entity_to_sequence[row2['Name']] = sequence2_mod\n", 627 | "\n", 628 | " return entity_to_sequence" 629 | ] 630 | }, 631 | { 632 | "cell_type": "code", 633 | "execution_count": 25, 634 | "id": "c59733e5-0c59-473b-b8fe-88cd80c46134", 635 | "metadata": {}, 636 | "outputs": [ 637 | { 638 | "data": { 639 | "text/html": [ 640 | "
\n", 641 | "\n", 654 | "\n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | "
EntityViscosity_at_150SCMHCLC
0AB-001440-2213EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLE...SYELTQPPSVSVSPGQTASITCSGDSLGSYFVHWYQQKPGQSPVLV...
1R1-002288-2008EVQLLQSGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLE...SYELTQPPSVSVSPGQTASITCSGDSLGSYFVHWYQQKPGQSPVLV...
2R1-003523-1985EVQLLESGGGLVKPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLE...SYELTQPPSVSVSPGQTASITCSGDSLGSYFVHWYQQKPGQSPVLV...
3R1-004310-1961EVQLLESGGGLVQPGGSLRLSCRASGFTFSSYAMSWVRQAPGKGLE...SYELTQPPSVSVSPGQTASITCSGDSLGSYFVHWYQQKPGQSPVLV...
4R1-005190-1838EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLE...SYELTQPPSVSVSPGQTASITCSGDSLGSYFVHWYQQKPGQSPVLV...
\n", 708 | "
" 709 | ], 710 | "text/plain": [ 711 | " Entity Viscosity_at_150 SCM \\\n", 712 | "0 AB-001 440 -2213 \n", 713 | "1 R1-002 288 -2008 \n", 714 | "2 R1-003 523 -1985 \n", 715 | "3 R1-004 310 -1961 \n", 716 | "4 R1-005 190 -1838 \n", 717 | "\n", 718 | " HC \\\n", 719 | "0 EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLE... \n", 720 | "1 EVQLLQSGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLE... \n", 721 | "2 EVQLLESGGGLVKPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLE... \n", 722 | "3 EVQLLESGGGLVQPGGSLRLSCRASGFTFSSYAMSWVRQAPGKGLE... \n", 723 | "4 EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLE... \n", 724 | "\n", 725 | " LC \n", 726 | "0 SYELTQPPSVSVSPGQTASITCSGDSLGSYFVHWYQQKPGQSPVLV... \n", 727 | "1 SYELTQPPSVSVSPGQTASITCSGDSLGSYFVHWYQQKPGQSPVLV... \n", 728 | "2 SYELTQPPSVSVSPGQTASITCSGDSLGSYFVHWYQQKPGQSPVLV... \n", 729 | "3 SYELTQPPSVSVSPGQTASITCSGDSLGSYFVHWYQQKPGQSPVLV... \n", 730 | "4 SYELTQPPSVSVSPGQTASITCSGDSLGSYFVHWYQQKPGQSPVLV... " 731 | ] 732 | }, 733 | "execution_count": 25, 734 | "metadata": {}, 735 | "output_type": "execute_result" 736 | } 737 | ], 738 | "source": [ 739 | "# Extract PDGF sequences and viscosity values from Lai SI: SI/mutants_SI.xlsx\n", 740 | "\n", 741 | "# Extract sequences from PLOS SI\n", 742 | "df_plos_lc = pd.read_csv('data/PDGF38_light.csv', sep='\\t')\n", 743 | "df_plos_hc = pd.read_csv('data/PDGF38_heavy.csv', sep='\\t')\n", 744 | "\n", 745 | "entity_to_sequence_hc = get_plos_seq_data(df_plos_hc)\n", 746 | "entity_to_sequence_lc = get_plos_seq_data(df_plos_lc)\n", 747 | "plos_data = [(k, ''.join(entity_to_sequence_hc[k]), ''.join(v)) for k, v in entity_to_sequence_lc.items()]\n", 748 | "df_plos = pd.DataFrame(plos_data, columns=[ENTITY_KEY, 'HC', 'LC'])\n", 749 | "\n", 750 | "# Lai SI\n", 751 | "df_PDGF38_raw = pd.read_csv(os.path.join(DATA_DIR, 'PDGF38_raw.csv'))\n", 752 | "\n", 753 | "xls = open(os.path.join(DATA_DIR, 'SI/mutants_SI.xlsx'), 'rb')\n", 754 | "df_PDGF38_sheet3 = pd.read_excel(xls, 'result')\n", 755 | "df_PDGF38_sheet3.rename({'Unnamed: 0':ENTITY_KEY}, inplace=True, axis=1)\n", 756 | "\n", 757 | "df_PDGF38 = df_PDGF38_raw.merge(df_plos, on=ENTITY_KEY)\n", 758 | "\n", 759 | "df_PDGF38 = df_PDGF38[[ENTITY_KEY, VISCOSITY_KEY, 'SCM', 'HC', 'LC']]\n", 760 | "df_PDGF38.to_csv(os.path.join(DATA_DIR, 'PDGF38.csv'), index=False)\n", 761 | "\n", 762 | "df_PDGF38.head()" 763 | ] 764 | }, 765 | { 766 | "cell_type": "code", 767 | "execution_count": 26, 768 | "id": "b2da622c-9513-4155-b110-d9e3718f7319", 769 | "metadata": {}, 770 | "outputs": [ 771 | { 772 | "name": "stdout", 773 | "output_type": "stream", 774 | "text": [ 775 | "38 fasta files were saved in FASTA_DIR\n" 776 | ] 777 | } 778 | ], 779 | "source": [ 780 | "fasta_files = []\n", 781 | "for entity, hc, lc in zip(df_PDGF38[ENTITY_KEY].values, df_PDGF38['HC'].values, df_PDGF38['LC'].values):\n", 782 | " if 'R1-001' in entity:\n", 783 | " print(entity)\n", 784 | " fasta_file = os.path.join(FASTA_DIR, entity + '.fasta')\n", 785 | " fasta_files.append(fasta_file)\n", 786 | " with open(fasta_file, 'w') as fptr:\n", 787 | " fptr.write('>' + entity + '_VH\\n')\n", 788 | " fptr.write(hc + '\\n')\n", 789 | " fptr.write('>' + entity + '_VL\\n')\n", 790 | " fptr.write(lc + '\\n')\n", 791 | "\n", 792 | "print('%d fasta files were saved in FASTA_DIR' % len(fasta_files))\n" 793 | ] 794 | }, 795 | { 796 | "cell_type": "markdown", 797 | "id": "c5db5ded-b3f2-479d-ac02-1812473ccc5b", 798 | "metadata": {}, 799 | "source": [ 800 | "### Prepare Ab8 dataset" 801 | ] 802 | }, 803 | { 804 | "cell_type": "code", 805 | "execution_count": 27, 806 | "id": "32c95ecb-b93e-4796-8485-580e708f3ae9", 807 | "metadata": {}, 808 | "outputs": [], 809 | "source": [ 810 | "# Extract supplementary data from Lai et al. MABS 2021, VOL. 13, NO. 1, e1991256 (19 pages) \n", 811 | "# https://www.tandfonline.com/doi/suppl/10.1080/19420862.2021.1991256/suppl_file/kmab_a_1991256_sm4057.zip\n", 812 | "#\n", 813 | "# 1. copy kmab_a_1991256_sm4057.zip to data\n", 814 | "# 2. unzip kmab_a_1991256_sm4057.zip - this will extract files in data directory" 815 | ] 816 | }, 817 | { 818 | "cell_type": "code", 819 | "execution_count": 107, 820 | "id": "a06a8a16-966a-406f-aa9e-cb41ffcc8293", 821 | "metadata": {}, 822 | "outputs": [ 823 | { 824 | "data": { 825 | "text/html": [ 826 | "
\n", 827 | "\n", 840 | "\n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | "
Clone NameEntityISOTYPEHCLCUnnamed: 5Variable Domain SourceSource DetailsHC ClassHFR1...VHLC ClassLFR1CDRL1LFR2CDRL2LFR3CDRL3LFR4VL
0TGN1412 analogTGN1412IgG1 / KappaQVQLVQSGAEVKKPGASVKVSCKASGYTFTSYYIHWVRQAPGQGLE...DIQMTQSPSSLSASVGDRVTITCHASQNIYVWLNWYQQKPGKAPKL...NaNPDB1YJDIgG1QVQLVQSGAEVKKPGASVKVSCKAS...QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYYIHWVRQAPGQGLE...KappaDIQMTQSPSSLSASVGDRVTITCHASQNIYVWLNWYQQKPGKAPKLLIYKASNLHTGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQGQTYPYTFGGGTKVEIKDIQMTQSPSSLSASVGDRVTITCHASQNIYVWLNWYQQKPGKAPKL...
1Avastin analogBevacizumabIgG1 / KappaEVQLVESGGGLVQPGGSLRLSCAASGYTFTNYGMNWVRQAPGKGLE...DIQMTQSPSSLSASVGDRVTITCSASQDISNYLNWYQQKPGKAPKV...NaNPDB1BJ1IgG1EVQLVESGGGLVQPGGSLRLSCAAS...EVQLVESGGGLVQPGGSLRLSCAASGYTFTNYGMNWVRQAPGKGLE...KappaDIQMTQSPSSLSASVGDRVTITCSASQDISNYLNWYQQKPGKAPKVLIYFTSSLHSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQYSTVPWTFGQGTKVEIKDIQMTQSPSSLSASVGDRVTITCSASQDISNYLNWYQQKPGKAPKV...
2Herceptin analogTrastuzumabIgG1 / KappaEVQLVESGGGLVQPGGSLRLSCAASGFNIKDTYIHWVRQAPGKGLE...DIQMTQSPSSLSASVGDRVTITCRASQDVNTAVAWYQQKPGKAPKL...NaNPDB1N8ZIgG1EVQLVESGGGLVQPGGSLRLSCAAS...EVQLVESGGGLVQPGGSLRLSCAASGFNIKDTYIHWVRQAPGKGLE...KappaDIQMTQSPSSLSASVGDRVTITCRASQDVNTAVAWYQQKPGKAPKLLIYSASFLYSGVPSRFSGSRSGTDFTLTISSLQPEDFATYYCQQHYTTPPTFGQGTKVEIKDIQMTQSPSSLSASVGDRVTITCRASQDVNTAVAWYQQKPGKAPKL...
3Basiliximab analogBasiliximabIgG1 / KappaQLQQSGTVLARPGASVKMSCKASGYSFTRYWMHWIKQRPGQGLEWI...QIVSTQSPAIMSASPGEKVTMTCSASSSRSYMQWYQQKPGTSPKRW...NaNPDB1MIMIgG1QLQQSGTVLARPGASVKMSCKAS...QLQQSGTVLARPGASVKMSCKASGYSFTRYWMHWIKQRPGQGLEWI...KappaQIVSTQSPAIMSASPGEKVTMTCSASSSRSYMQWYQQKPGTSPKRWIYDTSKLASGVPARFSGSGSGTSYSLTISSMEAEDAATYYCHQRSSYTFGGGTKLEIKQIVSTQSPAIMSASPGEKVTMTCSASSSRSYMQWYQQKPGTSPKRW...
4Natalizumab analogNatalizumabIgG1 / KappaQVQLVQSGAEVKKPGASVKVSCKASGFNIKDTYIHWVRQAPGQRLE...DIQMTQSPSSLSASVGDRVTITCKTSQDINKYMAWYQQTPGKAPRL...NaNUS PatentUS5840299AIgG1QVQLVQSGAEVKKPGASVKVSCKAS...QVQLVQSGAEVKKPGASVKVSCKASGFNIKDTYIHWVRQAPGQRLE...KappaDIQMTQSPSSLSASVGDRVTITCKTSQDINKYMAWYQQTPGKAPRLLIHYTSALQPGIPSRFSGSGSGRDYTFTISSLQPEDIATYYCLQYDNLWTFGQGTKVEIKDIQMTQSPSSLSASVGDRVTITCKTSQDINKYMAWYQQTPGKAPRL...
\n", 990 | "

5 rows × 26 columns

\n", 991 | "
" 992 | ], 993 | "text/plain": [ 994 | " Clone Name Entity ISOTYPE \\\n", 995 | "0 TGN1412 analog TGN1412 IgG1 / Kappa \n", 996 | "1 Avastin analog Bevacizumab IgG1 / Kappa \n", 997 | "2 Herceptin analog Trastuzumab IgG1 / Kappa \n", 998 | "3 Basiliximab analog Basiliximab IgG1 / Kappa \n", 999 | "4 Natalizumab analog Natalizumab IgG1 / Kappa \n", 1000 | "\n", 1001 | " HC \\\n", 1002 | "0 QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYYIHWVRQAPGQGLE... \n", 1003 | "1 EVQLVESGGGLVQPGGSLRLSCAASGYTFTNYGMNWVRQAPGKGLE... \n", 1004 | "2 EVQLVESGGGLVQPGGSLRLSCAASGFNIKDTYIHWVRQAPGKGLE... \n", 1005 | "3 QLQQSGTVLARPGASVKMSCKASGYSFTRYWMHWIKQRPGQGLEWI... \n", 1006 | "4 QVQLVQSGAEVKKPGASVKVSCKASGFNIKDTYIHWVRQAPGQRLE... \n", 1007 | "\n", 1008 | " LC Unnamed: 5 \\\n", 1009 | "0 DIQMTQSPSSLSASVGDRVTITCHASQNIYVWLNWYQQKPGKAPKL... NaN \n", 1010 | "1 DIQMTQSPSSLSASVGDRVTITCSASQDISNYLNWYQQKPGKAPKV... NaN \n", 1011 | "2 DIQMTQSPSSLSASVGDRVTITCRASQDVNTAVAWYQQKPGKAPKL... NaN \n", 1012 | "3 QIVSTQSPAIMSASPGEKVTMTCSASSSRSYMQWYQQKPGTSPKRW... NaN \n", 1013 | "4 DIQMTQSPSSLSASVGDRVTITCKTSQDINKYMAWYQQTPGKAPRL... NaN \n", 1014 | "\n", 1015 | " Variable Domain Source Source Details HC Class HFR1 \\\n", 1016 | "0 PDB 1YJD IgG1 QVQLVQSGAEVKKPGASVKVSCKAS \n", 1017 | "1 PDB 1BJ1 IgG1 EVQLVESGGGLVQPGGSLRLSCAAS \n", 1018 | "2 PDB 1N8Z IgG1 EVQLVESGGGLVQPGGSLRLSCAAS \n", 1019 | "3 PDB 1MIM IgG1 QLQQSGTVLARPGASVKMSCKAS \n", 1020 | "4 US Patent US5840299A IgG1 QVQLVQSGAEVKKPGASVKVSCKAS \n", 1021 | "\n", 1022 | " ... VH LC Class \\\n", 1023 | "0 ... QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYYIHWVRQAPGQGLE... Kappa \n", 1024 | "1 ... EVQLVESGGGLVQPGGSLRLSCAASGYTFTNYGMNWVRQAPGKGLE... Kappa \n", 1025 | "2 ... EVQLVESGGGLVQPGGSLRLSCAASGFNIKDTYIHWVRQAPGKGLE... Kappa \n", 1026 | "3 ... QLQQSGTVLARPGASVKMSCKASGYSFTRYWMHWIKQRPGQGLEWI... Kappa \n", 1027 | "4 ... QVQLVQSGAEVKKPGASVKVSCKASGFNIKDTYIHWVRQAPGQRLE... Kappa \n", 1028 | "\n", 1029 | " LFR1 CDRL1 LFR2 CDRL2 \\\n", 1030 | "0 DIQMTQSPSSLSASVGDRVTITC HASQNIYVWLN WYQQKPGKAPKLLIY KASNLHT \n", 1031 | "1 DIQMTQSPSSLSASVGDRVTITC SASQDISNYLN WYQQKPGKAPKVLIY FTSSLHS \n", 1032 | "2 DIQMTQSPSSLSASVGDRVTITC RASQDVNTAVA WYQQKPGKAPKLLIY SASFLYS \n", 1033 | "3 QIVSTQSPAIMSASPGEKVTMTC SASSSRSYMQ WYQQKPGTSPKRWIY DTSKLAS \n", 1034 | "4 DIQMTQSPSSLSASVGDRVTITC KTSQDINKYMA WYQQTPGKAPRLLIH YTSALQP \n", 1035 | "\n", 1036 | " LFR3 CDRL3 LFR4 \\\n", 1037 | "0 GVPSRFSGSGSGTDFTLTISSLQPEDFATYYC QQGQTYPYT FGGGTKVEIK \n", 1038 | "1 GVPSRFSGSGSGTDFTLTISSLQPEDFATYYC QQYSTVPWT FGQGTKVEIK \n", 1039 | "2 GVPSRFSGSRSGTDFTLTISSLQPEDFATYYC QQHYTTPPT FGQGTKVEIK \n", 1040 | "3 GVPARFSGSGSGTSYSLTISSMEAEDAATYYC HQRSSYT FGGGTKLEIK \n", 1041 | "4 GIPSRFSGSGSGRDYTFTISSLQPEDIATYYC LQYDNLWT FGQGTKVEIK \n", 1042 | "\n", 1043 | " VL \n", 1044 | "0 DIQMTQSPSSLSASVGDRVTITCHASQNIYVWLNWYQQKPGKAPKL... \n", 1045 | "1 DIQMTQSPSSLSASVGDRVTITCSASQDISNYLNWYQQKPGKAPKV... \n", 1046 | "2 DIQMTQSPSSLSASVGDRVTITCRASQDVNTAVAWYQQKPGKAPKL... \n", 1047 | "3 QIVSTQSPAIMSASPGEKVTMTCSASSSRSYMQWYQQKPGTSPKRW... \n", 1048 | "4 DIQMTQSPSSLSASVGDRVTITCKTSQDINKYMAWYQQTPGKAPRL... \n", 1049 | "\n", 1050 | "[5 rows x 26 columns]" 1051 | ] 1052 | }, 1053 | "execution_count": 107, 1054 | "metadata": {}, 1055 | "output_type": "execute_result" 1056 | } 1057 | ], 1058 | "source": [ 1059 | "df_Ab14 = pd.read_excel(open(os.path.join(DATA_DIR, 'supplemental Table.xlsx'), 'rb'), sheet_name='Sequence Listing')\n", 1060 | "df_Ab14 = df_Ab14.loc[df_Ab14.ISOTYPE == 'IgG1 / Kappa']\n", 1061 | "df_Ab14.reset_index(drop=True, inplace=True)\n", 1062 | "\n", 1063 | "df_Ab14_VH = pd.read_excel(open(os.path.join(DATA_DIR, 'supplemental Table.xlsx'), 'rb'), sheet_name='VHs')\n", 1064 | "df_Ab14_VH = df_Ab14_VH.loc[df_Ab14_VH['HC Class'] == 'IgG1']\n", 1065 | "df_Ab14_VH['VH'] = df_Ab14_VH.apply(lambda x: x['HFR1'] + x['CDRH1'] + x['HFR2'] + x['CDRH2'] + x['HFR3'] + x['CDRH3'] + x['HFR4'], axis=1)\n", 1066 | "\n", 1067 | "df_Ab14_VL = pd.read_excel(open(os.path.join(DATA_DIR, 'supplemental Table.xlsx'), 'rb'), sheet_name='VLs')\n", 1068 | "df_Ab14_VL['VL'] = df_Ab14_VL.apply(lambda x: x['LFR1'] + x['CDRL1'] + x['LFR2'] + x['CDRL2'] + x['LFR3'] + x['CDRL3'] + x['LFR4'], axis=1)\n", 1069 | "df_Ab14_VL.drop_duplicates(inplace=True)\n", 1070 | "df_Ab14_VL.head()\n", 1071 | "\n", 1072 | "df_Ab14 = df_Ab14.merge(df_Ab14_VH, on='mAb')\n", 1073 | "df_Ab14 = df_Ab14.merge(df_Ab14_VL, on='mAb')\n", 1074 | "\n", 1075 | "df_Ab14.rename({'mAb':ENTITY_KEY, 'Amino Acids, Mature Heavy Chain':'HC', \n", 1076 | " 'Amino Acids, Mature Light Chain':'LC'}, inplace=True, axis=1)\n", 1077 | "\n", 1078 | "df_Ab14.head()" 1079 | ] 1080 | }, 1081 | { 1082 | "cell_type": "code", 1083 | "execution_count": 108, 1084 | "id": "52f3df41-875f-4bef-8eee-60f886703bcf", 1085 | "metadata": { 1086 | "tags": [] 1087 | }, 1088 | "outputs": [ 1089 | { 1090 | "data": { 1091 | "text/html": [ 1092 | "
\n", 1093 | "\n", 1106 | "\n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | " \n", 1202 | " \n", 1203 | " \n", 1204 | " \n", 1205 | " \n", 1206 | " \n", 1207 | " \n", 1208 | " \n", 1209 | " \n", 1210 | " \n", 1211 | " \n", 1212 | " \n", 1213 | " \n", 1214 | " \n", 1215 | " \n", 1216 | " \n", 1217 | " \n", 1218 | " \n", 1219 | " \n", 1220 | " \n", 1221 | " \n", 1222 | " \n", 1223 | " \n", 1224 | " \n", 1225 | " \n", 1226 | " \n", 1227 | " \n", 1228 | " \n", 1229 | " \n", 1230 | " \n", 1231 | " \n", 1232 | " \n", 1233 | " \n", 1234 | " \n", 1235 | " \n", 1236 | " \n", 1237 | " \n", 1238 | " \n", 1239 | " \n", 1240 | " \n", 1241 | " \n", 1242 | " \n", 1243 | " \n", 1244 | " \n", 1245 | " \n", 1246 | " \n", 1247 | " \n", 1248 | " \n", 1249 | " \n", 1250 | " \n", 1251 | " \n", 1252 | " \n", 1253 | " \n", 1254 | " \n", 1255 | "
Clone NameEntityISOTYPEHCLCUnnamed: 5Variable Domain SourceSource DetailsHC ClassHFR1...LC ClassLFR1CDRL1LFR2CDRL2LFR3CDRL3LFR4VLmatch
0TGN1412 analogTGN1412IgG1 / KappaQVQLVQSGAEVKKPGASVKVSCKASGYTFTSYYIHWVRQAPGQGLE...DIQMTQSPSSLSASVGDRVTITCHASQNIYVWLNWYQQKPGKAPKL...NaNPDB1YJDIgG1QVQLVQSGAEVKKPGASVKVSCKAS...KappaDIQMTQSPSSLSASVGDRVTITCHASQNIYVWLNWYQQKPGKAPKLLIYKASNLHTGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQGQTYPYTFGGGTKVEIKDIQMTQSPSSLSASVGDRVTITCHASQNIYVWLNWYQQKPGKAPKL...False
1Basiliximab analogBasiliximabIgG1 / KappaQLQQSGTVLARPGASVKMSCKASGYSFTRYWMHWIKQRPGQGLEWI...QIVSTQSPAIMSASPGEKVTMTCSASSSRSYMQWYQQKPGTSPKRW...NaNPDB1MIMIgG1QLQQSGTVLARPGASVKMSCKAS...KappaQIVSTQSPAIMSASPGEKVTMTCSASSSRSYMQWYQQKPGTSPKRWIYDTSKLASGVPARFSGSGSGTSYSLTISSMEAEDAATYYCHQRSSYTFGGGTKLEIKQIVSTQSPAIMSASPGEKVTMTCSASSSRSYMQWYQQKPGTSPKRW...False
2Natalizumab analogNatalizumabIgG1 / KappaQVQLVQSGAEVKKPGASVKVSCKASGFNIKDTYIHWVRQAPGQRLE...DIQMTQSPSSLSASVGDRVTITCKTSQDINKYMAWYQQTPGKAPRL...NaNUS PatentUS5840299AIgG1QVQLVQSGAEVKKPGASVKVSCKAS...KappaDIQMTQSPSSLSASVGDRVTITCKTSQDINKYMAWYQQTPGKAPRLLIHYTSALQPGIPSRFSGSGSGRDYTFTISSLQPEDIATYYCLQYDNLWTFGQGTKVEIKDIQMTQSPSSLSASVGDRVTITCKTSQDINKYMAWYQQTPGKAPRL...False
3Tremelimumab analogTremelimumabIgG1 / KappaQVQLVESGGGVVQPGRSLRLSCAASGFTFSSYGMHWVRQAPGKGLE...DIQMTQSPSSLSASVGDRVTITCRASQSINSYLDWYQQKPGKAPKL...NaNUS PatentUS6682736IgG1QVQLVESGGGVVQPGRSLRLSCAAS...KappaDIQMTQSPSSLSASVGDRVTITCRASQSINSYLDWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQYYSTPFTFGPGTKVEIKDIQMTQSPSSLSASVGDRVTITCRASQSINSYLDWYQQKPGKAPKL...False
4Ipilimumab analogIpilimumabIgG1 / KappaQVQLVESGGGVVQPGRSLRLSCAASGFTFSSYTMHWVRQAPGKGLE...EIVLTQSPGTLSLSPGERATLSCRASQSVGSSYLAWYQQKPGQAPR...NaNUS PatentUS6984720IgG1QVQLVESGGGVVQPGRSLRLSCAAS...KappaEIVLTQSPGTLSLSPGERATLSCRASQSVGSSYLAWYQQKPGQAPRLLIYGAFSRATGIPDRFSGSGSGTDFTLTISRLEPEDFAVYYCQQYGSSPWTFGQGTKVEIKEIVLTQSPGTLSLSPGERATLSCRASQSVGSSYLAWYQQKPGQAPR...False
\n", 1256 | "

5 rows × 27 columns

\n", 1257 | "
" 1258 | ], 1259 | "text/plain": [ 1260 | " Clone Name Entity ISOTYPE \\\n", 1261 | "0 TGN1412 analog TGN1412 IgG1 / Kappa \n", 1262 | "1 Basiliximab analog Basiliximab IgG1 / Kappa \n", 1263 | "2 Natalizumab analog Natalizumab IgG1 / Kappa \n", 1264 | "3 Tremelimumab analog Tremelimumab IgG1 / Kappa \n", 1265 | "4 Ipilimumab analog Ipilimumab IgG1 / Kappa \n", 1266 | "\n", 1267 | " HC \\\n", 1268 | "0 QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYYIHWVRQAPGQGLE... \n", 1269 | "1 QLQQSGTVLARPGASVKMSCKASGYSFTRYWMHWIKQRPGQGLEWI... \n", 1270 | "2 QVQLVQSGAEVKKPGASVKVSCKASGFNIKDTYIHWVRQAPGQRLE... \n", 1271 | "3 QVQLVESGGGVVQPGRSLRLSCAASGFTFSSYGMHWVRQAPGKGLE... \n", 1272 | "4 QVQLVESGGGVVQPGRSLRLSCAASGFTFSSYTMHWVRQAPGKGLE... \n", 1273 | "\n", 1274 | " LC Unnamed: 5 \\\n", 1275 | "0 DIQMTQSPSSLSASVGDRVTITCHASQNIYVWLNWYQQKPGKAPKL... NaN \n", 1276 | "1 QIVSTQSPAIMSASPGEKVTMTCSASSSRSYMQWYQQKPGTSPKRW... NaN \n", 1277 | "2 DIQMTQSPSSLSASVGDRVTITCKTSQDINKYMAWYQQTPGKAPRL... NaN \n", 1278 | "3 DIQMTQSPSSLSASVGDRVTITCRASQSINSYLDWYQQKPGKAPKL... NaN \n", 1279 | "4 EIVLTQSPGTLSLSPGERATLSCRASQSVGSSYLAWYQQKPGQAPR... NaN \n", 1280 | "\n", 1281 | " Variable Domain Source Source Details HC Class HFR1 \\\n", 1282 | "0 PDB 1YJD IgG1 QVQLVQSGAEVKKPGASVKVSCKAS \n", 1283 | "1 PDB 1MIM IgG1 QLQQSGTVLARPGASVKMSCKAS \n", 1284 | "2 US Patent US5840299A IgG1 QVQLVQSGAEVKKPGASVKVSCKAS \n", 1285 | "3 US Patent US6682736 IgG1 QVQLVESGGGVVQPGRSLRLSCAAS \n", 1286 | "4 US Patent US6984720 IgG1 QVQLVESGGGVVQPGRSLRLSCAAS \n", 1287 | "\n", 1288 | " ... LC Class LFR1 CDRL1 LFR2 \\\n", 1289 | "0 ... Kappa DIQMTQSPSSLSASVGDRVTITC HASQNIYVWLN WYQQKPGKAPKLLIY \n", 1290 | "1 ... Kappa QIVSTQSPAIMSASPGEKVTMTC SASSSRSYMQ WYQQKPGTSPKRWIY \n", 1291 | "2 ... Kappa DIQMTQSPSSLSASVGDRVTITC KTSQDINKYMA WYQQTPGKAPRLLIH \n", 1292 | "3 ... Kappa DIQMTQSPSSLSASVGDRVTITC RASQSINSYLD WYQQKPGKAPKLLIY \n", 1293 | "4 ... Kappa EIVLTQSPGTLSLSPGERATLSC RASQSVGSSYLA WYQQKPGQAPRLLIY \n", 1294 | "\n", 1295 | " CDRL2 LFR3 CDRL3 LFR4 \\\n", 1296 | "0 KASNLHT GVPSRFSGSGSGTDFTLTISSLQPEDFATYYC QQGQTYPYT FGGGTKVEIK \n", 1297 | "1 DTSKLAS GVPARFSGSGSGTSYSLTISSMEAEDAATYYC HQRSSYT FGGGTKLEIK \n", 1298 | "2 YTSALQP GIPSRFSGSGSGRDYTFTISSLQPEDIATYYC LQYDNLWT FGQGTKVEIK \n", 1299 | "3 AASSLQS GVPSRFSGSGSGTDFTLTISSLQPEDFATYYC QQYYSTPFT FGPGTKVEIK \n", 1300 | "4 GAFSRAT GIPDRFSGSGSGTDFTLTISRLEPEDFAVYYC QQYGSSPWT FGQGTKVEIK \n", 1301 | "\n", 1302 | " VL match \n", 1303 | "0 DIQMTQSPSSLSASVGDRVTITCHASQNIYVWLNWYQQKPGKAPKL... False \n", 1304 | "1 QIVSTQSPAIMSASPGEKVTMTCSASSSRSYMQWYQQKPGTSPKRW... False \n", 1305 | "2 DIQMTQSPSSLSASVGDRVTITCKTSQDINKYMAWYQQTPGKAPRL... False \n", 1306 | "3 DIQMTQSPSSLSASVGDRVTITCRASQSINSYLDWYQQKPGKAPKL... False \n", 1307 | "4 EIVLTQSPGTLSLSPGERATLSCRASQSVGSSYLAWYQQKPGQAPR... False \n", 1308 | "\n", 1309 | "[5 rows x 27 columns]" 1310 | ] 1311 | }, 1312 | "execution_count": 108, 1313 | "metadata": {}, 1314 | "output_type": "execute_result" 1315 | } 1316 | ], 1317 | "source": [ 1318 | "def vl_vh_match_in_Ab21(x):\n", 1319 | " for vl, vh in zip(df_Ab21.LC.values, df_Ab21.HC.values):\n", 1320 | " if x['VL'] in vl and x['VH'] in vh:\n", 1321 | " return True\n", 1322 | " return False\n", 1323 | "\n", 1324 | "df_Ab14['match'] = df_Ab14.apply(lambda x: vl_vh_match_in_Ab21(x), axis=1)\n", 1325 | "df_Ab8 = df_Ab14[~df_Ab14.match]\n", 1326 | "df_Ab8.reset_index(drop=True, inplace=True)\n", 1327 | "df_Ab8.head()" 1328 | ] 1329 | }, 1330 | { 1331 | "cell_type": "code", 1332 | "execution_count": 109, 1333 | "id": "624e9d02-b403-4a7b-b316-eec252fbc93b", 1334 | "metadata": {}, 1335 | "outputs": [ 1336 | { 1337 | "name": "stdout", 1338 | "output_type": "stream", 1339 | "text": [ 1340 | "8 fasta files were saved in FASTA_DIR\n" 1341 | ] 1342 | } 1343 | ], 1344 | "source": [ 1345 | "fasta_files = []\n", 1346 | "for entity, hc, lc in zip(df_Ab8[ENTITY_KEY].values, df_Ab8['HC'].values, df_Ab8['LC'].values):\n", 1347 | " fasta_file = os.path.join(FASTA_DIR, entity + '.fasta')\n", 1348 | " fasta_files.append(fasta_file)\n", 1349 | " with open(fasta_file, 'w') as fptr:\n", 1350 | " fptr.write('>' + entity + '_VH\\n')\n", 1351 | " fptr.write(hc + '\\n')\n", 1352 | " fptr.write('>' + entity + '_VL\\n')\n", 1353 | " fptr.write(lc + '\\n')\n", 1354 | "\n", 1355 | "print('%d fasta files were saved in FASTA_DIR' % len(fasta_files))" 1356 | ] 1357 | }, 1358 | { 1359 | "cell_type": "code", 1360 | "execution_count": 111, 1361 | "id": "14146155-5238-44fd-8b5a-a45f34c16c8e", 1362 | "metadata": {}, 1363 | "outputs": [ 1364 | { 1365 | "name": "stdout", 1366 | "output_type": "stream", 1367 | "text": [ 1368 | "Number of antibodies in Ab8 set: 8\n" 1369 | ] 1370 | }, 1371 | { 1372 | "data": { 1373 | "text/html": [ 1374 | "
\n", 1375 | "\n", 1388 | "\n", 1389 | " \n", 1390 | " \n", 1391 | " \n", 1392 | " \n", 1393 | " \n", 1394 | " \n", 1395 | " \n", 1396 | " \n", 1397 | " \n", 1398 | " \n", 1399 | " \n", 1400 | " \n", 1401 | " \n", 1402 | " \n", 1403 | " \n", 1404 | " \n", 1405 | " \n", 1406 | " \n", 1407 | " \n", 1408 | " \n", 1409 | " \n", 1410 | " \n", 1411 | " \n", 1412 | " \n", 1413 | " \n", 1414 | " \n", 1415 | " \n", 1416 | " \n", 1417 | " \n", 1418 | " \n", 1419 | " \n", 1420 | " \n", 1421 | " \n", 1422 | " \n", 1423 | " \n", 1424 | " \n", 1425 | " \n", 1426 | " \n", 1427 | " \n", 1428 | " \n", 1429 | "
EntityViscosity_at_150SCM
0TGN141216.42844.6
1Basiliximab25.05640.8
2Natalizumab13.67815.5
3Tremelimumab8.80704.2
4Ipilimumab8.60754.0
\n", 1430 | "
" 1431 | ], 1432 | "text/plain": [ 1433 | " Entity Viscosity_at_150 SCM\n", 1434 | "0 TGN1412 16.42 844.6\n", 1435 | "1 Basiliximab 25.05 640.8\n", 1436 | "2 Natalizumab 13.67 815.5\n", 1437 | "3 Tremelimumab 8.80 704.2\n", 1438 | "4 Ipilimumab 8.60 754.0" 1439 | ] 1440 | }, 1441 | "execution_count": 111, 1442 | "metadata": {}, 1443 | "output_type": "execute_result" 1444 | } 1445 | ], 1446 | "source": [ 1447 | "# Save viscosity and other computed properties\n", 1448 | "Ab8_entities = ['TGN1412', 'Basiliximab', 'Natalizumab', 'Tremelimumab', 'Ipilimumab', 'Atezolizumab', 'Ganitumab', 'Vesencumab']\n", 1449 | "Ab8_visc = [16.42, 25.05, 13.67, 8.8, 8.6, 11.56, 10.1, 23.57]\n", 1450 | "Ab8_SCM = [844.6, 640.8, 815.5, 704.2, 754, 759.6, 806.5, 661.3]\n", 1451 | "df_Ab8_visc = pd.DataFrame({ENTITY_KEY: Ab8_entities, VISCOSITY_KEY: Ab8_visc, 'SCM': Ab8_SCM})\n", 1452 | "\n", 1453 | "df_Ab8_visc.to_csv(os.path.join(DATA_DIR, 'Ab8.csv'), index=False)\n", 1454 | "print('Number of antibodies in Ab8 set: %d' % len(df_Ab8_visc))\n", 1455 | "df_Ab8_visc.head()" 1456 | ] 1457 | }, 1458 | { 1459 | "cell_type": "code", 1460 | "execution_count": null, 1461 | "id": "cfe0c281-5a01-48a5-9f28-93bcf517ab9c", 1462 | "metadata": {}, 1463 | "outputs": [], 1464 | "source": [] 1465 | } 1466 | ], 1467 | "metadata": { 1468 | "kernelspec": { 1469 | "display_name": "Python 3 (ipykernel)", 1470 | "language": "python", 1471 | "name": "python3" 1472 | }, 1473 | "language_info": { 1474 | "codemirror_mode": { 1475 | "name": "ipython", 1476 | "version": 3 1477 | }, 1478 | "file_extension": ".py", 1479 | "mimetype": "text/x-python", 1480 | "name": "python", 1481 | "nbconvert_exporter": "python", 1482 | "pygments_lexer": "ipython3", 1483 | "version": "3.9.12" 1484 | } 1485 | }, 1486 | "nbformat": 4, 1487 | "nbformat_minor": 5 1488 | } 1489 | -------------------------------------------------------------------------------- /notebooks/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfizer-opensource/pfabnet-viscosity/60970a752a3e74cc336db13576a9c3a21448fe2e/notebooks/__init__.py -------------------------------------------------------------------------------- /pfabnet/__init__.py: -------------------------------------------------------------------------------- 1 | from .dataset import ViscosityDataset 2 | 3 | -------------------------------------------------------------------------------- /pfabnet/base.py: -------------------------------------------------------------------------------- 1 | ENTITY_KEY = 'Entity' 2 | VISCOSITY_KEY = 'Viscosity_at_150' 3 | SCHRODINGER_BASE = '/localscratch/software/schrodinger/adv-2021-2' 4 | 5 | 6 | def get_file_path(): 7 | return __file__ 8 | 9 | -------------------------------------------------------------------------------- /pfabnet/dataset.py: -------------------------------------------------------------------------------- 1 | from torch.utils.data import Dataset 2 | import torch 3 | 4 | class ViscosityDataset(Dataset): 5 | def __init__(self, X, y): 6 | self.X = X 7 | self.y = y 8 | 9 | def __getitem__(self, index): 10 | return torch.Tensor(self.X[index]), torch.Tensor([self.y[index]]) 11 | 12 | def __len__(self): 13 | return len(self.y) 14 | 15 | -------------------------------------------------------------------------------- /pfabnet/esp_generator.py: -------------------------------------------------------------------------------- 1 | import os 2 | import glob 3 | import argparse 4 | import pickle 5 | 6 | import numpy as np 7 | from openeye import oechem 8 | from utils import generate_esp_grids 9 | 10 | 11 | 12 | if __name__ == '__main__': 13 | parser = argparse.ArgumentParser( 14 | description='Generate PfAbNet ESP grid input') 15 | parser.add_argument('--input_mols_dir', type=str, default='./', 16 | help='directory containing antibody structures/models') 17 | parser.add_argument('--esp_output_dir', type=str, default='./', 18 | help='directory to save the generated ESP grid files') 19 | parser.add_argument('--grid_dim', type=int, default=96, 20 | help='number of grid points along each axis (default = 96)') 21 | parser.add_argument('--grid_spacing', type=float, default=0.75, 22 | help='spacing between grid points (default = 0.75 Angstrom)') 23 | parser.add_argument('--shell_width', type=float, default=2.0, 24 | help='thickness of the surface shell (default 2.0 Angstrom)') 25 | parser.add_argument('--NX', type=int, default=10, 26 | help='augmentation level (default 10x)') 27 | parser.add_argument('--processors', type=int, default=10, 28 | help='Number of CPUs for ESP grid calculation (default 10)') 29 | parser.add_argument('--seed', type=int, default=42, 30 | help='random seed (default 42)') 31 | 32 | parser.add_argument('-v', '--verbose', action='count', default=0) 33 | in_args = parser.parse_args() 34 | 35 | input_mols_dir = in_args.input_mols_dir 36 | esp_dir = in_args.esp_output_dir 37 | seed = in_args.seed 38 | 39 | args = in_args.__dict__ 40 | np.random.seed(seed) 41 | 42 | try: 43 | os.mkdir(esp_dir) 44 | except Exception as e: 45 | pass 46 | 47 | mol_files = glob.glob(input_mols_dir + '/*.mol2') 48 | for mol_file in mol_files: 49 | print(mol_file) 50 | output = generate_esp_grids(args, mol_file) 51 | for idx, (esp_grid, output_mol) in enumerate(output): 52 | output_dir = os.path.join(esp_dir, 'rotation_%d' % (idx + 1)) 53 | try: 54 | os.mkdir(output_dir) 55 | except Exception as e: 56 | pass 57 | 58 | base_mol_file = os.path.basename(mol_file).split('.mol2')[0] 59 | esp_file = os.path.join(output_dir, base_mol_file + '.pyb') 60 | with open(esp_file, 'wb') as fptr: 61 | pickle.dump(esp_grid, fptr) 62 | 63 | output_mol_file = os.path.join(output_dir, os.path.basename(mol_file)) 64 | 65 | ofs = oechem.oemolostream(output_mol_file) 66 | oechem.OEWriteConstMolecule(ofs, output_mol) 67 | ofs.close() 68 | 69 | 70 | 71 | 72 | -------------------------------------------------------------------------------- /pfabnet/generate_attributions.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pandas as pd 3 | import glob 4 | import argparse 5 | 6 | import numpy as np 7 | import torch 8 | 9 | from openeye import oechem 10 | from openeye import oegrid 11 | 12 | from model import ViscosityNet 13 | from utils import seed_everything 14 | from utils import GRID_DIM_KEY, GRID_SPACING_KEY, ESP_DIR_KEY, HOMOLOGY_MODEL_DIR_KEY 15 | from utils import get_molecule, calculate_attribution_grid 16 | from utils import get_esp_grids, generate_esp_grids 17 | from base import ENTITY_KEY 18 | 19 | 20 | 21 | device = 'cpu' 22 | if torch.cuda.is_available(): 23 | device = torch.cuda.current_device() 24 | 25 | def get_cnn_models(args, model_files): 26 | models = [] 27 | for model_file in model_files: 28 | model = ViscosityNet(args['grid_dim']) 29 | if os.path.exists(model_file): 30 | print('loading %s...' % model_file) 31 | model.load_state_dict(torch.load(model_file)) 32 | model.eval() 33 | 34 | model = torch.nn.DataParallel(model).to(device) 35 | models.append(model) 36 | 37 | return models 38 | 39 | 40 | def overlay(reference_mol, fit_mol, attribution_mol=None): 41 | alignment = oechem.OEGetAlignment(reference_mol, fit_mol) 42 | rot = oechem.OEDoubleArray(9) 43 | trans = oechem.OEDoubleArray(3) 44 | oechem.OERMSD(reference_mol, fit_mol, alignment, True, True, rot, trans) 45 | oechem.OERotate(fit_mol, rot) 46 | oechem.OETranslate(fit_mol, trans) 47 | 48 | if attribution_mol is not None: 49 | oechem.OERotate(attribution_mol, rot) 50 | oechem.OETranslate(attribution_mol, trans) 51 | 52 | 53 | def get_attribution_mol(args, attribution_grid): 54 | attribution_mol = oechem.OEGraphMol() 55 | grid_dim, grid_spacing = args[GRID_DIM_KEY], args[GRID_SPACING_KEY] 56 | significant_thres = args['significant_attribution_threshold'] 57 | grid = oegrid.OEScalarGrid(grid_dim, grid_dim, grid_dim, 0.0, 0.0, 0.0, grid_spacing) 58 | for i in range(grid_dim): 59 | for j in range(grid_dim): 60 | for k in range(grid_dim): 61 | gradient = attribution_grid[0][0][i][j][k] 62 | 63 | if np.abs(attribution_grid[0][0][i][j][k]) > significant_thres: 64 | x, y, z = grid.GridIdxToSpatialCoord(i, j, k) 65 | if gradient > 0.0: 66 | atom = attribution_mol.NewAtom(oechem.OEElemNo_O) 67 | else: 68 | atom = attribution_mol.NewAtom(oechem.OEElemNo_N) 69 | atom.SetPartialCharge(attribution_grid[0][0][i][j][k]) 70 | attribution_mol.SetCoords(atom, oechem.OEFloatArray([x, y, z])) 71 | 72 | return attribution_mol 73 | 74 | 75 | def generate_attributions(args, models): 76 | def save_molecule(f, mol): 77 | ofs = oechem.oemolostream(f) 78 | oechem.OEWriteMolecule(ofs, mol) 79 | ofs.close() 80 | 81 | reference_mol = get_molecule(args['reference_structure_file'], perceive_residue=True, center_mol=False) 82 | 83 | df = pd.read_csv(args['test_data_file']) 84 | 85 | hm_model_dir = args[HOMOLOGY_MODEL_DIR_KEY] 86 | output_attribution_dir = args['output_attribution_dir'] 87 | for row_idx, row in df.iterrows(): 88 | if args['process_structure_index'] >= 0 and args['process_structure_index'] != row_idx: 89 | continue 90 | mol_file = os.path.join(hm_model_dir, row[ENTITY_KEY] + '.mol2') 91 | if len(args[ESP_DIR_KEY]) > 0: 92 | esp_grids = get_esp_grids(args, mol_file) 93 | else: 94 | esp_grids = generate_esp_grids(args, mol_file) 95 | 96 | for grid_idx, (esp_grid, mol) in enumerate(esp_grids): 97 | for model_idx, model in enumerate(models): 98 | print('processing... row_idx: %d grid_idx: %d, model_idx: %d' 99 | % (row_idx, grid_idx, model_idx)) 100 | mol2 = oechem.OEGraphMol(mol) 101 | oechem.OEPerceiveResidues(mol2) 102 | attribution_grid, _ = calculate_attribution_grid(model, esp_grid, device) 103 | attribution_mol = get_attribution_mol(args, attribution_grid) 104 | overlay(reference_mol, mol2, attribution_mol) 105 | outfile_base = os.path.join(output_attribution_dir, 106 | '%s_%d_%d' % (row[ENTITY_KEY], grid_idx, model_idx)) 107 | save_molecule(outfile_base + '.mol2', mol2) 108 | save_molecule(outfile_base + '.oeb.gz', attribution_mol) 109 | if len(args[ESP_DIR_KEY]) > 0: 110 | pdb_file = mol_file.split('.mol2')[0] + '.pdb' 111 | pdb_mol = get_molecule(pdb_file, perceive_residue=False, center_mol=False) 112 | oechem.OEPerceiveResidues(mol2) 113 | overlay(mol2, pdb_mol) 114 | save_molecule(outfile_base + '.pdb', pdb_mol) 115 | 116 | 117 | 118 | 119 | 120 | if __name__ == "__main__": 121 | parser = argparse.ArgumentParser( 122 | description='Generate attributions using PfAbNet models') 123 | parser.add_argument('--test_data_file', type=str, help='test set csv files with entity names') 124 | parser.add_argument('--reference_structure_file', type=str, 125 | help='align each generated attribution molecule to the reference molecule (.mol2)') 126 | parser.add_argument('--homology_model_dir', type=str, help='homology model directory') 127 | parser.add_argument('--PfAbNet_model_prefix', type=str, default='PfAbNet', help='PfAbNet model prefix') 128 | parser.add_argument('--PfAbNet_model_dir', type=str, help='PfAbNet model directory') 129 | parser.add_argument('--grid_dim', type=int, default=96, 130 | help='number of grid points along each axis (default = 96)') 131 | parser.add_argument('--grid_spacing', type=float, default=0.75, 132 | help='spacing between grid points (default = 0.75 Angstrom)') 133 | parser.add_argument('--shell_width', type=float, default=2.0, 134 | help='thickness of the surface shell (default 2.0 Angstrom)') 135 | parser.add_argument('--NX', type=int, default=10, 136 | help='number of rotated structures for each input structure (default 10)') 137 | parser.add_argument('--processors', type=int, default=5, 138 | help='Number of CPUs for ESP grid calculation (default 5)') 139 | parser.add_argument('--esp_dir', type=str, default='', help='directory with precomputed ESP grids') 140 | parser.add_argument('--significant_attribution_threshold', type=float, 141 | help='significant attribution threshold') 142 | parser.add_argument('--process_structure_index', type=int, default=-1, 143 | help='process structure index (default: -1, process all') 144 | parser.add_argument('--output_attribution_dir', type=str, help='directory to save attribution outputs') 145 | parser.add_argument('-v', '--verbose', action='count', default=0) 146 | args = parser.parse_args() 147 | 148 | seed_everything(42) 149 | 150 | model_files_prefix = os.path.join(args.PfAbNet_model_dir, args.PfAbNet_model_prefix) 151 | model_files = glob.glob('%s*.pt' % model_files_prefix) 152 | 153 | args = vars(args) 154 | cnn_models = get_cnn_models(args, model_files) 155 | generate_attributions(args, cnn_models) 156 | 157 | 158 | 159 | -------------------------------------------------------------------------------- /pfabnet/generate_testset_attributions.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pandas as pd 3 | import glob 4 | import argparse 5 | 6 | import numpy as np 7 | import torch 8 | 9 | 10 | 11 | from model import ViscosityNet 12 | from utils import prepare_test_input 13 | from utils import calculate_attribution_grid 14 | from utils import seed_everything 15 | 16 | 17 | device = 'cpu' 18 | if torch.cuda.is_available(): 19 | device = torch.cuda.current_device() 20 | 21 | 22 | def get_cnn_models(args, model_files): 23 | models = [] 24 | for model_file in model_files: 25 | model = ViscosityNet(args['grid_dim']) 26 | if os.path.exists(model_file): 27 | print('loading %s...' % model_file) 28 | model.load_state_dict(torch.load(model_file)) 29 | model.eval() 30 | 31 | model = torch.nn.DataParallel(model).to(device) 32 | models.append(model) 33 | 34 | return models 35 | 36 | 37 | def calculate_test_set_attribution_scores(args, models): 38 | df = pd.read_csv(args.test_data_file) 39 | 40 | args = vars(args) 41 | X, _ = prepare_test_input(df, args) 42 | 43 | attribution_scores = [] 44 | for model in models: 45 | for i in range(len(X)): 46 | attribution_grid, esp_grid = calculate_attribution_grid(model, X[i], device) 47 | attribution_grid = attribution_grid[np.abs(esp_grid) > 1e-5] 48 | attribution_scores.extend(attribution_grid.flatten()) 49 | 50 | return np.array(attribution_scores) 51 | 52 | 53 | 54 | if __name__ == "__main__": 55 | parser = argparse.ArgumentParser( 56 | description='Generate attributions using PfAbNet models') 57 | parser.add_argument('--test_data_file', type=str, help='test set csv files with entity names') 58 | parser.add_argument('--homology_model_dir', type=str, help='homology model directory') 59 | parser.add_argument('--PfAbNet_model_prefix', type=str, default='PfAbNet', help='PfAbNet model prefix') 60 | parser.add_argument('--PfAbNet_model_dir', type=str, help='PfAbNet model directory') 61 | parser.add_argument('--grid_dim', type=int, default=96, 62 | help='number of grid points along each axis (default = 96)') 63 | parser.add_argument('--grid_spacing', type=float, default=0.75, 64 | help='spacing between grid points (default = 0.75 Angstrom)') 65 | parser.add_argument('--shell_width', type=float, default=2.0, 66 | help='thickness of the surface shell (default 2.0 Angstrom)') 67 | parser.add_argument('--NX', type=int, default=1, 68 | help='number of rotated structures for each input structure (default 1)') 69 | parser.add_argument('--processors', type=int, default=1, 70 | help='Number of CPUs for ESP grid calculation (default 1)') 71 | parser.add_argument('--esp_dir', type=str, default='', help='directory with precomputed ESP grids') 72 | parser.add_argument('--output_attribution_scores', type=str, help='file to save attribution scores') 73 | parser.add_argument('--output_attribution_threshold', type=str, help='file to save attribution threshold') 74 | parser.add_argument('-v', '--verbose', action='count', default=0) 75 | args = parser.parse_args() 76 | 77 | seed_everything(42) 78 | 79 | model_files_prefix = os.path.join(args.PfAbNet_model_dir, args.PfAbNet_model_prefix) 80 | model_files = glob.glob('%s*.pt' % model_files_prefix) 81 | cnn_models = get_cnn_models(vars(args), model_files) 82 | 83 | attribution_scores = calculate_test_set_attribution_scores(args, cnn_models) 84 | np.save(args.output_attribution_scores, attribution_scores) 85 | np.save(args.output_attribution_threshold, np.std(attribution_scores)) 86 | 87 | -------------------------------------------------------------------------------- /pfabnet/model.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | 5 | class ViscosityNet(nn.Module): 6 | def __init__(self, grid_dim=96): 7 | super(ViscosityNet, self).__init__() 8 | nfilt = 2 9 | ks = 3 10 | 11 | dilation = 1 12 | if grid_dim >= 64: 13 | self.convnet = nn.Sequential(nn.Conv3d(1, nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 14 | nn.MaxPool3d(2), 15 | nn.Conv3d(nfilt, 2*nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 16 | nn.MaxPool3d(2), 17 | nn.Conv3d(2*nfilt, 4*nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 18 | nn.MaxPool3d(2), 19 | nn.Conv3d(4*nfilt, 8*nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 20 | nn.MaxPool3d(2), 21 | nn.Conv3d(8*nfilt, 16*nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 22 | nn.MaxPool3d(2), 23 | nn.Conv3d(16*nfilt, 32*nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 24 | nn.MaxPool3d(2), 25 | nn.Conv3d(32*nfilt, 512*nfilt, ks, padding='same', dilation=dilation), nn.ReLU() 26 | ) 27 | else: 28 | self.convnet = nn.Sequential(nn.Conv3d(1, nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 29 | nn.MaxPool3d(2), 30 | nn.Conv3d(nfilt, 2*nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 31 | nn.MaxPool3d(2), 32 | nn.Conv3d(2*nfilt, 4*nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 33 | nn.MaxPool3d(2), 34 | nn.Conv3d(4*nfilt, 8*nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 35 | nn.MaxPool3d(2), 36 | nn.Conv3d(8*nfilt, 16*nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 37 | nn.MaxPool3d(2), 38 | nn.Conv3d(16*nfilt, 32*nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 39 | nn.Conv3d(32*nfilt, 512*nfilt, ks, padding='same', dilation=dilation), nn.ReLU() 40 | ) 41 | 42 | 43 | self.fc = nn.Sequential(nn.Linear(512*nfilt, 1), nn.ReLU()) 44 | 45 | self.drop_out = nn.Dropout(0.05) 46 | 47 | 48 | def forward(self, x, y=None): 49 | x = self.convnet(x) 50 | 51 | emb = torch.flatten(x, 1) 52 | 53 | x = self.drop_out(emb) 54 | x = self.fc(x) 55 | 56 | if y is not None: 57 | loss = nn.functional.huber_loss(y, x, reduction='mean') 58 | return x, loss 59 | else: 60 | return x 61 | 62 | -------------------------------------------------------------------------------- /pfabnet/predict.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import torch 3 | from dataset import ViscosityDataset 4 | from torch.utils.data.dataloader import DataLoader 5 | 6 | import argparse 7 | import glob 8 | import os 9 | 10 | import numpy as np 11 | 12 | from model import ViscosityNet 13 | from utils import seed_everything 14 | from utils import generate_esp_grids, get_esp_grids 15 | from utils import DEFAULT_GRID_PARAMS, ESP_DIR_KEY 16 | from base import ENTITY_KEY 17 | 18 | 19 | device = 'cpu' 20 | if torch.cuda.is_available(): 21 | device = torch.cuda.current_device() 22 | 23 | def get_cnn_models(args, model_files): 24 | models = [] 25 | for model_file in model_files: 26 | model = ViscosityNet(args.grid_dim) 27 | if os.path.exists(model_file): 28 | print('loading %s...' % model_file) 29 | model.load_state_dict(torch.load(model_file)) 30 | model.eval() 31 | 32 | model = model.to(device) 33 | models.append(model) 34 | 35 | return models 36 | 37 | 38 | def predict(cnn_models, mol_file, args = DEFAULT_GRID_PARAMS): 39 | if len(args[ESP_DIR_KEY]) > 0: 40 | esp_grids = get_esp_grids(args, mol_file) 41 | else: 42 | esp_grids = generate_esp_grids(args, mol_file) 43 | 44 | esp_grids = [esp_array for esp_array, _ in esp_grids] 45 | # esp_grids = generate_esp_grids(args, mol_file) 46 | dummy_y = [0.0]*len(esp_grids) 47 | 48 | test_dataset = ViscosityDataset(esp_grids, dummy_y) 49 | 50 | loader = DataLoader(test_dataset, shuffle=False, pin_memory=True, 51 | batch_size=1, num_workers=0) 52 | 53 | y_preds = [] 54 | for it, d_it in enumerate(loader): 55 | x, y = d_it 56 | 57 | # place data on the correct device 58 | x = x.to(device) 59 | 60 | for model in cnn_models: 61 | # forward the model 62 | with torch.set_grad_enabled(False): 63 | output = model(x) 64 | 65 | y1 = output.detach().cpu().squeeze(1).numpy() 66 | y_preds.extend(y1) 67 | 68 | 69 | return np.power(10, np.mean(np.array(y_preds))) 70 | 71 | 72 | 73 | 74 | if __name__ == "__main__": 75 | parser = argparse.ArgumentParser( 76 | description='Generate predictions using PfAbNet models') 77 | parser.add_argument('--structure_file', type=str, help='Input Fv structure') 78 | parser.add_argument('--PfAbNet_model_prefix', type=str, default='PfAbNet', help='output model prefix') 79 | parser.add_argument('--PfAbNet_model_dir', type=str, help='output model directory') 80 | parser.add_argument('--grid_dim', type=int, default=96, 81 | help='number of grid points along each axis (default = 96)') 82 | parser.add_argument('--grid_spacing', type=float, default=0.75, 83 | help='spacing between grid points (default = 0.75 Angstrom)') 84 | parser.add_argument('--shell_width', type=float, default=2.0, 85 | help='thickness of the surface shell (default 2.0 Angstrom)') 86 | parser.add_argument('--NX', type=int, default=10, 87 | help='augmentation level (default 10x)') 88 | parser.add_argument('--processors', type=int, default=5, 89 | help='Number of CPUs for ESP grid calculation (default 5)') 90 | parser.add_argument('--esp_dir', type=str, default='', help='directory with precomputed ESP grids') 91 | parser.add_argument('--output_file', type=str, help='Output file with prediction') 92 | parser.add_argument('-v', '--verbose', action='count', default=0) 93 | args = parser.parse_args() 94 | 95 | seed_everything(42) 96 | 97 | model_files_prefix = os.path.join(args.PfAbNet_model_dir, args.PfAbNet_model_prefix) 98 | model_files = glob.glob('%s*.pt' % model_files_prefix) 99 | cnn_models = get_cnn_models(args, model_files) 100 | 101 | output = [] 102 | ypred = predict(cnn_models, args.structure_file, args.__dict__) 103 | output.append({ENTITY_KEY:os.path.basename(args.structure_file).split('.mol2')[0], 'VISCOSITY_PRED':ypred}) 104 | print(args.structure_file, ypred) 105 | 106 | df = pd.DataFrame(output) 107 | df.to_csv(args.output_file, index=False) 108 | 109 | 110 | 111 | 112 | -------------------------------------------------------------------------------- /pfabnet/sbatch_tmpl.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash -l 2 | #SBATCH -e %j.err 3 | #SBATCH -o %j.out 4 | #SBATCH --nodes=1 5 | #SBATCH --gres=gpu:v100:1 6 | #SBATCH --mem=32gb 7 | #SBATCH --wait 8 | -------------------------------------------------------------------------------- /pfabnet/train.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pandas as pd 3 | import torch 4 | import torch.nn as nn 5 | 6 | import argparse 7 | from sklearn.model_selection import KFold 8 | 9 | from dataset import ViscosityDataset 10 | from model import ViscosityNet 11 | from trainer import Trainer, TrainerConfig 12 | from utils import seed_everything, prepare_training_input 13 | from base import VISCOSITY_KEY 14 | 15 | 16 | def train(args): 17 | seed_everything(42) 18 | 19 | training_data_files = args.training_data_file.split(',') 20 | df_list = [] 21 | for training_data_file in training_data_files: 22 | if training_data_file.endswith('.csv'): 23 | df = pd.read_csv(training_data_file) 24 | else: 25 | df = pd.read_pickle(training_data_file) 26 | 27 | df_list.append(df) 28 | 29 | df = pd.concat(df_list) 30 | df.loc[df[VISCOSITY_KEY] > 1000, VISCOSITY_KEY] = 1000 31 | 32 | X, y = prepare_training_input(df, args.__dict__) 33 | 34 | kf = KFold(n_splits=10, shuffle=True) 35 | train_index, val_index = list(kf.split(y))[args.fold_idx] 36 | 37 | X_train, y_train = X[train_index], y[train_index] 38 | X_val, y_val = X[val_index], y[val_index] 39 | print('Number of datapoints; train: %d, val: %d' % (len(y_train), len(y_val))) 40 | 41 | train_dataset = ViscosityDataset(X_train, y_train) 42 | val_dataset = ViscosityDataset(X_val, y_val) 43 | 44 | # save model path 45 | ckpt_file = '%s_%d.pt' % (args.output_model_prefix, args.fold_idx) 46 | ckpt_path = os.path.join(args.output_model_dir, ckpt_file) 47 | print('PyTorch model will be saved in ', ckpt_path) 48 | 49 | def weights_init(m): 50 | if isinstance(m, nn.Conv3d) or isinstance(m, nn.Linear): 51 | torch.nn.init.kaiming_normal_(m.weight) 52 | torch.nn.init.zeros_(m.bias) 53 | 54 | model = ViscosityNet(args.grid_dim) 55 | model.apply(weights_init) 56 | if os.path.exists(ckpt_path): 57 | print('loading saved model...') 58 | model.load_state_dict(torch.load(ckpt_path)) 59 | model.eval() 60 | 61 | print(sum(p.numel() for p in model.parameters() if p.requires_grad), 'model parameters') 62 | 63 | bs = 1 64 | 65 | history_file = '%s_hist_%d.pkl' % (args.output_model_prefix, args.fold_idx) 66 | history_path = os.path.join(args.output_model_dir, history_file) 67 | tconf = TrainerConfig(max_epochs=2000, batch_size=bs, learning_rate=1e-5, 68 | num_workers=0, ckpt_path=ckpt_path, history_path=history_path) 69 | 70 | trainer = Trainer(model, train_dataset, val_dataset, tconf) 71 | trainer.train() 72 | 73 | 74 | 75 | if __name__ == "__main__": 76 | parser = argparse.ArgumentParser( 77 | description='train PfAbNet model') 78 | parser.add_argument('--training_data_file', type=str, help='training data file') 79 | parser.add_argument('--homology_model_dir', type=str, help='homology model directory') 80 | parser.add_argument('--output_model_prefix', type=str, default='PfAbNet', help='output model prefix') 81 | parser.add_argument('--output_model_dir', type=str, help='output model directory') 82 | parser.add_argument('--grid_dim', type=int, default=96, 83 | help='number of grid points along each axis (default = 96)') 84 | parser.add_argument('--grid_spacing', type=float, default=0.75, 85 | help='spacing between grid points (default = 2.0 Angstrom)') 86 | parser.add_argument('--shell_width', type=float, default=2.0, 87 | help='thickness of the surface shell (default 2.0 Angstrom)') 88 | parser.add_argument('--NX', type=int, default=10, 89 | help='augmentation level (default 10x)') 90 | parser.add_argument('--processors', type=int, default=5, 91 | help='Number of CPUs for ESP grid calculation (default 5)') 92 | parser.add_argument('--esp_dir', type=str, default='', help='directory with precomputed ESP grids') 93 | parser.add_argument('--fold_idx', default=0, type=int, 94 | help='index of the k-fold split (default = 0)') 95 | parser.add_argument('-v', '--verbose', action='count', default=0) 96 | args = parser.parse_args() 97 | 98 | os.makedirs(args.output_model_dir, exist_ok=True) 99 | 100 | train(args) 101 | 102 | 103 | 104 | -------------------------------------------------------------------------------- /pfabnet/trainer.py: -------------------------------------------------------------------------------- 1 | from tqdm import tqdm 2 | import numpy as np 3 | import torch 4 | from torch.utils.data.dataloader import DataLoader 5 | import pickle 6 | 7 | 8 | class TrainerConfig: 9 | # optimization parameters 10 | betas = (0.9, 0.999) 11 | grad_norm_clip = 1.0 12 | ckpt_path = None 13 | history_path = None 14 | num_workers = 0 # for DataLoader 15 | 16 | def __init__(self, **kwargs): 17 | for k,v in kwargs.items(): 18 | setattr(self, k, v) 19 | 20 | class Trainer: 21 | 22 | def __init__(self, model, train_dataset, val_dataset, config): 23 | self.model = model 24 | self.train_dataset = train_dataset 25 | self.val_dataset = val_dataset 26 | self.config = config 27 | 28 | self.device = 'cpu' 29 | if torch.cuda.is_available(): 30 | self.device = torch.cuda.current_device() 31 | self.model = torch.nn.DataParallel(self.model).to(self.device) 32 | 33 | def save_checkpoint(self): 34 | raw_model = self.model.module if hasattr(self.model, "module") else self.model 35 | torch.save(raw_model.state_dict(), self.config.ckpt_path) 36 | 37 | 38 | def train(self): 39 | model, config = self.model, self.config 40 | optimizer = torch.optim.Adam(model.parameters(), lr=config.learning_rate, 41 | betas=config.betas, weight_decay=0.005) 42 | 43 | def run_epoch(split): 44 | is_train = split == 'train' 45 | model.train(is_train) 46 | 47 | if is_train: 48 | data = self.train_dataset 49 | batch_size = config.batch_size 50 | else: 51 | data = self.val_dataset 52 | batch_size = config.batch_size 53 | 54 | shuffle = False 55 | if is_train: 56 | shuffle = True 57 | 58 | loader = DataLoader(data, shuffle=shuffle, pin_memory=True, 59 | batch_size=batch_size, 60 | num_workers=config.num_workers) 61 | 62 | losses = [] 63 | pbar = tqdm(enumerate(loader), total=len(loader)) if is_train else enumerate(loader) 64 | for it, d_it in pbar: 65 | x, y = d_it 66 | 67 | x = x.to(self.device) 68 | y = y.to(self.device) 69 | 70 | with torch.set_grad_enabled(is_train): 71 | output, loss = model(x, y) 72 | loss = loss.mean() 73 | losses.append(loss.item()) 74 | 75 | if is_train: 76 | model.zero_grad() 77 | loss.backward() 78 | torch.nn.utils.clip_grad_norm_(model.parameters(), config.grad_norm_clip) 79 | optimizer.step() 80 | 81 | pbar.set_description(f"epoch {epoch+1} iter {it}: train loss {loss.item():.5f}. " 82 | f"lr {config.learning_rate:e}") 83 | 84 | return float(np.mean(losses)) 85 | 86 | best_loss = float('inf') 87 | try: 88 | with open(self.config.history_path, 'rb') as fptr: 89 | history = pickle.load(fptr) 90 | start_epoch = len(np.array(history['val_loss'])) 91 | history = {'train_loss':history['train_loss'][:start_epoch], 'val_loss':history['val_loss'][:start_epoch]} 92 | except Exception as e: 93 | history = {'train_loss': [], 'val_loss': []} 94 | start_epoch = 0 95 | 96 | for epoch in range(start_epoch, config.max_epochs): 97 | train_loss = run_epoch('train') 98 | val_loss = run_epoch('val') 99 | history['train_loss'].append(train_loss) 100 | history['val_loss'].append(val_loss) 101 | 102 | with open(self.config.history_path, 'wb') as fptr: 103 | pickle.dump(history, fptr) 104 | 105 | if epoch < 1950: 106 | self.save_checkpoint() 107 | continue 108 | 109 | good_model = val_loss < best_loss 110 | if self.config.ckpt_path is not None and good_model: 111 | best_loss = val_loss 112 | self.save_checkpoint() 113 | 114 | 115 | -------------------------------------------------------------------------------- /pfabnet/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import random 3 | import multiprocessing 4 | import pickle 5 | 6 | import numpy as np 7 | 8 | import torch 9 | from captum.attr import IntegratedGradients 10 | 11 | from openeye import oechem 12 | from openeye import oegrid 13 | from openeye import oezap 14 | from openeye import oespicoli 15 | 16 | try: 17 | from base import VISCOSITY_KEY, ENTITY_KEY 18 | except Exception as e: 19 | from .base import VISCOSITY_KEY, ENTITY_KEY 20 | 21 | ESP_GRID_KEY = 'ESP_GRID' 22 | 23 | INPUT_MOL_KEY = 'INPUT_MOL' 24 | ROT_X_KEY = 'rot_x' 25 | ROT_Y_KEY = 'rot_y' 26 | ROT_Z_KEY = 'rot_z' 27 | GRID_SPACING_KEY = 'grid_spacing' 28 | GRID_DIM_KEY = 'grid_dim' 29 | SHELL_WIDTH_KEY = 'shell_width' 30 | NX_KEY = 'NX' # augmentation level 31 | PROCESSORS_KEY = 'processors' 32 | HOMOLOGY_MODEL_DIR_KEY = 'homology_model_dir' 33 | ESP_DIR_KEY = 'esp_dir' 34 | 35 | DEFAULT_GRID_PARAMS = {GRID_DIM_KEY: 96, GRID_SPACING_KEY: 0.75, 36 | SHELL_WIDTH_KEY: 2.0, NX_KEY: 10} 37 | 38 | def get_molecule(input_file, perceive_residue=True, center_mol=True): 39 | ifs = oechem.oemolistream(input_file) 40 | mol = oechem.OEGraphMol() 41 | oechem.OEReadMolecule(ifs, mol) 42 | ifs.close() 43 | 44 | if perceive_residue: 45 | oechem.OEPerceiveResidues(mol) 46 | if center_mol: 47 | oechem.OECenter(mol) 48 | 49 | return mol 50 | 51 | 52 | def get_esp_array(params): 53 | mol = params[INPUT_MOL_KEY] 54 | theta_x = params[ROT_X_KEY] 55 | theta_y = params[ROT_Y_KEY] 56 | theta_z = params[ROT_Z_KEY] 57 | grid_spacing = params[GRID_SPACING_KEY] 58 | grid_dim = params[GRID_DIM_KEY] 59 | shell_width = params[SHELL_WIDTH_KEY] 60 | 61 | oechem.OEEulerRotate(mol, oechem.OEDoubleArray([theta_x, theta_y, theta_z])) 62 | 63 | oechem.OEAssignBondiVdWRadii(mol) 64 | 65 | zap = oezap.OEZap() 66 | zap.SetInnerDielectric(2.0) 67 | zap.SetGridSpacing(grid_spacing) 68 | zap.SetMolecule(mol) 69 | 70 | grid = oegrid.OEScalarGrid(grid_dim, grid_dim, grid_dim, 71 | 0.0, 0.0, 0.0, grid_spacing) 72 | zap.SetOuterDielectric(80) 73 | zap.CalcPotentialGrid(grid) 74 | 75 | surf = oespicoli.OESurface() 76 | oespicoli.OEMakeMolecularSurface(surf, mol) 77 | 78 | surf_grid = oegrid.OEScalarGrid(grid_dim, grid_dim, grid_dim, 0.0, 0.0, 0.0, grid_spacing) 79 | oespicoli.OEMakeGridFromSurface(surf_grid, surf) 80 | 81 | grid_size = grid.GetSize() 82 | arr = np.zeros(grid_size) 83 | idx = 0 84 | count = 0 85 | for i in range(0, grid_dim): 86 | for j in range(0, grid_dim): 87 | for k in range(0, grid_dim): 88 | v = surf_grid.GetValue(i, j, k) 89 | if 0 <= v < shell_width: 90 | val = grid.GetValue(i, j, k) 91 | arr[idx] = val 92 | 93 | count += 1 94 | idx += 1 95 | 96 | arr3d_esp = np.reshape(arr, (grid_dim, grid_dim, grid_dim, 1)) 97 | 98 | return arr3d_esp, mol 99 | 100 | 101 | 102 | def prepare_cnn_input(df, args, train=True): 103 | hm_model_dir = args[HOMOLOGY_MODEL_DIR_KEY] 104 | if hm_model_dir is None: 105 | raise Exception('Homology model directory not specified') 106 | 107 | X = [] 108 | y = [] 109 | for row_idx, row in df.iterrows(): 110 | entity = row[ENTITY_KEY] 111 | 112 | mol_file = os.path.join(hm_model_dir, entity + '.mol2') 113 | if len(args[ESP_DIR_KEY]) > 0: 114 | esp_grids = get_esp_grids(args, mol_file) 115 | else: 116 | esp_grids = generate_esp_grids(args, mol_file) 117 | 118 | esp_grids = [esp_array for esp_array, _ in esp_grids] 119 | 120 | X.extend(esp_grids) 121 | if train: 122 | log_visc = np.log10(row[VISCOSITY_KEY]) 123 | y.extend([log_visc] * args[NX_KEY]) 124 | else: 125 | y.extend([0.0] * args[NX_KEY]) 126 | 127 | return np.array(X), np.array(y) 128 | 129 | 130 | def get_esp_grids(args, mol_file): 131 | esp_dir = args[ESP_DIR_KEY] 132 | esp_array_output = [] 133 | for i in range(args[NX_KEY]): 134 | with open('%s/rotation_%d/%s.pyb' % (esp_dir, i + 1, 135 | os.path.basename(mol_file).split('.mol2')[0]), 'rb') as fptr: 136 | 137 | mol = get_molecule(os.path.join(os.path.join(esp_dir, 'rotation_%d' % (i+1)), os.path.basename(mol_file))) 138 | esp_array_output.append((pickle.load(fptr), mol)) 139 | 140 | return esp_array_output 141 | 142 | 143 | def generate_esp_grids(args, mol_file): 144 | mol = get_molecule(mol_file) 145 | 146 | params = [] 147 | for i in range(args[NX_KEY]): 148 | rot_x = np.random.uniform(0, 180) 149 | rot_y = np.random.uniform(0, 180) 150 | rot_z = np.random.uniform(0, 180) 151 | 152 | params.append({INPUT_MOL_KEY: oechem.OEGraphMol(mol), ROT_X_KEY: rot_x, 153 | ROT_Y_KEY: rot_y, ROT_Z_KEY: rot_z, 154 | GRID_DIM_KEY: args[GRID_DIM_KEY], 155 | GRID_SPACING_KEY: args[GRID_SPACING_KEY], 156 | SHELL_WIDTH_KEY: args[SHELL_WIDTH_KEY]}) 157 | if multiprocessing.cpu_count() >= args[PROCESSORS_KEY]: 158 | processors = args[PROCESSORS_KEY] 159 | else: 160 | processors = multiprocessing.cpu_count() 161 | p = multiprocessing.Pool(processes=processors) 162 | esp_array_output = p.map(get_esp_array, params) 163 | p.close() 164 | 165 | output = [(np.moveaxis(esp_array, 3, 0), output_mol) for esp_array, output_mol in esp_array_output] 166 | return output 167 | 168 | 169 | def prepare_training_input(df, args): 170 | return prepare_cnn_input(df, args, train=True) 171 | 172 | 173 | def prepare_test_input(df, args): 174 | return prepare_cnn_input(df, args, train=False) 175 | 176 | 177 | def calculate_attribution_grid(model, esp_grid_in, device='cpu'): 178 | esp_grid = torch.Tensor(esp_grid_in) 179 | baseline = torch.zeros(esp_grid.shape) 180 | esp_grid = esp_grid.unsqueeze(0) 181 | esp_grid2 = esp_grid.to(device) 182 | 183 | baseline = torch.unsqueeze(baseline, 0) 184 | baseline = baseline.to(device) 185 | 186 | ig = IntegratedGradients(model) 187 | attributions, delta = ig.attribute(esp_grid2, baseline, target=0, return_convergence_delta=True) 188 | attributions = attributions.detach().cpu().numpy() 189 | esp_grid2 = esp_grid2.detach().cpu().numpy() 190 | 191 | return attributions, esp_grid2 192 | 193 | 194 | def seed_everything(seed): 195 | random.seed(seed) 196 | np.random.seed(seed) 197 | torch.manual_seed(seed) 198 | torch.cuda.manual_seed_all(seed) 199 | -------------------------------------------------------------------------------- /pfabnet_eisenberg/__init__.py: -------------------------------------------------------------------------------- 1 | from .dataset import ViscosityDataset 2 | 3 | -------------------------------------------------------------------------------- /pfabnet_eisenberg/base.py: -------------------------------------------------------------------------------- 1 | ENTITY_KEY = 'Entity' 2 | VISCOSITY_KEY = 'Viscosity_at_150' 3 | SCHRODINGER_BASE = '/localscratch/software/schrodinger/adv-2021-2' 4 | 5 | 6 | def get_file_path(): 7 | return __file__ 8 | 9 | -------------------------------------------------------------------------------- /pfabnet_eisenberg/dataset.py: -------------------------------------------------------------------------------- 1 | from torch.utils.data import Dataset 2 | import torch 3 | 4 | class ViscosityDataset(Dataset): 5 | def __init__(self, X, y): 6 | self.X = X 7 | self.y = y 8 | 9 | def __getitem__(self, index): 10 | return torch.Tensor(self.X[index]), torch.Tensor([self.y[index]]) 11 | 12 | def __len__(self): 13 | return len(self.y) 14 | 15 | -------------------------------------------------------------------------------- /pfabnet_eisenberg/eisenberg_generator.py: -------------------------------------------------------------------------------- 1 | import os 2 | import glob 3 | import argparse 4 | import pickle 5 | 6 | import numpy as np 7 | from openeye import oechem 8 | from utils import generate_eisenberg_grids 9 | 10 | 11 | 12 | if __name__ == '__main__': 13 | parser = argparse.ArgumentParser( 14 | description='Generate PfAbNet ESP grid input') 15 | parser.add_argument('--input_mols_dir', type=str, default='./', 16 | help='directory containing antibody structures/models') 17 | parser.add_argument('--eisenberg_output_dir', type=str, default='./', 18 | help='directory to save the generated Eisenberg grid files') 19 | parser.add_argument('--grid_dim', type=int, default=96, 20 | help='number of grid points along each axis (default = 96)') 21 | parser.add_argument('--grid_spacing', type=float, default=0.75, 22 | help='spacing between grid points (default = 0.75 Angstrom)') 23 | parser.add_argument('--shell_width', type=float, default=2.0, 24 | help='thickness of the surface shell (default 2.0 Angstrom)') 25 | parser.add_argument('--NX', type=int, default=10, 26 | help='augmentation level (default 10x)') 27 | parser.add_argument('--processors', type=int, default=10, 28 | help='Number of CPUs for ESP grid calculation (default 10)') 29 | parser.add_argument('--seed', type=int, default=42, 30 | help='random seed (default 42)') 31 | 32 | parser.add_argument('-v', '--verbose', action='count', default=0) 33 | in_args = parser.parse_args() 34 | 35 | input_mols_dir = in_args.input_mols_dir 36 | eisenberg_dir = in_args.eisenberg_output_dir 37 | seed = in_args.seed 38 | 39 | args = in_args.__dict__ 40 | np.random.seed(seed) 41 | 42 | try: 43 | os.mkdir(eisenberg_dir) 44 | except Exception as e: 45 | pass 46 | 47 | mol_files = glob.glob(input_mols_dir + '/*.mol2') 48 | for mol_file in mol_files: 49 | print(mol_file) 50 | output = generate_eisenberg_grids(args, mol_file) 51 | for idx, (esp_grid, phobic_grid, philic_grid, output_mol) in enumerate(output): 52 | output_dir = os.path.join(eisenberg_dir, 'rotation_%d' % (idx + 1)) 53 | try: 54 | os.mkdir(output_dir) 55 | except Exception as e: 56 | pass 57 | 58 | base_mol_file = os.path.basename(mol_file).split('.mol2')[0] 59 | esp_file = os.path.join(output_dir, base_mol_file + '.pyb') 60 | with open(esp_file, 'wb') as fptr: 61 | pickle.dump([esp_grid, phobic_grid, philic_grid], fptr) 62 | 63 | output_mol_file = os.path.join(output_dir, os.path.basename(mol_file)) 64 | 65 | ofs = oechem.oemolostream(output_mol_file) 66 | oechem.OEWriteConstMolecule(ofs, output_mol) 67 | ofs.close() 68 | 69 | 70 | 71 | 72 | -------------------------------------------------------------------------------- /pfabnet_eisenberg/model.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | 5 | class ViscosityNet(nn.Module): 6 | def __init__(self, num_channels=2): 7 | super(ViscosityNet, self).__init__() 8 | nfilt = num_channels 9 | ks = 3 10 | 11 | dilation = 1 12 | if num_channels == 2: 13 | self.convnet = nn.Sequential(nn.Conv3d(num_channels, nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 14 | nn.MaxPool3d(2), 15 | nn.Conv3d(nfilt, 2*nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 16 | nn.MaxPool3d(2), 17 | nn.Conv3d(2*nfilt, 4*nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 18 | nn.MaxPool3d(2), 19 | nn.Conv3d(4*nfilt, 8*nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 20 | nn.MaxPool3d(2), 21 | nn.Conv3d(8*nfilt, 16*nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 22 | nn.MaxPool3d(2), 23 | nn.Conv3d(16*nfilt, 32*nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 24 | nn.MaxPool3d(2), 25 | nn.Conv3d(32*nfilt, 1024, ks, padding='same', dilation=dilation), nn.ReLU() 26 | ) 27 | elif num_channels == 3: 28 | self.convnet = nn.Sequential(nn.Conv3d(num_channels, nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 29 | nn.MaxPool3d(2), 30 | nn.Conv3d(nfilt, 2*nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 31 | nn.MaxPool3d(2), 32 | nn.Conv3d(2*nfilt, 4*nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 33 | nn.MaxPool3d(2), 34 | nn.Conv3d(4*nfilt, 8*nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 35 | nn.MaxPool3d(2), 36 | nn.Conv3d(8*nfilt, 16*nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 37 | nn.MaxPool3d(2), 38 | nn.Conv3d(16*nfilt, 32*nfilt, ks, padding='same', dilation=dilation), nn.ReLU(), 39 | nn.MaxPool3d(2), 40 | nn.Conv3d(32*nfilt, 1024, ks, padding='same', dilation=dilation), nn.ReLU() 41 | ) 42 | else: 43 | print('ERROR... number of input channels must be either 2 or 3') 44 | 45 | 46 | self.fc = nn.Sequential(nn.Linear(1024, 1), nn.ReLU()) 47 | 48 | self.drop_out = nn.Dropout(0.05) 49 | 50 | 51 | def forward(self, x, y=None): 52 | x = self.convnet(x) 53 | 54 | emb = torch.flatten(x, 1) 55 | 56 | x = self.drop_out(emb) 57 | x = self.fc(x) 58 | 59 | if y is not None: 60 | loss = nn.functional.huber_loss(y, x, reduction='mean') 61 | return x, loss 62 | else: 63 | return x, emb 64 | 65 | -------------------------------------------------------------------------------- /pfabnet_eisenberg/predict.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import torch 3 | from dataset import ViscosityDataset 4 | from torch.utils.data.dataloader import DataLoader 5 | 6 | import argparse 7 | import glob 8 | import os 9 | 10 | import numpy as np 11 | 12 | from model import ViscosityNet 13 | from utils import seed_everything 14 | from utils import generate_eisenberg_grids, get_eisenberg_grids 15 | from utils import DEFAULT_GRID_PARAMS, EISENBERG_DIR_KEY 16 | from base import ENTITY_KEY 17 | 18 | 19 | device = 'cpu' 20 | if torch.cuda.is_available(): 21 | device = torch.cuda.current_device() 22 | 23 | def get_cnn_models(args, model_files): 24 | models = [] 25 | for model_file in model_files: 26 | model = ViscosityNet(args.num_channels) 27 | if os.path.exists(model_file): 28 | print('loading %s...' % model_file) 29 | model.load_state_dict(torch.load(model_file)) 30 | model.eval() 31 | 32 | model = model.to(device) 33 | models.append(model) 34 | 35 | return models 36 | 37 | 38 | def predict(cnn_models, mol_file, args = DEFAULT_GRID_PARAMS): 39 | if len(args[EISENBERG_DIR_KEY]) > 0: 40 | esp_grids = get_eisenberg_grids(args, mol_file) 41 | else: 42 | esp_grids = generate_eisenberg_grids(args, mol_file) 43 | 44 | if args['num_channels'] == 3: 45 | combined_grids = [np.concatenate([esp_arr, phobic_arr, philic_arr], axis=0) 46 | for esp_arr, phobic_arr, philic_arr, _ in esp_grids] 47 | else: 48 | combined_grids = [np.concatenate([phobic_arr, philic_arr], axis=0) 49 | for _, phobic_arr, philic_arr, _ in esp_grids] 50 | 51 | dummy_y = [0.0]*len(combined_grids) 52 | 53 | test_dataset = ViscosityDataset(combined_grids, dummy_y) 54 | 55 | loader = DataLoader(test_dataset, shuffle=False, pin_memory=True, 56 | batch_size=1, num_workers=0) 57 | 58 | y_preds = [] 59 | for it, d_it in enumerate(loader): 60 | x, y = d_it 61 | 62 | # place data on the correct device 63 | x = x.to(device) 64 | 65 | for model in cnn_models: 66 | # forward the model 67 | with torch.set_grad_enabled(False): 68 | output, _ = model(x) 69 | 70 | y1 = output.detach().cpu().squeeze(1).numpy() 71 | y_preds.extend(y1) 72 | 73 | 74 | return np.power(10, np.mean(np.array(y_preds))) 75 | 76 | 77 | 78 | 79 | if __name__ == "__main__": 80 | parser = argparse.ArgumentParser( 81 | description='Generate predictions using PfAbNet models') 82 | parser.add_argument('--structure_file', type=str, help='Input Fv structure') 83 | parser.add_argument('--PfAbNet_model_prefix', type=str, default='PfAbNet', help='output model prefix') 84 | parser.add_argument('--PfAbNet_model_dir', type=str, help='output model directory') 85 | parser.add_argument('--grid_dim', type=int, default=96, 86 | help='number of grid points along each axis (default = 96)') 87 | parser.add_argument('--grid_spacing', type=float, default=0.75, 88 | help='spacing between grid points (default = 0.75 Angstrom)') 89 | parser.add_argument('--shell_width', type=float, default=2.0, 90 | help='thickness of the surface shell (default 2.0 Angstrom)') 91 | parser.add_argument('--NX', type=int, default=10, 92 | help='augmentation level (default 10x)') 93 | parser.add_argument('--num_channels', type=int, default=2, 94 | help='number of input channels (2 for eisenberg ' 95 | 'phobic + philic or 3 for esp_eisenberg (default 2))') 96 | parser.add_argument('--processors', type=int, default=5, 97 | help='Number of CPUs for ESP grid calculation (default 5)') 98 | parser.add_argument('--eisenberg_dir', type=str, default='', help='directory with precomputed density grids') 99 | parser.add_argument('--output_file', type=str, help='Output file with prediction') 100 | parser.add_argument('-v', '--verbose', action='count', default=0) 101 | args = parser.parse_args() 102 | 103 | seed_everything(42) 104 | 105 | model_files_prefix = os.path.join(args.PfAbNet_model_dir, args.PfAbNet_model_prefix) 106 | model_files = glob.glob('%s*.pt' % model_files_prefix) 107 | cnn_models = get_cnn_models(args, model_files) 108 | 109 | output = [] 110 | ypred = predict(cnn_models, args.structure_file, args.__dict__) 111 | output.append({ENTITY_KEY:os.path.basename(args.structure_file).split('.mol2')[0], 'VISCOSITY_PRED':ypred}) 112 | print(args.structure_file, ypred) 113 | 114 | df = pd.DataFrame(output) 115 | df.to_csv(args.output_file, index=False) 116 | 117 | 118 | 119 | 120 | -------------------------------------------------------------------------------- /pfabnet_eisenberg/train.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pandas as pd 3 | import torch 4 | import torch.nn as nn 5 | 6 | import argparse 7 | from sklearn.model_selection import KFold 8 | 9 | from dataset import ViscosityDataset 10 | from model import ViscosityNet 11 | from trainer import Trainer, TrainerConfig 12 | from utils import seed_everything, prepare_training_input 13 | from base import VISCOSITY_KEY 14 | 15 | 16 | def train(args): 17 | seed_everything(42) 18 | 19 | training_data_files = args.training_data_file.split(',') 20 | df_list = [] 21 | for training_data_file in training_data_files: 22 | if training_data_file.endswith('.csv'): 23 | df = pd.read_csv(training_data_file) 24 | else: 25 | df = pd.read_pickle(training_data_file) 26 | 27 | df_list.append(df) 28 | 29 | df = pd.concat(df_list) 30 | df.loc[df[VISCOSITY_KEY] > 1000, VISCOSITY_KEY] = 1000 31 | 32 | X, y = prepare_training_input(df, args.__dict__) 33 | 34 | kf = KFold(n_splits=10, shuffle=True) 35 | train_index, val_index = list(kf.split(y))[args.fold_idx] 36 | 37 | X_train, y_train = X[train_index], y[train_index] 38 | X_val, y_val = X[val_index], y[val_index] 39 | print('Number of datapoints; train: %d, val: %d' % (len(y_train), len(y_val))) 40 | 41 | train_dataset = ViscosityDataset(X_train, y_train) 42 | val_dataset = ViscosityDataset(X_val, y_val) 43 | 44 | # save model path 45 | ckpt_file = '%s_%d.pt' % (args.output_model_prefix, args.fold_idx) 46 | ckpt_path = os.path.join(args.output_model_dir, ckpt_file) 47 | print('PyTorch model will be saved in ', ckpt_path) 48 | 49 | def weights_init(m): 50 | if isinstance(m, nn.Conv3d) or isinstance(m, nn.Linear): 51 | torch.nn.init.kaiming_normal_(m.weight) 52 | torch.nn.init.zeros_(m.bias) 53 | 54 | model = ViscosityNet(args.num_channels) 55 | model.apply(weights_init) 56 | if os.path.exists(ckpt_path): 57 | print('loading saved model...') 58 | model.load_state_dict(torch.load(ckpt_path)) 59 | model.eval() 60 | 61 | print(sum(p.numel() for p in model.parameters() if p.requires_grad), 'model parameters') 62 | 63 | bs = 1 64 | 65 | history_file = '%s_hist_%d.pkl' % (args.output_model_prefix, args.fold_idx) 66 | history_path = os.path.join(args.output_model_dir, history_file) 67 | tconf = TrainerConfig(max_epochs=2000, batch_size=bs, learning_rate=1e-5, 68 | num_workers=0, ckpt_path=ckpt_path, history_path=history_path) 69 | 70 | trainer = Trainer(model, train_dataset, val_dataset, tconf) 71 | trainer.train() 72 | 73 | 74 | 75 | if __name__ == "__main__": 76 | parser = argparse.ArgumentParser( 77 | description='train PfAbNet model') 78 | parser.add_argument('--training_data_file', type=str, help='training data file') 79 | parser.add_argument('--homology_model_dir', type=str, help='homology model directory') 80 | parser.add_argument('--output_model_prefix', type=str, default='PfAbNet', help='output model prefix') 81 | parser.add_argument('--output_model_dir', type=str, help='output model directory') 82 | parser.add_argument('--grid_dim', type=int, default=96, 83 | help='number of grid points along each axis (default = 96)') 84 | parser.add_argument('--grid_spacing', type=float, default=0.75, 85 | help='spacing between grid points (default = 0.75 Angstrom)') 86 | parser.add_argument('--shell_width', type=float, default=2.0, 87 | help='thickness of the surface shell (default 2.0 Angstrom)') 88 | parser.add_argument('--NX', type=int, default=10, 89 | help='augmentation level (default 10x)') 90 | parser.add_argument('--processors', type=int, default=5, 91 | help='Number of CPUs for ESP grid calculation (default 5)') 92 | parser.add_argument('--num_channels', type=int, default=2, 93 | help='number of input channels (2 for eisenberg phobic + philic or 3 for esp_eisenberg (default 2))') 94 | parser.add_argument('--eisenberg_dir', type=str, default='', help='directory with precomputed Eisenberg grids') 95 | parser.add_argument('--fold_idx', default=0, type=int, 96 | help='index of the k-fold split (default = 0)') 97 | parser.add_argument('-v', '--verbose', action='count', default=0) 98 | args = parser.parse_args() 99 | 100 | os.makedirs(args.output_model_dir, exist_ok=True) 101 | 102 | train(args) 103 | 104 | 105 | 106 | -------------------------------------------------------------------------------- /pfabnet_eisenberg/trainer.py: -------------------------------------------------------------------------------- 1 | from tqdm import tqdm 2 | import numpy as np 3 | import torch 4 | from torch.utils.data.dataloader import DataLoader 5 | import pickle 6 | 7 | 8 | class TrainerConfig: 9 | # optimization parameters 10 | betas = (0.9, 0.999) 11 | grad_norm_clip = 1.0 12 | ckpt_path = None 13 | history_path = None 14 | num_workers = 0 # for DataLoader 15 | 16 | def __init__(self, **kwargs): 17 | for k,v in kwargs.items(): 18 | setattr(self, k, v) 19 | 20 | class Trainer: 21 | 22 | def __init__(self, model, train_dataset, val_dataset, config): 23 | self.model = model 24 | self.train_dataset = train_dataset 25 | self.val_dataset = val_dataset 26 | self.config = config 27 | 28 | self.device = 'cpu' 29 | if torch.cuda.is_available(): 30 | self.device = torch.cuda.current_device() 31 | self.model = torch.nn.DataParallel(self.model).to(self.device) 32 | 33 | def save_checkpoint(self): 34 | raw_model = self.model.module if hasattr(self.model, "module") else self.model 35 | torch.save(raw_model.state_dict(), self.config.ckpt_path) 36 | 37 | 38 | def train(self): 39 | model, config = self.model, self.config 40 | optimizer = torch.optim.Adam(model.parameters(), lr=config.learning_rate, 41 | betas=config.betas, weight_decay=0.005) 42 | 43 | def run_epoch(split): 44 | is_train = split == 'train' 45 | model.train(is_train) 46 | 47 | if is_train: 48 | data = self.train_dataset 49 | batch_size = config.batch_size 50 | else: 51 | data = self.val_dataset 52 | batch_size = config.batch_size 53 | 54 | shuffle = False 55 | if is_train: 56 | shuffle = True 57 | 58 | loader = DataLoader(data, shuffle=shuffle, pin_memory=True, 59 | batch_size=batch_size, 60 | num_workers=config.num_workers) 61 | 62 | losses = [] 63 | pbar = tqdm(enumerate(loader), total=len(loader)) if is_train else enumerate(loader) 64 | for it, d_it in pbar: 65 | x, y = d_it 66 | 67 | x = x.to(self.device) 68 | y = y.to(self.device) 69 | 70 | with torch.set_grad_enabled(is_train): 71 | output, loss = model(x, y) 72 | loss = loss.mean() 73 | losses.append(loss.item()) 74 | 75 | if is_train: 76 | model.zero_grad() 77 | loss.backward() 78 | torch.nn.utils.clip_grad_norm_(model.parameters(), config.grad_norm_clip) 79 | optimizer.step() 80 | 81 | pbar.set_description(f"epoch {epoch+1} iter {it}: train loss {loss.item():.5f}. " 82 | f"lr {config.learning_rate:e}") 83 | 84 | return float(np.mean(losses)) 85 | 86 | best_loss = float('inf') 87 | try: 88 | with open(self.config.history_path, 'rb') as fptr: 89 | history = pickle.load(fptr) 90 | start_epoch = len(np.array(history['val_loss'])) 91 | history = {'train_loss':history['train_loss'][:start_epoch], 'val_loss':history['val_loss'][:start_epoch]} 92 | except Exception as e: 93 | history = {'train_loss': [], 'val_loss': []} 94 | start_epoch = 0 95 | 96 | for epoch in range(start_epoch, config.max_epochs): 97 | train_loss = run_epoch('train') 98 | val_loss = run_epoch('val') 99 | history['train_loss'].append(train_loss) 100 | history['val_loss'].append(val_loss) 101 | 102 | with open(self.config.history_path, 'wb') as fptr: 103 | pickle.dump(history, fptr) 104 | 105 | if epoch < 1950: 106 | self.save_checkpoint() 107 | continue 108 | 109 | good_model = val_loss < best_loss 110 | if self.config.ckpt_path is not None and good_model: 111 | best_loss = val_loss 112 | self.save_checkpoint() 113 | 114 | 115 | -------------------------------------------------------------------------------- /pfabnet_eisenberg/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import random 3 | import multiprocessing 4 | import pickle 5 | import collections 6 | 7 | import numpy as np 8 | 9 | import torch 10 | from openeye import oechem 11 | from openeye import oegrid 12 | from openeye import oezap 13 | from openeye import oespicoli 14 | 15 | try: 16 | from base import VISCOSITY_KEY, ENTITY_KEY 17 | except Exception as e: 18 | from .base import VISCOSITY_KEY, ENTITY_KEY 19 | 20 | EISENBERG_GRID_KEY = 'EISENBERG_GRID' 21 | 22 | INPUT_MOL_KEY = 'INPUT_MOL' 23 | ROT_X_KEY = 'rot_x' 24 | ROT_Y_KEY = 'rot_y' 25 | ROT_Z_KEY = 'rot_z' 26 | GRID_SPACING_KEY = 'grid_spacing' 27 | GRID_DIM_KEY = 'grid_dim' 28 | SHELL_WIDTH_KEY = 'shell_width' 29 | NX_KEY = 'NX' # augmentation level 30 | PROCESSORS_KEY = 'processors' 31 | HOMOLOGY_MODEL_DIR_KEY = 'homology_model_dir' 32 | EISENBERG_DIR_KEY = 'eisenberg_dir' 33 | 34 | DEFAULT_GRID_PARAMS = {GRID_DIM_KEY: 96, GRID_SPACING_KEY: 0.75, 35 | SHELL_WIDTH_KEY: 2.0, NX_KEY: 10} 36 | 37 | def get_molecule(input_file): 38 | ifs = oechem.oemolistream(input_file) 39 | mol = oechem.OEGraphMol() 40 | oechem.OEReadMolecule(ifs, mol) 41 | ifs.close() 42 | 43 | oechem.OEPerceiveResidues(mol) 44 | oechem.OECenter(mol) 45 | 46 | return mol 47 | 48 | def get_eisenberg_grid(params, mol, grid_type='PHOBIC'): 49 | eisenberg_scale = collections.defaultdict(float) 50 | eisenberg_scale['ALA'] = 0.25; eisenberg_scale['CYS'] = 0.04; eisenberg_scale['PHE'] = 0.61; 51 | eisenberg_scale['ILE'] = 0.73; eisenberg_scale['LEU'] = 0.53; eisenberg_scale['PRO'] = -0.07; 52 | eisenberg_scale['VAL'] = 0.54; eisenberg_scale['TRP'] = 0.37; eisenberg_scale['TYR'] = 0.02; 53 | eisenberg_scale['ASP'] = -0.72; eisenberg_scale['GLU'] = -0.62; eisenberg_scale['GLY'] = 0.16; 54 | eisenberg_scale['HIS'] = -0.40; eisenberg_scale['LYS'] = -1.1; eisenberg_scale['MET'] = 0.26; 55 | eisenberg_scale['ASN'] = -0.64; eisenberg_scale['GLN'] = -0.69; eisenberg_scale['ARG'] = -1.8; 56 | eisenberg_scale['SER'] = -0.26; eisenberg_scale['THR'] = -0.18; 57 | 58 | mol_copy = oechem.OEGraphMol(mol) 59 | for atom in mol_copy.GetAtoms(): 60 | res = oechem.OEAtomGetResidue(atom) 61 | aa = res.GetName() 62 | if grid_type == 'PHOBIC' and eisenberg_scale[aa] < 0.0: 63 | mol_copy.DeleteAtom(atom) 64 | continue 65 | if grid_type == 'PHILIC' and eisenberg_scale[aa] > 0.0: 66 | mol_copy.DeleteAtom(atom) 67 | continue 68 | 69 | atom.SetRadius(3*np.abs(eisenberg_scale[aa])) 70 | 71 | mol_copy.Sweep() 72 | print(grid_type, mol_copy.NumAtoms(), oechem.OECount(mol_copy, oechem.OEIsHydrogen())) 73 | grid_spacing = params[GRID_SPACING_KEY] 74 | grid_dim = params[GRID_DIM_KEY] 75 | oe_grid = oegrid.OEScalarGrid(grid_dim, grid_dim, grid_dim, 0.0, 0.0, 0.0, grid_spacing) 76 | oegrid.OEMakeMolecularGaussianGrid(oe_grid, mol_copy) 77 | 78 | return oe_grid 79 | 80 | 81 | def gen_eisenberg_array(params): 82 | mol = params[INPUT_MOL_KEY] 83 | theta_x = params[ROT_X_KEY] 84 | theta_y = params[ROT_Y_KEY] 85 | theta_z = params[ROT_Z_KEY] 86 | grid_spacing = params[GRID_SPACING_KEY] 87 | grid_dim = params[GRID_DIM_KEY] 88 | shell_width = params[SHELL_WIDTH_KEY] 89 | 90 | oechem.OEEulerRotate(mol, oechem.OEDoubleArray([theta_x, theta_y, theta_z])) 91 | 92 | oechem.OEAssignBondiVdWRadii(mol) 93 | 94 | zap = oezap.OEZap() 95 | zap.SetInnerDielectric(2.0) 96 | zap.SetGridSpacing(grid_spacing) 97 | zap.SetMolecule(mol) 98 | 99 | grid = oegrid.OEScalarGrid(grid_dim, grid_dim, grid_dim, 100 | 0.0, 0.0, 0.0, grid_spacing) 101 | zap.SetOuterDielectric(80) 102 | zap.CalcPotentialGrid(grid) 103 | 104 | surf = oespicoli.OESurface() 105 | oespicoli.OEMakeMolecularSurface(surf, mol) 106 | 107 | surf_grid = oegrid.OEScalarGrid(grid_dim, grid_dim, grid_dim, 0.0, 0.0, 0.0, grid_spacing) 108 | oespicoli.OEMakeGridFromSurface(surf_grid, surf) 109 | 110 | phobic_grid = get_eisenberg_grid(params, mol, 'PHOBIC') 111 | philic_grid = get_eisenberg_grid(params, mol, 'PHILIC') 112 | 113 | grid_size = grid.GetSize() 114 | arr = np.zeros(grid_size) 115 | phobic_arr = np.zeros(grid_size) 116 | philic_arr = np.zeros(grid_size) 117 | idx = 0 118 | for i in range(0, grid_dim): 119 | for j in range(0, grid_dim): 120 | for k in range(0, grid_dim): 121 | v = surf_grid.GetValue(i, j, k) 122 | if 0 <= v < shell_width: 123 | val = grid.GetValue(i, j, k) 124 | arr[idx] = val 125 | val = phobic_grid.GetValue(i, j, k) 126 | phobic_arr[idx] = val 127 | val = philic_grid.GetValue(i, j, k) 128 | philic_arr[idx] = val 129 | 130 | idx += 1 131 | 132 | arr3d_esp = np.reshape(arr, (grid_dim, grid_dim, grid_dim, 1)) 133 | arr3d_phobic = np.reshape(phobic_arr, (grid_dim, grid_dim, grid_dim, 1)) 134 | arr3d_philic = np.reshape(philic_arr, (grid_dim, grid_dim, grid_dim, 1)) 135 | 136 | return arr3d_esp, arr3d_phobic, arr3d_philic, mol 137 | 138 | 139 | def prepare_cnn_input(df, args, train=True): 140 | hm_model_dir = args[HOMOLOGY_MODEL_DIR_KEY] 141 | if hm_model_dir is None: 142 | raise Exception('Homology model directory not specified') 143 | 144 | X = [] 145 | y = [] 146 | for row_idx, row in df.iterrows(): 147 | entity = row[ENTITY_KEY] 148 | 149 | mol_file = os.path.join(hm_model_dir, entity + '.mol2') 150 | if len(args[EISENBERG_DIR_KEY]) > 0: 151 | esp_grids = get_eisenberg_grids(args, mol_file) 152 | else: 153 | esp_grids = generate_eisenberg_grids(args, mol_file) 154 | 155 | if args['num_channels'] == 3: 156 | combined_grid = [np.concatenate([esp_arr, phobic_arr, philic_arr], axis=0) 157 | for esp_arr, phobic_arr, philic_arr, _ in esp_grids] 158 | else: 159 | combined_grid = [np.concatenate([phobic_arr, philic_arr], axis=0) 160 | for _, phobic_arr, philic_arr, _ in esp_grids] 161 | 162 | X.extend(combined_grid) 163 | if train: 164 | log_visc = np.log10(row[VISCOSITY_KEY]) 165 | y.extend([log_visc] * args[NX_KEY]) 166 | else: 167 | y.extend([0.0] * args[NX_KEY]) 168 | 169 | return np.array(X), np.array(y) 170 | 171 | 172 | def get_eisenberg_grids(args, mol_file): 173 | eisenberg_dir = args[EISENBERG_DIR_KEY] 174 | eisenberg_array_output = [] 175 | for i in range(args[NX_KEY]): 176 | with open('%s/rotation_%d/%s.pyb' % (eisenberg_dir, i + 1, 177 | os.path.basename(mol_file).split('.mol2')[0]), 'rb') as fptr: 178 | 179 | mol = get_molecule(os.path.join(os.path.join(eisenberg_dir, 'rotation_%d' % (i+1)), os.path.basename(mol_file))) 180 | esp_arr, phobic_arr, philic_arr = pickle.load(fptr) 181 | eisenberg_array_output.append((esp_arr, phobic_arr, philic_arr, mol)) 182 | 183 | return eisenberg_array_output 184 | 185 | 186 | def generate_eisenberg_grids(args, mol_file): 187 | mol = get_molecule(mol_file) 188 | 189 | params = [] 190 | for i in range(args[NX_KEY]): 191 | rot_x = np.random.uniform(0, 180) 192 | rot_y = np.random.uniform(0, 180) 193 | rot_z = np.random.uniform(0, 180) 194 | 195 | params.append({INPUT_MOL_KEY: oechem.OEGraphMol(mol), ROT_X_KEY: rot_x, 196 | ROT_Y_KEY: rot_y, ROT_Z_KEY: rot_z, 197 | GRID_DIM_KEY: args[GRID_DIM_KEY], 198 | GRID_SPACING_KEY: args[GRID_SPACING_KEY], 199 | SHELL_WIDTH_KEY: args[SHELL_WIDTH_KEY]}) 200 | if multiprocessing.cpu_count() >= args[PROCESSORS_KEY]: 201 | processors = args[PROCESSORS_KEY] 202 | else: 203 | processors = multiprocessing.cpu_count() 204 | p = multiprocessing.Pool(processes=processors) 205 | eisenberg_array_output = p.map(gen_eisenberg_array, params) 206 | p.close() 207 | 208 | output = [(np.moveaxis(esp_array, 3, 0), np.moveaxis(phobic_array, 3, 0), np.moveaxis(philic_array, 3, 0), output_mol) 209 | for esp_array, phobic_array, philic_array, output_mol in eisenberg_array_output] 210 | return output 211 | 212 | 213 | def prepare_training_input(df, args): 214 | return prepare_cnn_input(df, args, train=True) 215 | 216 | 217 | def prepare_test_input(df, args): 218 | return prepare_cnn_input(df, args, train=False) 219 | 220 | 221 | def seed_everything(seed): 222 | random.seed(seed) 223 | np.random.seed(seed) 224 | torch.manual_seed(seed) 225 | torch.cuda.manual_seed_all(seed) 226 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | # packages in environment at /home/X/.conda/envs/X-env: 2 | # 3 | # Name Version Build Channel 4 | _libgcc_mutex 0.1 main 5 | anyio 3.3.4 py39hf3d152e_0 conda-forge 6 | argon2-cffi 20.1.0 py39h27cfd23_1 7 | async_generator 1.10 py_0 conda-forge 8 | attrs 21.2.0 pyhd8ed1ab_0 conda-forge 9 | babel 2.9.1 pyh44b312d_0 conda-forge 10 | backcall 0.2.0 pyh9f0ad1d_0 conda-forge 11 | backports 1.0 py_2 conda-forge 12 | backports.functools_lru_cache 1.6.4 pyhd8ed1ab_0 conda-forge 13 | bcbio-gff 0.6.7 pypi_0 pypi 14 | biopython 1.79 pypi_0 pypi 15 | blas 1.0 mkl 16 | bleach 4.1.0 pyhd8ed1ab_0 conda-forge 17 | bottleneck 1.3.2 py39hdd57654_1 18 | brotli 1.0.9 he6710b0_2 19 | brotlipy 0.7.0 py39h27cfd23_1003 20 | bzip2 1.0.8 h7b6447c_0 21 | ca-certificates 2021.10.26 h06a4308_2 22 | captum 0.4.1 pypi_0 pypi 23 | certifi 2021.10.8 py39h06a4308_0 24 | cffi 1.14.6 py39h400218f_0 25 | charset-normalizer 2.0.4 pyhd3eb1b0_0 26 | click 8.1.3 pypi_0 pypi 27 | cloudpickle 2.0.0 pypi_0 pypi 28 | conda 4.10.3 py39h06a4308_0 29 | conda-package-handling 1.7.3 py39h27cfd23_1 30 | cryptography 3.4.8 py39hd23ed53_0 31 | cudatoolkit 10.1.243 h6bb024c_0 32 | cx-oracle 8.3.0 pypi_0 pypi 33 | cycler 0.10.0 py39h06a4308_0 34 | dask 2021.11.1 pypi_0 pypi 35 | dbus 1.13.18 hb2f20db_0 36 | decorator 5.1.0 pyhd8ed1ab_0 conda-forge 37 | defusedxml 0.7.1 pyhd8ed1ab_0 conda-forge 38 | entrypoints 0.3 pyhd8ed1ab_1003 conda-forge 39 | et-xmlfile 1.1.0 pypi_0 pypi 40 | expat 2.4.1 h2531618_2 41 | ffmpeg 4.3 hf484d3e_0 pytorch 42 | flask 2.2.1 pypi_0 pypi 43 | fontconfig 2.13.1 h6c09931_0 44 | fonttools 4.25.0 pyhd3eb1b0_0 45 | freetype 2.10.4 h5ab3b9f_0 46 | fsspec 2021.11.0 pypi_0 pypi 47 | giflib 5.2.1 h7b6447c_0 48 | glib 2.69.1 h5202010_0 49 | gmp 6.2.1 h2531618_2 50 | gnutls 3.6.15 he1e5248_0 51 | gpytorch 1.6.0 pypi_0 pypi 52 | gst-plugins-base 1.14.0 h8213a91_2 53 | gstreamer 1.14.0 h28cd5cc_2 54 | icu 58.2 he6710b0_3 55 | idna 3.2 pyhd3eb1b0_0 56 | importlib-metadata 4.8.1 py39hf3d152e_0 conda-forge 57 | iniconfig 1.1.1 pyhd3eb1b0_0 58 | intel-openmp 2021.3.0 h06a4308_3350 59 | ipykernel 5.5.5 py39hef51801_0 conda-forge 60 | ipython 7.28.0 py39hef51801_0 conda-forge 61 | ipython_genutils 0.2.0 py_1 conda-forge 62 | ipywidgets 7.6.5 pypi_0 pypi 63 | itsdangerous 2.1.2 pypi_0 pypi 64 | jedi 0.18.0 py39hf3d152e_2 conda-forge 65 | jinja2 3.1.2 pypi_0 pypi 66 | joblib 1.0.1 pyhd3eb1b0_0 67 | jpeg 9d h7f8727e_0 68 | json5 0.9.5 pyh9f0ad1d_0 conda-forge 69 | jsonschema 4.1.2 pyhd8ed1ab_0 conda-forge 70 | jupyter_client 7.0.6 pyhd8ed1ab_0 conda-forge 71 | jupyter_core 4.8.1 py39hf3d152e_0 conda-forge 72 | jupyter_server 1.11.1 pyhd8ed1ab_0 conda-forge 73 | jupyterlab 3.2.1 pyhd8ed1ab_0 conda-forge 74 | jupyterlab-widgets 1.0.2 pypi_0 pypi 75 | jupyterlab_pygments 0.1.2 pyh9f0ad1d_0 conda-forge 76 | jupyterlab_server 2.8.2 pyhd8ed1ab_0 conda-forge 77 | kiwisolver 1.3.1 py39h2531618_0 78 | lame 3.100 h7b6447c_0 79 | lcms2 2.12 h3be6417_0 80 | ld_impl_linux-64 2.35.1 h7274673_9 81 | libffi 3.3 he6710b0_2 82 | libgcc-ng 9.1.0 hdf63c60_0 83 | libgfortran-ng 7.3.0 hdf63c60_0 84 | libiconv 1.15 h63c8f33_5 85 | libidn2 2.3.2 h7f8727e_0 86 | libpng 1.6.37 hbc83047_0 87 | libsodium 1.0.18 h36c2ea0_1 conda-forge 88 | libstdcxx-ng 9.1.0 hdf63c60_0 89 | libtasn1 4.16.0 h27cfd23_0 90 | libtiff 4.2.0 h85742a9_0 91 | libunistring 0.9.10 h27cfd23_0 92 | libuuid 1.0.3 h7f8727e_2 93 | libuv 1.40.0 h7b6447c_0 94 | libwebp 1.2.0 h89dd481_0 95 | libwebp-base 1.2.0 h27cfd23_0 96 | libxcb 1.14 h7b6447c_0 97 | libxml2 2.9.10 hb55368b_3 98 | locket 0.2.1 pypi_0 pypi 99 | lxml 4.6.4 pypi_0 pypi 100 | lz4-c 1.9.3 h295c915_1 101 | markupsafe 2.1.1 pypi_0 pypi 102 | matplotlib 3.4.3 py39h06a4308_0 103 | matplotlib-base 3.4.3 py39hbbc1b5f_0 104 | matplotlib-inline 0.1.3 pyhd8ed1ab_0 conda-forge 105 | mistune 0.8.4 py39hbd71b63_1002 conda-forge 106 | mkl 2021.3.0 h06a4308_520 107 | mkl-service 2.4.0 py39h7f8727e_0 108 | mkl_fft 1.3.1 py39hd3c417c_0 109 | mkl_random 1.2.2 py39h51133e4_0 110 | more-itertools 8.8.0 pyhd3eb1b0_0 111 | munkres 1.1.4 py_0 112 | nbclassic 0.3.2 pyhd8ed1ab_0 conda-forge 113 | nbclient 0.5.4 pyhd8ed1ab_0 conda-forge 114 | nbconvert 6.2.0 py39hf3d152e_0 conda-forge 115 | nbformat 5.1.3 pyhd8ed1ab_0 conda-forge 116 | ncurses 6.2 he6710b0_1 117 | nest-asyncio 1.5.1 pyhd8ed1ab_0 conda-forge 118 | nettle 3.7.3 hbbd107a_1 119 | ninja 1.10.2 hff7bd54_1 120 | notebook 6.4.5 pyha770c72_0 conda-forge 121 | numexpr 2.7.3 py39h22e1b3c_1 122 | numpy 1.21.2 py39h20f2e39_0 123 | numpy-base 1.21.2 py39h79a1101_0 124 | olefile 0.46 pyhd3eb1b0_0 125 | openeye-toolkits 2021.1.1 py39_0 openeye 126 | openh264 2.1.0 hd408876_0 127 | openpyxl 3.0.9 pypi_0 pypi 128 | openssl 1.1.1l h7f8727e_0 129 | packaging 21.0 pyhd8ed1ab_0 conda-forge 130 | pandas 1.3.3 py39h8c16a72_0 131 | pandoc 2.14.2 h7f98852_0 conda-forge 132 | pandocfilters 1.5.0 pyhd8ed1ab_0 conda-forge 133 | parso 0.8.2 pyhd8ed1ab_0 conda-forge 134 | partd 1.2.0 pypi_0 pypi 135 | pcre 8.45 h295c915_0 136 | pexpect 4.8.0 pyh9f0ad1d_2 conda-forge 137 | pickleshare 0.7.5 py_1003 conda-forge 138 | pillow 8.4.0 py39h5aabda8_0 139 | pip 21.2.4 py39h06a4308_0 140 | pluggy 0.13.1 py39h06a4308_0 141 | prometheus_client 0.11.0 pyhd8ed1ab_0 conda-forge 142 | prompt-toolkit 3.0.21 pyha770c72_0 conda-forge 143 | psutil 5.8.0 pypi_0 pypi 144 | ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge 145 | py 1.10.0 pyhd3eb1b0_0 146 | pycosat 0.6.3 py39h27cfd23_0 147 | pycparser 2.20 py_2 148 | pygments 2.10.0 pyhd8ed1ab_0 conda-forge 149 | pyopenssl 20.0.1 pyhd3eb1b0_1 150 | pyparsing 2.4.7 pyhd3eb1b0_0 151 | pyqt 5.9.2 py39h2531618_6 152 | pyrsistent 0.17.3 py39hbd71b63_1 conda-forge 153 | pysocks 1.7.1 py39h06a4308_0 154 | pytest 6.2.4 py39h06a4308_2 155 | python 3.9.7 h12debd9_1 156 | python-dateutil 2.8.2 pyhd3eb1b0_0 157 | python-docx 0.8.11 pypi_0 pypi 158 | python_abi 3.9 2_cp39 conda-forge 159 | pytorch-mutex 1.0 cuda pytorch 160 | pytz 2021.3 pyhd3eb1b0_0 161 | pyyaml 6.0 pypi_0 pypi 162 | pyzmq 19.0.2 py39hb69f2a1_2 conda-forge 163 | qt 5.9.7 h5867ecd_1 164 | readline 8.1 h27cfd23_0 165 | requests 2.26.0 pyhd3eb1b0_0 166 | requests-unixsocket 0.2.0 py_0 conda-forge 167 | ruamel_yaml 0.15.100 py39h27cfd23_0 168 | scikit-learn 0.24.2 py39ha9443f7_0 169 | scipy 1.6.2 py39had2a1c9_1 170 | selfies 2.0.0 pypi_0 pypi 171 | send2trash 1.8.0 pyhd8ed1ab_0 conda-forge 172 | setuptools 58.0.4 py39h06a4308_0 173 | sip 4.19.13 py39h2531618_0 174 | six 1.16.0 pyhd3eb1b0_0 175 | sniffio 1.2.0 py39hf3d152e_1 conda-forge 176 | sqlite 3.36.0 hc218d9a_0 177 | swifter 1.0.9 pypi_0 pypi 178 | terminado 0.12.1 py39hf3d152e_0 conda-forge 179 | testpath 0.5.0 pyhd8ed1ab_0 conda-forge 180 | threadpoolctl 2.2.0 pyh0d69192_0 181 | tk 8.6.11 h1ccaba5_0 182 | toml 0.10.2 pyhd3eb1b0_0 183 | toolz 0.11.2 pypi_0 pypi 184 | torch 1.10.0 pypi_0 pypi 185 | torchvision 0.2.2 py_3 pytorch 186 | tornado 6.1 py39h27cfd23_0 187 | tqdm 4.62.2 pyhd3eb1b0_1 188 | traitlets 5.1.0 pyhd8ed1ab_0 conda-forge 189 | typing_extensions 3.10.0.2 pyh06a4308_0 190 | tzdata 2021a h5d7bf9c_0 191 | urllib3 1.26.7 pyhd3eb1b0_0 192 | viennarna 2.3.3 hfc679d8_2 bioconda 193 | wcwidth 0.2.5 pyh9f0ad1d_2 conda-forge 194 | webencodings 0.5.1 py_1 conda-forge 195 | websocket-client 0.57.0 py39hf3d152e_4 conda-forge 196 | werkzeug 2.2.1 pypi_0 pypi 197 | wheel 0.37.0 pyhd3eb1b0_1 198 | widgetsnbextension 3.5.2 pypi_0 pypi 199 | xz 5.2.5 h7b6447c_0 200 | yaml 0.2.5 h7b6447c_0 201 | zeromq 4.3.4 h2531618_0 202 | zipp 3.6.0 pyhd8ed1ab_0 conda-forge 203 | zlib 1.2.11 h7b6447c_3 204 | zstd 1.4.9 haebb681_0 205 | --------------------------------------------------------------------------------