├── LICENSE ├── README.rst ├── .gitignore └── notebooks └── Notebook_1_MIL_for_conformers.ipynb /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 KFU ChemoInformatics and Molecular Modeling Laboratory 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | Conformer Multi-Instance Machine Learning 2 | ========================================================== 3 | This repository contains the Python source code from the `paper `_. 4 | 5 | Overview 6 | ------------ 7 | In Multi-Instance Learning, each training object is represented by several feature 8 | vectors (bag) and a label. In our implementation, an example (i.e., a molecule) is presented 9 | by a bag of instances (i.e., a set of conformers), and a label (a bioactivity value) is available 10 | only for a bag (a molecule), but not for individual instances (conformations). 11 | 12 | Installation 13 | ------------ 14 | .. code-block:: bash 15 | 16 | pip install qsarmil 17 | 18 | Supplementary packages 19 | ------------ 20 | The modelling pipeline is based on two supplementary packages: 21 | 22 | - `QSARmil `_ – Molecular multi-instance machine learning 23 | - `milearn `_ - Multi-instance machine learning in Python 24 | 25 | Refer to these packages for more examples and application cases. 26 | 27 | How To Use 28 | ------------ 29 | Original datasets can be found at `datasets`. The folder contains 200 datasets on ligand bioactivity extracted from ChEMBL. 30 | Follow the `Notebook `_ for usage example. 31 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | .idea 3 | .ipynb_checkpoints 4 | __pycache__/ 5 | *.py[cod] 6 | *$py.class 7 | 8 | # C extensions 9 | *.so 10 | 11 | # Distribution / packaging 12 | .Python 13 | build/ 14 | develop-eggs/ 15 | dist/ 16 | downloads/ 17 | eggs/ 18 | .eggs/ 19 | lib/ 20 | lib64/ 21 | parts/ 22 | sdist/ 23 | var/ 24 | wheels/ 25 | pip-wheel-metadata/ 26 | share/python-wheels/ 27 | *.egg-info/ 28 | .installed.cfg 29 | *.egg 30 | MANIFEST 31 | 32 | # PyInstaller 33 | # Usually these files are written by a python script from a template 34 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 35 | *.manifest 36 | *.spec 37 | 38 | # Installer logs 39 | pip-log.txt 40 | pip-delete-this-directory.txt 41 | 42 | # Unit test / coverage reports 43 | htmlcov/ 44 | .tox/ 45 | .nox/ 46 | .coverage 47 | .coverage.* 48 | .cache 49 | nosetests.xml 50 | coverage.xml 51 | *.cover 52 | *.py,cover 53 | .hypothesis/ 54 | .pytest_cache/ 55 | 56 | # Translations 57 | *.mo 58 | *.pot 59 | 60 | # Django stuff: 61 | *.log 62 | local_settings.py 63 | db.sqlite3 64 | db.sqlite3-journal 65 | 66 | # Flask stuff: 67 | instance/ 68 | .webassets-cache 69 | 70 | # Scrapy stuff: 71 | .scrapy 72 | 73 | # Sphinx documentation 74 | docs/_build/ 75 | 76 | # PyBuilder 77 | target/ 78 | 79 | # Jupyter Notebook 80 | .ipynb_checkpoints 81 | 82 | # IPython 83 | profile_default/ 84 | ipython_config.py 85 | 86 | # pyenv 87 | .python-version 88 | 89 | # pipenv 90 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 91 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 92 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 93 | # install all needed dependencies. 94 | #Pipfile.lock 95 | 96 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 97 | __pypackages__/ 98 | 99 | # Celery stuff 100 | celerybeat-schedule 101 | celerybeat.pid 102 | 103 | # SageMath parsed files 104 | *.sage.py 105 | 106 | # Environments 107 | .env 108 | .venv 109 | env/ 110 | venv/ 111 | ENV/ 112 | env.bak/ 113 | venv.bak/ 114 | 115 | # Spyder project settings 116 | .spyderproject 117 | .spyproject 118 | 119 | # Rope project settings 120 | .ropeproject 121 | 122 | # mkdocs documentation 123 | /site 124 | 125 | # mypy 126 | .mypy_cache/ 127 | .dmypy.json 128 | dmypy.json 129 | 130 | # Pyre type checker 131 | .pyre/ 132 | -------------------------------------------------------------------------------- /notebooks/Notebook_1_MIL_for_conformers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Introduction\n", 8 | "\n", 9 | "Multi-instance (MI) machine learning approaches can be used to solve the issues of representation of each molecule by multiple conformations (instances) and automatic selection of the most relevant ones. In the multi-instance approach, an example (i.e., a molecule) is presented by a bag of instances (i.e., a set of conformations), and a label (a molecule property value) is available only for a bag (a molecule), but not for individual instances (conformations).\n", 10 | "\n", 11 | "In this study, we have implemented several multi-instance algorithms, both conventional and based on deep learning, and investigated their performance. We have compared the performance of MI-QSAR models with those based on the classical single-instance QSAR (SI-QSAR) approach in which each molecule is encoded by either 2D descriptors computed for the corresponding molecular graph or 3D descriptors issued for a single lowest-energy conformation. " 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "## 1. Load dataset\n", 19 | "\n", 20 | "The example datasets contain molecule structure (SMILES) and measured bioactivity (pKi or IC50) – the higher the better." 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 2, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "import numpy as np\n", 30 | "import pandas as pd\n", 31 | "from rdkit import Chem\n", 32 | "\n", 33 | "from sklearn.metrics import r2_score, accuracy_score\n", 34 | "from sklearn.model_selection import train_test_split\n", 35 | "\n", 36 | "# Data\n", 37 | "from huggingface_hub import hf_hub_download" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 3, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "REPO_ID = \"KagakuData/notebooks\"\n", 47 | "\n", 48 | "csv_path = hf_hub_download(REPO_ID, filename=\"chembl_200/CHEMBL279.csv\", repo_type=\"dataset\")\n", 49 | "data = pd.read_csv(csv_path, header=None)\n", 50 | "\n", 51 | "data_train, data_test = train_test_split(data, test_size=0.2)" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 5, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "smi_train, y_train = data_train[0].to_list(), data_train[2].to_list()\n", 61 | "smi_test, y_test = data_test[0].to_list(), data_test[2].to_list()" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 6, 67 | "metadata": {}, 68 | "outputs": [], 69 | "source": [ 70 | "mols_train, y_train = [], []\n", 71 | "for smi, prop in zip(smi_train, prop_train):\n", 72 | " mol = Chem.MolFromSmiles(smi)\n", 73 | " if mol:\n", 74 | " mols_train.append(mol)\n", 75 | " y_train.append(prop)" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 7, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "mols_test, y_test = [], []\n", 85 | "for smi, prop in zip(smi_test, prop_test):\n", 86 | " mol = Chem.MolFromSmiles(smi)\n", 87 | " if mol:\n", 88 | " mols_test.append(mol)\n", 89 | " y_test.append(prop)" 90 | ] 91 | }, 92 | { 93 | "attachments": {}, 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "## 1.5 Reduce the dataset size for faster pipeline (for playing around)" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 8, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "# mols_train, y_train = mols_train[:80], y_train[:80]\n", 107 | "# mols_test, y_test = mols_test[:20], y_test[:20]" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "## 2. Conformer generation\n", 115 | "\n", 116 | "For each molecule, an ensemble of conformers is generated. Then, molecules for which conformer generation failed are filtered out from both, the training and test set. Generated conformers can be accessed by mol.GetConformers(confID=0)." 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": 9, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "from qsarmil.conformer import RDKitConformerGenerator\n", 126 | "\n", 127 | "from qsarmil.utils.logging import FailedConformer, FailedDescriptor" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": 10, 133 | "metadata": {}, 134 | "outputs": [], 135 | "source": [ 136 | "conf_gen = RDKitConformerGenerator(num_conf=10, num_cpu=40)" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 11, 142 | "metadata": {}, 143 | "outputs": [ 144 | { 145 | "name": "stderr", 146 | "output_type": "stream", 147 | "text": [ 148 | "Generating conformers: 100%|████████████████████████████████████████████████████████████| 80/80 [00:04<00:00, 17.42it/s]\n" 149 | ] 150 | } 151 | ], 152 | "source": [ 153 | "confs_train = conf_gen.run(mols_train)\n", 154 | "\n", 155 | "tmp = [(c, y) for c, y in zip(confs_train, y_train) if not isinstance(c, FailedConformer)]\n", 156 | "confs_train, y_train = zip(*tmp) \n", 157 | "confs_train, y_train = list(confs_train), list(y_train)" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 12, 163 | "metadata": {}, 164 | "outputs": [ 165 | { 166 | "name": "stderr", 167 | "output_type": "stream", 168 | "text": [ 169 | "Generating conformers: 100%|████████████████████████████████████████████████████████████| 20/20 [00:02<00:00, 8.19it/s]\n" 170 | ] 171 | } 172 | ], 173 | "source": [ 174 | "confs_test = conf_gen.run(mols_test)\n", 175 | "\n", 176 | "tmp = [(c, y) for c, y in zip(confs_test, y_test) if not isinstance(c, FailedConformer)]\n", 177 | "confs_test, y_test = zip(*tmp) \n", 178 | "confs_test, y_test = list(confs_test), list(y_test)" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "## 3. Descriptor calculation\n", 186 | "\n", 187 | "Then, for each molecule with associated conformers 3D descriptors are calculated. Here, a descriptor wrapper is used, which is designed to apply descriptor calculators from external packages. The resulting descriptors are a list of 2D arrays (bags). Also, the resulting descriptors are scaled." 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": 13, 193 | "metadata": {}, 194 | "outputs": [], 195 | "source": [ 196 | "from qsarmil.descriptor.rdkit import (RDKitGEOM, \n", 197 | " RDKitAUTOCORR, \n", 198 | " RDKitRDF, \n", 199 | " RDKitMORSE, \n", 200 | " RDKitWHIM, \n", 201 | " RDKitGETAWAY)\n", 202 | "\n", 203 | "from molfeat.calc import Pharmacophore3D, USRDescriptors, ElectroShapeDescriptors\n", 204 | "\n", 205 | "from qsarmil.descriptor.wrapper import DescriptorWrapper\n", 206 | "\n", 207 | "from milearn.preprocessing import BagMinMaxScaler" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": 14, 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "desc_calc = DescriptorWrapper(RDKitRDF())" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": 15, 222 | "metadata": { 223 | "scrolled": true 224 | }, 225 | "outputs": [], 226 | "source": [ 227 | "x_train = desc_calc.transform(confs_train)\n", 228 | "x_test = desc_calc.transform(confs_test)" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 16, 234 | "metadata": {}, 235 | "outputs": [], 236 | "source": [ 237 | "scaler = BagMinMaxScaler()\n", 238 | "\n", 239 | "scaler.fit(x_train)\n", 240 | "\n", 241 | "x_train_scaled = scaler.transform(x_train)\n", 242 | "x_test_scaled = scaler.transform(x_test)" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "## 4. Mini-benchmark" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": 18, 255 | "metadata": {}, 256 | "outputs": [], 257 | "source": [ 258 | "import logging\n", 259 | "import warnings\n", 260 | "warnings.filterwarnings(\"ignore\")\n", 261 | "logging.getLogger(\"pytorch_lightning\").setLevel(logging.ERROR)\n", 262 | "logging.getLogger(\"lightning\").setLevel(logging.ERROR)\n", 263 | "\n", 264 | "import time\n", 265 | "import torch\n", 266 | "import random\n", 267 | "\n", 268 | "import numpy as np\n", 269 | "import pandas as pd\n", 270 | "\n", 271 | "import matplotlib.pyplot as plt\n", 272 | "\n", 273 | "# Preprocessing\n", 274 | "from milearn.preprocessing import BagMinMaxScaler\n", 275 | "\n", 276 | "# Network hparams\n", 277 | "from milearn.network.module.hopt import DEFAULT_PARAM_GRID\n", 278 | "\n", 279 | "# MIL wrappers\n", 280 | "from milearn.network.regressor import BagWrapperMLPNetworkRegressor, InstanceWrapperMLPNetworkRegressor\n", 281 | "from milearn.network.classifier import BagWrapperMLPNetworkClassifier, InstanceWrapperMLPNetworkClassifier\n", 282 | "\n", 283 | "# MIL networks\n", 284 | "from milearn.network.regressor import (InstanceNetworkRegressor,\n", 285 | " BagNetworkRegressor,\n", 286 | " AdditiveAttentionNetworkRegressor,\n", 287 | " SelfAttentionNetworkRegressor,\n", 288 | " HopfieldAttentionNetworkRegressor,\n", 289 | " DynamicPoolingNetworkRegressor)\n", 290 | "# Utils\n", 291 | "from sklearn.metrics import r2_score, accuracy_score\n", 292 | "from sklearn.model_selection import train_test_split" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": 19, 298 | "metadata": {}, 299 | "outputs": [], 300 | "source": [ 301 | "regressor_list = [\n", 302 | "\n", 303 | " # wrapper mil networks\n", 304 | " (\"MeanBagWrapperMLPNetworkRegressor\", BagWrapperMLPNetworkRegressor(pool=\"mean\")),\n", 305 | " (\"MeanInstanceWrapperMLPNetworkRegressor\", InstanceWrapperMLPNetworkRegressor(pool=\"mean\")),\n", 306 | " \n", 307 | " # classic mil networks\n", 308 | " (\"MeanBagNetworkRegressor\", BagNetworkRegressor(pool=\"mean\")),\n", 309 | " (\"MeanInstanceNetworkRegressor\", InstanceNetworkRegressor(pool=\"mean\")),\n", 310 | "\n", 311 | " # attention mil networks\n", 312 | " (\"AdditiveAttentionNetworkRegressor\", AdditiveAttentionNetworkRegressor()),\n", 313 | " (\"SelfAttentionNetworkRegressor\", SelfAttentionNetworkRegressor()),\n", 314 | " (\"HopfieldAttentionNetworkRegressor\", HopfieldAttentionNetworkRegressor()),\n", 315 | "\n", 316 | " # other mil networks\n", 317 | " (\"DynamicPoolingNetworkRegressor\", DynamicPoolingNetworkRegressor()),\n", 318 | " ]" 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": 21, 324 | "metadata": { 325 | "scrolled": true 326 | }, 327 | "outputs": [ 328 | { 329 | "name": "stdout", 330 | "output_type": "stream", 331 | "text": [ 332 | "MeanBagWrapperMLPNetworkRegressor\n", 333 | "MeanInstanceWrapperMLPNetworkRegressor\n", 334 | "MeanBagNetworkRegressor\n", 335 | "MeanInstanceNetworkRegressor\n", 336 | "AdditiveAttentionNetworkRegressor\n", 337 | "SelfAttentionNetworkRegressor\n", 338 | "HopfieldAttentionNetworkRegressor\n", 339 | "DynamicPoolingNetworkRegressor\n" 340 | ] 341 | } 342 | ], 343 | "source": [ 344 | "res_df = pd.DataFrame()\n", 345 | "for method_name, model in regressor_list:\n", 346 | "\n", 347 | " # model.hopt(x_train_scaled, y_train, param_grid=DEFAULT_PARAM_GRID, verbose=True)\n", 348 | " model.fit(x_train_scaled, y_train)\n", 349 | " y_pred = model.predict(x_test_scaled)\n", 350 | " \n", 351 | " res_df.loc[method_name, \"R2\"] = r2_score(y_test, y_pred)" 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": 22, 357 | "metadata": {}, 358 | "outputs": [ 359 | { 360 | "data": { 361 | "text/html": [ 362 | "
\n", 363 | "\n", 376 | "\n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | "
R2
DynamicPoolingNetworkRegressor0.528351
MeanBagWrapperMLPNetworkRegressor0.422525
MeanInstanceWrapperMLPNetworkRegressor0.411559
SelfAttentionNetworkRegressor0.401706
MeanBagNetworkRegressor0.370688
MeanInstanceNetworkRegressor0.370688
AdditiveAttentionNetworkRegressor0.366663
HopfieldAttentionNetworkRegressor0.340709
\n", 418 | "
" 419 | ], 420 | "text/plain": [ 421 | " R2\n", 422 | "DynamicPoolingNetworkRegressor 0.528351\n", 423 | "MeanBagWrapperMLPNetworkRegressor 0.422525\n", 424 | "MeanInstanceWrapperMLPNetworkRegressor 0.411559\n", 425 | "SelfAttentionNetworkRegressor 0.401706\n", 426 | "MeanBagNetworkRegressor 0.370688\n", 427 | "MeanInstanceNetworkRegressor 0.370688\n", 428 | "AdditiveAttentionNetworkRegressor 0.366663\n", 429 | "HopfieldAttentionNetworkRegressor 0.340709" 430 | ] 431 | }, 432 | "execution_count": 22, 433 | "metadata": {}, 434 | "output_type": "execute_result" 435 | } 436 | ], 437 | "source": [ 438 | "res_df.sort_values(by=\"R2\", ascending=False)" 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": null, 444 | "metadata": {}, 445 | "outputs": [], 446 | "source": [] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": null, 451 | "metadata": {}, 452 | "outputs": [], 453 | "source": [] 454 | } 455 | ], 456 | "metadata": { 457 | "kernelspec": { 458 | "display_name": "tmp", 459 | "language": "python", 460 | "name": "tmp" 461 | }, 462 | "language_info": { 463 | "codemirror_mode": { 464 | "name": "ipython", 465 | "version": 3 466 | }, 467 | "file_extension": ".py", 468 | "mimetype": "text/x-python", 469 | "name": "python", 470 | "nbconvert_exporter": "python", 471 | "pygments_lexer": "ipython3", 472 | "version": "3.10.18" 473 | } 474 | }, 475 | "nbformat": 4, 476 | "nbformat_minor": 4 477 | } 478 | --------------------------------------------------------------------------------