Clustering

├── .gitignore ├── LICENSE ├── README.md ├── doc ├── Makefile ├── README.md ├── conf.py └── index.rst ├── examples └── chembl_et-a_antagonists.zip ├── html └── mol_table │ ├── css │ ├── lit_style.css │ └── style.css │ └── index.html ├── make_doc.sh ├── rdkit_ipynb_tools ├── __init__.py ├── bokeh_tools.py ├── clustering.py ├── file_templ.py ├── hc_tools.py ├── html_templates.py ├── nb_tools.py ├── pandas_tools.py ├── pipeline.py ├── resources │ └── clustering │ │ ├── css │ │ ├── collapsible_list.css │ │ ├── index_style.css │ │ ├── lit_style.css │ │ └── style.css │ │ └── lib │ │ ├── btn_callbacks.js │ │ ├── folding.js │ │ └── jquery.min.js ├── sar.py └── tools.py └── tutorial ├── chembl_et-a_ant.sdf ├── chembl_et-a_antagonists.txt.gz ├── mol_grid.html ├── mol_table.html ├── pipeline.log ├── sim_map.html ├── tutorial_sar.ipynb └── tutorial_tools.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints/ 2 | .kdev4/ 3 | .vscode/ 4 | doc/_build/ 5 | tutorial/.ipynb_checkpoints/ 6 | tutorial/lib 7 | tutorial/chembl_et-a_antagonists.txt 8 | tutorial/chembl_et-a_ant_active.sdf 9 | *.pkl 10 | *.png 11 | *.mrv 12 | *.mol 13 | *.pyc 14 | .spyderproject 15 | branch_info.txt 16 | commit_msg.txt 17 | snippets.txt 18 | tags 19 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Axel Pahl 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # RDKit IPython Tools 2 | by Axel Pahl 3 | 4 | A set of tools to use with the Open Source Cheminformatics toolkit 5 | [RDKit](http.//www.rdkit.org) in the Jupyter Notebook.
6 | Written for Python3, only tested on Linux (Ubuntu 16.04) 7 | and the conda install of the RDkit. 8 | 9 | # Module tools 10 | 11 | A Mol_List class was introduced, which is a subclass of a Python list for holding lists of RDKit molecule objects and allows direct access to a lot of the RDKit functionality. 12 | It is meant to be used with the Jupyter Notebook and includes a.o.: 13 | * display of the Mol_List 14 | * as HTML table, nested table or grid 15 | * display of a summary including number of records and min, max, mean, median for numeric properties 16 | * display of correlations between the Mol_List's properties 17 | (using np.corrcoef, this allows getting a quick overview on which properties correlate with each other) 18 | * methods for sorting, searching (by property or substructure) and filtering the Mol_List 19 | * methods for renaming, reordering and calculating properties 20 | * direct plotting of properties as publication-grade [Highcharts](http://www.highcharts.com/) *or* [Bokeh](http://bokeh.pydata.org/en/latest/) plots with **structure tooltips** (!). 21 | * the plotting functionalities reside in their own module and can also be used for plotting Pandas dataframes and Python dicts. 22 | * further development will focus on Bokeh because of the more pythonic interface 23 | 24 | 25 | ## Other functions in the tools module: 26 | - *jsme*: Display Peter Ertl's [Javascript Molecule Editor](http://peter-ertl.com/jsme/) to enter a molecule directly in the IPython notebook (*how cool is that??*).
27 | The module tries to find a local version of JSME in /lib/ and when it fails to do so, 28 | loads a web version of the editor. I use a central lib/ folder and create symlinks 29 | to it in all notebook folders where I want to use these libraries 30 | 31 | ...plus many others. 32 | 33 | # Module pipeline 34 | 35 | A Pipelining Workflow using Python Generators, mainly for RDKit and large compound sets. 36 | The use of generators allows working with arbitrarily large data sets, the memory usage at any given time is low. 37 | 38 | Example use: 39 | 40 | >>> from rdkit_ipynb_tools import pipeline as p 41 | >>> s = Summary() 42 | >>> rd = start_csv_reader(test_data_b64.csv.gz", summary=s) 43 | >>> b64 = pipe_mol_from_b64(rd, summary=s) 44 | >>> filt = pipe_mol_filter(b64, "[H]c2c([H])c1ncoc1c([H])c2C(N)=O", summary=s) 45 | >>> stop_sdf_writer(filt, "test.sdf", summary=s) 46 | 47 | or, using the pipe function: 48 | 49 | >>> s = Summary() 50 | >>> rd = start_sdf_reader("test.sdf", summary=s) 51 | >>> pipe(rd, 52 | >>> pipe_keep_largest_fragment, 53 | >>> (pipe_neutralize_mol, {"summary": s}), 54 | >>> (pipe_keep_props, ["Ordernumber", "NP_Score"]), 55 | >>> (stop_csv_writer, "test.csv", {"summary": s}) 56 | >>> ) 57 | 58 | The progress of the pipeline is displayed as a HTML table in the Notebook and can also be followed in a separate terminal with: `watch -n 2 cat pipeline.log`. 59 | 60 | ## Currently Available Pipeline Components: 61 | | Starting | Running | Stopping 62 | |----------------------------|----------------------------|---------------------------| 63 | | start_cache_reader | pipe_calc_props | stop_cache_writer | 64 | | start_csv_reader | pipe_custom_filter | stop_count_records | 65 | | start_mol_csv_reader | pipe_custom_man | stop_csv_writer | 66 | | start_sdf_reader | pipe_do_nothing | stop_df_from_stream | 67 | | start_stream_from_dict | pipe_has_prop_filter | stop_dict_from_stream | 68 | | start_stream_from_mol_list | pipe_id_filter | stop_mol_list_from_stream | 69 | | | pipe_inspect_stream | stop_sdf_writer | 70 | | | pipe_join_data_from_file | | 71 | | | pipe_keep_largest_fragment | | 72 | | | pipe_keep_props | | 73 | | | pipe_merge_data | | 74 | | | pipe_mol_filter | | 75 | | | pipe_mol_from_b64 | | 76 | | | pipe_mol_from_smiles | | 77 | | | pipe_mol_to_b64 | | 78 | | | pipe_mol_to_smiles | | 79 | | | pipe_neutralize_mol | | 80 | | | pipe_remove_props | | 81 | | | pipe_rename_prop | | 82 | | | pipe_sim_filter | | 83 | | | pipe_sleep | | 84 | 85 | 86 | Limitation: unlike in other pipelining tools, because of the nature of Python generators, the pipeline can not be branched. 87 | 88 | # Other Modules 89 | ## Clustering 90 | Fully usable, documentation needs to be written. 91 | Please refer to the docstrings until then. 92 | 93 | ## Scaffolds 94 | New, WIP, **not** usable. Has been moved to the scaffolds branch. 95 | 96 | # Tutorial 97 | Much of the functionality is shown in the [tools tutorial notebook](tutorial/tutorial_tools.ipynb). 98 | SAR functionality is shown in the [SAR tutorial notebook](tutorial/tutorial_sar.ipynb). The SAR module is new and Work in Progress. 99 | 100 | # Documentation 101 | The module documentation can be built with sphinx using the `make_doc.sh` script 102 | 103 | # Installation 104 | ## Requirements 105 | The recommended way to use this project is via conda. 106 | 107 | 1. Python 3 108 | 1. [RDKit](http://www.rdkit.org/) 109 | 1. Jupyter Notebook 110 | 1. ipywidgets 111 | 112 | ## Highly recommended 113 | 1. cairo (via conda or pip) and cairocffi (only via pip) 114 | to get decent-looking structures 115 | 1. [Bokeh](http://bokeh.pydata.org/en/latest/) for high-quality data plots 116 | with structure tooltips 117 | 118 | After installing the requirements, 119 | clone this repo, then the rdkit_ipynb_tools can be used by including 120 | the project's base directory (`rdkit_ipynb_tools`) 121 | in Python's import path (I actually prefer this to using setuptools, 122 | because a simple `git pull` will get you the newest version).
123 | This can be achieved by one of the following:
124 | * If you use conda (recommended), use [conda develop](http://conda.pydata.org/docs/commands/build/conda-develop.html). 125 | This works similar to the next option. 126 | * Put a file with the extension `.pth`, e.g. `my_packages.pth`, 127 | into one of the `site-packages` directories of your Python installation 128 | and put the path to the base directory of this project 129 | (`rdkit_ipynb_tools`) into it.
130 | (I have the path to a dedicated folder on my machine included in such a `.pth` 131 | file and link all my development projects to that folder. 132 | This way, I only need to create the `.pth` file once.) 133 | 134 | # Tips & Tricks 135 | ## Pipelines, Structures and Performance 136 | Processing data from 200k compounds takes 10-15 sec on my notebook. 137 | 138 | Substructure searches take longer. 139 | 140 | For performance reasons, I use b64encode and pickle strings of mol objects to store the molecule structures in text format
141 | (see also Greg's blog post for [faster structure generation](http://rdkit.blogspot.de/2016/09/avoiding-unnecessary-work-and.html)): 142 | 143 | ```python 144 | b64encode(pickle.dumps(mol)).decode() 145 | ``` 146 | For me, that has proven to be the fastest method when dealing with flat text files and is also the reason why there are `pipe_mol_to_b64` and `pipe_mol_from_b64` components in the `pipeline` module. 147 | 148 | ## Working Offline 149 | * When you use a local copy of the Javascript Molecule Editor as described above 150 | and use Bokeh for plotting, you can work completely offline in your Notebook. 151 | 152 | # Roadmap 153 | * make pipelines more user-friendly 154 | * complete the scaffolds module 155 | * add functionality as needed / requested 156 | 157 | (probably not in this order) 158 | -------------------------------------------------------------------------------- /doc/Makefile: -------------------------------------------------------------------------------- 1 | # Makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line. 5 | SPHINXOPTS = 6 | SPHINXBUILD = sphinx-build 7 | PAPER = 8 | BUILDDIR = _build 9 | 10 | # User-friendly check for sphinx-build 11 | ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1) 12 | $(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/) 13 | endif 14 | 15 | # Internal variables. 16 | PAPEROPT_a4 = -D latex_paper_size=a4 17 | PAPEROPT_letter = -D latex_paper_size=letter 18 | ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . 19 | # the i18n builder cannot share the environment and doctrees with the others 20 | I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . 21 | 22 | .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext 23 | 24 | help: 25 | @echo "Please use \`make ' where is one of" 26 | @echo " html to make standalone HTML files" 27 | @echo " dirhtml to make HTML files named index.html in directories" 28 | @echo " singlehtml to make a single large HTML file" 29 | @echo " pickle to make pickle files" 30 | @echo " json to make JSON files" 31 | @echo " htmlhelp to make HTML files and a HTML help project" 32 | @echo " qthelp to make HTML files and a qthelp project" 33 | @echo " devhelp to make HTML files and a Devhelp project" 34 | @echo " epub to make an epub" 35 | @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" 36 | @echo " latexpdf to make LaTeX files and run them through pdflatex" 37 | @echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx" 38 | @echo " text to make text files" 39 | @echo " man to make manual pages" 40 | @echo " texinfo to make Texinfo files" 41 | @echo " info to make Texinfo files and run them through makeinfo" 42 | @echo " gettext to make PO message catalogs" 43 | @echo " changes to make an overview of all changed/added/deprecated items" 44 | @echo " xml to make Docutils-native XML files" 45 | @echo " pseudoxml to make pseudoxml-XML files for display purposes" 46 | @echo " linkcheck to check all external links for integrity" 47 | @echo " doctest to run all doctests embedded in the documentation (if enabled)" 48 | 49 | clean: 50 | rm -rf $(BUILDDIR)/* 51 | 52 | html: 53 | $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html 54 | @echo 55 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." 56 | 57 | dirhtml: 58 | $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml 59 | @echo 60 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." 61 | 62 | singlehtml: 63 | $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml 64 | @echo 65 | @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." 66 | 67 | pickle: 68 | $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle 69 | @echo 70 | @echo "Build finished; now you can process the pickle files." 71 | 72 | json: 73 | $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json 74 | @echo 75 | @echo "Build finished; now you can process the JSON files." 76 | 77 | htmlhelp: 78 | $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp 79 | @echo 80 | @echo "Build finished; now you can run HTML Help Workshop with the" \ 81 | ".hhp project file in $(BUILDDIR)/htmlhelp." 82 | 83 | qthelp: 84 | $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp 85 | @echo 86 | @echo "Build finished; now you can run "qcollectiongenerator" with the" \ 87 | ".qhcp project file in $(BUILDDIR)/qthelp, like this:" 88 | @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/RDKitIPythonNotebookTools.qhcp" 89 | @echo "To view the help file:" 90 | @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/RDKitIPythonNotebookTools.qhc" 91 | 92 | devhelp: 93 | $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp 94 | @echo 95 | @echo "Build finished." 96 | @echo "To view the help file:" 97 | @echo "# mkdir -p $$HOME/.local/share/devhelp/RDKitIPythonNotebookTools" 98 | @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/RDKitIPythonNotebookTools" 99 | @echo "# devhelp" 100 | 101 | epub: 102 | $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub 103 | @echo 104 | @echo "Build finished. The epub file is in $(BUILDDIR)/epub." 105 | 106 | latex: 107 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 108 | @echo 109 | @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." 110 | @echo "Run \`make' in that directory to run these through (pdf)latex" \ 111 | "(use \`make latexpdf' here to do that automatically)." 112 | 113 | latexpdf: 114 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 115 | @echo "Running LaTeX files through pdflatex..." 116 | $(MAKE) -C $(BUILDDIR)/latex all-pdf 117 | @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." 118 | 119 | latexpdfja: 120 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 121 | @echo "Running LaTeX files through platex and dvipdfmx..." 122 | $(MAKE) -C $(BUILDDIR)/latex all-pdf-ja 123 | @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." 124 | 125 | text: 126 | $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text 127 | @echo 128 | @echo "Build finished. The text files are in $(BUILDDIR)/text." 129 | 130 | man: 131 | $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man 132 | @echo 133 | @echo "Build finished. The manual pages are in $(BUILDDIR)/man." 134 | 135 | texinfo: 136 | $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo 137 | @echo 138 | @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." 139 | @echo "Run \`make' in that directory to run these through makeinfo" \ 140 | "(use \`make info' here to do that automatically)." 141 | 142 | info: 143 | $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo 144 | @echo "Running Texinfo files through makeinfo..." 145 | make -C $(BUILDDIR)/texinfo info 146 | @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." 147 | 148 | gettext: 149 | $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale 150 | @echo 151 | @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." 152 | 153 | changes: 154 | $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes 155 | @echo 156 | @echo "The overview file is in $(BUILDDIR)/changes." 157 | 158 | linkcheck: 159 | $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck 160 | @echo 161 | @echo "Link check complete; look for any errors in the above output " \ 162 | "or in $(BUILDDIR)/linkcheck/output.txt." 163 | 164 | doctest: 165 | $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest 166 | @echo "Testing of doctests in the sources finished, look at the " \ 167 | "results in $(BUILDDIR)/doctest/output.txt." 168 | 169 | xml: 170 | $(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml 171 | @echo 172 | @echo "Build finished. The XML files are in $(BUILDDIR)/xml." 173 | 174 | pseudoxml: 175 | $(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml 176 | @echo 177 | @echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml." 178 | -------------------------------------------------------------------------------- /doc/README.md: -------------------------------------------------------------------------------- 1 | The documentation for this module can be build using sphinx by running `make html` in this dir. The module dir rdkit_ipynb_tools has to be in the Python import path 2 | This can be achieved E.g. by one of the following methods 3 | 1. Put the name of the package dir in a custom .pth file in Python's site-packages or dist-packages 4 | (that's how I do it as long as it is not yet a real package) 5 | 2. Add the name of the dir to the PYTHONPATH variable 6 | 3. Copy it directly to site-packages or dist-packages 7 | -------------------------------------------------------------------------------- /doc/conf.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | # 4 | # RDKit IPython Notebook Tools documentation build configuration file, created by 5 | # sphinx-quickstart on Sat Oct 10 19:41:40 2015. 6 | # 7 | # This file is execfile()d with the current directory set to its 8 | # containing dir. 9 | # 10 | # Note that not all possible configuration values are present in this 11 | # autogenerated file. 12 | # 13 | # All configuration values have a default; values that are commented out 14 | # serve to show the default. 15 | 16 | import sys 17 | import os 18 | 19 | # If extensions (or modules to document with autodoc) are in another directory, 20 | # add these directories to sys.path here. If the directory is relative to the 21 | # documentation root, use os.path.abspath to make it absolute, like shown here. 22 | # sys.path.insert(0, os.path.abspath('..')) 23 | 24 | # -- General configuration ------------------------------------------------ 25 | 26 | # If your documentation needs a minimal Sphinx version, state it here. 27 | needs_sphinx = '1.3' 28 | 29 | # Add any Sphinx extension module names here, as strings. They can be 30 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 31 | # ones. 32 | extensions = [ 33 | 'sphinx.ext.autodoc', 34 | 'sphinx.ext.napoleon', 35 | 'sphinx.ext.doctest', 36 | 'sphinx.ext.todo', 37 | 'sphinx.ext.coverage', 38 | 'sphinx.ext.viewcode', 39 | ] 40 | 41 | # Add any paths that contain templates here, relative to this directory. 42 | templates_path = ['_templates'] 43 | 44 | # The suffix of source filenames. 45 | source_suffix = '.rst' 46 | 47 | # The encoding of source files. 48 | #source_encoding = 'utf-8-sig' 49 | 50 | # The master toctree document. 51 | master_doc = 'index' 52 | 53 | # General information about the project. 54 | project = 'RDKit IPython Notebook Tools' 55 | copyright = '2015, A. Pahl' 56 | 57 | # The version info for the project you're documenting, acts as replacement for 58 | # |version| and |release|, also used in various other places throughout the 59 | # built documents. 60 | # 61 | # The short X.Y version. 62 | version = '0.5' 63 | # The full version, including alpha/beta/rc tags. 64 | release = '0.5' 65 | 66 | # The language for content autogenerated by Sphinx. Refer to documentation 67 | # for a list of supported languages. 68 | #language = None 69 | 70 | # There are two options for replacing |today|: either, you set today to some 71 | # non-false value, then it is used: 72 | #today = '' 73 | # Else, today_fmt is used as the format for a strftime call. 74 | #today_fmt = '%B %d, %Y' 75 | 76 | # List of patterns, relative to source directory, that match files and 77 | # directories to ignore when looking for source files. 78 | exclude_patterns = ['_build'] 79 | 80 | # The reST default role (used for this markup: `text`) to use for all 81 | # documents. 82 | #default_role = None 83 | 84 | # If true, '()' will be appended to :func: etc. cross-reference text. 85 | #add_function_parentheses = True 86 | 87 | # If true, the current module name will be prepended to all description 88 | # unit titles (such as .. function::). 89 | #add_module_names = True 90 | 91 | # If true, sectionauthor and moduleauthor directives will be shown in the 92 | # output. They are ignored by default. 93 | #show_authors = False 94 | 95 | # The name of the Pygments (syntax highlighting) style to use. 96 | pygments_style = 'sphinx' 97 | 98 | # A list of ignored prefixes for module index sorting. 99 | #modindex_common_prefix = [] 100 | 101 | # If true, keep warnings as "system message" paragraphs in the built documents. 102 | #keep_warnings = False 103 | 104 | 105 | # -- Options for HTML output ---------------------------------------------- 106 | 107 | # The theme to use for HTML and HTML Help pages. See the documentation for 108 | # a list of builtin themes. 109 | html_theme = 'default' 110 | 111 | # Theme options are theme-specific and customize the look and feel of a theme 112 | # further. For a list of options available for each theme, see the 113 | # documentation. 114 | #html_theme_options = {} 115 | 116 | # Add any paths that contain custom themes here, relative to this directory. 117 | #html_theme_path = [] 118 | 119 | # The name for this set of Sphinx documents. If None, it defaults to 120 | # " v documentation". 121 | #html_title = None 122 | 123 | # A shorter title for the navigation bar. Default is the same as html_title. 124 | #html_short_title = None 125 | 126 | # The name of an image file (relative to this directory) to place at the top 127 | # of the sidebar. 128 | #html_logo = None 129 | 130 | # The name of an image file (within the static path) to use as favicon of the 131 | # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 132 | # pixels large. 133 | #html_favicon = None 134 | 135 | # Add any paths that contain custom static files (such as style sheets) here, 136 | # relative to this directory. They are copied after the builtin static files, 137 | # so a file named "default.css" will overwrite the builtin "default.css". 138 | html_static_path = ['_static'] 139 | 140 | # Add any extra paths that contain custom files (such as robots.txt or 141 | # .htaccess) here, relative to this directory. These files are copied 142 | # directly to the root of the documentation. 143 | #html_extra_path = [] 144 | 145 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, 146 | # using the given strftime format. 147 | #html_last_updated_fmt = '%b %d, %Y' 148 | 149 | # If true, SmartyPants will be used to convert quotes and dashes to 150 | # typographically correct entities. 151 | #html_use_smartypants = True 152 | 153 | # Custom sidebar templates, maps document names to template names. 154 | #html_sidebars = {} 155 | 156 | # Additional templates that should be rendered to pages, maps page names to 157 | # template names. 158 | #html_additional_pages = {} 159 | 160 | # If false, no module index is generated. 161 | #html_domain_indices = True 162 | 163 | # If false, no index is generated. 164 | #html_use_index = True 165 | 166 | # If true, the index is split into individual pages for each letter. 167 | #html_split_index = False 168 | 169 | # If true, links to the reST sources are added to the pages. 170 | #html_show_sourcelink = True 171 | 172 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. 173 | #html_show_sphinx = True 174 | 175 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. 176 | #html_show_copyright = True 177 | 178 | # If true, an OpenSearch description file will be output, and all pages will 179 | # contain a tag referring to it. The value of this option must be the 180 | # base URL from which the finished HTML is served. 181 | #html_use_opensearch = '' 182 | 183 | # This is the file name suffix for HTML files (e.g. ".xhtml"). 184 | #html_file_suffix = None 185 | 186 | # Output file base name for HTML help builder. 187 | htmlhelp_basename = 'RDKitIPythonNotebookToolsdoc' 188 | 189 | 190 | # -- Options for LaTeX output --------------------------------------------- 191 | 192 | latex_elements = { 193 | # The paper size ('letterpaper' or 'a4paper'). 194 | 'papersize': 'a4paper', 195 | 196 | # The font size ('10pt', '11pt' or '12pt'). 197 | #'pointsize': '10pt', 198 | 199 | # Additional stuff for the LaTeX preamble. 200 | #'preamble': '', 201 | } 202 | 203 | # Grouping the document tree into LaTeX files. List of tuples 204 | # (source start file, target name, title, 205 | # author, documentclass [howto, manual, or own class]). 206 | latex_documents = [ 207 | ('index', 'RDKitIPythonNotebookTools.tex', 'RDKit IPython Notebook Tools Documentation', 208 | 'A. Pahl', 'manual'), 209 | ] 210 | 211 | # The name of an image file (relative to this directory) to place at the top of 212 | # the title page. 213 | #latex_logo = None 214 | 215 | # For "manual" documents, if this is true, then toplevel headings are parts, 216 | # not chapters. 217 | #latex_use_parts = False 218 | 219 | # If true, show page references after internal links. 220 | #latex_show_pagerefs = False 221 | 222 | # If true, show URL addresses after external links. 223 | #latex_show_urls = False 224 | 225 | # Documents to append as an appendix to all manuals. 226 | #latex_appendices = [] 227 | 228 | # If false, no module index is generated. 229 | #latex_domain_indices = True 230 | 231 | 232 | # -- Options for manual page output --------------------------------------- 233 | 234 | # One entry per manual page. List of tuples 235 | # (source start file, name, description, authors, manual section). 236 | man_pages = [ 237 | ('index', 'rdkitipythonnotebooktools', 'RDKit IPython Notebook Tools Documentation', 238 | ['A. Pahl'], 1) 239 | ] 240 | 241 | # If true, show URL addresses after external links. 242 | #man_show_urls = False 243 | 244 | 245 | # -- Options for Texinfo output ------------------------------------------- 246 | 247 | # Grouping the document tree into Texinfo files. List of tuples 248 | # (source start file, target name, title, author, 249 | # dir menu entry, description, category) 250 | texinfo_documents = [ 251 | ('index', 'RDKitIPythonNotebookTools', 'RDKit IPython Notebook Tools Documentation', 252 | 'A. Pahl', 'RDKitIPythonNotebookTools', 'One line description of project.', 253 | 'Miscellaneous'), 254 | ] 255 | 256 | # Documents to append as an appendix to all manuals. 257 | #texinfo_appendices = [] 258 | 259 | # If false, no module index is generated. 260 | #texinfo_domain_indices = True 261 | 262 | # How to display URL addresses: 'footnote', 'no', or 'inline'. 263 | #texinfo_show_urls = 'footnote' 264 | 265 | # If true, do not generate a @detailmenu in the "Top" node's menu. 266 | #texinfo_no_detailmenu = False 267 | -------------------------------------------------------------------------------- /doc/index.rst: -------------------------------------------------------------------------------- 1 | .. RDKit IPython Notebook Tools documentation master file, created by 2 | sphinx-quickstart on Sat Oct 10 19:41:40 2015. 3 | You can adapt this file completely to your liking, but it should at least 4 | contain the root `toctree` directive. 5 | 6 | Welcome to RDKit IPython Notebook Tools's documentation! 7 | ======================================================== 8 | 9 | Indices and tables 10 | ================== 11 | 12 | * :ref:`genindex` 13 | * :ref:`modindex` 14 | * :ref:`search` 15 | 16 | .. automodule:: rdkit_ipynb_tools.tools 17 | :members: 18 | :private-members: 19 | :special-members: 20 | 21 | .. automodule:: rdkit_ipynb_tools.pipeline 22 | :members: 23 | :private-members: 24 | :special-members: 25 | 26 | .. automodule:: rdkit_ipynb_tools.sar 27 | :members: 28 | :private-members: 29 | :special-members: 30 | 31 | .. automodule:: rdkit_ipynb_tools.clustering 32 | :members: 33 | :private-members: 34 | :special-members: 35 | 36 | .. automodule:: rdkit_ipynb_tools.bokeh_tools 37 | :members: 38 | :private-members: 39 | :special-members: 40 | 41 | .. automodule:: rdkit_ipynb_tools.hc_tools 42 | :members: 43 | :private-members: 44 | :special-members: 45 | 46 | .. automodule:: rdkit_ipynb_tools.pandas_tools 47 | :members: 48 | :private-members: 49 | :special-members: 50 | 51 | .. automodule:: rdkit_ipynb_tools.html_templates 52 | :members: 53 | :private-members: 54 | :special-members: 55 | -------------------------------------------------------------------------------- /examples/chembl_et-a_antagonists.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apahl/rdkit_ipynb_tools/c259ac8ee75709becd2a5e67f9a913bd20e0ae38/examples/chembl_et-a_antagonists.zip -------------------------------------------------------------------------------- /html/mol_table/css/lit_style.css: -------------------------------------------------------------------------------- 1 | body{ 2 | background-color: #FFFFFF; 3 | font-family: freesans, arial, verdana, sans-serif; 4 | font-size: small; 5 | } 6 | h3 { 7 | margin-bottom: 10px; 8 | } 9 | td { 10 | border-collapse:collapse; 11 | border-width:thin; 12 | border-style:hidden; 13 | border-color:black; 14 | padding-right: 1px; 15 | padding-bottom: 1px; 16 | } 17 | table { 18 | border-collapse:collapse; 19 | border-width:thin; 20 | border-style:hidden; 21 | border-color:black; 22 | background-color: #FFFFFF; 23 | text-align: left; 24 | } 25 | 26 | 27 | 28 | -------------------------------------------------------------------------------- /html/mol_table/css/style.css: -------------------------------------------------------------------------------- 1 | body{ 2 | background-color: #FFFFFF; 3 | font-family: freesans, arial, verdana, sans-serif; 4 | } 5 | th { 6 | border-collapse: collapse; 7 | border-width: thin; 8 | border-style: solid; 9 | border-color: black; 10 | text-align: left; 11 | font-weight: bold; 12 | } 13 | td { 14 | border-collapse:collapse; 15 | border-width:thin; 16 | border-style:solid; 17 | border-color:black; 18 | padding: 5px; 19 | } 20 | table { 21 | border-collapse:collapse; 22 | border-width:thin; 23 | border-style:solid; 24 | border-color:black; 25 | background-color: #FFFFFF; 26 | text-align: left; 27 | } 28 | 29 | 30 | 31 | 32 | -------------------------------------------------------------------------------- /make_doc.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # generate the documentation 3 | # requires shpinx with sphinxcontrib.napoleon 4 | 5 | make -C doc html 6 | -------------------------------------------------------------------------------- /rdkit_ipynb_tools/__init__.py: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /rdkit_ipynb_tools/bokeh_tools.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | """ 4 | ########### 5 | Bokeh Tools 6 | ########### 7 | 8 | *Created on 2015-12-12 by A. Pahl* 9 | 10 | Bokeh plotting functionality for Mol_lists. 11 | """ 12 | 13 | import colorsys 14 | import math 15 | 16 | import numpy as np 17 | 18 | # from bkcharts import Bar 19 | from bokeh.plotting import figure, ColumnDataSource 20 | import bokeh.io as io 21 | from bokeh.models import HoverTool, OpenURL, TapTool 22 | 23 | AVAIL_COLORS = ["#1F77B4", "firebrick", "goldenrod", "aqua", "brown", "chartreuse", "darkmagenta" 24 | "aquamarine", "blue", "red", "blueviolet", "darkorange", "forestgreen", "lime"] 25 | # AVAIL_MARKERS: circle, diamond, triangle, square, inverted_triangle, asterisk, 26 | # circle_cross, circle_x, cross, diamond_cross, square_cross, square_x, asterisk, diamond 27 | 28 | io.output_notebook() 29 | 30 | 31 | class ColorScale(): 32 | 33 | def __init__(self, num_values, val_min, val_max, middle_color="yellow", reverse=False): 34 | self.num_values = num_values 35 | self.num_val_1 = num_values - 1 36 | self.value_min = val_min 37 | self.value_max = val_max 38 | self.reverse = reverse 39 | self.value_range = self.value_max - self.value_min 40 | self.color_scale = [] 41 | if middle_color.startswith("y"): # middle color yellow 42 | hsv_tuples = [(0.0 + ((x * 0.35) / (self.num_val_1)), 0.99, 0.9) for x in range(self.num_values)] 43 | self.reverse = not self.reverse 44 | else: # middle color blue 45 | hsv_tuples = [(0.35 + ((x * 0.65) / (self.num_val_1)), 0.9, 0.9) for x in range(self.num_values)] 46 | rgb_tuples = map(lambda x: colorsys.hsv_to_rgb(*x), hsv_tuples) 47 | for rgb in rgb_tuples: 48 | rgb_int = [int(255 * x) for x in rgb] 49 | self.color_scale.append('#{:02x}{:02x}{:02x}'.format(*rgb_int)) 50 | 51 | if self.reverse: 52 | self.color_scale.reverse() 53 | 54 | def __call__(self, value): 55 | """return the color from the scale corresponding to the place in the value_min .. value_max range""" 56 | pos = int(((value - self.value_min) / self.value_range) * self.num_val_1) 57 | 58 | return self.color_scale[pos] 59 | 60 | 61 | def legend(self): 62 | """Return the value_range and a list of tuples (value, color) to be used in a legend.""" 63 | legend = [] 64 | for idx, color in enumerate(self.color_scale): 65 | val = self.value_min + idx / self.num_val_1 * self.value_range 66 | legend.append((val, color)) 67 | 68 | return legend 69 | 70 | 71 | class Chart(): 72 | """A Bokeh Plot.""" 73 | 74 | def __init__(self, kind="scatter", **kwargs): 75 | """Useful Chart kwargs: 76 | 77 | Parameters: 78 | xlabel (str): override the automatic x_axis_label. Default is None. 79 | ylabel (str): override the automatic y_axis_label. Default is None. 80 | callback (str): clicking on a point will link to the given HTML address. `@` can be used as placeholder for the point id (e.g. Compound_Id). Default is None.""" 81 | 82 | self.data = {} 83 | self.kwargs = kwargs 84 | self.kind = kind 85 | self.height = kwargs.get("height", 450) 86 | self.title = kwargs.get("title", "Scatter Plot") 87 | self.position = kwargs.get("position", kwargs.get("pos", "top_left")) 88 | 89 | self.series_counter = 0 90 | self.tools_added = False 91 | tools = ["pan", "wheel_zoom", "box_zoom", "reset", "resize", "save"] 92 | self.callback = kwargs.get("callback", None) 93 | if self.callback is not None: 94 | tools.append("tap") 95 | 96 | self.plot = figure(plot_height=self.height, title=self.title, tools=tools) 97 | self.plot.axis.axis_label_text_font_size = "14pt" 98 | self.plot.axis.major_label_text_font_size = "14pt" 99 | self.plot.title.text_font_size = "18pt" 100 | if self.callback is not None: 101 | taptool = self.plot.select(type=TapTool) 102 | taptool.callback = OpenURL(url=self.callback) 103 | 104 | 105 | def _add_series(self, x, y, series, size, source): 106 | color = self.add_data_kwargs.get("color", AVAIL_COLORS[self.series_counter]) 107 | 108 | if self.series_counter == 0: 109 | self.plot_type = self.plot.circle 110 | 111 | elif self.series_counter == 1: 112 | self.plot_type = self.plot.diamond 113 | if isinstance(size, int): 114 | size += 3 # diamonds appear smaller than circles of the same size 115 | elif self.series_counter == 2: 116 | self.plot_type = self.plot.triangle 117 | elif self.series_counter == 4: 118 | self.plot_type = self.plot.inverted_triangle 119 | elif self.series_counter == 5: 120 | self.plot_type = self.plot.asterisk 121 | elif self.series_counter == 6: 122 | self.plot_type = self.plot.circle_cross 123 | elif self.series_counter == 7: 124 | self.plot_type = self.plot.circle_x 125 | elif self.series_counter == 8: 126 | self.plot_type = self.plot.cross 127 | elif self.series_counter == 9: 128 | self.plot_type = self.plot.diamond_cross 129 | elif self.series_counter == 10: 130 | self.plot_type = self.plot.square_cross 131 | elif self.series_counter == 11: 132 | self.plot_type = self.plot.square_x 133 | else: 134 | self.plot_type = self.plot.asterisk 135 | 136 | self.plot_type(x, y, legend=series, size=size, color=color, source=source) 137 | if self.add_data_kwargs.get("line", False): 138 | self.plot.line(x, y, legend=series, color=color, 139 | line_width=self.add_data_kwargs.get("width", 3), source=source) 140 | 141 | self.series_counter += 1 142 | if self.series_counter >= len(AVAIL_COLORS): 143 | print("* series overflow, starting again.") 144 | self.series_counter = 0 145 | 146 | 147 | 148 | def add_data(self, d, x, y, **kwargs): 149 | """Added line option. This does not work with the color_by option. 150 | 151 | Parameters: 152 | color, color_by, series, series_by, size, size_by; 153 | line (bool): whether to plot a line or not. Default is False. 154 | width (int): line width when line is plotted. Default is 3.""" 155 | 156 | colors = "#1F77B4" 157 | self.add_data_kwargs = kwargs 158 | series = kwargs.get("series", None) 159 | if series is not None: 160 | series_by = "Series" 161 | else: 162 | series_by = kwargs.get("series_by", None) 163 | 164 | color_by = kwargs.get("color_by", None) 165 | size_by = kwargs.get("size_by", None) 166 | pid = kwargs.get("pid", None) 167 | 168 | tooltip = get_tooltip(x, y, 169 | pid, 170 | series, 171 | series_by, 172 | color_by, 173 | size_by, 174 | kwargs.get("tooltip", None)) 175 | 176 | if self.series_counter == 0: 177 | self.plot.add_tools(tooltip) 178 | 179 | self.plot.xaxis.axis_label = self.kwargs.get("xlabel", x) 180 | self.plot.yaxis.axis_label = self.kwargs.get("ylabel", y) 181 | 182 | if size_by is not None: 183 | size = "{}_sizes".format(size_by) 184 | d[size] = get_sizes_from_values(d[size_by]) 185 | else: 186 | size = kwargs.get("radius", kwargs.get("r", kwargs.get("size", kwargs.get("s", 10)))) 187 | 188 | reverse = kwargs.get("invert", False) 189 | 190 | if series: 191 | d["x"] = d[x] 192 | d["y"] = d[y] 193 | d["series"] = [series] * len(d[x]) 194 | 195 | self._add_series(x, y, series, size=size, source=ColumnDataSource(d)) 196 | 197 | elif series_by: 198 | series_keys = set() 199 | for idx, item in enumerate(d[series_by]): 200 | if item is None: 201 | d[series_by][idx] = "None" 202 | elif item is np.nan: 203 | d[series_by][idx] = "NaN" 204 | 205 | series_keys.add(d[series_by][idx]) 206 | 207 | for series in series_keys: 208 | d_series = {x: [], y: [], "series": []} 209 | if size_by is not None: 210 | d_series[size_by] = [] 211 | d_series[size] = [] 212 | if pid is not None: 213 | d_series[pid] = [] 214 | d_series["mol"] = [] 215 | for idx, el in enumerate(d[x]): 216 | if d[series_by][idx] == series: 217 | d_series[x].append(d[x][idx]) 218 | d_series[y].append(d[y][idx]) 219 | d_series["series"].append(d[series_by][idx]) 220 | if size_by is not None: 221 | d_series[size_by].append(d[size_by][idx]) 222 | d_series[size].append(d[size][idx]) 223 | if pid is not None: 224 | d_series[pid].append(d[pid][idx]) 225 | d_series["mol"].append(d["mol"][idx]) 226 | 227 | 228 | d_series["x"] = d_series[x] 229 | d_series["y"] = d_series[y] 230 | 231 | self._add_series(x, y, series, size=size, source=ColumnDataSource(d_series)) 232 | 233 | 234 | elif color_by: 235 | color_by_min = min(d[color_by]) 236 | color_by_max = max(d[color_by]) 237 | color_scale = ColorScale(20, color_by_min, color_by_max, reverse) 238 | colors = [] 239 | for val in d[color_by]: 240 | if val is not None and val is not np.nan: 241 | colors.append(color_scale(val)) 242 | else: 243 | colors.append("black") 244 | 245 | d["colors"] = colors 246 | d["x"] = d[x] 247 | d["y"] = d[y] 248 | self.plot.circle(x, y, size=size, color=colors, source=ColumnDataSource(d)) 249 | 250 | else: 251 | d["x"] = d[x] 252 | d["y"] = d[y] 253 | self.plot.circle(x, y, size=size, source=ColumnDataSource(d)) 254 | if self.add_data_kwargs.get("line", False): 255 | self.plot.line(x, y, line_width=self.add_data_kwargs.get("width", 3), 256 | source=ColumnDataSource(d)) 257 | 258 | 259 | def show(self): 260 | self.plot.legend.location = self.position 261 | io.show(self.plot) 262 | 263 | 264 | class Hist(): 265 | """A Bokeh histogram. from numpy histogram and a Bokeh quad plot. 266 | The highlevel Bokeh Chart Histogram class gave false results on the y axis for me (as of 9-Mar-2016).""" 267 | 268 | def __init__(self, title="Histogram", xlabel="Values", ylabel="Occurrence", **kwargs): 269 | """Generates a histogram. 270 | Possible useful additional kwargs include: plot_width, plot_height, y_axis_type="log", 271 | tick_size="14pt".""" 272 | 273 | self.colors = ["#FF596A", "#0066FF", "#00CC88", "#FFDD00"] 274 | self.plot_no = -1 275 | self.kwargs = kwargs 276 | self.pos = "top_left" 277 | tick_size = self.kwargs.pop("tick_size", "14pt") 278 | 279 | for arg in ["pos", "position"]: 280 | if arg in self.kwargs: 281 | self.pos = self.kwargs[arg] 282 | self.kwargs.pop(arg) 283 | 284 | self.plot = figure(title=title, **kwargs) 285 | self.plot.xaxis.axis_label = xlabel 286 | self.plot.yaxis.axis_label = ylabel 287 | self.plot.axis.major_label_text_font_size = tick_size 288 | self.plot.axis.axis_label_text_font_size = "14pt" 289 | self.plot.axis.major_label_text_font_size = "14pt" 290 | self.plot.title.text_font_size = "18pt" 291 | 292 | 293 | 294 | def add_data(self, data, bins=10, series=None, color=None, normed=False, **kwargs): 295 | """Add actual data to the plot.""" 296 | # manage colors 297 | self.plot_no += 1 298 | if self.plot_no > len(self.colors) - 1: 299 | self.plot_no = 0 300 | if color is None: 301 | color = self.colors[self.plot_no] 302 | 303 | data = remove_nan(data) 304 | hist, edges = np.histogram(data, bins=bins) 305 | if normed: 306 | hist = normalize_largest_bin_to_one(hist) 307 | self.source = ColumnDataSource(data=dict(top=hist, left=edges[:-1], right=edges[1:])) 308 | 309 | if series is not None: 310 | self.plot.quad(top="top", bottom=0, left="left", right="right", 311 | color=color, line_color="black", alpha=0.5, legend=series, 312 | source=self.source) 313 | 314 | else: 315 | self.plot.quad(top="top", bottom=0, left="left", right="right", color=color, line_color="black", alpha=0.8, source=self.source) 316 | 317 | 318 | def show(self): 319 | self.plot.legend.location = self.pos 320 | io.show(self.plot) 321 | 322 | 323 | # def bar_chart(d, x, show=True, **kwargs): 324 | # """Displays a bar chart for the occurrence of the given x-value. 325 | # This plot type is especially useful for plotting the occurrence of categorical data, 326 | # where only a small number (<= 10) of different values are present. 327 | # This function is directly calling the advanced bokeh bar chart type, 328 | # therefore no additional class is used. 329 | # Useful kwargs include: title, plot_height, plot_width.""" 330 | # title = kwargs.pop("title", "Occurrence of {}".format(x)) 331 | # p = Bar(d, x, values=x, agg="count", legend=False, title=title, **kwargs) 332 | # p.yaxis.axis_label = "Occurrence" 333 | # p.axis.axis_label_text_font_size = "14pt" 334 | # p.axis.major_label_text_font_size = "14pt" 335 | # p.title.text_font_size = "18pt" 336 | # if show: 337 | # io.show(p) 338 | # else: 339 | # return p 340 | 341 | 342 | def get_tooltip(x, y, pid=None, series=None, series_by=None, color_by=None, size_by=None, tooltip=None): 343 | if pid is not None: 344 | pid_tag = '{pid}: @{pid}
'.format(pid=pid) 345 | else: 346 | pid_tag = "" 347 | 348 | if size_by is not None: 349 | size_tag = '{size_by}: @{size_by}
'.format(size_by=size_by) 350 | else: 351 | size_tag = "" 352 | 353 | if series_by: 354 | series_tag = '{series_by}: @series
'.format(series_by=series_by) 355 | color_tag = "" 356 | elif color_by: 357 | series_tag = "" 358 | color_tag = '{color_by}: @{color_by}  ⚫'.format(color_by=color_by) 359 | else: 360 | color_tag = "" 361 | series_tag = "" 362 | 363 | if tooltip == "struct": 364 | templ = HoverTool( 365 | tooltips=""" 366 |

367 |

368 |

373 |

374 |

375 | {series_tag}{pid_tag} 376 | {x}: @x
377 | {y}: @y
{color_tag}{size_tag} 378 |

379 |

386 | {series_tag}{pid_tag} 387 | {x}: @x
388 | {y}: @y
{color_tag}{size_tag} 389 |

390 | """.format(pid_tag=pid_tag, series_tag=series_tag, color_tag=color_tag, size_tag=size_tag, x=x, y=y) 391 | ) 392 | # templ = HoverTool(tooltips=[(x, "@x"), (y, "@y")]) 393 | 394 | return templ 395 | 396 | 397 | def remove_nan(l): 398 | """Remove Nans from a list for histograms.""" 399 | return [x for x in l if x is not np.nan] 400 | 401 | 402 | def guess_id_prop(prop_list): # try to guess an id_prop 403 | for prop in prop_list: 404 | if prop.lower().endswith("id"): 405 | return prop 406 | return None 407 | 408 | 409 | def normalize_largest_bin_to_one(hist): 410 | """Takes a Numpy histogram list and normalizes all values, so that the highest value becomes 1.0. 411 | Returns a new list.""" 412 | max_bin = max(hist) 413 | norm_hist = [b / max_bin for b in hist] 414 | return norm_hist 415 | 416 | 417 | def get_bin_centers(edges): 418 | """Returns a list of bin centers from a list of np.histogram edges. 419 | The returned centers are one element shorter than the provided edges list.""" 420 | l = len(edges) 421 | centers = [] 422 | for idx in range(l - 1): 423 | center = (edges[idx] + edges[idx + 1]) / 2 424 | center = float("{:.3f}".format(center)) # limit to three decimals 425 | centers.append(center) 426 | 427 | return centers 428 | 429 | 430 | def get_sizes_from_values(values, min_size=10, max_size=60, log_scale=True): 431 | max_val = max(values) 432 | mult = max_size - min_size 433 | 434 | if log_scale: 435 | min_val = min(values) - 1 436 | norm = math.log10(max_val - min_val) 437 | sizes = [min_size + mult * math.log10(x - min_val) / norm for x in values] 438 | 439 | else: 440 | min_val = min(values) 441 | norm = max_val - min_val 442 | sizes = [min_size + mult * (x - min_val) / norm for x in values] 443 | 444 | return sizes 445 | 446 | 447 | def cpd_scatter(df, x, y, r=7, pid=None, **kwargs): 448 | """Predefined Plot #1. 449 | Quickly plot an RDKit Pandas dataframe or a molecule dictionary with structure tooltips.""" 450 | 451 | if not pid: 452 | if isinstance(df, dict): 453 | prop_list = df.keys() 454 | else: 455 | prop_list = [df.index.name] 456 | prop_list.extend(df.columns.values) 457 | 458 | pid = guess_id_prop(prop_list) 459 | 460 | callback = kwargs.pop("callback", None) 461 | title = kwargs.pop("title", "Compound Scatter Plot") 462 | scatter = Chart(title=title, r=r, callback=callback) 463 | scatter.add_data(df, x, y, pid=pid, **kwargs) 464 | return scatter.show() 465 | -------------------------------------------------------------------------------- /rdkit_ipynb_tools/clustering.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | """ 4 | ########## 5 | Clustering 6 | ########## 7 | 8 | *Created on Sun Feb 28 11:00 2016 by A. Pahl* 9 | 10 | Clustering molecules. 11 | """ 12 | 13 | import os 14 | import os.path as op 15 | from copy import deepcopy 16 | from collections import Counter, OrderedDict 17 | import shutil 18 | 19 | from rdkit.Chem import AllChem as Chem, MACCSkeys 20 | from rdkit.Chem import Draw 21 | import rdkit.Chem.Descriptors as Desc 22 | from rdkit.Chem.AtomPairs import Pairs, Torsions 23 | 24 | # import rdkit.Chem.Scaffolds.MurckoScaffold as MurckoScaffold 25 | try: 26 | Draw.DrawingOptions.atomLabelFontFace = "DejaVu Sans" 27 | Draw.DrawingOptions.atomLabelFontSize = 18 28 | except KeyError: # Font "DejaVu Sans" is not available 29 | pass 30 | 31 | from rdkit import DataStructs 32 | from rdkit.ML.Cluster import Butina 33 | 34 | # from PIL import Image, ImageChops 35 | import numpy as np 36 | try: 37 | # this mainly needs to be done because Sphinx can not import Matplotlib 38 | # (it gives `NotImplementedError('Implement enable_gui in a subclass')`) 39 | import matplotlib.pyplot as plt 40 | MPL = True 41 | except ImportError: 42 | MPL = False 43 | 44 | # from . import html_templates as html 45 | from . import tools, html_templates as html, file_templ as ft, nb_tools as nbt 46 | 47 | try: 48 | from rdkit.Avalon import pyAvalonTools as pyAv 49 | USE_AVALON = True 50 | except ImportError: 51 | USE_AVALON = False 52 | 53 | 54 | nbits = 1024 55 | nbits_long = 16384 56 | 57 | # dictionary 58 | FPDICT = {} 59 | FPDICT['ecfp0'] = lambda m: Chem.GetMorganFingerprintAsBitVect(m, 0, nBits=nbits) 60 | FPDICT['ecfp2'] = lambda m: Chem.GetMorganFingerprintAsBitVect(m, 1, nBits=nbits) 61 | FPDICT['ecfp4'] = lambda m: Chem.GetMorganFingerprintAsBitVect(m, 2, nBits=nbits) 62 | FPDICT['ecfp6'] = lambda m: Chem.GetMorganFingerprintAsBitVect(m, 3, nBits=nbits) 63 | FPDICT['ecfc0'] = lambda m: Chem.GetMorganFingerprint(m, 0) 64 | FPDICT['ecfc2'] = lambda m: Chem.GetMorganFingerprint(m, 1) 65 | FPDICT['ecfc4'] = lambda m: Chem.GetMorganFingerprint(m, 2) 66 | FPDICT['ecfc6'] = lambda m: Chem.GetMorganFingerprint(m, 3) 67 | FPDICT['fcfp2'] = lambda m: Chem.GetMorganFingerprintAsBitVect(m, 1, useFeatures=True, nBits=nbits) 68 | FPDICT['fcfp4'] = lambda m: Chem.GetMorganFingerprintAsBitVect(m, 2, useFeatures=True, nBits=nbits) 69 | FPDICT['fcfp6'] = lambda m: Chem.GetMorganFingerprintAsBitVect(m, 3, useFeatures=True, nBits=nbits) 70 | FPDICT['fcfc2'] = lambda m: Chem.GetMorganFingerprint(m, 1, useFeatures=True) 71 | FPDICT['fcfc4'] = lambda m: Chem.GetMorganFingerprint(m, 2, useFeatures=True) 72 | FPDICT['fcfc6'] = lambda m: Chem.GetMorganFingerprint(m, 3, useFeatures=True) 73 | FPDICT['lecfp4'] = lambda m: Chem.GetMorganFingerprintAsBitVect(m, 2, nBits=nbits_long) 74 | FPDICT['lecfp6'] = lambda m: Chem.GetMorganFingerprintAsBitVect(m, 3, nBits=nbits_long) 75 | FPDICT['lfcfp4'] = lambda m: Chem.GetMorganFingerprintAsBitVect(m, 2, useFeatures=True, nBits=nbits_long) 76 | FPDICT['lfcfp6'] = lambda m: Chem.GetMorganFingerprintAsBitVect(m, 3, useFeatures=True, nBits=nbits_long) 77 | FPDICT['maccs'] = lambda m: MACCSkeys.GenMACCSKeys(m) 78 | FPDICT['ap'] = lambda m: Pairs.GetAtomPairFingerprint(m) 79 | FPDICT['tt'] = lambda m: Torsions.GetTopologicalTorsionFingerprintAsIntVect(m) 80 | FPDICT['hashap'] = lambda m: Desc.GetHashedAtomPairFingerprintAsBitVect(m, nBits=nbits) 81 | FPDICT['hashtt'] = lambda m: Desc.GetHashedTopologicalTorsionFingerprintAsBitVect(m, nBits=nbits) 82 | FPDICT['rdk5'] = lambda m: Chem.RDKFingerprint(m, maxPath=5, fpSize=nbits, nBitsPerHash=2) 83 | FPDICT['rdk6'] = lambda m: Chem.RDKFingerprint(m, maxPath=6, fpSize=nbits, nBitsPerHash=2) 84 | FPDICT['rdk7'] = lambda m: Chem.RDKFingerprint(m, maxPath=7, fpSize=nbits, nBitsPerHash=2) 85 | if USE_AVALON: 86 | FPDICT['avalon'] = lambda m: pyAv.GetAvalonFP(m, nbits) 87 | FPDICT['avalon_l'] = lambda m: pyAv.GetAvalonFP(m, nbits_long) 88 | 89 | 90 | def mpl_hist(data, bins=10, xlabel="values", ylabel="Occurrence", show=False, save=True, **kwargs): 91 | """Useful kwargs: size (tuple), dpi (int), fn (filename, str), title (str)""" 92 | my_dpi = kwargs.get("dpi", 96) 93 | size = kwargs.get("size", (300, 350)) 94 | title = kwargs.get("title", None) 95 | figsize = (size[0] / my_dpi, size[1] / my_dpi) 96 | plt.style.use('seaborn-pastel') 97 | # plt.style.use('ggplot') 98 | plt.style.use('seaborn-whitegrid') 99 | fig = plt.figure(figsize=figsize, dpi=my_dpi) 100 | if title is not None: 101 | fig.suptitle(title, fontsize=24) 102 | plt.hist(data, bins=bins) 103 | plt.xlabel(xlabel, fontsize=20) 104 | plt.ylabel(ylabel, fontsize=20) 105 | plt.tick_params(axis='both', which='major', labelsize=16) 106 | 107 | if save: 108 | fn = kwargs.get("fn", "hist.png") 109 | plt.savefig(fn, bbox_inches='tight') 110 | 111 | if show: 112 | plt.show() 113 | 114 | 115 | def renumber_clusters(cluster_list, start_at=1): 116 | """Renumber clusters in-place.""" 117 | start_at -= 1 118 | # get the current individual cluster numbers present in the list 119 | id_list = sorted(set(tools.get_value(mol.GetProp("Cluster_No")) 120 | for mol in cluster_list.mols_with_prop("Cluster_No"))) 121 | 122 | # assign the new ids as values the old id's keys 123 | new_ids = {k: v for v, k in enumerate(id_list, 1 + start_at)} 124 | for mol in cluster_list: 125 | if not mol.HasProp("Cluster_No"): continue 126 | old_id = int(mol.GetProp("Cluster_No")) 127 | mol.SetProp("Cluster_No", str(new_ids[old_id])) 128 | 129 | 130 | def get_cluster_numbers(cluster_list): 131 | """Returns the cluster numbers present in the cluster_list as a list, keeping the original order.""" 132 | cl_no_od = OrderedDict() 133 | for mol in cluster_list.mols_with_prop("Cluster_No"): 134 | cl_no = int(mol.GetProp("Cluster_No")) 135 | cl_no_od[cl_no] = 0 136 | 137 | return list(cl_no_od.keys()) 138 | 139 | 140 | def get_clusters_by_no(cluster_list, cl_no, make_copy=True, renumber=False): 141 | """Return one or more clusters (provide numbers as list) by their number.""" 142 | if not isinstance(cl_no, list): 143 | cl_no = [cl_no] 144 | 145 | cluster = tools.Mol_List() 146 | if cluster_list.order: 147 | cluster.order = cluster_list.order.copy() 148 | for mol in cluster_list: 149 | if mol.HasProp("Cluster_No") and int(mol.GetProp("Cluster_No")) in cl_no: 150 | if make_copy: 151 | mol = deepcopy(mol) 152 | cluster.append(mol) 153 | 154 | if renumber: 155 | renumber_clusters(cluster) 156 | 157 | return cluster 158 | 159 | 160 | def remove_clusters_by_no(cluster_list, cl_no, make_copy=True, renumber=False): 161 | """Return a new cluster list where the clusters with the provided numbers are removed.""" 162 | if not isinstance(cl_no, list): 163 | cl_no = [cl_no] 164 | 165 | cluster = tools.Mol_List() 166 | for mol in cluster_list: 167 | if mol.HasProp("Cluster_No") and int(mol.GetProp("Cluster_No")) not in cl_no: 168 | if make_copy: 169 | mol = deepcopy(mol) 170 | cluster.append(mol) 171 | 172 | if renumber: 173 | renumber_clusters(cluster) 174 | 175 | return cluster 176 | 177 | 178 | def keep_clusters_by_len(cluster_list, min_len=3, max_len=1000, make_copy=True, renumber=False): 179 | """Returns a new cluster list with all cluster removed where `len < min_len`.""" 180 | ctr = Counter() 181 | result_list = tools.Mol_List() 182 | id_prop = tools.guess_id_prop(tools.list_fields(cluster_list)) 183 | 184 | # Get the lengths, even if there are no cores: 185 | for mol in cluster_list: 186 | if int(mol.GetProp(id_prop)) < 100000: continue # is a core 187 | 188 | if not mol.HasProp("Cluster_No"): continue 189 | cl_id = int(mol.GetProp("Cluster_No")) 190 | ctr[cl_id] += 1 191 | 192 | # now, only keep mols which belong to clusters of the desired length, 193 | # including the cores 194 | for mol in cluster_list: 195 | if not mol.HasProp("Cluster_No"): continue 196 | cl_id = int(mol.GetProp("Cluster_No")) 197 | if ctr[cl_id] >= min_len and ctr[cl_id] <= max_len: 198 | result_list.append(mol) 199 | 200 | if renumber: 201 | renumber_clusters(result_list) 202 | 203 | return result_list 204 | 205 | 206 | def get_cores(cluster_list, make_copy=True): 207 | """Find and return the core molecules in a cluster_list.""" 208 | core_list = cluster_list.has_prop_filter("is_core", make_copy=make_copy) 209 | core_list.order = ["Compound_Id", "Cluster_No", "Num_Members", "Num_Values", "Min", "Max", "Mean", "Median", "is_core"] 210 | return core_list 211 | 212 | 213 | def get_members(cluster_list, make_copy=True): 214 | """Find and return the members of a cluster_list, exclude the cores.""" 215 | member_list = cluster_list.has_prop_filter("is_core", invert=True, make_copy=make_copy) 216 | return member_list 217 | 218 | 219 | def get_stats_for_cluster(cluster_list, activity_prop=None): 220 | stats = {} 221 | sups_raw_list = [mol.GetProp("Supplier") for mol in cluster_list.mols_with_prop("Supplier")] 222 | if sups_raw_list: 223 | # each supplier field of a mol may already contain "; " 224 | sups_raw_str = "; ".join(sups_raw_list) 225 | sups_set = set(sups_raw_str.split("; ")) 226 | stats["Supplier"] = "; ".join(sorted(sups_set)) 227 | 228 | prod_raw_list = [mol.GetProp("Producer") for mol in cluster_list.mols_with_prop("Producer")] 229 | if prod_raw_list: 230 | # each Producer field of a mol may already contain "; " 231 | prod_raw_str = "; ".join(prod_raw_list) 232 | prod_set = set(prod_raw_str.split("; ")) 233 | stats["Producer"] = "; ".join(sorted(prod_set)) 234 | 235 | if activity_prop is not None: 236 | value_list = [tools.get_value(mol.GetProp(activity_prop)) for mol in cluster_list.mols_with_prop(activity_prop)] 237 | stats["Num_Values"] = len(value_list) 238 | stats["Min"] = min(value_list) if value_list else None 239 | stats["Max"] = max(value_list) if value_list else None 240 | stats["Mean"] = np.mean(value_list) if value_list else None 241 | stats["Median"] = np.median(value_list) if value_list else None 242 | 243 | return stats 244 | 245 | 246 | def add_stats_to_cores(cluster_list_w_cores, props=None): 247 | """Add statistical information for the given props to the cluster cores. 248 | If 'props' is None, a default list of properties is used. 249 | The cores have to be already present in the list.""" 250 | if props is None: 251 | props = ["ALogP", "QED", "SA_Score"] 252 | elif not isinstance(props, list): 253 | props = [props] 254 | 255 | cores = get_cores(cluster_list_w_cores, make_copy=False) 256 | if len(cores) == 0: 257 | raise LookupError("Could not find any cores in the list! Please add them first with add_cores() or add_centers()") 258 | cores.order = ["Cluster_No", "Num_Members"] 259 | for prop in props: 260 | cores.order.extend(["{}_Min".format(prop), "{}_Max".format(prop), "{}_Mean".format(prop), 261 | "{}_Median".format(prop), "{}_#Values".format(prop)]) 262 | for core in cores: 263 | cl_no = tools.get_prop_val(core, "Cluster_No") 264 | cluster = get_members(get_clusters_by_no(cluster_list_w_cores, cl_no, make_copy=False), 265 | make_copy=False) 266 | for prop in props: 267 | value_list = list(filter( 268 | lambda x: x is not None, [tools.get_prop_val(mol, prop) for mol in cluster])) 269 | if len(value_list) == 0: continue 270 | 271 | core.SetProp("{}_#Values".format(prop), str(len(value_list))) 272 | 273 | if all(tools.isnumber(x) for x in value_list): 274 | core.SetProp("{}_Min".format(prop), "{:.2f}".format(min(value_list))) 275 | core.SetProp("{}_Max".format(prop), "{:.2f}".format(max(value_list))) 276 | core.SetProp("{}_Mean".format(prop), "{:.2f}".format(np.mean(value_list))) 277 | core.SetProp("{}_Median".format(prop), "{:.2f}".format(np.median(value_list))) 278 | 279 | else: # summarize the values as strings 280 | value_str = "; ".join(str(x) for x in value_list) 281 | value_list_ext = value_str.split("; ") 282 | value_ctr = Counter(value_list_ext) 283 | prop_values = [] 284 | for val in sorted(value_ctr): 285 | prop_values.append("{} ({})".format(val, value_ctr[val])) 286 | core.SetProp("{}s".format(prop), "; ".join(prop_values)) 287 | 288 | 289 | def get_clusters_with_activity(cluster_list, activity_prop, min_act=None, max_act=None, min_len=1, renumber=False): 290 | """Return only the clusters which fulfill the given activity criteria. 291 | 292 | Parameters: 293 | min_act (str): something like `< 50` or `> 70` that can be evaluated. 294 | max_act (str): see above.""" 295 | 296 | if min_act is not None: 297 | min_act_comp = compile('stats["Min"] {}'.format(min_act), '', 'eval') 298 | if max_act is not None: 299 | max_act_comp = compile('stats["Max"] {}'.format(max_act), '', 'eval') 300 | 301 | cores_and_members = get_cores(cluster_list) 302 | if len(cores_and_members) > 0: 303 | cores_and_members = cores_and_members.prop_filter('Num_Members >= {}'.format(min_len)) 304 | members_all = get_members(cluster_list) 305 | tmp_list = tools.Mol_List() 306 | 307 | cl_ids = sorted(set(int(mol.GetProp("Cluster_No")) for mol in members_all.mols_with_prop("Cluster_No"))) 308 | for new_id, cl_id in enumerate(cl_ids, 1): 309 | cluster = get_clusters_by_no(members_all, cl_id) 310 | if len(cluster) < min_len: continue 311 | stats = get_stats_for_cluster(cluster, activity_prop) 312 | if stats["Num_Values"] == 0: continue 313 | stats["Min"] # to quiet the linter 314 | keep = True 315 | if min_act is not None and not eval(min_act_comp): 316 | keep = False 317 | if max_act is not None and not eval(max_act_comp): 318 | keep = False 319 | 320 | if keep: 321 | tmp_list.extend(cluster) 322 | 323 | cores_and_members.extend(tmp_list) 324 | 325 | if renumber: 326 | renumber_clusters(cores_and_members) 327 | 328 | return cores_and_members 329 | 330 | 331 | def add_cores(cluster_list, activity_prop=None, align_to_core=False): 332 | """Find and add cores to the cluster_list in-place. 333 | 334 | Parameters: 335 | mode (str): "mcs" (default): add the core as MCS of the cluster. This can lead to very small structures. 336 | activity_prop (str): the name of the property for the activity.""" 337 | 338 | # first, remove any already existing cores 339 | members_all = get_members(cluster_list) 340 | 341 | # find cluster numbers 342 | cl_ids = set(tools.get_value(mol.GetProp("Cluster_No")) for mol in members_all.mols_with_prop("Cluster_No")) 343 | 344 | for cl_id in cl_ids: 345 | cluster = get_clusters_by_no(members_all, cl_id, make_copy=False) 346 | if len(cluster) < 3: 347 | continue 348 | 349 | # determine the cluster core by MCSS 350 | # do not generate cores for singletons and pairs 351 | 352 | core_mol = tools.find_mcs(cluster) 353 | if core_mol is None: 354 | continue 355 | 356 | tools.check_2d_coords(core_mol) 357 | 358 | # set a number of properties for the cluster core 359 | id_prop = tools.guess_id_prop(cluster[0].GetPropNames()) 360 | core_mol.SetProp(id_prop, str(cl_id)) 361 | core_mol.SetProp("is_core", "yes") 362 | core_mol.SetProp("Cluster_No", str(cl_id)) 363 | core_mol.SetProp("Num_Members", str(len(cluster))) 364 | 365 | if align_to_core: 366 | # align_mol = MurckoScaffold.GetScaffoldForMol(core_mol) 367 | cluster.align(core_mol) 368 | 369 | members_all.insert(0, core_mol) 370 | 371 | return members_all 372 | 373 | 374 | def add_centers(cluster_list, mode="most_active", activity_prop=None, min_num_members=3, **kwargs): 375 | """Add cluster centers to the cores. 376 | Contrary to the cores, this is not an MCS, but one of the cluster members, 377 | with a certain property. 378 | 379 | Parameters: 380 | mode (str): `most_active` (default): if activity_prop is not None, the most active compound is taken. 381 | `smallest`: the compound with the least amount of heavy atoms is taken as center. 382 | `center`: the compound with the medium number of heavy atoms is taken. 383 | `from_tag`: takes the molecule the `tag` property as center. 384 | The `tag` parameter needs to be defined.""" 385 | 386 | if "active" in mode: 387 | if activity_prop is not None: 388 | al = activity_prop.lower() 389 | reverse = False 390 | if "pic50" in al or "pec50" in al: 391 | reverse = True 392 | 393 | else: 394 | mode = "center" 395 | 396 | # first, remove any already existing cores 397 | members_all = get_members(cluster_list) 398 | 399 | # find cluster numbers 400 | cl_ids = set(tools.get_value(mol.GetProp("Cluster_No")) for mol in members_all.mols_with_prop("Cluster_No")) 401 | 402 | for cl_id in cl_ids: 403 | cluster = get_clusters_by_no(members_all, cl_id, make_copy=False) 404 | if len(cluster) < min_num_members: 405 | continue 406 | 407 | if "active" in mode: 408 | cluster.sort_list(activity_prop, reverse=reverse) 409 | core_mol = deepcopy(cluster[0]) 410 | 411 | elif "smallest" in mode: # smallest 412 | cluster.sort(key=Desc.HeavyAtomCount) 413 | core_mol = deepcopy(cluster[0]) 414 | 415 | elif "center" in mode: # medium number of heavy atoms, the middle of the list 416 | cluster.sort(key=Desc.HeavyAtomCount) 417 | core_mol = deepcopy(cluster[len(cluster) // 2]) 418 | 419 | elif "tag" in mode: 420 | tag = kwargs.get("tag", None) 421 | if tag is None: 422 | raise KeyError("Parameter `tag` is required but could not be found.") 423 | tmp_list = cluster.has_prop_filter(tag) 424 | if len(tmp_list) == 0: 425 | print("No core for cluster {} (tag: {})".format(cl_id, tag)) 426 | continue 427 | core_mol = tmp_list[0] 428 | 429 | for prop in core_mol.GetPropNames(): 430 | core_mol.ClearProp(prop) 431 | 432 | # set a number of properties for the cluster core 433 | id_prop = tools.guess_id_prop(cluster[0].GetPropNames()) 434 | core_mol.SetProp(id_prop, str(cl_id)) 435 | core_mol.SetProp("is_core", "yes") 436 | core_mol.SetProp("Cluster_No", str(cl_id)) 437 | core_mol.SetProp("Num_Members", str(len(cluster))) 438 | 439 | members_all.insert(0, core_mol) 440 | 441 | return members_all 442 | 443 | 444 | def get_mol_list_from_index_list(orig_sdf, index_list, cl_id): 445 | """generate sdf_lists after clustering""" 446 | cluster_list = tools.Mol_List() 447 | for x in index_list: 448 | mol = deepcopy(orig_sdf[x]) 449 | mol.SetProp("Cluster_No", str(cl_id)) 450 | cluster_list.append(mol) 451 | 452 | if len(cluster_list) == 2: 453 | cluster_list[0].SetProp("is_pair", "yes") 454 | cluster_list[1].SetProp("is_pair", "yes") 455 | elif len(index_list) == 1: 456 | cluster_list[0].SetProp("is_single", "yes") 457 | 458 | return cluster_list 459 | 460 | 461 | def cluster_from_mol_list(mol_list, cutoff=0.8, fp="ecfp6", activity_prop=None, 462 | summary_only=True, generate_cores=False, align_to_core=False): 463 | """Clusters the input Mol_List. 464 | 465 | Parameters: 466 | mol_list (tools.Mol_List): the input molecule list. 467 | cutoff (float): similarity cutoff for putting molecules into the same cluster. 468 | 469 | Returns: 470 | A new Mol_List containing the input molecules with their respective cluster number, 471 | as well as additionally the cluster cores, containing some statistics.""" 472 | 473 | try: 474 | fp_func = FPDICT[fp] 475 | except KeyError: 476 | print("Fingerprint {} not found. Available fingerprints are: {}".format(fp, ", ".join(sorted(FPDICT.keys())))) 477 | return 478 | 479 | counter = Counter() 480 | 481 | # generate the fingerprints 482 | fp_list = [fp_func(mol) for mol in mol_list] 483 | 484 | # second generate the distance matrix: 485 | dists = [] 486 | num_of_fps = len(fp_list) 487 | for i in range(1, num_of_fps): 488 | sims = DataStructs.BulkTanimotoSimilarity(fp_list[i], fp_list[:i]) 489 | dists.extend([1 - x for x in sims]) 490 | 491 | # now cluster the data: 492 | cluster_idx_list = Butina.ClusterData(dists, num_of_fps, cutoff, isDistData=True) 493 | for cluster in cluster_idx_list: 494 | counter[len(cluster)] += 1 495 | print(" fingerprint:", fp) 496 | print(" clustersize num_of_clusters") 497 | print(" =========== ===============") 498 | for length in sorted(counter.keys(), reverse=True): 499 | print(" {:4d} {:3d}".format(length, counter[length])) 500 | print() 501 | 502 | if summary_only: 503 | return None 504 | 505 | cluster_list = tools.Mol_List() 506 | 507 | # go over each list of indices to collect the cluster's molecules 508 | for cl_id, idx_list in enumerate(sorted(cluster_idx_list, key=len, reverse=True), 1): 509 | cluster = get_mol_list_from_index_list(mol_list, idx_list, cl_id) 510 | cluster[0].SetProp("is_repr", "yes") # The first compound in a cluster is the representative 511 | cluster_list.extend(cluster) 512 | 513 | if generate_cores: 514 | cluster_list = add_cores(cluster_list, activity_prop, align_to_core) 515 | 516 | return cluster_list 517 | 518 | 519 | def show_numbers(cluster_list, show=True): 520 | """Calculate (and show) some numbers for the cluster_list. Returns a dict.""" 521 | all_members = get_members(cluster_list) 522 | ctr_cl_no = Counter([int(mol.GetProp("Cluster_No")) for mol in all_members]) 523 | sizes = [s for s in ctr_cl_no.values()] 524 | ctr_size = Counter(sizes) 525 | 526 | if show: 527 | total = 0 528 | print("\nCluster Size | Number of Clusters") 529 | print("------------- + -------------------") 530 | 531 | for i in sorted(ctr_size, reverse=True): 532 | 533 | print(" {:3d} | {:3d}".format(i, ctr_size[i])) 534 | total += (i * ctr_size[i]) 535 | 536 | print() 537 | print("Number of compounds as sum of members per cluster size:", total) 538 | 539 | return sizes 540 | 541 | 542 | def core_table(mol, props=None, hist=None): 543 | if props is None: 544 | props = ["Cluster_No", "Num_Members", "Producers"] 545 | 546 | td_opt = {"align": "center"} 547 | header_opt = {"bgcolor": "#94CAEF", "align": "center"} 548 | table_list = [] 549 | 550 | cells = html.td(html.b("Molecule"), header_opt) 551 | for prop in props: 552 | pl = prop.lower() 553 | if (pl.endswith("min") or pl.endswith("max") or pl.endswith("mean") or 554 | pl.endswith("median")or pl.endswith("ic50") or pl.endswith("ic50)") or 555 | pl.endswith("activity") or pl.endswith("acctivity)")): 556 | pos = prop.rfind("_") 557 | if pos > 0: 558 | prop = prop[:pos] + "
" + prop[pos + 1:] 559 | cells.extend(html.td(html.b(prop), header_opt)) 560 | 561 | if hist is not None: 562 | header_opt["class"] = "histogram" 563 | cells.extend(html.td(html.b("Histogram"), header_opt)) 564 | 565 | rows = html.tr(cells) 566 | 567 | cells = [] 568 | 569 | if not mol: 570 | cells.extend(html.td("no structure")) 571 | 572 | else: 573 | mol_props = mol.GetPropNames() 574 | cl_no = mol.GetProp("Cluster_No") 575 | img_file = "img/core_{}.png".format(cl_no) 576 | img = tools.autocrop(Draw.MolToImage(mol)) 577 | img.save(img_file, format='PNG') 578 | img_src = img_file 579 | 580 | cell = html.img(img_src) 581 | cells.extend(html.td(cell, td_opt)) 582 | 583 | for prop in props: 584 | td_opt = {"align": "center"} 585 | if prop in mol_props: 586 | prop_val = mol.GetProp(prop) 587 | cells.extend(html.td(prop_val, td_opt)) 588 | else: 589 | cells.extend(html.td("", td_opt)) 590 | 591 | if hist is not None: 592 | td_opt["class"] = "histogram" 593 | if "img/" not in hist: 594 | hist = "img/" + hist 595 | img_opt = {"height": "220"} 596 | cell = html.img(hist, img_opt) 597 | cells.extend(html.td(cell, td_opt)) 598 | 599 | rows.extend(html.tr(cells)) 600 | 601 | table_list.extend(html.table(rows)) 602 | 603 | # print(table_list) 604 | return "".join(table_list) 605 | 606 | 607 | def write_report(cluster_list, title="Clusters", props=None, reverse=True, **kwargs): 608 | """Useful kwargs: core_props (list, props to show for the core, 609 | default: ["Cluster_No", "Num_Members", "Producers"]). The exact names of the props have to be given (with `_Mean` etc.). 610 | bins (int or list, default=10), align (bool) 611 | add_stats (bool): whether to add the statistics on the fly 612 | or use any precalculated ones. Default: True 613 | show_hist (bool): whether to show histograms or not (default: True).""" 614 | resource_dir = op.join(op.dirname(__file__), "resources") 615 | cur_dir = op.abspath(op.curdir) 616 | 617 | core_props = kwargs.get("core_props", None) 618 | align = kwargs.get("align", False) 619 | content = [ft.CLUSTER_REPORT_INTRO] 620 | bins = kwargs.get("bins", 10) 621 | show_hist = kwargs.get("show_hist", True) 622 | add_stats = kwargs.get("add_stats", False) 623 | 624 | pb = nbt.ProgressbarJS() 625 | 626 | print(" Copying resources...") 627 | if op.isdir("./clustering"): 628 | print("* Clustering dir already exists, writing into...") 629 | else: 630 | shutil.copytree(op.join(resource_dir, "clustering"), "./clustering") 631 | 632 | os.chdir("clustering") 633 | if add_stats: 634 | print(" Adding statisical information...") 635 | props_to_stat = [] 636 | for prop in core_props: 637 | if "Num_Members" in prop or "Cluster_No" in prop: continue # no stats for these props! 638 | for stat_type in ["_Min", "_Max", "_Mean", "_Median", "s"]: 639 | if prop.endswith(stat_type): 640 | prop = prop[:-(len(stat_type))] 641 | if props not in props_to_stat: 642 | props_to_stat.append(prop) 643 | add_stats_to_cores(cluster_list, props_to_stat) 644 | 645 | print(" Generating Report...") 646 | 647 | # collect the cluster numbers in the order in which they are in the cluster_list: 648 | cluster_numbers = OrderedDict() 649 | for mol in cluster_list.has_prop_filter("Cluster_No"): 650 | cluster_numbers[int(mol.GetProp("Cluster_No"))] = 0 # dummy value 651 | 652 | if props is not None and not isinstance(props, list): 653 | props = [props] 654 | 655 | len_cluster_numbers = len(cluster_numbers) 656 | for idx, cl_no in enumerate(cluster_numbers, 1): 657 | pb.update(100 * idx / len_cluster_numbers) 658 | cluster = get_clusters_by_no(cluster_list, cl_no) 659 | if len(cluster) == 0: continue 660 | if align and len(cluster) > 1: 661 | cluster.align() 662 | 663 | hist_fn = None 664 | if props is not None: 665 | first_prop = props[0] 666 | cluster.sort_list(first_prop, reverse=reverse) 667 | if MPL and show_hist and len(cluster) > 4: 668 | hist_fn = "img/hist_{}.png".format(cl_no) 669 | data = [tools.get_value(mol.GetProp(first_prop)) for mol in cluster if mol.HasProp(first_prop)] 670 | mpl_hist(data, bins=bins, xlabel=first_prop, fn=hist_fn) 671 | 672 | content.append("""
\n

Cluster {:03d}

""".format(cl_no, cl_no)) 673 | core = get_cores(cluster) 674 | if len(core) > 0: 675 | content.append("

\n") 676 | content.append(core_table(core[0], props=core_props, hist=hist_fn)) 677 | content.append("
- \n
  Members:
  ") 678 | 679 | members = get_members(cluster) 680 | content.append(members.grid(props=props, size=300, raw=True, img_dir="img")) 681 | content.append("
  ") 682 | if len(core) > 0: 683 | content.append("

") 684 | 685 | content.append(ft.CLUSTER_REPORT_EXTRO) 686 | print(" Writing report...") 687 | open("index.html", "w").write("\n".join(content)) 688 | os.chdir(cur_dir) 689 | print(" done. The report has been written to \n {}.".format(op.join(cur_dir, "clustering", "index.html"))) 690 | pb.done() 691 | 692 | 693 | def add_remarks_to_report(remarks, highlights=None, index_file="clustering/index.html"): 694 | """Takes remarks and adds them to the clustering report by cluster number. 695 | Writes a new `index_remarks.html` file 696 | 697 | Parameters: 698 | remarks (dict): keys are cluster numbers, values are lists of two elements, 699 | first is true/false for highlighting of the cluster, second is the remark as text 700 | highlights (list): list of strings whose occurrence in the report should be highlighted 701 | index_file (str): the name of the report index file""" 702 | 703 | if isinstance(highlights, str): # make a list out of a single string argument 704 | highlights = [highlights] 705 | new_file = open(op.join(op.dirname(index_file), "index_remarks.html"), "w") 706 | for line in open(index_file): 707 | for cl_no in remarks: 708 | if "Cluster {:03d}".format(cl_no) in line: 709 | hl = remarks[cl_no][0] 710 | txt = remarks[cl_no][1] 711 | if hl: 712 | line = '

Cluster {0:03d}

'.format(cl_no) 713 | else: 714 | line = '

Cluster {0:03d}

'.format(cl_no) 715 | if len(txt) > 1: 716 | line += '

{0}

{}

'.format(hl)) 724 | new_file.write(line) 725 | -------------------------------------------------------------------------------- /rdkit_ipynb_tools/file_templ.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | """ 4 | ############## 5 | File Templates 6 | ############## 7 | 8 | *Created on Sun Apr 3 19:30 2016 by A. Pahl* 9 | 10 | File templates for Reporting. 11 | """ 12 | 13 | 14 | STYLE = """""" 43 | 44 | CLUSTER_REPORT_INTRO = """ 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | Clustering 57 | 58 | 59 | 60 | 61 |

62 |

63 | """ 64 | 65 | CLUSTER_REPORT_EXTRO = """ 66 |

67 |

68 | 69 | 74 | 75 | 76 | """ 77 | 78 | # string template ${cluster_list}: 79 | CLUSTER_PY = """# list method index is not (yet) available 80 | def index(l, elmnt): 81 | for idx, el in enumerate(l): 82 | if el == elmnt: 83 | return idx 84 | return -1 # not found 85 | 86 | class Cluster: 87 | def __init__(self): 88 | self.current_cluster = 0 89 | self.list_idx = 0 90 | self.cluster_list = [${cluster_list}] 91 | 92 | def show_cluster(self): 93 | document.getElementById("display_no").innerHTML = self.current_cluster 94 | document.getElementById("cluster_frame").src = "clusters/cluster_{}.html".format(self.current_cluster) 95 | 96 | def get_cluster(self): 97 | cluster = int(document.getElementById("inp_cluster_no").value) 98 | idx = index(self.cluster_list, cluster) 99 | if idx >= 0: 100 | self.current_cluster = cluster 101 | self.list_idx = idx 102 | self.show_cluster() 103 | 104 | def next_cluster(self): 105 | if self.list_idx < len(self.cluster_list) - 1: 106 | self.list_idx += 1 107 | self.current_cluster = self.cluster_list[self.list_idx] 108 | self.show_cluster() 109 | 110 | def prev_cluster(self): 111 | if self.list_idx > 0: 112 | self.list_idx -= 1 113 | self.current_cluster = self.cluster_list[self.list_idx] 114 | self.show_cluster() 115 | 116 | cluster = Cluster() 117 | """ 118 | 119 | CLUSTER_HTML = """ 120 | 121 | 122 | ${style} 123 | 124 | 125 |

126 | 129 |

Cluster Report

130 |

131 | Cluster No.:
132 | 133 |

134 | 135 | 136 | 137 |

138 |

139 | 140 | 141 | 142 | """ 143 | 144 | # string template ${cluster_list}: 145 | CLUSTER_JS = """'use strict';function report_clusters(){function D(a,b,c){"undefined"==typeof b&&(b=a,a=0);"undefined"==typeof c&&(c=1);if(0=b||0>c&&a<=b)return[];for(var e=[];0b;a+=c)e.push(a);return e}function E(a){return F(D(n(a)),a)}function z(a){if(null==a||"object"==typeof a)return a;var b={},c;for(c in obj)a.hasOwnProperty(c)&&(b[c]=a[c]);return b}function G(a){if(null==a||"object"==typeof a)return a;var b={},c;for(c in obj)a.hasOwnProperty(c)&&(b[c]=G(a[c]));return b}function m(a){return a? 146 | [].slice.apply(a):[]}function h(a){a=a?[].slice.apply(a):[];a.__class__=h;return a}function k(a){var b=[];if(a)for(var c=0;cb(c)}):a.sort();c&&a.reverse()};a.Exception=b;a.ValueError=c;a.__sort__=e;a.sorted=function(a,b,c){if("undefined"==typeof b||null!=b&&b.__class__==l)b=null;if("undefined"==typeof c||null!=c&&c.__class__==l)c=!1;if(arguments.length){var d= 153 | arguments.length-1;if(arguments[d]&&arguments[d].__class__==l){var d=arguments[d--],f;for(f in d)switch(f){case "iterable":a=d[f];break;case "key":b=d[f];break;case "reverse":c=d[f]}}}f=C(a)==t?z(a.py_keys()):z(a);e(f,b,c);return f}}}});u(d,"",A(d.org.transcrypt.__base__));var L=d.__envir__;u(d,"",A(d.org.transcrypt.__standard__));var O=d.__sort__;L.executor_name=L.transpiler_name;d.main={__file__:""};d.__except__=null;var l=function(a){a.__class__=l;a.constructor=Object;return a};d.___kwargdict__= 154 | l;d.property=function(a,b){b||(b=function(){});return{get:function(){return a(this)},set:function(a){b(this,a)},enumerable:!0}};d.__merge__=function(a,b){var c={},e;for(e in a)c[e]=a[e];for(e in b)c[e]=b[e];return c};var M=function(){for(var a=[].slice.apply(arguments),b="",c=0;c)"),"???"}}}};d.repr=K;d.chr=function(a){return String.fromCharCode(a)};d.org=function(a){return a.charCodeAt(0)};var F=function(){var a=[].slice.call(arguments);return(0==a.length?[]:a.reduce(function(a,c){return a.lengtha&&(a=this.length+a);null==b?b=this.length: 158 | 0>b&&(b=this.length+b);for(var e=m([]);aa&&(a=this.length+a);null==b?b=this.length:0>b&&(b=this.length+b);if(null==c)Array.prototype.splice.apply(this,[a,b-a].concat(e));else for(var d=0;aa)c=e-1;else return e}return-1};Array.prototype.add=function(a){-1==this.indexOf(a)&&this.push(a)};Array.prototype.discard=function(a){a=this.indexOf(a);-1!=a&&this.splice(a, 161 | 1)};Array.prototype.isdisjoint=function(a){this.sort();for(var b=0;b 42 | HIGHCHARTS = """ 43 | 44 | 45 | 46 | 47 | """.format(hc_loc=HC_LOCATION) 48 | 49 | CHART_TEMPL = """

50 | 59 | """ 60 | 61 | #: Currently supported chart kinds 62 | CHART_KINDS = ["scatter", "column"] 63 | 64 | #: Currently supported tooltip options 65 | TOOLTIP_OPTIONS = "struct" 66 | 67 | if AP_TOOLS: 68 | #: Library version 69 | VERSION = apt.get_commit(__file__) 70 | # I use this to keep track of the library versions I use in my project notebooks 71 | print("{:45s} (commit: {})".format(__name__, VERSION)) 72 | else: 73 | print("- loading highcharts...") 74 | 75 | display(HTML(HIGHCHARTS)) 76 | 77 | 78 | class ColorScale(): 79 | """Used for continuous coloring.""" 80 | 81 | def __init__(self, num_values, val_min, val_max): 82 | self.num_values = num_values 83 | self.num_val_1 = num_values - 1 84 | self.value_min = val_min 85 | self.value_max = val_max 86 | self.value_range = self.value_max - self.value_min 87 | self.color_scale = [] 88 | hsv_tuples = [(0.35 + ((x * 0.65) / (self.num_val_1)), 0.9, 0.9) for x in range(self.num_values)] 89 | rgb_tuples = map(lambda x: colorsys.hsv_to_rgb(*x), hsv_tuples) 90 | for rgb in rgb_tuples: 91 | rgb_int = [int(255 * x) for x in rgb] 92 | self.color_scale.append('#{:02x}{:02x}{:02x}'.format(*rgb_int)) 93 | 94 | def __call__(self, value, reverse=False): 95 | """return the color from the scale corresponding to the place in the value_min .. value_max range""" 96 | pos = int(((value - self.value_min) / self.value_range) * self.num_val_1) 97 | 98 | if reverse: 99 | pos = self.num_val_1 - pos 100 | 101 | return self.color_scale[pos] 102 | 103 | 104 | class Chart(): 105 | """Available Chart kinds: scatter, column. 106 | 107 | Parameters: 108 | radius (int): Size of the points. Alias: *r* 109 | y_title (str): Used in the column plot as title of the y axis, default: "".""" 110 | 111 | def __init__(self, kind="scatter", **kwargs): 112 | if kind not in CHART_KINDS: 113 | raise ValueError("{} is not a supported chart kind ({})".format(kind, CHART_KINDS)) 114 | 115 | self.kind = kind 116 | self.height = kwargs.get("height", 450) 117 | radius = kwargs.get("r", kwargs.get("radius", 5)) # accept "r" or "radius" for this option 118 | self.legend = kwargs.get("legend", None) 119 | self.y_title = kwargs.get("y_title", "") 120 | self.chart_id = time.strftime("%y%m%d%H%M%S") 121 | self.chart = {} 122 | self.chart["title"] = {"text": kwargs.get("title", "{} plot".format(self.kind))} 123 | self.chart["subtitle"] = {"text": kwargs.get("subtitle")} 124 | self.chart["series"] = [] 125 | self.chart["plotOptions"] = {"scatter": {"marker": {"radius": radius}}} 126 | self.chart["credits"] = {'enabled': False} 127 | self.chart["yAxis"] = {"title": {"enabled": True, "text": self.y_title}} 128 | 129 | 130 | def _structure_tooltip(self, i): 131 | tooltip = [] 132 | if self.arg_pid: 133 | tooltip.extend([str(self.dpid[i]), "
"]) 134 | tooltip.extend(['

', 135 | str(self.dmol[i]), "

"]) 136 | return "".join(tooltip) 137 | 138 | 139 | def _extended_tooltip(self): 140 | ext_tt = [[] for idx in self.dx] 141 | for idx, _ in enumerate(self.dx): 142 | for field in self.arg_include_in_tooltip: 143 | ext_tt[idx].append("{}: {}".format(field, self.include[field][idx])) 144 | if self.arg_pid: 145 | ext_tt[idx].append("{}: {}".format(self.arg_pid, self.dpid[idx])) 146 | self.dpid = ["
".join(i) for i in ext_tt] 147 | self.arg_pid = True 148 | 149 | 150 | def _data_columns(self): 151 | """Generate the data for the Column plot""" 152 | data = [] 153 | cats = [] 154 | 155 | for i in range(self.dlen): 156 | try: 157 | cv = str(self.dx[i]) 158 | dv = float(self.dy[i]) 159 | except TypeError: 160 | continue 161 | 162 | cats.append(cv) 163 | data.append(dv) 164 | 165 | self.chart["series"].append({"name": self.arg_y, "data": data}) 166 | self.chart["xAxis"]["categories"] = cats 167 | 168 | 169 | def _data_tuples(self, d): 170 | """Generate the data tuples required for Highcharts scatter plot.""" 171 | data = [] 172 | dx = d["x"] 173 | dy = d["y"] 174 | if self.arg_z: 175 | dz = d["z"] 176 | if self.arg_pid or self.arg_struct or self.arg_include_in_tooltip: 177 | dpid = d["id"] 178 | if self.arg_color_by: 179 | dcolorval = d["color_by"] 180 | 181 | for i in range(len(dx)): 182 | try: 183 | tmp_d = {"x": float(dx[i]), "y": float(dy[i])} 184 | if self.arg_z: 185 | tmp_d["z"] = float(dz[i]) 186 | if self.arg_pid or self.arg_struct or self.arg_include_in_tooltip: 187 | tmp_d["id"] = str(dpid[i]) 188 | if self.arg_color_by: 189 | color_val = float(dcolorval[i]) 190 | color_code = self.color_scale(color_val, reverse=self.arg_reverse) 191 | tmp_d["z"] = color_val 192 | tmp_d["color"] = color_code 193 | marker = {"fillColor": color_code, 194 | "states": {"hover": {"fillColor": color_code}}} 195 | tmp_d["marker"] = marker 196 | 197 | data.append(tmp_d) 198 | 199 | except TypeError: 200 | pass 201 | 202 | return data 203 | 204 | 205 | def _series_discrete(self): 206 | # [{"name": "A", "data": [{"x": 1, "y": 2}, {"x": 2, "y": 3}]}, 207 | # {"name": "B", "data": [{"x": 2, "y": 3}, {"x": 3, "y": 4}]}] 208 | self.arg_z = None # not implemented yet 209 | series = [] 210 | 211 | names = set(str(c) for c in self.dseries_by) 212 | data_series_x = {name: [] for name in names} 213 | data_series_y = {name: [] for name in names} 214 | if self.arg_pid: 215 | data_series_id = {name: [] for name in names} 216 | if self.arg_struct: 217 | data_series_mol = {name: [] for name in names} 218 | if self.arg_color_by: 219 | data_series_color = {name: [] for name in names} 220 | 221 | for i in range(self.dlen): 222 | series_by_str = str(self.dseries_by[i]) 223 | data_series_x[series_by_str].append(float(self.dx[i])) 224 | data_series_y[series_by_str].append(float(self.dy[i])) 225 | if self.arg_struct: 226 | data_series_mol[series_by_str].append(self._structure_tooltip(i)) 227 | elif self.arg_pid: 228 | data_series_id[series_by_str].append(str(self.dpid[i])) 229 | if self.arg_color_by: 230 | data_series_color[series_by_str].append(self.dcolor_by[i]) 231 | 232 | for name in names: 233 | tmp_d = {"x": data_series_x[name], "y": data_series_y[name]} 234 | if self.arg_struct: 235 | tmp_d["id"] = data_series_mol[name] 236 | elif self.arg_pid: 237 | tmp_d["id"] = data_series_id[name] 238 | if self.arg_color_by: 239 | tmp_d["color_by"] = data_series_color[name] 240 | 241 | series_dict = {"name": name} 242 | series_dict["data"] = self._data_tuples(tmp_d) 243 | series.append(series_dict) 244 | 245 | return series 246 | 247 | 248 | def add_data(self, d, x="x", y="y", z=None, **kwargs): 249 | """Add the data to the chart. 250 | 251 | Parameters: 252 | d (dictionary or dataframe): The input dictionary 253 | str x , y [, z]: The keys for the properties to plot. 254 | 255 | Other Parameters: 256 | pid (str): The name of a (compound) id to be displayed in the tooltip. 257 | Defaults to *None*. 258 | tooltip (str): enable structure tooltips (currently only implemented for RDKit dataframes). 259 | Possible values: *"", "struct"*. Defaults to "". 260 | mol_col (str): Structure column in the df used for the tooltip 261 | (used if tooltip="struct"). Defaults to *"mol"*. 262 | color_by (str, None): property to use for coloring. Defaults to *None* 263 | series_by (str, None): property to use as series. Defaults to *None* 264 | color_mode (str): Point coloring mode. Alias: *mode* 265 | Available values: *"disc", "discrete", "cont", "continuos"*. Defaults to *"disc"* 266 | reverse (bool): Reverse the ColorScale. Defaults to *False*. 267 | """ 268 | 269 | if x not in d or y not in d: 270 | raise KeyError("'{x}' and '{y}' are required parameters for scatter plot, but could not all be found in dict.".format(x=x, y=y)) 271 | 272 | if len(d[x]) != len(d[y]): 273 | raise ValueError("'{x}' and '{y}' must have the same length.".format(x=self.arg_x, y=self.arg_y)) 274 | 275 | self.arg_x = x 276 | self.arg_y = y 277 | self.arg_z = z 278 | self.arg_series_by = kwargs.get("series_by", None) 279 | self.arg_color_by = kwargs.get("color_by", None) 280 | self.arg_pid = kwargs.get("pid", None) 281 | self.arg_color_discrete = "disc" in kwargs.get("color_mode", kwargs.get("mode", "discrete")) 282 | self.arg_reverse = kwargs.get("reverse", False) 283 | self.arg_include_in_tooltip = kwargs.get("include_in_tooltip", kwargs.get("include", "xxx")) 284 | if not isinstance(self.arg_include_in_tooltip, list): 285 | self.arg_include_in_tooltip = [self.arg_include_in_tooltip] 286 | self.arg_jitter = kwargs.get("jitter", None) 287 | self.jitter_mag = kwargs.get("mag", 0.2) 288 | 289 | 290 | 291 | self.dx = list(d[x]) 292 | self.dy = list(d[y]) 293 | self.dlen = len(self.dx) 294 | 295 | if self.arg_jitter: 296 | for j in self.arg_jitter: 297 | if j == x: 298 | for i, val in enumerate(self.dx): 299 | self.dx[i] = val + self.jitter_mag * random.random() * 2 - self.jitter_mag 300 | 301 | if j == y: 302 | for i, val in enumerate(self.dy): 303 | self.dy[i] = val + self.jitter_mag * random.random() * 2 - self.jitter_mag 304 | 305 | if self.arg_pid: 306 | # pandas data series and pid == index 307 | if not isinstance(d[x], list) and self.arg_pid == d.index.name: 308 | self.dpid = list(d.index) 309 | else: 310 | # self.dpid = ["{}: {}".format(self.arg_pid, i) for i in list(d[self.arg_pid])] 311 | self.dpid = list(d[self.arg_pid]) 312 | 313 | if self.dlen != len(self.dpid): 314 | raise ValueError("'{x}' and '{pid}' must have the same length.".format(x=self.arg_x, pid=self.arg_pid)) 315 | else: 316 | self.arg_pid = None 317 | 318 | self.chart["xAxis"] = {"title": {"enabled": True, "text": self.arg_x}} 319 | 320 | ######################### 321 | # plot-specific options # 322 | ######################### 323 | if self.kind in ["scatter"]: 324 | if not self.y_title: 325 | self.chart["yAxis"] = {"title": {"enabled": True, "text": self.arg_y}} 326 | self.arg_tooltip = kwargs.get("tooltip", "") 327 | if self.arg_tooltip not in TOOLTIP_OPTIONS: 328 | print("- unknown tooltip option {}, setting to empty.".format(self.arg_tooltip)) 329 | self.arg_tooltip = "" 330 | self.arg_struct = "struct" in self.arg_tooltip 331 | self.arg_mol_col = kwargs.get("mol_col", "mol") 332 | if self.arg_struct: 333 | self.dmol = list(d[self.arg_mol_col]) 334 | 335 | self.include = {} 336 | includes = self.arg_include_in_tooltip[:] 337 | for field in includes: 338 | if field in d: 339 | self.include[field] = list(d[field]) 340 | else: 341 | self.arg_include_in_tooltip.remove(field) # remove fields that are not present in the data set 342 | 343 | if self.arg_pid or self.arg_include_in_tooltip: 344 | self._extended_tooltip() 345 | 346 | 347 | if self.kind == "scatter": 348 | self.chart["chart"] = {"type": "scatter", "zoomType": "xy"} 349 | # defining the tooltip 350 | self.chart["tooltip"] = {"useHTML": True} 351 | # self.chart["tooltip"]["headerFormat"] = "{y} vs. {x}
".format(x=x, y=y) 352 | self.chart["tooltip"]["headerFormat"] = "" 353 | 354 | point_format = ["{x}: {{point.x}}
{y}: {{point.y}}".format(x=self.arg_x, y=self.arg_y)] 355 | if self.arg_color_by: 356 | point_format.append("{color_by}: {{point.z}}".format(color_by=self.arg_color_by)) 357 | if self.arg_pid or self.arg_struct or self.arg_include_in_tooltip: 358 | point_format.append("{point.id}") 359 | self.chart["tooltip"]["pointFormat"] = "
".join(point_format) 360 | 361 | if not self.legend: 362 | self.chart["legend"] = {'enabled': False} 363 | else: 364 | self.chart["legend"] = {'enabled': True, "align": "right"} 365 | 366 | 367 | ############################ 368 | # defining the data series # 369 | ############################ 370 | if self.arg_series_by: 371 | if self.dlen != len(d[self.arg_series_by]): 372 | raise ValueError("'{x}' and '{series_by}' must have the same length.".format(x=self.arg_x, series_by=self.arg_series_by)) 373 | self.dseries_by = list(d[self.arg_series_by]) 374 | if self.arg_color_by: 375 | if self.dlen != len(d[self.arg_color_by]): 376 | raise ValueError("'{x}' and '{color_by}' must have the same length.".format(x=self.arg_x, color_by=self.arg_color_by)) 377 | self.dcolor_by = list(d[self.arg_color_by]) 378 | # self.chart["colorAxis"] = {"minColor": "#FFFFFF", "maxColor": "Highcharts.getOptions().colors[0]"} 379 | min_color_by = min(self.dcolor_by) 380 | max_color_by = max(self.dcolor_by) 381 | self.color_scale = ColorScale(20, min_color_by, max_color_by) 382 | # self.chart["colorAxis"] = {"min": min_color_by, "max": max_color_by, 383 | # "minColor": '#16E52B', "maxColor": '#E51616'} 384 | # if self.legend != False: 385 | # self.chart["legend"] = {'enabled': True} 386 | if not self.chart["subtitle"]["text"]: 387 | self.chart["subtitle"]["text"] = "colored by {} ({:.2f} .. {:.2f})".format(self.arg_color_by, min_color_by, max_color_by) 388 | if self.arg_z: 389 | if self.dlen != len(d[z]): 390 | raise ValueError("'{x}' and '{z}' must have the same length.".format(x=self.arg_x, pid=self.arg_pid)) 391 | self.dz = list(d[z]) 392 | 393 | 394 | if self.arg_series_by: 395 | if self.legend is not False: 396 | self.chart["legend"] = {'enabled': True, "title": {"text": self.arg_series_by}, 397 | "align": "right"} 398 | self.chart["tooltip"]["headerFormat"] = '{series_by}: {{series.name}}
'.format(series_by=self.arg_series_by) 399 | series = self._series_discrete() 400 | self.chart["series"].extend(series) 401 | else: 402 | tmp_d = {"x": self.dx, "y": self.dy} 403 | if self.arg_struct: 404 | tmp_d["id"] = [self._structure_tooltip(i) for i in range(self.dlen)] 405 | elif self.arg_pid: 406 | tmp_d["id"] = self.dpid 407 | if self.arg_color_by: # continuous values 408 | tmp_d["color_by"] = self.dcolor_by 409 | 410 | data = self._data_tuples(tmp_d) 411 | self.chart["series"].append({"name": "series", "data": data}) 412 | 413 | if self.kind == "column": 414 | self.chart["chart"] = {"type": "column", "zoomType": "xy"} 415 | self._data_columns() 416 | 417 | 418 | def show(self, debug=False): 419 | """Show the plot.""" 420 | formatter = string.Template(CHART_TEMPL) 421 | # if debug: 422 | # print(self.chart) 423 | chart_json = json.dumps(self.chart) 424 | html = formatter.substitute({"id": self.chart_id, "chart": chart_json, 425 | "height": self.height}) 426 | # html = html.replace('"Highcharts.getOptions().colors[0]"', 'Highcharts.getOptions().colors[0]') 427 | if debug: 428 | print(self.dpid) 429 | print(html) 430 | return HTML(html) 431 | 432 | 433 | def guess_id_prop(prop_list): # try to guess an id_prop 434 | for prop in prop_list: 435 | if prop.lower().endswith("id"): 436 | return prop 437 | return None 438 | 439 | 440 | # Quick Predefined Plots 441 | def cpd_scatter(df, x, y, r=7, pid=None, **kwargs): 442 | """Predefined Plot #1. 443 | Quickly plot an RDKit Pandas dataframe or a molecule dictionary with structure tooltips.""" 444 | 445 | if not pid: 446 | if isinstance(df, dict): 447 | prop_list = df.keys() 448 | else: 449 | prop_list = [df.index.name] 450 | prop_list.extend(df.columns.values) 451 | 452 | pid = guess_id_prop(prop_list) 453 | 454 | title = kwargs.get("title", "Compound Scatter Plot") 455 | scatter = Chart(title=title, r=r) 456 | scatter.add_data(df, x, y, pid=pid, **kwargs) 457 | return scatter.show() 458 | 459 | 460 | # interactive exploration of an RDKit Pandas dataframe 461 | def inspect_df(df, pid="Compound_Id", tooltip="struct"): 462 | """Use IPythons interactive widgets to visually and interactively explore an RDKit Pandas dataframe. 463 | TODO: implement!""" 464 | pass 465 | -------------------------------------------------------------------------------- /rdkit_ipynb_tools/html_templates.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | # html_templates.py 4 | """ 5 | ############## 6 | HTML Templates 7 | ############## 8 | 9 | *Created on Wed Jun 01 2015 by A. Pahl* 10 | 11 | A simple and pythonic templating library where every HTML tag is a function. 12 | """ 13 | 14 | # import time 15 | # import os.path as op 16 | 17 | TABLE_OPTIONS = {"cellspacing": "1", "cellpadding": "1", "border": "1", 18 | "align": "", "height": "60px", "summary": "Table", } # "width": "800px", 19 | 20 | # PAGE_OPTIONS = {"icon": "icons/chart_bar.png", "css": ["css/style.css", "css/collapsible_list.css"], 21 | # "scripts": ["lib/jquery.js", "lib/highcharts.js", "script/folding.js"]} 22 | PAGE_OPTIONS = {"icon": "icons/benzene.png"} 23 | JSME = "lib/jsme/jsme.nocache.js" 24 | 25 | HTML_FILE_NAME = "mol_table.html" 26 | 27 | 28 | def tag(name, content, options=None, lf_open=False, lf_close=False): 29 | """creates a HTML stub with closed tags of type around 30 | with additional in the opening tag 31 | when lf_(open|close)==True, the respective tag will be appended with a line feed. 32 | returns: html stub as list""" 33 | 34 | if lf_open: 35 | lf_open_str = "\n" 36 | else: 37 | lf_open_str = "" 38 | if lf_close: 39 | lf_close_str = "\n" 40 | else: 41 | lf_close_str = "" 42 | 43 | option_str = "" 44 | if options: 45 | option_list = [" "] 46 | for option in options: 47 | option_list.extend([option, '="', str(options[option]), '" ']) 48 | 49 | option_str = "".join(option_list) 50 | 51 | stub = ["<{}{}>{}".format(name, option_str, lf_open_str)] 52 | if type(content) == list: 53 | stub.extend(content) 54 | else: 55 | stub.append(content) 56 | 57 | stub.append("{}".format(name, lf_close_str)) 58 | 59 | return stub 60 | 61 | 62 | def page(content, title="Results", header=None, summary=None, options=PAGE_OPTIONS): 63 | """create a full HTML page from a list of stubs below 64 | options dict: 65 | css: list of CSS style file paths to include. 66 | scripts: list of javascript library file paths to include. 67 | icon: path to icon image 68 | returns HTML page as STRING !!!""" 69 | 70 | # override the title if there is a title in 71 | if "title" in options and len(options["title"]) > 2: 72 | title = options["title"] 73 | 74 | if "icon" in options and len(options["icon"]) > 2: 75 | icon_str = ''.format(options["icon"]) 76 | else: 77 | icon_str = "" 78 | 79 | if "css" in options and options["css"]: 80 | css = options["css"] 81 | if not isinstance(css, list): 82 | css = [css] 83 | 84 | css_str = "".join([' \n'.format(file_name) for file_name in css]) 85 | 86 | else: 87 | # minimal inline CSS 88 | css_str = """""" 118 | 119 | if "scripts" in options and options["scripts"]: 120 | scripts = options["scripts"] 121 | if type(scripts) != list: 122 | scripts = [scripts] 123 | 124 | js_str = "".join([' \n'.format(file_name) for file_name in scripts]) 125 | 126 | else: 127 | js_str = "" 128 | 129 | if header: 130 | if not isinstance(header, list): 131 | header = [header] 132 | 133 | header_str = "".join(h2(header)) 134 | 135 | else: 136 | header_str = "" 137 | 138 | if summary: 139 | if not isinstance(summary, list): 140 | summary = [summary] 141 | 142 | summary_str = "".join(p(summary)) 143 | 144 | else: 145 | summary_str = "" 146 | 147 | 148 | 149 | if isinstance(content, list): 150 | content_str = "".join(content) 151 | else: 152 | content_str = content 153 | 154 | 155 | html_page = """ 156 | 157 | 158 | 159 | {title} 160 | {icon_str} 161 | {css_str} 162 | {js_str} 163 | 164 | 165 | {header_str} 166 | {summary_str} 167 | {content_str} 168 | 169 | 170 | """.format(title=title, icon_str=icon_str, css_str=css_str, js_str=js_str, 171 | header_str=header_str, summary_str=summary_str, content_str=content_str) 172 | 173 | return html_page 174 | 175 | 176 | def write(text, fn=HTML_FILE_NAME): 177 | with open(fn, "w") as f: 178 | f.write(text) 179 | 180 | 181 | def script(content): 182 | return tag("script", content, lf_open=True, lf_close=True) 183 | 184 | 185 | def img(src, options=None): 186 | """takes a src, returns an img tag""" 187 | 188 | option_str = "" 189 | if options: 190 | option_list = [" "] 191 | for option in options: 192 | option_list.extend([option, '="', str(options[option]), '" ']) 193 | 194 | option_str = "".join(option_list) 195 | 196 | stub = [' icon

'.format(option_str, src)] 197 | 198 | return stub 199 | 200 | 201 | def table(content, options=TABLE_OPTIONS): 202 | tbody = tag("tbody", content, lf_open=True, lf_close=True) 203 | return tag("table", tbody, options, lf_open=True, lf_close=True) 204 | 205 | 206 | def tr(content, options=None): 207 | return tag("tr", content, options, lf_close=True) 208 | 209 | 210 | def td(content, options=None): 211 | return tag("td", content, options, lf_close=False) 212 | 213 | 214 | def p(content): 215 | return tag("p", content, lf_open=True, lf_close=True) 216 | 217 | 218 | def h1(content): 219 | return tag("h1", content, lf_open=True, lf_close=True) 220 | 221 | 222 | def h2(content): 223 | return tag("h2", content, lf_open=True, lf_close=True) 224 | 225 | 226 | def div(content, options=None): 227 | return tag("div", content, options, lf_close=False) 228 | 229 | 230 | def ul(content): 231 | return tag("ul", content, lf_open=True, lf_close=True) 232 | 233 | 234 | def li(content): 235 | return tag("li", content, lf_open=False, lf_close=True) 236 | 237 | 238 | def li_lf(content): # list item with opening line feed 239 | return tag("li", content, lf_open=True, lf_close=True) 240 | 241 | 242 | def b(content, options=None): 243 | return tag("b", content, options, lf_close=False) 244 | 245 | 246 | def a(content, options): 247 | """the anchor tag requires an "href" in options, 248 | therefore the "options" parameter is not optional in this case (klingt komisch, ist aber so)""" 249 | return tag("a", content, options, lf_close=False) 250 | -------------------------------------------------------------------------------- /rdkit_ipynb_tools/nb_tools.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | # nb_tools.py 4 | """ 5 | ############## 6 | Notebook Tools 7 | ############## 8 | 9 | *Created on Wed Apr 22 16:37:35 2015 by A. Pahl* 10 | 11 | A set of tools to use in the IPython (JuPyTer) Notebook 12 | """ 13 | 14 | # from rdkit.Chem import AllChem as Chem 15 | # from rdkit.Chem import Draw 16 | # Draw.DrawingOptions.atomLabelFontSize = 18 17 | # import rdkit.Chem.Descriptors as Desc 18 | 19 | # import sys 20 | import time 21 | 22 | 23 | def is_interactive_ipython(): 24 | try: 25 | get_ipython() 26 | ipy = True 27 | print("> interactive IPython session.") 28 | except NameError: 29 | ipy = False 30 | return ipy 31 | 32 | 33 | IPYTHON = is_interactive_ipython() 34 | 35 | if IPYTHON: 36 | from IPython.core.display import HTML, Javascript, display 37 | import uuid 38 | 39 | 40 | class ProgressbarJS(): 41 | """A class to display a Javascript progressbar in the IPython notebook.""" 42 | def __init__(self, color="#43ace8"): 43 | if IPYTHON: 44 | self.bar_id = str(uuid.uuid4()) 45 | self.eta_id = str(uuid.uuid4()) 46 | # possible colours: #94CAEF (blue from HTML reports), 47 | # #d6d2d0 (grey from window decorations) 48 | # 49 | self.pb = HTML( 50 | """ 51 | 52 | 55 |

53 |

54 |

ETA:

56 | """.format(self.bar_id, color, self.eta_id)) 57 | self.prev_time = 0.0 58 | self.start_time = time.time() 59 | display(self.pb) 60 | 61 | 62 | def update(self, perc, force=False): 63 | """update the progressbar 64 | in: progress in percent""" 65 | if IPYTHON: 66 | # make sure that the update function is not called too often: 67 | self.cur_time = time.time() 68 | if force or (self.cur_time - self.prev_time >= 0.25): 69 | if perc > 100: perc = 100 70 | if perc >= 25: 71 | eta = (100 - perc) * (self.cur_time - self.start_time) / perc 72 | eta_str = format_seconds(eta) 73 | else: 74 | eta_str = "..." 75 | self.prev_time = self.cur_time 76 | display(Javascript(""" 77 | $('div#{}').width('{}%'); 78 | $('div#{}').text('{}'); 79 | """.format(self.bar_id, perc, self.eta_id, eta_str))) 80 | 81 | def done(self): 82 | """finalize with a full progressbar for aesthetics""" 83 | if IPYTHON: 84 | display(Javascript(""" 85 | $('div#{}').width('{}%'); 86 | $('div#{}').text('{}'); 87 | """.format(self.bar_id, 100, self.eta_id, "done"))) 88 | 89 | 90 | def show_progress(iterable, iter_len=0): 91 | """A convenience wrapper for the ProgressBar class around iterables. 92 | 93 | Parameters: 94 | iterable (list, generator): The iterable object over which to loop. 95 | iter_len (int): Optional length of the object. This can be given if the iter object is a generator. 96 | 97 | Returns: 98 | A generator from iterable and displays the Javascript toolbar in the notebook.""" 99 | 100 | if iter_len == 0: 101 | iter_len = len(iterable) 102 | 103 | steps = iter_len // 100 104 | if steps < 1: 105 | steps = 1 106 | pb = ProgressbarJS() 107 | 108 | for x, item in enumerate(iterable): 109 | if x % steps == 0: 110 | pb.update(100 * x / iter_len) 111 | 112 | yield item 113 | 114 | pb.done() 115 | 116 | 117 | def format_seconds(seconds): 118 | seconds = int(seconds) 119 | m, s = divmod(seconds, 60) 120 | h, m = divmod(m, 60) 121 | t_str = "{:02.0f}h {:02d}m {:02d}s".format(h, m, s) 122 | return t_str 123 | -------------------------------------------------------------------------------- /rdkit_ipynb_tools/pandas_tools.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | ############ 4 | Pandas Tools 5 | ############ 6 | 7 | *Created on Wed Jul 29 12:20:19 2015 by A. Pahl* 8 | 9 | A Set of Pandas Tools for RDKit. 10 | (built on top of rdkit.Chem.PandasTools) 11 | """ 12 | 13 | import time 14 | import os.path as op 15 | import pandas as pd 16 | 17 | from rdkit.Chem import AllChem as Chem 18 | from rdkit.Chem import PandasTools as PT 19 | 20 | from . import tools, hc_tools as hct 21 | 22 | 23 | try: 24 | from misc_tools import apl_tools as apt 25 | AP_TOOLS = True 26 | except ImportError: 27 | AP_TOOLS = False 28 | 29 | if AP_TOOLS: 30 | #: Library version 31 | VERSION = apt.get_commit(__file__) 32 | # I use this to keep track of the library versions I use in my project notebooks 33 | print("{:45s} (commit: {})".format(__name__, VERSION)) 34 | else: 35 | print("{:45s} ({})".format(__name__, time.strftime("%y%m%d-%H:%M", time.localtime(op.getmtime(__file__))))) 36 | 37 | 38 | def init_PT(): 39 | """create a dummy df to initialize the RDKit PandasTools (a bit hacky, I know).""" 40 | init = pd.DataFrame.from_dict({"id": [123, 124], "Smiles": ["c1ccccc1C(=O)N", "c1ccccc1C(=O)O"]}) 41 | PT.AddMoleculeColumnToFrame(init) 42 | 43 | init_PT() 44 | 45 | def move_col(df, col, new_pos=1): 46 | """ 47 | Put column col on position new_pos. 48 | """ 49 | 50 | cols = list(df.columns.values) 51 | index = cols.index(col) # raises ValueError if not found 52 | cols_len = len(cols) 53 | if new_pos > cols_len - 1: 54 | new_pos = cols_len - 1 55 | 56 | if index == new_pos: 57 | print("{} is already on position {}".format(col, new_pos)) 58 | return df 59 | 60 | new_cols = [] 61 | for i, c in enumerate(cols): 62 | if i == index: continue 63 | if i == new_pos: 64 | new_cols.append(col) 65 | new_cols.append(c) 66 | 67 | new_df = df[new_cols] 68 | 69 | return new_df 70 | 71 | 72 | def df_from_mol_list(mol_list, id_prop="Compound_Id", props=None, set_index=True): 73 | """Generate an RDKit Pandas dataframe from a Mol_List. 74 | Now also including the structure. 75 | If contains a list of property names, then only these properties plus the are returned. 76 | Returns RDKit Pandas dataframe""" 77 | 78 | prop_list = tools.list_fields(mol_list) 79 | 80 | if id_prop: 81 | guessed_id = id_prop 82 | else: 83 | guessed_id = tools.guess_id_prop(prop_list) 84 | 85 | df_dict = tools.dict_from_sdf_list(mol_list, id_prop=id_prop, props=props, prop_list=prop_list) 86 | smiles_col = [Chem.MolToSmiles(mol) for mol in mol_list] 87 | df = pd.DataFrame(df_dict) 88 | if set_index and guessed_id: 89 | df = df.set_index(guessed_id) 90 | 91 | df["Smiles"] = pd.Series(data=smiles_col, index=df.index) 92 | PT.AddMoleculeColumnToFrame(df, smilesCol="Smiles", molCol='mol', includeFingerprints=False) 93 | df = df.drop("Smiles", axis=1) 94 | # move structure column to left side 95 | df = move_col(df, "mol", new_pos=0) 96 | 97 | return df 98 | 99 | 100 | def left_join_on_index(df1, df2): 101 | new_df = pd.merge(df1, df2, how="left", left_index=True, right_index=True) 102 | return new_df 103 | 104 | 105 | def inner_join_on_index(df1, df2): 106 | new_df = pd.merge(df1, df2, how="inner", left_index=True, right_index=True) 107 | return new_df 108 | 109 | 110 | def join_data_from_file(df, fn, dropna=True, gen_struct=True, remove_smiles=True): 111 | """Join data from file (e.g. Smiles, biol. data (tab-sep.) ) by index (set to Compound_Id) to df. 112 | If gen_struct is True and Smiles could be found in the resulting df, 113 | then the structure is generated and moved to first column. 114 | If remove_smiles is True, the Smiles column will be dropped. 115 | index in df has to be set to the Compound_Id. 116 | returns new df""" 117 | 118 | # data_df = pd.concat([df, pd.read_table(fn, index_col=df.index.name)], axis=1, join_axes=[df.index]) 119 | data_df = inner_join_on_index(df, pd.read_table(fn, index_col=df.index.name)) 120 | 121 | not_found = list(set(df.index.tolist()) - set(data_df.index.tolist())) 122 | 123 | if dropna: 124 | data_df = data_df.dropna(axis=1, how="all") 125 | 126 | if gen_struct: 127 | smiles_col = None 128 | for col in list(data_df.columns.values): 129 | if "smiles" in col.lower(): 130 | smiles_col = col 131 | break 132 | 133 | if smiles_col: 134 | PT.AddMoleculeColumnToFrame(data_df, smilesCol=smiles_col, molCol='mol', includeFingerprints=False) 135 | 136 | if remove_smiles: 137 | data_df = data_df.drop(smiles_col, axis=1) 138 | 139 | data_df = move_col(data_df, "mol", new_pos=0) 140 | 141 | if not_found: 142 | print("* not found:", not_found) 143 | 144 | data_df = data_df.convert_objects(convert_numeric=True) # convert to numeric where possible 145 | 146 | return data_df 147 | 148 | 149 | def keep_numeric_only(df): 150 | """Keep only the numeric data in a df, 151 | remove all ROWS that contain non-numeric data. 152 | Do this prior to a highchart plot 153 | returns new df""" 154 | new_df = df.convert_objects(convert_numeric=True) 155 | new_df = new_df.dropna() 156 | return new_df 157 | 158 | 159 | def mol_list_from_df(df, mol_col="mol"): 160 | """ 161 | Creates a Mol_List from an RDKit Pandas dataframe. 162 | Returns Mol_List. 163 | """ 164 | 165 | mol_list = tools.Mol_List() 166 | id_prop = df.index.name 167 | props = [k for k in df.keys().tolist() if k != mol_col] 168 | 169 | for cid in df.index.values.tolist(): 170 | mol = df.at[cid, mol_col] 171 | if not mol: 172 | continue 173 | mol.SetProp(id_prop, str(cid)) 174 | for prop in props: 175 | if df.at[cid, prop]: 176 | mol.SetProp(prop, str(df.at[cid, prop])) 177 | mol_list.append(mol) 178 | 179 | return mol_list 180 | 181 | 182 | def df_show_table(df): 183 | """ 184 | show df as mol_table 185 | """ 186 | pass 187 | 188 | 189 | def align_molecules(df, qry, mol_col="mol"): 190 | """ 191 | Align all molecules in a df to a given query molecule, 192 | e.g. after a substructure query. 193 | operates inline, returns nothing""" 194 | 195 | Chem.Compute2DCoords(qry) 196 | for mol in df[mol_col]: 197 | Chem.GenerateDepictionMatching2DStructure(mol, qry) 198 | 199 | 200 | def add_calc_prop(df, mol_col="mol", props="logp"): 201 | avail_props = ["logp", "mw", "sa"] 202 | if not isinstance(props, list): 203 | prop = list(props) 204 | for prop in props: 205 | if not prop in avail_props: 206 | raise ValueError("{} can not be calculated.".format(prop)) 207 | #TODO: fill stub 208 | -------------------------------------------------------------------------------- /rdkit_ipynb_tools/pipeline.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | """ 4 | ######## 5 | Pipeline 6 | ######## 7 | 8 | *Created on 3-Feb-2016 by A. Pahl* 9 | 10 | A Pipelining Workflow using Python Generators, mainly for RDKit and large compound sets. 11 | 12 | Example use (converting a dl Chembl SDF to tab-separated Smiles): 13 | ```Python 14 | chembl_fn = "chembl_23.sdf.gz" 15 | keep_stereo = True 16 | fn_info = "stereo" if keep_stereo else "rac" 17 | s = p.Summary() 18 | rd = p.start_sdf_reader(chembl_fn, max_records=0, summary=s) 19 | res = p.pipe(rd, 20 | (p.pipe_rename_prop, "chembl_id", "Chembl_Id"), 21 | (p.pipe_keep_largest_fragment), 22 | (p.pipe_mol_to_smiles, {"isomeric": keep_stereo}), 23 | (p.stop_csv_writer, "chembl_23_{}.tsv".format(keep_stereo), {"summary": s}) 24 | ) 25 | ``` 26 | 27 | The progress of the pipeline is displayed in the notebook. 28 | """ 29 | 30 | 31 | # import sys 32 | # import os.path as op 33 | from copy import deepcopy 34 | import time 35 | from collections import OrderedDict, defaultdict 36 | import csv 37 | import gzip 38 | import pickle 39 | import base64 as b64 40 | import tempfile 41 | 42 | import numpy as np 43 | try: 44 | import pandas 45 | from rdkit.Chem import PandasTools as PT 46 | PANDAS = True 47 | except ImportError: 48 | PANDAS = False 49 | # The component stop_df_from_stream then checks for availability of pandas, 50 | # so that users that do not use that component and do not have pandas installed 51 | # do not get an import error 52 | 53 | from rdkit.Chem import AllChem as Chem 54 | from rdkit.Chem import Draw 55 | import rdkit.Chem.Descriptors as Desc 56 | 57 | # imports for similarity search 58 | from rdkit import DataStructs 59 | from rdkit.Chem.Fingerprints import FingerprintMols 60 | import rdkit.Chem.Scaffolds.MurckoScaffold as MurckoScaffold 61 | 62 | try: 63 | Draw.DrawingOptions.atomLabelFontFace = "DejaVu Sans" 64 | Draw.DrawingOptions.atomLabelFontSize = 18 65 | except KeyError: # Font "DejaVu Sans" is not available 66 | pass 67 | 68 | from . import tools 69 | 70 | try: 71 | from rdkit.Avalon import pyAvalonTools as pyAv 72 | USE_AVALON = True 73 | except ImportError: 74 | USE_AVALON = False 75 | 76 | try: 77 | from Contrib.SA_Score import sascorer 78 | SASCORER = True 79 | except ImportError: 80 | print("* SA scorer not available. RDKit's Contrib dir needs to be in the Python import path...") 81 | SASCORER = False 82 | 83 | USE_FP = "morgan" # other options: "avalon", "default" 84 | 85 | try: 86 | # interactive IPython session 87 | _ = get_ipython() 88 | IPY = True 89 | except NameError: 90 | IPY = False 91 | 92 | if IPY: 93 | from IPython.core.display import HTML, display, clear_output 94 | 95 | 96 | def format_seconds(seconds): 97 | seconds = int(seconds) 98 | m, s = divmod(seconds, 60) 99 | h, m = divmod(m, 60) 100 | t_str = "{:02.0f}h {:02d}m {:02d}s".format(h, m, s) 101 | return t_str 102 | 103 | 104 | def get_value(str_val): 105 | if not str_val: 106 | return None 107 | if isinstance(str_val, list): 108 | print(str_val) 109 | 110 | try: 111 | val = float(str_val) 112 | if "." not in str_val: 113 | val = int(val) 114 | except ValueError: 115 | val = str_val 116 | 117 | return val 118 | 119 | 120 | class Summary(OrderedDict): 121 | """An OrderedDict-based class that keeps track of the time since its instantiation. 122 | Used for reporting running details of pipeline functions.""" 123 | 124 | def __init__(self, timeit=True, **kwargs): 125 | """Parameters: 126 | timeit: whether or not to use the timing functionality. Default: True""" 127 | super().__init__(**kwargs) 128 | self.timeit = timeit 129 | if self.timeit: 130 | self.t_start = time.time() 131 | 132 | 133 | def __html__(self, final=False): 134 | if final: 135 | pipe_status = "finished." 136 | else: 137 | pipe_status = "running..." 138 | 139 | outer = """{}

Pipeline {}
Component	# Records

""" 140 | rows = [] 141 | for k in self.keys(): 142 | value = self[k] 143 | row = """{}{}""".format(k, str(value)) 144 | rows.append(row) 145 | seconds = time.time() - self.t_start 146 | row = """Time elapsed{}""".format(format_seconds(seconds)) 147 | rows.append(row) 148 | return outer.format(pipe_status, "".join(rows)) 149 | 150 | 151 | def __str__(self): 152 | s_list = [] 153 | keys = self.keys() 154 | mlen = max(map(len, keys)) 155 | line_end = "\n" 156 | for idx, k in enumerate(keys, 1): 157 | value = self[k] 158 | if self.timeit and idx == len(keys): 159 | line_end = "" 160 | if type(value) == float: 161 | s_list.append("{k:{mlen}s}: {val:10.2f}".format(k=k, mlen=mlen, val=value)) 162 | s_list.append(line_end) 163 | else: 164 | s_list.append("{k:{mlen}s}: {val:>7}".format(k=k, mlen=mlen, val=value)) 165 | s_list.append(line_end) 166 | 167 | if self.timeit: 168 | seconds = time.time() - self.t_start 169 | s_list.append(" (time: {})".format(format_seconds(seconds))) 170 | 171 | return "".join(s_list) 172 | 173 | 174 | def __repr__(self): 175 | return self.__str__() 176 | 177 | 178 | def print(self): 179 | """print the content of a dict or Counter object in a formatted way""" 180 | print(self.__str__()) 181 | 182 | 183 | def update(self, final=False): 184 | if IPY: 185 | clear_output(wait=True) 186 | display(HTML(self.__html__(final))) 187 | else: 188 | print(self.__str__()) 189 | 190 | 191 | 192 | 193 | 194 | def pipe(val, *forms): 195 | """Inspired by the thread_first function of the `Toolz `_ project 196 | and adapted to also accept keyword arguments. Removed the discouraged reduce function. 197 | If functions of the pipeline nedd additional parameters, the function and 198 | the parameters have to be passed as tuples. Keyword arguments have to be 199 | passed as dicts in these tuples: 200 | 201 | >>> s = Summary() 202 | >>> rd = start_sdf_reader("test.sdf", summary=s) 203 | >>> pipe(rd, 204 | >>> pipe_keep_largest_fragment, 205 | >>> (pipe_neutralize_mol, {"summary": s}), 206 | >>> (pipe_keep_props, ["Ordernumber", "NP_Score"]), 207 | >>> (stop_csv_writer, "test.csv", {"summary": s}) 208 | >>> )""" 209 | 210 | def evalform_front(val, form): 211 | if callable(form): 212 | return form(val) 213 | if isinstance(form, tuple): 214 | args = [val] 215 | kwargs = {} 216 | func = form[0] 217 | for a in form[1:]: 218 | if isinstance(a, dict): 219 | kwargs.update(a) 220 | else: 221 | args.append(a) 222 | return func(*args, **kwargs) 223 | 224 | result = val 225 | for form in forms: 226 | result = evalform_front(result, form) 227 | 228 | return result 229 | 230 | 231 | def start_csv_reader(fn, max_records=0, tag=True, sep="\t", 232 | summary=None, comp_id="start_csv_reader"): 233 | """A reader for csv files. 234 | 235 | Returns: 236 | An iterator with the fields as dict 237 | 238 | Parameters: 239 | fn (str, list): filename or list of filenames. 240 | tag (bool): add the filename as a record when reading from more than one file. 241 | max_records (int): maximum number of records to read, 0 means all. 242 | summary (Summary): a Counter class to collect runtime statistics. 243 | comp_id: (str): the component Id to use for the summary.""" 244 | 245 | if not isinstance(fn, list): 246 | fn = [fn] 247 | 248 | rec_counter = 0 249 | for filen in fn: 250 | if ".gz" in filen: 251 | f = gzip.open(filen, mode="rt") 252 | else: 253 | f = open(filen) 254 | 255 | if sep == "\t": 256 | dialect = "excel-tab" 257 | else: 258 | dialect = "excel" 259 | reader = csv.DictReader(f, dialect=dialect) 260 | prev_time = time.time() 261 | for row_dict in reader: 262 | rec_counter += 1 263 | if max_records > 0 and rec_counter > max_records: break 264 | # make a copy with non-empty values 265 | rec = {k: get_value(v) for k, v in row_dict.items() if v is not None and v != ""} # make a copy with non-empty values 266 | if len(fn) > 1 and tag: 267 | rec["tag"] = filen 268 | 269 | if summary is not None: 270 | summary[comp_id] = rec_counter 271 | curr_time = time.time() 272 | if curr_time - prev_time > 2.0: # write the log only every two seconds 273 | prev_time = curr_time 274 | print(summary, file=open("pipeline.log", "w")) 275 | summary.update() 276 | yield rec 277 | 278 | f.close() 279 | 280 | if summary: 281 | print(summary, file=open("pipeline.log", "w")) 282 | # print(summary) 283 | summary.update(final=True) 284 | 285 | 286 | 287 | def start_cache_reader(name, summary=None, comp_id="start_cache_reader"): 288 | fn = "/tmp/{}".format(name) 289 | start_csv_reader(fn, summary=None, comp_id=comp_id) 290 | 291 | 292 | def start_sdf_reader(fn, max_records=0, tag=True, summary=None, comp_id="start_sdf_reader"): 293 | """A reader for SD files. 294 | 295 | Returns: 296 | An iterator with the fields as dict, including the molecule in the "mol" key 297 | 298 | Parameters: 299 | fn (str, list): filename or list of filenames. 300 | max_records (int): maximum number of records to read, 0 means all. 301 | tag (bool): add the filename as a record when reading from more than one file. 302 | summary (Summary): a Counter class to collect runtime statistics. 303 | comp_id: (str): the component Id to use for the summary.""" 304 | 305 | rec_counter = 0 306 | no_mol_counter = 0 307 | # also open lists of files 308 | if not isinstance(fn, list): 309 | fn = [fn] 310 | 311 | for filen in fn: 312 | if ".gz" in filen: 313 | f = gzip.open(filen, mode="rb") 314 | else: 315 | f = open(filen, "rb") 316 | 317 | reader = Chem.ForwardSDMolSupplier(f) 318 | prev_time = time.time() 319 | for mol in reader: 320 | if max_records > 0 and rec_counter > max_records: break 321 | rec = {} 322 | rec_counter += 1 323 | if mol: 324 | if len(fn) > 1 and tag: 325 | rec["tag"] = filen 326 | for prop in mol.GetPropNames(): 327 | val = mol.GetProp(prop) 328 | if len(val) > 0: # transfer only those properties to the stream which carry a value 329 | rec[prop] = get_value(val) 330 | mol.ClearProp(prop) 331 | 332 | rec["mol"] = mol 333 | 334 | if summary is not None: 335 | summary[comp_id] = rec_counter 336 | curr_time = time.time() 337 | if curr_time - prev_time > 2.0: # write the log only every two seconds 338 | prev_time = curr_time 339 | print(summary, file=open("pipeline.log", "w")) 340 | summary.update() 341 | 342 | yield rec 343 | 344 | else: 345 | no_mol_counter += 1 346 | if summary is not None: 347 | summary["{}_no_mol".format(comp_id)] = no_mol_counter 348 | 349 | f.close() 350 | 351 | if summary: 352 | print(summary, file=open("pipeline.log", "w")) 353 | summary.update() 354 | 355 | 356 | def start_stream_from_dict(d, summary=None, comp_id="start_stream_from_dict", show_first=False): 357 | """Provide a data stream from a dict.""" 358 | prev_time = time.time() 359 | d_keys = list(d.keys()) 360 | ln = len(d[d_keys[0]]) # length of the stream 361 | rec_counter = 0 362 | for idx in range(ln): 363 | rec = {} 364 | for k in d_keys: 365 | rec[k] = d[k][idx] 366 | 367 | rec_counter += 1 368 | if summary is not None: 369 | summary[comp_id] = rec_counter 370 | curr_time = time.time() 371 | if curr_time - prev_time > 2.0: # write the log only every two seconds 372 | prev_time = curr_time 373 | print(summary, file=open("pipeline.log", "w")) 374 | summary.update() 375 | 376 | if show_first and rec_counter == 1: 377 | print("{}:".format(comp_id), rec) 378 | 379 | yield rec 380 | 381 | if summary is not None: 382 | print(summary, file=open("pipeline.log", "w")) 383 | summary.update() 384 | 385 | 386 | def start_stream_from_mol_list(mol_list, summary=None, comp_id="start_stream_from_mol_list"): 387 | """Provide a data stream from a Mol_List.""" 388 | prev_time = time.time() 389 | rec_counter = 0 390 | for orig_mol in mol_list: 391 | if not orig_mol: continue 392 | mol = deepcopy(orig_mol) 393 | rec = {} 394 | props = mol.GetPropNames() 395 | for prop in props: 396 | val = get_value(mol.GetProp(prop)) 397 | mol.ClearProp(prop) 398 | if val is not None: 399 | rec[prop] = val 400 | 401 | rec["mol"] = mol 402 | 403 | rec_counter += 1 404 | if summary is not None: 405 | summary[comp_id] = rec_counter 406 | curr_time = time.time() 407 | if curr_time - prev_time > 2.0: # write the log only every two seconds 408 | prev_time = curr_time 409 | print(summary, file=open("pipeline.log", "w")) 410 | summary.update() 411 | 412 | yield rec 413 | 414 | if summary: 415 | print(summary, file=open("pipeline.log", "w")) 416 | summary.update() 417 | 418 | 419 | def stop_csv_writer(stream, fn, sep="\t", summary=None, comp_id="stop_csv_writer"): 420 | """Write CSV file from the incoming stream. 421 | 422 | Parameters: 423 | fn (str): filename 424 | summary (Summary): a Counter class to collect runtime statistics 425 | comp_id: (str): the component Id to use for the summary""" 426 | 427 | fields = OrderedDict() 428 | rec_counter = 0 429 | tmp = tempfile.TemporaryFile("w+") 430 | 431 | for rec in stream: 432 | if "mol" in rec: # molecule object can not be written to CSV 433 | rec.pop("mol") 434 | 435 | line = [] 436 | cp = rec.copy() 437 | 438 | # first write the records whose keys are already in fields: 439 | for key in fields: 440 | if key in cp: 441 | val = cp[key] 442 | if val is None: val = "" 443 | line.append(str(val)) 444 | cp.pop(key) 445 | else: 446 | line.append("") 447 | 448 | # now collect the additional records (and add them to fields) 449 | for key in cp: 450 | fields[key] = 0 # dummy value 451 | val = cp[key] 452 | if val is None: val = "" 453 | line.append(str(val)) 454 | 455 | tmp.write("\t".join(line) + "\n") 456 | rec_counter += 1 457 | if summary is not None: 458 | summary[comp_id] = rec_counter 459 | 460 | f = open(fn, "w") 461 | num_columns = len(fields) 462 | first_line = True 463 | tmp.seek(0) 464 | for line_str in tmp: 465 | if first_line: # write the final header 466 | first_line = False 467 | line = list(fields.keys()) 468 | f.write(sep.join(line) + "\n") 469 | 470 | line = line_str.rstrip("\n").split("\t") 471 | 472 | # fill up all lines with empty records to the number of columns 473 | num_fill_records = num_columns - len(line) 474 | fill = [""] * num_fill_records 475 | line.extend(fill) 476 | 477 | 478 | f.write(sep.join(line) + "\n") 479 | 480 | f.close() 481 | tmp.close() 482 | if not IPY: 483 | print("* {}: {} records written.".format(comp_id, rec_counter)) 484 | 485 | 486 | def stop_sdf_writer(stream, fn, max=500, summary=None, comp_id="stop_sdf_writer"): 487 | """Write records in stream as SD File.""" 488 | 489 | rec_counter = 0 490 | no_mol_counter = 0 491 | writer = Chem.SDWriter(fn) 492 | 493 | for rec in stream: 494 | if "mol" not in rec: 495 | no_mol_counter += 1 496 | if summary is not None: 497 | summary["{}_no_mol".format(comp_id)] = no_mol_counter 498 | return 499 | 500 | rec_counter += 1 501 | if rec_counter > max: 502 | continue # Let the pipe run to completion for the Summary 503 | 504 | mol = rec["mol"] 505 | check_2d_coords(mol) 506 | 507 | # assign the values from rec to the mol object 508 | for key in rec: 509 | if key == "mol": continue 510 | val = rec[key] 511 | if isinstance(val, str): 512 | if val != "": 513 | mol.SetProp(key, val) 514 | else: 515 | mol.SetProp(key, str(val)) 516 | 517 | if summary is not None: 518 | summary[comp_id] = rec_counter 519 | 520 | writer.write(mol) 521 | 522 | if rec_counter >= max: break 523 | 524 | writer.close() 525 | 526 | 527 | def stop_mol_list_from_stream(stream, max=250, summary=None, comp_id="stop_mol_list_from_stream"): 528 | """Creates a Mol_list from the records in the stream. Stops the pipeline stream.""" 529 | rec_counter = 0 530 | mol_list = tools.Mol_List() 531 | 532 | for rec in stream: 533 | if "mol" not in rec: continue 534 | 535 | rec_counter += 1 536 | if rec_counter > max: 537 | continue # Let the pipe run to completion for the Summary 538 | 539 | mol = rec["mol"] 540 | 541 | try: 542 | mol.GetConformer() 543 | except ValueError: # no 2D coords... calculate them 544 | mol.Compute2DCoords() 545 | 546 | # assign the values from rec to the mol object 547 | for key in rec: 548 | if key == "mol": continue 549 | val = rec[key] 550 | if isinstance(val, str): 551 | if val != "": 552 | mol.SetProp(key, val) 553 | else: 554 | mol.SetProp(key, str(val)) 555 | 556 | if summary is not None: 557 | summary[comp_id] = rec_counter 558 | 559 | mol_list.append(mol) 560 | 561 | 562 | return mol_list 563 | 564 | 565 | def stop_dict_from_stream(stream, summary=None, comp_id="stop_dict_from_stream"): 566 | """Generates a dict out of the stream""" 567 | rec_counter = 0 568 | for rec in stream: 569 | rec_counter += 1 570 | if rec_counter == 1: 571 | stream_dict = {k: [] for k in rec} 572 | stream_keys = set(stream_dict.keys()) 573 | 574 | for field in rec: 575 | if field in stream_keys: 576 | stream_dict[field].append(rec[field]) 577 | else: # this field was not in the records until now 578 | stream_dict[field] = rec_counter * [np.nan] 579 | stream_keys.add(field) 580 | 581 | empty_fields = stream_keys - set(rec.keys()) # handle fields which are in the stream, 582 | for field in empty_fields: # but not in this record 583 | stream_dict[field].append(np.nan) 584 | 585 | if summary is not None: 586 | summary[comp_id] = rec_counter 587 | 588 | return stream_dict 589 | 590 | 591 | def stop_df_from_stream(stream, summary=None, comp_id="stop_df_from_stream"): 592 | """Generates a Pandas DataFrame out of the data stream. 593 | The molecules need to be present in the stream, 594 | e.g. generated by `pipe_mol_from_smiles`.""" 595 | 596 | if not PANDAS: 597 | raise ImportError("pandas is not available.") 598 | PT.RenderImagesInAllDataFrames(images=True) 599 | df = pandas.DataFrame.from_dict(stop_dict_from_stream(stream, summary=summary, comp_id=comp_id)) 600 | return df 601 | 602 | 603 | def stop_count_records(stream, summary=None, comp_id="stop_count_records"): 604 | """Only count the records from the incoming stream.""" 605 | rec_counter = 0 606 | 607 | for rec in stream: 608 | 609 | rec_counter += 1 610 | if summary is not None: 611 | summary[comp_id] = rec_counter 612 | 613 | return rec_counter 614 | 615 | 616 | def stop_cache_writer(stream, name, summary=None, comp_id="stop_cache_writer"): 617 | """Write records in stream as cache.""" 618 | 619 | fn = "/tmp/{}".format(name) 620 | 621 | stop_csv_writer(stream, fn, summary=summary, comp_id=comp_id) 622 | 623 | 624 | def pipe_mol_from_smiles(stream, in_smiles="Smiles", remove=True, summary=None, comp_id="pipe_mol_from_smiles"): 625 | """Generate a molecule on the stream from Smiles.""" 626 | rec_counter = 0 627 | for rec in stream: 628 | if in_smiles in rec: 629 | mol = Chem.MolFromSmiles(rec[in_smiles]) 630 | if remove: 631 | rec.pop(in_smiles) 632 | 633 | if mol: 634 | rec_counter += 1 635 | if summary is not None: 636 | summary[comp_id] = rec_counter 637 | 638 | rec["mol"] = mol 639 | yield rec 640 | 641 | 642 | 643 | def pipe_mol_from_b64(stream, in_b64="Mol_b64", remove=True, summary=None, comp_id="pipe_mol_from_b64"): 644 | """Generate a molecule on the stream from a b64 encoded mol object.""" 645 | rec_counter = 0 646 | for rec in stream: 647 | if in_b64 in rec: 648 | mol = pickle.loads(b64.b64decode(rec[in_b64])) 649 | if remove: 650 | rec.pop(in_b64) 651 | 652 | if mol: 653 | rec_counter += 1 654 | if summary is not None: 655 | summary[comp_id] = rec_counter 656 | 657 | rec["mol"] = mol 658 | yield rec 659 | 660 | 661 | def start_mol_csv_reader(fn, max_records=0, in_b64="Mol_b64", tag=True, sep="\t", summary=None, comp_id="start_mol_csv_reader"): 662 | """A reader for csv files containing molecules in binary b64 format. 663 | 664 | Returns: 665 | An iterator with the fields and the molecule as dict 666 | 667 | Parameters: 668 | fn (str): filename. 669 | tag (bool): add the filename as a record when reading from more than one file. 670 | max_records (int): maximum number of records to read, 0 means all. 671 | summary (Summary): a Counter class to collect runtime statistics. 672 | comp_id: (str): the component Id to use for the summary.""" 673 | 674 | rd = start_csv_reader(fn, max_records, tag, sep, summary, comp_id) 675 | mol = pipe_mol_from_b64(rd, in_b64) 676 | 677 | return mol 678 | 679 | 680 | def pipe_mol_to_smiles(stream, out_smiles="Smiles", isomeric=False, summary=None, comp_id="pipe_mol_to_smiles"): 681 | """Calculate Smiles from the mol object onthe stram.""" 682 | rec_counter = 0 683 | for rec in stream: 684 | if "mol" in rec: 685 | rec[out_smiles] = Chem.MolToSmiles(rec["mol"], isomericSmiles=isomeric) 686 | rec_counter += 1 687 | if summary is not None: 688 | summary[comp_id] = rec_counter 689 | yield rec 690 | 691 | 692 | def pipe_mol_to_b64(stream, out_b64="Mol_b64", summary=None, comp_id="pipe_mol_to_b64"): 693 | rec_counter = 0 694 | for rec in stream: 695 | if "mol" in rec: 696 | mol = rec["mol"] 697 | if mol: 698 | rec["Mol_b64"] = b64.b64encode(pickle.dumps(mol)).decode() 699 | rec_counter += 1 700 | if summary is not None: 701 | summary[comp_id] = rec_counter 702 | 703 | yield rec 704 | 705 | 706 | def check_2d_coords(mol, force=False): 707 | """Check if a mol has 2D coordinates and if not, calculate them.""" 708 | if not force: 709 | try: 710 | mol.GetConformer() 711 | except ValueError: 712 | force = True # no 2D coords... calculate them 713 | 714 | if force: 715 | if USE_AVALON: 716 | pyAv.Generate2DCoords(mol) 717 | else: 718 | mol.Compute2DCoords() 719 | 720 | 721 | def pipe_calc_ic50(stream, prop_pic50, prop_ic50=None, unit="uM", digits=3, 722 | summary=None, comp_id="pipe_calc_ic50"): 723 | """Calculates the IC50 from a pIC50 value that has to be present in the record. 724 | Parameters: 725 | prop_pic50 (string): the name of the pIC50 prop from which to calc the IC50. 726 | prop_ic50 (string): the name of the calculated IC50. 727 | digits (int): number of decimal digits to use.""" 728 | 729 | rec_counter = 0 730 | if prop_ic50 is None: 731 | pos = prop_pic50.rfind("_") 732 | if pos > 0: 733 | bn = prop_pic50[:pos] 734 | else: 735 | bn = prop_pic50 736 | 737 | prop_ic50 = "{}_(IC50_{})".format(bn, unit) 738 | 739 | for rec in stream: 740 | rec_counter += 1 741 | if prop_pic50 in rec: 742 | ic50 = tools.ic50(rec[prop_pic50], unit) 743 | rec[prop_ic50] = ic50 744 | 745 | if summary is not None: 746 | summary[comp_id] = rec_counter 747 | 748 | yield rec 749 | 750 | 751 | def pipe_calc_props(stream, props, force2d=False, summary=None, comp_id="pipe_calc_props"): 752 | """Calculate properties from the Mol_List. 753 | props can be a single property or a list of properties. 754 | 755 | Calculable properties: 756 | 2d, date, formula, hba, hbd, logp, molid, mw, smiles, rotb, sa (synthetic accessibility), tpsa 757 | 758 | Synthetic Accessibility (normalized): 759 | 0: hard to synthesize; 1: easy access 760 | 761 | as described in: 762 | | Estimation of Synthetic Accessibility Score of Drug-like Molecules based on Molecular Complexity and Fragment Contributions 763 | | *Peter Ertl and Ansgar Schuffenhauer* 764 | | Journal of Cheminformatics 1:8 (2009) (`link `_) 765 | """ 766 | 767 | rec_counter = 0 768 | if not isinstance(props, list): 769 | props = [props] 770 | 771 | # make all props lower-case: 772 | props = list(map(lambda x: x.lower(), props)) 773 | 774 | for rec in stream: 775 | if "mol" in rec: 776 | mol = rec["mol"] 777 | if "2d" in props: 778 | check_2d_coords(mol, force2d) 779 | 780 | if "date" in props: 781 | rec["Date"] = time.strftime("%Y%m%d") 782 | 783 | if "molid" in props: 784 | rec["Compound_Id"] = rec_counter 785 | 786 | if "formula" in props: 787 | rec["Formula"] = Chem.CalcMolFormula(mol) 788 | 789 | if "hba" in props: 790 | rec["HBA"] = Desc.NOCount(mol) 791 | 792 | if "hbd" in props: 793 | rec["HBD"] = Desc.NHOHCount(mol) 794 | 795 | if "logp" in props: 796 | rec["LogP"] = np.round(Desc.MolLogP(mol), 3) 797 | 798 | if "mw" in props: 799 | rec["MW"] = np.round(Desc.MolWt(mol), 3) 800 | 801 | if "rotb" in props: 802 | rec["RotB"] = Desc.NumRotatableBonds(mol) 803 | 804 | if "smiles" in props: 805 | rec["Smiles"] = Chem.MolToSmiles(mol) 806 | 807 | if SASCORER and "sa" in props: 808 | score = sascorer.calculateScore(mol) 809 | norm_score = 1 - (score / 10) 810 | rec["SA"] = np.round(norm_score, 3) 811 | 812 | if "tpsa" in props: 813 | rec["TPSA"] = int(Desc.TPSA(mol)) 814 | 815 | rec_counter += 1 816 | if summary is not None: 817 | summary[comp_id] = rec_counter 818 | 819 | yield rec 820 | 821 | 822 | def pipe_custom_filter(stream, run_code, start_code=None, summary=None, comp_id="pipe_custom_filter"): 823 | """If the evaluation of run_code is true, the respective record will be put on the stream.""" 824 | rec_counter = 0 825 | if start_code is not None: 826 | exec(start_code) 827 | 828 | # pre-compile the run_code statement for performance reasons 829 | byte_code = compile(run_code, '', 'eval') 830 | for rec in stream: 831 | if eval(byte_code): 832 | rec_counter += 1 833 | if summary is not None: 834 | summary[comp_id] = rec_counter 835 | 836 | yield rec 837 | 838 | 839 | def pipe_custom_man(stream, run_code, start_code=None, stop_code=None, comp_id="pipe_custom_man"): 840 | """If the evaluation of run_code is true, the respective record will be put on the stream.""" 841 | if start_code is not None: 842 | exec(start_code) 843 | 844 | byte_code = compile(run_code, '', 'exec') 845 | for rec in stream: 846 | exec(byte_code) 847 | 848 | yield rec 849 | 850 | if stop_code: 851 | exec(stop_code) 852 | 853 | 854 | def pipe_has_prop_filter(stream, prop, invert=False, summary=None, comp_id="pipe_has_prop_filter"): 855 | rec_counter = 0 856 | 857 | for rec in stream: 858 | 859 | hit = prop in rec 860 | 861 | if invert: 862 | # reverse logic 863 | hit = not hit 864 | 865 | if hit: 866 | rec_counter += 1 867 | 868 | if summary is not None: 869 | summary[comp_id] = rec_counter 870 | 871 | yield rec 872 | 873 | 874 | def pipe_id_filter(stream, cpd_ids, id_prop="Compound_Id", summary=None, comp_id="pipe_id_filter"): 875 | rec_counter = 0 876 | if not isinstance(cpd_ids, list): 877 | cpd_ids = [cpd_ids] 878 | 879 | cpd_ids = {c_id: 0 for c_id in cpd_ids} 880 | 881 | for rec in stream: 882 | if id_prop not in rec: continue 883 | 884 | if rec[id_prop] in cpd_ids: 885 | rec_counter += 1 886 | 887 | if summary is not None: 888 | summary[comp_id] = rec_counter 889 | 890 | yield rec 891 | 892 | 893 | def pipe_mol_filter(stream, query, smarts=False, invert=False, add_h=False, summary=None, comp_id="pipe_mol_filter"): 894 | rec_counter = 0 895 | if "[H]" in query or "#1" in query: 896 | add_h = True 897 | 898 | if add_h or "#6" in query or "#7" in query: 899 | smarts = True 900 | 901 | query_mol = Chem.MolFromSmarts(query) if smarts else Chem.MolFromSmiles(query) 902 | if not query_mol: 903 | print("* {} ERROR: could not generate query from SMARTS.".format(comp_id)) 904 | return None 905 | 906 | for rec in stream: 907 | if "mol" not in rec: continue 908 | 909 | mol = rec["mol"] 910 | 911 | hit = False 912 | if add_h: 913 | mol_with_h = Chem.AddHs(mol) 914 | if mol_with_h.HasSubstructMatch(query_mol): 915 | hit = True 916 | 917 | else: 918 | if mol.HasSubstructMatch(query_mol): 919 | hit = True 920 | 921 | if invert: 922 | # reverse logic 923 | hit = not hit 924 | 925 | if hit: 926 | rec_counter += 1 927 | 928 | if summary is not None: 929 | summary[comp_id] = rec_counter 930 | 931 | yield rec 932 | 933 | 934 | def pipe_sim_filter(stream, query, cutoff=80, summary=None, comp_id="pipe_sim_filter"): 935 | """Filter for compounds that have a similarity greater or equal 936 | than `cutoff` (in percent) to the `query` Smiles. 937 | If the field `FP_b64` (e.g. pre-calculated) is present, this will be used, 938 | otherwise the fingerprint of the Murcko scaffold will be generated on-the-fly (much slower).""" 939 | rec_counter = 0 940 | 941 | query_mol = Chem.MolFromSmiles(query) 942 | if not query_mol: 943 | print("* {} ERROR: could not generate query from SMILES.".format(comp_id)) 944 | return None 945 | 946 | murcko_mol = MurckoScaffold.GetScaffoldForMol(query_mol) 947 | if USE_FP == "morgan": 948 | query_fp = Desc.rdMolDescriptors.GetMorganFingerprintAsBitVect(murcko_mol, 2) 949 | elif USE_FP == "avalon": 950 | query_fp = pyAv.GetAvalonFP(murcko_mol, 1024) 951 | else: 952 | query_fp = FingerprintMols.FingerprintMol(murcko_mol) 953 | 954 | for rec in stream: 955 | if "mol" not in rec: continue 956 | 957 | if "FP_b64" in rec: # use the pre-defined fingerprint if it is present in the stream 958 | mol_fp = pickle.loads(b64.b64decode(rec["FP_b64"])) 959 | else: 960 | murcko_mol = MurckoScaffold.GetScaffoldForMol(rec["mol"]) 961 | if USE_FP == "morgan": 962 | mol_fp = Desc.rdMolDescriptors.GetMorganFingerprintAsBitVect(murcko_mol, 2) 963 | elif USE_FP == "avalon": 964 | mol_fp = pyAv.GetAvalonFP(murcko_mol, 1024) 965 | else: 966 | mol_fp = FingerprintMols.FingerprintMol(murcko_mol) 967 | 968 | sim = DataStructs.FingerprintSimilarity(query_fp, mol_fp) 969 | if sim * 100 >= cutoff: 970 | rec_counter += 1 971 | rec["Sim"] = np.round(sim * 100, 2) 972 | 973 | if summary is not None: 974 | summary[comp_id] = rec_counter 975 | 976 | yield rec 977 | 978 | 979 | def pipe_remove_props(stream, props, summary=None, comp_id="pipe_remove_props"): 980 | """Remove properties from the stream. 981 | props can be a single property name or a list of property names.""" 982 | 983 | if not isinstance(props, list): 984 | props = [props] 985 | 986 | rec_counter = 0 987 | for rec_counter, rec in enumerate(stream, 1): 988 | for prop in props: 989 | if prop in rec: 990 | rec.pop(prop) 991 | 992 | if summary is not None: 993 | summary[comp_id] = rec_counter 994 | 995 | yield rec 996 | 997 | 998 | def pipe_keep_props(stream, props, summary=None, comp_id="pipe_keep_props", show_first=False): 999 | """Keep only the listed properties on the stream. "mol" is always kept by this component. 1000 | props can be a single property name or a list of property names. 1001 | show_first prints the first records for debugging purposes.""" 1002 | 1003 | if not isinstance(props, list): 1004 | props = [props] 1005 | 1006 | if "mol" not in props: 1007 | props.append("mol") 1008 | 1009 | for rec_counter, rec in enumerate(stream, 1): 1010 | for prop in rec.copy().keys(): 1011 | if prop not in props: 1012 | rec.pop(prop) 1013 | 1014 | if summary is not None: 1015 | summary[comp_id] = rec_counter 1016 | 1017 | if show_first and rec_counter == 1: 1018 | print("{}:".format(comp_id), rec) 1019 | 1020 | yield rec 1021 | 1022 | 1023 | def pipe_do_nothing(stream, *args, **kwargs): 1024 | """A stub component that does nothing.""" 1025 | 1026 | for rec in stream: 1027 | yield rec 1028 | 1029 | 1030 | def pipe_sleep(stream, duration): 1031 | """Another stub component, that slows down the puipeline 1032 | by `duration` seconds for demonstration purposes.""" 1033 | 1034 | for rec in stream: 1035 | time.sleep(duration) 1036 | yield rec 1037 | 1038 | 1039 | def pipe_rename_prop(stream, prop_old, prop_new, summary=None, comp_id="pipe_rename_prop"): 1040 | """Rename a property on the stream. 1041 | Parameters: 1042 | prop_old (str): old name of the property 1043 | prop_new (str): newname of the property""" 1044 | 1045 | rec_counter = 0 1046 | for rec_counter, rec in enumerate(stream, 1): 1047 | if prop_old in rec: 1048 | rec[prop_new] = rec[prop_old] 1049 | rec.pop(prop_old) 1050 | 1051 | if summary is not None: 1052 | summary[comp_id] = rec_counter 1053 | 1054 | yield rec 1055 | 1056 | 1057 | def pipe_join_data_from_file(stream, fn, join_on, behaviour="joined_only", append=True, 1058 | summary=None, comp_id="pipe_join_data_from_file", show_first=False): 1059 | """Joins data from a csv or SD file. 1060 | CAUTION: The input stream will be held in memory by this component! 1061 | 1062 | Parameters: 1063 | stream (dict iterator): stream of input compounds. 1064 | fn (str): name of the file (type is determined by having "sdf" in the name or not). 1065 | join_on (str): property to join on 1066 | behaviour (str): 1067 | "joined_only": only put those recored on the stream on which data was joined (default). 1068 | "keep_all": put all input records on the stream again, including those, on which no data was joined. 1069 | append (bool): if True (default), new values will be appended to existing fields 1070 | on the stream, forming a list. 1071 | This list has to be merged with the `pipe_merge_data` component. 1072 | If False, existing values are kept.""" 1073 | 1074 | # collect the records from the stream in a list, store the position of the join_on properties in a dict 1075 | stream_rec_list = [] 1076 | stream_id_list = [] # list to hold the join_on properties and their positions in the stream_rec_list 1077 | prev_time = time.time() 1078 | 1079 | stream_counter = -1 1080 | for rec in stream: 1081 | stream_join_on_val = rec.get(join_on, False) 1082 | if stream_join_on_val is False: continue 1083 | stream_counter += 1 1084 | stream_rec_list.append(rec) 1085 | stream_id_list.append(stream_join_on_val) 1086 | 1087 | if "sdf" in fn: 1088 | rd = start_sdf_reader(fn) 1089 | else: 1090 | rd = start_csv_reader(fn) 1091 | 1092 | rec_counter = 0 1093 | for rec in rd: 1094 | rec_join_on_val = rec.get(join_on, False) 1095 | if not rec_join_on_val: continue 1096 | 1097 | while rec_join_on_val in stream_id_list: 1098 | rec_copy = deepcopy(rec) 1099 | stream_join_on_idx = stream_id_list.index(rec_join_on_val) 1100 | stream_id_list.pop(stream_join_on_idx) 1101 | stream_rec = stream_rec_list.pop(stream_join_on_idx) 1102 | 1103 | for k in stream_rec: 1104 | if k != join_on: 1105 | if append and k in rec_copy: 1106 | val = rec_copy[k] 1107 | if not isinstance(val, list): 1108 | val = [val] 1109 | val.append(stream_rec[k]) 1110 | rec_copy[k] = val 1111 | else: 1112 | rec_copy[k] = stream_rec[k] 1113 | 1114 | rec_counter += 1 1115 | if summary is not None: 1116 | summary[comp_id] = rec_counter 1117 | curr_time = time.time() 1118 | if curr_time - prev_time > 2.0: # write the log only every two seconds 1119 | prev_time = curr_time 1120 | print(summary, file=open("pipeline.log", "w")) 1121 | summary.update() 1122 | 1123 | if show_first and rec_counter == 1: 1124 | print("{}:".format(comp_id), rec) 1125 | 1126 | yield rec_copy 1127 | 1128 | # with behaviour="keep_all", now add the records to the stream on which no data was joined. 1129 | if "all" in behaviour.lower(): 1130 | for rec in stream_rec_list: 1131 | 1132 | rec_counter += 1 1133 | if summary is not None: 1134 | summary[comp_id] = rec_counter 1135 | curr_time = time.time() 1136 | if curr_time - prev_time > 2.0: # write the log only every two seconds 1137 | prev_time = curr_time 1138 | print(summary, file=open("pipeline.log", "w")) 1139 | summary.update() 1140 | 1141 | yield rec 1142 | 1143 | if summary: 1144 | print(summary, file=open("pipeline.log", "w")) 1145 | summary.update() 1146 | 1147 | 1148 | def pipe_keep_largest_fragment(stream, summary=None, comp_id="pipe_keep_largest_frag"): 1149 | rec_counter = 0 1150 | frag_counter = 0 1151 | for rec in stream: 1152 | if "mol" not in rec: continue 1153 | mol = rec["mol"] 1154 | if not mol: continue 1155 | 1156 | mols = Chem.GetMolFrags(mol, asMols=True) 1157 | if len(mols) > 1: 1158 | frag_counter += 1 1159 | mols = sorted(mols, key=Desc.HeavyAtomCount, reverse=True) 1160 | if summary is not None: 1161 | summary["{}_has_frags".format(comp_id)] = frag_counter 1162 | 1163 | mol = mols[0] 1164 | 1165 | rec["mol"] = mol 1166 | 1167 | rec_counter += 1 1168 | if summary is not None: 1169 | summary[comp_id] = rec_counter 1170 | 1171 | yield rec 1172 | 1173 | 1174 | def pipe_neutralize_mol(stream, summary=None, comp_id="pipe_neutralize_mol"): 1175 | pattern = ( 1176 | # Imidazoles 1177 | ('[n+;H]', 'n'), 1178 | # Amines 1179 | ('[N+;!H0]', 'N'), 1180 | # Carboxylic acids and alcohols 1181 | ('[$([O-]);!$([O-][#7])]', 'O'), 1182 | # Thiols 1183 | ('[S-;X1]', 'S'), 1184 | # Sulfonamides 1185 | ('[$([N-;X2]S(=O)=O)]', 'N'), 1186 | # Enamines 1187 | ('[$([N-;X2][C,N]=C)]', 'N'), 1188 | # Tetrazoles 1189 | ('[n-]', '[nH]'), 1190 | # Sulfoxides 1191 | ('[$([S-]=O)]', 'S'), 1192 | # Amides 1193 | ('[$([N-]C=O)]', 'N'), 1194 | ) 1195 | 1196 | reactions = [(Chem.MolFromSmarts(x), Chem.MolFromSmiles(y, False)) for x, y in pattern] 1197 | 1198 | rec_counter = 0 1199 | neutr_counter = 0 1200 | for rec in stream: 1201 | if "mol" not in rec: continue 1202 | mol = rec["mol"] 1203 | if not mol: continue 1204 | 1205 | replaced = False 1206 | for reactant, product in reactions: 1207 | while mol.HasSubstructMatch(reactant): 1208 | replaced = True 1209 | rms = Chem.ReplaceSubstructs(mol, reactant, product) 1210 | mol = rms[0] 1211 | 1212 | if replaced: 1213 | Chem.SanitizeMol(mol) 1214 | mol.Compute2DCoords() 1215 | 1216 | rec_counter += 1 1217 | if summary is not None: 1218 | summary[comp_id] = rec_counter 1219 | if replaced: 1220 | neutr_counter += 1 1221 | summary["{}_neutralized".format(comp_id)] = neutr_counter 1222 | 1223 | rec["mol"] = mol 1224 | 1225 | yield rec 1226 | 1227 | 1228 | def pipe_inspect_stream(stream, fn="pipe_inspect.txt", exclude=None, summary=None): 1229 | """Write records from the stream into the file `fn` every two seconds. 1230 | Do not write records from the exclude list.""" 1231 | prev_time = time.time() 1232 | if exclude is not None: 1233 | if not isinstance(exclude, list): 1234 | exclude = [exclude] 1235 | 1236 | for rec in stream: 1237 | if exclude is not None: 1238 | rec = deepcopy(rec) 1239 | for prop in exclude: 1240 | rec.pop(prop, None) 1241 | 1242 | curr_time = time.time() 1243 | if curr_time - prev_time > 2.0: # write the log only every two seconds 1244 | prev_time = curr_time 1245 | if summary is not None: 1246 | print(summary, "\n\n", rec, file=open(fn, "w")) 1247 | else: 1248 | print(rec, file=open(fn, "w")) 1249 | 1250 | yield rec 1251 | 1252 | 1253 | def pipe_merge_data(stream, merge_on, str_props="concat", num_props="mean", mark=True, digits=3, summary=None, comp_id="pipe_merge_data"): 1254 | """Merge the data from the stream on the `merge_on` property. 1255 | WARNING: The stream is collected in memory by this component! 1256 | 1257 | Parameters: 1258 | merge_on (str): Name of the property (key) to merge on. 1259 | mark (bool): if true, merged records will be marked with a `Merged=num_of_merged_records` field. 1260 | str_props (str): Merge behaviour for string properties. 1261 | Allowed values are: concat ("; "-separated concatenation), 1262 | unique ("; "-separated concatenation of the unique values), 1263 | keep_first, keep_last. 1264 | num_props (str): Merge behaviour for numerical values. 1265 | Allowed values are: mean, median, keep_first, keep_last. 1266 | digits (int): The number of decimal digits for the merged numerical props 1267 | (mean or median).""" 1268 | 1269 | def _get_merged_val_from_val_list(val_list, str_props, num_props): 1270 | if isinstance(val_list[0], str): 1271 | if "concat" in str_props: 1272 | return "; ".join(val_list), None, None 1273 | if "unique" in str_props: 1274 | return "; ".join(set(val_list)), None, None 1275 | if "first" in str_props: 1276 | return val_list[0], None, None 1277 | if "last" in str_props: 1278 | return val_list[-1], None, None 1279 | 1280 | return val_list[0], None, None 1281 | 1282 | elif isinstance(val_list[0], float) or isinstance(val_list[0], int): 1283 | if "mean" in num_props: 1284 | val = np.mean(val_list) 1285 | return (np.round(val, digits), "Std", # Standard deviation 1286 | np.round(np.std(val_list), digits)) 1287 | if "median" in num_props: 1288 | val = np.median(val_list) 1289 | return (np.round(val, digits), "MAD", # Median Absolute Deviation 1290 | np.round(np.median([abs(x - val) for x in val_list]), digits)) 1291 | if "first" in num_props: 1292 | return val_list[0], None, None 1293 | if "last" in num_props: 1294 | return val_list[-1], None, None 1295 | 1296 | return val_list[0], None, None 1297 | 1298 | else: 1299 | return val_list[0], None, None 1300 | 1301 | 1302 | merged = defaultdict(lambda: defaultdict(list)) # defaultdict of defaultdict(list) 1303 | if summary is not None: 1304 | summary[comp_id] = "collecting..." 1305 | 1306 | for rec in stream: 1307 | if merge_on not in rec: continue 1308 | 1309 | merge_on_val = rec.pop(merge_on) 1310 | for prop in rec.keys(): 1311 | val = rec[prop] 1312 | if isinstance(val, list): # from a pipe_join operation with append == True 1313 | merged[merge_on_val][prop].extend(val) 1314 | else: 1315 | merged[merge_on_val][prop].append(val) 1316 | 1317 | rec_counter = 0 1318 | prev_time = time.time() 1319 | 1320 | for item in merged: 1321 | rec = {merge_on: item} 1322 | for prop in merged[item]: 1323 | val_list = merged[item][prop] 1324 | if len(val_list) > 1: 1325 | merge_result = _get_merged_val_from_val_list(val_list, str_props, num_props) 1326 | rec[prop] = merge_result[0] 1327 | if merge_result[1] is not None: # deviation values from mean or median 1328 | rec["{}_{}".format(prop, merge_result[1])] = merge_result[2] 1329 | if mark: 1330 | rec["Merged"] = len(val_list) 1331 | else: 1332 | rec[prop] = val_list[0] 1333 | 1334 | rec_counter += 1 1335 | if summary is not None: 1336 | summary[comp_id] = rec_counter 1337 | curr_time = time.time() 1338 | if curr_time - prev_time > 2.0: # write the log only every two seconds 1339 | prev_time = curr_time 1340 | print(summary, file=open("pipeline.log", "w")) 1341 | summary.update() 1342 | 1343 | yield rec 1344 | 1345 | if summary: 1346 | print(summary, file=open("pipeline.log", "w")) 1347 | summary.update() 1348 | 1349 | 1350 | 1351 | def dict_from_csv(fn, max_records=0): 1352 | """Read a CSV file and return a dict with the headers a keys and the columns as value lists. 1353 | Empty cells are np.nan.""" 1354 | 1355 | d = defaultdict(list) 1356 | 1357 | if ".gz" in fn: 1358 | f = gzip.open(fn, mode="rt") 1359 | else: 1360 | f = open(fn) 1361 | 1362 | reader = csv.DictReader(f, dialect="excel-tab") 1363 | 1364 | for rec_counter, row_dict in enumerate(reader, 1): 1365 | for k in row_dict: 1366 | v = row_dict[k] 1367 | if v == "" or v is None: 1368 | d[k].append(np.nan) 1369 | else: 1370 | d[k].append(get_value(v)) 1371 | 1372 | if max_records > 0 and rec_counter >= max_records: break 1373 | 1374 | print(" > {} records read".format(rec_counter)) 1375 | 1376 | return d 1377 | 1378 | 1379 | def generate_pipe_from_csv(fn): 1380 | """Generate a valid pipeline from a formatted csv file (see examples/example_pipe.ods).""" 1381 | 1382 | f = open(fn) 1383 | reader = list(csv.DictReader(f, dialect="excel-tab")) 1384 | num_of_lines = len(reader) 1385 | pipe_list = ["s = p.Summary()\n"] 1386 | for line_no, row_dict in enumerate(reader, 1): 1387 | # clean up the field 1388 | for k in row_dict: 1389 | # replace the weird quotation marks that my Libreoffice exports: 1390 | if "”" in row_dict[k]: 1391 | row_dict[k] = row_dict[k].replace("”", '"') 1392 | if "“" in row_dict[k]: 1393 | row_dict[k] = row_dict[k].replace("“", '"') 1394 | 1395 | if row_dict["Summary"]: 1396 | if row_dict["KWargs"]: 1397 | row_dict["KWargs"] = row_dict["KWargs"] + ", 'summary': s" 1398 | else: 1399 | row_dict["KWargs"] = "'summary': s" 1400 | 1401 | if line_no == 1: 1402 | if row_dict["KWargs"]: 1403 | pipe_list.append("rd = p.{Component}({Args}, **{{{KWargs}}})\n".format(**row_dict)) 1404 | else: 1405 | pipe_list.append("rd = p.{Component}({Args})\n".format(**row_dict)) 1406 | pipe_list.append("res = p.pipe(\n rd,\n") 1407 | continue 1408 | 1409 | if row_dict["Args"] or row_dict["KWargs"]: 1410 | pipe_list.append(" (p.{}".format(row_dict["Component"])) 1411 | if row_dict["Args"]: 1412 | pipe_list.append(", {}".format(row_dict["Args"])) 1413 | if row_dict["KWargs"]: 1414 | pipe_list.append(', {{{}}}'.format(row_dict["KWargs"])) 1415 | pipe_list.append(')') 1416 | else: 1417 | pipe_list.append(' p.{}'.format(row_dict["Component"])) 1418 | 1419 | if line_no < num_of_lines: 1420 | pipe_list.append(",\n") 1421 | else: 1422 | pipe_list.append("\n") 1423 | 1424 | pipe_list.append(')') 1425 | 1426 | pipe_str = "".join(pipe_list) 1427 | if IPY: 1428 | IPY.set_next_input(pipe_str) 1429 | else: 1430 | print(pipe_str) 1431 | -------------------------------------------------------------------------------- /rdkit_ipynb_tools/resources/clustering/css/collapsible_list.css: -------------------------------------------------------------------------------- 1 | ul { 2 | list-style: none; 3 | margin: 0; 4 | padding: 0; 5 | } 6 | li { 7 | background-image: url(../icons/table.png); 8 | background-position: 0 1px; 9 | background-repeat: no-repeat; 10 | padding-left: 20px; 11 | } 12 | li.folder { 13 | background-image: url(../icons/add.png); 14 | } 15 | 16 | a { 17 | color: #000000; 18 | cursor: pointer; 19 | text-decoration: none; 20 | } 21 | a:hover { 22 | text-decoration: underline; 23 | } 24 | -------------------------------------------------------------------------------- /rdkit_ipynb_tools/resources/clustering/css/index_style.css: -------------------------------------------------------------------------------- 1 | body{ 2 | background-color: #FFFFFF; 3 | font-family: freesans, arial, verdana, sans-serif; 4 | } 5 | h3 { 6 | margin-bottom: 10px; 7 | } 8 | td { 9 | border-collapse:collapse; 10 | border-width:thin; 11 | border-style:hidden; 12 | border-color:black; 13 | padding-right: 80px; 14 | padding-bottom: 10px; 15 | } 16 | table { 17 | border-collapse:collapse; 18 | border-width:thin; 19 | border-style:hidden; 20 | border-color:black; 21 | background-color: #FFFFFF; 22 | text-align: left; 23 | } 24 | 25 | 26 | 27 | 28 | -------------------------------------------------------------------------------- /rdkit_ipynb_tools/resources/clustering/css/lit_style.css: -------------------------------------------------------------------------------- 1 | body{ 2 | background-color: #FFFFFF; 3 | font-family: freesans, arial, verdana, sans-serif; 4 | font-size: small; 5 | } 6 | h3 { 7 | margin-bottom: 10px; 8 | } 9 | td { 10 | border-collapse:collapse; 11 | border-width:thin; 12 | border-style:hidden; 13 | border-color:black; 14 | padding-right: 1px; 15 | padding-bottom: 1px; 16 | } 17 | table { 18 | border-collapse:collapse; 19 | border-width:thin; 20 | border-style:hidden; 21 | border-color:black; 22 | background-color: #FFFFFF; 23 | text-align: left; 24 | } 25 | 26 | 27 | 28 | -------------------------------------------------------------------------------- /rdkit_ipynb_tools/resources/clustering/css/style.css: -------------------------------------------------------------------------------- 1 | body{ 2 | background-color: #FFFFFF; 3 | font-family: freesans, arial, verdana, sans-serif; 4 | } 5 | th { 6 | border-collapse: collapse; 7 | border-width: thin; 8 | border-style: solid; 9 | border-color: black; 10 | text-align: left; 11 | font-weight: bold; 12 | } 13 | td { 14 | border-collapse:collapse; 15 | border-width:thin; 16 | border-style:solid; 17 | border-color:black; 18 | } 19 | table { 20 | border-collapse:collapse; 21 | border-width:thin; 22 | border-style:solid; 23 | border-color:black; 24 | background-color: #FFFFFF; 25 | text-align: left; 26 | } 27 | 28 | 29 | 30 | 31 | -------------------------------------------------------------------------------- /rdkit_ipynb_tools/resources/clustering/lib/btn_callbacks.js: -------------------------------------------------------------------------------- 1 | function toggle_clusters() { 2 | if (window.flag_clusters_expanded) { 3 | $('ul ul').hide(); 4 | window.flag_clusters_expanded = false; 5 | } else { 6 | $('ul ul').show(); 7 | window.flag_clusters_expanded = true; 8 | } 9 | } 10 | 11 | function toggle_histograms() { 12 | if (window.flag_histograms_shown) { 13 | $('.histogram').hide(); 14 | window.flag_histograms_shown = false; 15 | } else { 16 | $('.histogram').show(); 17 | window.flag_histograms_shown = true; 18 | } 19 | } 20 | -------------------------------------------------------------------------------- /rdkit_ipynb_tools/resources/clustering/lib/folding.js: -------------------------------------------------------------------------------- 1 | // function which handles the folding 2 | function folding() { 3 | // Find list items representing folders and 4 | // style them accordingly. Also, turn them 5 | // into links that can expand/collapse the 6 | // tree leaf. 7 | $('li > ul').each(function(i) { 8 | // Find this list's parent list item. 9 | var parent_li = $(this).parent('li'); 10 | 11 | // Style the list item as folder. 12 | parent_li.addClass('folder'); 13 | 14 | // Temporarily remove the list from the 15 | // parent list item, wrap the remaining 16 | // text in an anchor, then reattach it. 17 | var sub_ul = $(this).remove(); 18 | parent_li.wrapInner('').find('a').click(function() { 19 | // Make the anchor toggle the leaf display. 20 | sub_ul.slideToggle(); 21 | }); 22 | parent_li.append(sub_ul); 23 | }); 24 | 25 | // Hide all lists except the outermost. 26 | $('ul ul').hide(); 27 | //$('ul ul').show(); 28 | }; 29 | -------------------------------------------------------------------------------- /rdkit_ipynb_tools/sar.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | """ 4 | ### 5 | SAR 6 | ### 7 | 8 | *Created on Tue Mar 14, 2017 by A. Pahl* 9 | 10 | SAR Tools. 11 | """ 12 | 13 | # import csv, os 14 | import base64, pickle, sys, time 15 | import os.path as op 16 | from collections import Counter 17 | import re 18 | import colorsys 19 | 20 | from rdkit.Chem import AllChem as Chem 21 | from rdkit.Chem import Draw 22 | 23 | from rdkit.Chem.Draw import SimilarityMaps 24 | from rdkit import DataStructs 25 | 26 | try: 27 | Draw.DrawingOptions.atomLabelFontFace = "DejaVu Sans" 28 | Draw.DrawingOptions.atomLabelFontSize = 18 29 | except KeyError: # Font "DejaVu Sans" is not available 30 | pass 31 | 32 | import numpy as np 33 | from sklearn.ensemble import RandomForestClassifier 34 | 35 | from . import tools, html_templates as html, nb_tools as nbt 36 | 37 | 38 | from IPython.core.display import HTML, display, clear_output 39 | 40 | if sys.version_info[0] > 2: 41 | PY3 = True 42 | from io import BytesIO as IO 43 | else: 44 | PY3 = False 45 | from cStringIO import StringIO as IO 46 | 47 | try: 48 | from misc_tools import apl_tools as apt 49 | AP_TOOLS = True 50 | except ImportError: 51 | AP_TOOLS = False 52 | 53 | if AP_TOOLS: 54 | #: Library version 55 | VERSION = apt.get_commit(__file__) 56 | # I use this to keep track of the library versions I use in my project notebooks 57 | print("{:45s} (commit: {})".format(__name__, VERSION)) 58 | else: 59 | print("{:45s} ({})".format(__name__, time.strftime("%y%m%d-%H:%M", time.localtime(op.getmtime(__file__))))) 60 | 61 | 62 | BGCOLOR = "#94CAEF" 63 | IMG_GRID_SIZE = 235 64 | 65 | 66 | TABLE_INTRO = """""" 67 | HTML_INTRO = """ 68 | 69 | 70 | %s 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 |

%s (%s)

81 | 82 | """ 83 | HTML_EXTRO = """

84 | 97 | 98 | """ 99 | 100 | LOGP_INTRO = """ 101 |

102 |

103 |

LogP color coding:

104 | 105 | 106 | 107 | """ 108 | LOGP_EXTRO = "\n

\n" 109 | 110 | 111 | class ColorScale(): 112 | 113 | def __init__(self, num_values, val_min, val_max, middle_color="yellow", 114 | reverse=False, is_lin=True): 115 | self.num_values = num_values 116 | self.num_val_1 = num_values - 1 117 | self.is_lin = is_lin # is the range fo the color prop linear (e.g. pIC50)? 118 | if self.is_lin: 119 | self.value_min = val_min 120 | self.value_max = val_max 121 | else: 122 | self.value_min = tools.pic50(val_min, "um") 123 | self.value_max = tools.pic50(val_max, "um") 124 | self.reverse = reverse 125 | self.value_range = self.value_max - self.value_min 126 | self.color_scale = [] 127 | if middle_color.startswith("y"): # middle color yellow 128 | hsv_tuples = [(0.0 + ((x * 0.35) / (self.num_val_1)), 0.99, 0.9) for x in range(self.num_values)] 129 | self.reverse = not self.reverse 130 | else: # middle color blue 131 | hsv_tuples = [(0.35 + ((x * 0.65) / (self.num_val_1)), 0.9, 0.9) for x in range(self.num_values)] 132 | rgb_tuples = map(lambda x: colorsys.hsv_to_rgb(*x), hsv_tuples) 133 | for rgb in rgb_tuples: 134 | rgb_int = [int(255 * x) for x in rgb] 135 | self.color_scale.append('#{:02x}{:02x}{:02x}'.format(*rgb_int)) 136 | 137 | if self.reverse: 138 | self.color_scale.reverse() 139 | 140 | def __call__(self, value): 141 | """return the color from the scale corresponding to the place in the value_min .. value_max range""" 142 | if not self.is_lin: 143 | value = tools.pic50(value, "um") 144 | pos = int(((value - self.value_min) / self.value_range) * self.num_val_1) 145 | 146 | return self.color_scale[pos] 147 | 148 | 149 | def legend(self): 150 | """Return the value_range and a list of tuples (value, color) to be used in a legend.""" 151 | legend = [] 152 | for idx, color in enumerate(self.color_scale): 153 | val = self.value_min + idx / self.num_val_1 * self.value_range 154 | if not self.is_lin: 155 | val = tools.ic50(val, "um") 156 | legend.append((val, color)) 157 | 158 | return legend 159 | 160 | 161 | def format_num(val): 162 | """Return a suitable format string depending on the size of the value.""" 163 | if val > 50: 164 | return ".0f" 165 | if val > 1: 166 | return ".1f" 167 | else: 168 | return ".2f" 169 | 170 | 171 | def _get_proba(fp, predictionFunction): 172 | return predictionFunction([fp])[0][1] 173 | 174 | 175 | def b64_fig(fig, dpi=72): 176 | img_file = IO() 177 | fig.savefig(img_file, dpi=dpi, format='PNG', bbox_inches="tight") 178 | b64 = base64.b64encode(img_file.getvalue()) 179 | if PY3: 180 | b64 = b64.decode() 181 | img_file.close() 182 | return b64 183 | 184 | 185 | class SAR_List(tools.Mol_List): 186 | def __init__(self, *args, **kwargs): 187 | super().__init__(*args, **kwargs) 188 | self.model = None 189 | self.html = None 190 | 191 | 192 | def _pass_properties(self, new_list): 193 | new_list.order = self.order 194 | new_list.ia = self.ia 195 | new_list.plot_tool = self.plot_tool 196 | new_list.model = self.model 197 | new_list.html = None 198 | 199 | 200 | def __getitem__(self, item): 201 | result = list.__getitem__(self, item) 202 | try: 203 | new_list = type(self)(result) 204 | 205 | # pass on properties 206 | self._pass_properties(new_list) 207 | return new_list 208 | except TypeError: 209 | return result 210 | 211 | 212 | def new(self, *args): 213 | new_list = type(self)(*args) 214 | # pass on properties 215 | self._pass_properties(new_list) 216 | return new_list 217 | 218 | 219 | def train(self, act_class_prop="AC_Real"): 220 | self.model = train(self, act_class_prop) 221 | """Generates the trained model.""" 222 | 223 | 224 | def predict(self): 225 | """Adds predictions from the trained model to the SAR_List. 226 | Model has to be available as `self.model`.""" 227 | if self.model is None: 228 | raise LookupError("Model is not available. Please train first.") 229 | predict(self, self.model) 230 | self.html = None 231 | 232 | 233 | def analyze(self, act_class="AC_Real", pred_class="AC_Pred"): 234 | """Prints the ratio of succcessful predictions for the molecules which have `act_class` and `pred_class` properties.""" 235 | mol_ctr = Counter() 236 | hit_ctr = Counter() 237 | for mol in self: 238 | if mol.HasProp(act_class) and mol.HasProp(pred_class): 239 | mol_ctr[int(mol.GetProp(act_class))] += 1 240 | if mol.GetProp(act_class) != mol.GetProp(pred_class): 241 | continue 242 | hit_ctr[int(mol.GetProp(act_class))] += 1 243 | if len(mol_ctr) > 0: 244 | sum_mol_ctr = sum(mol_ctr.values()) 245 | sum_hit_ctr = sum(hit_ctr.values()) 246 | print("Number of correctly predicted molecules: {} / {} ({:.2f}%)" 247 | .format(sum_hit_ctr, sum_mol_ctr, 100 * sum_hit_ctr / 248 | sum_mol_ctr)) 249 | print("\nCorrectly predicted molecules per Activity Class:") 250 | for c in sorted(hit_ctr): 251 | print(" {}: {:.2f}".format(c, 100 * hit_ctr[c] / mol_ctr[c])) 252 | else: 253 | print("No molecules found with both {} and {}.".format(act_class, pred_class)) 254 | return hit_ctr, mol_ctr 255 | 256 | 257 | def save_model(self, fn="sar"): 258 | if self.model is None: 259 | print("No model available.") 260 | return 261 | save_model(self.model, fn) 262 | 263 | 264 | def load_model(self, fn="sar", force=False): 265 | if self.model is not None and not force: 266 | print("There is already a model available. Use `force=True` to override.") 267 | return 268 | if not fn.endswith(".model"): 269 | fn = fn + ".model" 270 | with open(fn, "rb") as f: 271 | self.model = pickle.load(f) 272 | print(" > model loaded (last modified: {}).".format(time.strftime("%Y-%m-%d %H:%M", time.localtime(op.getmtime(fn))))) 273 | 274 | 275 | def sim_map(self): 276 | if self.html is None: 277 | self.html = sim_map(self, self.model, id_prop=self.id_prop, order=self.order) 278 | else: 279 | print("Using cached HTML content...") 280 | print("Set property `html` to `None` to re-generate.") 281 | return HTML(self.html) 282 | 283 | 284 | def write_sim_map(self, fn="sim_map.html", title="Similarity Map", summary=None): 285 | if self.html is None: 286 | self.html = sim_map(self, self.model, id_prop=self.id_prop, order=self.order) 287 | else: 288 | print("Using cached HTML content...") 289 | print("Set property `html` to `None` to re-generate.") 290 | html.write(html.page(self.html, summary=summary, title=title), fn=fn) 291 | return HTML('{}'.format(fn, fn)) 292 | 293 | 294 | def map_from_id(self, cpd_id=None): 295 | if cpd_id is None and tools.WIDGETS: 296 | def show_sim_map(ev): 297 | cpd_id = tools.get_value(w_input_id.value.strip()) 298 | clear_output() 299 | w_input_id.value = "" 300 | display(self.new_list_from_ids(cpd_id).sim_map()) 301 | 302 | w_input_id = tools.ipyw.Text(description="Compound Id:") 303 | # w_btn_clear_input = tools.ipyw.Button(description="Clear Input") 304 | # w_btn_clear_input.on_click(clear_input) 305 | w_btn_show = tools.ipyw.Button(description="Show Sim Map") 306 | w_btn_show.on_click(show_sim_map) 307 | 308 | w_hb_show = tools.ipyw.HBox(children=[w_input_id, w_btn_show]) 309 | # tools.set_margin(w_vb_search1) 310 | display(w_hb_show) 311 | else: 312 | new_list = self.new_list_from_ids(cpd_id) 313 | return HTML(sim_map(new_list, self.model, id_prop=self.id_prop, order=self.order)) 314 | 315 | 316 | def train(mol_list, act_class_prop="AC_Real"): 317 | """Returns the trained model.""" 318 | fps = [] 319 | act_classes = [] 320 | for mol in mol_list: 321 | fps.append(Chem.GetMorganFingerprintAsBitVect(mol, 2)) 322 | act_classes.append(tools.get_value(mol.GetProp(act_class_prop))) 323 | np_fps = [] 324 | for fp in fps: 325 | arr = np.zeros((1,)) 326 | DataStructs.ConvertToNumpyArray(fp, arr) 327 | np_fps.append(arr) 328 | 329 | # get a random forest classifiert with 100 trees 330 | rf = RandomForestClassifier(n_estimators=100, random_state=1123) 331 | rf.fit(np_fps, act_classes) 332 | return rf 333 | 334 | 335 | def predict_mol(mol, model): 336 | """Returns the predicted class and the probabilities for a molecule. 337 | 338 | Parameters: 339 | model: Output from `train()`.""" 340 | fp = np.zeros((1,)) 341 | DataStructs.ConvertToNumpyArray(Chem.GetMorganFingerprintAsBitVect(mol, 2), fp) 342 | fp = fp.reshape(1, -1) # this removes the deprecation warning 343 | predict_class = model.predict(fp) 344 | predict_prob = model.predict_proba(fp) 345 | return predict_class[0], predict_prob[0] 346 | 347 | 348 | def predict(mol_list, model): 349 | for mol in mol_list: 350 | pred_class, pred_prob = predict_mol(mol, model) 351 | mol.SetProp("AC_Pred", str(pred_class)) 352 | mol.SetProp("Prob", "{:.2}".format(pred_prob[pred_class])) 353 | 354 | 355 | def save_model(model, fn="sar"): 356 | if not fn.endswith(".model"): 357 | fn = fn + ".model" 358 | with open(fn, "wb") as f: 359 | pickle.dump(model, f) 360 | 361 | 362 | def load_sdf(fn, model_name=None): 363 | mol_list = tools.load_sdf(fn) 364 | sar_list = SAR_List(mol_list) 365 | if model_name is None: 366 | print(" * No model was loaded. Please provide a name to load.") 367 | else: 368 | try: 369 | sar_list.load_model(model_name) 370 | except FileNotFoundError: 371 | print(" * Model {} could not be found. No model was loaded".format(model_name)) 372 | return sar_list 373 | 374 | 375 | def sim_map(mol_list, model, id_prop=None, interact=False, highlight=None, show_hidden=False, order=None, size=300): 376 | """Parameters: 377 | mol_list (Mol_List): List of RDKit molecules 378 | highlight (dict): Dict of properties (special: *all*) and values to highlight cells, 379 | e.g. {"activity": "< 50"} 380 | show_hidden (bool): Whether to show hidden properties (name starts with _) or not. 381 | Defaults to *False*. 382 | link (str): column used for linking out 383 | target (str): column used as link target 384 | order (list): A list of substrings to match with the field names for ordering in the table header 385 | img_dir (str): if None, the molecule images are embedded in the HTML doc. 386 | Otherwise the images will be stored in img_dir and linked in the doc. 387 | 388 | Returns: 389 | HTML table as TEXT to embed in IPython or a web page.""" 390 | 391 | time_stamp = time.strftime("%y%m%d%H%M%S") 392 | td_opt = {"style": "text-align: center;"} 393 | header_opt = {"bgcolor": "#94CAEF", "style": "text-align: center;"} 394 | table_list = [] 395 | prop_list = tools.list_fields(mol_list) 396 | 397 | if isinstance(order, list): 398 | for k in reversed(order): 399 | prop_list.sort(key=lambda x: k.lower() in x.lower(), reverse=True) 400 | 401 | if id_prop is None: 402 | guessed_id = tools.guess_id_prop(prop_list) 403 | else: 404 | guessed_id = id_prop 405 | 406 | if interact and guessed_id is not None: 407 | table_list.append(tools.TBL_JAVASCRIPT.format(ts=time_stamp, bgcolor="transparent")) 408 | 409 | if id_prop is not None: 410 | if id_prop not in prop_list: 411 | raise LookupError("Id property {} not found in data set.".format(id_prop)) 412 | 413 | if len(mol_list) > 5: 414 | pb = nbt.ProgressbarJS() 415 | 416 | if guessed_id: 417 | # make sure that the id_prop (or the guessed id prop) is first: 418 | prop_list.pop(prop_list.index(guessed_id)) 419 | tmp_list = [guessed_id] 420 | tmp_list.extend(prop_list) 421 | prop_list = tmp_list 422 | 423 | cells = html.td(html.b("#"), header_opt) 424 | cells.extend(html.td(html.b("Molecule"), header_opt)) 425 | cells.extend(html.td(html.b("SimMap"), header_opt)) 426 | for prop in prop_list: 427 | cells.extend(html.td(html.b(prop), header_opt)) 428 | rows = html.tr(cells) 429 | 430 | list_len = len(mol_list) 431 | for idx, mol in enumerate(mol_list): 432 | if len(mol_list) > 5: 433 | pb.update(100 * (idx + 1) / list_len) 434 | cells = [] 435 | mol_props = mol.GetPropNames() 436 | 437 | if guessed_id: 438 | id_prop_val = mol.GetProp(guessed_id) 439 | img_id = id_prop_val 440 | cell_opt = {"id": "{}_{}".format(id_prop_val, time_stamp)} 441 | else: 442 | img_id = idx 443 | cell_opt = {"id": str(idx)} 444 | 445 | cell = html.td(str(idx), cell_opt) 446 | cells.extend(cell) 447 | 448 | if not mol: 449 | cells.extend(html.td("no structure")) 450 | 451 | else: 452 | b64 = tools.b64_img(mol, size * 2) 453 | img_src = "data:image/png;base64,{}".format(b64) 454 | cell_opt = {} 455 | if interact and guessed_id is not None: 456 | img_opt = {"title": "Click to select / unselect", 457 | "onclick": "toggleCpd('{}')".format(id_prop_val)} 458 | else: 459 | img_opt = {"title": str(img_id)} 460 | # img_opt["width"] = size 461 | # img_opt["height"] = size 462 | img_opt["style"] = 'max-width: {}px; max-height: {}px; display: block; margin: auto;'.format(size, size) 463 | 464 | cell = html.img(img_src, img_opt) 465 | cells.extend(html.td(cell, cell_opt)) 466 | 467 | 468 | fig, _ = SimilarityMaps.GetSimilarityMapForModel( 469 | mol, SimilarityMaps.GetMorganFingerprint, lambda x: _get_proba(x, model.predict_proba)) 470 | b64 = b64_fig(fig, dpi=72) 471 | img_src = "data:image/png;base64,{}".format(b64) 472 | cell_opt = {} 473 | img_opt["style"] = 'max-width: {}px; max-height: {}px; display: block; margin: auto;'.format(size, size) 474 | cell = html.img(img_src, img_opt) 475 | cells.extend(html.td(cell, cell_opt)) 476 | 477 | for prop in prop_list: 478 | td_opt = {"style": "text-align: center;"} 479 | if prop in mol_props: 480 | if not show_hidden and prop.startswith("_"): continue 481 | td_opt["title"] = prop 482 | prop_val = mol.GetProp(prop) 483 | if highlight: 484 | eval_str = None 485 | if "*all*" in highlight: 486 | if not guessed_id or (guessed_id and prop != guessed_id): 487 | eval_str = " ".join([prop_val, highlight["*all*"]]) 488 | else: 489 | if prop in highlight: 490 | eval_str = " ".join([prop_val, highlight[prop]]) 491 | if eval_str and eval(eval_str): 492 | td_opt["bgcolor"] = "#99ff99" 493 | 494 | cells.extend(html.td(prop_val, td_opt)) 495 | else: 496 | cells.extend(html.td("", td_opt)) 497 | 498 | rows.extend(html.tr(cells)) 499 | 500 | table_list.extend(html.table(rows)) 501 | 502 | if interact and guessed_id is not None: 503 | table_list.append(tools.ID_LIST.format(ts=time_stamp)) 504 | 505 | if len(mol_list) > 5: 506 | pb.done() 507 | return "".join(table_list) 508 | 509 | 510 | def legend_table(legend): 511 | """Return a HTML table with the ColorScale label as text. 512 | 513 | Psrsmeters: 514 | legend (list): list of tuples as returned from ColorScale.legend().""" 515 | intro = "\n\n" 516 | extro = "\n\n

\n" 517 | tbl_list = [intro] 518 | rnge = abs(legend[0][0] - legend[-1][0]) 519 | digits = format_num(rnge) 520 | for tup in legend: 521 | cell = "{val:{digits}}".format(color=tup[1], val=tup[0], digits=digits) 522 | tbl_list.append(cell) 523 | 524 | tbl_list.append(extro) 525 | 526 | return "".join(tbl_list) 527 | 528 | 529 | def get_res_pos(smiles): 530 | pat = re.compile('\[(.*?)\*\]') 531 | pos_str = re.findall(pat, smiles)[0] 532 | if pos_str: 533 | return int(pos_str) 534 | else: 535 | return 0 536 | 537 | 538 | def generate_sar_table(db_list, core, id_prop, act_prop, sort_reverse=True, 539 | dir_name="html/sar_table", color_prop="logp"): 540 | """core: smiles string; id_prop, act_prop: string 541 | colorprop_is_lin: whether or not the property used for coloring is linear (e.g. LogP or PercActivity) or needs to be logarithmitized (e.g. IC50_uM).""" 542 | 543 | tools.create_dir_if_not_exist(dir_name) 544 | tools.create_dir_if_not_exist(op.join(dir_name, "img")) 545 | 546 | db_list.sort_list(act_prop, reverse=sort_reverse) 547 | 548 | act_xy = np.zeros([55, 55], dtype=np.float) # coordinates for the activity 549 | # color_xy = np.zeros([55, 55], dtype=np.float) 550 | color_xy = np.full([55, 55], np.NaN, dtype=np.float) 551 | molid_xy = np.zeros([55, 55], dtype=np.int) 552 | # molid_xy = np.arange(900, dtype=np.int).reshape(30, 30) # coordinates for the molid 553 | rx_dict = {} # axes for the residues 554 | ry_dict = {} 555 | max_x = -1 # keep track of the arraysize 556 | max_y = -1 557 | res_pos_x = -1 558 | res_pos_y = -1 559 | 560 | core_mol = Chem.MolFromSmiles(core) 561 | Draw.MolToFile(core_mol, "%s/img/core.png" % dir_name, [90, 90]) 562 | 563 | for idx, mol in enumerate(db_list): 564 | act = float(mol.GetProp(act_prop)) 565 | color = float(mol.GetProp(color_prop)) 566 | molid = int(mol.GetProp(id_prop)) 567 | tmp = Chem.ReplaceCore(mol, core_mol, labelByIndex=True) 568 | frag_mols = list(Chem.GetMolFrags(tmp, asMols=True)) 569 | frag_smiles = [Chem.MolToSmiles(m, True) for m in frag_mols] 570 | if len(frag_mols) == 1: 571 | # one of the two residues is H: 572 | pos = get_res_pos(frag_smiles[0]) 573 | if pos == res_pos_x: 574 | h_smiles = "[%d*]([H])" % res_pos_y 575 | frag_smiles.append(h_smiles) 576 | frag_mols.append(Chem.MolFromSmiles(h_smiles)) 577 | else: 578 | h_smiles = "[%d*]([H])" % res_pos_x 579 | frag_smiles.insert(0, h_smiles) 580 | frag_mols.insert(0, Chem.MolFromSmiles(h_smiles)) 581 | 582 | print(" adding H residue in pos {} to mol #{} (molid: {})".format(pos, idx, mol.GetProp(id_prop))) 583 | 584 | elif len(frag_mols) > 2: 585 | print("* incorrect number of fragments ({}) in mol #{} (molid: {})".format(len(frag_mols), idx, mol.GetProp(id_prop))) 586 | continue 587 | 588 | if res_pos_x == -1: 589 | # print frag_smiles[0], frag_smiles[1] 590 | res_pos_x = get_res_pos(frag_smiles[0]) 591 | res_pos_y = get_res_pos(frag_smiles[1]) 592 | # print "res_pos_x: {} res_pos_y: {}".format(res_pos_x, res_pos_y) 593 | else: 594 | test_pos_x = get_res_pos(frag_smiles[0]) 595 | if test_pos_x != res_pos_x: # switch residues 596 | frag_smiles = frag_smiles[::-1] 597 | frag_mols = frag_mols[::-1] 598 | if frag_smiles[0] in rx_dict: 599 | curr_x = rx_dict[frag_smiles[0]] 600 | else: 601 | max_x += 1 602 | rx_dict[frag_smiles[0]] = max_x 603 | curr_x = max_x 604 | Draw.MolToFile(frag_mols[0], "%s/img/frag_x_%02d.png" % (dir_name, max_x), [100, 100]) 605 | if frag_smiles[1] in ry_dict: 606 | curr_y = ry_dict[frag_smiles[1]] 607 | else: 608 | max_y += 1 609 | ry_dict[frag_smiles[1]] = max_y 610 | curr_y = max_y 611 | Draw.MolToFile(frag_mols[1], "%s/img/frag_y_%02d.png" % (dir_name, max_y), [100, 100]) 612 | 613 | # draw thw whole molecule for the tooltip 614 | img_file = op.join(dir_name, "img/", "cpd_{}_{}.png".format(curr_x, curr_y)) 615 | img = tools.autocrop(Draw.MolToImage(mol), "white") 616 | img.save(img_file, format='PNG') 617 | 618 | act_xy[curr_x][curr_y] = act 619 | color_xy[curr_x][curr_y] = color 620 | molid_xy[curr_x][curr_y] = molid 621 | 622 | return act_xy, molid_xy, color_xy, max_x, max_y 623 | 624 | 625 | def sar_table_report_html(act_xy, molid_xy, color_xy, max_x, max_y, color_by="logp", 626 | reverse_color=False, colorprop_is_lin=True, 627 | show_link=False, show_tooltip=True): 628 | if "logp" in color_by.lower(): 629 | # logp_colors = {2.7: "#5F84FF", 3.0: "#A4D8FF", 4.2: "#66FF66", 5.0: "#FFFF66", 1000.0: "#FF4E4E"} 630 | logp_colors = {2.7: "#98C0FF", 3.0: "#BDF1FF", 4.2: "#AAFF9B", 5.0: "#F3FFBF", 1000.0: "#FF9E9E"} 631 | 632 | else: 633 | color_min = float(np.nanmin(color_xy)) 634 | color_max = float(np.nanmax(color_xy)) 635 | color_scale = ColorScale(20, color_min, color_max, 636 | reverse=reverse_color, is_lin=colorprop_is_lin) 637 | 638 | # write horizontal residues 639 | line = [TABLE_INTRO] 640 | line.append("\nCore:
$\"icon\"$ ") 641 | for curr_x in range(max_x + 1): 642 | line.append(" $\"icon\"$ " % curr_x) 643 | 644 | line.append("\n\n") 645 | 646 | for curr_y in range(max_y + 1): 647 | line.append(" $\"icon\"$ " % curr_y) 648 | for curr_x in range(max_x + 1): 649 | molid = molid_xy[curr_x][curr_y] 650 | if molid > 0: 651 | link_in = "" 652 | link_out = "" 653 | bg_color = " " 654 | mouseover = "" 655 | if show_link: 656 | link = "../reports/ind_stock_results.htm#cpd_%05d" % molid 657 | link_in = "" % link 658 | link_out = "" 659 | if "logp" in color_by.lower(): 660 | logp = color_xy[curr_x][curr_y] 661 | if show_tooltip: 662 | prop_tip = 'LogP: %.2f' % logp 663 | for limit in sorted(logp_colors): 664 | if logp <= limit: 665 | bg_color = ' bgcolor="%s"' % logp_colors[limit] 666 | break 667 | else: 668 | value = float(color_xy[curr_x][curr_y]) 669 | html_color = color_scale(value) 670 | bg_color = ' bgcolor="{}"'.format(html_color) 671 | if show_tooltip: 672 | prop_tip = '{}: {:.2f}'.format(color_by, color_xy[curr_x][curr_y]) 673 | 674 | if show_tooltip: 675 | tool_tip = ' "icon"

{}'.format(curr_x, curr_y, prop_tip) 676 | mouseover = """ onmouseover="Tip('{}')" onmouseout="UnTip()" """.format(tool_tip) 677 | 678 | line.append("%.2f

(%s%d%s)" % (mouseover, bg_color, act_xy[curr_x][curr_y], link_in, molid_xy[curr_x][curr_y], link_out)) 679 | 680 | else: # empty value in numpy array 681 | line.append("") 682 | 683 | 684 | line.append("\n") 685 | 686 | line.append("\n\n") 687 | 688 | if "logp" in color_by.lower(): 689 | line.append(LOGP_INTRO) 690 | for limit in sorted(logp_colors): 691 | line.append('≤ %.2f' % (logp_colors[limit], limit)) 692 | line.append("\n\n") 693 | line.append(LOGP_EXTRO) 694 | else: 695 | line.append("

Coloring legend for {}:
\n".format(color_by)) 696 | legend = color_scale.legend() 697 | line.append(legend_table(legend)) 698 | 699 | 700 | html_table = "".join(line) 701 | 702 | return html_table 703 | 704 | 705 | def write_html_page(html_content, dir_name="html/sar_table", page_name="sar_table", page_title="SAR Table"): 706 | 707 | tools.create_dir_if_not_exist(dir_name) 708 | tools.create_dir_if_not_exist(op.join(dir_name, "img")) 709 | 710 | filename = op.join(dir_name, "%s.htm" % page_name) 711 | f = open(filename, "w") 712 | f.write(HTML_INTRO % (page_title, page_title, time.strftime("%d-%b-%Y"))) 713 | 714 | f.write(html_content) 715 | 716 | f.write(HTML_EXTRO) 717 | f.close() 718 | -------------------------------------------------------------------------------- /tutorial/chembl_et-a_antagonists.txt.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apahl/rdkit_ipynb_tools/c259ac8ee75709becd2a5e67f9a913bd20e0ae38/tutorial/chembl_et-a_antagonists.txt.gz -------------------------------------------------------------------------------- /tutorial/pipeline.log: -------------------------------------------------------------------------------- 1 | start_csv_reader : 2323 2 | pipe_has_prop_filter : 1454 3 | stop_mol_list_from_stream: 1453 (time: 00h 00m 4.55s) 4 | --------------------------------------------------------------------------------

Cluster {:03d}

Members:

Cluster {0:03d}

Cluster {0:03d}

Cluster Report

%s (%s)