├── .gitignore
├── .readthedocs.yaml
├── CHANGELOG.md
├── LICENSE
├── README.md
├── docs
    ├── Makefile
    ├── conf.py
    ├── index.rst
    ├── make.bat
    └── requirements.txt
├── drug2cell
    ├── __init__.py
    ├── chembl.py
    ├── cpdb_dict.pkl
    ├── data.py
    ├── drug-target_dicts.pkl
    └── util.py
├── notebooks
    ├── chembl
    │   ├── filtering.ipynb
    │   └── initial_database_parsing.ipynb
    └── demo.ipynb
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | docs/_build
3 | drug2cell/__pycache__
4 | 


--------------------------------------------------------------------------------
/.readthedocs.yaml:
--------------------------------------------------------------------------------
 1 | # .readthedocs.yaml
 2 | # Read the Docs configuration file
 3 | # See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
 4 | 
 5 | # Required
 6 | version: 2
 7 | 
 8 | # Set the version of Python and other tools you might need
 9 | build:
10 |   os: ubuntu-20.04
11 |   tools:
12 |     python: "3.9"
13 | 
14 | # Build documentation in the docs/ directory with Sphinx
15 | sphinx:
16 |    configuration: docs/conf.py
17 | 
18 | # Install our python package before building the docs
19 | python:
20 |   install:
21 |     - requirements: docs/requirements.txt
22 |     - method: pip
23 |       path: .
24 | 


--------------------------------------------------------------------------------
/CHANGELOG.md:
--------------------------------------------------------------------------------
 1 | # Changelog
 2 | 
 3 | ## 0.1.2
 4 | - regenerated drug target dictionary with updated thresholds
 5 | 
 6 | ## 0.1.1
 7 | - update blitzGSEA to make use of pip install and ``interactive_plot``
 8 | 
 9 | ## 0.1.0
10 | - initial release
11 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Drug2cell, (c) 2023, GRL (the "Software")
 2 |  
 3 | The Software remains the property of Genome Research Ltd ("GRL").
 4 | 
 5 | The Software is distributed "AS IS" under this Licence solely for non-commercial use in the hope that it will be useful, but in order that GRL as a charitable foundation protects its assets for the benefit of its educational and research purposes, GRL makes clear that no condition is made or to be implied, nor is any warranty given or to be implied, as to the accuracy of the Software, or that it will be suitable for any particular purpose or for use under any specific conditions. Furthermore, GRL disclaims all responsibility for the use which is made of the Software. It further disclaims any liability for the outcomes arising from using  the Software.
 6 | 
 7 | The Licensee agrees to indemnify GRL and hold GRL harmless from and against any and all claims, damages and liabilities asserted by third parties (including claims for negligence) which arise directly or indirectly from the use of the Software or the sale of any products based on the Software.
 8 |  
 9 | No part of the Software may be reproduced, modified, transmitted or transferred in any form or by any means, electronic or mechanical, without the express permission of GRL. The permission of GRL is not required if the said reproduction, modification, transmission or transference is done without financial return, the conditions of this Licence are imposed upon the receiver of the product, and all original and amended source code is included in any transmitted product. You may be held legally responsible for any copyright infringement that is caused or encouraged by your failure to abide by these terms and conditions. 
10 | 
11 | You are not permitted under this Licence to use this Software commercially. Use for which any financial return is received shall be defined as commercial use, and includes (1) integration of all or part of the source code or the Software into a product for sale or license by or on behalf of Licensee to third parties or (2) use of the Software or any derivative of it for research with the final aim of developing software products for sale or license to a third party or (3) use of the Software or any derivative of it for research with the final aim of developing non-software products for sale or license to a third party, or (4) use of the Software to provide any service to an external organisation for which payment is received. If you are interested in using the Software commercially, please contact legal@sanger.ac.uk. Contact details are: legal@sanger.ac.uk quoting reference Drug2cell-software.
12 | 
13 | 
14 | 
15 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Drug2cell
 2 | 
 3 | This is a collection of utility functions for gene group activity evaluation in scanpy, both for per-cell scoring and marker-based enrichment/overrepresentation analyses. Drug2cell makes use of established methodology, and offers it in a convenient/efficient form for easy scanpy use. The package comes with a set of ChEMBL derived drug target sets, but has straightforward input formatting so it can be easily used with any gene groups of choice.
 4 | 
 5 | ## Installation
 6 | 
 7 | ```bash
 8 | pip install drug2cell
 9 | ```
10 | 
11 | ## Usage
12 | 
13 | ```python3
14 | import drug2cell as d2c
15 | ```
16 | 
17 | **Per-cell scoring** can be done via `d2c.score()`, and creates a new gene group feature space object in `.uns['drug2cell']` of the supplied object. The `.obs` and `.obsm` of the original object are copied over for ease of downstream use, like plotting or potential marker detection.
18 | 
19 | **Enrichment** is done with GSEA, contained within the function `d2c.gsea()`. **Overrepresentation** analysis can be done with the hypergeometric test of `d2c.hypergeometric()`. Both of those require having ran `sc.tl.rank_genes_groups()` on the input object, and return one data frame per evaluated cluster with enrichment/overrepresentation results.
20 | 
21 | It's possible to provide your own gene groups as a dictionary, with the names of the groups as keys and the corresponding gene lists as the values. Pass this dictionary as the `targets` argument of any of those three functions.
22 | 
23 | **Please refer to the [demo notebook](https://nbviewer.org/github/Teichlab/drug2cell/blob/main/notebooks/demo.ipynb), all of the input arguments are detailed in [ReadTheDocs](https://drug2cell.readthedocs.io/en/latest/).**
24 | 
25 | ## ChEMBL parsing
26 | 
27 | Drug2cell also features instructions on how to parse the ChEMBL database into drugs and their targets. Some additional notebooks are included. Both refer to a pre-parsed data frame of ChEMBL human targets, which can be accessed at `ftp://ftp.sanger.ac.uk/pub/users/kp9/chembl_30_merged_genesymbols_humans.pkl`.
28 |  - [Filtering](https://nbviewer.org/github/Teichlab/drug2cell/blob/main/notebooks/chembl/filtering.ipynb) shows how this data frame was turned into the drugs:targets dictionary shipped with the package by default. There are some helper functions included in `d2c.chembl` which can assist you shall you wish to filter it in a different way.
29 |  - [Initial database parsing](https://nbviewer.org/github/Teichlab/drug2cell/blob/main/notebooks/chembl/initial_database_parsing.ipynb) shows how the data frame was created from online resources.
30 |  
31 | ## Citation
32 | 
33 | If you use drug2cell in your work, please cite the [paper](https://www.nature.com/articles/s41586-023-06311-1)
34 | 
35 | ```
36 | @article{kanemaru2023spatially,
37 |   title={Spatially resolved multiomics of human cardiac niches},
38 |   author={Kanemaru, Kazumasa and Cranley, James and Muraro, Daniele and Miranda, Antonio MA and Ho, Siew Yen and Wilbrey-Clark, Anna and Patrick Pett, Jan and Polanski, Krzysztof and Richardson, Laura and Litvinukova, Monika and others},
39 |   journal={Nature},
40 |   volume={619},
41 |   number={7971},
42 |   pages={801--810},
43 |   year={2023},
44 |   publisher={Nature Publishing Group UK London}
45 | }
46 | ```
47 | 


--------------------------------------------------------------------------------
/docs/Makefile:
--------------------------------------------------------------------------------
 1 | # Minimal makefile for Sphinx documentation
 2 | #
 3 | 
 4 | # You can set these variables from the command line, and also
 5 | # from the environment for the first two.
 6 | SPHINXOPTS    ?=
 7 | SPHINXBUILD   ?= sphinx-build
 8 | SOURCEDIR     = .
 9 | BUILDDIR      = _build
10 | 
11 | # Put it first so that "make" without argument is like "make help".
12 | help:
13 | 	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14 | 
15 | .PHONY: help Makefile
16 | 
17 | # Catch-all target: route all unknown targets to Sphinx using the new
18 | # "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
19 | %: Makefile
20 | 	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
21 | 


--------------------------------------------------------------------------------
/docs/conf.py:
--------------------------------------------------------------------------------
 1 | # Configuration file for the Sphinx documentation builder.
 2 | #
 3 | # For the full list of built-in configuration values, see the documentation:
 4 | # https://www.sphinx-doc.org/en/master/usage/configuration.html
 5 | 
 6 | # If extensions (or modules to document with autodoc) are in another directory,
 7 | # add these directories to sys.path here. If the directory is relative to the
 8 | # documentation root, use os.path.abspath to make it absolute, like shown here.
 9 | #
10 | import os
11 | import sys
12 | sys.path.insert(0, os.path.abspath('..'))
13 | autodoc_mock_imports = ['anndata', 'pandas', 'numpy', 'statsmodels.stats.multitest', 'scipy.sparse', 'scipy.stats', 'pkg_resources', 'collections', 'blitzgsea', 'statsmodels', 'scipy']
14 | 
15 | # -- Project information -----------------------------------------------------
16 | # https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
17 | 
18 | project = 'Drug2cell'
19 | copyright = '2023, Krzysztof Polanski, Kazumasa Kanemaru'
20 | author = 'Krzysztof Polanski, Kazumasa Kanemaru'
21 | release = '0.1.2'
22 | 
23 | # -- General configuration ---------------------------------------------------
24 | # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
25 | 
26 | extensions = ['sphinx.ext.autodoc']
27 | 
28 | templates_path = ['_templates']
29 | exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
30 | 
31 | 
32 | 
33 | # -- Options for HTML output -------------------------------------------------
34 | # https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
35 | 
36 | html_theme = 'sphinx_rtd_theme'
37 | html_static_path = ['_static']
38 | 


--------------------------------------------------------------------------------
/docs/index.rst:
--------------------------------------------------------------------------------
 1 | .. Drug2cell documentation master file, created by
 2 |    sphinx-quickstart on Thu Feb  2 10:21:06 2023.
 3 |    You can adapt this file completely to your liking, but it should at least
 4 |    contain the root `toctree` directive.
 5 | 
 6 | Welcome to Drug2cell's documentation!
 7 | =====================================
 8 | 
 9 | .. automodule:: drug2cell
10 |    :members:  gsea, hypergeometric, score
11 | 
12 | Gene group loading
13 | ==================
14 | 
15 | .. automodule:: drug2cell.data
16 |    :members:  chembl, consensuspathdb
17 | 
18 | Utility functions
19 | =================
20 | 
21 | .. automodule:: drug2cell.util
22 |    :members:  prepare_plot_args, plot_gsea
23 | 
24 | .. automodule:: drug2cell.chembl
25 |    :members:  filter_activities


--------------------------------------------------------------------------------
/docs/make.bat:
--------------------------------------------------------------------------------
 1 | @ECHO OFF
 2 | 
 3 | pushd %~dp0
 4 | 
 5 | REM Command file for Sphinx documentation
 6 | 
 7 | if "%SPHINXBUILD%" == "" (
 8 | 	set SPHINXBUILD=sphinx-build
 9 | )
10 | set SOURCEDIR=.
11 | set BUILDDIR=_build
12 | 
13 | %SPHINXBUILD% >NUL 2>NUL
14 | if errorlevel 9009 (
15 | 	echo.
16 | 	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
17 | 	echo.installed, then set the SPHINXBUILD environment variable to point
18 | 	echo.to the full path of the 'sphinx-build' executable. Alternatively you
19 | 	echo.may add the Sphinx directory to PATH.
20 | 	echo.
21 | 	echo.If you don't have Sphinx installed, grab it from
22 | 	echo.https://www.sphinx-doc.org/
23 | 	exit /b 1
24 | )
25 | 
26 | if "%1" == "" goto help
27 | 
28 | %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
29 | goto end
30 | 
31 | :help
32 | %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
33 | 
34 | :end
35 | popd
36 | 


--------------------------------------------------------------------------------
/docs/requirements.txt:
--------------------------------------------------------------------------------
1 | sphinx-rtd-theme
2 | 


--------------------------------------------------------------------------------
/drug2cell/__init__.py:
--------------------------------------------------------------------------------
  1 | import blitzgsea
  2 | import anndata
  3 | import pandas as pd
  4 | import numpy as np
  5 | 
  6 | #gives access to submodules
  7 | from . import chembl
  8 | from . import data
  9 | from . import util
 10 | 
 11 | from statsmodels.stats.multitest import multipletests
 12 | from scipy.sparse import issparse
 13 | from scipy.stats import hypergeom
 14 | 
 15 | def _sparse_nanmean(X, axis):
 16 |     # function from https://github.com/scverse/scanpy/blob/034ca2823804645e0d4874c9b16ba2eb8c13ac0f/scanpy/tools/_score_genes.py
 17 |     """
 18 |     np.nanmean equivalent for sparse matrices
 19 |     """
 20 |     if not issparse(X):
 21 |         raise TypeError("X must be a sparse matrix")
 22 | 
 23 |     # count the number of nan elements per row/column (dep. on axis)
 24 |     Z = X.copy()
 25 |     Z.data = np.isnan(Z.data)
 26 |     Z.eliminate_zeros()
 27 |     n_elements = Z.shape[axis] - Z.sum(axis)
 28 | 
 29 |     # set the nans to 0, so that a normal .sum() works
 30 |     Y = X.copy()
 31 |     Y.data[np.isnan(Y.data)] = 0
 32 |     Y.eliminate_zeros()
 33 | 
 34 |     # the average
 35 |     s = Y.sum(axis, dtype='float64')  # float64 for score_genes function compatibility)
 36 |     m = s / n_elements
 37 | 
 38 |     return m
 39 | 
 40 | def _mean(X,names,axis):
 41 |     '''
 42 |     Helper function to compute a mean of X across an axis, respecting names and possible nans.
 43 |     
 44 |     Derived from sc.tl.score_genes() logic.
 45 |     '''
 46 |     if issparse(X):
 47 |         obs_avg = pd.Series(
 48 |             np.array(_sparse_nanmean(X, axis=axis)).flatten(),  
 49 |             index=names,
 50 |         )  # average expression of genes
 51 |     else:
 52 |         obs_avg = pd.Series(
 53 |             np.nanmean(X, axis=axis), index=names
 54 |         )  # average expression of genes
 55 |     return obs_avg
 56 | 
 57 | def score(adata, targets=None, nested=False, categories=None, method="mean", layer=None, use_raw=False, n_bins=25, ctrl_size=50, sep=","):
 58 |     '''
 59 |     Obtain per-cell scoring of gene groups of interest. Distributed with a set of 
 60 |     ChEMBL drug targets that can be used immediately.
 61 |     
 62 |     Please ensure that the gene nomenclature in your target sets is compatible with your 
 63 |     ``.var_names`` (or ``.raw.var_names``). The ChEMBL drug targets use HGNC (human gene 
 64 |     names in line with standard cell ranger mapping output).
 65 |     
 66 |     Adds ``.uns['drug2cell']`` to the input AnnData, a new AnnData object with the same 
 67 |     observation space but with the scored gene groups as the features. The gene group 
 68 |     members used to compute the scores will be listed in ``.var['genes']`` of the new 
 69 |     object.
 70 |     
 71 |     Input
 72 |     -----
 73 |     adata : ``AnnData``
 74 |         Using log-normalised data is recommended.
 75 |     targets : ``dict`` of lists of ``str``, optional (default: ``None``)
 76 |         The gene groups to evaluate. Can be targets of known drugs, GO terms, pathway 
 77 |         memberships, anything you can assign genes to. If ``None``, will load the 
 78 |         ChEMBL-derived drug target sets distributed with the package.
 79 |         
 80 |         Accepts two forms:
 81 |         
 82 |             - A dictionary with the names of the groups as keys, and the entries being the 
 83 |             corresponding gene lists.
 84 |             - A dictionary of dictionaries defined like above, with names of gene group 
 85 |             categories as keys. If passing one of those, please specify ``nested=True``.
 86 |         
 87 |     nested : ``bool``, optional (default: ``False``)
 88 |         Whether ``targets`` is a dictionary of dictionaries with group categories as keys.
 89 |     categories : ``str`` or list of ``str``, optional (default: ``None``)
 90 |         If ``targets=None`` or ``nested=True``, this argument  can be used to subset the 
 91 |         gene groups to one or more categories (keys of the original dictionary). In case 
 92 |         of the ChEMBL drug targets, these are ATC level 1/level 2 category codes.
 93 |     method : ``str``, optional (default: ``"mean"``)
 94 |         The method to use to score the gene groups. The default is ``"mean"``, which 
 95 |         computes the mean over all the genes. The other option is ``"seurat"``, which 
 96 |         generates an appropriate background profile for each target set and subtracts it 
 97 |         from the mean. This is inspired by ``sc.tl.rank_genes()`` logic, which in turn 
 98 |         was inspired by Seurat's gene group scoring algorithm.
 99 |     layer : ``str``, optional (default: ``None``)
100 |         Which ``.layers`` of the input AnnData to use for the expression values. If 
101 |         ``None``, will default to ``.X``.
102 |     use_raw : ``bool``, optional (default: ``False``)
103 |         Whether to use ``.raw.X`` for the expression values.
104 |     n_bins : ``int``, optional (default: 25)
105 |         Only used with ``method="seurat"``. The number of expression bins to partition the 
106 |         feature space into.
107 |     ctrl_size : ``int``, optional (default: 50)
108 |         Only used with ``method="seurat"``. The number of genes to randomly sample from 
109 |         each expression bin.
110 |     sep : ``str``, optional (default: ``","``)
111 |         What delimiter to use when storing the corresponding gene groups for each feature 
112 |         in ``.uns['drug2cell'].var['genes']``
113 |     '''
114 |     #select expression and gene names to use
115 |     if layer is not None:
116 |         if use_raw:
117 |             raise ValueError("Cannot specify `layer` and have `use_raw=True`.")
118 |         X = adata.layers[layer]
119 |         var_names = adata.var_names
120 |     else:
121 |         if use_raw and adata.raw is not None:
122 |             X = adata.raw.X
123 |             var_names = adata.raw.var_names
124 |         else:
125 |             X = adata.X
126 |             var_names = adata.var_names
127 |     #get {group:[targets]} form of gene groups to evaluate based on arguments
128 |     #skip target reconstruction to overwrite any potential existing scoring
129 |     targets = util.prepare_targets(
130 |         adata,
131 |         targets=targets,
132 |         nested=nested,
133 |         categories=categories,
134 |         sep=sep,
135 |         reconstruct=False
136 |     )
137 |     #store full list of targets to have them on tap for later
138 |     full_targets = targets.copy()
139 |     #turn the list of gene IDs to a boolean mask of var_names
140 |     for drug in targets:
141 |         targets[drug] = np.isin(var_names, targets[drug])
142 |     #perform scoring
143 |     #the scoring shall be done via matrix multiplication
144 |     #of the original cell by gene matrix, by a new gene by drug matrix
145 |     #with the entries in the new matrix being the weights of each gene for that drug
146 |     #the first part, the mean across targets, is constant; prepare weights for that
147 |     weights = pd.DataFrame(targets, index=var_names)
148 |     #kick out drugs with no targets
149 |     weights = weights.loc[:, weights.sum()>0]
150 |     #scale to 1 sum for each column, weights for mean acquired. get mean
151 |     weights = weights/weights.sum()
152 |     if issparse(X):
153 |         scores = X.dot(weights)
154 |     else:
155 |         scores = np.dot(X, weights)
156 |     #the second part only happens for seurat scoring
157 |     #logic inspired by sc.tl.score_genes()
158 |     if method == "seurat":
159 |         #obtain per-gene means
160 |         obs_avg = _mean(X, names=var_names, axis=0)
161 |         #bin the genes (score_genes() logic in full effect here)
162 |         n_items = int(np.round(len(obs_avg) / (n_bins - 1)))
163 |         obs_cut = obs_avg.rank(method='min') // n_items
164 |         #we'll be working on the array for ease of stuff going forward
165 |         obs_cut = obs_cut.values
166 |         #compute a cell by control group matrix
167 |         #using matrix multiplication again
168 |         #so build a gene by control group matrix to enable it
169 |         control_groups = {}
170 |         for cut in np.unique(obs_cut):
171 |             #get the locations of the value
172 |             mask = obs_cut == cut
173 |             #get the nonzero values, i.e. indices of the locations
174 |             r_genes = np.nonzero(mask)[0]
175 |             #shuffle these indices, and only keep the top N in the mask
176 |             np.random.shuffle(r_genes)
177 |             mask[r_genes[ctrl_size:]] = False
178 |             #store mask
179 |             control_groups[cut] = mask
180 |         #turn to weights matrix, like earlier
181 |         control_gene_weights = pd.DataFrame(control_groups, index=var_names)
182 |         control_gene_weights = control_gene_weights/control_gene_weights.sum()
183 |         #compute control profiles
184 |         if issparse(X):
185 |             control_profiles = X.dot(control_gene_weights)
186 |         else:
187 |             control_profiles = np.dot(X, control_gene_weights)
188 |         #identify the bins for each drug
189 |         drug_bins = {}
190 |         #use the subset form that's in weights
191 |         for drug in weights.columns:
192 |             #targets is still a handy boolean vector of membership
193 |             #get the bins for this drug
194 |             bins = np.unique(obs_cut[targets[drug]])
195 |             #mask the bin order in the existing variables with this
196 |             #control_gene_weights has column names
197 |             #control_profiles is ordered the same way but is just values
198 |             drug_bins[drug] = np.isin(control_gene_weights.columns, bins)
199 |         #turn to weights matrix again
200 |         drug_weights = pd.DataFrame(drug_bins, index=control_gene_weights.columns)
201 |         drug_weights = drug_weights/drug_weights.sum()
202 |         #can now get the seurat reference profile via matrix multiplication
203 |         #subtract from the existing scores
204 |         seurat = np.dot(control_profiles, drug_weights)
205 |         scores = scores - seurat
206 |     #we now have the final form of the scores
207 |     #create a little helper adata thingy based on them
208 |     #store existing .obsm in there for ease of plotting stuff
209 |     adata.uns['drug2cell'] = anndata.AnnData(scores, obs=adata.obs)
210 |     adata.uns['drug2cell'].var_names = weights.columns
211 |     adata.uns['drug2cell'].obsm = adata.obsm
212 |     #store gene group membership, going back to targets for it
213 |     for drug in weights.columns:
214 |         #mask the var_names with the membership, and join into a single delimited string
215 |         adata.uns['drug2cell'].var.loc[drug, 'genes'] = sep.join(var_names[targets[drug]])
216 |         #pull out the old full membership dict and store that too
217 |         adata.uns['drug2cell'].var.loc[drug, 'all_genes'] = sep.join(full_targets[drug])
218 | 
219 | def gsea(adata, targets=None, nested=False, categories=None, absolute=False, plot_args=True, sep=",", **kwargs):
220 |     '''
221 |     Perform gene set enrichment analysis on the marker gene scores computed for the 
222 |     original object. Uses blitzgsea.
223 |     
224 |     Returns:
225 |     
226 |     - a dictionary with clusters for which the original object markers were computed \
227 |     as the keys, and data frames of test results sorted on q-value as the items
228 |     
229 |     - a helper variable with plotting arguments for ``d2c.plot_gsea()``, if \
230 |     ``plot_args=True``. ``['scores']`` has the GSEA input, and ``['targets']`` is the \
231 |     gene group dictionary that was used.
232 |     
233 |     Input
234 |     -----
235 |     adata : ``AnnData``
236 |         With marker genes computed via ``sc.tl.rank_genes_groups()`` in the original 
237 |         expression space.
238 |     targets : ``dict`` of lists of ``str``, optional (default: ``None``)
239 |         The gene groups to evaluate. Can be targets of known drugs, GO terms, pathway 
240 |         memberships, anything you can assign genes to. If ``None``, will use 
241 |         ``d2c.score()`` output if present, and if not present load the ChEMBL-derived 
242 |         drug target sets distributed with the package.
243 |         
244 |         Accepts two forms:
245 |         
246 |             - A dictionary with the names of the groups as keys, and the entries being the 
247 |             corresponding gene lists.
248 |             - A dictionary of dictionaries defined like above, with names of gene group 
249 |             categories as keys. If passing one of those, please specify ``nested=True``.
250 |         
251 |     nested : ``bool``, optional (default: ``False``)
252 |         Whether ``targets`` is a dictionary of dictionaries with group categories as keys.
253 |     categories : ``str`` or list of ``str``, optional (default: ``None``)
254 |         If ``targets=None`` or ``nested=True``, this argument  can be used to subset the 
255 |         gene groups to one or more categories (keys of the original dictionary). In case 
256 |         of the ChEMBL drug targets, these are ATC level 1/level 2 category codes.
257 |     absolute : ``bool``, optional (default: ``False``)
258 |         If ``True``, pass the absolute values of scores to GSEA. Improves statistical 
259 |         power.
260 |     plot_args : ``bool``, optional (default: ``True``)
261 |         Whether to return the second piece of output that holds pre-compiled information 
262 |         for ``d2c.plot_gsea()``.
263 |     sep : ``str``, optional (default: ``","``)
264 |         The delimiter that was used with ``d2c.score()`` for gene group storage.
265 |     kwargs
266 |         Any additional arguments to pass to ``blitzgsea.gsea()``.
267 |     '''
268 |     #get {group:[targets]} form of gene groups to evaluate based on arguments
269 |     #allow for target reconstruction for when this is ran after scoring
270 |     targets = util.prepare_targets(
271 |         adata,
272 |         targets=targets,
273 |         nested=nested,
274 |         categories=categories,
275 |         sep=sep,
276 |         reconstruct=True
277 |     )
278 |     #store the GSEA results in a dictionary, with groups as the keys
279 |     enrichment = {}
280 |     #the plotting-minded output can already store its targets
281 |     #and will keep track of scores as they get made during the loop
282 |     plot_gsea_args = {"targets":targets, "scores":{}}
283 |     #this gets the names of the clusters in the original marker output
284 |     for cluster in adata.uns['rank_genes_groups']['names'].dtype.names:
285 |         #prepare blitzgsea input
286 |         df = pd.DataFrame({"0":adata.uns['rank_genes_groups']['names'][cluster],
287 |                            "1":adata.uns['rank_genes_groups']['scores'][cluster]})
288 |         #possibly sort on absolute value of scores
289 |         if absolute:
290 |             df["1"] = np.absolute(df["1"])
291 |             df = df.sort_values("1", ascending=False)
292 |         #compute GSEA and store results/scores in output
293 |         enrichment[cluster] = blitzgsea.gsea(df, targets, **kwargs)
294 |         plot_gsea_args["scores"][cluster] = df
295 |     #provide output
296 |     if plot_args:
297 |         return enrichment, plot_gsea_args
298 |     else:
299 |         return enrichment
300 | 
301 | def hypergeometric(adata, targets=None, nested=False, categories=None, pvals_adj_thresh=0.05, direction="both", corr_method="benjamini-hochberg", sep=","):
302 |     '''
303 |     Perform a hypergeometric test to assess the overrepresentation of gene group members 
304 |     among marker genes computed for the original object.
305 |     
306 |     Returns a dictionary with clusters for which the original object markers were computed 
307 |     as the keys, and data frames of test results sorted on q-value as the items.
308 |     
309 |     Input
310 |     -----
311 |     adata : ``AnnData``
312 |         With marker genes computed via ``sc.tl.rank_genes_groups()`` in the original 
313 |         expression space.
314 |     targets : ``dict`` of lists of ``str``, optional (default: ``None``)
315 |         The gene groups to evaluate. Can be targets of known drugs, GO terms, pathway 
316 |         memberships, anything you can assign genes to. If ``None``, will use 
317 |         ``d2c.score()`` output if present, and if not present load the ChEMBL-derived 
318 |         drug target sets distributed with the package.
319 |         
320 |         Accepts two forms:
321 |         
322 |             - A dictionary with the names of the groups as keys, and the entries being the 
323 |             corresponding gene lists.
324 |             - A dictionary of dictionaries defined like above, with names of gene group 
325 |             categories as keys. If passing one of those, please specify ``nested=True``.
326 |         
327 |     nested : ``bool``, optional (default: ``False``)
328 |         Whether ``targets`` is a dictionary of dictionaries with group categories as keys.
329 |     categories : ``str`` or list of ``str``, optional (default: ``None``)
330 |         If ``targets=None`` or ``nested=True``, this argument  can be used to subset the 
331 |         gene groups to one or more categories (keys of the original dictionary). In case 
332 |         of the ChEMBL drug targets, these are ATC level 1/level 2 category codes.
333 |     pvals_adj_thresh : ``float``, optional (default: ``0.05``)
334 |         The ``pvals_adj`` cutoff to use on the ``sc.tl.rank_genes_groups()`` output to 
335 |         identify markers.
336 |     direction : ``str``, optional (default: ``"both"``)
337 |         Whether to seek out up/down-regulated genes for the groups, based on the values 
338 |         from ``scores``. Can be ``"up"``, ``"down"``, or ``"both"`` (for no selection).
339 |     corr_method : ``str``, optional (default: ``"benjamini-hochberg"``)
340 |         Which FDR correction to apply to the p-values of the hypergeometric test. Can be 
341 |         ``"benjamini-hochberg"`` or ``"bonferroni"``.
342 |     sep : ``str``, optional (default: ``","``)
343 |         The delimiter that was used with ``d2c.score()`` for gene group storage.
344 |     '''
345 |     #get the universe of available genes, as a set for easy intersecting
346 |     if adata.uns['rank_genes_groups']['params']['use_raw']:
347 |         universe = set(adata.raw.var_names)
348 |     else:
349 |         universe = set(adata.var_names)
350 |     #get {group:[targets]} form of gene groups to evaluate based on arguments
351 |     #allow for target reconstruction for when this is ran after scoring
352 |     targets = util.prepare_targets(
353 |         adata,
354 |         targets=targets,
355 |         nested=nested,
356 |         categories=categories,
357 |         sep=sep,
358 |         reconstruct=True
359 |     )
360 |     #intersect each group membership with the universe after turning it to a set
361 |     for group in targets:
362 |         targets[group] = set(targets[group]).intersection(universe)
363 |     #kick out any empty keys, using dictionary comprehension
364 |     targets = {k:v for k,v in targets.items() if v}
365 |     #store the hypergeometric results in a dictionary, with groups as the keys
366 |     overrepresentation = {}
367 |     #this gets the names of the clusters in the original marker output
368 |     for cluster in adata.uns['rank_genes_groups']['names'].dtype.names:
369 |         #prepare overrepresentation output data frame
370 |         results = pd.DataFrame(
371 |             1, 
372 |             index=list(targets.keys()),
373 |             columns=['intersection','gene_group','markers','universe','pvals', 'pvals_adj']
374 |         )
375 |         #pre-type pvals as floats as otherwise pandas complains about deprecation
376 |         results = results.astype({"pvals":"float64", "pvals_adj":"float64"})
377 |         #identify the markers for this group from the original output
378 |         #construct mask for significance
379 |         mask = adata.uns['rank_genes_groups']['pvals_adj'][cluster] < pvals_adj_thresh
380 |         #the sign of the score indicates up/down-regulation
381 |         if direction == "up":
382 |             mask = mask & (adata.uns['rank_genes_groups']['scores'][cluster] > 0)
383 |         elif direction == "down":
384 |             mask = mask & (adata.uns['rank_genes_groups']['scores'][cluster] < 0)
385 |         #do this as a set because it will be easier to intersect later
386 |         markers = set(adata.uns['rank_genes_groups']['names'][cluster][mask])
387 |         #at this point we know how many genes we have as markers and the universe
388 |         results['markers'] = len(markers)
389 |         results['universe'] = len(universe)
390 |         #perform hypergeometric test to assess overrepresentation
391 |         #loop over the scored gene groups
392 |         for ind in results.index:
393 |             #retrieve gene membership and the intersection
394 |             gene_group = targets[ind]
395 |             common = gene_group.intersection(markers)
396 |             results.loc[ind, 'intersection'] = len(common)
397 |             results.loc[ind, 'gene_group'] = len(gene_group)
398 |             #need to subtract 1 from the intersection length
399 |             #https://alexlenail.medium.com/understanding-and-implementing-the-hypergeometric-test-in-python-a7db688a7458
400 |             pval = hypergeom.sf(
401 |                 len(common)-1,
402 |                 len(universe),
403 |                 len(markers),
404 |                 len(gene_group)
405 |             )
406 |             results.loc[ind, 'pvals'] = pval
407 |         #multiple testing correction
408 |         #mirror sc.tl.rank_genes_groups() logic for consistency
409 |         #just in case any NaNs popped up somehow, fill them to 1 so FDR works
410 |         results = results.fillna(1)
411 |         if corr_method == "benjamini-hochberg":
412 |             results['pvals_adj'] = multipletests(results['pvals'], method="fdr_bh")[1]
413 |         elif corr_method == "bonferroni":
414 |             results['pvals_adj'] = np.minimum(results['pvals']*results.shape[0], 1.0)
415 |         #sort on q-value and store output
416 |         overrepresentation[cluster] = results.sort_values("pvals_adj")
417 |     #that's it. return the dictionary
418 |     return overrepresentation


--------------------------------------------------------------------------------
/drug2cell/chembl.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | 
  4 | def filter_activities(dataframe,
  5 |                       drug_max_phase=None,
  6 |                       add_drug_mechanism=True,
  7 |                       assay_type=None,
  8 |                       remove_inactive=True,
  9 |                       include_active=True,
 10 |                       pchembl_target_column=None,
 11 |                       pchembl_threshold=None
 12 |                         ):
 13 |     '''
 14 |     Perform a sequential set of filtering operations on the provided ChEMBL data frame. 
 15 |     The order of the filters matches the order of the arguments in the input description. 
 16 |     Returns a data frame with the rows fulfilling the resulting criteria.
 17 |     
 18 |     Input
 19 |     -----
 20 |     dataframe : ``pd.DataFrame``
 21 |         The ChEMBL data frame to perform filtering operations on.
 22 |     drug_max_phase : ``int`` or list of ``int``, optional (default: ``None``)
 23 |         Subset the data frame to drugs in the provided clinical stages:
 24 |         
 25 |             - Phase 1: Testing of drug on healthy volunteers for dose-ranging
 26 |             - Phase 2: Initial testing of drug on patients to assess efficacy and safety
 27 |             - Phase 3: Testing of drug on patients to assess efficacy, effectiveness and safety (larger test group)
 28 |             - Phase 4: approved
 29 |     
 30 |     add_drug_mechanism : ``bool``, optional (default: ``True``)
 31 |         Grant subsequent filtering immunity to rows with drug mechanism information 
 32 |         present.
 33 |     assay_type : ``str`` or list of ``str``, optional (default: ``None``)
 34 |         Subset the data frame based on assay type information:
 35 |         
 36 |             - Binding (B) - Data measuring binding of compound to a molecular target, e.g. Ki, IC50, Kd.
 37 |             - Functional (F) - Data measuring the biological effect of a compound, e.g. %cell death in a cell line, rat weight.
 38 |             - ADMET (A) - ADME data e.g. t1/2, oral bioavailability.
 39 |             - Toxicity (T) - Data measuring toxicity of a compound, e.g., cytotoxicity.
 40 |             - Physicochemical (P) - Assays measuring physicochemical properties of the compounds in the absence of biological material e.g., chemical stability, solubility.
 41 |             - Unclassified (U) - A small proportion of assays cannot be classified into one of the above categories e.g., ratio of binding vs efficacy.
 42 |     
 43 |     remove_inactive : ``bool``, optional (default: ``True``)
 44 |         Subset the data frame to remove inactive drug-target interactions.
 45 |     include_active : ``bool``, optional (default: ``True``)
 46 |         Grant subsequent filtering immunity to active drug-target interactions.
 47 |     pchembl_target_column : ``str``, optional (default: ``None``)
 48 |         Use the selected column in the data frame to dictate custom pChEMBL thresholds 
 49 |         for each unique value in the column.
 50 |     pchembl_threshold : ``float`` or ``dict`` of ``float``, optional (default: ``None``)
 51 |         Subset the data frame to this pChEMBL minimum. If a single ``float``, use that 
 52 |         value. If a ``dict`` provided in conjunction with a ``pchembl_target_column``, 
 53 |         have the unique values of the specified column as keys of the dictionary, with 
 54 |         entries being the desired threshold for that category.
 55 |     '''
 56 |     
 57 |     #create a helper column that will be set to True as needed to protect rows from deletion
 58 |     dataframe['keep']=False
 59 |     
 60 |     # check boolean inputs
 61 |     if type(add_drug_mechanism) != bool:
 62 |         raise TypeError("argument 'add_drug_mechanism' should be 'bool' type")
 63 |     if type(remove_inactive) != bool:
 64 |         raise TypeError("argument 'remove_inactive' should be 'bool' type")
 65 |     if type(include_active) != bool:
 66 |         raise TypeError("argument 'include_active' should be 'bool' type")
 67 |     
 68 |     # filter based on drug max phase
 69 |     if drug_max_phase!=None:
 70 |         if type(drug_max_phase)==list:
 71 |             dataframe = dataframe[dataframe['molecule_dictionary|max_phase'].isin(drug_max_phase)]
 72 |         elif type(drug_max_phase)==int:
 73 |             dataframe = dataframe[dataframe['molecule_dictionary|max_phase']==drug_max_phase]
 74 |         else:
 75 |             raise TypeError("argument 'drug_max_phase' should be 'int' or 'list' type")
 76 |     
 77 |     # grant subsequent filtering immunity in the event of drug mechanism presence
 78 |     if add_drug_mechanism:
 79 |         drugmech_index = dataframe[dataframe['drug_mechanism|molregno'].notnull()].index
 80 |         dataframe.loc[drugmech_index,'keep']=True  # to keep 'drug_mechanism' activity
 81 |     
 82 |     # filter based on assay type
 83 |     if assay_type!=None:
 84 |         if type(assay_type)==list:
 85 |             dataframe = dataframe[(dataframe['assays|assay_type'].isin(assay_type))| \
 86 |                                         (dataframe['keep'])]
 87 |         elif type(assay_type)==str:
 88 |             dataframe = dataframe[(dataframe['assays|assay_type']==assay_type)| \
 89 |                                         (dataframe['keep'])]
 90 |         else:
 91 |             raise TypeError("argument 'assay_type' should be 'str' or 'list' type")
 92 |     
 93 |     # filter based on activity
 94 |     ## remove inactive
 95 |     if remove_inactive:
 96 |         dataframe = dataframe[(dataframe['activities|activity_comment'].isin(['inactive',
 97 |                                                                                      'Not Active',
 98 |                                                                                      'Not Active (inhibition < 50% @ 10 uM and thus dose-reponse curve not measured)',
 99 |                                                                                      'Inactive'])==False)| \
100 |                                    (dataframe['keep'])] # keep 'drug_mechanism' activity if 'add_drug_mechanism' is True
101 |     
102 |     ## grant subsequent filtering immunity to active
103 |     if include_active:
104 |         active_index = dataframe[dataframe['activities|activity_comment'].isin(['active','Active'])].index
105 |         dataframe.loc[active_index,'keep']=True  # to keep 'active' activity
106 |         
107 |     ## pChEMBL thresholding
108 |     if pchembl_threshold!=None:
109 |         if type(pchembl_threshold) == dict:
110 |             if set(dataframe[pchembl_target_column])==set(pchembl_threshold.keys()):
111 |                 dataframe['pchembl_active']=False
112 |                 for k in pchembl_threshold.keys():
113 |                     eachclass_df = dataframe[dataframe[pchembl_target_column]==k]
114 |                     dataframe.loc[eachclass_df.index,'pchembl_active']=eachclass_df['activities|pchembl_value']>=pchembl_threshold[k]
115 |                     del eachclass_df
116 |             else:
117 |                 raise KeyError("argument 'pchembl_threshold' should have all the classes in key")
118 |         else:
119 |             #single value, do a single thresholding
120 |             dataframe['pchembl_active'] = dataframe['activities|pchembl_value']>pchembl_threshold
121 |         dataframe = dataframe[dataframe['pchembl_active'] | dataframe['keep']] # keep 'active' and 'drug_mechanism' activity if including those are True
122 |     
123 |     return dataframe
124 |     
125 | 
126 | def create_dict(dataframe):
127 |     
128 |     dic = {}
129 |     for c in set(dataframe['molecule_dictionary|chembl_id']):
130 |         table_c = dataframe[dataframe['molecule_dictionary|chembl_id']==c]
131 |         c_name = list(set(table_c['molecule_dictionary|pref_name']))
132 | 
133 |         if len(c_name)>1:
134 |             raise ValueEror(f'{c} has multimle name: {c_name}')
135 |         else:
136 |             c_name = c_name[0]
137 | 
138 |         l=[]
139 |         for x in table_c['component_synonyms|component_synonym']:
140 |             l=l+x.split('|')
141 |         dic[f'{c}|{c_name}']=list(set(l))
142 |         del l, table_c, c_name
143 | 
144 |     return dic
145 | 
146 | 
147 | def create_drug_dictionary(dataframe,
148 |                            drug_grouping=None,
149 |                            atc_level=['level1','level2']
150 |                             ):
151 |     '''
152 |     This function creates drug:target-genenames dictionary
153 |     * dataframe: drug-target activity dataframe 
154 |     * drug_grouping: 'ATC_level','drug_max_phase',or None (default: None)
155 |     * atc_level: specify level to categorise drugs and make drug-target dictionary. list. (default: ['level1','level2'])
156 |     '''
157 | 
158 |     drug_target_dict={}
159 |  
160 |     if drug_grouping=='ATC_level':
161 |         for l in atc_level:
162 |             dataframe.fillna(value={f'atc_classification|{l}':'No-category'},inplace=True)
163 |             for group in dataframe[f'atc_classification|{l}'].unique():
164 |                 # filtering category and creating dictionary
165 |                 group_df = dataframe[dataframe[f'atc_classification|{l}']==group]
166 |                 drug_target_dict[group] = create_dict(group_df)
167 |                 del group_df
168 |     elif drug_grouping=='drug_max_phase':
169 |         dataframe.fillna(value={'molecule_dictionary|max_phase':'No-phase'},inplace=True)
170 |         for group in set(dataframe['molecule_dictionary|max_phase']):
171 |             group_df = dataframe[dataframe['molecule_dictionary|max_phase']==group]
172 |             drug_target_dict[f'maxphase_{str(group)}'] = create_dict(group_df)
173 |             del group_df
174 |     elif drug_grouping!=None:
175 |         raise KeyError("invalid key <drug_grouping>")
176 |     else:
177 |         #no grouping, just make a drug:[targets] dict
178 |         drug_target_dict = create_dict(dataframe)
179 |     return drug_target_dict
180 |     
181 |     


--------------------------------------------------------------------------------
/drug2cell/cpdb_dict.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Teichlab/drug2cell/a786c632fc521b94f597e452e2f09c45fd451640/drug2cell/cpdb_dict.pkl


--------------------------------------------------------------------------------
/drug2cell/data.py:
--------------------------------------------------------------------------------
 1 | import pkg_resources
 2 | import pandas as pd
 3 | 
 4 | def chembl():
 5 |     '''
 6 |     Load the default ChEMBL drug target dictionary distributed with the package.
 7 |     
 8 |     Returns the drug target dictionary - ATC categories as keys, with each ATC category 
 9 |     a dictionary with corresponding drugs as keys.
10 |     '''
11 |     #this picks up the pickle shipped with the package
12 |     stream = pkg_resources.resource_stream(__name__, 'drug-target_dicts.pkl')
13 |     targets = pd.read_pickle(stream)
14 |     return targets
15 | 
16 | def consensuspathdb():
17 |     '''
18 |     Load the ConsensusPathDB pathway gene memberships distributed with the package.
19 |     
20 |     Returns a dictionary with pathway names as keys and memberships as items.
21 |     '''
22 |     #this picks up the pickle shipped with the package
23 |     stream = pkg_resources.resource_stream(__name__, 'cpdb_dict.pkl')
24 |     targets = pd.read_pickle(stream)
25 |     return targets


--------------------------------------------------------------------------------
/drug2cell/drug-target_dicts.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Teichlab/drug2cell/a786c632fc521b94f597e452e2f09c45fd451640/drug2cell/drug-target_dicts.pkl


--------------------------------------------------------------------------------
/drug2cell/util.py:
--------------------------------------------------------------------------------
  1 | import blitzgsea
  2 | import numpy as np
  3 | 
  4 | from . import data
  5 | 
  6 | from collections import ChainMap
  7 | 
  8 | def reconstruct_targets(adata, sep=","):
  9 |     '''
 10 |     Reconstruct the targets dictionary from an AnnData with ``d2c.score()`` output stored.
 11 |     '''
 12 |     targets = {}
 13 |     #loop over the group names, i.e. this object's feature space
 14 |     for group in adata.uns['drug2cell'].var_names:
 15 |         #split the stored gene list
 16 |         targets[group] = adata.uns['drug2cell'].var.loc[group,'all_genes'].split(sep)
 17 |     return targets
 18 | 
 19 | def prepare_targets(adata, targets=None, nested=False, categories=None, sep=",", reconstruct=False):
 20 |     '''
 21 |     A helper function that ensures the function calling it gets a {group:[targets]} 
 22 |     dictionary back.
 23 |     '''
 24 |     #no provided explicit target set
 25 |     if targets is None:
 26 |         #if there's .uns['drug2cell'] and we're instructed to reconstruct, do so
 27 |         if reconstruct and ('drug2cell' in adata.uns):
 28 |             targets = reconstruct_targets(adata, sep=sep)
 29 |         else:
 30 |             #load ChEMBL
 31 |             targets = data.chembl()
 32 |             #this is nested, regardless of what the input may say
 33 |             nested = True
 34 |     else:
 35 |         #copy so the original is unaltered
 36 |         targets = targets.copy()
 37 |     #subset to provided categories. this should only be provided for nested versions
 38 |     if categories is not None:
 39 |         #turn to list if need be
 40 |         if type(categories) is not list:
 41 |             categories = list(categories)
 42 |         #so that they can be iterated over here for subsetting
 43 |         targets = {k:targets[k] for k in categories}
 44 |     #do we have a dict of dicts, with the categories as keys?
 45 |     if nested:
 46 |         #turn the dict of dicts structure to just a basic dictionary
 47 |         #get all the drugs for all the remaining categories, making a list of dicts
 48 |         #turn that list to a single dict via ChainMap magic
 49 |         targets = dict(ChainMap(*[targets[cat] for cat in targets]))
 50 |     #at this point we have the desired form of {group:[targets]}, return it
 51 |     return targets
 52 | 
 53 | def prepare_plot_args(adata, targets=None, categories=None):
 54 |     '''
 55 |     Prepare the ``var_names``, ``var_group_positions`` and ``var_group_labels`` arguments 
 56 |     for scanpy plotting functions to display scored gene groups and group them nicely. 
 57 |     Returns ``plot_args``, a dictionary of the values that can be used with scanpy 
 58 |     plotting as ``**plot_args``.
 59 |     
 60 |     Input:
 61 |     ------
 62 |     adata : ``AnnData``
 63 |         Point the function to the ``.uns['drug2cell']`` slot computed by the ``score()`` 
 64 |         function earlier. It's required to remove gene groups that were not represented 
 65 |         in the data.
 66 |     targets : ``dict`` of lists of ``str``, optional (default: ``None``)
 67 |         The gene groups to evaluate. Can be targets of known drugs, GO terms, pathway 
 68 |         memberships, anything you can assign genes to. If ``None``, will load the 
 69 |         ChEMBL-derived drug target sets distributed with the package. Must be the 
 70 |         ``nested=True`` version of the input as described in the score function.
 71 |     categories : ``str`` or list of ``str``, optional (default: ``None``)
 72 |         If ``targets=None`` or ``nested=True``, this argument  can be used to subset the 
 73 |         gene groups to one or more categories (keys of the original dictionary). In case 
 74 |         of the ChEMBL drug targets, these are ATC level 1/level 2 category codes.
 75 |     '''
 76 |     #load ChEMBL dictionary if not specified
 77 |     if targets is None:
 78 |         targets = data.chembl()
 79 |     else:
 80 |         #copy so the original argument is unaltered
 81 |         targets = targets.copy()
 82 |     #subset to provided categories
 83 |     if categories is not None:
 84 |         #turn to list if need be
 85 |         if type(categories) is not list:
 86 |             categories = list(categories)
 87 |         #so that they can be iterated over here for subsetting
 88 |         targets = {k:targets[k] for k in categories}
 89 |     #this time we only care about the group names. kick out their targets
 90 |     for group in targets:
 91 |         targets[group] = list(targets[group].keys())
 92 |     #can commence constructing the plotting arguments now
 93 |     var_names = []
 94 |     var_group_positions = []
 95 |     var_group_labels = []
 96 |     #we'll begin at the first feature. this is zero indexed
 97 |     start = 0
 98 |     for group in targets:
 99 |         #intersect the gene group names with what's available in the object
100 |         #i.e. what was actually scored and is available for plotting
101 |         targets[group] = list(adata.var_names[np.isin(adata.var_names, targets[group])])
102 |         #skip if empty
103 |         if len(targets[group]) == 0:
104 |             continue
105 |         #if not empty, store!
106 |         #append group names to list
107 |         var_names = var_names + targets[group]
108 |         #the newest group starts at start and ends at the end of the var_names
109 |         var_group_positions = var_group_positions + [(start, len(var_names)-1)]
110 |         var_group_labels = var_group_labels + [group]
111 |         #the new start will be the next feature that goes in the list
112 |         start = len(var_names)
113 |     #that's it, return the things as a dict
114 |     plot_args = {'var_names':var_names,
115 |                  'var_group_positions':var_group_positions,
116 |                  'var_group_labels':var_group_labels}
117 |     return plot_args
118 | 
119 | def plot_gsea(enrichment, targets, scores, n=10, interactive_plot=True, **kwargs):
120 |     '''
121 |     Display the output of ``d2c.gsea()`` with blitzgsea's ``top_table()`` plot.
122 |     
123 |     The first ``d2c.gsea()`` output variable is ``enrichment``, and passing the second 
124 |     ``d2c.gsea()`` output variable with a ``**`` in front of it provides ``targets`` and 
125 |     ``scores``.
126 |     
127 |     Input
128 |     -----
129 |     enrichment : ``dict`` of ``pd.DataFrame``
130 |         Cluster names as keys, blitzgsea's ``gsea()`` output as values
131 |     targets : ``dict`` of list of ``str``
132 |         The gene group memberships that were used to compute GSEA
133 |     scores : ``dict`` of ``pd.DataFrame``
134 |         Cluster names as keys, the input to blitzgsea
135 |     n : ``int``, optional (default: ``10``)
136 |         How many top scores to show for each group
137 |     interactive_plot : ``bool``, optional (default: ``True``)
138 |         If ``True``, will display the plots within a Jupyter Notebook. If ``False``, 
139 |         will collect the figures into a list and return it at the end.
140 |     kwargs
141 |         Any additional arguments to pass to ``blitzgsea.plot.top_table()``.
142 |     '''
143 |     #optionally save output
144 |     if not interactive_plot:
145 |         figs = []
146 |     #make a top_table() plot for each cluster
147 |     for cluster in enrichment:
148 |         #get a figure, passing the various arguments
149 |         #interactive_plot prepares the figure so that it .show()s in a notebook
150 |         fig = blitzgsea.plot.top_table(scores[cluster], 
151 |                                        targets, 
152 |                                        enrichment[cluster], 
153 |                                        n=n, 
154 |                                        interactive_plot=interactive_plot,
155 |                                        **kwargs
156 |                                       )
157 |         #retitle accordingly
158 |         fig.suptitle(cluster)
159 |         #either display the plot or stash it in output
160 |         if interactive_plot:
161 |             fig.show()
162 |         else:
163 |             figs.append(fig)
164 |     if not interactive_plot:
165 |         return figs


--------------------------------------------------------------------------------
/notebooks/chembl/filtering.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "id": "7df6b6c4",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "import pandas as pd\n",
 11 |     "import drug2cell as d2c"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "id": "409592b1",
 17 |    "metadata": {},
 18 |    "source": [
 19 |     "Load the human targets ChEMBL data frame created in the initial parsing notebook."
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": 2,
 25 |    "id": "78aecb53",
 26 |    "metadata": {},
 27 |    "outputs": [],
 28 |    "source": [
 29 |     "original = pd.read_pickle(\"chembl_30_merged_genesymbols_humans.pkl\")"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "markdown",
 34 |    "id": "67ee796c",
 35 |    "metadata": {},
 36 |    "source": [
 37 |     "Drug2cell's filtering functions allow for subsetting the pchembl threshold for each category of a column of choice. We'll be using the `target_class` column, and basing our values on the Illuminating the Druggable Genome (IDG) project (ref: https://peerj.com/articles/15153/#:~:text=30%20nM%20for,other%20target%20families)"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": 4,
 43 |    "id": "fe3f6848",
 44 |    "metadata": {},
 45 |    "outputs": [],
 46 |    "source": [
 47 |     "#pChEMBL is -log10() as per https://chembl.gitbook.io/chembl-interface-documentation/frequently-asked-questions/chembl-data-questions#what-is-pchembl\n",
 48 |     "#the threshold values were updated on 21Dec2024\n",
 49 |     "thresholds_dict={\n",
 50 |     "    'none':6, #1uM\n",
 51 |     "    'NHR':7, #100nM\n",
 52 |     "    'GPCR':7, #100nM\n",
 53 |     "    'Ion Channel':5, #10uM\n",
 54 |     "    'Kinase':7.53, #30nM\n",
 55 |     "}"
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "markdown",
 60 |    "id": "3e88f0c3",
 61 |    "metadata": {},
 62 |    "source": [
 63 |     "We'll add some more criteria to the filtering. For a comprehensive list of available options, consult the documentation."
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "code",
 68 |    "execution_count": 5,
 69 |    "id": "974f2309",
 70 |    "metadata": {},
 71 |    "outputs": [
 72 |     {
 73 |      "name": "stdout",
 74 |      "output_type": "stream",
 75 |      "text": [
 76 |       "(39660, 55)\n"
 77 |      ]
 78 |     }
 79 |    ],
 80 |    "source": [
 81 |     "filtered_df = d2c.chembl.filter_activities(\n",
 82 |     "    dataframe=original,\n",
 83 |     "    drug_max_phase=4,\n",
 84 |     "    assay_type='F',\n",
 85 |     "    add_drug_mechanism=True,\n",
 86 |     "    remove_inactive=True,\n",
 87 |     "    include_active=True,\n",
 88 |     "    pchembl_target_column=\"target_class\",\n",
 89 |     "    pchembl_threshold=thresholds_dict\n",
 90 |     ")\n",
 91 |     "print(filtered_df.shape)"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "markdown",
 96 |    "id": "c1d8d459",
 97 |    "metadata": {},
 98 |    "source": [
 99 |     "Now that we have our data frame subset to the drugs and targets of interest, we can convert them into a dictionary that can  be used by drug2cell. The exact form distributed with the package was created like so:"
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "code",
104 |    "execution_count": 6,
105 |    "id": "03d9dfd1",
106 |    "metadata": {},
107 |    "outputs": [
108 |     {
109 |      "data": {
110 |       "text/plain": [
111 |        "{'CHEMBL1201566|DARBEPOETIN ALFA': ['EPOR'],\n",
112 |        " 'CHEMBL3707314|METHOXY POLYETHYLENE GLYCOL-EPOETIN BETA': ['EPOR'],\n",
113 |        " 'CHEMBL2109092|EPOETIN BETA': ['EPOR'],\n",
114 |        " 'CHEMBL1622|FOLIC ACID': ['HSD17B10'],\n",
115 |        " 'CHEMBL3039545|LUSPATERCEPT': ['TGFB2', 'TGFB1', 'TGFB3'],\n",
116 |        " 'CHEMBL1963684|PEGINESATIDE ACETATE': ['EPOR'],\n",
117 |        " 'CHEMBL1705709|SODIUM FEREDETATE': ['NFE2L2']}"
118 |       ]
119 |      },
120 |      "execution_count": 6,
121 |      "metadata": {},
122 |      "output_type": "execute_result"
123 |     }
124 |    ],
125 |    "source": [
126 |     "chembldict = d2c.chembl.create_drug_dictionary(\n",
127 |     "    filtered_df,\n",
128 |     "    drug_grouping='ATC_level'\n",
129 |     ")\n",
130 |     "chembldict['B03']"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "markdown",
135 |    "id": "b0b1582d",
136 |    "metadata": {},
137 |    "source": [
138 |     "This results in a nested dictionary structure - a dictionary of categories, holding dictionaries of drugs, holding lists of targets. Drug2cell knows how to operate with this sort of structure as well as its normal groups:targets dictionary, but you need to specify `nested=True` in the scoring/enrichment/overrepresentation functions whenever you pass this structure."
139 |    ]
140 |   }
141 |  ],
142 |  "metadata": {
143 |   "kernelspec": {
144 |    "display_name": "drug2cell_test_env",
145 |    "language": "python",
146 |    "name": "drug2cell_test_env"
147 |   },
148 |   "language_info": {
149 |    "codemirror_mode": {
150 |     "name": "ipython",
151 |     "version": 3
152 |    },
153 |    "file_extension": ".py",
154 |    "mimetype": "text/x-python",
155 |    "name": "python",
156 |    "nbconvert_exporter": "python",
157 |    "pygments_lexer": "ipython3",
158 |    "version": "3.8.20"
159 |   }
160 |  },
161 |  "nbformat": 4,
162 |  "nbformat_minor": 5
163 | }
164 | 


--------------------------------------------------------------------------------
/notebooks/chembl/initial_database_parsing.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "id": "aerial-liquid",
   6 |    "metadata": {},
   7 |    "source": [
   8 |     "## Download data base"
   9 |    ]
  10 |   },
  11 |   {
  12 |    "cell_type": "code",
  13 |    "execution_count": null,
  14 |    "id": "arbitrary-feature",
  15 |    "metadata": {},
  16 |    "outputs": [],
  17 |    "source": [
  18 |     "wget -O chembl_30_sqlite.tar.gz https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_30_sqlite.tar.gz"
  19 |    ]
  20 |   },
  21 |   {
  22 |    "cell_type": "code",
  23 |    "execution_count": null,
  24 |    "id": "german-doctrine",
  25 |    "metadata": {},
  26 |    "outputs": [],
  27 |    "source": [
  28 |     "# at farm\n",
  29 |     "# download checksums file\n",
  30 |     "wget -O db30_chechsums.txt https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/checksums.txt"
  31 |    ]
  32 |   },
  33 |   {
  34 |    "cell_type": "code",
  35 |    "execution_count": null,
  36 |    "id": "handmade-canada",
  37 |    "metadata": {},
  38 |    "outputs": [],
  39 |    "source": [
  40 |     "!cat db30_checksums.txt"
  41 |    ]
  42 |   },
  43 |   {
  44 |    "cell_type": "code",
  45 |    "execution_count": null,
  46 |    "id": "bizarre-reservation",
  47 |    "metadata": {},
  48 |    "outputs": [],
  49 |    "source": [
  50 |     "!sha256sum chembl_30_sqlite.tar.gz"
  51 |    ]
  52 |   },
  53 |   {
  54 |    "cell_type": "code",
  55 |    "execution_count": null,
  56 |    "id": "biblical-banana",
  57 |    "metadata": {},
  58 |    "outputs": [],
  59 |    "source": [
  60 |     "# decompress\n",
  61 |     "tar -xzvf chembl_30_sqlite.tar.gz"
  62 |    ]
  63 |   },
  64 |   {
  65 |    "cell_type": "code",
  66 |    "execution_count": null,
  67 |    "id": "raising-devon",
  68 |    "metadata": {},
  69 |    "outputs": [],
  70 |    "source": []
  71 |   },
  72 |   {
  73 |    "cell_type": "markdown",
  74 |    "id": "consistent-andorra",
  75 |    "metadata": {},
  76 |    "source": [
  77 |     "## Import required modules"
  78 |    ]
  79 |   },
  80 |   {
  81 |    "cell_type": "code",
  82 |    "execution_count": 28,
  83 |    "id": "confidential-reaction",
  84 |    "metadata": {},
  85 |    "outputs": [],
  86 |    "source": [
  87 |     "import pandas as pd\n",
  88 |     "pd.set_option('display.max_rows', 600)\n",
  89 |     "import numpy as np\n",
  90 |     "import sqlite3"
  91 |    ]
  92 |   },
  93 |   {
  94 |    "cell_type": "markdown",
  95 |    "id": "following-absolute",
  96 |    "metadata": {},
  97 |    "source": [
  98 |     "## Connect to the downloaded database"
  99 |    ]
 100 |   },
 101 |   {
 102 |    "cell_type": "code",
 103 |    "execution_count": 2,
 104 |    "id": "patient-digit",
 105 |    "metadata": {},
 106 |    "outputs": [],
 107 |    "source": [
 108 |     "con = sqlite3.connect('chembl_30.db')\n",
 109 |     "cur = con.cursor()"
 110 |    ]
 111 |   },
 112 |   {
 113 |    "cell_type": "code",
 114 |    "execution_count": 3,
 115 |    "id": "unnecessary-wonder",
 116 |    "metadata": {},
 117 |    "outputs": [
 118 |     {
 119 |      "name": "stdout",
 120 |      "output_type": "stream",
 121 |      "text": [
 122 |       "[('assay_type',), ('chembl_id_lookup',), ('target_type',), ('go_classification',), ('atc_classification',), ('source',), ('action_type',), ('variant_sequences',), ('protein_classification',), ('activity_smid',), ('irac_classification',), ('bioassay_ontology',), ('structural_alert_sets',), ('relationship_type',), ('confidence_score_lookup',), ('curation_lookup',), ('assay_classification',), ('data_validity_lookup',), ('products',), ('patent_use_codes',), ('research_stem',), ('hrac_classification',), ('component_sequences',), ('domains',), ('bio_component_sequences',), ('protein_family_classification',), ('version',), ('activity_stds_lookup',), ('frac_classification',), ('usan_stems',), ('organism_class',), ('target_dictionary',), ('molecule_dictionary',), ('docs',), ('protein_class_synonyms',), ('structural_alerts',), ('cell_dictionary',), ('tissue_dictionary',), ('product_patents',), ('component_synonyms',), ('component_domains',), ('research_companies',), ('component_go',), ('defined_daily_dose',), ('activity_supp',), ('component_class',), ('target_relations',), ('biotherapeutics',), ('compound_records',), ('binding_sites',), ('compound_structural_alerts',), ('assays',), ('molecule_hierarchy',), ('molecule_synonyms',), ('molecule_hrac_classification',), ('target_components',), ('compound_properties',), ('molecule_frac_classification',), ('molecule_irac_classification',), ('molecule_atc_classification',), ('compound_structures',), ('drug_indication',), ('drug_mechanism',), ('assay_parameters',), ('activities',), ('drug_warning',), ('metabolism',), ('biotherapeutic_components',), ('assay_class_map',), ('formulations',), ('site_components',), ('indication_refs',), ('mechanism_refs',), ('activity_properties',), ('warning_refs',), ('metabolism_refs',), ('predicted_binding_domains',), ('activity_supp_map',), ('ligand_eff',), ('sqlite_stat1',)]\n"
 123 |      ]
 124 |     }
 125 |    ],
 126 |    "source": [
 127 |     "cur. execute(\"SELECT name FROM sqlite_master WHERE type='table';\")\n",
 128 |     "print(cur.fetchall())"
 129 |    ]
 130 |   },
 131 |   {
 132 |    "cell_type": "code",
 133 |    "execution_count": null,
 134 |    "id": "driving-surrey",
 135 |    "metadata": {},
 136 |    "outputs": [],
 137 |    "source": []
 138 |   },
 139 |   {
 140 |    "cell_type": "markdown",
 141 |    "id": "spoken-wallace",
 142 |    "metadata": {},
 143 |    "source": [
 144 |     "## Activity data loading"
 145 |    ]
 146 |   },
 147 |   {
 148 |    "cell_type": "code",
 149 |    "execution_count": 4,
 150 |    "id": "uniform-assurance",
 151 |    "metadata": {},
 152 |    "outputs": [
 153 |     {
 154 |      "name": "stdout",
 155 |      "output_type": "stream",
 156 |      "text": [
 157 |       "(19286751, 27)\n",
 158 |       "(19286751, 27)\n",
 159 |       "CPU times: user 3min 39s, sys: 20.6 s, total: 4min\n",
 160 |       "Wall time: 4min 5s\n"
 161 |      ]
 162 |     },
 163 |     {
 164 |      "data": {
 165 |       "text/html": [
 166 |        "<div>\n",
 167 |        "<style scoped>\n",
 168 |        "    .dataframe tbody tr th:only-of-type {\n",
 169 |        "        vertical-align: middle;\n",
 170 |        "    }\n",
 171 |        "\n",
 172 |        "    .dataframe tbody tr th {\n",
 173 |        "        vertical-align: top;\n",
 174 |        "    }\n",
 175 |        "\n",
 176 |        "    .dataframe thead th {\n",
 177 |        "        text-align: right;\n",
 178 |        "    }\n",
 179 |        "</style>\n",
 180 |        "<table border=\"1\" class=\"dataframe\">\n",
 181 |        "  <thead>\n",
 182 |        "    <tr style=\"text-align: right;\">\n",
 183 |        "      <th></th>\n",
 184 |        "      <th>activities|activity_id</th>\n",
 185 |        "      <th>activities|assay_id</th>\n",
 186 |        "      <th>activities|doc_id</th>\n",
 187 |        "      <th>activities|record_id</th>\n",
 188 |        "      <th>activities|molregno</th>\n",
 189 |        "      <th>activities|standard_relation</th>\n",
 190 |        "      <th>activities|standard_value</th>\n",
 191 |        "      <th>activities|standard_units</th>\n",
 192 |        "      <th>activities|standard_flag</th>\n",
 193 |        "      <th>activities|standard_type</th>\n",
 194 |        "      <th>...</th>\n",
 195 |        "      <th>activities|toid</th>\n",
 196 |        "      <th>activities|upper_value</th>\n",
 197 |        "      <th>activities|standard_upper_value</th>\n",
 198 |        "      <th>activities|src_id</th>\n",
 199 |        "      <th>activities|type</th>\n",
 200 |        "      <th>activities|relation</th>\n",
 201 |        "      <th>activities|value</th>\n",
 202 |        "      <th>activities|units</th>\n",
 203 |        "      <th>activities|text_value</th>\n",
 204 |        "      <th>activities|standard_text_value</th>\n",
 205 |        "    </tr>\n",
 206 |        "  </thead>\n",
 207 |        "  <tbody>\n",
 208 |        "    <tr>\n",
 209 |        "      <th>0</th>\n",
 210 |        "      <td>31863</td>\n",
 211 |        "      <td>54505</td>\n",
 212 |        "      <td>6424</td>\n",
 213 |        "      <td>206172</td>\n",
 214 |        "      <td>180094</td>\n",
 215 |        "      <td>&gt;</td>\n",
 216 |        "      <td>100000.0</td>\n",
 217 |        "      <td>nM</td>\n",
 218 |        "      <td>1</td>\n",
 219 |        "      <td>IC50</td>\n",
 220 |        "      <td>...</td>\n",
 221 |        "      <td>NaN</td>\n",
 222 |        "      <td>NaN</td>\n",
 223 |        "      <td>None</td>\n",
 224 |        "      <td>1</td>\n",
 225 |        "      <td>IC50</td>\n",
 226 |        "      <td>&gt;</td>\n",
 227 |        "      <td>100.0</td>\n",
 228 |        "      <td>uM</td>\n",
 229 |        "      <td>None</td>\n",
 230 |        "      <td>None</td>\n",
 231 |        "    </tr>\n",
 232 |        "    <tr>\n",
 233 |        "      <th>1</th>\n",
 234 |        "      <td>31864</td>\n",
 235 |        "      <td>83907</td>\n",
 236 |        "      <td>6432</td>\n",
 237 |        "      <td>208970</td>\n",
 238 |        "      <td>182268</td>\n",
 239 |        "      <td>=</td>\n",
 240 |        "      <td>2500.0</td>\n",
 241 |        "      <td>nM</td>\n",
 242 |        "      <td>1</td>\n",
 243 |        "      <td>IC50</td>\n",
 244 |        "      <td>...</td>\n",
 245 |        "      <td>NaN</td>\n",
 246 |        "      <td>NaN</td>\n",
 247 |        "      <td>None</td>\n",
 248 |        "      <td>1</td>\n",
 249 |        "      <td>IC50</td>\n",
 250 |        "      <td>=</td>\n",
 251 |        "      <td>2.5</td>\n",
 252 |        "      <td>uM</td>\n",
 253 |        "      <td>None</td>\n",
 254 |        "      <td>None</td>\n",
 255 |        "    </tr>\n",
 256 |        "    <tr>\n",
 257 |        "      <th>2</th>\n",
 258 |        "      <td>31865</td>\n",
 259 |        "      <td>88152</td>\n",
 260 |        "      <td>6432</td>\n",
 261 |        "      <td>208970</td>\n",
 262 |        "      <td>182268</td>\n",
 263 |        "      <td>&gt;</td>\n",
 264 |        "      <td>50000.0</td>\n",
 265 |        "      <td>nM</td>\n",
 266 |        "      <td>1</td>\n",
 267 |        "      <td>IC50</td>\n",
 268 |        "      <td>...</td>\n",
 269 |        "      <td>NaN</td>\n",
 270 |        "      <td>NaN</td>\n",
 271 |        "      <td>None</td>\n",
 272 |        "      <td>1</td>\n",
 273 |        "      <td>IC50</td>\n",
 274 |        "      <td>&gt;</td>\n",
 275 |        "      <td>50.0</td>\n",
 276 |        "      <td>uM</td>\n",
 277 |        "      <td>None</td>\n",
 278 |        "      <td>None</td>\n",
 279 |        "    </tr>\n",
 280 |        "    <tr>\n",
 281 |        "      <th>3</th>\n",
 282 |        "      <td>31866</td>\n",
 283 |        "      <td>83907</td>\n",
 284 |        "      <td>6432</td>\n",
 285 |        "      <td>208987</td>\n",
 286 |        "      <td>182855</td>\n",
 287 |        "      <td>=</td>\n",
 288 |        "      <td>9000.0</td>\n",
 289 |        "      <td>nM</td>\n",
 290 |        "      <td>1</td>\n",
 291 |        "      <td>IC50</td>\n",
 292 |        "      <td>...</td>\n",
 293 |        "      <td>NaN</td>\n",
 294 |        "      <td>NaN</td>\n",
 295 |        "      <td>None</td>\n",
 296 |        "      <td>1</td>\n",
 297 |        "      <td>IC50</td>\n",
 298 |        "      <td>=</td>\n",
 299 |        "      <td>9.0</td>\n",
 300 |        "      <td>uM</td>\n",
 301 |        "      <td>None</td>\n",
 302 |        "      <td>None</td>\n",
 303 |        "    </tr>\n",
 304 |        "    <tr>\n",
 305 |        "      <th>4</th>\n",
 306 |        "      <td>31867</td>\n",
 307 |        "      <td>88153</td>\n",
 308 |        "      <td>6432</td>\n",
 309 |        "      <td>208987</td>\n",
 310 |        "      <td>182855</td>\n",
 311 |        "      <td>None</td>\n",
 312 |        "      <td>NaN</td>\n",
 313 |        "      <td>nM</td>\n",
 314 |        "      <td>0</td>\n",
 315 |        "      <td>IC50</td>\n",
 316 |        "      <td>...</td>\n",
 317 |        "      <td>NaN</td>\n",
 318 |        "      <td>NaN</td>\n",
 319 |        "      <td>None</td>\n",
 320 |        "      <td>1</td>\n",
 321 |        "      <td>IC50</td>\n",
 322 |        "      <td>None</td>\n",
 323 |        "      <td>NaN</td>\n",
 324 |        "      <td>uM</td>\n",
 325 |        "      <td>None</td>\n",
 326 |        "      <td>None</td>\n",
 327 |        "    </tr>\n",
 328 |        "  </tbody>\n",
 329 |        "</table>\n",
 330 |        "<p>5 rows × 27 columns</p>\n",
 331 |        "</div>"
 332 |       ],
 333 |       "text/plain": [
 334 |        "   activities|activity_id  activities|assay_id  activities|doc_id  \\\n",
 335 |        "0                   31863                54505               6424   \n",
 336 |        "1                   31864                83907               6432   \n",
 337 |        "2                   31865                88152               6432   \n",
 338 |        "3                   31866                83907               6432   \n",
 339 |        "4                   31867                88153               6432   \n",
 340 |        "\n",
 341 |        "   activities|record_id  activities|molregno activities|standard_relation  \\\n",
 342 |        "0                206172               180094                            >   \n",
 343 |        "1                208970               182268                            =   \n",
 344 |        "2                208970               182268                            >   \n",
 345 |        "3                208987               182855                            =   \n",
 346 |        "4                208987               182855                         None   \n",
 347 |        "\n",
 348 |        "   activities|standard_value activities|standard_units  \\\n",
 349 |        "0                   100000.0                        nM   \n",
 350 |        "1                     2500.0                        nM   \n",
 351 |        "2                    50000.0                        nM   \n",
 352 |        "3                     9000.0                        nM   \n",
 353 |        "4                        NaN                        nM   \n",
 354 |        "\n",
 355 |        "   activities|standard_flag activities|standard_type  ... activities|toid  \\\n",
 356 |        "0                         1                     IC50  ...             NaN   \n",
 357 |        "1                         1                     IC50  ...             NaN   \n",
 358 |        "2                         1                     IC50  ...             NaN   \n",
 359 |        "3                         1                     IC50  ...             NaN   \n",
 360 |        "4                         0                     IC50  ...             NaN   \n",
 361 |        "\n",
 362 |        "  activities|upper_value  activities|standard_upper_value  activities|src_id  \\\n",
 363 |        "0                    NaN                             None                  1   \n",
 364 |        "1                    NaN                             None                  1   \n",
 365 |        "2                    NaN                             None                  1   \n",
 366 |        "3                    NaN                             None                  1   \n",
 367 |        "4                    NaN                             None                  1   \n",
 368 |        "\n",
 369 |        "  activities|type activities|relation activities|value  activities|units  \\\n",
 370 |        "0            IC50                   >            100.0                uM   \n",
 371 |        "1            IC50                   =              2.5                uM   \n",
 372 |        "2            IC50                   >             50.0                uM   \n",
 373 |        "3            IC50                   =              9.0                uM   \n",
 374 |        "4            IC50                None              NaN                uM   \n",
 375 |        "\n",
 376 |        "   activities|text_value activities|standard_text_value  \n",
 377 |        "0                   None                           None  \n",
 378 |        "1                   None                           None  \n",
 379 |        "2                   None                           None  \n",
 380 |        "3                   None                           None  \n",
 381 |        "4                   None                           None  \n",
 382 |        "\n",
 383 |        "[5 rows x 27 columns]"
 384 |       ]
 385 |      },
 386 |      "execution_count": 4,
 387 |      "metadata": {},
 388 |      "output_type": "execute_result"
 389 |     }
 390 |    ],
 391 |    "source": [
 392 |     "%%time\n",
 393 |     "activities = pd.read_sql_query(\"SELECT * from activities\", con)\n",
 394 |     "print(activities.shape)\n",
 395 |     "\n",
 396 |     "# rename columns to be able to track back\n",
 397 |     "activities.columns=[f'activities|{x}' for x in activities.columns]\n",
 398 |     "\n",
 399 |     "print(activities.shape)\n",
 400 |     "activities.head()"
 401 |    ]
 402 |   },
 403 |   {
 404 |    "cell_type": "code",
 405 |    "execution_count": null,
 406 |    "id": "above-police",
 407 |    "metadata": {},
 408 |    "outputs": [],
 409 |    "source": []
 410 |   },
 411 |   {
 412 |    "cell_type": "markdown",
 413 |    "id": "unexpected-packet",
 414 |    "metadata": {},
 415 |    "source": [
 416 |     "## Assay data loading"
 417 |    ]
 418 |   },
 419 |   {
 420 |    "cell_type": "code",
 421 |    "execution_count": 5,
 422 |    "id": "actual-incidence",
 423 |    "metadata": {},
 424 |    "outputs": [
 425 |     {
 426 |      "name": "stdout",
 427 |      "output_type": "stream",
 428 |      "text": [
 429 |       "(1458215, 24)\n",
 430 |       "(1458215, 24)\n"
 431 |      ]
 432 |     },
 433 |     {
 434 |      "data": {
 435 |       "text/html": [
 436 |        "<div>\n",
 437 |        "<style scoped>\n",
 438 |        "    .dataframe tbody tr th:only-of-type {\n",
 439 |        "        vertical-align: middle;\n",
 440 |        "    }\n",
 441 |        "\n",
 442 |        "    .dataframe tbody tr th {\n",
 443 |        "        vertical-align: top;\n",
 444 |        "    }\n",
 445 |        "\n",
 446 |        "    .dataframe thead th {\n",
 447 |        "        text-align: right;\n",
 448 |        "    }\n",
 449 |        "</style>\n",
 450 |        "<table border=\"1\" class=\"dataframe\">\n",
 451 |        "  <thead>\n",
 452 |        "    <tr style=\"text-align: right;\">\n",
 453 |        "      <th></th>\n",
 454 |        "      <th>assays|assay_id</th>\n",
 455 |        "      <th>assays|doc_id</th>\n",
 456 |        "      <th>assays|description</th>\n",
 457 |        "      <th>assays|assay_type</th>\n",
 458 |        "      <th>assays|assay_test_type</th>\n",
 459 |        "      <th>assays|assay_category</th>\n",
 460 |        "      <th>assays|assay_organism</th>\n",
 461 |        "      <th>assays|assay_tax_id</th>\n",
 462 |        "      <th>assays|assay_strain</th>\n",
 463 |        "      <th>assays|assay_tissue</th>\n",
 464 |        "      <th>...</th>\n",
 465 |        "      <th>assays|confidence_score</th>\n",
 466 |        "      <th>assays|curated_by</th>\n",
 467 |        "      <th>assays|src_id</th>\n",
 468 |        "      <th>assays|src_assay_id</th>\n",
 469 |        "      <th>assays|chembl_id</th>\n",
 470 |        "      <th>assays|cell_id</th>\n",
 471 |        "      <th>assays|bao_format</th>\n",
 472 |        "      <th>assays|tissue_id</th>\n",
 473 |        "      <th>assays|variant_id</th>\n",
 474 |        "      <th>assays|aidx</th>\n",
 475 |        "    </tr>\n",
 476 |        "  </thead>\n",
 477 |        "  <tbody>\n",
 478 |        "    <tr>\n",
 479 |        "      <th>0</th>\n",
 480 |        "      <td>1</td>\n",
 481 |        "      <td>11087</td>\n",
 482 |        "      <td>The compound was tested for the in vitro inhib...</td>\n",
 483 |        "      <td>B</td>\n",
 484 |        "      <td>None</td>\n",
 485 |        "      <td>None</td>\n",
 486 |        "      <td>None</td>\n",
 487 |        "      <td>NaN</td>\n",
 488 |        "      <td>None</td>\n",
 489 |        "      <td>None</td>\n",
 490 |        "      <td>...</td>\n",
 491 |        "      <td>8</td>\n",
 492 |        "      <td>Autocuration</td>\n",
 493 |        "      <td>1</td>\n",
 494 |        "      <td>None</td>\n",
 495 |        "      <td>CHEMBL615117</td>\n",
 496 |        "      <td>NaN</td>\n",
 497 |        "      <td>BAO_0000019</td>\n",
 498 |        "      <td>NaN</td>\n",
 499 |        "      <td>NaN</td>\n",
 500 |        "      <td>CLD0</td>\n",
 501 |        "    </tr>\n",
 502 |        "    <tr>\n",
 503 |        "      <th>1</th>\n",
 504 |        "      <td>2</td>\n",
 505 |        "      <td>684</td>\n",
 506 |        "      <td>Compound was evaluated for its ability to mobi...</td>\n",
 507 |        "      <td>F</td>\n",
 508 |        "      <td>None</td>\n",
 509 |        "      <td>None</td>\n",
 510 |        "      <td>None</td>\n",
 511 |        "      <td>NaN</td>\n",
 512 |        "      <td>None</td>\n",
 513 |        "      <td>None</td>\n",
 514 |        "      <td>...</td>\n",
 515 |        "      <td>0</td>\n",
 516 |        "      <td>Autocuration</td>\n",
 517 |        "      <td>1</td>\n",
 518 |        "      <td>None</td>\n",
 519 |        "      <td>CHEMBL615118</td>\n",
 520 |        "      <td>NaN</td>\n",
 521 |        "      <td>BAO_0000219</td>\n",
 522 |        "      <td>NaN</td>\n",
 523 |        "      <td>NaN</td>\n",
 524 |        "      <td>CLD0</td>\n",
 525 |        "    </tr>\n",
 526 |        "    <tr>\n",
 527 |        "      <th>2</th>\n",
 528 |        "      <td>3</td>\n",
 529 |        "      <td>15453</td>\n",
 530 |        "      <td>None</td>\n",
 531 |        "      <td>B</td>\n",
 532 |        "      <td>None</td>\n",
 533 |        "      <td>None</td>\n",
 534 |        "      <td>None</td>\n",
 535 |        "      <td>NaN</td>\n",
 536 |        "      <td>None</td>\n",
 537 |        "      <td>None</td>\n",
 538 |        "      <td>...</td>\n",
 539 |        "      <td>0</td>\n",
 540 |        "      <td>Autocuration</td>\n",
 541 |        "      <td>1</td>\n",
 542 |        "      <td>None</td>\n",
 543 |        "      <td>CHEMBL615119</td>\n",
 544 |        "      <td>NaN</td>\n",
 545 |        "      <td>BAO_0000019</td>\n",
 546 |        "      <td>NaN</td>\n",
 547 |        "      <td>NaN</td>\n",
 548 |        "      <td>CLD0</td>\n",
 549 |        "    </tr>\n",
 550 |        "    <tr>\n",
 551 |        "      <th>3</th>\n",
 552 |        "      <td>4</td>\n",
 553 |        "      <td>17841</td>\n",
 554 |        "      <td>Binding affinity against A2 adenosine receptor...</td>\n",
 555 |        "      <td>B</td>\n",
 556 |        "      <td>None</td>\n",
 557 |        "      <td>None</td>\n",
 558 |        "      <td>Bos taurus</td>\n",
 559 |        "      <td>9913.0</td>\n",
 560 |        "      <td>None</td>\n",
 561 |        "      <td>Striatum</td>\n",
 562 |        "      <td>...</td>\n",
 563 |        "      <td>4</td>\n",
 564 |        "      <td>Autocuration</td>\n",
 565 |        "      <td>1</td>\n",
 566 |        "      <td>None</td>\n",
 567 |        "      <td>CHEMBL615120</td>\n",
 568 |        "      <td>NaN</td>\n",
 569 |        "      <td>BAO_0000249</td>\n",
 570 |        "      <td>2435.0</td>\n",
 571 |        "      <td>NaN</td>\n",
 572 |        "      <td>CLD0</td>\n",
 573 |        "    </tr>\n",
 574 |        "    <tr>\n",
 575 |        "      <th>4</th>\n",
 576 |        "      <td>5</td>\n",
 577 |        "      <td>17430</td>\n",
 578 |        "      <td>In vitro cell cytotoxicity against 143-B cell ...</td>\n",
 579 |        "      <td>F</td>\n",
 580 |        "      <td>None</td>\n",
 581 |        "      <td>None</td>\n",
 582 |        "      <td>Homo sapiens</td>\n",
 583 |        "      <td>9606.0</td>\n",
 584 |        "      <td>None</td>\n",
 585 |        "      <td>None</td>\n",
 586 |        "      <td>...</td>\n",
 587 |        "      <td>1</td>\n",
 588 |        "      <td>Intermediate</td>\n",
 589 |        "      <td>1</td>\n",
 590 |        "      <td>None</td>\n",
 591 |        "      <td>CHEMBL615121</td>\n",
 592 |        "      <td>163.0</td>\n",
 593 |        "      <td>BAO_0000219</td>\n",
 594 |        "      <td>NaN</td>\n",
 595 |        "      <td>NaN</td>\n",
 596 |        "      <td>CLD0</td>\n",
 597 |        "    </tr>\n",
 598 |        "  </tbody>\n",
 599 |        "</table>\n",
 600 |        "<p>5 rows × 24 columns</p>\n",
 601 |        "</div>"
 602 |       ],
 603 |       "text/plain": [
 604 |        "   assays|assay_id  assays|doc_id  \\\n",
 605 |        "0                1          11087   \n",
 606 |        "1                2            684   \n",
 607 |        "2                3          15453   \n",
 608 |        "3                4          17841   \n",
 609 |        "4                5          17430   \n",
 610 |        "\n",
 611 |        "                                  assays|description assays|assay_type  \\\n",
 612 |        "0  The compound was tested for the in vitro inhib...                 B   \n",
 613 |        "1  Compound was evaluated for its ability to mobi...                 F   \n",
 614 |        "2                                               None                 B   \n",
 615 |        "3  Binding affinity against A2 adenosine receptor...                 B   \n",
 616 |        "4  In vitro cell cytotoxicity against 143-B cell ...                 F   \n",
 617 |        "\n",
 618 |        "  assays|assay_test_type assays|assay_category assays|assay_organism  \\\n",
 619 |        "0                   None                  None                  None   \n",
 620 |        "1                   None                  None                  None   \n",
 621 |        "2                   None                  None                  None   \n",
 622 |        "3                   None                  None            Bos taurus   \n",
 623 |        "4                   None                  None          Homo sapiens   \n",
 624 |        "\n",
 625 |        "   assays|assay_tax_id assays|assay_strain assays|assay_tissue  ...  \\\n",
 626 |        "0                  NaN                None                None  ...   \n",
 627 |        "1                  NaN                None                None  ...   \n",
 628 |        "2                  NaN                None                None  ...   \n",
 629 |        "3               9913.0                None            Striatum  ...   \n",
 630 |        "4               9606.0                None                None  ...   \n",
 631 |        "\n",
 632 |        "  assays|confidence_score assays|curated_by  assays|src_id  \\\n",
 633 |        "0                       8      Autocuration              1   \n",
 634 |        "1                       0      Autocuration              1   \n",
 635 |        "2                       0      Autocuration              1   \n",
 636 |        "3                       4      Autocuration              1   \n",
 637 |        "4                       1      Intermediate              1   \n",
 638 |        "\n",
 639 |        "  assays|src_assay_id  assays|chembl_id assays|cell_id  assays|bao_format  \\\n",
 640 |        "0                None      CHEMBL615117            NaN        BAO_0000019   \n",
 641 |        "1                None      CHEMBL615118            NaN        BAO_0000219   \n",
 642 |        "2                None      CHEMBL615119            NaN        BAO_0000019   \n",
 643 |        "3                None      CHEMBL615120            NaN        BAO_0000249   \n",
 644 |        "4                None      CHEMBL615121          163.0        BAO_0000219   \n",
 645 |        "\n",
 646 |        "  assays|tissue_id assays|variant_id  assays|aidx  \n",
 647 |        "0              NaN               NaN         CLD0  \n",
 648 |        "1              NaN               NaN         CLD0  \n",
 649 |        "2              NaN               NaN         CLD0  \n",
 650 |        "3           2435.0               NaN         CLD0  \n",
 651 |        "4              NaN               NaN         CLD0  \n",
 652 |        "\n",
 653 |        "[5 rows x 24 columns]"
 654 |       ]
 655 |      },
 656 |      "execution_count": 5,
 657 |      "metadata": {},
 658 |      "output_type": "execute_result"
 659 |     }
 660 |    ],
 661 |    "source": [
 662 |     "# assay data\n",
 663 |     "assays = pd.read_sql_query(\"SELECT * from assays\", con)\n",
 664 |     "print(assays.shape)\n",
 665 |     "\n",
 666 |     "# rename columns to be able to track back\n",
 667 |     "assays.columns=[f'assays|{x}' for x in assays.columns]\n",
 668 |     "\n",
 669 |     "print(assays.shape)\n",
 670 |     "assays.head()"
 671 |    ]
 672 |   },
 673 |   {
 674 |    "cell_type": "code",
 675 |    "execution_count": null,
 676 |    "id": "touched-insulation",
 677 |    "metadata": {},
 678 |    "outputs": [],
 679 |    "source": []
 680 |   },
 681 |   {
 682 |    "cell_type": "markdown",
 683 |    "id": "independent-slave",
 684 |    "metadata": {},
 685 |    "source": [
 686 |     "## Target data loading"
 687 |    ]
 688 |   },
 689 |   {
 690 |    "cell_type": "code",
 691 |    "execution_count": 6,
 692 |    "id": "numeric-warrant",
 693 |    "metadata": {},
 694 |    "outputs": [
 695 |     {
 696 |      "name": "stdout",
 697 |      "output_type": "stream",
 698 |      "text": [
 699 |       "(14855, 7)\n"
 700 |      ]
 701 |     },
 702 |     {
 703 |      "data": {
 704 |       "text/html": [
 705 |        "<div>\n",
 706 |        "<style scoped>\n",
 707 |        "    .dataframe tbody tr th:only-of-type {\n",
 708 |        "        vertical-align: middle;\n",
 709 |        "    }\n",
 710 |        "\n",
 711 |        "    .dataframe tbody tr th {\n",
 712 |        "        vertical-align: top;\n",
 713 |        "    }\n",
 714 |        "\n",
 715 |        "    .dataframe thead th {\n",
 716 |        "        text-align: right;\n",
 717 |        "    }\n",
 718 |        "</style>\n",
 719 |        "<table border=\"1\" class=\"dataframe\">\n",
 720 |        "  <thead>\n",
 721 |        "    <tr style=\"text-align: right;\">\n",
 722 |        "      <th></th>\n",
 723 |        "      <th>target_dictionary|tid</th>\n",
 724 |        "      <th>target_dictionary|target_type</th>\n",
 725 |        "      <th>target_dictionary|pref_name</th>\n",
 726 |        "      <th>target_dictionary|tax_id</th>\n",
 727 |        "      <th>target_dictionary|organism</th>\n",
 728 |        "      <th>target_dictionary|chembl_id</th>\n",
 729 |        "      <th>target_dictionary|species_group_flag</th>\n",
 730 |        "    </tr>\n",
 731 |        "  </thead>\n",
 732 |        "  <tbody>\n",
 733 |        "    <tr>\n",
 734 |        "      <th>0</th>\n",
 735 |        "      <td>1</td>\n",
 736 |        "      <td>SINGLE PROTEIN</td>\n",
 737 |        "      <td>Maltase-glucoamylase</td>\n",
 738 |        "      <td>9606.0</td>\n",
 739 |        "      <td>Homo sapiens</td>\n",
 740 |        "      <td>CHEMBL2074</td>\n",
 741 |        "      <td>0</td>\n",
 742 |        "    </tr>\n",
 743 |        "    <tr>\n",
 744 |        "      <th>1</th>\n",
 745 |        "      <td>2</td>\n",
 746 |        "      <td>SINGLE PROTEIN</td>\n",
 747 |        "      <td>Sulfonylurea receptor 2</td>\n",
 748 |        "      <td>9606.0</td>\n",
 749 |        "      <td>Homo sapiens</td>\n",
 750 |        "      <td>CHEMBL1971</td>\n",
 751 |        "      <td>0</td>\n",
 752 |        "    </tr>\n",
 753 |        "    <tr>\n",
 754 |        "      <th>2</th>\n",
 755 |        "      <td>3</td>\n",
 756 |        "      <td>SINGLE PROTEIN</td>\n",
 757 |        "      <td>Phosphodiesterase 5A</td>\n",
 758 |        "      <td>9606.0</td>\n",
 759 |        "      <td>Homo sapiens</td>\n",
 760 |        "      <td>CHEMBL1827</td>\n",
 761 |        "      <td>0</td>\n",
 762 |        "    </tr>\n",
 763 |        "    <tr>\n",
 764 |        "      <th>3</th>\n",
 765 |        "      <td>4</td>\n",
 766 |        "      <td>SINGLE PROTEIN</td>\n",
 767 |        "      <td>Voltage-gated T-type calcium channel alpha-1H ...</td>\n",
 768 |        "      <td>9606.0</td>\n",
 769 |        "      <td>Homo sapiens</td>\n",
 770 |        "      <td>CHEMBL1859</td>\n",
 771 |        "      <td>0</td>\n",
 772 |        "    </tr>\n",
 773 |        "    <tr>\n",
 774 |        "      <th>4</th>\n",
 775 |        "      <td>5</td>\n",
 776 |        "      <td>SINGLE PROTEIN</td>\n",
 777 |        "      <td>Nicotinic acetylcholine receptor alpha subunit</td>\n",
 778 |        "      <td>6253.0</td>\n",
 779 |        "      <td>Ascaris suum</td>\n",
 780 |        "      <td>CHEMBL1884</td>\n",
 781 |        "      <td>0</td>\n",
 782 |        "    </tr>\n",
 783 |        "  </tbody>\n",
 784 |        "</table>\n",
 785 |        "</div>"
 786 |       ],
 787 |       "text/plain": [
 788 |        "   target_dictionary|tid target_dictionary|target_type  \\\n",
 789 |        "0                      1                SINGLE PROTEIN   \n",
 790 |        "1                      2                SINGLE PROTEIN   \n",
 791 |        "2                      3                SINGLE PROTEIN   \n",
 792 |        "3                      4                SINGLE PROTEIN   \n",
 793 |        "4                      5                SINGLE PROTEIN   \n",
 794 |        "\n",
 795 |        "                         target_dictionary|pref_name  \\\n",
 796 |        "0                               Maltase-glucoamylase   \n",
 797 |        "1                            Sulfonylurea receptor 2   \n",
 798 |        "2                               Phosphodiesterase 5A   \n",
 799 |        "3  Voltage-gated T-type calcium channel alpha-1H ...   \n",
 800 |        "4     Nicotinic acetylcholine receptor alpha subunit   \n",
 801 |        "\n",
 802 |        "   target_dictionary|tax_id target_dictionary|organism  \\\n",
 803 |        "0                    9606.0               Homo sapiens   \n",
 804 |        "1                    9606.0               Homo sapiens   \n",
 805 |        "2                    9606.0               Homo sapiens   \n",
 806 |        "3                    9606.0               Homo sapiens   \n",
 807 |        "4                    6253.0               Ascaris suum   \n",
 808 |        "\n",
 809 |        "  target_dictionary|chembl_id  target_dictionary|species_group_flag  \n",
 810 |        "0                  CHEMBL2074                                     0  \n",
 811 |        "1                  CHEMBL1971                                     0  \n",
 812 |        "2                  CHEMBL1827                                     0  \n",
 813 |        "3                  CHEMBL1859                                     0  \n",
 814 |        "4                  CHEMBL1884                                     0  "
 815 |       ]
 816 |      },
 817 |      "execution_count": 6,
 818 |      "metadata": {},
 819 |      "output_type": "execute_result"
 820 |     }
 821 |    ],
 822 |    "source": [
 823 |     "target_dictionary = pd.read_sql_query(\"SELECT * from target_dictionary\", con)\n",
 824 |     "print(target_dictionary.shape)\n",
 825 |     "\n",
 826 |     "# rename columns to be able to track back\n",
 827 |     "target_dictionary.columns=[f'target_dictionary|{x}' for x in target_dictionary.columns]\n",
 828 |     "\n",
 829 |     "target_dictionary.head()"
 830 |    ]
 831 |   },
 832 |   {
 833 |    "cell_type": "code",
 834 |    "execution_count": 7,
 835 |    "id": "athletic-producer",
 836 |    "metadata": {},
 837 |    "outputs": [
 838 |     {
 839 |      "name": "stdout",
 840 |      "output_type": "stream",
 841 |      "text": [
 842 |       "(13558, 4)\n"
 843 |      ]
 844 |     },
 845 |     {
 846 |      "data": {
 847 |       "text/html": [
 848 |        "<div>\n",
 849 |        "<style scoped>\n",
 850 |        "    .dataframe tbody tr th:only-of-type {\n",
 851 |        "        vertical-align: middle;\n",
 852 |        "    }\n",
 853 |        "\n",
 854 |        "    .dataframe tbody tr th {\n",
 855 |        "        vertical-align: top;\n",
 856 |        "    }\n",
 857 |        "\n",
 858 |        "    .dataframe thead th {\n",
 859 |        "        text-align: right;\n",
 860 |        "    }\n",
 861 |        "</style>\n",
 862 |        "<table border=\"1\" class=\"dataframe\">\n",
 863 |        "  <thead>\n",
 864 |        "    <tr style=\"text-align: right;\">\n",
 865 |        "      <th></th>\n",
 866 |        "      <th>target_components|tid</th>\n",
 867 |        "      <th>target_components|component_id</th>\n",
 868 |        "      <th>target_components|targcomp_id</th>\n",
 869 |        "      <th>target_components|homologue</th>\n",
 870 |        "    </tr>\n",
 871 |        "  </thead>\n",
 872 |        "  <tbody>\n",
 873 |        "    <tr>\n",
 874 |        "      <th>0</th>\n",
 875 |        "      <td>11004</td>\n",
 876 |        "      <td>3090</td>\n",
 877 |        "      <td>1</td>\n",
 878 |        "      <td>0</td>\n",
 879 |        "    </tr>\n",
 880 |        "    <tr>\n",
 881 |        "      <th>1</th>\n",
 882 |        "      <td>11028</td>\n",
 883 |        "      <td>1166</td>\n",
 884 |        "      <td>4</td>\n",
 885 |        "      <td>0</td>\n",
 886 |        "    </tr>\n",
 887 |        "    <tr>\n",
 888 |        "      <th>2</th>\n",
 889 |        "      <td>11037</td>\n",
 890 |        "      <td>1888</td>\n",
 891 |        "      <td>5</td>\n",
 892 |        "      <td>0</td>\n",
 893 |        "    </tr>\n",
 894 |        "    <tr>\n",
 895 |        "      <th>3</th>\n",
 896 |        "      <td>11043</td>\n",
 897 |        "      <td>1294</td>\n",
 898 |        "      <td>7</td>\n",
 899 |        "      <td>0</td>\n",
 900 |        "    </tr>\n",
 901 |        "    <tr>\n",
 902 |        "      <th>4</th>\n",
 903 |        "      <td>11056</td>\n",
 904 |        "      <td>2159</td>\n",
 905 |        "      <td>8</td>\n",
 906 |        "      <td>0</td>\n",
 907 |        "    </tr>\n",
 908 |        "  </tbody>\n",
 909 |        "</table>\n",
 910 |        "</div>"
 911 |       ],
 912 |       "text/plain": [
 913 |        "   target_components|tid  target_components|component_id  \\\n",
 914 |        "0                  11004                            3090   \n",
 915 |        "1                  11028                            1166   \n",
 916 |        "2                  11037                            1888   \n",
 917 |        "3                  11043                            1294   \n",
 918 |        "4                  11056                            2159   \n",
 919 |        "\n",
 920 |        "   target_components|targcomp_id  target_components|homologue  \n",
 921 |        "0                              1                            0  \n",
 922 |        "1                              4                            0  \n",
 923 |        "2                              5                            0  \n",
 924 |        "3                              7                            0  \n",
 925 |        "4                              8                            0  "
 926 |       ]
 927 |      },
 928 |      "execution_count": 7,
 929 |      "metadata": {},
 930 |      "output_type": "execute_result"
 931 |     }
 932 |    ],
 933 |    "source": [
 934 |     "target_components = pd.read_sql_query(\"SELECT * from target_components\", con)\n",
 935 |     "print(target_components.shape)\n",
 936 |     "\n",
 937 |     "# rename columns to be able to track back\n",
 938 |     "target_components.columns=[f'target_components|{x}' for x in target_components.columns]\n",
 939 |     "\n",
 940 |     "target_components.head()"
 941 |    ]
 942 |   },
 943 |   {
 944 |    "cell_type": "code",
 945 |    "execution_count": 8,
 946 |    "id": "immune-animal",
 947 |    "metadata": {},
 948 |    "outputs": [
 949 |     {
 950 |      "name": "stdout",
 951 |      "output_type": "stream",
 952 |      "text": [
 953 |       "(97172, 4)\n"
 954 |      ]
 955 |     },
 956 |     {
 957 |      "data": {
 958 |       "text/html": [
 959 |        "<div>\n",
 960 |        "<style scoped>\n",
 961 |        "    .dataframe tbody tr th:only-of-type {\n",
 962 |        "        vertical-align: middle;\n",
 963 |        "    }\n",
 964 |        "\n",
 965 |        "    .dataframe tbody tr th {\n",
 966 |        "        vertical-align: top;\n",
 967 |        "    }\n",
 968 |        "\n",
 969 |        "    .dataframe thead th {\n",
 970 |        "        text-align: right;\n",
 971 |        "    }\n",
 972 |        "</style>\n",
 973 |        "<table border=\"1\" class=\"dataframe\">\n",
 974 |        "  <thead>\n",
 975 |        "    <tr style=\"text-align: right;\">\n",
 976 |        "      <th></th>\n",
 977 |        "      <th>component_synonyms|compsyn_id</th>\n",
 978 |        "      <th>component_synonyms|component_id</th>\n",
 979 |        "      <th>component_synonyms|component_synonym</th>\n",
 980 |        "      <th>component_synonyms|syn_type</th>\n",
 981 |        "    </tr>\n",
 982 |        "  </thead>\n",
 983 |        "  <tbody>\n",
 984 |        "    <tr>\n",
 985 |        "      <th>0</th>\n",
 986 |        "      <td>860862</td>\n",
 987 |        "      <td>48</td>\n",
 988 |        "      <td>Gabra-1</td>\n",
 989 |        "      <td>GENE_SYMBOL_OTHER</td>\n",
 990 |        "    </tr>\n",
 991 |        "    <tr>\n",
 992 |        "      <th>1</th>\n",
 993 |        "      <td>860867</td>\n",
 994 |        "      <td>49</td>\n",
 995 |        "      <td>Gabrb-3</td>\n",
 996 |        "      <td>GENE_SYMBOL_OTHER</td>\n",
 997 |        "    </tr>\n",
 998 |        "    <tr>\n",
 999 |        "      <th>2</th>\n",
1000 |        "      <td>860872</td>\n",
1001 |        "      <td>50</td>\n",
1002 |        "      <td>Gabrb-2</td>\n",
1003 |        "      <td>GENE_SYMBOL_OTHER</td>\n",
1004 |        "    </tr>\n",
1005 |        "    <tr>\n",
1006 |        "      <th>3</th>\n",
1007 |        "      <td>860877</td>\n",
1008 |        "      <td>51</td>\n",
1009 |        "      <td>CDKN5</td>\n",
1010 |        "      <td>GENE_SYMBOL_OTHER</td>\n",
1011 |        "    </tr>\n",
1012 |        "    <tr>\n",
1013 |        "      <th>4</th>\n",
1014 |        "      <td>860915</td>\n",
1015 |        "      <td>56</td>\n",
1016 |        "      <td>NMDAR1</td>\n",
1017 |        "      <td>GENE_SYMBOL_OTHER</td>\n",
1018 |        "    </tr>\n",
1019 |        "  </tbody>\n",
1020 |        "</table>\n",
1021 |        "</div>"
1022 |       ],
1023 |       "text/plain": [
1024 |        "   component_synonyms|compsyn_id  component_synonyms|component_id  \\\n",
1025 |        "0                         860862                               48   \n",
1026 |        "1                         860867                               49   \n",
1027 |        "2                         860872                               50   \n",
1028 |        "3                         860877                               51   \n",
1029 |        "4                         860915                               56   \n",
1030 |        "\n",
1031 |        "  component_synonyms|component_synonym component_synonyms|syn_type  \n",
1032 |        "0                              Gabra-1           GENE_SYMBOL_OTHER  \n",
1033 |        "1                              Gabrb-3           GENE_SYMBOL_OTHER  \n",
1034 |        "2                              Gabrb-2           GENE_SYMBOL_OTHER  \n",
1035 |        "3                                CDKN5           GENE_SYMBOL_OTHER  \n",
1036 |        "4                               NMDAR1           GENE_SYMBOL_OTHER  "
1037 |       ]
1038 |      },
1039 |      "execution_count": 8,
1040 |      "metadata": {},
1041 |      "output_type": "execute_result"
1042 |     }
1043 |    ],
1044 |    "source": [
1045 |     "component_synonyms = pd.read_sql_query(\"SELECT * from component_synonyms\", con)\n",
1046 |     "print(component_synonyms.shape)\n",
1047 |     "\n",
1048 |     "# rename columns to be able to track back\n",
1049 |     "component_synonyms.columns=[f'component_synonyms|{x}' for x in component_synonyms.columns]\n",
1050 |     "\n",
1051 |     "component_synonyms.head()"
1052 |    ]
1053 |   },
1054 |   {
1055 |    "cell_type": "code",
1056 |    "execution_count": null,
1057 |    "id": "alternate-stable",
1058 |    "metadata": {},
1059 |    "outputs": [],
1060 |    "source": []
1061 |   },
1062 |   {
1063 |    "cell_type": "code",
1064 |    "execution_count": null,
1065 |    "id": "sound-receiver",
1066 |    "metadata": {},
1067 |    "outputs": [],
1068 |    "source": []
1069 |   },
1070 |   {
1071 |    "cell_type": "markdown",
1072 |    "id": "molecular-finland",
1073 |    "metadata": {},
1074 |    "source": [
1075 |     "## Compound data loading"
1076 |    ]
1077 |   },
1078 |   {
1079 |    "cell_type": "code",
1080 |    "execution_count": 9,
1081 |    "id": "responsible-architect",
1082 |    "metadata": {},
1083 |    "outputs": [
1084 |     {
1085 |      "name": "stdout",
1086 |      "output_type": "stream",
1087 |      "text": [
1088 |       "(6656, 14)\n"
1089 |      ]
1090 |     },
1091 |     {
1092 |      "data": {
1093 |       "text/html": [
1094 |        "<div>\n",
1095 |        "<style scoped>\n",
1096 |        "    .dataframe tbody tr th:only-of-type {\n",
1097 |        "        vertical-align: middle;\n",
1098 |        "    }\n",
1099 |        "\n",
1100 |        "    .dataframe tbody tr th {\n",
1101 |        "        vertical-align: top;\n",
1102 |        "    }\n",
1103 |        "\n",
1104 |        "    .dataframe thead th {\n",
1105 |        "        text-align: right;\n",
1106 |        "    }\n",
1107 |        "</style>\n",
1108 |        "<table border=\"1\" class=\"dataframe\">\n",
1109 |        "  <thead>\n",
1110 |        "    <tr style=\"text-align: right;\">\n",
1111 |        "      <th></th>\n",
1112 |        "      <th>drug_mechanism|mec_id</th>\n",
1113 |        "      <th>drug_mechanism|record_id</th>\n",
1114 |        "      <th>drug_mechanism|molregno</th>\n",
1115 |        "      <th>drug_mechanism|mechanism_of_action</th>\n",
1116 |        "      <th>drug_mechanism|tid</th>\n",
1117 |        "      <th>drug_mechanism|site_id</th>\n",
1118 |        "      <th>drug_mechanism|action_type</th>\n",
1119 |        "      <th>drug_mechanism|direct_interaction</th>\n",
1120 |        "      <th>drug_mechanism|molecular_mechanism</th>\n",
1121 |        "      <th>drug_mechanism|disease_efficacy</th>\n",
1122 |        "      <th>drug_mechanism|mechanism_comment</th>\n",
1123 |        "      <th>drug_mechanism|selectivity_comment</th>\n",
1124 |        "      <th>drug_mechanism|binding_site_comment</th>\n",
1125 |        "      <th>drug_mechanism|variant_id</th>\n",
1126 |        "    </tr>\n",
1127 |        "  </thead>\n",
1128 |        "  <tbody>\n",
1129 |        "    <tr>\n",
1130 |        "      <th>0</th>\n",
1131 |        "      <td>13</td>\n",
1132 |        "      <td>1343810</td>\n",
1133 |        "      <td>1124</td>\n",
1134 |        "      <td>Carbonic anhydrase VII inhibitor</td>\n",
1135 |        "      <td>11060.0</td>\n",
1136 |        "      <td>NaN</td>\n",
1137 |        "      <td>INHIBITOR</td>\n",
1138 |        "      <td>1</td>\n",
1139 |        "      <td>1</td>\n",
1140 |        "      <td>1</td>\n",
1141 |        "      <td>None</td>\n",
1142 |        "      <td>None</td>\n",
1143 |        "      <td>None</td>\n",
1144 |        "      <td>NaN</td>\n",
1145 |        "    </tr>\n",
1146 |        "    <tr>\n",
1147 |        "      <th>1</th>\n",
1148 |        "      <td>14</td>\n",
1149 |        "      <td>1344053</td>\n",
1150 |        "      <td>675068</td>\n",
1151 |        "      <td>Carbonic anhydrase I inhibitor</td>\n",
1152 |        "      <td>10193.0</td>\n",
1153 |        "      <td>NaN</td>\n",
1154 |        "      <td>INHIBITOR</td>\n",
1155 |        "      <td>1</td>\n",
1156 |        "      <td>1</td>\n",
1157 |        "      <td>1</td>\n",
1158 |        "      <td>None</td>\n",
1159 |        "      <td>None</td>\n",
1160 |        "      <td>None</td>\n",
1161 |        "      <td>NaN</td>\n",
1162 |        "    </tr>\n",
1163 |        "    <tr>\n",
1164 |        "      <th>2</th>\n",
1165 |        "      <td>15</td>\n",
1166 |        "      <td>1344649</td>\n",
1167 |        "      <td>674765</td>\n",
1168 |        "      <td>Carbonic anhydrase I inhibitor</td>\n",
1169 |        "      <td>10193.0</td>\n",
1170 |        "      <td>NaN</td>\n",
1171 |        "      <td>INHIBITOR</td>\n",
1172 |        "      <td>1</td>\n",
1173 |        "      <td>1</td>\n",
1174 |        "      <td>1</td>\n",
1175 |        "      <td>Expressed in eye</td>\n",
1176 |        "      <td>None</td>\n",
1177 |        "      <td>None</td>\n",
1178 |        "      <td>NaN</td>\n",
1179 |        "    </tr>\n",
1180 |        "    <tr>\n",
1181 |        "      <th>3</th>\n",
1182 |        "      <td>16</td>\n",
1183 |        "      <td>1343255</td>\n",
1184 |        "      <td>1085</td>\n",
1185 |        "      <td>Carbonic anhydrase I inhibitor</td>\n",
1186 |        "      <td>10193.0</td>\n",
1187 |        "      <td>NaN</td>\n",
1188 |        "      <td>INHIBITOR</td>\n",
1189 |        "      <td>1</td>\n",
1190 |        "      <td>1</td>\n",
1191 |        "      <td>1</td>\n",
1192 |        "      <td>None</td>\n",
1193 |        "      <td>None</td>\n",
1194 |        "      <td>None</td>\n",
1195 |        "      <td>NaN</td>\n",
1196 |        "    </tr>\n",
1197 |        "    <tr>\n",
1198 |        "      <th>4</th>\n",
1199 |        "      <td>17</td>\n",
1200 |        "      <td>1344903</td>\n",
1201 |        "      <td>1125</td>\n",
1202 |        "      <td>Carbonic anhydrase I inhibitor</td>\n",
1203 |        "      <td>10193.0</td>\n",
1204 |        "      <td>NaN</td>\n",
1205 |        "      <td>INHIBITOR</td>\n",
1206 |        "      <td>1</td>\n",
1207 |        "      <td>1</td>\n",
1208 |        "      <td>1</td>\n",
1209 |        "      <td>Expressed in eye</td>\n",
1210 |        "      <td>None</td>\n",
1211 |        "      <td>None</td>\n",
1212 |        "      <td>NaN</td>\n",
1213 |        "    </tr>\n",
1214 |        "  </tbody>\n",
1215 |        "</table>\n",
1216 |        "</div>"
1217 |       ],
1218 |       "text/plain": [
1219 |        "   drug_mechanism|mec_id  drug_mechanism|record_id  drug_mechanism|molregno  \\\n",
1220 |        "0                     13                   1343810                     1124   \n",
1221 |        "1                     14                   1344053                   675068   \n",
1222 |        "2                     15                   1344649                   674765   \n",
1223 |        "3                     16                   1343255                     1085   \n",
1224 |        "4                     17                   1344903                     1125   \n",
1225 |        "\n",
1226 |        "  drug_mechanism|mechanism_of_action  drug_mechanism|tid  \\\n",
1227 |        "0   Carbonic anhydrase VII inhibitor             11060.0   \n",
1228 |        "1     Carbonic anhydrase I inhibitor             10193.0   \n",
1229 |        "2     Carbonic anhydrase I inhibitor             10193.0   \n",
1230 |        "3     Carbonic anhydrase I inhibitor             10193.0   \n",
1231 |        "4     Carbonic anhydrase I inhibitor             10193.0   \n",
1232 |        "\n",
1233 |        "   drug_mechanism|site_id drug_mechanism|action_type  \\\n",
1234 |        "0                     NaN                  INHIBITOR   \n",
1235 |        "1                     NaN                  INHIBITOR   \n",
1236 |        "2                     NaN                  INHIBITOR   \n",
1237 |        "3                     NaN                  INHIBITOR   \n",
1238 |        "4                     NaN                  INHIBITOR   \n",
1239 |        "\n",
1240 |        "   drug_mechanism|direct_interaction  drug_mechanism|molecular_mechanism  \\\n",
1241 |        "0                                  1                                   1   \n",
1242 |        "1                                  1                                   1   \n",
1243 |        "2                                  1                                   1   \n",
1244 |        "3                                  1                                   1   \n",
1245 |        "4                                  1                                   1   \n",
1246 |        "\n",
1247 |        "   drug_mechanism|disease_efficacy drug_mechanism|mechanism_comment  \\\n",
1248 |        "0                                1                             None   \n",
1249 |        "1                                1                             None   \n",
1250 |        "2                                1                 Expressed in eye   \n",
1251 |        "3                                1                             None   \n",
1252 |        "4                                1                 Expressed in eye   \n",
1253 |        "\n",
1254 |        "  drug_mechanism|selectivity_comment drug_mechanism|binding_site_comment  \\\n",
1255 |        "0                               None                                None   \n",
1256 |        "1                               None                                None   \n",
1257 |        "2                               None                                None   \n",
1258 |        "3                               None                                None   \n",
1259 |        "4                               None                                None   \n",
1260 |        "\n",
1261 |        "   drug_mechanism|variant_id  \n",
1262 |        "0                        NaN  \n",
1263 |        "1                        NaN  \n",
1264 |        "2                        NaN  \n",
1265 |        "3                        NaN  \n",
1266 |        "4                        NaN  "
1267 |       ]
1268 |      },
1269 |      "execution_count": 9,
1270 |      "metadata": {},
1271 |      "output_type": "execute_result"
1272 |     }
1273 |    ],
1274 |    "source": [
1275 |     "drug_mechanism = pd.read_sql_query(\"SELECT * from drug_mechanism\", con)\n",
1276 |     "print(drug_mechanism.shape)\n",
1277 |     "\n",
1278 |     "# rename columns to be able to track back\n",
1279 |     "drug_mechanism.columns=[f'drug_mechanism|{x}' for x in drug_mechanism.columns]\n",
1280 |     "\n",
1281 |     "drug_mechanism.head()"
1282 |    ]
1283 |   },
1284 |   {
1285 |    "cell_type": "code",
1286 |    "execution_count": 10,
1287 |    "id": "challenging-swing",
1288 |    "metadata": {},
1289 |    "outputs": [
1290 |     {
1291 |      "name": "stdout",
1292 |      "output_type": "stream",
1293 |      "text": [
1294 |       "(2157379, 31)\n"
1295 |      ]
1296 |     },
1297 |     {
1298 |      "data": {
1299 |       "text/html": [
1300 |        "<div>\n",
1301 |        "<style scoped>\n",
1302 |        "    .dataframe tbody tr th:only-of-type {\n",
1303 |        "        vertical-align: middle;\n",
1304 |        "    }\n",
1305 |        "\n",
1306 |        "    .dataframe tbody tr th {\n",
1307 |        "        vertical-align: top;\n",
1308 |        "    }\n",
1309 |        "\n",
1310 |        "    .dataframe thead th {\n",
1311 |        "        text-align: right;\n",
1312 |        "    }\n",
1313 |        "</style>\n",
1314 |        "<table border=\"1\" class=\"dataframe\">\n",
1315 |        "  <thead>\n",
1316 |        "    <tr style=\"text-align: right;\">\n",
1317 |        "      <th></th>\n",
1318 |        "      <th>molecule_dictionary|molregno</th>\n",
1319 |        "      <th>molecule_dictionary|pref_name</th>\n",
1320 |        "      <th>molecule_dictionary|chembl_id</th>\n",
1321 |        "      <th>molecule_dictionary|max_phase</th>\n",
1322 |        "      <th>molecule_dictionary|therapeutic_flag</th>\n",
1323 |        "      <th>molecule_dictionary|dosed_ingredient</th>\n",
1324 |        "      <th>molecule_dictionary|structure_type</th>\n",
1325 |        "      <th>molecule_dictionary|chebi_par_id</th>\n",
1326 |        "      <th>molecule_dictionary|molecule_type</th>\n",
1327 |        "      <th>molecule_dictionary|first_approval</th>\n",
1328 |        "      <th>...</th>\n",
1329 |        "      <th>molecule_dictionary|usan_stem</th>\n",
1330 |        "      <th>molecule_dictionary|polymer_flag</th>\n",
1331 |        "      <th>molecule_dictionary|usan_substem</th>\n",
1332 |        "      <th>molecule_dictionary|usan_stem_definition</th>\n",
1333 |        "      <th>molecule_dictionary|indication_class</th>\n",
1334 |        "      <th>molecule_dictionary|withdrawn_flag</th>\n",
1335 |        "      <th>molecule_dictionary|withdrawn_year</th>\n",
1336 |        "      <th>molecule_dictionary|withdrawn_country</th>\n",
1337 |        "      <th>molecule_dictionary|withdrawn_reason</th>\n",
1338 |        "      <th>molecule_dictionary|withdrawn_class</th>\n",
1339 |        "    </tr>\n",
1340 |        "  </thead>\n",
1341 |        "  <tbody>\n",
1342 |        "    <tr>\n",
1343 |        "      <th>0</th>\n",
1344 |        "      <td>1</td>\n",
1345 |        "      <td>None</td>\n",
1346 |        "      <td>CHEMBL6329</td>\n",
1347 |        "      <td>0</td>\n",
1348 |        "      <td>0</td>\n",
1349 |        "      <td>0</td>\n",
1350 |        "      <td>MOL</td>\n",
1351 |        "      <td>NaN</td>\n",
1352 |        "      <td>Small molecule</td>\n",
1353 |        "      <td>NaN</td>\n",
1354 |        "      <td>...</td>\n",
1355 |        "      <td>None</td>\n",
1356 |        "      <td>0</td>\n",
1357 |        "      <td>None</td>\n",
1358 |        "      <td>None</td>\n",
1359 |        "      <td>None</td>\n",
1360 |        "      <td>0</td>\n",
1361 |        "      <td>NaN</td>\n",
1362 |        "      <td>None</td>\n",
1363 |        "      <td>None</td>\n",
1364 |        "      <td>None</td>\n",
1365 |        "    </tr>\n",
1366 |        "    <tr>\n",
1367 |        "      <th>1</th>\n",
1368 |        "      <td>2</td>\n",
1369 |        "      <td>None</td>\n",
1370 |        "      <td>CHEMBL6328</td>\n",
1371 |        "      <td>0</td>\n",
1372 |        "      <td>0</td>\n",
1373 |        "      <td>0</td>\n",
1374 |        "      <td>MOL</td>\n",
1375 |        "      <td>NaN</td>\n",
1376 |        "      <td>Small molecule</td>\n",
1377 |        "      <td>NaN</td>\n",
1378 |        "      <td>...</td>\n",
1379 |        "      <td>None</td>\n",
1380 |        "      <td>0</td>\n",
1381 |        "      <td>None</td>\n",
1382 |        "      <td>None</td>\n",
1383 |        "      <td>None</td>\n",
1384 |        "      <td>0</td>\n",
1385 |        "      <td>NaN</td>\n",
1386 |        "      <td>None</td>\n",
1387 |        "      <td>None</td>\n",
1388 |        "      <td>None</td>\n",
1389 |        "    </tr>\n",
1390 |        "    <tr>\n",
1391 |        "      <th>2</th>\n",
1392 |        "      <td>3</td>\n",
1393 |        "      <td>None</td>\n",
1394 |        "      <td>CHEMBL265667</td>\n",
1395 |        "      <td>0</td>\n",
1396 |        "      <td>0</td>\n",
1397 |        "      <td>0</td>\n",
1398 |        "      <td>MOL</td>\n",
1399 |        "      <td>NaN</td>\n",
1400 |        "      <td>Small molecule</td>\n",
1401 |        "      <td>NaN</td>\n",
1402 |        "      <td>...</td>\n",
1403 |        "      <td>None</td>\n",
1404 |        "      <td>0</td>\n",
1405 |        "      <td>None</td>\n",
1406 |        "      <td>None</td>\n",
1407 |        "      <td>None</td>\n",
1408 |        "      <td>0</td>\n",
1409 |        "      <td>NaN</td>\n",
1410 |        "      <td>None</td>\n",
1411 |        "      <td>None</td>\n",
1412 |        "      <td>None</td>\n",
1413 |        "    </tr>\n",
1414 |        "    <tr>\n",
1415 |        "      <th>3</th>\n",
1416 |        "      <td>4</td>\n",
1417 |        "      <td>None</td>\n",
1418 |        "      <td>CHEMBL6362</td>\n",
1419 |        "      <td>0</td>\n",
1420 |        "      <td>0</td>\n",
1421 |        "      <td>0</td>\n",
1422 |        "      <td>MOL</td>\n",
1423 |        "      <td>NaN</td>\n",
1424 |        "      <td>Small molecule</td>\n",
1425 |        "      <td>NaN</td>\n",
1426 |        "      <td>...</td>\n",
1427 |        "      <td>None</td>\n",
1428 |        "      <td>0</td>\n",
1429 |        "      <td>None</td>\n",
1430 |        "      <td>None</td>\n",
1431 |        "      <td>None</td>\n",
1432 |        "      <td>0</td>\n",
1433 |        "      <td>NaN</td>\n",
1434 |        "      <td>None</td>\n",
1435 |        "      <td>None</td>\n",
1436 |        "      <td>None</td>\n",
1437 |        "    </tr>\n",
1438 |        "    <tr>\n",
1439 |        "      <th>4</th>\n",
1440 |        "      <td>5</td>\n",
1441 |        "      <td>None</td>\n",
1442 |        "      <td>CHEMBL267864</td>\n",
1443 |        "      <td>0</td>\n",
1444 |        "      <td>0</td>\n",
1445 |        "      <td>0</td>\n",
1446 |        "      <td>MOL</td>\n",
1447 |        "      <td>NaN</td>\n",
1448 |        "      <td>Small molecule</td>\n",
1449 |        "      <td>NaN</td>\n",
1450 |        "      <td>...</td>\n",
1451 |        "      <td>None</td>\n",
1452 |        "      <td>0</td>\n",
1453 |        "      <td>None</td>\n",
1454 |        "      <td>None</td>\n",
1455 |        "      <td>None</td>\n",
1456 |        "      <td>0</td>\n",
1457 |        "      <td>NaN</td>\n",
1458 |        "      <td>None</td>\n",
1459 |        "      <td>None</td>\n",
1460 |        "      <td>None</td>\n",
1461 |        "    </tr>\n",
1462 |        "  </tbody>\n",
1463 |        "</table>\n",
1464 |        "<p>5 rows × 31 columns</p>\n",
1465 |        "</div>"
1466 |       ],
1467 |       "text/plain": [
1468 |        "   molecule_dictionary|molregno molecule_dictionary|pref_name  \\\n",
1469 |        "0                             1                          None   \n",
1470 |        "1                             2                          None   \n",
1471 |        "2                             3                          None   \n",
1472 |        "3                             4                          None   \n",
1473 |        "4                             5                          None   \n",
1474 |        "\n",
1475 |        "  molecule_dictionary|chembl_id  molecule_dictionary|max_phase  \\\n",
1476 |        "0                    CHEMBL6329                              0   \n",
1477 |        "1                    CHEMBL6328                              0   \n",
1478 |        "2                  CHEMBL265667                              0   \n",
1479 |        "3                    CHEMBL6362                              0   \n",
1480 |        "4                  CHEMBL267864                              0   \n",
1481 |        "\n",
1482 |        "   molecule_dictionary|therapeutic_flag  molecule_dictionary|dosed_ingredient  \\\n",
1483 |        "0                                     0                                     0   \n",
1484 |        "1                                     0                                     0   \n",
1485 |        "2                                     0                                     0   \n",
1486 |        "3                                     0                                     0   \n",
1487 |        "4                                     0                                     0   \n",
1488 |        "\n",
1489 |        "  molecule_dictionary|structure_type  molecule_dictionary|chebi_par_id  \\\n",
1490 |        "0                                MOL                               NaN   \n",
1491 |        "1                                MOL                               NaN   \n",
1492 |        "2                                MOL                               NaN   \n",
1493 |        "3                                MOL                               NaN   \n",
1494 |        "4                                MOL                               NaN   \n",
1495 |        "\n",
1496 |        "  molecule_dictionary|molecule_type  molecule_dictionary|first_approval  ...  \\\n",
1497 |        "0                    Small molecule                                 NaN  ...   \n",
1498 |        "1                    Small molecule                                 NaN  ...   \n",
1499 |        "2                    Small molecule                                 NaN  ...   \n",
1500 |        "3                    Small molecule                                 NaN  ...   \n",
1501 |        "4                    Small molecule                                 NaN  ...   \n",
1502 |        "\n",
1503 |        "   molecule_dictionary|usan_stem  molecule_dictionary|polymer_flag  \\\n",
1504 |        "0                           None                                 0   \n",
1505 |        "1                           None                                 0   \n",
1506 |        "2                           None                                 0   \n",
1507 |        "3                           None                                 0   \n",
1508 |        "4                           None                                 0   \n",
1509 |        "\n",
1510 |        "   molecule_dictionary|usan_substem  molecule_dictionary|usan_stem_definition  \\\n",
1511 |        "0                              None                                      None   \n",
1512 |        "1                              None                                      None   \n",
1513 |        "2                              None                                      None   \n",
1514 |        "3                              None                                      None   \n",
1515 |        "4                              None                                      None   \n",
1516 |        "\n",
1517 |        "   molecule_dictionary|indication_class  molecule_dictionary|withdrawn_flag  \\\n",
1518 |        "0                                  None                                   0   \n",
1519 |        "1                                  None                                   0   \n",
1520 |        "2                                  None                                   0   \n",
1521 |        "3                                  None                                   0   \n",
1522 |        "4                                  None                                   0   \n",
1523 |        "\n",
1524 |        "   molecule_dictionary|withdrawn_year  molecule_dictionary|withdrawn_country  \\\n",
1525 |        "0                                 NaN                                   None   \n",
1526 |        "1                                 NaN                                   None   \n",
1527 |        "2                                 NaN                                   None   \n",
1528 |        "3                                 NaN                                   None   \n",
1529 |        "4                                 NaN                                   None   \n",
1530 |        "\n",
1531 |        "   molecule_dictionary|withdrawn_reason  molecule_dictionary|withdrawn_class  \n",
1532 |        "0                                  None                                 None  \n",
1533 |        "1                                  None                                 None  \n",
1534 |        "2                                  None                                 None  \n",
1535 |        "3                                  None                                 None  \n",
1536 |        "4                                  None                                 None  \n",
1537 |        "\n",
1538 |        "[5 rows x 31 columns]"
1539 |       ]
1540 |      },
1541 |      "execution_count": 10,
1542 |      "metadata": {},
1543 |      "output_type": "execute_result"
1544 |     }
1545 |    ],
1546 |    "source": [
1547 |     "molecule_dictionary = pd.read_sql_query(\"SELECT * from molecule_dictionary\", con)\n",
1548 |     "print(molecule_dictionary.shape)\n",
1549 |     "\n",
1550 |     "# rename columns to be able to track back\n",
1551 |     "molecule_dictionary.columns=[f'molecule_dictionary|{x}' for x in molecule_dictionary.columns]\n",
1552 |     "\n",
1553 |     "molecule_dictionary.head()"
1554 |    ]
1555 |   },
1556 |   {
1557 |    "cell_type": "code",
1558 |    "execution_count": 11,
1559 |    "id": "gothic-fantasy",
1560 |    "metadata": {},
1561 |    "outputs": [
1562 |     {
1563 |      "name": "stdout",
1564 |      "output_type": "stream",
1565 |      "text": [
1566 |       "(4470, 3)\n"
1567 |      ]
1568 |     },
1569 |     {
1570 |      "data": {
1571 |       "text/html": [
1572 |        "<div>\n",
1573 |        "<style scoped>\n",
1574 |        "    .dataframe tbody tr th:only-of-type {\n",
1575 |        "        vertical-align: middle;\n",
1576 |        "    }\n",
1577 |        "\n",
1578 |        "    .dataframe tbody tr th {\n",
1579 |        "        vertical-align: top;\n",
1580 |        "    }\n",
1581 |        "\n",
1582 |        "    .dataframe thead th {\n",
1583 |        "        text-align: right;\n",
1584 |        "    }\n",
1585 |        "</style>\n",
1586 |        "<table border=\"1\" class=\"dataframe\">\n",
1587 |        "  <thead>\n",
1588 |        "    <tr style=\"text-align: right;\">\n",
1589 |        "      <th></th>\n",
1590 |        "      <th>molecule_atc_classification|mol_atc_id</th>\n",
1591 |        "      <th>molecule_atc_classification|level5</th>\n",
1592 |        "      <th>molecule_atc_classification|molregno</th>\n",
1593 |        "    </tr>\n",
1594 |        "  </thead>\n",
1595 |        "  <tbody>\n",
1596 |        "    <tr>\n",
1597 |        "      <th>0</th>\n",
1598 |        "      <td>59409</td>\n",
1599 |        "      <td>L01EX15</td>\n",
1600 |        "      <td>2089491</td>\n",
1601 |        "    </tr>\n",
1602 |        "    <tr>\n",
1603 |        "      <th>1</th>\n",
1604 |        "      <td>59410</td>\n",
1605 |        "      <td>L01EX10</td>\n",
1606 |        "      <td>608601</td>\n",
1607 |        "    </tr>\n",
1608 |        "    <tr>\n",
1609 |        "      <th>2</th>\n",
1610 |        "      <td>59411</td>\n",
1611 |        "      <td>L01EM03</td>\n",
1612 |        "      <td>1567700</td>\n",
1613 |        "    </tr>\n",
1614 |        "    <tr>\n",
1615 |        "      <th>3</th>\n",
1616 |        "      <td>59412</td>\n",
1617 |        "      <td>D06BX03</td>\n",
1618 |        "      <td>579824</td>\n",
1619 |        "    </tr>\n",
1620 |        "    <tr>\n",
1621 |        "      <th>4</th>\n",
1622 |        "      <td>59413</td>\n",
1623 |        "      <td>L01EX13</td>\n",
1624 |        "      <td>1763584</td>\n",
1625 |        "    </tr>\n",
1626 |        "  </tbody>\n",
1627 |        "</table>\n",
1628 |        "</div>"
1629 |       ],
1630 |       "text/plain": [
1631 |        "   molecule_atc_classification|mol_atc_id molecule_atc_classification|level5  \\\n",
1632 |        "0                                   59409                            L01EX15   \n",
1633 |        "1                                   59410                            L01EX10   \n",
1634 |        "2                                   59411                            L01EM03   \n",
1635 |        "3                                   59412                            D06BX03   \n",
1636 |        "4                                   59413                            L01EX13   \n",
1637 |        "\n",
1638 |        "   molecule_atc_classification|molregno  \n",
1639 |        "0                               2089491  \n",
1640 |        "1                                608601  \n",
1641 |        "2                               1567700  \n",
1642 |        "3                                579824  \n",
1643 |        "4                               1763584  "
1644 |       ]
1645 |      },
1646 |      "execution_count": 11,
1647 |      "metadata": {},
1648 |      "output_type": "execute_result"
1649 |     }
1650 |    ],
1651 |    "source": [
1652 |     "molecule_atc_classification = pd.read_sql_query(\"SELECT * from molecule_atc_classification\", con)\n",
1653 |     "print(molecule_atc_classification.shape)\n",
1654 |     "\n",
1655 |     "# rename columns to be able to track back\n",
1656 |     "molecule_atc_classification.columns=[f'molecule_atc_classification|{x}' for x in molecule_atc_classification.columns]\n",
1657 |     "\n",
1658 |     "molecule_atc_classification.head()"
1659 |    ]
1660 |   },
1661 |   {
1662 |    "cell_type": "code",
1663 |    "execution_count": 12,
1664 |    "id": "spatial-creation",
1665 |    "metadata": {},
1666 |    "outputs": [
1667 |     {
1668 |      "name": "stdout",
1669 |      "output_type": "stream",
1670 |      "text": [
1671 |       "(5148, 10)\n"
1672 |      ]
1673 |     },
1674 |     {
1675 |      "data": {
1676 |       "text/html": [
1677 |        "<div>\n",
1678 |        "<style scoped>\n",
1679 |        "    .dataframe tbody tr th:only-of-type {\n",
1680 |        "        vertical-align: middle;\n",
1681 |        "    }\n",
1682 |        "\n",
1683 |        "    .dataframe tbody tr th {\n",
1684 |        "        vertical-align: top;\n",
1685 |        "    }\n",
1686 |        "\n",
1687 |        "    .dataframe thead th {\n",
1688 |        "        text-align: right;\n",
1689 |        "    }\n",
1690 |        "</style>\n",
1691 |        "<table border=\"1\" class=\"dataframe\">\n",
1692 |        "  <thead>\n",
1693 |        "    <tr style=\"text-align: right;\">\n",
1694 |        "      <th></th>\n",
1695 |        "      <th>atc_classification|who_name</th>\n",
1696 |        "      <th>atc_classification|level1</th>\n",
1697 |        "      <th>atc_classification|level2</th>\n",
1698 |        "      <th>atc_classification|level3</th>\n",
1699 |        "      <th>atc_classification|level4</th>\n",
1700 |        "      <th>atc_classification|level5</th>\n",
1701 |        "      <th>atc_classification|level1_description</th>\n",
1702 |        "      <th>atc_classification|level2_description</th>\n",
1703 |        "      <th>atc_classification|level3_description</th>\n",
1704 |        "      <th>atc_classification|level4_description</th>\n",
1705 |        "    </tr>\n",
1706 |        "  </thead>\n",
1707 |        "  <tbody>\n",
1708 |        "    <tr>\n",
1709 |        "      <th>0</th>\n",
1710 |        "      <td>sodium fluoride</td>\n",
1711 |        "      <td>A</td>\n",
1712 |        "      <td>A01</td>\n",
1713 |        "      <td>A01A</td>\n",
1714 |        "      <td>A01AA</td>\n",
1715 |        "      <td>A01AA01</td>\n",
1716 |        "      <td>ALIMENTARY TRACT AND METABOLISM</td>\n",
1717 |        "      <td>STOMATOLOGICAL PREPARATIONS</td>\n",
1718 |        "      <td>STOMATOLOGICAL PREPARATIONS</td>\n",
1719 |        "      <td>Caries prophylactic agents</td>\n",
1720 |        "    </tr>\n",
1721 |        "    <tr>\n",
1722 |        "      <th>1</th>\n",
1723 |        "      <td>sodium monofluorophosphate</td>\n",
1724 |        "      <td>A</td>\n",
1725 |        "      <td>A01</td>\n",
1726 |        "      <td>A01A</td>\n",
1727 |        "      <td>A01AA</td>\n",
1728 |        "      <td>A01AA02</td>\n",
1729 |        "      <td>ALIMENTARY TRACT AND METABOLISM</td>\n",
1730 |        "      <td>STOMATOLOGICAL PREPARATIONS</td>\n",
1731 |        "      <td>STOMATOLOGICAL PREPARATIONS</td>\n",
1732 |        "      <td>Caries prophylactic agents</td>\n",
1733 |        "    </tr>\n",
1734 |        "    <tr>\n",
1735 |        "      <th>2</th>\n",
1736 |        "      <td>olaflur</td>\n",
1737 |        "      <td>A</td>\n",
1738 |        "      <td>A01</td>\n",
1739 |        "      <td>A01A</td>\n",
1740 |        "      <td>A01AA</td>\n",
1741 |        "      <td>A01AA03</td>\n",
1742 |        "      <td>ALIMENTARY TRACT AND METABOLISM</td>\n",
1743 |        "      <td>STOMATOLOGICAL PREPARATIONS</td>\n",
1744 |        "      <td>STOMATOLOGICAL PREPARATIONS</td>\n",
1745 |        "      <td>Caries prophylactic agents</td>\n",
1746 |        "    </tr>\n",
1747 |        "    <tr>\n",
1748 |        "      <th>3</th>\n",
1749 |        "      <td>stannous fluoride</td>\n",
1750 |        "      <td>A</td>\n",
1751 |        "      <td>A01</td>\n",
1752 |        "      <td>A01A</td>\n",
1753 |        "      <td>A01AA</td>\n",
1754 |        "      <td>A01AA04</td>\n",
1755 |        "      <td>ALIMENTARY TRACT AND METABOLISM</td>\n",
1756 |        "      <td>STOMATOLOGICAL PREPARATIONS</td>\n",
1757 |        "      <td>STOMATOLOGICAL PREPARATIONS</td>\n",
1758 |        "      <td>Caries prophylactic agents</td>\n",
1759 |        "    </tr>\n",
1760 |        "    <tr>\n",
1761 |        "      <th>4</th>\n",
1762 |        "      <td>combinations</td>\n",
1763 |        "      <td>A</td>\n",
1764 |        "      <td>A01</td>\n",
1765 |        "      <td>A01A</td>\n",
1766 |        "      <td>A01AA</td>\n",
1767 |        "      <td>A01AA30</td>\n",
1768 |        "      <td>ALIMENTARY TRACT AND METABOLISM</td>\n",
1769 |        "      <td>STOMATOLOGICAL PREPARATIONS</td>\n",
1770 |        "      <td>STOMATOLOGICAL PREPARATIONS</td>\n",
1771 |        "      <td>Caries prophylactic agents</td>\n",
1772 |        "    </tr>\n",
1773 |        "  </tbody>\n",
1774 |        "</table>\n",
1775 |        "</div>"
1776 |       ],
1777 |       "text/plain": [
1778 |        "  atc_classification|who_name atc_classification|level1  \\\n",
1779 |        "0             sodium fluoride                         A   \n",
1780 |        "1  sodium monofluorophosphate                         A   \n",
1781 |        "2                     olaflur                         A   \n",
1782 |        "3           stannous fluoride                         A   \n",
1783 |        "4                combinations                         A   \n",
1784 |        "\n",
1785 |        "  atc_classification|level2 atc_classification|level3  \\\n",
1786 |        "0                       A01                      A01A   \n",
1787 |        "1                       A01                      A01A   \n",
1788 |        "2                       A01                      A01A   \n",
1789 |        "3                       A01                      A01A   \n",
1790 |        "4                       A01                      A01A   \n",
1791 |        "\n",
1792 |        "  atc_classification|level4 atc_classification|level5  \\\n",
1793 |        "0                     A01AA                   A01AA01   \n",
1794 |        "1                     A01AA                   A01AA02   \n",
1795 |        "2                     A01AA                   A01AA03   \n",
1796 |        "3                     A01AA                   A01AA04   \n",
1797 |        "4                     A01AA                   A01AA30   \n",
1798 |        "\n",
1799 |        "  atc_classification|level1_description atc_classification|level2_description  \\\n",
1800 |        "0       ALIMENTARY TRACT AND METABOLISM           STOMATOLOGICAL PREPARATIONS   \n",
1801 |        "1       ALIMENTARY TRACT AND METABOLISM           STOMATOLOGICAL PREPARATIONS   \n",
1802 |        "2       ALIMENTARY TRACT AND METABOLISM           STOMATOLOGICAL PREPARATIONS   \n",
1803 |        "3       ALIMENTARY TRACT AND METABOLISM           STOMATOLOGICAL PREPARATIONS   \n",
1804 |        "4       ALIMENTARY TRACT AND METABOLISM           STOMATOLOGICAL PREPARATIONS   \n",
1805 |        "\n",
1806 |        "  atc_classification|level3_description atc_classification|level4_description  \n",
1807 |        "0           STOMATOLOGICAL PREPARATIONS            Caries prophylactic agents  \n",
1808 |        "1           STOMATOLOGICAL PREPARATIONS            Caries prophylactic agents  \n",
1809 |        "2           STOMATOLOGICAL PREPARATIONS            Caries prophylactic agents  \n",
1810 |        "3           STOMATOLOGICAL PREPARATIONS            Caries prophylactic agents  \n",
1811 |        "4           STOMATOLOGICAL PREPARATIONS            Caries prophylactic agents  "
1812 |       ]
1813 |      },
1814 |      "execution_count": 12,
1815 |      "metadata": {},
1816 |      "output_type": "execute_result"
1817 |     }
1818 |    ],
1819 |    "source": [
1820 |     "atc_classification = pd.read_sql_query(\"SELECT * from atc_classification\", con)\n",
1821 |     "print(atc_classification.shape)\n",
1822 |     "\n",
1823 |     "# rename columns to be able to track back\n",
1824 |     "atc_classification.columns=[f'atc_classification|{x}' for x in atc_classification.columns]\n",
1825 |     "\n",
1826 |     "atc_classification.head()"
1827 |    ]
1828 |   },
1829 |   {
1830 |    "cell_type": "code",
1831 |    "execution_count": null,
1832 |    "id": "renewable-engineering",
1833 |    "metadata": {},
1834 |    "outputs": [],
1835 |    "source": [
1836 |     "# atc_classification = pd.read_sql_query(\"SELECT * from atc_classification\", con)\n",
1837 |     "# atc_classification.to_csv('/home/jovyan/projects/P50_ChEMBL/csv/atc_classification_db30.csv',index=False)"
1838 |    ]
1839 |   },
1840 |   {
1841 |    "cell_type": "code",
1842 |    "execution_count": null,
1843 |    "id": "expressed-program",
1844 |    "metadata": {},
1845 |    "outputs": [],
1846 |    "source": []
1847 |   },
1848 |   {
1849 |    "cell_type": "markdown",
1850 |    "id": "plain-indonesian",
1851 |    "metadata": {},
1852 |    "source": [
1853 |     "## Concatenate information"
1854 |    ]
1855 |   },
1856 |   {
1857 |    "cell_type": "markdown",
1858 |    "id": "rubber-compromise",
1859 |    "metadata": {},
1860 |    "source": [
1861 |     "* activities_final:<br> \n",
1862 |     "'activities|activity_id', 'activities|assay_id', 'activities|molregno',\n",
1863 |     "'activities|pchembl_value', 'activities|type', 'activities|standard_relation', 'activities|standard_value',\n",
1864 |     "'activities|standard_units', 'activities|standard_flag',\n",
1865 |     "'activities|standard_type', 'activities|activity_comment',\n",
1866 |     "'assays|description','assays|assay_type','assays|tid', 'assays|confidence_score','assays|curated_by','assays|chembl_id',<br>\n",
1867 |     "<br>\n",
1868 |     "* targets_final:<br>\n",
1869 |     "'target_dictionary|tid','target_dictionary|target_type','target_dictionary|pref_name','target_dictionary|organism','target_dictionary|chembl_id',\n",
1870 |     "'component_synonyms|component_synonym', 'component_synonyms|syn_type'<br>\n",
1871 |     "<br>\n",
1872 |     "* molecule_dictionary:<br>\n",
1873 |     "'molecule_dictionary|molregno', 'molecule_dictionary|pref_name','molecule_dictionary|chembl_id', 'molecule_dictionary|max_phase', 'molecule_dictionary|molecule_type','molecule_dictionary|oral',\n",
1874 |     "'molecule_dictionary|parenteral', 'molecule_dictionary|topical','molecule_dictionary|black_box_warning','molecule_dictionary|natural_product'<br>\n",
1875 |     "<br>    \n",
1876 |     "* drug_mechanism:<br>\n",
1877 |     "'drug_mechanism|molregno','drug_mechanism|mechanism_of_action','drug_mechanism|tid','drug_mechanism|action_type',\n",
1878 |     "<br>\n",
1879 |     "* molecule_atc_classification:<br>\n",
1880 |     "'molecule_atc_classification|mol_atc_id','molecule_atc_classification|level5','molecule_atc_classification|molregno'\n",
1881 |     "<br>\n",
1882 |     "* atc_classification:<br>\n",
1883 |     "'atc_classification|who_name', 'atc_classification|level1','atc_classification|level2', 'atc_classification|level3',\n",
1884 |     "'atc_classification|level4', 'atc_classification|level5','atc_classification|level1_description','atc_classification|level2_description',\n",
1885 |     "'atc_classification|level3_description','atc_classification|level4_description'"
1886 |    ]
1887 |   },
1888 |   {
1889 |    "cell_type": "code",
1890 |    "execution_count": null,
1891 |    "id": "spread-programmer",
1892 |    "metadata": {},
1893 |    "outputs": [],
1894 |    "source": []
1895 |   },
1896 |   {
1897 |    "cell_type": "code",
1898 |    "execution_count": 13,
1899 |    "id": "accurate-document",
1900 |    "metadata": {},
1901 |    "outputs": [
1902 |     {
1903 |      "name": "stdout",
1904 |      "output_type": "stream",
1905 |      "text": [
1906 |       "activities+assays: (19286751, 17)\n",
1907 |       "added drug mechanism: (19291779, 21)\n",
1908 |       "19291779\n",
1909 |       "19291382\n",
1910 |       "added compound info: (20181918, 45)\n",
1911 |       "UNIPROT              55495\n",
1912 |       "GENE_SYMBOL_OTHER    39293\n",
1913 |       "GENE_SYMBOL          13028\n",
1914 |       "EC_NUMBER             7943\n",
1915 |       "Name: component_synonyms|syn_type, dtype: int64\n",
1916 |       "added targets info: (21292684, 52)\n"
1917 |      ]
1918 |     },
1919 |     {
1920 |      "data": {
1921 |       "text/html": [
1922 |        "<div>\n",
1923 |        "<style scoped>\n",
1924 |        "    .dataframe tbody tr th:only-of-type {\n",
1925 |        "        vertical-align: middle;\n",
1926 |        "    }\n",
1927 |        "\n",
1928 |        "    .dataframe tbody tr th {\n",
1929 |        "        vertical-align: top;\n",
1930 |        "    }\n",
1931 |        "\n",
1932 |        "    .dataframe thead th {\n",
1933 |        "        text-align: right;\n",
1934 |        "    }\n",
1935 |        "</style>\n",
1936 |        "<table border=\"1\" class=\"dataframe\">\n",
1937 |        "  <thead>\n",
1938 |        "    <tr style=\"text-align: right;\">\n",
1939 |        "      <th></th>\n",
1940 |        "      <th>activities|activity_id</th>\n",
1941 |        "      <th>activities|assay_id</th>\n",
1942 |        "      <th>activities|molregno</th>\n",
1943 |        "      <th>activities|pchembl_value</th>\n",
1944 |        "      <th>activities|type</th>\n",
1945 |        "      <th>activities|standard_relation</th>\n",
1946 |        "      <th>activities|standard_value</th>\n",
1947 |        "      <th>activities|standard_units</th>\n",
1948 |        "      <th>activities|standard_flag</th>\n",
1949 |        "      <th>activities|standard_type</th>\n",
1950 |        "      <th>...</th>\n",
1951 |        "      <th>atc_classification|level3_description</th>\n",
1952 |        "      <th>atc_classification|level4_description</th>\n",
1953 |        "      <th>atc_classification|who_name</th>\n",
1954 |        "      <th>target_dictionary|tid</th>\n",
1955 |        "      <th>target_dictionary|target_type</th>\n",
1956 |        "      <th>target_dictionary|pref_name</th>\n",
1957 |        "      <th>target_dictionary|organism</th>\n",
1958 |        "      <th>target_dictionary|chembl_id</th>\n",
1959 |        "      <th>component_synonyms|component_synonym</th>\n",
1960 |        "      <th>component_synonyms|syn_type</th>\n",
1961 |        "    </tr>\n",
1962 |        "  </thead>\n",
1963 |        "  <tbody>\n",
1964 |        "    <tr>\n",
1965 |        "      <th>0</th>\n",
1966 |        "      <td>31863.0</td>\n",
1967 |        "      <td>54505.0</td>\n",
1968 |        "      <td>180094.0</td>\n",
1969 |        "      <td>NaN</td>\n",
1970 |        "      <td>IC50</td>\n",
1971 |        "      <td>&gt;</td>\n",
1972 |        "      <td>100000.00</td>\n",
1973 |        "      <td>nM</td>\n",
1974 |        "      <td>1.0</td>\n",
1975 |        "      <td>IC50</td>\n",
1976 |        "      <td>...</td>\n",
1977 |        "      <td>NaN</td>\n",
1978 |        "      <td>NaN</td>\n",
1979 |        "      <td>NaN</td>\n",
1980 |        "      <td>63.0</td>\n",
1981 |        "      <td>SINGLE PROTEIN</td>\n",
1982 |        "      <td>DNA topoisomerase II alpha</td>\n",
1983 |        "      <td>Homo sapiens</td>\n",
1984 |        "      <td>CHEMBL1806</td>\n",
1985 |        "      <td>TOP2A</td>\n",
1986 |        "      <td>GENE_SYMBOL</td>\n",
1987 |        "    </tr>\n",
1988 |        "    <tr>\n",
1989 |        "      <th>1</th>\n",
1990 |        "      <td>31864.0</td>\n",
1991 |        "      <td>83907.0</td>\n",
1992 |        "      <td>182268.0</td>\n",
1993 |        "      <td>5.60</td>\n",
1994 |        "      <td>IC50</td>\n",
1995 |        "      <td>=</td>\n",
1996 |        "      <td>2500.00</td>\n",
1997 |        "      <td>nM</td>\n",
1998 |        "      <td>1.0</td>\n",
1999 |        "      <td>IC50</td>\n",
2000 |        "      <td>...</td>\n",
2001 |        "      <td>NaN</td>\n",
2002 |        "      <td>NaN</td>\n",
2003 |        "      <td>NaN</td>\n",
2004 |        "      <td>11653.0</td>\n",
2005 |        "      <td>SINGLE PROTEIN</td>\n",
2006 |        "      <td>Heparanase</td>\n",
2007 |        "      <td>Homo sapiens</td>\n",
2008 |        "      <td>CHEMBL3921</td>\n",
2009 |        "      <td>HPSE</td>\n",
2010 |        "      <td>GENE_SYMBOL</td>\n",
2011 |        "    </tr>\n",
2012 |        "    <tr>\n",
2013 |        "      <th>2</th>\n",
2014 |        "      <td>2224237.0</td>\n",
2015 |        "      <td>531583.0</td>\n",
2016 |        "      <td>182268.0</td>\n",
2017 |        "      <td>5.60</td>\n",
2018 |        "      <td>pIC50</td>\n",
2019 |        "      <td>=</td>\n",
2020 |        "      <td>2511.89</td>\n",
2021 |        "      <td>nM</td>\n",
2022 |        "      <td>1.0</td>\n",
2023 |        "      <td>IC50</td>\n",
2024 |        "      <td>...</td>\n",
2025 |        "      <td>NaN</td>\n",
2026 |        "      <td>NaN</td>\n",
2027 |        "      <td>NaN</td>\n",
2028 |        "      <td>11653.0</td>\n",
2029 |        "      <td>SINGLE PROTEIN</td>\n",
2030 |        "      <td>Heparanase</td>\n",
2031 |        "      <td>Homo sapiens</td>\n",
2032 |        "      <td>CHEMBL3921</td>\n",
2033 |        "      <td>HPSE</td>\n",
2034 |        "      <td>GENE_SYMBOL</td>\n",
2035 |        "    </tr>\n",
2036 |        "    <tr>\n",
2037 |        "      <th>3</th>\n",
2038 |        "      <td>31865.0</td>\n",
2039 |        "      <td>88152.0</td>\n",
2040 |        "      <td>182268.0</td>\n",
2041 |        "      <td>NaN</td>\n",
2042 |        "      <td>IC50</td>\n",
2043 |        "      <td>&gt;</td>\n",
2044 |        "      <td>50000.00</td>\n",
2045 |        "      <td>nM</td>\n",
2046 |        "      <td>1.0</td>\n",
2047 |        "      <td>IC50</td>\n",
2048 |        "      <td>...</td>\n",
2049 |        "      <td>NaN</td>\n",
2050 |        "      <td>NaN</td>\n",
2051 |        "      <td>NaN</td>\n",
2052 |        "      <td>NaN</td>\n",
2053 |        "      <td>NaN</td>\n",
2054 |        "      <td>NaN</td>\n",
2055 |        "      <td>NaN</td>\n",
2056 |        "      <td>NaN</td>\n",
2057 |        "      <td>NaN</td>\n",
2058 |        "      <td>NaN</td>\n",
2059 |        "    </tr>\n",
2060 |        "    <tr>\n",
2061 |        "      <th>4</th>\n",
2062 |        "      <td>31866.0</td>\n",
2063 |        "      <td>83907.0</td>\n",
2064 |        "      <td>182855.0</td>\n",
2065 |        "      <td>5.05</td>\n",
2066 |        "      <td>IC50</td>\n",
2067 |        "      <td>=</td>\n",
2068 |        "      <td>9000.00</td>\n",
2069 |        "      <td>nM</td>\n",
2070 |        "      <td>1.0</td>\n",
2071 |        "      <td>IC50</td>\n",
2072 |        "      <td>...</td>\n",
2073 |        "      <td>NaN</td>\n",
2074 |        "      <td>NaN</td>\n",
2075 |        "      <td>NaN</td>\n",
2076 |        "      <td>11653.0</td>\n",
2077 |        "      <td>SINGLE PROTEIN</td>\n",
2078 |        "      <td>Heparanase</td>\n",
2079 |        "      <td>Homo sapiens</td>\n",
2080 |        "      <td>CHEMBL3921</td>\n",
2081 |        "      <td>HPSE</td>\n",
2082 |        "      <td>GENE_SYMBOL</td>\n",
2083 |        "    </tr>\n",
2084 |        "  </tbody>\n",
2085 |        "</table>\n",
2086 |        "<p>5 rows × 52 columns</p>\n",
2087 |        "</div>"
2088 |       ],
2089 |       "text/plain": [
2090 |        "   activities|activity_id  activities|assay_id  activities|molregno  \\\n",
2091 |        "0                 31863.0              54505.0             180094.0   \n",
2092 |        "1                 31864.0              83907.0             182268.0   \n",
2093 |        "2               2224237.0             531583.0             182268.0   \n",
2094 |        "3                 31865.0              88152.0             182268.0   \n",
2095 |        "4                 31866.0              83907.0             182855.0   \n",
2096 |        "\n",
2097 |        "   activities|pchembl_value activities|type activities|standard_relation  \\\n",
2098 |        "0                       NaN            IC50                            >   \n",
2099 |        "1                      5.60            IC50                            =   \n",
2100 |        "2                      5.60           pIC50                            =   \n",
2101 |        "3                       NaN            IC50                            >   \n",
2102 |        "4                      5.05            IC50                            =   \n",
2103 |        "\n",
2104 |        "   activities|standard_value activities|standard_units  \\\n",
2105 |        "0                  100000.00                        nM   \n",
2106 |        "1                    2500.00                        nM   \n",
2107 |        "2                    2511.89                        nM   \n",
2108 |        "3                   50000.00                        nM   \n",
2109 |        "4                    9000.00                        nM   \n",
2110 |        "\n",
2111 |        "   activities|standard_flag activities|standard_type  ...  \\\n",
2112 |        "0                       1.0                     IC50  ...   \n",
2113 |        "1                       1.0                     IC50  ...   \n",
2114 |        "2                       1.0                     IC50  ...   \n",
2115 |        "3                       1.0                     IC50  ...   \n",
2116 |        "4                       1.0                     IC50  ...   \n",
2117 |        "\n",
2118 |        "  atc_classification|level3_description atc_classification|level4_description  \\\n",
2119 |        "0                                   NaN                                   NaN   \n",
2120 |        "1                                   NaN                                   NaN   \n",
2121 |        "2                                   NaN                                   NaN   \n",
2122 |        "3                                   NaN                                   NaN   \n",
2123 |        "4                                   NaN                                   NaN   \n",
2124 |        "\n",
2125 |        "  atc_classification|who_name  target_dictionary|tid  \\\n",
2126 |        "0                         NaN                   63.0   \n",
2127 |        "1                         NaN                11653.0   \n",
2128 |        "2                         NaN                11653.0   \n",
2129 |        "3                         NaN                    NaN   \n",
2130 |        "4                         NaN                11653.0   \n",
2131 |        "\n",
2132 |        "   target_dictionary|target_type target_dictionary|pref_name  \\\n",
2133 |        "0                 SINGLE PROTEIN  DNA topoisomerase II alpha   \n",
2134 |        "1                 SINGLE PROTEIN                  Heparanase   \n",
2135 |        "2                 SINGLE PROTEIN                  Heparanase   \n",
2136 |        "3                            NaN                         NaN   \n",
2137 |        "4                 SINGLE PROTEIN                  Heparanase   \n",
2138 |        "\n",
2139 |        "  target_dictionary|organism  target_dictionary|chembl_id  \\\n",
2140 |        "0               Homo sapiens                   CHEMBL1806   \n",
2141 |        "1               Homo sapiens                   CHEMBL3921   \n",
2142 |        "2               Homo sapiens                   CHEMBL3921   \n",
2143 |        "3                        NaN                          NaN   \n",
2144 |        "4               Homo sapiens                   CHEMBL3921   \n",
2145 |        "\n",
2146 |        "  component_synonyms|component_synonym  component_synonyms|syn_type  \n",
2147 |        "0                                TOP2A                  GENE_SYMBOL  \n",
2148 |        "1                                 HPSE                  GENE_SYMBOL  \n",
2149 |        "2                                 HPSE                  GENE_SYMBOL  \n",
2150 |        "3                                  NaN                          NaN  \n",
2151 |        "4                                 HPSE                  GENE_SYMBOL  \n",
2152 |        "\n",
2153 |        "[5 rows x 52 columns]"
2154 |       ]
2155 |      },
2156 |      "execution_count": 13,
2157 |      "metadata": {},
2158 |      "output_type": "execute_result"
2159 |     }
2160 |    ],
2161 |    "source": [
2162 |     "# merge activities and assays data\n",
2163 |     "final_df = activities.merge(assays,how='left',left_on='activities|assay_id',right_on='assays|assay_id')\n",
2164 |     "final_df = final_df[['activities|activity_id', 'activities|assay_id', 'activities|molregno',\n",
2165 |     "                             'activities|pchembl_value', 'activities|type', 'activities|standard_relation', 'activities|standard_value',\n",
2166 |     "                             'activities|standard_units', 'activities|standard_flag','activities|standard_type', 'activities|activity_comment',\n",
2167 |     "                             'assays|description','assays|assay_type','assays|tid', 'assays|confidence_score','assays|curated_by','assays|chembl_id']]\n",
2168 |     "print(f'activities+assays: {final_df.shape}')\n",
2169 |     "\n",
2170 |     "# merge activities and drug_mechanism based on 'molregno': how='outer' to capture all\n",
2171 |     "final_df = final_df.merge(drug_mechanism[['drug_mechanism|molregno','drug_mechanism|mechanism_of_action','drug_mechanism|tid','drug_mechanism|action_type',]],\n",
2172 |     "                          how='outer',left_on=['activities|molregno','assays|tid'],right_on=['drug_mechanism|molregno','drug_mechanism|tid'])\n",
2173 |     "print(f'added drug mechanism: {final_df.shape}')\n",
2174 |     "\n",
2175 |     "## remake molregno column by combining\n",
2176 |     "ind = final_df[(final_df['drug_mechanism|molregno']==final_df['drug_mechanism|molregno'])& \\\n",
2177 |     "         (final_df['activities|molregno']!=final_df['activities|molregno'])].index\n",
2178 |     "final_df['activities_drug_mechanism|molregno']=final_df['activities|molregno'].copy()\n",
2179 |     "final_df.loc[ind,'activities_drug_mechanism|molregno']=final_df.loc[ind,'drug_mechanism|molregno']\n",
2180 |     "print(sum(final_df['activities_drug_mechanism|molregno']==final_df['activities_drug_mechanism|molregno']))\n",
2181 |     "del ind\n",
2182 |     "\n",
2183 |     "## remake tid column by combining\n",
2184 |     "ind = final_df[(final_df['drug_mechanism|tid']==final_df['drug_mechanism|tid'])& \\\n",
2185 |     "         (final_df['assays|tid']!=final_df['assays|tid'])].index\n",
2186 |     "final_df['assays_drug_mechanism|tid']=final_df['assays|tid'].copy()\n",
2187 |     "final_df.loc[ind,'assays_drug_mechanism|tid']=final_df.loc[ind,'drug_mechanism|tid']\n",
2188 |     "print(sum(final_df['assays_drug_mechanism|tid']==final_df['assays_drug_mechanism|tid']))\n",
2189 |     "\n",
2190 |     "\n",
2191 |     "# merge compound informations\n",
2192 |     "final_df = final_df.merge(molecule_dictionary[['molecule_dictionary|molregno', 'molecule_dictionary|pref_name','molecule_dictionary|chembl_id', 'molecule_dictionary|max_phase', \n",
2193 |     "                                               'molecule_dictionary|molecule_type','molecule_dictionary|oral','molecule_dictionary|parenteral', 'molecule_dictionary|topical',\n",
2194 |     "                                               'molecule_dictionary|black_box_warning','molecule_dictionary|natural_product']],\n",
2195 |     "                          how='left',left_on='activities_drug_mechanism|molregno',right_on='molecule_dictionary|molregno')\n",
2196 |     "\n",
2197 |     "final_df = final_df.merge(molecule_atc_classification[['molecule_atc_classification|molregno','molecule_atc_classification|level5']],\n",
2198 |     "                          how='left',left_on='activities_drug_mechanism|molregno',right_on='molecule_atc_classification|molregno')\n",
2199 |     "\n",
2200 |     "final_df = final_df.merge(atc_classification[['atc_classification|level1','atc_classification|level2','atc_classification|level3','atc_classification|level4','atc_classification|level5',\n",
2201 |     "                                              'atc_classification|level1_description','atc_classification|level2_description',\n",
2202 |     "                                              'atc_classification|level3_description','atc_classification|level4_description','atc_classification|who_name']],\n",
2203 |     "                          how='left',left_on='molecule_atc_classification|level5',right_on='atc_classification|level5')\n",
2204 |     "print(f'added compound info: {final_df.shape}')\n",
2205 |     "\n",
2206 |     "# merge targets\n",
2207 |     "## merging target dataframes first (to link tid with gene symbol)\n",
2208 |     "targets_final = target_dictionary.merge(target_components,how='left',left_on='target_dictionary|tid',right_on='target_components|tid')\n",
2209 |     "targets_final = targets_final.merge(component_synonyms,how='left',left_on='target_components|component_id',right_on='component_synonyms|component_id')\n",
2210 |     "print(targets_final['component_synonyms|syn_type'].value_counts())\n",
2211 |     "\n",
2212 |     "## selecting targets which have 'gene symbol'\n",
2213 |     "## syn_type == 'GENE_SYMBOL'\n",
2214 |     "targets_final = targets_final[targets_final['component_synonyms|syn_type']=='GENE_SYMBOL']\n",
2215 |     "\n",
2216 |     "## merge\n",
2217 |     "final_df = final_df.merge(targets_final[['target_dictionary|tid','target_dictionary|target_type','target_dictionary|pref_name','target_dictionary|organism','target_dictionary|chembl_id',\n",
2218 |     "                                         'component_synonyms|component_synonym', 'component_synonyms|syn_type']],\n",
2219 |     "                          how='left',left_on='assays_drug_mechanism|tid',right_on='target_dictionary|tid')\n",
2220 |     "\n",
2221 |     "print(f'added targets info: {final_df.shape}')\n",
2222 |     "final_df.head()"
2223 |    ]
2224 |   },
2225 |   {
2226 |    "cell_type": "markdown",
2227 |    "id": "piano-gibraltar",
2228 |    "metadata": {},
2229 |    "source": [
2230 |     "## Save the data frame above for access to non-human entries"
2231 |    ]
2232 |   },
2233 |   {
2234 |    "cell_type": "markdown",
2235 |    "id": "vertical-spiritual",
2236 |    "metadata": {},
2237 |    "source": [
2238 |     "## Filtering drugs targeting human molecules"
2239 |    ]
2240 |   },
2241 |   {
2242 |    "cell_type": "code",
2243 |    "execution_count": 30,
2244 |    "id": "handed-bracket",
2245 |    "metadata": {},
2246 |    "outputs": [
2247 |     {
2248 |      "data": {
2249 |       "text/plain": [
2250 |        "(6708016, 53)"
2251 |       ]
2252 |      },
2253 |      "execution_count": 30,
2254 |      "metadata": {},
2255 |      "output_type": "execute_result"
2256 |     }
2257 |    ],
2258 |    "source": [
2259 |     "final_df = final_df[final_df['target_dictionary|organism']=='Homo sapiens']\n",
2260 |     "final_df.shape"
2261 |    ]
2262 |   },
2263 |   {
2264 |    "cell_type": "code",
2265 |    "execution_count": null,
2266 |    "id": "identified-logan",
2267 |    "metadata": {},
2268 |    "outputs": [],
2269 |    "source": []
2270 |   },
2271 |   {
2272 |    "cell_type": "markdown",
2273 |    "id": "available-sharp",
2274 |    "metadata": {},
2275 |    "source": [
2276 |     "## Add target class"
2277 |    ]
2278 |   },
2279 |   {
2280 |    "cell_type": "markdown",
2281 |    "id": "dependent-oregon",
2282 |    "metadata": {},
2283 |    "source": [
2284 |     "- Based on [IDG Protein list](https://druggablegenome.net/IDGProteinList)\n",
2285 |     "- add 'Ion channel' from [HGNC, GID:177](https://www.genenames.org/data/genegroup/#!/group/177)\n",
2286 |     "- add 'GPCR' from [HGNC, GID:139](https://www.genenames.org/data/genegroup/#!/group/139)\n",
2287 |     "- add 'NHR' (Nuclear Hormone Receptors) from [HGNC, GID:71](https://www.genenames.org/data/genegroup/#!/group/71)"
2288 |    ]
2289 |   },
2290 |   {
2291 |    "cell_type": "code",
2292 |    "execution_count": 31,
2293 |    "id": "expensive-occurrence",
2294 |    "metadata": {},
2295 |    "outputs": [
2296 |     {
2297 |      "data": {
2298 |       "text/plain": [
2299 |        "dict_keys(['Kinase', 'GPCR', 'Ion Channel', 'NHR'])"
2300 |       ]
2301 |      },
2302 |      "execution_count": 31,
2303 |      "metadata": {},
2304 |      "output_type": "execute_result"
2305 |     }
2306 |    ],
2307 |    "source": [
2308 |     "# create dictionary for protein classes\n",
2309 |     "idg = pd.read_csv('IDG_TargetList_Y4.csv')\n",
2310 |     "\n",
2311 |     "targetclass_dict={}\n",
2312 |     "for c in set(idg['IDGFamily']):\n",
2313 |     "    targetclass_dict[c]=list(idg[idg['IDGFamily']==c]['Gene'])\n",
2314 |     "\n",
2315 |     "ion = pd.read_csv('HGNC_GID177_Ion-channels.txt',sep='\\t')\n",
2316 |     "gpcr = pd.read_csv('HGNC_GID139_G-protein-coupled-receptors.txt',sep='\\t')\n",
2317 |     "nr = pd.read_csv('HGNC_GID71_Nuclear-hormone-receptors.txt',sep='\\t')\n",
2318 |     "\n",
2319 |     "targetclass_dict['Ion Channel']=list(set(targetclass_dict['Ion Channel']+list(ion['Approved symbol'])))\n",
2320 |     "targetclass_dict['GPCR']=list(set(targetclass_dict['GPCR']+list(gpcr['Approved symbol'])))\n",
2321 |     "targetclass_dict['NHR']=list(nr['Approved symbol'].unique())\n",
2322 |     "targetclass_dict.keys()"
2323 |    ]
2324 |   },
2325 |   {
2326 |    "cell_type": "code",
2327 |    "execution_count": 32,
2328 |    "id": "understood-graduate",
2329 |    "metadata": {},
2330 |    "outputs": [
2331 |     {
2332 |      "name": "stderr",
2333 |      "output_type": "stream",
2334 |      "text": [
2335 |       "/home/jovyan/my-conda-envs/cellpymc/lib/python3.7/site-packages/ipykernel_launcher.py:13: SettingWithCopyWarning: \n",
2336 |       "A value is trying to be set on a copy of a slice from a DataFrame.\n",
2337 |       "Try using .loc[row_indexer,col_indexer] = value instead\n",
2338 |       "\n",
2339 |       "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
2340 |       "  del sys.path[0]\n",
2341 |       "/home/jovyan/my-conda-envs/cellpymc/lib/python3.7/site-packages/ipykernel_launcher.py:14: SettingWithCopyWarning: \n",
2342 |       "A value is trying to be set on a copy of a slice from a DataFrame.\n",
2343 |       "Try using .loc[row_indexer,col_indexer] = value instead\n",
2344 |       "\n",
2345 |       "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
2346 |       "  \n"
2347 |      ]
2348 |     },
2349 |     {
2350 |      "data": {
2351 |       "text/plain": [
2352 |        "none           5347947\n",
2353 |        "GPCR            903731\n",
2354 |        "NHR             177300\n",
2355 |        "Ion Channel     174983\n",
2356 |        "Kinase          104055\n",
2357 |        "Name: target_class, dtype: int64"
2358 |       ]
2359 |      },
2360 |      "execution_count": 32,
2361 |      "metadata": {},
2362 |      "output_type": "execute_result"
2363 |     }
2364 |    ],
2365 |    "source": [
2366 |     "# assin protein class to each target\n",
2367 |     "def which_class(dictionary, value):\n",
2368 |     "    out='none'\n",
2369 |     "    for k in dictionary.keys():\n",
2370 |     "        if value in dictionary[k]:\n",
2371 |     "            if out=='none':\n",
2372 |     "                out=k\n",
2373 |     "            else:\n",
2374 |     "                out=f'{out};{k}'\n",
2375 |     "    return out\n",
2376 |     "\n",
2377 |     "# add target class\n",
2378 |     "final_df['target_class']=final_df['component_synonyms|component_synonym'].copy()\n",
2379 |     "final_df['target_class']=[which_class(targetclass_dict,t) for t in final_df['target_class']]\n",
2380 |     "final_df['target_class'].value_counts()"
2381 |    ]
2382 |   },
2383 |   {
2384 |    "cell_type": "code",
2385 |    "execution_count": null,
2386 |    "id": "continued-brick",
2387 |    "metadata": {},
2388 |    "outputs": [],
2389 |    "source": []
2390 |   },
2391 |   {
2392 |    "cell_type": "markdown",
2393 |    "id": "fatal-ranch",
2394 |    "metadata": {},
2395 |    "source": [
2396 |     "## Save"
2397 |    ]
2398 |   },
2399 |   {
2400 |    "cell_type": "code",
2401 |    "execution_count": 34,
2402 |    "id": "metric-arnold",
2403 |    "metadata": {},
2404 |    "outputs": [
2405 |     {
2406 |      "name": "stdout",
2407 |      "output_type": "stream",
2408 |      "text": [
2409 |       "CPU times: user 25.7 s, sys: 5.28 s, total: 31 s\n",
2410 |       "Wall time: 50.3 s\n"
2411 |      ]
2412 |     }
2413 |    ],
2414 |    "source": [
2415 |     "%%time\n",
2416 |     "final_df.to_pickle('chembl_30_merged_genesymbols_humans.pkl')"
2417 |    ]
2418 |   },
2419 |   {
2420 |    "cell_type": "code",
2421 |    "execution_count": null,
2422 |    "id": "southern-furniture",
2423 |    "metadata": {},
2424 |    "outputs": [],
2425 |    "source": []
2426 |   },
2427 |   {
2428 |    "cell_type": "code",
2429 |    "execution_count": null,
2430 |    "id": "apart-scheme",
2431 |    "metadata": {},
2432 |    "outputs": [],
2433 |    "source": []
2434 |   },
2435 |   {
2436 |    "cell_type": "code",
2437 |    "execution_count": null,
2438 |    "id": "herbal-baltimore",
2439 |    "metadata": {},
2440 |    "outputs": [],
2441 |    "source": []
2442 |   },
2443 |   {
2444 |    "cell_type": "markdown",
2445 |    "id": "compatible-intersection",
2446 |    "metadata": {},
2447 |    "source": [
2448 |     "## Session info"
2449 |    ]
2450 |   },
2451 |   {
2452 |    "cell_type": "code",
2453 |    "execution_count": 35,
2454 |    "id": "bored-structure",
2455 |    "metadata": {},
2456 |    "outputs": [
2457 |     {
2458 |      "data": {
2459 |       "text/html": [
2460 |        "<details>\n",
2461 |        "<summary>Click to view session information</summary>\n",
2462 |        "<pre>\n",
2463 |        "-----\n",
2464 |        "numpy               1.21.5\n",
2465 |        "pandas              1.2.0\n",
2466 |        "session_info        1.0.0\n",
2467 |        "-----\n",
2468 |        "</pre>\n",
2469 |        "<details>\n",
2470 |        "<summary>Click to view modules imported as dependencies</summary>\n",
2471 |        "<pre>\n",
2472 |        "backcall            0.2.0\n",
2473 |        "colorama            0.4.4\n",
2474 |        "cython_runtime      NA\n",
2475 |        "dateutil            2.8.1\n",
2476 |        "decorator           4.4.2\n",
2477 |        "google              NA\n",
2478 |        "ipykernel           5.4.2\n",
2479 |        "ipython_genutils    0.2.0\n",
2480 |        "jedi                0.17.2\n",
2481 |        "mpl_toolkits        NA\n",
2482 |        "numexpr             2.7.2\n",
2483 |        "parso               0.7.1\n",
2484 |        "pexpect             4.8.0\n",
2485 |        "pickleshare         0.7.5\n",
2486 |        "prompt_toolkit      3.0.10\n",
2487 |        "ptyprocess          0.7.0\n",
2488 |        "pygments            2.7.4\n",
2489 |        "pytz                2020.5\n",
2490 |        "six                 1.14.0\n",
2491 |        "sphinxcontrib       NA\n",
2492 |        "storemagic          NA\n",
2493 |        "tornado             6.1\n",
2494 |        "traitlets           5.0.5\n",
2495 |        "wcwidth             0.2.5\n",
2496 |        "zmq                 21.0.1\n",
2497 |        "zope                NA\n",
2498 |        "</pre>\n",
2499 |        "</details> <!-- seems like this ends pre, so might as well be explicit -->\n",
2500 |        "<pre>\n",
2501 |        "-----\n",
2502 |        "IPython             7.19.0\n",
2503 |        "jupyter_client      6.1.11\n",
2504 |        "jupyter_core        4.7.0\n",
2505 |        "notebook            6.2.0\n",
2506 |        "-----\n",
2507 |        "Python 3.7.9 | packaged by conda-forge | (default, Dec  9 2020, 21:08:20) [GCC 9.3.0]\n",
2508 |        "Linux-4.15.0-112-generic-x86_64-with-debian-bullseye-sid\n",
2509 |        "-----\n",
2510 |        "Session information updated at 2022-07-12 21:29\n",
2511 |        "</pre>\n",
2512 |        "</details>"
2513 |       ],
2514 |       "text/plain": [
2515 |        "<IPython.core.display.HTML object>"
2516 |       ]
2517 |      },
2518 |      "execution_count": 35,
2519 |      "metadata": {},
2520 |      "output_type": "execute_result"
2521 |     }
2522 |    ],
2523 |    "source": [
2524 |     "import session_info\n",
2525 |     "session_info.show()"
2526 |    ]
2527 |   },
2528 |   {
2529 |    "cell_type": "code",
2530 |    "execution_count": null,
2531 |    "id": "opposed-memory",
2532 |    "metadata": {},
2533 |    "outputs": [],
2534 |    "source": []
2535 |   },
2536 |   {
2537 |    "cell_type": "code",
2538 |    "execution_count": null,
2539 |    "id": "heard-newman",
2540 |    "metadata": {},
2541 |    "outputs": [],
2542 |    "source": []
2543 |   }
2544 |  ],
2545 |  "metadata": {
2546 |   "kernelspec": {
2547 |    "display_name": "cellpymc",
2548 |    "language": "python",
2549 |    "name": "cellpymc"
2550 |   },
2551 |   "language_info": {
2552 |    "codemirror_mode": {
2553 |     "name": "ipython",
2554 |     "version": 3
2555 |    },
2556 |    "file_extension": ".py",
2557 |    "mimetype": "text/x-python",
2558 |    "name": "python",
2559 |    "nbconvert_exporter": "python",
2560 |    "pygments_lexer": "ipython3",
2561 |    "version": "3.7.9"
2562 |   }
2563 |  },
2564 |  "nbformat": 4,
2565 |  "nbformat_minor": 5
2566 | }
2567 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup, find_packages
 2 | 
 3 | setup(
 4 |     name='drug2cell',
 5 |     version='0.1.2',
 6 |     description='Gene group activity utility functions for scanpy',
 7 |     url='https://github.com/Teichlab/drug2cell',
 8 |     packages=find_packages(exclude=['docs', 'notebooks']),
 9 |     install_requires=[
10 |         'anndata',
11 |         'pandas',
12 |         'numpy',
13 |         'statsmodels',
14 |         'scipy',
15 |         'blitzgsea'
16 |     ],
17 |     package_data={
18 |         "drug2cell": ["*.pkl"]
19 |     },
20 |     author='Krzysztof Polanski, Kazumasa Kanemaru',
21 |     author_email='kp9@sanger.ac.uk',
22 |     license='non-commercial license'
23 | )
24 | 


--------------------------------------------------------------------------------