├── .gitignore ├── .readthedocs.yaml ├── CHANGELOG.md ├── LICENSE ├── README.md ├── docs ├── Makefile ├── conf.py ├── index.rst ├── make.bat └── requirements.txt ├── drug2cell ├── __init__.py ├── chembl.py ├── cpdb_dict.pkl ├── data.py ├── drug-target_dicts.pkl └── util.py ├── notebooks ├── chembl │ ├── filtering.ipynb │ └── initial_database_parsing.ipynb └── demo.ipynb └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | docs/_build 3 | drug2cell/__pycache__ 4 | -------------------------------------------------------------------------------- /.readthedocs.yaml: -------------------------------------------------------------------------------- 1 | # .readthedocs.yaml 2 | # Read the Docs configuration file 3 | # See https://docs.readthedocs.io/en/stable/config-file/v2.html for details 4 | 5 | # Required 6 | version: 2 7 | 8 | # Set the version of Python and other tools you might need 9 | build: 10 | os: ubuntu-20.04 11 | tools: 12 | python: "3.9" 13 | 14 | # Build documentation in the docs/ directory with Sphinx 15 | sphinx: 16 | configuration: docs/conf.py 17 | 18 | # Install our python package before building the docs 19 | python: 20 | install: 21 | - requirements: docs/requirements.txt 22 | - method: pip 23 | path: . 24 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # Changelog 2 | 3 | ## 0.1.2 4 | - regenerated drug target dictionary with updated thresholds 5 | 6 | ## 0.1.1 7 | - update blitzGSEA to make use of pip install and ``interactive_plot`` 8 | 9 | ## 0.1.0 10 | - initial release 11 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Drug2cell, (c) 2023, GRL (the "Software") 2 | 3 | The Software remains the property of Genome Research Ltd ("GRL"). 4 | 5 | The Software is distributed "AS IS" under this Licence solely for non-commercial use in the hope that it will be useful, but in order that GRL as a charitable foundation protects its assets for the benefit of its educational and research purposes, GRL makes clear that no condition is made or to be implied, nor is any warranty given or to be implied, as to the accuracy of the Software, or that it will be suitable for any particular purpose or for use under any specific conditions. Furthermore, GRL disclaims all responsibility for the use which is made of the Software. It further disclaims any liability for the outcomes arising from using the Software. 6 | 7 | The Licensee agrees to indemnify GRL and hold GRL harmless from and against any and all claims, damages and liabilities asserted by third parties (including claims for negligence) which arise directly or indirectly from the use of the Software or the sale of any products based on the Software. 8 | 9 | No part of the Software may be reproduced, modified, transmitted or transferred in any form or by any means, electronic or mechanical, without the express permission of GRL. The permission of GRL is not required if the said reproduction, modification, transmission or transference is done without financial return, the conditions of this Licence are imposed upon the receiver of the product, and all original and amended source code is included in any transmitted product. You may be held legally responsible for any copyright infringement that is caused or encouraged by your failure to abide by these terms and conditions. 10 | 11 | You are not permitted under this Licence to use this Software commercially. Use for which any financial return is received shall be defined as commercial use, and includes (1) integration of all or part of the source code or the Software into a product for sale or license by or on behalf of Licensee to third parties or (2) use of the Software or any derivative of it for research with the final aim of developing software products for sale or license to a third party or (3) use of the Software or any derivative of it for research with the final aim of developing non-software products for sale or license to a third party, or (4) use of the Software to provide any service to an external organisation for which payment is received. If you are interested in using the Software commercially, please contact legal@sanger.ac.uk. Contact details are: legal@sanger.ac.uk quoting reference Drug2cell-software. 12 | 13 | 14 | 15 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Drug2cell 2 | 3 | This is a collection of utility functions for gene group activity evaluation in scanpy, both for per-cell scoring and marker-based enrichment/overrepresentation analyses. Drug2cell makes use of established methodology, and offers it in a convenient/efficient form for easy scanpy use. The package comes with a set of ChEMBL derived drug target sets, but has straightforward input formatting so it can be easily used with any gene groups of choice. 4 | 5 | ## Installation 6 | 7 | ```bash 8 | pip install drug2cell 9 | ``` 10 | 11 | ## Usage 12 | 13 | ```python3 14 | import drug2cell as d2c 15 | ``` 16 | 17 | **Per-cell scoring** can be done via `d2c.score()`, and creates a new gene group feature space object in `.uns['drug2cell']` of the supplied object. The `.obs` and `.obsm` of the original object are copied over for ease of downstream use, like plotting or potential marker detection. 18 | 19 | **Enrichment** is done with GSEA, contained within the function `d2c.gsea()`. **Overrepresentation** analysis can be done with the hypergeometric test of `d2c.hypergeometric()`. Both of those require having ran `sc.tl.rank_genes_groups()` on the input object, and return one data frame per evaluated cluster with enrichment/overrepresentation results. 20 | 21 | It's possible to provide your own gene groups as a dictionary, with the names of the groups as keys and the corresponding gene lists as the values. Pass this dictionary as the `targets` argument of any of those three functions. 22 | 23 | **Please refer to the [demo notebook](https://nbviewer.org/github/Teichlab/drug2cell/blob/main/notebooks/demo.ipynb), all of the input arguments are detailed in [ReadTheDocs](https://drug2cell.readthedocs.io/en/latest/).** 24 | 25 | ## ChEMBL parsing 26 | 27 | Drug2cell also features instructions on how to parse the ChEMBL database into drugs and their targets. Some additional notebooks are included. Both refer to a pre-parsed data frame of ChEMBL human targets, which can be accessed at `ftp://ftp.sanger.ac.uk/pub/users/kp9/chembl_30_merged_genesymbols_humans.pkl`. 28 | - [Filtering](https://nbviewer.org/github/Teichlab/drug2cell/blob/main/notebooks/chembl/filtering.ipynb) shows how this data frame was turned into the drugs:targets dictionary shipped with the package by default. There are some helper functions included in `d2c.chembl` which can assist you shall you wish to filter it in a different way. 29 | - [Initial database parsing](https://nbviewer.org/github/Teichlab/drug2cell/blob/main/notebooks/chembl/initial_database_parsing.ipynb) shows how the data frame was created from online resources. 30 | 31 | ## Citation 32 | 33 | If you use drug2cell in your work, please cite the [paper](https://www.nature.com/articles/s41586-023-06311-1) 34 | 35 | ``` 36 | @article{kanemaru2023spatially, 37 | title={Spatially resolved multiomics of human cardiac niches}, 38 | author={Kanemaru, Kazumasa and Cranley, James and Muraro, Daniele and Miranda, Antonio MA and Ho, Siew Yen and Wilbrey-Clark, Anna and Patrick Pett, Jan and Polanski, Krzysztof and Richardson, Laura and Litvinukova, Monika and others}, 39 | journal={Nature}, 40 | volume={619}, 41 | number={7971}, 42 | pages={801--810}, 43 | year={2023}, 44 | publisher={Nature Publishing Group UK London} 45 | } 46 | ``` 47 | -------------------------------------------------------------------------------- /docs/Makefile: -------------------------------------------------------------------------------- 1 | # Minimal makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line, and also 5 | # from the environment for the first two. 6 | SPHINXOPTS ?= 7 | SPHINXBUILD ?= sphinx-build 8 | SOURCEDIR = . 9 | BUILDDIR = _build 10 | 11 | # Put it first so that "make" without argument is like "make help". 12 | help: 13 | @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 14 | 15 | .PHONY: help Makefile 16 | 17 | # Catch-all target: route all unknown targets to Sphinx using the new 18 | # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). 19 | %: Makefile 20 | @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 21 | -------------------------------------------------------------------------------- /docs/conf.py: -------------------------------------------------------------------------------- 1 | # Configuration file for the Sphinx documentation builder. 2 | # 3 | # For the full list of built-in configuration values, see the documentation: 4 | # https://www.sphinx-doc.org/en/master/usage/configuration.html 5 | 6 | # If extensions (or modules to document with autodoc) are in another directory, 7 | # add these directories to sys.path here. If the directory is relative to the 8 | # documentation root, use os.path.abspath to make it absolute, like shown here. 9 | # 10 | import os 11 | import sys 12 | sys.path.insert(0, os.path.abspath('..')) 13 | autodoc_mock_imports = ['anndata', 'pandas', 'numpy', 'statsmodels.stats.multitest', 'scipy.sparse', 'scipy.stats', 'pkg_resources', 'collections', 'blitzgsea', 'statsmodels', 'scipy'] 14 | 15 | # -- Project information ----------------------------------------------------- 16 | # https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information 17 | 18 | project = 'Drug2cell' 19 | copyright = '2023, Krzysztof Polanski, Kazumasa Kanemaru' 20 | author = 'Krzysztof Polanski, Kazumasa Kanemaru' 21 | release = '0.1.2' 22 | 23 | # -- General configuration --------------------------------------------------- 24 | # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration 25 | 26 | extensions = ['sphinx.ext.autodoc'] 27 | 28 | templates_path = ['_templates'] 29 | exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store'] 30 | 31 | 32 | 33 | # -- Options for HTML output ------------------------------------------------- 34 | # https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output 35 | 36 | html_theme = 'sphinx_rtd_theme' 37 | html_static_path = ['_static'] 38 | -------------------------------------------------------------------------------- /docs/index.rst: -------------------------------------------------------------------------------- 1 | .. Drug2cell documentation master file, created by 2 | sphinx-quickstart on Thu Feb 2 10:21:06 2023. 3 | You can adapt this file completely to your liking, but it should at least 4 | contain the root `toctree` directive. 5 | 6 | Welcome to Drug2cell's documentation! 7 | ===================================== 8 | 9 | .. automodule:: drug2cell 10 | :members: gsea, hypergeometric, score 11 | 12 | Gene group loading 13 | ================== 14 | 15 | .. automodule:: drug2cell.data 16 | :members: chembl, consensuspathdb 17 | 18 | Utility functions 19 | ================= 20 | 21 | .. automodule:: drug2cell.util 22 | :members: prepare_plot_args, plot_gsea 23 | 24 | .. automodule:: drug2cell.chembl 25 | :members: filter_activities -------------------------------------------------------------------------------- /docs/make.bat: -------------------------------------------------------------------------------- 1 | @ECHO OFF 2 | 3 | pushd %~dp0 4 | 5 | REM Command file for Sphinx documentation 6 | 7 | if "%SPHINXBUILD%" == "" ( 8 | set SPHINXBUILD=sphinx-build 9 | ) 10 | set SOURCEDIR=. 11 | set BUILDDIR=_build 12 | 13 | %SPHINXBUILD% >NUL 2>NUL 14 | if errorlevel 9009 ( 15 | echo. 16 | echo.The 'sphinx-build' command was not found. Make sure you have Sphinx 17 | echo.installed, then set the SPHINXBUILD environment variable to point 18 | echo.to the full path of the 'sphinx-build' executable. Alternatively you 19 | echo.may add the Sphinx directory to PATH. 20 | echo. 21 | echo.If you don't have Sphinx installed, grab it from 22 | echo.https://www.sphinx-doc.org/ 23 | exit /b 1 24 | ) 25 | 26 | if "%1" == "" goto help 27 | 28 | %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% 29 | goto end 30 | 31 | :help 32 | %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% 33 | 34 | :end 35 | popd 36 | -------------------------------------------------------------------------------- /docs/requirements.txt: -------------------------------------------------------------------------------- 1 | sphinx-rtd-theme 2 | -------------------------------------------------------------------------------- /drug2cell/__init__.py: -------------------------------------------------------------------------------- 1 | import blitzgsea 2 | import anndata 3 | import pandas as pd 4 | import numpy as np 5 | 6 | #gives access to submodules 7 | from . import chembl 8 | from . import data 9 | from . import util 10 | 11 | from statsmodels.stats.multitest import multipletests 12 | from scipy.sparse import issparse 13 | from scipy.stats import hypergeom 14 | 15 | def _sparse_nanmean(X, axis): 16 | # function from https://github.com/scverse/scanpy/blob/034ca2823804645e0d4874c9b16ba2eb8c13ac0f/scanpy/tools/_score_genes.py 17 | """ 18 | np.nanmean equivalent for sparse matrices 19 | """ 20 | if not issparse(X): 21 | raise TypeError("X must be a sparse matrix") 22 | 23 | # count the number of nan elements per row/column (dep. on axis) 24 | Z = X.copy() 25 | Z.data = np.isnan(Z.data) 26 | Z.eliminate_zeros() 27 | n_elements = Z.shape[axis] - Z.sum(axis) 28 | 29 | # set the nans to 0, so that a normal .sum() works 30 | Y = X.copy() 31 | Y.data[np.isnan(Y.data)] = 0 32 | Y.eliminate_zeros() 33 | 34 | # the average 35 | s = Y.sum(axis, dtype='float64') # float64 for score_genes function compatibility) 36 | m = s / n_elements 37 | 38 | return m 39 | 40 | def _mean(X,names,axis): 41 | ''' 42 | Helper function to compute a mean of X across an axis, respecting names and possible nans. 43 | 44 | Derived from sc.tl.score_genes() logic. 45 | ''' 46 | if issparse(X): 47 | obs_avg = pd.Series( 48 | np.array(_sparse_nanmean(X, axis=axis)).flatten(), 49 | index=names, 50 | ) # average expression of genes 51 | else: 52 | obs_avg = pd.Series( 53 | np.nanmean(X, axis=axis), index=names 54 | ) # average expression of genes 55 | return obs_avg 56 | 57 | def score(adata, targets=None, nested=False, categories=None, method="mean", layer=None, use_raw=False, n_bins=25, ctrl_size=50, sep=","): 58 | ''' 59 | Obtain per-cell scoring of gene groups of interest. Distributed with a set of 60 | ChEMBL drug targets that can be used immediately. 61 | 62 | Please ensure that the gene nomenclature in your target sets is compatible with your 63 | ``.var_names`` (or ``.raw.var_names``). The ChEMBL drug targets use HGNC (human gene 64 | names in line with standard cell ranger mapping output). 65 | 66 | Adds ``.uns['drug2cell']`` to the input AnnData, a new AnnData object with the same 67 | observation space but with the scored gene groups as the features. The gene group 68 | members used to compute the scores will be listed in ``.var['genes']`` of the new 69 | object. 70 | 71 | Input 72 | ----- 73 | adata : ``AnnData`` 74 | Using log-normalised data is recommended. 75 | targets : ``dict`` of lists of ``str``, optional (default: ``None``) 76 | The gene groups to evaluate. Can be targets of known drugs, GO terms, pathway 77 | memberships, anything you can assign genes to. If ``None``, will load the 78 | ChEMBL-derived drug target sets distributed with the package. 79 | 80 | Accepts two forms: 81 | 82 | - A dictionary with the names of the groups as keys, and the entries being the 83 | corresponding gene lists. 84 | - A dictionary of dictionaries defined like above, with names of gene group 85 | categories as keys. If passing one of those, please specify ``nested=True``. 86 | 87 | nested : ``bool``, optional (default: ``False``) 88 | Whether ``targets`` is a dictionary of dictionaries with group categories as keys. 89 | categories : ``str`` or list of ``str``, optional (default: ``None``) 90 | If ``targets=None`` or ``nested=True``, this argument can be used to subset the 91 | gene groups to one or more categories (keys of the original dictionary). In case 92 | of the ChEMBL drug targets, these are ATC level 1/level 2 category codes. 93 | method : ``str``, optional (default: ``"mean"``) 94 | The method to use to score the gene groups. The default is ``"mean"``, which 95 | computes the mean over all the genes. The other option is ``"seurat"``, which 96 | generates an appropriate background profile for each target set and subtracts it 97 | from the mean. This is inspired by ``sc.tl.rank_genes()`` logic, which in turn 98 | was inspired by Seurat's gene group scoring algorithm. 99 | layer : ``str``, optional (default: ``None``) 100 | Which ``.layers`` of the input AnnData to use for the expression values. If 101 | ``None``, will default to ``.X``. 102 | use_raw : ``bool``, optional (default: ``False``) 103 | Whether to use ``.raw.X`` for the expression values. 104 | n_bins : ``int``, optional (default: 25) 105 | Only used with ``method="seurat"``. The number of expression bins to partition the 106 | feature space into. 107 | ctrl_size : ``int``, optional (default: 50) 108 | Only used with ``method="seurat"``. The number of genes to randomly sample from 109 | each expression bin. 110 | sep : ``str``, optional (default: ``","``) 111 | What delimiter to use when storing the corresponding gene groups for each feature 112 | in ``.uns['drug2cell'].var['genes']`` 113 | ''' 114 | #select expression and gene names to use 115 | if layer is not None: 116 | if use_raw: 117 | raise ValueError("Cannot specify `layer` and have `use_raw=True`.") 118 | X = adata.layers[layer] 119 | var_names = adata.var_names 120 | else: 121 | if use_raw and adata.raw is not None: 122 | X = adata.raw.X 123 | var_names = adata.raw.var_names 124 | else: 125 | X = adata.X 126 | var_names = adata.var_names 127 | #get {group:[targets]} form of gene groups to evaluate based on arguments 128 | #skip target reconstruction to overwrite any potential existing scoring 129 | targets = util.prepare_targets( 130 | adata, 131 | targets=targets, 132 | nested=nested, 133 | categories=categories, 134 | sep=sep, 135 | reconstruct=False 136 | ) 137 | #store full list of targets to have them on tap for later 138 | full_targets = targets.copy() 139 | #turn the list of gene IDs to a boolean mask of var_names 140 | for drug in targets: 141 | targets[drug] = np.isin(var_names, targets[drug]) 142 | #perform scoring 143 | #the scoring shall be done via matrix multiplication 144 | #of the original cell by gene matrix, by a new gene by drug matrix 145 | #with the entries in the new matrix being the weights of each gene for that drug 146 | #the first part, the mean across targets, is constant; prepare weights for that 147 | weights = pd.DataFrame(targets, index=var_names) 148 | #kick out drugs with no targets 149 | weights = weights.loc[:, weights.sum()>0] 150 | #scale to 1 sum for each column, weights for mean acquired. get mean 151 | weights = weights/weights.sum() 152 | if issparse(X): 153 | scores = X.dot(weights) 154 | else: 155 | scores = np.dot(X, weights) 156 | #the second part only happens for seurat scoring 157 | #logic inspired by sc.tl.score_genes() 158 | if method == "seurat": 159 | #obtain per-gene means 160 | obs_avg = _mean(X, names=var_names, axis=0) 161 | #bin the genes (score_genes() logic in full effect here) 162 | n_items = int(np.round(len(obs_avg) / (n_bins - 1))) 163 | obs_cut = obs_avg.rank(method='min') // n_items 164 | #we'll be working on the array for ease of stuff going forward 165 | obs_cut = obs_cut.values 166 | #compute a cell by control group matrix 167 | #using matrix multiplication again 168 | #so build a gene by control group matrix to enable it 169 | control_groups = {} 170 | for cut in np.unique(obs_cut): 171 | #get the locations of the value 172 | mask = obs_cut == cut 173 | #get the nonzero values, i.e. indices of the locations 174 | r_genes = np.nonzero(mask)[0] 175 | #shuffle these indices, and only keep the top N in the mask 176 | np.random.shuffle(r_genes) 177 | mask[r_genes[ctrl_size:]] = False 178 | #store mask 179 | control_groups[cut] = mask 180 | #turn to weights matrix, like earlier 181 | control_gene_weights = pd.DataFrame(control_groups, index=var_names) 182 | control_gene_weights = control_gene_weights/control_gene_weights.sum() 183 | #compute control profiles 184 | if issparse(X): 185 | control_profiles = X.dot(control_gene_weights) 186 | else: 187 | control_profiles = np.dot(X, control_gene_weights) 188 | #identify the bins for each drug 189 | drug_bins = {} 190 | #use the subset form that's in weights 191 | for drug in weights.columns: 192 | #targets is still a handy boolean vector of membership 193 | #get the bins for this drug 194 | bins = np.unique(obs_cut[targets[drug]]) 195 | #mask the bin order in the existing variables with this 196 | #control_gene_weights has column names 197 | #control_profiles is ordered the same way but is just values 198 | drug_bins[drug] = np.isin(control_gene_weights.columns, bins) 199 | #turn to weights matrix again 200 | drug_weights = pd.DataFrame(drug_bins, index=control_gene_weights.columns) 201 | drug_weights = drug_weights/drug_weights.sum() 202 | #can now get the seurat reference profile via matrix multiplication 203 | #subtract from the existing scores 204 | seurat = np.dot(control_profiles, drug_weights) 205 | scores = scores - seurat 206 | #we now have the final form of the scores 207 | #create a little helper adata thingy based on them 208 | #store existing .obsm in there for ease of plotting stuff 209 | adata.uns['drug2cell'] = anndata.AnnData(scores, obs=adata.obs) 210 | adata.uns['drug2cell'].var_names = weights.columns 211 | adata.uns['drug2cell'].obsm = adata.obsm 212 | #store gene group membership, going back to targets for it 213 | for drug in weights.columns: 214 | #mask the var_names with the membership, and join into a single delimited string 215 | adata.uns['drug2cell'].var.loc[drug, 'genes'] = sep.join(var_names[targets[drug]]) 216 | #pull out the old full membership dict and store that too 217 | adata.uns['drug2cell'].var.loc[drug, 'all_genes'] = sep.join(full_targets[drug]) 218 | 219 | def gsea(adata, targets=None, nested=False, categories=None, absolute=False, plot_args=True, sep=",", **kwargs): 220 | ''' 221 | Perform gene set enrichment analysis on the marker gene scores computed for the 222 | original object. Uses blitzgsea. 223 | 224 | Returns: 225 | 226 | - a dictionary with clusters for which the original object markers were computed \ 227 | as the keys, and data frames of test results sorted on q-value as the items 228 | 229 | - a helper variable with plotting arguments for ``d2c.plot_gsea()``, if \ 230 | ``plot_args=True``. ``['scores']`` has the GSEA input, and ``['targets']`` is the \ 231 | gene group dictionary that was used. 232 | 233 | Input 234 | ----- 235 | adata : ``AnnData`` 236 | With marker genes computed via ``sc.tl.rank_genes_groups()`` in the original 237 | expression space. 238 | targets : ``dict`` of lists of ``str``, optional (default: ``None``) 239 | The gene groups to evaluate. Can be targets of known drugs, GO terms, pathway 240 | memberships, anything you can assign genes to. If ``None``, will use 241 | ``d2c.score()`` output if present, and if not present load the ChEMBL-derived 242 | drug target sets distributed with the package. 243 | 244 | Accepts two forms: 245 | 246 | - A dictionary with the names of the groups as keys, and the entries being the 247 | corresponding gene lists. 248 | - A dictionary of dictionaries defined like above, with names of gene group 249 | categories as keys. If passing one of those, please specify ``nested=True``. 250 | 251 | nested : ``bool``, optional (default: ``False``) 252 | Whether ``targets`` is a dictionary of dictionaries with group categories as keys. 253 | categories : ``str`` or list of ``str``, optional (default: ``None``) 254 | If ``targets=None`` or ``nested=True``, this argument can be used to subset the 255 | gene groups to one or more categories (keys of the original dictionary). In case 256 | of the ChEMBL drug targets, these are ATC level 1/level 2 category codes. 257 | absolute : ``bool``, optional (default: ``False``) 258 | If ``True``, pass the absolute values of scores to GSEA. Improves statistical 259 | power. 260 | plot_args : ``bool``, optional (default: ``True``) 261 | Whether to return the second piece of output that holds pre-compiled information 262 | for ``d2c.plot_gsea()``. 263 | sep : ``str``, optional (default: ``","``) 264 | The delimiter that was used with ``d2c.score()`` for gene group storage. 265 | kwargs 266 | Any additional arguments to pass to ``blitzgsea.gsea()``. 267 | ''' 268 | #get {group:[targets]} form of gene groups to evaluate based on arguments 269 | #allow for target reconstruction for when this is ran after scoring 270 | targets = util.prepare_targets( 271 | adata, 272 | targets=targets, 273 | nested=nested, 274 | categories=categories, 275 | sep=sep, 276 | reconstruct=True 277 | ) 278 | #store the GSEA results in a dictionary, with groups as the keys 279 | enrichment = {} 280 | #the plotting-minded output can already store its targets 281 | #and will keep track of scores as they get made during the loop 282 | plot_gsea_args = {"targets":targets, "scores":{}} 283 | #this gets the names of the clusters in the original marker output 284 | for cluster in adata.uns['rank_genes_groups']['names'].dtype.names: 285 | #prepare blitzgsea input 286 | df = pd.DataFrame({"0":adata.uns['rank_genes_groups']['names'][cluster], 287 | "1":adata.uns['rank_genes_groups']['scores'][cluster]}) 288 | #possibly sort on absolute value of scores 289 | if absolute: 290 | df["1"] = np.absolute(df["1"]) 291 | df = df.sort_values("1", ascending=False) 292 | #compute GSEA and store results/scores in output 293 | enrichment[cluster] = blitzgsea.gsea(df, targets, **kwargs) 294 | plot_gsea_args["scores"][cluster] = df 295 | #provide output 296 | if plot_args: 297 | return enrichment, plot_gsea_args 298 | else: 299 | return enrichment 300 | 301 | def hypergeometric(adata, targets=None, nested=False, categories=None, pvals_adj_thresh=0.05, direction="both", corr_method="benjamini-hochberg", sep=","): 302 | ''' 303 | Perform a hypergeometric test to assess the overrepresentation of gene group members 304 | among marker genes computed for the original object. 305 | 306 | Returns a dictionary with clusters for which the original object markers were computed 307 | as the keys, and data frames of test results sorted on q-value as the items. 308 | 309 | Input 310 | ----- 311 | adata : ``AnnData`` 312 | With marker genes computed via ``sc.tl.rank_genes_groups()`` in the original 313 | expression space. 314 | targets : ``dict`` of lists of ``str``, optional (default: ``None``) 315 | The gene groups to evaluate. Can be targets of known drugs, GO terms, pathway 316 | memberships, anything you can assign genes to. If ``None``, will use 317 | ``d2c.score()`` output if present, and if not present load the ChEMBL-derived 318 | drug target sets distributed with the package. 319 | 320 | Accepts two forms: 321 | 322 | - A dictionary with the names of the groups as keys, and the entries being the 323 | corresponding gene lists. 324 | - A dictionary of dictionaries defined like above, with names of gene group 325 | categories as keys. If passing one of those, please specify ``nested=True``. 326 | 327 | nested : ``bool``, optional (default: ``False``) 328 | Whether ``targets`` is a dictionary of dictionaries with group categories as keys. 329 | categories : ``str`` or list of ``str``, optional (default: ``None``) 330 | If ``targets=None`` or ``nested=True``, this argument can be used to subset the 331 | gene groups to one or more categories (keys of the original dictionary). In case 332 | of the ChEMBL drug targets, these are ATC level 1/level 2 category codes. 333 | pvals_adj_thresh : ``float``, optional (default: ``0.05``) 334 | The ``pvals_adj`` cutoff to use on the ``sc.tl.rank_genes_groups()`` output to 335 | identify markers. 336 | direction : ``str``, optional (default: ``"both"``) 337 | Whether to seek out up/down-regulated genes for the groups, based on the values 338 | from ``scores``. Can be ``"up"``, ``"down"``, or ``"both"`` (for no selection). 339 | corr_method : ``str``, optional (default: ``"benjamini-hochberg"``) 340 | Which FDR correction to apply to the p-values of the hypergeometric test. Can be 341 | ``"benjamini-hochberg"`` or ``"bonferroni"``. 342 | sep : ``str``, optional (default: ``","``) 343 | The delimiter that was used with ``d2c.score()`` for gene group storage. 344 | ''' 345 | #get the universe of available genes, as a set for easy intersecting 346 | if adata.uns['rank_genes_groups']['params']['use_raw']: 347 | universe = set(adata.raw.var_names) 348 | else: 349 | universe = set(adata.var_names) 350 | #get {group:[targets]} form of gene groups to evaluate based on arguments 351 | #allow for target reconstruction for when this is ran after scoring 352 | targets = util.prepare_targets( 353 | adata, 354 | targets=targets, 355 | nested=nested, 356 | categories=categories, 357 | sep=sep, 358 | reconstruct=True 359 | ) 360 | #intersect each group membership with the universe after turning it to a set 361 | for group in targets: 362 | targets[group] = set(targets[group]).intersection(universe) 363 | #kick out any empty keys, using dictionary comprehension 364 | targets = {k:v for k,v in targets.items() if v} 365 | #store the hypergeometric results in a dictionary, with groups as the keys 366 | overrepresentation = {} 367 | #this gets the names of the clusters in the original marker output 368 | for cluster in adata.uns['rank_genes_groups']['names'].dtype.names: 369 | #prepare overrepresentation output data frame 370 | results = pd.DataFrame( 371 | 1, 372 | index=list(targets.keys()), 373 | columns=['intersection','gene_group','markers','universe','pvals', 'pvals_adj'] 374 | ) 375 | #pre-type pvals as floats as otherwise pandas complains about deprecation 376 | results = results.astype({"pvals":"float64", "pvals_adj":"float64"}) 377 | #identify the markers for this group from the original output 378 | #construct mask for significance 379 | mask = adata.uns['rank_genes_groups']['pvals_adj'][cluster] < pvals_adj_thresh 380 | #the sign of the score indicates up/down-regulation 381 | if direction == "up": 382 | mask = mask & (adata.uns['rank_genes_groups']['scores'][cluster] > 0) 383 | elif direction == "down": 384 | mask = mask & (adata.uns['rank_genes_groups']['scores'][cluster] < 0) 385 | #do this as a set because it will be easier to intersect later 386 | markers = set(adata.uns['rank_genes_groups']['names'][cluster][mask]) 387 | #at this point we know how many genes we have as markers and the universe 388 | results['markers'] = len(markers) 389 | results['universe'] = len(universe) 390 | #perform hypergeometric test to assess overrepresentation 391 | #loop over the scored gene groups 392 | for ind in results.index: 393 | #retrieve gene membership and the intersection 394 | gene_group = targets[ind] 395 | common = gene_group.intersection(markers) 396 | results.loc[ind, 'intersection'] = len(common) 397 | results.loc[ind, 'gene_group'] = len(gene_group) 398 | #need to subtract 1 from the intersection length 399 | #https://alexlenail.medium.com/understanding-and-implementing-the-hypergeometric-test-in-python-a7db688a7458 400 | pval = hypergeom.sf( 401 | len(common)-1, 402 | len(universe), 403 | len(markers), 404 | len(gene_group) 405 | ) 406 | results.loc[ind, 'pvals'] = pval 407 | #multiple testing correction 408 | #mirror sc.tl.rank_genes_groups() logic for consistency 409 | #just in case any NaNs popped up somehow, fill them to 1 so FDR works 410 | results = results.fillna(1) 411 | if corr_method == "benjamini-hochberg": 412 | results['pvals_adj'] = multipletests(results['pvals'], method="fdr_bh")[1] 413 | elif corr_method == "bonferroni": 414 | results['pvals_adj'] = np.minimum(results['pvals']*results.shape[0], 1.0) 415 | #sort on q-value and store output 416 | overrepresentation[cluster] = results.sort_values("pvals_adj") 417 | #that's it. return the dictionary 418 | return overrepresentation -------------------------------------------------------------------------------- /drug2cell/chembl.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | 4 | def filter_activities(dataframe, 5 | drug_max_phase=None, 6 | add_drug_mechanism=True, 7 | assay_type=None, 8 | remove_inactive=True, 9 | include_active=True, 10 | pchembl_target_column=None, 11 | pchembl_threshold=None 12 | ): 13 | ''' 14 | Perform a sequential set of filtering operations on the provided ChEMBL data frame. 15 | The order of the filters matches the order of the arguments in the input description. 16 | Returns a data frame with the rows fulfilling the resulting criteria. 17 | 18 | Input 19 | ----- 20 | dataframe : ``pd.DataFrame`` 21 | The ChEMBL data frame to perform filtering operations on. 22 | drug_max_phase : ``int`` or list of ``int``, optional (default: ``None``) 23 | Subset the data frame to drugs in the provided clinical stages: 24 | 25 | - Phase 1: Testing of drug on healthy volunteers for dose-ranging 26 | - Phase 2: Initial testing of drug on patients to assess efficacy and safety 27 | - Phase 3: Testing of drug on patients to assess efficacy, effectiveness and safety (larger test group) 28 | - Phase 4: approved 29 | 30 | add_drug_mechanism : ``bool``, optional (default: ``True``) 31 | Grant subsequent filtering immunity to rows with drug mechanism information 32 | present. 33 | assay_type : ``str`` or list of ``str``, optional (default: ``None``) 34 | Subset the data frame based on assay type information: 35 | 36 | - Binding (B) - Data measuring binding of compound to a molecular target, e.g. Ki, IC50, Kd. 37 | - Functional (F) - Data measuring the biological effect of a compound, e.g. %cell death in a cell line, rat weight. 38 | - ADMET (A) - ADME data e.g. t1/2, oral bioavailability. 39 | - Toxicity (T) - Data measuring toxicity of a compound, e.g., cytotoxicity. 40 | - Physicochemical (P) - Assays measuring physicochemical properties of the compounds in the absence of biological material e.g., chemical stability, solubility. 41 | - Unclassified (U) - A small proportion of assays cannot be classified into one of the above categories e.g., ratio of binding vs efficacy. 42 | 43 | remove_inactive : ``bool``, optional (default: ``True``) 44 | Subset the data frame to remove inactive drug-target interactions. 45 | include_active : ``bool``, optional (default: ``True``) 46 | Grant subsequent filtering immunity to active drug-target interactions. 47 | pchembl_target_column : ``str``, optional (default: ``None``) 48 | Use the selected column in the data frame to dictate custom pChEMBL thresholds 49 | for each unique value in the column. 50 | pchembl_threshold : ``float`` or ``dict`` of ``float``, optional (default: ``None``) 51 | Subset the data frame to this pChEMBL minimum. If a single ``float``, use that 52 | value. If a ``dict`` provided in conjunction with a ``pchembl_target_column``, 53 | have the unique values of the specified column as keys of the dictionary, with 54 | entries being the desired threshold for that category. 55 | ''' 56 | 57 | #create a helper column that will be set to True as needed to protect rows from deletion 58 | dataframe['keep']=False 59 | 60 | # check boolean inputs 61 | if type(add_drug_mechanism) != bool: 62 | raise TypeError("argument 'add_drug_mechanism' should be 'bool' type") 63 | if type(remove_inactive) != bool: 64 | raise TypeError("argument 'remove_inactive' should be 'bool' type") 65 | if type(include_active) != bool: 66 | raise TypeError("argument 'include_active' should be 'bool' type") 67 | 68 | # filter based on drug max phase 69 | if drug_max_phase!=None: 70 | if type(drug_max_phase)==list: 71 | dataframe = dataframe[dataframe['molecule_dictionary|max_phase'].isin(drug_max_phase)] 72 | elif type(drug_max_phase)==int: 73 | dataframe = dataframe[dataframe['molecule_dictionary|max_phase']==drug_max_phase] 74 | else: 75 | raise TypeError("argument 'drug_max_phase' should be 'int' or 'list' type") 76 | 77 | # grant subsequent filtering immunity in the event of drug mechanism presence 78 | if add_drug_mechanism: 79 | drugmech_index = dataframe[dataframe['drug_mechanism|molregno'].notnull()].index 80 | dataframe.loc[drugmech_index,'keep']=True # to keep 'drug_mechanism' activity 81 | 82 | # filter based on assay type 83 | if assay_type!=None: 84 | if type(assay_type)==list: 85 | dataframe = dataframe[(dataframe['assays|assay_type'].isin(assay_type))| \ 86 | (dataframe['keep'])] 87 | elif type(assay_type)==str: 88 | dataframe = dataframe[(dataframe['assays|assay_type']==assay_type)| \ 89 | (dataframe['keep'])] 90 | else: 91 | raise TypeError("argument 'assay_type' should be 'str' or 'list' type") 92 | 93 | # filter based on activity 94 | ## remove inactive 95 | if remove_inactive: 96 | dataframe = dataframe[(dataframe['activities|activity_comment'].isin(['inactive', 97 | 'Not Active', 98 | 'Not Active (inhibition < 50% @ 10 uM and thus dose-reponse curve not measured)', 99 | 'Inactive'])==False)| \ 100 | (dataframe['keep'])] # keep 'drug_mechanism' activity if 'add_drug_mechanism' is True 101 | 102 | ## grant subsequent filtering immunity to active 103 | if include_active: 104 | active_index = dataframe[dataframe['activities|activity_comment'].isin(['active','Active'])].index 105 | dataframe.loc[active_index,'keep']=True # to keep 'active' activity 106 | 107 | ## pChEMBL thresholding 108 | if pchembl_threshold!=None: 109 | if type(pchembl_threshold) == dict: 110 | if set(dataframe[pchembl_target_column])==set(pchembl_threshold.keys()): 111 | dataframe['pchembl_active']=False 112 | for k in pchembl_threshold.keys(): 113 | eachclass_df = dataframe[dataframe[pchembl_target_column]==k] 114 | dataframe.loc[eachclass_df.index,'pchembl_active']=eachclass_df['activities|pchembl_value']>=pchembl_threshold[k] 115 | del eachclass_df 116 | else: 117 | raise KeyError("argument 'pchembl_threshold' should have all the classes in key") 118 | else: 119 | #single value, do a single thresholding 120 | dataframe['pchembl_active'] = dataframe['activities|pchembl_value']>pchembl_threshold 121 | dataframe = dataframe[dataframe['pchembl_active'] | dataframe['keep']] # keep 'active' and 'drug_mechanism' activity if including those are True 122 | 123 | return dataframe 124 | 125 | 126 | def create_dict(dataframe): 127 | 128 | dic = {} 129 | for c in set(dataframe['molecule_dictionary|chembl_id']): 130 | table_c = dataframe[dataframe['molecule_dictionary|chembl_id']==c] 131 | c_name = list(set(table_c['molecule_dictionary|pref_name'])) 132 | 133 | if len(c_name)>1: 134 | raise ValueEror(f'{c} has multimle name: {c_name}') 135 | else: 136 | c_name = c_name[0] 137 | 138 | l=[] 139 | for x in table_c['component_synonyms|component_synonym']: 140 | l=l+x.split('|') 141 | dic[f'{c}|{c_name}']=list(set(l)) 142 | del l, table_c, c_name 143 | 144 | return dic 145 | 146 | 147 | def create_drug_dictionary(dataframe, 148 | drug_grouping=None, 149 | atc_level=['level1','level2'] 150 | ): 151 | ''' 152 | This function creates drug:target-genenames dictionary 153 | * dataframe: drug-target activity dataframe 154 | * drug_grouping: 'ATC_level','drug_max_phase',or None (default: None) 155 | * atc_level: specify level to categorise drugs and make drug-target dictionary. list. (default: ['level1','level2']) 156 | ''' 157 | 158 | drug_target_dict={} 159 | 160 | if drug_grouping=='ATC_level': 161 | for l in atc_level: 162 | dataframe.fillna(value={f'atc_classification|{l}':'No-category'},inplace=True) 163 | for group in dataframe[f'atc_classification|{l}'].unique(): 164 | # filtering category and creating dictionary 165 | group_df = dataframe[dataframe[f'atc_classification|{l}']==group] 166 | drug_target_dict[group] = create_dict(group_df) 167 | del group_df 168 | elif drug_grouping=='drug_max_phase': 169 | dataframe.fillna(value={'molecule_dictionary|max_phase':'No-phase'},inplace=True) 170 | for group in set(dataframe['molecule_dictionary|max_phase']): 171 | group_df = dataframe[dataframe['molecule_dictionary|max_phase']==group] 172 | drug_target_dict[f'maxphase_{str(group)}'] = create_dict(group_df) 173 | del group_df 174 | elif drug_grouping!=None: 175 | raise KeyError("invalid key ") 176 | else: 177 | #no grouping, just make a drug:[targets] dict 178 | drug_target_dict = create_dict(dataframe) 179 | return drug_target_dict 180 | 181 | -------------------------------------------------------------------------------- /drug2cell/cpdb_dict.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Teichlab/drug2cell/a786c632fc521b94f597e452e2f09c45fd451640/drug2cell/cpdb_dict.pkl -------------------------------------------------------------------------------- /drug2cell/data.py: -------------------------------------------------------------------------------- 1 | import pkg_resources 2 | import pandas as pd 3 | 4 | def chembl(): 5 | ''' 6 | Load the default ChEMBL drug target dictionary distributed with the package. 7 | 8 | Returns the drug target dictionary - ATC categories as keys, with each ATC category 9 | a dictionary with corresponding drugs as keys. 10 | ''' 11 | #this picks up the pickle shipped with the package 12 | stream = pkg_resources.resource_stream(__name__, 'drug-target_dicts.pkl') 13 | targets = pd.read_pickle(stream) 14 | return targets 15 | 16 | def consensuspathdb(): 17 | ''' 18 | Load the ConsensusPathDB pathway gene memberships distributed with the package. 19 | 20 | Returns a dictionary with pathway names as keys and memberships as items. 21 | ''' 22 | #this picks up the pickle shipped with the package 23 | stream = pkg_resources.resource_stream(__name__, 'cpdb_dict.pkl') 24 | targets = pd.read_pickle(stream) 25 | return targets -------------------------------------------------------------------------------- /drug2cell/drug-target_dicts.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Teichlab/drug2cell/a786c632fc521b94f597e452e2f09c45fd451640/drug2cell/drug-target_dicts.pkl -------------------------------------------------------------------------------- /drug2cell/util.py: -------------------------------------------------------------------------------- 1 | import blitzgsea 2 | import numpy as np 3 | 4 | from . import data 5 | 6 | from collections import ChainMap 7 | 8 | def reconstruct_targets(adata, sep=","): 9 | ''' 10 | Reconstruct the targets dictionary from an AnnData with ``d2c.score()`` output stored. 11 | ''' 12 | targets = {} 13 | #loop over the group names, i.e. this object's feature space 14 | for group in adata.uns['drug2cell'].var_names: 15 | #split the stored gene list 16 | targets[group] = adata.uns['drug2cell'].var.loc[group,'all_genes'].split(sep) 17 | return targets 18 | 19 | def prepare_targets(adata, targets=None, nested=False, categories=None, sep=",", reconstruct=False): 20 | ''' 21 | A helper function that ensures the function calling it gets a {group:[targets]} 22 | dictionary back. 23 | ''' 24 | #no provided explicit target set 25 | if targets is None: 26 | #if there's .uns['drug2cell'] and we're instructed to reconstruct, do so 27 | if reconstruct and ('drug2cell' in adata.uns): 28 | targets = reconstruct_targets(adata, sep=sep) 29 | else: 30 | #load ChEMBL 31 | targets = data.chembl() 32 | #this is nested, regardless of what the input may say 33 | nested = True 34 | else: 35 | #copy so the original is unaltered 36 | targets = targets.copy() 37 | #subset to provided categories. this should only be provided for nested versions 38 | if categories is not None: 39 | #turn to list if need be 40 | if type(categories) is not list: 41 | categories = list(categories) 42 | #so that they can be iterated over here for subsetting 43 | targets = {k:targets[k] for k in categories} 44 | #do we have a dict of dicts, with the categories as keys? 45 | if nested: 46 | #turn the dict of dicts structure to just a basic dictionary 47 | #get all the drugs for all the remaining categories, making a list of dicts 48 | #turn that list to a single dict via ChainMap magic 49 | targets = dict(ChainMap(*[targets[cat] for cat in targets])) 50 | #at this point we have the desired form of {group:[targets]}, return it 51 | return targets 52 | 53 | def prepare_plot_args(adata, targets=None, categories=None): 54 | ''' 55 | Prepare the ``var_names``, ``var_group_positions`` and ``var_group_labels`` arguments 56 | for scanpy plotting functions to display scored gene groups and group them nicely. 57 | Returns ``plot_args``, a dictionary of the values that can be used with scanpy 58 | plotting as ``**plot_args``. 59 | 60 | Input: 61 | ------ 62 | adata : ``AnnData`` 63 | Point the function to the ``.uns['drug2cell']`` slot computed by the ``score()`` 64 | function earlier. It's required to remove gene groups that were not represented 65 | in the data. 66 | targets : ``dict`` of lists of ``str``, optional (default: ``None``) 67 | The gene groups to evaluate. Can be targets of known drugs, GO terms, pathway 68 | memberships, anything you can assign genes to. If ``None``, will load the 69 | ChEMBL-derived drug target sets distributed with the package. Must be the 70 | ``nested=True`` version of the input as described in the score function. 71 | categories : ``str`` or list of ``str``, optional (default: ``None``) 72 | If ``targets=None`` or ``nested=True``, this argument can be used to subset the 73 | gene groups to one or more categories (keys of the original dictionary). In case 74 | of the ChEMBL drug targets, these are ATC level 1/level 2 category codes. 75 | ''' 76 | #load ChEMBL dictionary if not specified 77 | if targets is None: 78 | targets = data.chembl() 79 | else: 80 | #copy so the original argument is unaltered 81 | targets = targets.copy() 82 | #subset to provided categories 83 | if categories is not None: 84 | #turn to list if need be 85 | if type(categories) is not list: 86 | categories = list(categories) 87 | #so that they can be iterated over here for subsetting 88 | targets = {k:targets[k] for k in categories} 89 | #this time we only care about the group names. kick out their targets 90 | for group in targets: 91 | targets[group] = list(targets[group].keys()) 92 | #can commence constructing the plotting arguments now 93 | var_names = [] 94 | var_group_positions = [] 95 | var_group_labels = [] 96 | #we'll begin at the first feature. this is zero indexed 97 | start = 0 98 | for group in targets: 99 | #intersect the gene group names with what's available in the object 100 | #i.e. what was actually scored and is available for plotting 101 | targets[group] = list(adata.var_names[np.isin(adata.var_names, targets[group])]) 102 | #skip if empty 103 | if len(targets[group]) == 0: 104 | continue 105 | #if not empty, store! 106 | #append group names to list 107 | var_names = var_names + targets[group] 108 | #the newest group starts at start and ends at the end of the var_names 109 | var_group_positions = var_group_positions + [(start, len(var_names)-1)] 110 | var_group_labels = var_group_labels + [group] 111 | #the new start will be the next feature that goes in the list 112 | start = len(var_names) 113 | #that's it, return the things as a dict 114 | plot_args = {'var_names':var_names, 115 | 'var_group_positions':var_group_positions, 116 | 'var_group_labels':var_group_labels} 117 | return plot_args 118 | 119 | def plot_gsea(enrichment, targets, scores, n=10, interactive_plot=True, **kwargs): 120 | ''' 121 | Display the output of ``d2c.gsea()`` with blitzgsea's ``top_table()`` plot. 122 | 123 | The first ``d2c.gsea()`` output variable is ``enrichment``, and passing the second 124 | ``d2c.gsea()`` output variable with a ``**`` in front of it provides ``targets`` and 125 | ``scores``. 126 | 127 | Input 128 | ----- 129 | enrichment : ``dict`` of ``pd.DataFrame`` 130 | Cluster names as keys, blitzgsea's ``gsea()`` output as values 131 | targets : ``dict`` of list of ``str`` 132 | The gene group memberships that were used to compute GSEA 133 | scores : ``dict`` of ``pd.DataFrame`` 134 | Cluster names as keys, the input to blitzgsea 135 | n : ``int``, optional (default: ``10``) 136 | How many top scores to show for each group 137 | interactive_plot : ``bool``, optional (default: ``True``) 138 | If ``True``, will display the plots within a Jupyter Notebook. If ``False``, 139 | will collect the figures into a list and return it at the end. 140 | kwargs 141 | Any additional arguments to pass to ``blitzgsea.plot.top_table()``. 142 | ''' 143 | #optionally save output 144 | if not interactive_plot: 145 | figs = [] 146 | #make a top_table() plot for each cluster 147 | for cluster in enrichment: 148 | #get a figure, passing the various arguments 149 | #interactive_plot prepares the figure so that it .show()s in a notebook 150 | fig = blitzgsea.plot.top_table(scores[cluster], 151 | targets, 152 | enrichment[cluster], 153 | n=n, 154 | interactive_plot=interactive_plot, 155 | **kwargs 156 | ) 157 | #retitle accordingly 158 | fig.suptitle(cluster) 159 | #either display the plot or stash it in output 160 | if interactive_plot: 161 | fig.show() 162 | else: 163 | figs.append(fig) 164 | if not interactive_plot: 165 | return figs -------------------------------------------------------------------------------- /notebooks/chembl/filtering.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "7df6b6c4", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import pandas as pd\n", 11 | "import drug2cell as d2c" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "id": "409592b1", 17 | "metadata": {}, 18 | "source": [ 19 | "Load the human targets ChEMBL data frame created in the initial parsing notebook." 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 2, 25 | "id": "78aecb53", 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "original = pd.read_pickle(\"chembl_30_merged_genesymbols_humans.pkl\")" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "id": "67ee796c", 35 | "metadata": {}, 36 | "source": [ 37 | "Drug2cell's filtering functions allow for subsetting the pchembl threshold for each category of a column of choice. We'll be using the `target_class` column, and basing our values on the Illuminating the Druggable Genome (IDG) project (ref: https://peerj.com/articles/15153/#:~:text=30%20nM%20for,other%20target%20families)" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 4, 43 | "id": "fe3f6848", 44 | "metadata": {}, 45 | "outputs": [], 46 | "source": [ 47 | "#pChEMBL is -log10() as per https://chembl.gitbook.io/chembl-interface-documentation/frequently-asked-questions/chembl-data-questions#what-is-pchembl\n", 48 | "#the threshold values were updated on 21Dec2024\n", 49 | "thresholds_dict={\n", 50 | " 'none':6, #1uM\n", 51 | " 'NHR':7, #100nM\n", 52 | " 'GPCR':7, #100nM\n", 53 | " 'Ion Channel':5, #10uM\n", 54 | " 'Kinase':7.53, #30nM\n", 55 | "}" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "id": "3e88f0c3", 61 | "metadata": {}, 62 | "source": [ 63 | "We'll add some more criteria to the filtering. For a comprehensive list of available options, consult the documentation." 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 5, 69 | "id": "974f2309", 70 | "metadata": {}, 71 | "outputs": [ 72 | { 73 | "name": "stdout", 74 | "output_type": "stream", 75 | "text": [ 76 | "(39660, 55)\n" 77 | ] 78 | } 79 | ], 80 | "source": [ 81 | "filtered_df = d2c.chembl.filter_activities(\n", 82 | " dataframe=original,\n", 83 | " drug_max_phase=4,\n", 84 | " assay_type='F',\n", 85 | " add_drug_mechanism=True,\n", 86 | " remove_inactive=True,\n", 87 | " include_active=True,\n", 88 | " pchembl_target_column=\"target_class\",\n", 89 | " pchembl_threshold=thresholds_dict\n", 90 | ")\n", 91 | "print(filtered_df.shape)" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "id": "c1d8d459", 97 | "metadata": {}, 98 | "source": [ 99 | "Now that we have our data frame subset to the drugs and targets of interest, we can convert them into a dictionary that can be used by drug2cell. The exact form distributed with the package was created like so:" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 6, 105 | "id": "03d9dfd1", 106 | "metadata": {}, 107 | "outputs": [ 108 | { 109 | "data": { 110 | "text/plain": [ 111 | "{'CHEMBL1201566|DARBEPOETIN ALFA': ['EPOR'],\n", 112 | " 'CHEMBL3707314|METHOXY POLYETHYLENE GLYCOL-EPOETIN BETA': ['EPOR'],\n", 113 | " 'CHEMBL2109092|EPOETIN BETA': ['EPOR'],\n", 114 | " 'CHEMBL1622|FOLIC ACID': ['HSD17B10'],\n", 115 | " 'CHEMBL3039545|LUSPATERCEPT': ['TGFB2', 'TGFB1', 'TGFB3'],\n", 116 | " 'CHEMBL1963684|PEGINESATIDE ACETATE': ['EPOR'],\n", 117 | " 'CHEMBL1705709|SODIUM FEREDETATE': ['NFE2L2']}" 118 | ] 119 | }, 120 | "execution_count": 6, 121 | "metadata": {}, 122 | "output_type": "execute_result" 123 | } 124 | ], 125 | "source": [ 126 | "chembldict = d2c.chembl.create_drug_dictionary(\n", 127 | " filtered_df,\n", 128 | " drug_grouping='ATC_level'\n", 129 | ")\n", 130 | "chembldict['B03']" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "id": "b0b1582d", 136 | "metadata": {}, 137 | "source": [ 138 | "This results in a nested dictionary structure - a dictionary of categories, holding dictionaries of drugs, holding lists of targets. Drug2cell knows how to operate with this sort of structure as well as its normal groups:targets dictionary, but you need to specify `nested=True` in the scoring/enrichment/overrepresentation functions whenever you pass this structure." 139 | ] 140 | } 141 | ], 142 | "metadata": { 143 | "kernelspec": { 144 | "display_name": "drug2cell_test_env", 145 | "language": "python", 146 | "name": "drug2cell_test_env" 147 | }, 148 | "language_info": { 149 | "codemirror_mode": { 150 | "name": "ipython", 151 | "version": 3 152 | }, 153 | "file_extension": ".py", 154 | "mimetype": "text/x-python", 155 | "name": "python", 156 | "nbconvert_exporter": "python", 157 | "pygments_lexer": "ipython3", 158 | "version": "3.8.20" 159 | } 160 | }, 161 | "nbformat": 4, 162 | "nbformat_minor": 5 163 | } 164 | -------------------------------------------------------------------------------- /notebooks/chembl/initial_database_parsing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "aerial-liquid", 6 | "metadata": {}, 7 | "source": [ 8 | "## Download data base" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": null, 14 | "id": "arbitrary-feature", 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "wget -O chembl_30_sqlite.tar.gz https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_30_sqlite.tar.gz" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "id": "german-doctrine", 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "# at farm\n", 29 | "# download checksums file\n", 30 | "wget -O db30_chechsums.txt https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/checksums.txt" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": null, 36 | "id": "handmade-canada", 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "!cat db30_checksums.txt" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": null, 46 | "id": "bizarre-reservation", 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "!sha256sum chembl_30_sqlite.tar.gz" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "id": "biblical-banana", 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "# decompress\n", 61 | "tar -xzvf chembl_30_sqlite.tar.gz" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": null, 67 | "id": "raising-devon", 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "id": "consistent-andorra", 75 | "metadata": {}, 76 | "source": [ 77 | "## Import required modules" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 28, 83 | "id": "confidential-reaction", 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "import pandas as pd\n", 88 | "pd.set_option('display.max_rows', 600)\n", 89 | "import numpy as np\n", 90 | "import sqlite3" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "id": "following-absolute", 96 | "metadata": {}, 97 | "source": [ 98 | "## Connect to the downloaded database" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 2, 104 | "id": "patient-digit", 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "con = sqlite3.connect('chembl_30.db')\n", 109 | "cur = con.cursor()" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 3, 115 | "id": "unnecessary-wonder", 116 | "metadata": {}, 117 | "outputs": [ 118 | { 119 | "name": "stdout", 120 | "output_type": "stream", 121 | "text": [ 122 | "[('assay_type',), ('chembl_id_lookup',), ('target_type',), ('go_classification',), ('atc_classification',), ('source',), ('action_type',), ('variant_sequences',), ('protein_classification',), ('activity_smid',), ('irac_classification',), ('bioassay_ontology',), ('structural_alert_sets',), ('relationship_type',), ('confidence_score_lookup',), ('curation_lookup',), ('assay_classification',), ('data_validity_lookup',), ('products',), ('patent_use_codes',), ('research_stem',), ('hrac_classification',), ('component_sequences',), ('domains',), ('bio_component_sequences',), ('protein_family_classification',), ('version',), ('activity_stds_lookup',), ('frac_classification',), ('usan_stems',), ('organism_class',), ('target_dictionary',), ('molecule_dictionary',), ('docs',), ('protein_class_synonyms',), ('structural_alerts',), ('cell_dictionary',), ('tissue_dictionary',), ('product_patents',), ('component_synonyms',), ('component_domains',), ('research_companies',), ('component_go',), ('defined_daily_dose',), ('activity_supp',), ('component_class',), ('target_relations',), ('biotherapeutics',), ('compound_records',), ('binding_sites',), ('compound_structural_alerts',), ('assays',), ('molecule_hierarchy',), ('molecule_synonyms',), ('molecule_hrac_classification',), ('target_components',), ('compound_properties',), ('molecule_frac_classification',), ('molecule_irac_classification',), ('molecule_atc_classification',), ('compound_structures',), ('drug_indication',), ('drug_mechanism',), ('assay_parameters',), ('activities',), ('drug_warning',), ('metabolism',), ('biotherapeutic_components',), ('assay_class_map',), ('formulations',), ('site_components',), ('indication_refs',), ('mechanism_refs',), ('activity_properties',), ('warning_refs',), ('metabolism_refs',), ('predicted_binding_domains',), ('activity_supp_map',), ('ligand_eff',), ('sqlite_stat1',)]\n" 123 | ] 124 | } 125 | ], 126 | "source": [ 127 | "cur. execute(\"SELECT name FROM sqlite_master WHERE type='table';\")\n", 128 | "print(cur.fetchall())" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "id": "driving-surrey", 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "id": "spoken-wallace", 142 | "metadata": {}, 143 | "source": [ 144 | "## Activity data loading" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 4, 150 | "id": "uniform-assurance", 151 | "metadata": {}, 152 | "outputs": [ 153 | { 154 | "name": "stdout", 155 | "output_type": "stream", 156 | "text": [ 157 | "(19286751, 27)\n", 158 | "(19286751, 27)\n", 159 | "CPU times: user 3min 39s, sys: 20.6 s, total: 4min\n", 160 | "Wall time: 4min 5s\n" 161 | ] 162 | }, 163 | { 164 | "data": { 165 | "text/html": [ 166 | "
\n", 167 | "\n", 180 | "\n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | "
activities|activity_idactivities|assay_idactivities|doc_idactivities|record_idactivities|molregnoactivities|standard_relationactivities|standard_valueactivities|standard_unitsactivities|standard_flagactivities|standard_type...activities|toidactivities|upper_valueactivities|standard_upper_valueactivities|src_idactivities|typeactivities|relationactivities|valueactivities|unitsactivities|text_valueactivities|standard_text_value
031863545056424206172180094>100000.0nM1IC50...NaNNaNNone1IC50>100.0uMNoneNone
131864839076432208970182268=2500.0nM1IC50...NaNNaNNone1IC50=2.5uMNoneNone
231865881526432208970182268>50000.0nM1IC50...NaNNaNNone1IC50>50.0uMNoneNone
331866839076432208987182855=9000.0nM1IC50...NaNNaNNone1IC50=9.0uMNoneNone
431867881536432208987182855NoneNaNnM0IC50...NaNNaNNone1IC50NoneNaNuMNoneNone
\n", 330 | "

5 rows × 27 columns

\n", 331 | "
" 332 | ], 333 | "text/plain": [ 334 | " activities|activity_id activities|assay_id activities|doc_id \\\n", 335 | "0 31863 54505 6424 \n", 336 | "1 31864 83907 6432 \n", 337 | "2 31865 88152 6432 \n", 338 | "3 31866 83907 6432 \n", 339 | "4 31867 88153 6432 \n", 340 | "\n", 341 | " activities|record_id activities|molregno activities|standard_relation \\\n", 342 | "0 206172 180094 > \n", 343 | "1 208970 182268 = \n", 344 | "2 208970 182268 > \n", 345 | "3 208987 182855 = \n", 346 | "4 208987 182855 None \n", 347 | "\n", 348 | " activities|standard_value activities|standard_units \\\n", 349 | "0 100000.0 nM \n", 350 | "1 2500.0 nM \n", 351 | "2 50000.0 nM \n", 352 | "3 9000.0 nM \n", 353 | "4 NaN nM \n", 354 | "\n", 355 | " activities|standard_flag activities|standard_type ... activities|toid \\\n", 356 | "0 1 IC50 ... NaN \n", 357 | "1 1 IC50 ... NaN \n", 358 | "2 1 IC50 ... NaN \n", 359 | "3 1 IC50 ... NaN \n", 360 | "4 0 IC50 ... NaN \n", 361 | "\n", 362 | " activities|upper_value activities|standard_upper_value activities|src_id \\\n", 363 | "0 NaN None 1 \n", 364 | "1 NaN None 1 \n", 365 | "2 NaN None 1 \n", 366 | "3 NaN None 1 \n", 367 | "4 NaN None 1 \n", 368 | "\n", 369 | " activities|type activities|relation activities|value activities|units \\\n", 370 | "0 IC50 > 100.0 uM \n", 371 | "1 IC50 = 2.5 uM \n", 372 | "2 IC50 > 50.0 uM \n", 373 | "3 IC50 = 9.0 uM \n", 374 | "4 IC50 None NaN uM \n", 375 | "\n", 376 | " activities|text_value activities|standard_text_value \n", 377 | "0 None None \n", 378 | "1 None None \n", 379 | "2 None None \n", 380 | "3 None None \n", 381 | "4 None None \n", 382 | "\n", 383 | "[5 rows x 27 columns]" 384 | ] 385 | }, 386 | "execution_count": 4, 387 | "metadata": {}, 388 | "output_type": "execute_result" 389 | } 390 | ], 391 | "source": [ 392 | "%%time\n", 393 | "activities = pd.read_sql_query(\"SELECT * from activities\", con)\n", 394 | "print(activities.shape)\n", 395 | "\n", 396 | "# rename columns to be able to track back\n", 397 | "activities.columns=[f'activities|{x}' for x in activities.columns]\n", 398 | "\n", 399 | "print(activities.shape)\n", 400 | "activities.head()" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": null, 406 | "id": "above-police", 407 | "metadata": {}, 408 | "outputs": [], 409 | "source": [] 410 | }, 411 | { 412 | "cell_type": "markdown", 413 | "id": "unexpected-packet", 414 | "metadata": {}, 415 | "source": [ 416 | "## Assay data loading" 417 | ] 418 | }, 419 | { 420 | "cell_type": "code", 421 | "execution_count": 5, 422 | "id": "actual-incidence", 423 | "metadata": {}, 424 | "outputs": [ 425 | { 426 | "name": "stdout", 427 | "output_type": "stream", 428 | "text": [ 429 | "(1458215, 24)\n", 430 | "(1458215, 24)\n" 431 | ] 432 | }, 433 | { 434 | "data": { 435 | "text/html": [ 436 | "
\n", 437 | "\n", 450 | "\n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | "
assays|assay_idassays|doc_idassays|descriptionassays|assay_typeassays|assay_test_typeassays|assay_categoryassays|assay_organismassays|assay_tax_idassays|assay_strainassays|assay_tissue...assays|confidence_scoreassays|curated_byassays|src_idassays|src_assay_idassays|chembl_idassays|cell_idassays|bao_formatassays|tissue_idassays|variant_idassays|aidx
0111087The compound was tested for the in vitro inhib...BNoneNoneNoneNaNNoneNone...8Autocuration1NoneCHEMBL615117NaNBAO_0000019NaNNaNCLD0
12684Compound was evaluated for its ability to mobi...FNoneNoneNoneNaNNoneNone...0Autocuration1NoneCHEMBL615118NaNBAO_0000219NaNNaNCLD0
2315453NoneBNoneNoneNoneNaNNoneNone...0Autocuration1NoneCHEMBL615119NaNBAO_0000019NaNNaNCLD0
3417841Binding affinity against A2 adenosine receptor...BNoneNoneBos taurus9913.0NoneStriatum...4Autocuration1NoneCHEMBL615120NaNBAO_00002492435.0NaNCLD0
4517430In vitro cell cytotoxicity against 143-B cell ...FNoneNoneHomo sapiens9606.0NoneNone...1Intermediate1NoneCHEMBL615121163.0BAO_0000219NaNNaNCLD0
\n", 600 | "

5 rows × 24 columns

\n", 601 | "
" 602 | ], 603 | "text/plain": [ 604 | " assays|assay_id assays|doc_id \\\n", 605 | "0 1 11087 \n", 606 | "1 2 684 \n", 607 | "2 3 15453 \n", 608 | "3 4 17841 \n", 609 | "4 5 17430 \n", 610 | "\n", 611 | " assays|description assays|assay_type \\\n", 612 | "0 The compound was tested for the in vitro inhib... B \n", 613 | "1 Compound was evaluated for its ability to mobi... F \n", 614 | "2 None B \n", 615 | "3 Binding affinity against A2 adenosine receptor... B \n", 616 | "4 In vitro cell cytotoxicity against 143-B cell ... F \n", 617 | "\n", 618 | " assays|assay_test_type assays|assay_category assays|assay_organism \\\n", 619 | "0 None None None \n", 620 | "1 None None None \n", 621 | "2 None None None \n", 622 | "3 None None Bos taurus \n", 623 | "4 None None Homo sapiens \n", 624 | "\n", 625 | " assays|assay_tax_id assays|assay_strain assays|assay_tissue ... \\\n", 626 | "0 NaN None None ... \n", 627 | "1 NaN None None ... \n", 628 | "2 NaN None None ... \n", 629 | "3 9913.0 None Striatum ... \n", 630 | "4 9606.0 None None ... \n", 631 | "\n", 632 | " assays|confidence_score assays|curated_by assays|src_id \\\n", 633 | "0 8 Autocuration 1 \n", 634 | "1 0 Autocuration 1 \n", 635 | "2 0 Autocuration 1 \n", 636 | "3 4 Autocuration 1 \n", 637 | "4 1 Intermediate 1 \n", 638 | "\n", 639 | " assays|src_assay_id assays|chembl_id assays|cell_id assays|bao_format \\\n", 640 | "0 None CHEMBL615117 NaN BAO_0000019 \n", 641 | "1 None CHEMBL615118 NaN BAO_0000219 \n", 642 | "2 None CHEMBL615119 NaN BAO_0000019 \n", 643 | "3 None CHEMBL615120 NaN BAO_0000249 \n", 644 | "4 None CHEMBL615121 163.0 BAO_0000219 \n", 645 | "\n", 646 | " assays|tissue_id assays|variant_id assays|aidx \n", 647 | "0 NaN NaN CLD0 \n", 648 | "1 NaN NaN CLD0 \n", 649 | "2 NaN NaN CLD0 \n", 650 | "3 2435.0 NaN CLD0 \n", 651 | "4 NaN NaN CLD0 \n", 652 | "\n", 653 | "[5 rows x 24 columns]" 654 | ] 655 | }, 656 | "execution_count": 5, 657 | "metadata": {}, 658 | "output_type": "execute_result" 659 | } 660 | ], 661 | "source": [ 662 | "# assay data\n", 663 | "assays = pd.read_sql_query(\"SELECT * from assays\", con)\n", 664 | "print(assays.shape)\n", 665 | "\n", 666 | "# rename columns to be able to track back\n", 667 | "assays.columns=[f'assays|{x}' for x in assays.columns]\n", 668 | "\n", 669 | "print(assays.shape)\n", 670 | "assays.head()" 671 | ] 672 | }, 673 | { 674 | "cell_type": "code", 675 | "execution_count": null, 676 | "id": "touched-insulation", 677 | "metadata": {}, 678 | "outputs": [], 679 | "source": [] 680 | }, 681 | { 682 | "cell_type": "markdown", 683 | "id": "independent-slave", 684 | "metadata": {}, 685 | "source": [ 686 | "## Target data loading" 687 | ] 688 | }, 689 | { 690 | "cell_type": "code", 691 | "execution_count": 6, 692 | "id": "numeric-warrant", 693 | "metadata": {}, 694 | "outputs": [ 695 | { 696 | "name": "stdout", 697 | "output_type": "stream", 698 | "text": [ 699 | "(14855, 7)\n" 700 | ] 701 | }, 702 | { 703 | "data": { 704 | "text/html": [ 705 | "
\n", 706 | "\n", 719 | "\n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | "
target_dictionary|tidtarget_dictionary|target_typetarget_dictionary|pref_nametarget_dictionary|tax_idtarget_dictionary|organismtarget_dictionary|chembl_idtarget_dictionary|species_group_flag
01SINGLE PROTEINMaltase-glucoamylase9606.0Homo sapiensCHEMBL20740
12SINGLE PROTEINSulfonylurea receptor 29606.0Homo sapiensCHEMBL19710
23SINGLE PROTEINPhosphodiesterase 5A9606.0Homo sapiensCHEMBL18270
34SINGLE PROTEINVoltage-gated T-type calcium channel alpha-1H ...9606.0Homo sapiensCHEMBL18590
45SINGLE PROTEINNicotinic acetylcholine receptor alpha subunit6253.0Ascaris suumCHEMBL18840
\n", 785 | "
" 786 | ], 787 | "text/plain": [ 788 | " target_dictionary|tid target_dictionary|target_type \\\n", 789 | "0 1 SINGLE PROTEIN \n", 790 | "1 2 SINGLE PROTEIN \n", 791 | "2 3 SINGLE PROTEIN \n", 792 | "3 4 SINGLE PROTEIN \n", 793 | "4 5 SINGLE PROTEIN \n", 794 | "\n", 795 | " target_dictionary|pref_name \\\n", 796 | "0 Maltase-glucoamylase \n", 797 | "1 Sulfonylurea receptor 2 \n", 798 | "2 Phosphodiesterase 5A \n", 799 | "3 Voltage-gated T-type calcium channel alpha-1H ... \n", 800 | "4 Nicotinic acetylcholine receptor alpha subunit \n", 801 | "\n", 802 | " target_dictionary|tax_id target_dictionary|organism \\\n", 803 | "0 9606.0 Homo sapiens \n", 804 | "1 9606.0 Homo sapiens \n", 805 | "2 9606.0 Homo sapiens \n", 806 | "3 9606.0 Homo sapiens \n", 807 | "4 6253.0 Ascaris suum \n", 808 | "\n", 809 | " target_dictionary|chembl_id target_dictionary|species_group_flag \n", 810 | "0 CHEMBL2074 0 \n", 811 | "1 CHEMBL1971 0 \n", 812 | "2 CHEMBL1827 0 \n", 813 | "3 CHEMBL1859 0 \n", 814 | "4 CHEMBL1884 0 " 815 | ] 816 | }, 817 | "execution_count": 6, 818 | "metadata": {}, 819 | "output_type": "execute_result" 820 | } 821 | ], 822 | "source": [ 823 | "target_dictionary = pd.read_sql_query(\"SELECT * from target_dictionary\", con)\n", 824 | "print(target_dictionary.shape)\n", 825 | "\n", 826 | "# rename columns to be able to track back\n", 827 | "target_dictionary.columns=[f'target_dictionary|{x}' for x in target_dictionary.columns]\n", 828 | "\n", 829 | "target_dictionary.head()" 830 | ] 831 | }, 832 | { 833 | "cell_type": "code", 834 | "execution_count": 7, 835 | "id": "athletic-producer", 836 | "metadata": {}, 837 | "outputs": [ 838 | { 839 | "name": "stdout", 840 | "output_type": "stream", 841 | "text": [ 842 | "(13558, 4)\n" 843 | ] 844 | }, 845 | { 846 | "data": { 847 | "text/html": [ 848 | "
\n", 849 | "\n", 862 | "\n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | "
target_components|tidtarget_components|component_idtarget_components|targcomp_idtarget_components|homologue
011004309010
111028116640
211037188850
311043129470
411056215980
\n", 910 | "
" 911 | ], 912 | "text/plain": [ 913 | " target_components|tid target_components|component_id \\\n", 914 | "0 11004 3090 \n", 915 | "1 11028 1166 \n", 916 | "2 11037 1888 \n", 917 | "3 11043 1294 \n", 918 | "4 11056 2159 \n", 919 | "\n", 920 | " target_components|targcomp_id target_components|homologue \n", 921 | "0 1 0 \n", 922 | "1 4 0 \n", 923 | "2 5 0 \n", 924 | "3 7 0 \n", 925 | "4 8 0 " 926 | ] 927 | }, 928 | "execution_count": 7, 929 | "metadata": {}, 930 | "output_type": "execute_result" 931 | } 932 | ], 933 | "source": [ 934 | "target_components = pd.read_sql_query(\"SELECT * from target_components\", con)\n", 935 | "print(target_components.shape)\n", 936 | "\n", 937 | "# rename columns to be able to track back\n", 938 | "target_components.columns=[f'target_components|{x}' for x in target_components.columns]\n", 939 | "\n", 940 | "target_components.head()" 941 | ] 942 | }, 943 | { 944 | "cell_type": "code", 945 | "execution_count": 8, 946 | "id": "immune-animal", 947 | "metadata": {}, 948 | "outputs": [ 949 | { 950 | "name": "stdout", 951 | "output_type": "stream", 952 | "text": [ 953 | "(97172, 4)\n" 954 | ] 955 | }, 956 | { 957 | "data": { 958 | "text/html": [ 959 | "
\n", 960 | "\n", 973 | "\n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | "
component_synonyms|compsyn_idcomponent_synonyms|component_idcomponent_synonyms|component_synonymcomponent_synonyms|syn_type
086086248Gabra-1GENE_SYMBOL_OTHER
186086749Gabrb-3GENE_SYMBOL_OTHER
286087250Gabrb-2GENE_SYMBOL_OTHER
386087751CDKN5GENE_SYMBOL_OTHER
486091556NMDAR1GENE_SYMBOL_OTHER
\n", 1021 | "
" 1022 | ], 1023 | "text/plain": [ 1024 | " component_synonyms|compsyn_id component_synonyms|component_id \\\n", 1025 | "0 860862 48 \n", 1026 | "1 860867 49 \n", 1027 | "2 860872 50 \n", 1028 | "3 860877 51 \n", 1029 | "4 860915 56 \n", 1030 | "\n", 1031 | " component_synonyms|component_synonym component_synonyms|syn_type \n", 1032 | "0 Gabra-1 GENE_SYMBOL_OTHER \n", 1033 | "1 Gabrb-3 GENE_SYMBOL_OTHER \n", 1034 | "2 Gabrb-2 GENE_SYMBOL_OTHER \n", 1035 | "3 CDKN5 GENE_SYMBOL_OTHER \n", 1036 | "4 NMDAR1 GENE_SYMBOL_OTHER " 1037 | ] 1038 | }, 1039 | "execution_count": 8, 1040 | "metadata": {}, 1041 | "output_type": "execute_result" 1042 | } 1043 | ], 1044 | "source": [ 1045 | "component_synonyms = pd.read_sql_query(\"SELECT * from component_synonyms\", con)\n", 1046 | "print(component_synonyms.shape)\n", 1047 | "\n", 1048 | "# rename columns to be able to track back\n", 1049 | "component_synonyms.columns=[f'component_synonyms|{x}' for x in component_synonyms.columns]\n", 1050 | "\n", 1051 | "component_synonyms.head()" 1052 | ] 1053 | }, 1054 | { 1055 | "cell_type": "code", 1056 | "execution_count": null, 1057 | "id": "alternate-stable", 1058 | "metadata": {}, 1059 | "outputs": [], 1060 | "source": [] 1061 | }, 1062 | { 1063 | "cell_type": "code", 1064 | "execution_count": null, 1065 | "id": "sound-receiver", 1066 | "metadata": {}, 1067 | "outputs": [], 1068 | "source": [] 1069 | }, 1070 | { 1071 | "cell_type": "markdown", 1072 | "id": "molecular-finland", 1073 | "metadata": {}, 1074 | "source": [ 1075 | "## Compound data loading" 1076 | ] 1077 | }, 1078 | { 1079 | "cell_type": "code", 1080 | "execution_count": 9, 1081 | "id": "responsible-architect", 1082 | "metadata": {}, 1083 | "outputs": [ 1084 | { 1085 | "name": "stdout", 1086 | "output_type": "stream", 1087 | "text": [ 1088 | "(6656, 14)\n" 1089 | ] 1090 | }, 1091 | { 1092 | "data": { 1093 | "text/html": [ 1094 | "
\n", 1095 | "\n", 1108 | "\n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | " \n", 1202 | " \n", 1203 | " \n", 1204 | " \n", 1205 | " \n", 1206 | " \n", 1207 | " \n", 1208 | " \n", 1209 | " \n", 1210 | " \n", 1211 | " \n", 1212 | " \n", 1213 | " \n", 1214 | " \n", 1215 | "
drug_mechanism|mec_iddrug_mechanism|record_iddrug_mechanism|molregnodrug_mechanism|mechanism_of_actiondrug_mechanism|tiddrug_mechanism|site_iddrug_mechanism|action_typedrug_mechanism|direct_interactiondrug_mechanism|molecular_mechanismdrug_mechanism|disease_efficacydrug_mechanism|mechanism_commentdrug_mechanism|selectivity_commentdrug_mechanism|binding_site_commentdrug_mechanism|variant_id
01313438101124Carbonic anhydrase VII inhibitor11060.0NaNINHIBITOR111NoneNoneNoneNaN
1141344053675068Carbonic anhydrase I inhibitor10193.0NaNINHIBITOR111NoneNoneNoneNaN
2151344649674765Carbonic anhydrase I inhibitor10193.0NaNINHIBITOR111Expressed in eyeNoneNoneNaN
31613432551085Carbonic anhydrase I inhibitor10193.0NaNINHIBITOR111NoneNoneNoneNaN
41713449031125Carbonic anhydrase I inhibitor10193.0NaNINHIBITOR111Expressed in eyeNoneNoneNaN
\n", 1216 | "
" 1217 | ], 1218 | "text/plain": [ 1219 | " drug_mechanism|mec_id drug_mechanism|record_id drug_mechanism|molregno \\\n", 1220 | "0 13 1343810 1124 \n", 1221 | "1 14 1344053 675068 \n", 1222 | "2 15 1344649 674765 \n", 1223 | "3 16 1343255 1085 \n", 1224 | "4 17 1344903 1125 \n", 1225 | "\n", 1226 | " drug_mechanism|mechanism_of_action drug_mechanism|tid \\\n", 1227 | "0 Carbonic anhydrase VII inhibitor 11060.0 \n", 1228 | "1 Carbonic anhydrase I inhibitor 10193.0 \n", 1229 | "2 Carbonic anhydrase I inhibitor 10193.0 \n", 1230 | "3 Carbonic anhydrase I inhibitor 10193.0 \n", 1231 | "4 Carbonic anhydrase I inhibitor 10193.0 \n", 1232 | "\n", 1233 | " drug_mechanism|site_id drug_mechanism|action_type \\\n", 1234 | "0 NaN INHIBITOR \n", 1235 | "1 NaN INHIBITOR \n", 1236 | "2 NaN INHIBITOR \n", 1237 | "3 NaN INHIBITOR \n", 1238 | "4 NaN INHIBITOR \n", 1239 | "\n", 1240 | " drug_mechanism|direct_interaction drug_mechanism|molecular_mechanism \\\n", 1241 | "0 1 1 \n", 1242 | "1 1 1 \n", 1243 | "2 1 1 \n", 1244 | "3 1 1 \n", 1245 | "4 1 1 \n", 1246 | "\n", 1247 | " drug_mechanism|disease_efficacy drug_mechanism|mechanism_comment \\\n", 1248 | "0 1 None \n", 1249 | "1 1 None \n", 1250 | "2 1 Expressed in eye \n", 1251 | "3 1 None \n", 1252 | "4 1 Expressed in eye \n", 1253 | "\n", 1254 | " drug_mechanism|selectivity_comment drug_mechanism|binding_site_comment \\\n", 1255 | "0 None None \n", 1256 | "1 None None \n", 1257 | "2 None None \n", 1258 | "3 None None \n", 1259 | "4 None None \n", 1260 | "\n", 1261 | " drug_mechanism|variant_id \n", 1262 | "0 NaN \n", 1263 | "1 NaN \n", 1264 | "2 NaN \n", 1265 | "3 NaN \n", 1266 | "4 NaN " 1267 | ] 1268 | }, 1269 | "execution_count": 9, 1270 | "metadata": {}, 1271 | "output_type": "execute_result" 1272 | } 1273 | ], 1274 | "source": [ 1275 | "drug_mechanism = pd.read_sql_query(\"SELECT * from drug_mechanism\", con)\n", 1276 | "print(drug_mechanism.shape)\n", 1277 | "\n", 1278 | "# rename columns to be able to track back\n", 1279 | "drug_mechanism.columns=[f'drug_mechanism|{x}' for x in drug_mechanism.columns]\n", 1280 | "\n", 1281 | "drug_mechanism.head()" 1282 | ] 1283 | }, 1284 | { 1285 | "cell_type": "code", 1286 | "execution_count": 10, 1287 | "id": "challenging-swing", 1288 | "metadata": {}, 1289 | "outputs": [ 1290 | { 1291 | "name": "stdout", 1292 | "output_type": "stream", 1293 | "text": [ 1294 | "(2157379, 31)\n" 1295 | ] 1296 | }, 1297 | { 1298 | "data": { 1299 | "text/html": [ 1300 | "
\n", 1301 | "\n", 1314 | "\n", 1315 | " \n", 1316 | " \n", 1317 | " \n", 1318 | " \n", 1319 | " \n", 1320 | " \n", 1321 | " \n", 1322 | " \n", 1323 | " \n", 1324 | " \n", 1325 | " \n", 1326 | " \n", 1327 | " \n", 1328 | " \n", 1329 | " \n", 1330 | " \n", 1331 | " \n", 1332 | " \n", 1333 | " \n", 1334 | " \n", 1335 | " \n", 1336 | " \n", 1337 | " \n", 1338 | " \n", 1339 | " \n", 1340 | " \n", 1341 | " \n", 1342 | " \n", 1343 | " \n", 1344 | " \n", 1345 | " \n", 1346 | " \n", 1347 | " \n", 1348 | " \n", 1349 | " \n", 1350 | " \n", 1351 | " \n", 1352 | " \n", 1353 | " \n", 1354 | " \n", 1355 | " \n", 1356 | " \n", 1357 | " \n", 1358 | " \n", 1359 | " \n", 1360 | " \n", 1361 | " \n", 1362 | " \n", 1363 | " \n", 1364 | " \n", 1365 | " \n", 1366 | " \n", 1367 | " \n", 1368 | " \n", 1369 | " \n", 1370 | " \n", 1371 | " \n", 1372 | " \n", 1373 | " \n", 1374 | " \n", 1375 | " \n", 1376 | " \n", 1377 | " \n", 1378 | " \n", 1379 | " \n", 1380 | " \n", 1381 | " \n", 1382 | " \n", 1383 | " \n", 1384 | " \n", 1385 | " \n", 1386 | " \n", 1387 | " \n", 1388 | " \n", 1389 | " \n", 1390 | " \n", 1391 | " \n", 1392 | " \n", 1393 | " \n", 1394 | " \n", 1395 | " \n", 1396 | " \n", 1397 | " \n", 1398 | " \n", 1399 | " \n", 1400 | " \n", 1401 | " \n", 1402 | " \n", 1403 | " \n", 1404 | " \n", 1405 | " \n", 1406 | " \n", 1407 | " \n", 1408 | " \n", 1409 | " \n", 1410 | " \n", 1411 | " \n", 1412 | " \n", 1413 | " \n", 1414 | " \n", 1415 | " \n", 1416 | " \n", 1417 | " \n", 1418 | " \n", 1419 | " \n", 1420 | " \n", 1421 | " \n", 1422 | " \n", 1423 | " \n", 1424 | " \n", 1425 | " \n", 1426 | " \n", 1427 | " \n", 1428 | " \n", 1429 | " \n", 1430 | " \n", 1431 | " \n", 1432 | " \n", 1433 | " \n", 1434 | " \n", 1435 | " \n", 1436 | " \n", 1437 | " \n", 1438 | " \n", 1439 | " \n", 1440 | " \n", 1441 | " \n", 1442 | " \n", 1443 | " \n", 1444 | " \n", 1445 | " \n", 1446 | " \n", 1447 | " \n", 1448 | " \n", 1449 | " \n", 1450 | " \n", 1451 | " \n", 1452 | " \n", 1453 | " \n", 1454 | " \n", 1455 | " \n", 1456 | " \n", 1457 | " \n", 1458 | " \n", 1459 | " \n", 1460 | " \n", 1461 | " \n", 1462 | " \n", 1463 | "
molecule_dictionary|molregnomolecule_dictionary|pref_namemolecule_dictionary|chembl_idmolecule_dictionary|max_phasemolecule_dictionary|therapeutic_flagmolecule_dictionary|dosed_ingredientmolecule_dictionary|structure_typemolecule_dictionary|chebi_par_idmolecule_dictionary|molecule_typemolecule_dictionary|first_approval...molecule_dictionary|usan_stemmolecule_dictionary|polymer_flagmolecule_dictionary|usan_substemmolecule_dictionary|usan_stem_definitionmolecule_dictionary|indication_classmolecule_dictionary|withdrawn_flagmolecule_dictionary|withdrawn_yearmolecule_dictionary|withdrawn_countrymolecule_dictionary|withdrawn_reasonmolecule_dictionary|withdrawn_class
01NoneCHEMBL6329000MOLNaNSmall moleculeNaN...None0NoneNoneNone0NaNNoneNoneNone
12NoneCHEMBL6328000MOLNaNSmall moleculeNaN...None0NoneNoneNone0NaNNoneNoneNone
23NoneCHEMBL265667000MOLNaNSmall moleculeNaN...None0NoneNoneNone0NaNNoneNoneNone
34NoneCHEMBL6362000MOLNaNSmall moleculeNaN...None0NoneNoneNone0NaNNoneNoneNone
45NoneCHEMBL267864000MOLNaNSmall moleculeNaN...None0NoneNoneNone0NaNNoneNoneNone
\n", 1464 | "

5 rows × 31 columns

\n", 1465 | "
" 1466 | ], 1467 | "text/plain": [ 1468 | " molecule_dictionary|molregno molecule_dictionary|pref_name \\\n", 1469 | "0 1 None \n", 1470 | "1 2 None \n", 1471 | "2 3 None \n", 1472 | "3 4 None \n", 1473 | "4 5 None \n", 1474 | "\n", 1475 | " molecule_dictionary|chembl_id molecule_dictionary|max_phase \\\n", 1476 | "0 CHEMBL6329 0 \n", 1477 | "1 CHEMBL6328 0 \n", 1478 | "2 CHEMBL265667 0 \n", 1479 | "3 CHEMBL6362 0 \n", 1480 | "4 CHEMBL267864 0 \n", 1481 | "\n", 1482 | " molecule_dictionary|therapeutic_flag molecule_dictionary|dosed_ingredient \\\n", 1483 | "0 0 0 \n", 1484 | "1 0 0 \n", 1485 | "2 0 0 \n", 1486 | "3 0 0 \n", 1487 | "4 0 0 \n", 1488 | "\n", 1489 | " molecule_dictionary|structure_type molecule_dictionary|chebi_par_id \\\n", 1490 | "0 MOL NaN \n", 1491 | "1 MOL NaN \n", 1492 | "2 MOL NaN \n", 1493 | "3 MOL NaN \n", 1494 | "4 MOL NaN \n", 1495 | "\n", 1496 | " molecule_dictionary|molecule_type molecule_dictionary|first_approval ... \\\n", 1497 | "0 Small molecule NaN ... \n", 1498 | "1 Small molecule NaN ... \n", 1499 | "2 Small molecule NaN ... \n", 1500 | "3 Small molecule NaN ... \n", 1501 | "4 Small molecule NaN ... \n", 1502 | "\n", 1503 | " molecule_dictionary|usan_stem molecule_dictionary|polymer_flag \\\n", 1504 | "0 None 0 \n", 1505 | "1 None 0 \n", 1506 | "2 None 0 \n", 1507 | "3 None 0 \n", 1508 | "4 None 0 \n", 1509 | "\n", 1510 | " molecule_dictionary|usan_substem molecule_dictionary|usan_stem_definition \\\n", 1511 | "0 None None \n", 1512 | "1 None None \n", 1513 | "2 None None \n", 1514 | "3 None None \n", 1515 | "4 None None \n", 1516 | "\n", 1517 | " molecule_dictionary|indication_class molecule_dictionary|withdrawn_flag \\\n", 1518 | "0 None 0 \n", 1519 | "1 None 0 \n", 1520 | "2 None 0 \n", 1521 | "3 None 0 \n", 1522 | "4 None 0 \n", 1523 | "\n", 1524 | " molecule_dictionary|withdrawn_year molecule_dictionary|withdrawn_country \\\n", 1525 | "0 NaN None \n", 1526 | "1 NaN None \n", 1527 | "2 NaN None \n", 1528 | "3 NaN None \n", 1529 | "4 NaN None \n", 1530 | "\n", 1531 | " molecule_dictionary|withdrawn_reason molecule_dictionary|withdrawn_class \n", 1532 | "0 None None \n", 1533 | "1 None None \n", 1534 | "2 None None \n", 1535 | "3 None None \n", 1536 | "4 None None \n", 1537 | "\n", 1538 | "[5 rows x 31 columns]" 1539 | ] 1540 | }, 1541 | "execution_count": 10, 1542 | "metadata": {}, 1543 | "output_type": "execute_result" 1544 | } 1545 | ], 1546 | "source": [ 1547 | "molecule_dictionary = pd.read_sql_query(\"SELECT * from molecule_dictionary\", con)\n", 1548 | "print(molecule_dictionary.shape)\n", 1549 | "\n", 1550 | "# rename columns to be able to track back\n", 1551 | "molecule_dictionary.columns=[f'molecule_dictionary|{x}' for x in molecule_dictionary.columns]\n", 1552 | "\n", 1553 | "molecule_dictionary.head()" 1554 | ] 1555 | }, 1556 | { 1557 | "cell_type": "code", 1558 | "execution_count": 11, 1559 | "id": "gothic-fantasy", 1560 | "metadata": {}, 1561 | "outputs": [ 1562 | { 1563 | "name": "stdout", 1564 | "output_type": "stream", 1565 | "text": [ 1566 | "(4470, 3)\n" 1567 | ] 1568 | }, 1569 | { 1570 | "data": { 1571 | "text/html": [ 1572 | "
\n", 1573 | "\n", 1586 | "\n", 1587 | " \n", 1588 | " \n", 1589 | " \n", 1590 | " \n", 1591 | " \n", 1592 | " \n", 1593 | " \n", 1594 | " \n", 1595 | " \n", 1596 | " \n", 1597 | " \n", 1598 | " \n", 1599 | " \n", 1600 | " \n", 1601 | " \n", 1602 | " \n", 1603 | " \n", 1604 | " \n", 1605 | " \n", 1606 | " \n", 1607 | " \n", 1608 | " \n", 1609 | " \n", 1610 | " \n", 1611 | " \n", 1612 | " \n", 1613 | " \n", 1614 | " \n", 1615 | " \n", 1616 | " \n", 1617 | " \n", 1618 | " \n", 1619 | " \n", 1620 | " \n", 1621 | " \n", 1622 | " \n", 1623 | " \n", 1624 | " \n", 1625 | " \n", 1626 | " \n", 1627 | "
molecule_atc_classification|mol_atc_idmolecule_atc_classification|level5molecule_atc_classification|molregno
059409L01EX152089491
159410L01EX10608601
259411L01EM031567700
359412D06BX03579824
459413L01EX131763584
\n", 1628 | "
" 1629 | ], 1630 | "text/plain": [ 1631 | " molecule_atc_classification|mol_atc_id molecule_atc_classification|level5 \\\n", 1632 | "0 59409 L01EX15 \n", 1633 | "1 59410 L01EX10 \n", 1634 | "2 59411 L01EM03 \n", 1635 | "3 59412 D06BX03 \n", 1636 | "4 59413 L01EX13 \n", 1637 | "\n", 1638 | " molecule_atc_classification|molregno \n", 1639 | "0 2089491 \n", 1640 | "1 608601 \n", 1641 | "2 1567700 \n", 1642 | "3 579824 \n", 1643 | "4 1763584 " 1644 | ] 1645 | }, 1646 | "execution_count": 11, 1647 | "metadata": {}, 1648 | "output_type": "execute_result" 1649 | } 1650 | ], 1651 | "source": [ 1652 | "molecule_atc_classification = pd.read_sql_query(\"SELECT * from molecule_atc_classification\", con)\n", 1653 | "print(molecule_atc_classification.shape)\n", 1654 | "\n", 1655 | "# rename columns to be able to track back\n", 1656 | "molecule_atc_classification.columns=[f'molecule_atc_classification|{x}' for x in molecule_atc_classification.columns]\n", 1657 | "\n", 1658 | "molecule_atc_classification.head()" 1659 | ] 1660 | }, 1661 | { 1662 | "cell_type": "code", 1663 | "execution_count": 12, 1664 | "id": "spatial-creation", 1665 | "metadata": {}, 1666 | "outputs": [ 1667 | { 1668 | "name": "stdout", 1669 | "output_type": "stream", 1670 | "text": [ 1671 | "(5148, 10)\n" 1672 | ] 1673 | }, 1674 | { 1675 | "data": { 1676 | "text/html": [ 1677 | "
\n", 1678 | "\n", 1691 | "\n", 1692 | " \n", 1693 | " \n", 1694 | " \n", 1695 | " \n", 1696 | " \n", 1697 | " \n", 1698 | " \n", 1699 | " \n", 1700 | " \n", 1701 | " \n", 1702 | " \n", 1703 | " \n", 1704 | " \n", 1705 | " \n", 1706 | " \n", 1707 | " \n", 1708 | " \n", 1709 | " \n", 1710 | " \n", 1711 | " \n", 1712 | " \n", 1713 | " \n", 1714 | " \n", 1715 | " \n", 1716 | " \n", 1717 | " \n", 1718 | " \n", 1719 | " \n", 1720 | " \n", 1721 | " \n", 1722 | " \n", 1723 | " \n", 1724 | " \n", 1725 | " \n", 1726 | " \n", 1727 | " \n", 1728 | " \n", 1729 | " \n", 1730 | " \n", 1731 | " \n", 1732 | " \n", 1733 | " \n", 1734 | " \n", 1735 | " \n", 1736 | " \n", 1737 | " \n", 1738 | " \n", 1739 | " \n", 1740 | " \n", 1741 | " \n", 1742 | " \n", 1743 | " \n", 1744 | " \n", 1745 | " \n", 1746 | " \n", 1747 | " \n", 1748 | " \n", 1749 | " \n", 1750 | " \n", 1751 | " \n", 1752 | " \n", 1753 | " \n", 1754 | " \n", 1755 | " \n", 1756 | " \n", 1757 | " \n", 1758 | " \n", 1759 | " \n", 1760 | " \n", 1761 | " \n", 1762 | " \n", 1763 | " \n", 1764 | " \n", 1765 | " \n", 1766 | " \n", 1767 | " \n", 1768 | " \n", 1769 | " \n", 1770 | " \n", 1771 | " \n", 1772 | " \n", 1773 | " \n", 1774 | "
atc_classification|who_nameatc_classification|level1atc_classification|level2atc_classification|level3atc_classification|level4atc_classification|level5atc_classification|level1_descriptionatc_classification|level2_descriptionatc_classification|level3_descriptionatc_classification|level4_description
0sodium fluorideAA01A01AA01AAA01AA01ALIMENTARY TRACT AND METABOLISMSTOMATOLOGICAL PREPARATIONSSTOMATOLOGICAL PREPARATIONSCaries prophylactic agents
1sodium monofluorophosphateAA01A01AA01AAA01AA02ALIMENTARY TRACT AND METABOLISMSTOMATOLOGICAL PREPARATIONSSTOMATOLOGICAL PREPARATIONSCaries prophylactic agents
2olaflurAA01A01AA01AAA01AA03ALIMENTARY TRACT AND METABOLISMSTOMATOLOGICAL PREPARATIONSSTOMATOLOGICAL PREPARATIONSCaries prophylactic agents
3stannous fluorideAA01A01AA01AAA01AA04ALIMENTARY TRACT AND METABOLISMSTOMATOLOGICAL PREPARATIONSSTOMATOLOGICAL PREPARATIONSCaries prophylactic agents
4combinationsAA01A01AA01AAA01AA30ALIMENTARY TRACT AND METABOLISMSTOMATOLOGICAL PREPARATIONSSTOMATOLOGICAL PREPARATIONSCaries prophylactic agents
\n", 1775 | "
" 1776 | ], 1777 | "text/plain": [ 1778 | " atc_classification|who_name atc_classification|level1 \\\n", 1779 | "0 sodium fluoride A \n", 1780 | "1 sodium monofluorophosphate A \n", 1781 | "2 olaflur A \n", 1782 | "3 stannous fluoride A \n", 1783 | "4 combinations A \n", 1784 | "\n", 1785 | " atc_classification|level2 atc_classification|level3 \\\n", 1786 | "0 A01 A01A \n", 1787 | "1 A01 A01A \n", 1788 | "2 A01 A01A \n", 1789 | "3 A01 A01A \n", 1790 | "4 A01 A01A \n", 1791 | "\n", 1792 | " atc_classification|level4 atc_classification|level5 \\\n", 1793 | "0 A01AA A01AA01 \n", 1794 | "1 A01AA A01AA02 \n", 1795 | "2 A01AA A01AA03 \n", 1796 | "3 A01AA A01AA04 \n", 1797 | "4 A01AA A01AA30 \n", 1798 | "\n", 1799 | " atc_classification|level1_description atc_classification|level2_description \\\n", 1800 | "0 ALIMENTARY TRACT AND METABOLISM STOMATOLOGICAL PREPARATIONS \n", 1801 | "1 ALIMENTARY TRACT AND METABOLISM STOMATOLOGICAL PREPARATIONS \n", 1802 | "2 ALIMENTARY TRACT AND METABOLISM STOMATOLOGICAL PREPARATIONS \n", 1803 | "3 ALIMENTARY TRACT AND METABOLISM STOMATOLOGICAL PREPARATIONS \n", 1804 | "4 ALIMENTARY TRACT AND METABOLISM STOMATOLOGICAL PREPARATIONS \n", 1805 | "\n", 1806 | " atc_classification|level3_description atc_classification|level4_description \n", 1807 | "0 STOMATOLOGICAL PREPARATIONS Caries prophylactic agents \n", 1808 | "1 STOMATOLOGICAL PREPARATIONS Caries prophylactic agents \n", 1809 | "2 STOMATOLOGICAL PREPARATIONS Caries prophylactic agents \n", 1810 | "3 STOMATOLOGICAL PREPARATIONS Caries prophylactic agents \n", 1811 | "4 STOMATOLOGICAL PREPARATIONS Caries prophylactic agents " 1812 | ] 1813 | }, 1814 | "execution_count": 12, 1815 | "metadata": {}, 1816 | "output_type": "execute_result" 1817 | } 1818 | ], 1819 | "source": [ 1820 | "atc_classification = pd.read_sql_query(\"SELECT * from atc_classification\", con)\n", 1821 | "print(atc_classification.shape)\n", 1822 | "\n", 1823 | "# rename columns to be able to track back\n", 1824 | "atc_classification.columns=[f'atc_classification|{x}' for x in atc_classification.columns]\n", 1825 | "\n", 1826 | "atc_classification.head()" 1827 | ] 1828 | }, 1829 | { 1830 | "cell_type": "code", 1831 | "execution_count": null, 1832 | "id": "renewable-engineering", 1833 | "metadata": {}, 1834 | "outputs": [], 1835 | "source": [ 1836 | "# atc_classification = pd.read_sql_query(\"SELECT * from atc_classification\", con)\n", 1837 | "# atc_classification.to_csv('/home/jovyan/projects/P50_ChEMBL/csv/atc_classification_db30.csv',index=False)" 1838 | ] 1839 | }, 1840 | { 1841 | "cell_type": "code", 1842 | "execution_count": null, 1843 | "id": "expressed-program", 1844 | "metadata": {}, 1845 | "outputs": [], 1846 | "source": [] 1847 | }, 1848 | { 1849 | "cell_type": "markdown", 1850 | "id": "plain-indonesian", 1851 | "metadata": {}, 1852 | "source": [ 1853 | "## Concatenate information" 1854 | ] 1855 | }, 1856 | { 1857 | "cell_type": "markdown", 1858 | "id": "rubber-compromise", 1859 | "metadata": {}, 1860 | "source": [ 1861 | "* activities_final:
\n", 1862 | "'activities|activity_id', 'activities|assay_id', 'activities|molregno',\n", 1863 | "'activities|pchembl_value', 'activities|type', 'activities|standard_relation', 'activities|standard_value',\n", 1864 | "'activities|standard_units', 'activities|standard_flag',\n", 1865 | "'activities|standard_type', 'activities|activity_comment',\n", 1866 | "'assays|description','assays|assay_type','assays|tid', 'assays|confidence_score','assays|curated_by','assays|chembl_id',
\n", 1867 | "
\n", 1868 | "* targets_final:
\n", 1869 | "'target_dictionary|tid','target_dictionary|target_type','target_dictionary|pref_name','target_dictionary|organism','target_dictionary|chembl_id',\n", 1870 | "'component_synonyms|component_synonym', 'component_synonyms|syn_type'
\n", 1871 | "
\n", 1872 | "* molecule_dictionary:
\n", 1873 | "'molecule_dictionary|molregno', 'molecule_dictionary|pref_name','molecule_dictionary|chembl_id', 'molecule_dictionary|max_phase', 'molecule_dictionary|molecule_type','molecule_dictionary|oral',\n", 1874 | "'molecule_dictionary|parenteral', 'molecule_dictionary|topical','molecule_dictionary|black_box_warning','molecule_dictionary|natural_product'
\n", 1875 | "
\n", 1876 | "* drug_mechanism:
\n", 1877 | "'drug_mechanism|molregno','drug_mechanism|mechanism_of_action','drug_mechanism|tid','drug_mechanism|action_type',\n", 1878 | "
\n", 1879 | "* molecule_atc_classification:
\n", 1880 | "'molecule_atc_classification|mol_atc_id','molecule_atc_classification|level5','molecule_atc_classification|molregno'\n", 1881 | "
\n", 1882 | "* atc_classification:
\n", 1883 | "'atc_classification|who_name', 'atc_classification|level1','atc_classification|level2', 'atc_classification|level3',\n", 1884 | "'atc_classification|level4', 'atc_classification|level5','atc_classification|level1_description','atc_classification|level2_description',\n", 1885 | "'atc_classification|level3_description','atc_classification|level4_description'" 1886 | ] 1887 | }, 1888 | { 1889 | "cell_type": "code", 1890 | "execution_count": null, 1891 | "id": "spread-programmer", 1892 | "metadata": {}, 1893 | "outputs": [], 1894 | "source": [] 1895 | }, 1896 | { 1897 | "cell_type": "code", 1898 | "execution_count": 13, 1899 | "id": "accurate-document", 1900 | "metadata": {}, 1901 | "outputs": [ 1902 | { 1903 | "name": "stdout", 1904 | "output_type": "stream", 1905 | "text": [ 1906 | "activities+assays: (19286751, 17)\n", 1907 | "added drug mechanism: (19291779, 21)\n", 1908 | "19291779\n", 1909 | "19291382\n", 1910 | "added compound info: (20181918, 45)\n", 1911 | "UNIPROT 55495\n", 1912 | "GENE_SYMBOL_OTHER 39293\n", 1913 | "GENE_SYMBOL 13028\n", 1914 | "EC_NUMBER 7943\n", 1915 | "Name: component_synonyms|syn_type, dtype: int64\n", 1916 | "added targets info: (21292684, 52)\n" 1917 | ] 1918 | }, 1919 | { 1920 | "data": { 1921 | "text/html": [ 1922 | "
\n", 1923 | "\n", 1936 | "\n", 1937 | " \n", 1938 | " \n", 1939 | " \n", 1940 | " \n", 1941 | " \n", 1942 | " \n", 1943 | " \n", 1944 | " \n", 1945 | " \n", 1946 | " \n", 1947 | " \n", 1948 | " \n", 1949 | " \n", 1950 | " \n", 1951 | " \n", 1952 | " \n", 1953 | " \n", 1954 | " \n", 1955 | " \n", 1956 | " \n", 1957 | " \n", 1958 | " \n", 1959 | " \n", 1960 | " \n", 1961 | " \n", 1962 | " \n", 1963 | " \n", 1964 | " \n", 1965 | " \n", 1966 | " \n", 1967 | " \n", 1968 | " \n", 1969 | " \n", 1970 | " \n", 1971 | " \n", 1972 | " \n", 1973 | " \n", 1974 | " \n", 1975 | " \n", 1976 | " \n", 1977 | " \n", 1978 | " \n", 1979 | " \n", 1980 | " \n", 1981 | " \n", 1982 | " \n", 1983 | " \n", 1984 | " \n", 1985 | " \n", 1986 | " \n", 1987 | " \n", 1988 | " \n", 1989 | " \n", 1990 | " \n", 1991 | " \n", 1992 | " \n", 1993 | " \n", 1994 | " \n", 1995 | " \n", 1996 | " \n", 1997 | " \n", 1998 | " \n", 1999 | " \n", 2000 | " \n", 2001 | " \n", 2002 | " \n", 2003 | " \n", 2004 | " \n", 2005 | " \n", 2006 | " \n", 2007 | " \n", 2008 | " \n", 2009 | " \n", 2010 | " \n", 2011 | " \n", 2012 | " \n", 2013 | " \n", 2014 | " \n", 2015 | " \n", 2016 | " \n", 2017 | " \n", 2018 | " \n", 2019 | " \n", 2020 | " \n", 2021 | " \n", 2022 | " \n", 2023 | " \n", 2024 | " \n", 2025 | " \n", 2026 | " \n", 2027 | " \n", 2028 | " \n", 2029 | " \n", 2030 | " \n", 2031 | " \n", 2032 | " \n", 2033 | " \n", 2034 | " \n", 2035 | " \n", 2036 | " \n", 2037 | " \n", 2038 | " \n", 2039 | " \n", 2040 | " \n", 2041 | " \n", 2042 | " \n", 2043 | " \n", 2044 | " \n", 2045 | " \n", 2046 | " \n", 2047 | " \n", 2048 | " \n", 2049 | " \n", 2050 | " \n", 2051 | " \n", 2052 | " \n", 2053 | " \n", 2054 | " \n", 2055 | " \n", 2056 | " \n", 2057 | " \n", 2058 | " \n", 2059 | " \n", 2060 | " \n", 2061 | " \n", 2062 | " \n", 2063 | " \n", 2064 | " \n", 2065 | " \n", 2066 | " \n", 2067 | " \n", 2068 | " \n", 2069 | " \n", 2070 | " \n", 2071 | " \n", 2072 | " \n", 2073 | " \n", 2074 | " \n", 2075 | " \n", 2076 | " \n", 2077 | " \n", 2078 | " \n", 2079 | " \n", 2080 | " \n", 2081 | " \n", 2082 | " \n", 2083 | " \n", 2084 | " \n", 2085 | "
activities|activity_idactivities|assay_idactivities|molregnoactivities|pchembl_valueactivities|typeactivities|standard_relationactivities|standard_valueactivities|standard_unitsactivities|standard_flagactivities|standard_type...atc_classification|level3_descriptionatc_classification|level4_descriptionatc_classification|who_nametarget_dictionary|tidtarget_dictionary|target_typetarget_dictionary|pref_nametarget_dictionary|organismtarget_dictionary|chembl_idcomponent_synonyms|component_synonymcomponent_synonyms|syn_type
031863.054505.0180094.0NaNIC50>100000.00nM1.0IC50...NaNNaNNaN63.0SINGLE PROTEINDNA topoisomerase II alphaHomo sapiensCHEMBL1806TOP2AGENE_SYMBOL
131864.083907.0182268.05.60IC50=2500.00nM1.0IC50...NaNNaNNaN11653.0SINGLE PROTEINHeparanaseHomo sapiensCHEMBL3921HPSEGENE_SYMBOL
22224237.0531583.0182268.05.60pIC50=2511.89nM1.0IC50...NaNNaNNaN11653.0SINGLE PROTEINHeparanaseHomo sapiensCHEMBL3921HPSEGENE_SYMBOL
331865.088152.0182268.0NaNIC50>50000.00nM1.0IC50...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
431866.083907.0182855.05.05IC50=9000.00nM1.0IC50...NaNNaNNaN11653.0SINGLE PROTEINHeparanaseHomo sapiensCHEMBL3921HPSEGENE_SYMBOL
\n", 2086 | "

5 rows × 52 columns

\n", 2087 | "
" 2088 | ], 2089 | "text/plain": [ 2090 | " activities|activity_id activities|assay_id activities|molregno \\\n", 2091 | "0 31863.0 54505.0 180094.0 \n", 2092 | "1 31864.0 83907.0 182268.0 \n", 2093 | "2 2224237.0 531583.0 182268.0 \n", 2094 | "3 31865.0 88152.0 182268.0 \n", 2095 | "4 31866.0 83907.0 182855.0 \n", 2096 | "\n", 2097 | " activities|pchembl_value activities|type activities|standard_relation \\\n", 2098 | "0 NaN IC50 > \n", 2099 | "1 5.60 IC50 = \n", 2100 | "2 5.60 pIC50 = \n", 2101 | "3 NaN IC50 > \n", 2102 | "4 5.05 IC50 = \n", 2103 | "\n", 2104 | " activities|standard_value activities|standard_units \\\n", 2105 | "0 100000.00 nM \n", 2106 | "1 2500.00 nM \n", 2107 | "2 2511.89 nM \n", 2108 | "3 50000.00 nM \n", 2109 | "4 9000.00 nM \n", 2110 | "\n", 2111 | " activities|standard_flag activities|standard_type ... \\\n", 2112 | "0 1.0 IC50 ... \n", 2113 | "1 1.0 IC50 ... \n", 2114 | "2 1.0 IC50 ... \n", 2115 | "3 1.0 IC50 ... \n", 2116 | "4 1.0 IC50 ... \n", 2117 | "\n", 2118 | " atc_classification|level3_description atc_classification|level4_description \\\n", 2119 | "0 NaN NaN \n", 2120 | "1 NaN NaN \n", 2121 | "2 NaN NaN \n", 2122 | "3 NaN NaN \n", 2123 | "4 NaN NaN \n", 2124 | "\n", 2125 | " atc_classification|who_name target_dictionary|tid \\\n", 2126 | "0 NaN 63.0 \n", 2127 | "1 NaN 11653.0 \n", 2128 | "2 NaN 11653.0 \n", 2129 | "3 NaN NaN \n", 2130 | "4 NaN 11653.0 \n", 2131 | "\n", 2132 | " target_dictionary|target_type target_dictionary|pref_name \\\n", 2133 | "0 SINGLE PROTEIN DNA topoisomerase II alpha \n", 2134 | "1 SINGLE PROTEIN Heparanase \n", 2135 | "2 SINGLE PROTEIN Heparanase \n", 2136 | "3 NaN NaN \n", 2137 | "4 SINGLE PROTEIN Heparanase \n", 2138 | "\n", 2139 | " target_dictionary|organism target_dictionary|chembl_id \\\n", 2140 | "0 Homo sapiens CHEMBL1806 \n", 2141 | "1 Homo sapiens CHEMBL3921 \n", 2142 | "2 Homo sapiens CHEMBL3921 \n", 2143 | "3 NaN NaN \n", 2144 | "4 Homo sapiens CHEMBL3921 \n", 2145 | "\n", 2146 | " component_synonyms|component_synonym component_synonyms|syn_type \n", 2147 | "0 TOP2A GENE_SYMBOL \n", 2148 | "1 HPSE GENE_SYMBOL \n", 2149 | "2 HPSE GENE_SYMBOL \n", 2150 | "3 NaN NaN \n", 2151 | "4 HPSE GENE_SYMBOL \n", 2152 | "\n", 2153 | "[5 rows x 52 columns]" 2154 | ] 2155 | }, 2156 | "execution_count": 13, 2157 | "metadata": {}, 2158 | "output_type": "execute_result" 2159 | } 2160 | ], 2161 | "source": [ 2162 | "# merge activities and assays data\n", 2163 | "final_df = activities.merge(assays,how='left',left_on='activities|assay_id',right_on='assays|assay_id')\n", 2164 | "final_df = final_df[['activities|activity_id', 'activities|assay_id', 'activities|molregno',\n", 2165 | " 'activities|pchembl_value', 'activities|type', 'activities|standard_relation', 'activities|standard_value',\n", 2166 | " 'activities|standard_units', 'activities|standard_flag','activities|standard_type', 'activities|activity_comment',\n", 2167 | " 'assays|description','assays|assay_type','assays|tid', 'assays|confidence_score','assays|curated_by','assays|chembl_id']]\n", 2168 | "print(f'activities+assays: {final_df.shape}')\n", 2169 | "\n", 2170 | "# merge activities and drug_mechanism based on 'molregno': how='outer' to capture all\n", 2171 | "final_df = final_df.merge(drug_mechanism[['drug_mechanism|molregno','drug_mechanism|mechanism_of_action','drug_mechanism|tid','drug_mechanism|action_type',]],\n", 2172 | " how='outer',left_on=['activities|molregno','assays|tid'],right_on=['drug_mechanism|molregno','drug_mechanism|tid'])\n", 2173 | "print(f'added drug mechanism: {final_df.shape}')\n", 2174 | "\n", 2175 | "## remake molregno column by combining\n", 2176 | "ind = final_df[(final_df['drug_mechanism|molregno']==final_df['drug_mechanism|molregno'])& \\\n", 2177 | " (final_df['activities|molregno']!=final_df['activities|molregno'])].index\n", 2178 | "final_df['activities_drug_mechanism|molregno']=final_df['activities|molregno'].copy()\n", 2179 | "final_df.loc[ind,'activities_drug_mechanism|molregno']=final_df.loc[ind,'drug_mechanism|molregno']\n", 2180 | "print(sum(final_df['activities_drug_mechanism|molregno']==final_df['activities_drug_mechanism|molregno']))\n", 2181 | "del ind\n", 2182 | "\n", 2183 | "## remake tid column by combining\n", 2184 | "ind = final_df[(final_df['drug_mechanism|tid']==final_df['drug_mechanism|tid'])& \\\n", 2185 | " (final_df['assays|tid']!=final_df['assays|tid'])].index\n", 2186 | "final_df['assays_drug_mechanism|tid']=final_df['assays|tid'].copy()\n", 2187 | "final_df.loc[ind,'assays_drug_mechanism|tid']=final_df.loc[ind,'drug_mechanism|tid']\n", 2188 | "print(sum(final_df['assays_drug_mechanism|tid']==final_df['assays_drug_mechanism|tid']))\n", 2189 | "\n", 2190 | "\n", 2191 | "# merge compound informations\n", 2192 | "final_df = final_df.merge(molecule_dictionary[['molecule_dictionary|molregno', 'molecule_dictionary|pref_name','molecule_dictionary|chembl_id', 'molecule_dictionary|max_phase', \n", 2193 | " 'molecule_dictionary|molecule_type','molecule_dictionary|oral','molecule_dictionary|parenteral', 'molecule_dictionary|topical',\n", 2194 | " 'molecule_dictionary|black_box_warning','molecule_dictionary|natural_product']],\n", 2195 | " how='left',left_on='activities_drug_mechanism|molregno',right_on='molecule_dictionary|molregno')\n", 2196 | "\n", 2197 | "final_df = final_df.merge(molecule_atc_classification[['molecule_atc_classification|molregno','molecule_atc_classification|level5']],\n", 2198 | " how='left',left_on='activities_drug_mechanism|molregno',right_on='molecule_atc_classification|molregno')\n", 2199 | "\n", 2200 | "final_df = final_df.merge(atc_classification[['atc_classification|level1','atc_classification|level2','atc_classification|level3','atc_classification|level4','atc_classification|level5',\n", 2201 | " 'atc_classification|level1_description','atc_classification|level2_description',\n", 2202 | " 'atc_classification|level3_description','atc_classification|level4_description','atc_classification|who_name']],\n", 2203 | " how='left',left_on='molecule_atc_classification|level5',right_on='atc_classification|level5')\n", 2204 | "print(f'added compound info: {final_df.shape}')\n", 2205 | "\n", 2206 | "# merge targets\n", 2207 | "## merging target dataframes first (to link tid with gene symbol)\n", 2208 | "targets_final = target_dictionary.merge(target_components,how='left',left_on='target_dictionary|tid',right_on='target_components|tid')\n", 2209 | "targets_final = targets_final.merge(component_synonyms,how='left',left_on='target_components|component_id',right_on='component_synonyms|component_id')\n", 2210 | "print(targets_final['component_synonyms|syn_type'].value_counts())\n", 2211 | "\n", 2212 | "## selecting targets which have 'gene symbol'\n", 2213 | "## syn_type == 'GENE_SYMBOL'\n", 2214 | "targets_final = targets_final[targets_final['component_synonyms|syn_type']=='GENE_SYMBOL']\n", 2215 | "\n", 2216 | "## merge\n", 2217 | "final_df = final_df.merge(targets_final[['target_dictionary|tid','target_dictionary|target_type','target_dictionary|pref_name','target_dictionary|organism','target_dictionary|chembl_id',\n", 2218 | " 'component_synonyms|component_synonym', 'component_synonyms|syn_type']],\n", 2219 | " how='left',left_on='assays_drug_mechanism|tid',right_on='target_dictionary|tid')\n", 2220 | "\n", 2221 | "print(f'added targets info: {final_df.shape}')\n", 2222 | "final_df.head()" 2223 | ] 2224 | }, 2225 | { 2226 | "cell_type": "markdown", 2227 | "id": "piano-gibraltar", 2228 | "metadata": {}, 2229 | "source": [ 2230 | "## Save the data frame above for access to non-human entries" 2231 | ] 2232 | }, 2233 | { 2234 | "cell_type": "markdown", 2235 | "id": "vertical-spiritual", 2236 | "metadata": {}, 2237 | "source": [ 2238 | "## Filtering drugs targeting human molecules" 2239 | ] 2240 | }, 2241 | { 2242 | "cell_type": "code", 2243 | "execution_count": 30, 2244 | "id": "handed-bracket", 2245 | "metadata": {}, 2246 | "outputs": [ 2247 | { 2248 | "data": { 2249 | "text/plain": [ 2250 | "(6708016, 53)" 2251 | ] 2252 | }, 2253 | "execution_count": 30, 2254 | "metadata": {}, 2255 | "output_type": "execute_result" 2256 | } 2257 | ], 2258 | "source": [ 2259 | "final_df = final_df[final_df['target_dictionary|organism']=='Homo sapiens']\n", 2260 | "final_df.shape" 2261 | ] 2262 | }, 2263 | { 2264 | "cell_type": "code", 2265 | "execution_count": null, 2266 | "id": "identified-logan", 2267 | "metadata": {}, 2268 | "outputs": [], 2269 | "source": [] 2270 | }, 2271 | { 2272 | "cell_type": "markdown", 2273 | "id": "available-sharp", 2274 | "metadata": {}, 2275 | "source": [ 2276 | "## Add target class" 2277 | ] 2278 | }, 2279 | { 2280 | "cell_type": "markdown", 2281 | "id": "dependent-oregon", 2282 | "metadata": {}, 2283 | "source": [ 2284 | "- Based on [IDG Protein list](https://druggablegenome.net/IDGProteinList)\n", 2285 | "- add 'Ion channel' from [HGNC, GID:177](https://www.genenames.org/data/genegroup/#!/group/177)\n", 2286 | "- add 'GPCR' from [HGNC, GID:139](https://www.genenames.org/data/genegroup/#!/group/139)\n", 2287 | "- add 'NHR' (Nuclear Hormone Receptors) from [HGNC, GID:71](https://www.genenames.org/data/genegroup/#!/group/71)" 2288 | ] 2289 | }, 2290 | { 2291 | "cell_type": "code", 2292 | "execution_count": 31, 2293 | "id": "expensive-occurrence", 2294 | "metadata": {}, 2295 | "outputs": [ 2296 | { 2297 | "data": { 2298 | "text/plain": [ 2299 | "dict_keys(['Kinase', 'GPCR', 'Ion Channel', 'NHR'])" 2300 | ] 2301 | }, 2302 | "execution_count": 31, 2303 | "metadata": {}, 2304 | "output_type": "execute_result" 2305 | } 2306 | ], 2307 | "source": [ 2308 | "# create dictionary for protein classes\n", 2309 | "idg = pd.read_csv('IDG_TargetList_Y4.csv')\n", 2310 | "\n", 2311 | "targetclass_dict={}\n", 2312 | "for c in set(idg['IDGFamily']):\n", 2313 | " targetclass_dict[c]=list(idg[idg['IDGFamily']==c]['Gene'])\n", 2314 | "\n", 2315 | "ion = pd.read_csv('HGNC_GID177_Ion-channels.txt',sep='\\t')\n", 2316 | "gpcr = pd.read_csv('HGNC_GID139_G-protein-coupled-receptors.txt',sep='\\t')\n", 2317 | "nr = pd.read_csv('HGNC_GID71_Nuclear-hormone-receptors.txt',sep='\\t')\n", 2318 | "\n", 2319 | "targetclass_dict['Ion Channel']=list(set(targetclass_dict['Ion Channel']+list(ion['Approved symbol'])))\n", 2320 | "targetclass_dict['GPCR']=list(set(targetclass_dict['GPCR']+list(gpcr['Approved symbol'])))\n", 2321 | "targetclass_dict['NHR']=list(nr['Approved symbol'].unique())\n", 2322 | "targetclass_dict.keys()" 2323 | ] 2324 | }, 2325 | { 2326 | "cell_type": "code", 2327 | "execution_count": 32, 2328 | "id": "understood-graduate", 2329 | "metadata": {}, 2330 | "outputs": [ 2331 | { 2332 | "name": "stderr", 2333 | "output_type": "stream", 2334 | "text": [ 2335 | "/home/jovyan/my-conda-envs/cellpymc/lib/python3.7/site-packages/ipykernel_launcher.py:13: SettingWithCopyWarning: \n", 2336 | "A value is trying to be set on a copy of a slice from a DataFrame.\n", 2337 | "Try using .loc[row_indexer,col_indexer] = value instead\n", 2338 | "\n", 2339 | "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", 2340 | " del sys.path[0]\n", 2341 | "/home/jovyan/my-conda-envs/cellpymc/lib/python3.7/site-packages/ipykernel_launcher.py:14: SettingWithCopyWarning: \n", 2342 | "A value is trying to be set on a copy of a slice from a DataFrame.\n", 2343 | "Try using .loc[row_indexer,col_indexer] = value instead\n", 2344 | "\n", 2345 | "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", 2346 | " \n" 2347 | ] 2348 | }, 2349 | { 2350 | "data": { 2351 | "text/plain": [ 2352 | "none 5347947\n", 2353 | "GPCR 903731\n", 2354 | "NHR 177300\n", 2355 | "Ion Channel 174983\n", 2356 | "Kinase 104055\n", 2357 | "Name: target_class, dtype: int64" 2358 | ] 2359 | }, 2360 | "execution_count": 32, 2361 | "metadata": {}, 2362 | "output_type": "execute_result" 2363 | } 2364 | ], 2365 | "source": [ 2366 | "# assin protein class to each target\n", 2367 | "def which_class(dictionary, value):\n", 2368 | " out='none'\n", 2369 | " for k in dictionary.keys():\n", 2370 | " if value in dictionary[k]:\n", 2371 | " if out=='none':\n", 2372 | " out=k\n", 2373 | " else:\n", 2374 | " out=f'{out};{k}'\n", 2375 | " return out\n", 2376 | "\n", 2377 | "# add target class\n", 2378 | "final_df['target_class']=final_df['component_synonyms|component_synonym'].copy()\n", 2379 | "final_df['target_class']=[which_class(targetclass_dict,t) for t in final_df['target_class']]\n", 2380 | "final_df['target_class'].value_counts()" 2381 | ] 2382 | }, 2383 | { 2384 | "cell_type": "code", 2385 | "execution_count": null, 2386 | "id": "continued-brick", 2387 | "metadata": {}, 2388 | "outputs": [], 2389 | "source": [] 2390 | }, 2391 | { 2392 | "cell_type": "markdown", 2393 | "id": "fatal-ranch", 2394 | "metadata": {}, 2395 | "source": [ 2396 | "## Save" 2397 | ] 2398 | }, 2399 | { 2400 | "cell_type": "code", 2401 | "execution_count": 34, 2402 | "id": "metric-arnold", 2403 | "metadata": {}, 2404 | "outputs": [ 2405 | { 2406 | "name": "stdout", 2407 | "output_type": "stream", 2408 | "text": [ 2409 | "CPU times: user 25.7 s, sys: 5.28 s, total: 31 s\n", 2410 | "Wall time: 50.3 s\n" 2411 | ] 2412 | } 2413 | ], 2414 | "source": [ 2415 | "%%time\n", 2416 | "final_df.to_pickle('chembl_30_merged_genesymbols_humans.pkl')" 2417 | ] 2418 | }, 2419 | { 2420 | "cell_type": "code", 2421 | "execution_count": null, 2422 | "id": "southern-furniture", 2423 | "metadata": {}, 2424 | "outputs": [], 2425 | "source": [] 2426 | }, 2427 | { 2428 | "cell_type": "code", 2429 | "execution_count": null, 2430 | "id": "apart-scheme", 2431 | "metadata": {}, 2432 | "outputs": [], 2433 | "source": [] 2434 | }, 2435 | { 2436 | "cell_type": "code", 2437 | "execution_count": null, 2438 | "id": "herbal-baltimore", 2439 | "metadata": {}, 2440 | "outputs": [], 2441 | "source": [] 2442 | }, 2443 | { 2444 | "cell_type": "markdown", 2445 | "id": "compatible-intersection", 2446 | "metadata": {}, 2447 | "source": [ 2448 | "## Session info" 2449 | ] 2450 | }, 2451 | { 2452 | "cell_type": "code", 2453 | "execution_count": 35, 2454 | "id": "bored-structure", 2455 | "metadata": {}, 2456 | "outputs": [ 2457 | { 2458 | "data": { 2459 | "text/html": [ 2460 | "
\n", 2461 | "Click to view session information\n", 2462 | "
\n",
2463 |        "-----\n",
2464 |        "numpy               1.21.5\n",
2465 |        "pandas              1.2.0\n",
2466 |        "session_info        1.0.0\n",
2467 |        "-----\n",
2468 |        "
\n", 2469 | "
\n", 2470 | "Click to view modules imported as dependencies\n", 2471 | "
\n",
2472 |        "backcall            0.2.0\n",
2473 |        "colorama            0.4.4\n",
2474 |        "cython_runtime      NA\n",
2475 |        "dateutil            2.8.1\n",
2476 |        "decorator           4.4.2\n",
2477 |        "google              NA\n",
2478 |        "ipykernel           5.4.2\n",
2479 |        "ipython_genutils    0.2.0\n",
2480 |        "jedi                0.17.2\n",
2481 |        "mpl_toolkits        NA\n",
2482 |        "numexpr             2.7.2\n",
2483 |        "parso               0.7.1\n",
2484 |        "pexpect             4.8.0\n",
2485 |        "pickleshare         0.7.5\n",
2486 |        "prompt_toolkit      3.0.10\n",
2487 |        "ptyprocess          0.7.0\n",
2488 |        "pygments            2.7.4\n",
2489 |        "pytz                2020.5\n",
2490 |        "six                 1.14.0\n",
2491 |        "sphinxcontrib       NA\n",
2492 |        "storemagic          NA\n",
2493 |        "tornado             6.1\n",
2494 |        "traitlets           5.0.5\n",
2495 |        "wcwidth             0.2.5\n",
2496 |        "zmq                 21.0.1\n",
2497 |        "zope                NA\n",
2498 |        "
\n", 2499 | "
\n", 2500 | "
\n",
2501 |        "-----\n",
2502 |        "IPython             7.19.0\n",
2503 |        "jupyter_client      6.1.11\n",
2504 |        "jupyter_core        4.7.0\n",
2505 |        "notebook            6.2.0\n",
2506 |        "-----\n",
2507 |        "Python 3.7.9 | packaged by conda-forge | (default, Dec  9 2020, 21:08:20) [GCC 9.3.0]\n",
2508 |        "Linux-4.15.0-112-generic-x86_64-with-debian-bullseye-sid\n",
2509 |        "-----\n",
2510 |        "Session information updated at 2022-07-12 21:29\n",
2511 |        "
\n", 2512 | "
" 2513 | ], 2514 | "text/plain": [ 2515 | "" 2516 | ] 2517 | }, 2518 | "execution_count": 35, 2519 | "metadata": {}, 2520 | "output_type": "execute_result" 2521 | } 2522 | ], 2523 | "source": [ 2524 | "import session_info\n", 2525 | "session_info.show()" 2526 | ] 2527 | }, 2528 | { 2529 | "cell_type": "code", 2530 | "execution_count": null, 2531 | "id": "opposed-memory", 2532 | "metadata": {}, 2533 | "outputs": [], 2534 | "source": [] 2535 | }, 2536 | { 2537 | "cell_type": "code", 2538 | "execution_count": null, 2539 | "id": "heard-newman", 2540 | "metadata": {}, 2541 | "outputs": [], 2542 | "source": [] 2543 | } 2544 | ], 2545 | "metadata": { 2546 | "kernelspec": { 2547 | "display_name": "cellpymc", 2548 | "language": "python", 2549 | "name": "cellpymc" 2550 | }, 2551 | "language_info": { 2552 | "codemirror_mode": { 2553 | "name": "ipython", 2554 | "version": 3 2555 | }, 2556 | "file_extension": ".py", 2557 | "mimetype": "text/x-python", 2558 | "name": "python", 2559 | "nbconvert_exporter": "python", 2560 | "pygments_lexer": "ipython3", 2561 | "version": "3.7.9" 2562 | } 2563 | }, 2564 | "nbformat": 4, 2565 | "nbformat_minor": 5 2566 | } 2567 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | setup( 4 | name='drug2cell', 5 | version='0.1.2', 6 | description='Gene group activity utility functions for scanpy', 7 | url='https://github.com/Teichlab/drug2cell', 8 | packages=find_packages(exclude=['docs', 'notebooks']), 9 | install_requires=[ 10 | 'anndata', 11 | 'pandas', 12 | 'numpy', 13 | 'statsmodels', 14 | 'scipy', 15 | 'blitzgsea' 16 | ], 17 | package_data={ 18 | "drug2cell": ["*.pkl"] 19 | }, 20 | author='Krzysztof Polanski, Kazumasa Kanemaru', 21 | author_email='kp9@sanger.ac.uk', 22 | license='non-commercial license' 23 | ) 24 | --------------------------------------------------------------------------------