├── LICENSE ├── README.md ├── examples ├── demuxlet_example.ipynb └── scrublet_basics.ipynb ├── old_versions └── v0.1 │ ├── LICENSE │ ├── README.md │ ├── examples │ ├── 10X_PBMC-8k_example.ipynb │ ├── 10X_PBMC-8k_scanpy_example.ipynb │ ├── demuxlet_PBMC_example.ipynb │ └── old │ │ └── 180306_basic_example.ipynb │ ├── requirements.txt │ ├── setup.py │ └── src │ └── scrublet │ ├── __init__.py │ ├── helper_functions.py │ └── scrublet.py ├── requirements.txt ├── setup.py └── src └── scrublet ├── __init__.py ├── helper_functions.py └── scrublet.py /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | Copyright (c) 2018 Samuel Wolock 3 | 4 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 5 | 6 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 7 | 8 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 9 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Scrublet 2 | **S**ingle-**C**ell **R**emover of Do**ublet**s 3 | 4 | Python code for identifying doublets in single-cell RNA-seq data. For details and validation of the method, see our paper in [Cell Systems](https://www.sciencedirect.com/science/article/pii/S2405471218304745) or the preprint on [bioRxiv](https://www.biorxiv.org/content/early/2018/07/09/357368). 5 | 6 | #### Quick start: 7 | For a typical workflow, including interpretation of predicted doublet scores, see the example [notebook](./examples/scrublet_basics.ipynb). 8 | 9 | Given a raw (unnormalized) UMI counts matrix `counts_matrix` with cells as rows and genes as columns, calculate a doublet score for each cell: 10 | ```python 11 | import scrublet as scr 12 | scrub = scr.Scrublet(counts_matrix) 13 | doublet_scores, predicted_doublets = scrub.scrub_doublets() 14 | ``` 15 | `scr.scrub_doublets()` simulates doublets from the observed data and uses a k-nearest-neighbor classifier to calculate a continuous `doublet_score` (between 0 and 1) for each transcriptome. The score is automatically thresholded to generate `predicted_doublets`, a boolean array that is `True` for predicted doublets and `False` otherwise. 16 | 17 | #### Best practices: 18 | - When working with data from multiple samples, run Scrublet on each sample separately. Because Scrublet is designed to detect technical doublets formed by the random co-encapsulation of two cells, it may perform poorly on merged datasets where the cell type proportions are not representative of any single sample. 19 | - Check that the doublet score threshold is reasonable (in an ideal case, separating the two peaks of a bimodal simulated doublet score histogram, as in [this example](./examples/scrublet_basics.ipynb)), and adjust manually if necessary. 20 | - Visualize the doublet predictions in a 2-D embedding (e.g., UMAP or t-SNE). Predicted doublets should mostly co-localize (possibly in multiple clusters). If they do not, you may need to adjust the doublet score threshold, or change the pre-processing parameters to better resolve the cell states present in your data. 21 | 22 | #### Installation: 23 | To install with PyPI: 24 | ```bash 25 | pip install scrublet 26 | ``` 27 | 28 | To install from source: 29 | ```bash 30 | git clone https://github.com/AllonKleinLab/scrublet.git 31 | cd scrublet 32 | pip install -r requirements.txt 33 | pip install --upgrade . 34 | ``` 35 | 36 | #### Old versions: 37 | Previous versions can be found [here](./old_versions/). 38 | 39 | #### Other doublet detection tools: 40 | [DoubletFinder](https://github.com/chris-mcginnis-ucsf/DoubletFinder) 41 | [DoubletDecon](https://github.com/EDePasquale/DoubletDecon) 42 | [DoubletDetection](https://github.com/JonathanShor/DoubletDetection) 43 | -------------------------------------------------------------------------------- /examples/scrublet_basics.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "This example shows how to: \n", 8 | "1. Load a counts matrix (10X Chromium data from human peripheral blood cells)\n", 9 | "2. Run the default Scrublet pipeline \n", 10 | "3. Check that doublet predictions make sense" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 1, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "%matplotlib inline\n", 20 | "import scrublet as scr\n", 21 | "import scipy.io\n", 22 | "import matplotlib.pyplot as plt\n", 23 | "import numpy as np\n", 24 | "import os" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 2, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "plt.rcParams['font.family'] = 'sans-serif'\n", 34 | "plt.rcParams['font.sans-serif'] = 'Arial'\n", 35 | "plt.rc('font', size=14)\n", 36 | "plt.rcParams['pdf.fonttype'] = 42" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "#### Download 8k PBMC data set from 10X Genomics\n", 44 | "Download raw data from this link:\n", 45 | "http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_filtered_gene_bc_matrices.tar.gz\n", 46 | "\n", 47 | "\n", 48 | "Or use wget:" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 3, 54 | "metadata": {}, 55 | "outputs": [ 56 | { 57 | "name": "stdout", 58 | "output_type": "stream", 59 | "text": [ 60 | "--2018-10-04 11:21:04-- http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_filtered_gene_bc_matrices.tar.gz\n", 61 | "Resolving cf.10xgenomics.com (cf.10xgenomics.com)... 13.35.78.24, 13.35.78.116, 13.35.78.82, ...\n", 62 | "Connecting to cf.10xgenomics.com (cf.10xgenomics.com)|13.35.78.24|:80... connected.\n", 63 | "HTTP request sent, awaiting response... 200 OK\n", 64 | "Length: 37558165 (36M) [application/x-tar]\n", 65 | "Saving to: ‘pbmc8k_filtered_gene_bc_matrices.tar.gz’\n", 66 | "\n", 67 | "pbmc8k_filtered_gen 100%[===================>] 35.82M 16.5MB/s in 2.2s \n", 68 | "\n", 69 | "2018-10-04 11:21:07 (16.5 MB/s) - ‘pbmc8k_filtered_gene_bc_matrices.tar.gz’ saved [37558165/37558165]\n", 70 | "\n" 71 | ] 72 | } 73 | ], 74 | "source": [ 75 | "!wget http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_filtered_gene_bc_matrices.tar.gz" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "Uncompress:" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 4, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "!tar xfz pbmc8k_filtered_gene_bc_matrices.tar.gz" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "#### Load counts matrix and gene list\n", 99 | "Load the raw counts matrix as a scipy sparse matrix with cells as rows and genes as columns." 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 5, 105 | "metadata": {}, 106 | "outputs": [ 107 | { 108 | "name": "stdout", 109 | "output_type": "stream", 110 | "text": [ 111 | "Counts matrix shape: 8381 rows, 33694 columns\n", 112 | "Number of genes in gene list: 33694\n" 113 | ] 114 | } 115 | ], 116 | "source": [ 117 | "input_dir = 'filtered_gene_bc_matrices/GRCh38/'\n", 118 | "counts_matrix = scipy.io.mmread(input_dir + '/matrix.mtx').T.tocsc()\n", 119 | "genes = np.array(scr.load_genes(input_dir + 'genes.tsv', delimiter='\\t', column=1))\n", 120 | "\n", 121 | "print('Counts matrix shape: {} rows, {} columns'.format(counts_matrix.shape[0], counts_matrix.shape[1]))\n", 122 | "print('Number of genes in gene list: {}'.format(len(genes)))" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "#### Initialize Scrublet object\n", 130 | "The relevant parameters are:\n", 131 | "- *expected_doublet_rate*: the expected fraction of transcriptomes that are doublets, typically 0.05-0.1. Results are not particularly sensitive to this parameter. For this example, the expected doublet rate comes from the Chromium User Guide: https://support.10xgenomics.com/permalink/3vzDu3zQjY0o2AqkkkI4CC\n", 132 | "- *sim_doublet_ratio*: the number of doublets to simulate, relative to the number of observed transcriptomes. This should be high enough that all doublet states are well-represented by simulated doublets. Setting it too high is computationally expensive. The default value is 2, though values as low as 0.5 give very similar results for the datasets that have been tested.\n", 133 | "- *n_neighbors*: Number of neighbors used to construct the KNN classifier of observed transcriptomes and simulated doublets. The default value of `round(0.5*sqrt(n_cells))` generally works well.\n" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 6, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "scrub = scr.Scrublet(counts_matrix, expected_doublet_rate=0.06)" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "#### Run the default pipeline, which includes:\n", 150 | "1. Doublet simulation\n", 151 | "2. Normalization, gene filtering, rescaling, PCA\n", 152 | "3. Doublet score calculation \n", 153 | "4. Doublet score threshold detection and doublet calling\n" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": 7, 159 | "metadata": {}, 160 | "outputs": [ 161 | { 162 | "name": "stdout", 163 | "output_type": "stream", 164 | "text": [ 165 | "Preprocessing...\n", 166 | "Simulating doublets...\n", 167 | "Embedding transcriptomes using PCA...\n", 168 | "Calculating doublet scores...\n", 169 | "Automatically set threshold at doublet score = 0.22\n", 170 | "Detected doublet rate = 4.3%\n", 171 | "Estimated detectable doublet fraction = 61.5%\n", 172 | "Overall doublet rate:\n", 173 | "\tExpected = 6.0%\n", 174 | "\tEstimated = 7.0%\n", 175 | "Elapsed time: 11.2 seconds\n" 176 | ] 177 | } 178 | ], 179 | "source": [ 180 | "doublet_scores, predicted_doublets = scrub.scrub_doublets(min_counts=2, \n", 181 | " min_cells=3, \n", 182 | " min_gene_variability_pctl=85, \n", 183 | " n_prin_comps=30)" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "#### Plot doublet score histograms for observed transcriptomes and simulated doublets\n", 191 | "The simulated doublet histogram is typically bimodal. The left mode corresponds to \"embedded\" doublets generated by two cells with similar gene expression. The right mode corresponds to \"neotypic\" doublets, which are generated by cells with distinct gene expression (e.g., different cell types) and are expected to introduce more artifacts in downstream analyses. Scrublet can only detect neotypic doublets. \n", 192 | " \n", 193 | "To call doublets vs. singlets, we must set a threshold doublet score, ideally at the minimum between the two modes of the simulated doublet histogram. `scrub_doublets()` attempts to identify this point automatically and has done a good job in this example. However, if automatic threshold detection doesn't work well, you can adjust the threshold with the `call_doublets()` function. For example:\n", 194 | "```python\n", 195 | "scrub.call_doublets(threshold=0.25)\n", 196 | "```" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 8, 202 | "metadata": {}, 203 | "outputs": [ 204 | { 205 | "data": { 206 | "image/png": "\n", 207 | "text/plain": [ 208 | "
" 209 | ] 210 | }, 211 | "metadata": {}, 212 | "output_type": "display_data" 213 | } 214 | ], 215 | "source": [ 216 | "scrub.plot_histogram();" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "#### Get 2-D embedding to visualize the results" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": 9, 229 | "metadata": {}, 230 | "outputs": [ 231 | { 232 | "name": "stdout", 233 | "output_type": "stream", 234 | "text": [ 235 | "Running UMAP...\n", 236 | "Done.\n" 237 | ] 238 | } 239 | ], 240 | "source": [ 241 | "print('Running UMAP...')\n", 242 | "scrub.set_embedding('UMAP', scr.get_umap(scrub.manifold_obs_, 10, min_dist=0.3))\n", 243 | "\n", 244 | "# # Uncomment to run tSNE - slow\n", 245 | "# print('Running tSNE...')\n", 246 | "# scrub.set_embedding('tSNE', scr.get_tsne(scrub.manifold_obs_, angle=0.9))\n", 247 | "\n", 248 | "# # Uncomment to run force layout - slow\n", 249 | "# print('Running ForceAtlas2...')\n", 250 | "# scrub.set_embedding('FA', scr.get_force_layout(scrub.manifold_obs_, n_neighbors=5. n_iter=1000))\n", 251 | " \n", 252 | "print('Done.')" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "#### Plot doublet predictions on 2-D embedding\n", 260 | "Predicted doublets should co-localize in distinct states." 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": 10, 266 | "metadata": {}, 267 | "outputs": [ 268 | { 269 | "data": { 270 | "image/png": "\n", 271 | "text/plain": [ 272 | "
" 273 | ] 274 | }, 275 | "metadata": {}, 276 | "output_type": "display_data" 277 | } 278 | ], 279 | "source": [ 280 | "scrub.plot_embedding('UMAP', order_points=True);\n", 281 | "\n", 282 | "# scrub.plot_embedding('tSNE', order_points=True);\n", 283 | "# scrub.plot_embedding('FA', order_points=True);" 284 | ] 285 | } 286 | ], 287 | "metadata": { 288 | "kernelspec": { 289 | "display_name": "Python 3", 290 | "language": "python", 291 | "name": "python3" 292 | }, 293 | "language_info": { 294 | "codemirror_mode": { 295 | "name": "ipython", 296 | "version": 3 297 | }, 298 | "file_extension": ".py", 299 | "mimetype": "text/x-python", 300 | "name": "python", 301 | "nbconvert_exporter": "python", 302 | "pygments_lexer": "ipython3", 303 | "version": "3.6.4" 304 | } 305 | }, 306 | "nbformat": 4, 307 | "nbformat_minor": 2 308 | } 309 | -------------------------------------------------------------------------------- /old_versions/v0.1/LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | Copyright (c) 2018 Samuel Wolock 3 | 4 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 5 | 6 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 7 | 8 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 9 | -------------------------------------------------------------------------------- /old_versions/v0.1/README.md: -------------------------------------------------------------------------------- 1 | # Scrublet 2 | **S**ingle-**C**ell **R**emover of Do**ublet**s 3 | 4 | Python code for identifying doublets in single-cell RNA-seq data. For details and validation of the method, see our preprint on [bioRxiv](https://www.biorxiv.org/content/early/2018/07/09/357368). 5 | 6 | 7 | Given a raw counts matrix `E`, with cells as rows and genes as columns, you can calculate a doublet score for each cell with the following command: 8 | ```python 9 | import scrublet as scr 10 | scrublet_results = scr.compute_doublet_scores(E) 11 | doublet_scores = scrublet_results['doublet_scores_observed_cells'] 12 | ``` 13 | 14 | There are a number of optional parameters that can have major effects on the quality of results. For a typical workflow, including interpretation of predicted doublet scores, see the example [ipython notebook](./examples/10X_PBMC-8k_example.ipynb). 15 | 16 | We are also working on a [scanpy](https://github.com/theislab/scanpy) implementation. See https://github.com/swolock/scanpy for a fork implementing Scrublet and the example [here](./examples/10X_PBMC-8k_scanpy_example.ipynb). 17 | 18 | #### To install: 19 | ```bash 20 | git clone https://github.com/AllonKleinLab/scrublet.git 21 | cd scrublet 22 | pip install -r requirements.txt 23 | pip install --upgrade . 24 | ``` 25 | #### Other doublet detection tools: 26 | [DoubletFinder](https://github.com/chris-mcginnis-ucsf/DoubletFinder) 27 | [DoubletDecon](https://github.com/EDePasquale/DoubletDecon) 28 | [DoubletDetection](https://github.com/JonathanShor/DoubletDetection) 29 | -------------------------------------------------------------------------------- /old_versions/v0.1/examples/10X_PBMC-8k_example.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "%matplotlib inline\n", 10 | "import scrublet as scr\n", 11 | "import scipy.io\n", 12 | "import matplotlib.pyplot as plt\n", 13 | "import numpy as np\n", 14 | "import os\n", 15 | "import time\n" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "#### Download 8k PBMC data set from 10X Genomics\n", 23 | "(Only need to do this once)\n", 24 | "\n", 25 | "Download raw data by navigating to the following URL in your web browser:\n", 26 | "http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_filtered_gene_bc_matrices.tar.gz\n", 27 | "\n", 28 | "Or use wget:" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 2, 34 | "metadata": {}, 35 | "outputs": [ 36 | { 37 | "name": "stdout", 38 | "output_type": "stream", 39 | "text": [ 40 | "--2018-07-16 16:49:19-- http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_filtered_gene_bc_matrices.tar.gz\n", 41 | "Resolving cf.10xgenomics.com (cf.10xgenomics.com)... 13.33.35.84, 13.33.35.97, 13.33.35.175, ...\n", 42 | "Connecting to cf.10xgenomics.com (cf.10xgenomics.com)|13.33.35.84|:80... connected.\n", 43 | "HTTP request sent, awaiting response... 200 OK\n", 44 | "Length: 37558165 (36M) [application/x-tar]\n", 45 | "Saving to: ‘pbmc8k_filtered_gene_bc_matrices.tar.gz’\n", 46 | "\n", 47 | "pbmc8k_filtered_gen 100%[===================>] 35.82M 7.21MB/s in 6.4s \n", 48 | "\n", 49 | "2018-07-16 16:49:25 (5.60 MB/s) - ‘pbmc8k_filtered_gene_bc_matrices.tar.gz’ saved [37558165/37558165]\n", 50 | "\n" 51 | ] 52 | } 53 | ], 54 | "source": [ 55 | "!wget http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_filtered_gene_bc_matrices.tar.gz" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "Uncompress:" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 3, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "!tar xfz pbmc8k_filtered_gene_bc_matrices.tar.gz" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "#### Load counts matrix and gene list\n", 79 | "The first time this is run, the counts matrix is loaded from the mtx file. An npz file is saved for fast loading in the future." 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 4, 85 | "metadata": {}, 86 | "outputs": [ 87 | { 88 | "name": "stdout", 89 | "output_type": "stream", 90 | "text": [ 91 | "Expression matrix shape: 8381 rows, 33694 columns\n", 92 | "Number of genes in gene list: 33694\n" 93 | ] 94 | } 95 | ], 96 | "source": [ 97 | "input_dir = 'filtered_gene_bc_matrices/GRCh38/'\n", 98 | "\n", 99 | "# The raw counts matrix (E) should be a scipy sparse CSC matrix\n", 100 | "# with cells as rows and genes as columns\n", 101 | "\n", 102 | "if os.path.isfile(input_dir + '/matrix.npz'):\n", 103 | " E = scipy.sparse.load_npz(input_dir + '/matrix.npz')\n", 104 | "else:\n", 105 | " E = scipy.io.mmread(input_dir + '/matrix.mtx').T.tocsc()\n", 106 | " scipy.sparse.save_npz(input_dir + '/matrix.npz', E, compressed=True)\n", 107 | "\n", 108 | "genes = np.array(scr.load_genes(input_dir + 'genes.tsv', delimiter='\\t', column=1))\n", 109 | "\n", 110 | "print('Expression matrix shape: {} rows, {} columns'.format(E.shape[0], E.shape[1]))\n", 111 | "print('Number of genes in gene list: {}'.format(len(genes)))" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "#### Check that the distribution of total counts per cell looks reasonable (i.e., background has been filtered out)\n" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 5, 124 | "metadata": {}, 125 | "outputs": [ 126 | { 127 | "data": { 128 | "text/plain": [ 129 | "Text(0,0.5,'Number of cells')" 130 | ] 131 | }, 132 | "execution_count": 5, 133 | "metadata": {}, 134 | "output_type": "execute_result" 135 | }, 136 | { 137 | "data": { 138 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAVcAAADXCAYAAACu9hJ0AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAFPZJREFUeJzt3X2UXVV5x/Hvj4Q3FQghgUXz0kSJL9RXnAJqVTQoBJAghRZUjJhltIJSsZVgralFa/ANoSqaQiQoC8RIJUgUYiSAXQaSIIaXoEwxJUOQJCZAAAEDT/84e+AS7tw5c+fu+zLz+6w1656zzz7nPMlZ88y+++yzjyICMzNrrB1aHYCZ2VDk5GpmloGTq5lZBk6uZmYZOLmamWXg5GpmloGTq5lZBk6uZmYZOLmamWUwstUB5DBmzJiYNGlSq8MwsyFm1apVmyJibJm6QzK5Tpo0iZUrV7Y6DDMbYiT9X9m67hYwM8sgW3KVNF/SBkm3V5R9WdJdklZL+m9Joyq2nSmpW9JvJR1WUX54KuuWNDtXvGZmjZSz5XoRcPh2ZUuAV0bEq4HfAWcCSNofOAH4q7TPtySNkDQC+CYwDdgfODHVNTNra9mSa0TcAGzeruzaiNiWVpcD49PydOCyiHgiIn4PdAMHpp/uiLgnIp4ELkt1zczaWiv7XD8I/DQtjwPWVWzrSWV9lZuZtbWWJFdJ/wJsAy7pLapSLWqUVzvmLEkrJa3cuHFjYwI1M6tT04diSZoBHAVMjWdfg9ADTKioNh5Yn5b7Kn+OiJgHzAPo6ury6xUabNLsq6uWr517ZJMjMesMTW25SjocOAM4OiIeq9i0CDhB0s6SJgNTgJuBFcAUSZMl7URx02tRM2M2M6tHtparpEuBQ4AxknqAORSjA3YGlkgCWB4RH4mIOyRdDtxJ0V1wSkQ8lY5zKnANMAKYHxF35IrZzKxRsiXXiDixSvGFNep/AfhClfLFwOIGhmZmlp2f0DIzy2BIzi1gzdPXjS7wzS4b3txyNTPLwMnVzCwDJ1czswycXM3MMnByNTPLwMnVzCwDJ1czswycXM3MMnByNTPLwMnVzCwDJ1czswycXM3MMnByNTPLwLNi2TNqzXBlZgPjlquZWQbZkquk+ZI2SLq9omy0pCWS7k6fe6ZySTpPUrek1ZIOqNhnRqp/d3q5oZlZ28vZcr0IOHy7stnA0oiYAixN6wDTKF5KOAWYBZwPRTKmePfWQcCBwJzehGxm1s6yJdeIuAHYvF3xdGBBWl4AHFNRfnEUlgOjJO0LHAYsiYjNEbEFWMLzE7aZWdtpdp/rPhFxP0D63DuVjwPWVdTrSWV9lZuZtbV2uaGlKmVRo/z5B5BmSVopaeXGjRsbGpyZ2UA1O7k+kL7ukz43pPIeYEJFvfHA+hrlzxMR8yKiKyK6xo4d2/DAzcwGotnJdRHQe8d/BnBlRfn706iBg4GHUrfBNcA7Je2ZbmS9M5WZmbW1bA8RSLoUOAQYI6mH4q7/XOBySTOBe4HjU/XFwBFAN/AYcDJARGyWdBawItX794jY/iaZmVnbyZZcI+LEPjZNrVI3gFP6OM58YH4DQzMzy65dbmiZmQ0pTq5mZhk4uZqZZeDkamaWgZOrmVkGTq5mZhn0m1wlfUnS7pJ2lLRU0iZJ72tGcGZmnapMy/WdEfEwcBTF46gvBf45a1RmZh2uTHLdMX0eAVzqJ6TMzPpX5gmtqyTdBfwJ+KikscDjecMyM+ts/bZcI2I28AagKyL+DDxKMbm1mZn1oc+Wq6Rjq5RVrl6RIyAzs6GgVrfAu2psC5xczcz61GdyjYiTmxmIDT2TZl9dtXzt3CObHIlZ89XqFji91o4R8bXGh2NmNjTU6hbYrWlRmJkNMbW6BT7XzEDMzIaSMo+/vjQ99np7Wn+1pM8M5qSSPiHpDkm3S7pU0i6SJku6SdLdkn4gaadUd+e03p22TxrMuc3MmqHME1r/BZwJ/BkgIlYDJ9R7QknjgI9TjJt9JTAiHe9s4JyImAJsAWamXWYCWyJiP+CcVM/MrK2VSa4viIibtyvbNsjzjgR2lTQSeAFwP/B2YGHavgA4Ji1PT+uk7VO13YBbM7N2U+bx102SXkIxthVJx1Ekw7pExH2SvkLx9tc/AdcCq4AHI6I3afcA49LyOGBd2nebpIeAvYBN9cYw3PU1RMrMGqdMcj0FmAe8XNJ9wO+BuqcclLQnRWt0MvAg8ENgWpWq0btLjW2Vx50FzAKYOHFiveGZmTVEv8k1Iu4BDpX0QmCHiNg6yHMeCvw+IjYCSLoCeCMwStLI1HodD6xP9XuACUBP6kbYA3jezFwRMY/ijwBdXV3PS75mZs1UZrTAf0gaFRGPRsRWSXtK+vwgznkvcLCkF6S+06nAncB1wHGpzgzgyrS8KK2Ttv8iIpw8zaytlbmhNS0iHuxdiYgtFHO71iUibqK4MXULcFuKYR5wBnC6pG6KPtUL0y4XAnul8tOB2fWe28ysWcr0uY6QtHNEPAEgaVdg58GcNCLmAHO2K74HOLBK3ceB4wdzPjOzZiuTXL8PLJX0XYobSR/k2aFRZmZWRZkbWl+StJriRpSAsyLimuyRmZl1sDItVyLiZ8DPMsdiZjZklLmhZWZmA+TkamaWQZ/JVdLS9OmJUszMBqhWn+u+kt4KHC3pMrZ7DDUibskamZlZB6uVXD9LMWB/PLD9K12CYhYrMzOrotabCBYCCyX9a0Sc1cSYzMw6XplxrmdJOhp4SypaFhE/yRuWmVlnKzNxyxeB0ygmV7kTOC2VmZlZH8o8RHAk8NqIeBpA0gLg1xSvfjEzsyrKjnMdVbG8R45AzMyGkjIt1y8Cv5Z0HcVwrLfgVquZWU1lbmhdKmkZ8NcUyfWMiPhD7sDMzDpZ2Ylb7qd4I4CZmZXguQXMzDJwcjUzy6BmcpW0g6TbG31SSaMkLZR0l6Q1kt4gabSkJZLuTp97prqSdJ6kbkmrJR3Q6HjMzBqtZnJNY1t/I2lig897LvCziHg58BpgDcU8BksjYgqwlGdfRDgNmJJ+ZgHnNzgWM7OGK3NDa1/gDkk3A4/2FkbE0fWcUNLuFMO5PpCO8yTwpKTpwCGp2gJgGcUbYacDF6fXaS9Prd590002M7O2VCa5fq7B53wxsBH4rqTXAKsoHq/dpzdhRsT9kvZO9ccB6yr270llz0mukmZRtGyZOLHRDW0zs4Hp94ZWRFwPrAV2TMsrgMHM5ToSOAA4PyJeR9Eanl2jvqqURZU450VEV0R0jR07dhDhmZkNXr8tV0kfomgRjgZeQtFq/DYwtc5z9gA9EXFTWl9IkVwf6P26L2lfYENF/QkV+48H1td5bmsDk2Zf3ee2tXOPbGIkZvmUGYp1CvAm4GGAiLgb2LvmHjWkp7vWSXpZKppKMdvWImBGKpsBXJmWFwHvT6MGDgYecn+rmbW7Mn2uT0TEk1Lx7VzSSKp8LR+gjwGXSNoJuAc4mSLRXy5pJnAvcHyquxg4AugGHkt1zczaWpnker2kTwO7SnoH8FHgqsGcNCJuBbqqbHpeV0MaJXDKYM5nZtZsZboFZlPc3b8N+DBFS/IzOYMyM+t0ZWbFejpNkH0TRXfAb1Nr0szM+lBmtMCRFKMD/pdiWNRkSR+OiJ/mDs7MrFOV6XP9KvC2iOgGkPQS4GrAydXMrA9l+lw39CbW5B6eHYNqZmZV9NlylXRsWrxD0mLgcoo+1+MpntKyNlZroL6Z5VerW+BdFcsPAG9NyxuBPbNFZGY2BPSZXCPCg/XNzOpUZrTAZIonqiZV1q93ykEzs+GgzGiBHwMXUjyV9XTecMzMhoYyyfXxiDgveyRmZkNImeR6rqQ5wLXAE72FETGYOV3NzIa0Msn1VcBJwNt5tlsg0rqZmVVRJrm+G3hxeteVmZmVUOYJrd8Ao3IHYmY2lJRpue4D3CVpBc/tc/VQLDOzPpRJrnOyR2FmNsSUmc/1+hwnljQCWAncFxFHpYcVLqN4EeItwEnp9TI7AxcDrwf+CPx9RKzNEZOZWaP02+cqaaukh9PP45KekvRwA859GrCmYv1s4JyImAJsAWam8pnAlojYDzgn1TMza2v9JteI2C0idk8/uwB/C3xjMCeVNB44ErggrYtiaNfCVGUBcExanp7WSdunqvdtiWZmbapMn+tzRMSPJc0e5Hm/DnwK2C2t7wU8GBHb0noPMC4tjwPWpXNvk/RQqr+p8oCSZgGzACZOnDjI8KxV+poqce3cI5scidnglJm45diK1R0o3tpa9zu0JB1FMQH3KkmH9BZXqRoltj1bEDEPmAfQ1dXld3yZWUuVablWzuu6DVhL8VW9Xm8CjpZ0BLALsDtFS3aUpJGp9ToeWJ/q9wATgB5JI4E9gM2DOL+ZWXZlRgs0dF7XiDgTOBMgtVz/KSLeK+mHwHEUIwZmAFemXRal9V+l7b/w22fNrN3Ves3LZ2vsFxFxVoNjOQO4TNLngV9TTHNI+vyepG6KFusJDT6vmVnD1Wq5Plql7IUUQ6P2AgadXCNiGbAsLd8DHFilzuMU7+0yM+sYtV7z8tXeZUm7UYxLPZnia/tX+9rPzMz66XOVNBo4HXgvxVjTAyJiSzMCMzPrZLX6XL8MHEsxvOlVEfFI06IyM+twtZ7Q+iTwF8BngPUVj8BubdDjr2ZmQ1atPtcyc71ai/X1RJOZtdaAH381a4Vaf0T8aKy1I7dOzcwycHI1M8vAydXMLAMnVzOzDJxczcwycHI1M8vAydXMLAMnVzOzDJxczcwycHI1M8vAj79ax/MbY60dNb3lKmmCpOskrZF0h6TTUvloSUsk3Z0+90zlknSepG5JqyUd0OyYzcwGqhXdAtuAT0bEK4CDgVMk7Q/MBpZGxBRgaVoHmAZMST+zgPObH7KZ2cA0PblGxP0RcUta3gqsAcZRvK57Qaq2ADgmLU8HLo7CcopXcO/b5LDNzAakpTe0JE0CXgfcBOwTEfdDkYCBvVO1ccC6it16Utn2x5olaaWklRs3bswZtplZv1qWXCW9CPgR8I8RUevNBqpSFs8riJgXEV0R0TV27NhGhWlmVpeWJFdJO1Ik1ksi4opU/EDv1/30uSGV9wATKnYfD6xvVqxmZvVo+lAsSQIuBNZExNcqNi0CZgBz0+eVFeWnSroMOAh4qLf7wKwWv73AWqkV41zfBJwE3Cbp1lT2aYqkermkmcC9wPFp22LgCKAbeAw4ubnhmpkNXNOTa0T8kur9qABTq9QP4JSsQZmZNZgffzUzy8DJ1cwsA88t0AFq3Zgxs/bklquZWQZOrmZmGTi5mpll4D7XNuK+VbOhw8m1yZxA24Mn2Lbc3C1gZpaBW65mFTwfgTWKk2sm/vo/9LgrwQbC3QJmZhk4uZqZZeBuAbNBcj+tVeOWq5lZBk6uZmYZOLmamWXQMX2ukg4HzgVGABdExNwWh2TWLw/fGr46IrlKGgF8E3gHxdtgV0haFBF3tjYys/r4JtjQ1xHJFTgQ6I6IewDSm2CnAy1Nrn5QwHJwa3do6JTkOg5YV7HeQ/Ga7WdImgXMSquPSPoD8FCVY+3RR/kYYNPgQ224vuJt9bHr2bfsPv3Vq3d7R197nZ3nuE3avxnXvp5tA732f1m6ZkS0/Q/Fa7YvqFg/CfjPfvaZN8Dyla3+dw4k3lYfu559y+7TX716t/vaN+a47Xrt69mW89p3ymiBHmBCxfp4YH0/+1w1wPJ2lTPewRy7nn3L7tNfvXq3+9o35rjteu3r3ZaFUvZua5JGAr8DpgL3ASuA90TEHQ08x8qI6GrU8axz+NoPXzmvfUf0uUbENkmnAtdQDMWa38jEmsxr8PGsc/jaD1/Zrn1HtFzNzDpNp/S5mpl1FCdXM7MMnFzNzDJwcjUzy8DJtQ+SXiHp25IWSvqHVsdjzSXphZJWSTqq1bFYc0g6RNKN6ff+kMEeb1glV0nzJW2QdPt25YdL+q2kbkmzASJiTUR8BPg7wGMgO9xArn1yBnB5c6O0RhvgdQ/gEWAXigeXBmVYJVfgIuDwyoKKGbemAfsDJ0raP207GvglsLS5YVoGF1Hy2ks6lGJSoAeaHaQ13EWU/52/MSKmUfxh/dxgTzyskmtE3ABs3q74mRm3IuJJoHfGLSJiUUS8EXhvcyO1RhvgtX8bcDDwHuBDkobV78lQMpDrHhFPp+1bgJ0He+6OeEIrs6ozbqU+l2Mp/pMXtyAuy6/qtY+IUwEkfQDYVPFLZ0NDX7/zxwKHAaOAbwz2JE6uoCplERHLgGXNDcWarOq1f2Yh4qLmhWJN1Nfv/BXAFY06ib/u1Dfjlg0NvvbDU1Ouu5NrMcPWFEmTJe0EnAAsanFM1hy+9sNTU677sEquki4FfgW8TFKPpJkRsQ3onXFrDXB5hhm3rMV87YenVl53z4plZpbBsGq5mpk1i5OrmVkGTq5mZhk4uZqZZeDkamaWgZOrmVkGTq7WFiTtJenW9PMHSfdVrO9Upf5oSR8pcdyRkh7ME3W/5z5d0i6tOLe1nse5WtuR9G/AIxHxlRp19gMWRsRr+znWSIrJV0Y1Nsr+SeoBXhkRLUnu1lpuuVrbk/QpSbenn4+l4rkUT93cKmmupN0l/ULSLZJWl3mDgKSTU93fSPpuKpss6bpUvkTS+FT+fUnHVOz7SPo8VNJSSVekyZcvTuWfAPYGbpT089SC/p6k29K/4+ON/V+yduNZsaytSTqQYj7dA4ERwM2SrgdmA/v1tlwl7UgxJ+dWSXsD/wP8pMZxX0MxKfIbI2KzpNFp07eACyLiEkmzgK8Dx/UT5gEUky5vAJZLOjgizpH0SeDNEfGgpIOAMRHxqnT+prekrbnccrV292bgRxHxWERsBX4M/E2VegLOlrQauBaYIGlMjeO+HfhBRGwG6P0EDqKYPBng4nT+/iyPiPsj4ingVmBSlTrdFC3tcyUdBjxU4rjWwZxcrd1Vm3uzmvcDewAHpNbsJop3IdU67kBuOGwj/b6k14RUfut7omL5Kap8I4yIPwKvpnht0MeB7wzg3NaBnFyt3d0AvFvSrpJeRPEalhuBrcBuFfX2ADZExDZJ76CYbb6WnwMn9HYHVHQLLKd4KSXA+9L5AdYCr0/L76booujPMzFKGktxA/mHwByKrgQbwtznam0tIm5O08atSEXnR8RtAJJWSroNuBr4GnCVpJXALcDd/Rx3taQvATdI2gasAmZSTEV3oaQzKV5QeHLa5TvAlSlxX8tzW6t9mQf8XNI64FPpuL0t5jPK/Q9Yp/JQLDOzDNwtYGaWgZOrmVkGTq5mZhk4uZqZZeDkamaWgZOrmVkGTq5mZhn8P03I/hKRUzg6AAAAAElFTkSuQmCC\n", 139 | "text/plain": [ 140 | "
" 141 | ] 142 | }, 143 | "metadata": {}, 144 | "output_type": "display_data" 145 | } 146 | ], 147 | "source": [ 148 | "total_counts = E.sum(1).A.squeeze()\n", 149 | "\n", 150 | "fig, ax = plt.subplots(figsize = (5, 3))\n", 151 | "ax.hist(total_counts, bins = np.logspace(3, 5, 40))\n", 152 | "ax.set_xscale('log')\n", 153 | "ax.set_xlabel('Total counts')\n", 154 | "ax.set_ylabel('Number of cells')\n" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "#### Calculate doublet scores\n" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 6, 167 | "metadata": {}, 168 | "outputs": [ 169 | { 170 | "name": "stdout", 171 | "output_type": "stream", 172 | "text": [ 173 | "Simulating doublets\n", 174 | "Total counts normalizing\n", 175 | "Finding highly variable genes\n", 176 | "Filtering genes from 33694 to 1697\n", 177 | "Applying z-score normalization\n", 178 | "Running PCA\n", 179 | "Building kNN graph and calculating doublet scores\n", 180 | "Elapsed time: 19.0 seconds\n" 181 | ] 182 | } 183 | ], 184 | "source": [ 185 | "# filtering/preprocessing parameters:\n", 186 | "min_counts = 2\n", 187 | "min_cells = 3\n", 188 | "vscore_percentile = 85\n", 189 | "n_pc = 30\n", 190 | "\n", 191 | "# doublet detector parameters:\n", 192 | "expected_doublet_rate = 0.06 \n", 193 | "sim_doublet_ratio = 3\n", 194 | "n_neighbors = 50\n", 195 | "\n", 196 | "t0 = time.time()\n", 197 | "\n", 198 | "scrublet_results = scr.compute_doublet_scores(\n", 199 | " E, \n", 200 | " min_counts = min_counts, \n", 201 | " min_cells = min_cells, \n", 202 | " vscore_percentile = vscore_percentile, \n", 203 | " n_prin_comps = n_pc,\n", 204 | " scaling_method = 'zscore',\n", 205 | " expected_doublet_rate = expected_doublet_rate,\n", 206 | " sim_doublet_ratio = sim_doublet_ratio,\n", 207 | " n_neighbors = n_neighbors, \n", 208 | " use_approx_neighbors = True, \n", 209 | " get_doublet_neighbor_parents = False\n", 210 | ")\n", 211 | "\n", 212 | "\n", 213 | "t1 = time.time()\n", 214 | "print('Elapsed time: {:.1f} seconds'.format(t1 - t0))" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "#### Get UMAP embedding to help visualize the results" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": 7, 227 | "metadata": {}, 228 | "outputs": [], 229 | "source": [ 230 | "# UMAP: https://github.com/lmcinnes/umap\n", 231 | "import umap\n", 232 | "\n", 233 | "embedding = umap.UMAP(n_neighbors=10).fit_transform(scrublet_results['pca_observed_cells'])\n" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "#### Set doublet score threshold and visualize results\n", 241 | "To call doublets, manually set a threshold between the two peaks of the simulated doublet histogram." 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": 8, 247 | "metadata": {}, 248 | "outputs": [ 249 | { 250 | "name": "stdout", 251 | "output_type": "stream", 252 | "text": [ 253 | "349/8381 = 4.2% of cells are predicted doublets.\n", 254 | "60.0% of doublets are predicted to be detectable.\n", 255 | "Predicted overall doublet rate = 6.9%\n" 256 | ] 257 | }, 258 | { 259 | "data": { 260 | "image/png": "\n", 261 | "text/plain": [ 262 | "
" 263 | ] 264 | }, 265 | "metadata": {}, 266 | "output_type": "display_data" 267 | } 268 | ], 269 | "source": [ 270 | "score_threshold = 0.25\n", 271 | "\n", 272 | "fig, axs = scr.plot_scrublet_results(embedding, \n", 273 | " scrublet_results['doublet_scores_observed_cells'], \n", 274 | " scrublet_results['doublet_scores_simulated_doublets'], \n", 275 | " score_threshold, \n", 276 | " order_points = True, \n", 277 | " marker_size = 4)\n" 278 | ] 279 | } 280 | ], 281 | "metadata": { 282 | "kernelspec": { 283 | "display_name": "Python 3", 284 | "language": "python", 285 | "name": "python3" 286 | }, 287 | "language_info": { 288 | "codemirror_mode": { 289 | "name": "ipython", 290 | "version": 3 291 | }, 292 | "file_extension": ".py", 293 | "mimetype": "text/x-python", 294 | "name": "python", 295 | "nbconvert_exporter": "python", 296 | "pygments_lexer": "ipython3", 297 | "version": "3.6.4" 298 | } 299 | }, 300 | "nbformat": 4, 301 | "nbformat_minor": 2 302 | } 303 | -------------------------------------------------------------------------------- /old_versions/v0.1/requirements.txt: -------------------------------------------------------------------------------- 1 | annoy 2 | matplotlib 3 | numpy 4 | scikit-learn 5 | scipy 6 | -------------------------------------------------------------------------------- /old_versions/v0.1/setup.py: -------------------------------------------------------------------------------- 1 | from distutils.core import setup 2 | 3 | setup( 4 | name = "scrublet", 5 | packages = ['scrublet'], 6 | package_dir={'': 'src'}, 7 | version = '0.1', 8 | description = 'Doublet prediction in single-cell RNA-sequencing data', 9 | author = 'Samuel L. Wolock', 10 | author_email = 'swolock@g.harvard.edu', 11 | url = 'https://github.com/swolock/scrublet', 12 | download_url = 'https://github.com/swolock/scrublet/tarball/0.1', 13 | install_requires=['numpy', 'scipy', 'scikit-learn', 'matplotlib', 'annoy'], 14 | ) 15 | -------------------------------------------------------------------------------- /old_versions/v0.1/src/scrublet/__init__.py: -------------------------------------------------------------------------------- 1 | from .scrublet import compute_doublet_scores, plot_scrublet_results 2 | from .helper_functions import * 3 | -------------------------------------------------------------------------------- /old_versions/v0.1/src/scrublet/helper_functions.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import scipy 3 | import scipy.stats 4 | import scipy.sparse 5 | from sklearn.decomposition import PCA,TruncatedSVD 6 | from sklearn.neighbors import NearestNeighbors 7 | import time 8 | 9 | ########## LOADING DATA 10 | def load_genes(filename, delimiter='\t', column=0, skip_rows=0): 11 | gene_list = [] 12 | gene_dict = {} 13 | 14 | with open(filename) as f: 15 | for iL in range(skip_rows): 16 | f.readline() 17 | for l in f: 18 | gene = l.strip('\n').split(delimiter)[column] 19 | if gene in gene_dict: 20 | gene_dict[gene] += 1 21 | gene_list.append(gene + '__' + str(gene_dict[gene])) 22 | if gene_dict[gene] == 2: 23 | i = gene_list.index(gene) 24 | gene_list[i] = gene + '__1' 25 | else: 26 | gene_dict[gene] = 1 27 | gene_list.append(gene) 28 | return gene_list 29 | 30 | 31 | def make_genes_unique(orig_gene_list): 32 | gene_list = [] 33 | gene_dict = {} 34 | 35 | for gene in orig_gene_list: 36 | if gene in gene_dict: 37 | gene_dict[gene] += 1 38 | gene_list.append(gene + '__' + str(gene_dict[gene])) 39 | if gene_dict[gene] == 2: 40 | i = gene_list.index(gene) 41 | gene_list[i] = gene + '__1' 42 | else: 43 | gene_dict[gene] = 1 44 | gene_list.append(gene) 45 | return gene_list 46 | 47 | ########## USEFUL SPARSE FUNCTIONS 48 | 49 | def sparse_var(E, axis=0): 50 | ''' variance across the specified axis ''' 51 | 52 | mean_gene = E.mean(axis=axis).A.squeeze() 53 | tmp = E.copy() 54 | tmp.data **= 2 55 | return tmp.mean(axis=axis).A.squeeze() - mean_gene ** 2 56 | 57 | def sparse_multiply(E, a): 58 | ''' multiply each row of E by a scalar ''' 59 | 60 | nrow = E.shape[0] 61 | w = scipy.sparse.lil_matrix((nrow, nrow)) 62 | w.setdiag(a) 63 | return w * E 64 | 65 | def sparse_zscore(E, gene_mean=None, gene_stdev=None): 66 | ''' z-score normalize each column of E ''' 67 | 68 | if gene_mean is None: 69 | gene_mean = E.mean(0) 70 | if gene_stdev is None: 71 | gene_stdev = np.sqrt(sparse_var(E)) 72 | return sparse_multiply((E - gene_mean).T, 1/gene_stdev).T 73 | 74 | ########## GENE FILTERING 75 | 76 | def runningquantile(x, y, p, nBins): 77 | 78 | ind = np.argsort(x) 79 | x = x[ind] 80 | y = y[ind] 81 | 82 | dx = (x[-1] - x[0]) / nBins 83 | xOut = np.linspace(x[0]+dx/2, x[-1]-dx/2, nBins) 84 | 85 | yOut = np.zeros(xOut.shape) 86 | 87 | for i in range(len(xOut)): 88 | ind = np.nonzero((x >= xOut[i]-dx/2) & (x < xOut[i]+dx/2))[0] 89 | if len(ind) > 0: 90 | yOut[i] = np.percentile(y[ind], p) 91 | else: 92 | if i > 0: 93 | yOut[i] = yOut[i-1] 94 | else: 95 | yOut[i] = np.nan 96 | 97 | return xOut, yOut 98 | 99 | 100 | def get_vscores(E, min_mean=0, nBins=50, fit_percentile=0.1, error_wt=1): 101 | ''' 102 | Calculate v-score (above-Poisson noise statistic) for genes in the input counts matrix 103 | Return v-scores and other stats 104 | ''' 105 | 106 | ncell = E.shape[0] 107 | 108 | mu_gene = E.mean(axis=0).A.squeeze() 109 | gene_ix = np.nonzero(mu_gene > min_mean)[0] 110 | mu_gene = mu_gene[gene_ix] 111 | 112 | tmp = E[:,gene_ix] 113 | tmp.data **= 2 114 | var_gene = tmp.mean(axis=0).A.squeeze() - mu_gene ** 2 115 | del tmp 116 | FF_gene = var_gene / mu_gene 117 | 118 | data_x = np.log(mu_gene) 119 | data_y = np.log(FF_gene / mu_gene) 120 | 121 | x, y = runningquantile(data_x, data_y, fit_percentile, nBins) 122 | x = x[~np.isnan(y)] 123 | y = y[~np.isnan(y)] 124 | 125 | gLog = lambda input: np.log(input[1] * np.exp(-input[0]) + input[2]) 126 | h,b = np.histogram(np.log(FF_gene[mu_gene>0]), bins=200) 127 | b = b[:-1] + np.diff(b)/2 128 | max_ix = np.argmax(h) 129 | c = np.max((np.exp(b[max_ix]), 1)) 130 | errFun = lambda b2: np.sum(abs(gLog([x,c,b2])-y) ** error_wt) 131 | b0 = 0.1 132 | b = scipy.optimize.fmin(func = errFun, x0=[b0], disp=False) 133 | a = c / (1+b) - 1 134 | 135 | 136 | v_scores = FF_gene / ((1+a)*(1+b) + b * mu_gene); 137 | CV_eff = np.sqrt((1+a)*(1+b) - 1); 138 | CV_input = np.sqrt(b); 139 | 140 | return v_scores, CV_eff, CV_input, gene_ix, mu_gene, FF_gene, a, b 141 | 142 | def filter_genes(E, base_ix = [], min_vscore_pctl = 85, min_counts = 3, min_cells = 3, show_vscore_plot = False, sample_name = ''): 143 | ''' 144 | Filter genes by expression level and variability 145 | Return list of filtered gene indices 146 | ''' 147 | 148 | if len(base_ix) == 0: 149 | base_ix = np.arange(E.shape[0]) 150 | 151 | Vscores, CV_eff, CV_input, gene_ix, mu_gene, FF_gene, a, b = get_vscores(E[base_ix, :]) 152 | ix2 = Vscores>0 153 | Vscores = Vscores[ix2] 154 | gene_ix = gene_ix[ix2] 155 | mu_gene = mu_gene[ix2] 156 | FF_gene = FF_gene[ix2] 157 | min_vscore = np.percentile(Vscores, min_vscore_pctl) 158 | ix = (((E[:,gene_ix] >= min_counts).sum(0).A.squeeze() >= min_cells) & (Vscores >= min_vscore)) 159 | 160 | if show_vscore_plot: 161 | import matplotlib.pyplot as plt 162 | x_min = 0.5*np.min(mu_gene) 163 | x_max = 2*np.max(mu_gene) 164 | xTh = x_min * np.exp(np.log(x_max/x_min)*np.linspace(0,1,100)) 165 | yTh = (1 + a)*(1+b) + b * xTh 166 | plt.figure(figsize=(8, 6)); 167 | plt.scatter(np.log10(mu_gene), np.log10(FF_gene), c = [.8,.8,.8], alpha = 0.3, edgecolors=''); 168 | plt.scatter(np.log10(mu_gene)[ix], np.log10(FF_gene)[ix], c = [0,0,0], alpha = 0.3, edgecolors=''); 169 | plt.plot(np.log10(xTh),np.log10(yTh)); 170 | plt.title(sample_name) 171 | plt.xlabel('log10(mean)'); 172 | plt.ylabel('log10(Fano factor)'); 173 | plt.show() 174 | 175 | return gene_ix[ix] 176 | 177 | ########## CELL NORMALIZATION 178 | 179 | def tot_counts_norm(E, exclude_dominant_frac = 1, included = [], target_mean = 0): 180 | ''' 181 | Cell-level total counts normalization of input counts matrix, excluding overly abundant genes if desired. 182 | Return normalized counts, average total counts, and (if exclude_dominant_frac < 1) list of genes used to calculate total counts 183 | ''' 184 | 185 | E = E.tocsc() 186 | ncell = E.shape[0] 187 | if len(included) == 0: 188 | if exclude_dominant_frac == 1: 189 | tots_use = E.sum(axis=1) 190 | else: 191 | tots = E.sum(axis=1) 192 | wtmp = scipy.sparse.lil_matrix((ncell, ncell)) 193 | wtmp.setdiag(1. / tots) 194 | included = np.asarray(~(((wtmp * E) > exclude_dominant_frac).sum(axis=0) > 0))[0,:] 195 | tots_use = E[:,included].sum(axis = 1) 196 | print('Excluded %i genes from normalization' %(np.sum(~included))) 197 | else: 198 | tots_use = E[:,included].sum(axis = 1) 199 | 200 | if target_mean == 0: 201 | target_mean = np.mean(tots_use) 202 | 203 | w = scipy.sparse.lil_matrix((ncell, ncell)) 204 | w.setdiag(float(target_mean) / tots_use) 205 | Enorm = w * E 206 | 207 | return Enorm.tocsc(), target_mean, included 208 | 209 | ########## DIMENSIONALITY REDUCTION 210 | 211 | 212 | def get_pca(E, base_ix=[], numpc=50, keep_sparse=False, normalize=True): 213 | ''' 214 | Run PCA on the counts matrix E, gene-level normalizing if desired 215 | Return PCA coordinates 216 | ''' 217 | # If keep_sparse is True, gene-level normalization maintains sparsity 218 | # (no centering) and TruncatedSVD is used instead of normal PCA. 219 | 220 | if len(base_ix) == 0: 221 | base_ix = np.arange(E.shape[0]) 222 | 223 | if keep_sparse: 224 | if normalize: 225 | zstd = np.sqrt(sparse_var(E[base_ix,:])) 226 | Z = sparse_multiply(E.T, 1 / zstd).T 227 | else: 228 | Z = E 229 | pca = TruncatedSVD(n_components=numpc) 230 | 231 | else: 232 | if normalize: 233 | zmean = E[base_ix,:].mean(0) 234 | zstd = np.sqrt(sparse_var(E[base_ix,:])) 235 | Z = sparse_multiply((E - zmean).T, 1/zstd).T 236 | else: 237 | Z = E 238 | pca = PCA(n_components=numpc) 239 | 240 | pca.fit(Z[base_ix,:]) 241 | return pca.transform(Z) 242 | 243 | 244 | def preprocess_and_pca(E, total_counts_normalize=True, norm_exclude_abundant_gene_frac=1, min_counts=3, min_cells=5, min_vscore_pctl=85, gene_filter=None, num_pc=50, sparse_pca=False, zscore_normalize=True, show_vscore_plot=False): 245 | ''' 246 | Total counts normalize, filter genes, run PCA 247 | Return PCA coordinates and filtered gene indices 248 | ''' 249 | 250 | if total_counts_normalize: 251 | print('Total count normalizing') 252 | E = tot_counts_norm(E, exclude_dominant_frac = norm_exclude_abundant_gene_frac)[0] 253 | 254 | if gene_filter is None: 255 | print('Finding highly variable genes') 256 | gene_filter = filter_genes(E, min_vscore_pctl=min_vscore_pctl, min_counts=min_counts, min_cells=min_cells, show_vscore_plot=show_vscore_plot) 257 | 258 | print('Using %i genes for PCA' %len(gene_filter)) 259 | PCdat = get_pca(E[:,gene_filter], numpc=num_pc, keep_sparse=sparse_pca, normalize=zscore_normalize) 260 | 261 | return PCdat, gene_filter 262 | 263 | ########## GRAPH CONSTRUCTION 264 | 265 | def get_knn_graph(X, k=5, dist_metric='euclidean', approx=False, return_edges=True): 266 | ''' 267 | Build k-nearest-neighbor graph 268 | Return edge list and nearest neighbor matrix 269 | ''' 270 | 271 | t0 = time.time() 272 | if approx: 273 | try: 274 | from annoy import AnnoyIndex 275 | except: 276 | approx = False 277 | print('Could not find library "annoy" for approx. nearest neighbor search') 278 | if approx: 279 | #print('Using approximate nearest neighbor search') 280 | 281 | if dist_metric == 'cosine': 282 | dist_metric = 'angular' 283 | npc = X.shape[1] 284 | ncell = X.shape[0] 285 | annoy_index = AnnoyIndex(npc, metric=dist_metric) 286 | 287 | for i in range(ncell): 288 | annoy_index.add_item(i, list(X[i,:])) 289 | annoy_index.build(10) # 10 trees 290 | 291 | knn = [] 292 | for iCell in range(ncell): 293 | knn.append(annoy_index.get_nns_by_item(iCell, k + 1)[1:]) 294 | knn = np.array(knn, dtype=int) 295 | 296 | else: 297 | #print('Using sklearn NearestNeighbors') 298 | 299 | if dist_metric == 'cosine': 300 | nbrs = NearestNeighbors(n_neighbors=k, metric=dist_metric, algorithm='brute').fit(X) 301 | else: 302 | nbrs = NearestNeighbors(n_neighbors=k, metric=dist_metric).fit(X) 303 | knn = nbrs.kneighbors(return_distance=False) 304 | 305 | if return_edges: 306 | links = set([]) 307 | for i in range(knn.shape[0]): 308 | for j in knn[i,:]: 309 | links.add(tuple(sorted((i,j)))) 310 | 311 | t_elapse = time.time() - t0 312 | #print('kNN graph built in %.3f sec' %(t_elapse)) 313 | 314 | return links, knn 315 | return knn 316 | 317 | def build_adj_mat(edges, n_nodes): 318 | A = scipy.sparse.lil_matrix((n_nodes, n_nodes)) 319 | for e in edges: 320 | i, j = e 321 | A[i,j] = 1 322 | A[j,i] = 1 323 | return A.tocsc() 324 | 325 | ########## FORCE LAYOUT 326 | 327 | def get_force_layout(links, n_cells, n_iter=100, edgeWeightInfluence=1, barnesHutTheta=2, scalingRatio=1, gravity=0.05, jitterTolerance=1, verbose=False): 328 | from fa2 import ForceAtlas2 329 | import networkx as nx 330 | 331 | G = nx.Graph() 332 | G.add_nodes_from(range(n_cells)) 333 | G.add_edges_from(list(links)) 334 | 335 | forceatlas2 = ForceAtlas2( 336 | # Behavior alternatives 337 | outboundAttractionDistribution=False, # Dissuade hubs 338 | linLogMode=False, # NOT IMPLEMENTED 339 | adjustSizes=False, # Prevent overlap (NOT IMPLEMENTED) 340 | edgeWeightInfluence=edgeWeightInfluence, 341 | 342 | # Performance 343 | jitterTolerance=jitterTolerance, # Tolerance 344 | barnesHutOptimize=True, 345 | barnesHutTheta=barnesHutTheta, 346 | multiThreaded=False, # NOT IMPLEMENTED 347 | 348 | # Tuning 349 | scalingRatio=scalingRatio, 350 | strongGravityMode=False, 351 | gravity=gravity, 352 | # Log 353 | verbose=verbose) 354 | 355 | positions = forceatlas2.forceatlas2_networkx_layout(G, pos=None, iterations=n_iter) 356 | positions = np.array([positions[i] for i in sorted(positions.keys())]) 357 | return positions 358 | 359 | ########## CLUSTERING 360 | 361 | def get_spectral_clusters(A, k): 362 | from sklearn.cluster import SpectralClustering 363 | spec = SpectralClustering(n_clusters=k, random_state = 0, affinity = 'precomputed', assign_labels = 'discretize') 364 | return spec.fit_predict(A) 365 | 366 | 367 | def get_louvain_clusters(nodes, edges): 368 | import networkx as nx 369 | import community 370 | 371 | G = nx.Graph() 372 | G.add_nodes_from(nodes) 373 | G.add_edges_from(edges) 374 | 375 | return np.array(community.best_partition(G).values()) 376 | 377 | ########## GENE ENRICHMENT 378 | 379 | def rank_enriched_genes(E, gene_list, cell_mask, min_counts=3, min_cells=3): 380 | gix = (E[cell_mask,:]>=min_counts).sum(0).A.squeeze() >= min_cells 381 | print('%i cells in group' %(sum(cell_mask))) 382 | print('Considering %i genes' %(sum(gix))) 383 | 384 | gene_list = gene_list[gix] 385 | 386 | z = sparse_zscore(E[:,gix]) 387 | scores = z[cell_mask,:].mean(0).A.squeeze() 388 | o = np.argsort(-scores) 389 | 390 | return gene_list[o], scores[o] 391 | 392 | 393 | ########## PLOTTING STUFF 394 | 395 | def darken_cmap(cmap, scale_factor): 396 | cdat = np.zeros((cmap.N, 4)) 397 | for ii in range(cdat.shape[0]): 398 | curcol = cmap(ii) 399 | cdat[ii,0] = curcol[0] * scale_factor 400 | cdat[ii,1] = curcol[1] * scale_factor 401 | cdat[ii,2] = curcol[2] * scale_factor 402 | cdat[ii,3] = 1 403 | cmap = cmap.from_list(cmap.N, cdat) 404 | return cmap 405 | 406 | def custom_cmap(rgb_list): 407 | import matplotlib.pyplot as plt 408 | rgb_list = np.array(rgb_list) 409 | cmap = plt.cm.Reds 410 | cmap = cmap.from_list(rgb_list.shape[0],rgb_list) 411 | return cmap 412 | 413 | def plot_groups(x, y, groups, lim_buffer = 50, saving = False, fig_dir = './', fig_name = 'fig', res = 300, close_after = False, title_size = 12, point_size = 3, ncol = 5): 414 | import matplotlib.pyplot as plt 415 | 416 | n_col = int(ncol) 417 | ngroup = len(np.unique(groups)) 418 | nrow = int(np.ceil(ngroup / float(ncol))) 419 | fig = plt.figure(figsize = (14, 3 * nrow)) 420 | for ii, c in enumerate(np.unique(groups)): 421 | ax = plt.subplot(nrow, ncol, ii+1) 422 | ix = groups == c 423 | 424 | ax.scatter(x[~ix], y[~ix], s = point_size, c = [.8,.8,.8], edgecolors = '') 425 | ax.scatter(x[ix], y[ix], s = point_size, c = [0,0,0], edgecolors = '') 426 | ax.set_xticks([]) 427 | ax.set_yticks([]) 428 | ax.set_xlim([min(x) - lim_buffer, max(x) + lim_buffer]) 429 | ax.set_ylim([min(y) - lim_buffer, max(y) + lim_buffer]) 430 | 431 | ax.set_title(str(c), fontsize = title_size) 432 | 433 | fig.tight_layout() 434 | 435 | if saving: 436 | if not os.path.exists(fig_dir): 437 | os.makedirs(fig_dir) 438 | plt.savefig(fig_dir + '/' + fig_name + '.png', dpi=res) 439 | 440 | if close_after: 441 | plt.close() 442 | -------------------------------------------------------------------------------- /old_versions/v0.1/src/scrublet/scrublet.py: -------------------------------------------------------------------------------- 1 | from .helper_functions import * 2 | from sklearn.decomposition import PCA, TruncatedSVD 3 | 4 | def compute_doublet_scores(E, n_neighbors=50, sim_doublet_ratio=3, expected_doublet_rate = 0.1, use_approx_neighbors=True, total_counts_normalize=True, min_counts=3, min_cells=3, vscore_percentile=85, gene_filter=None, scaling_method = 'zscore', n_prin_comps=30, get_doublet_neighbor_parents = False): 5 | ''' Predict cell doublets 6 | 7 | Given a counts matrix `E`, calculates a doublet score between 0 and 1 by 8 | simulating doublets, performing PCA, building a k-nearest-neighbor graph, 9 | and finding the fraction of each observed transcriptome's neighbors that are 10 | simulated doublets. This 11 | 12 | Required inputs: 13 | - E: 2-D matrix with shape (n_cells, n_genes) 14 | scipy.sparse matrix or numpy array containing raw (unnormalized) 15 | transcript counts. E[i,j] is the number of copies of transcript j 16 | detected in cell i. 17 | 18 | Optional parameters: 19 | - n_neighbors: int (default = 50) 20 | Number of neighbors used to construct the kNN graph of observed 21 | transcriptomes and simulated doublets. 22 | - sim_doublet_ratio: float(default = 3) 23 | Number of doublets to simulate relative to the number of observed 24 | transcriptomes. 25 | - expected_doublet_rate: float (default = 0.1) 26 | The estimated doublet rate for the experiment. 27 | - use_approx_neighbors: boolean (default = True) 28 | If true, use approximate nearest neighbors algorithm to contstruct the 29 | kNN graph. 30 | - total_counts_normalize: boolean (default = True) 31 | If true, apply total counts normalization prior to PCA 32 | - gene_filter: list (default = None) 33 | For gene filtering prior to PCA. List of gene indices (columns of `E`) 34 | to use in PCA. 35 | - min_counts and min_cells: float (default = 3 for both) 36 | For gene filtering prior to PCA. Keep genes with at least `min_counts` 37 | counts in at least `min_cells` cells. Ignored if `gene_filter` is not 38 | `None`. 39 | - vscore_percentile: float (default = 85) 40 | For gene filtering prior to PCA. Keep only highly variable genes, i.e., 41 | those in the `vscore_percentile`-th percentile or above. Ignored if 42 | `gene_filter` is not `None`. 43 | - scaling_method: str (default = "zscore") 44 | Method for gene-level scaling of transcript counts prior to PCA. Options 45 | are "zscore", "log", and "none". 46 | - n_prin_comps: int (default = 30) 47 | Number of principal components to use for embedding cells prior to 48 | building kNN graph. 49 | - get_doublet_neighbor_parents: boolean (default = False) 50 | If true, returns the list of parent cells used to generate each observed 51 | transcriptome's doublet neighbors. 52 | 53 | Returns: a dictionary with following items: 54 | - "doublet_scores_observed_cells": 1-D `array` 55 | List of doublet scores for each observed transcriptome 56 | - "doublet_scores_simulateed_doublets": 1-D `array` 57 | List of doublet scores for each synthetic doublet 58 | - "pca_observed_cells": 2-D `array` 59 | Principal component embedding of the observed transcriptomes 60 | - "pca_simulated_doublets": 61 | Principal component embedding of the simulated doublets 62 | - "gene_filter": 63 | List of gene indices used for PCA. 64 | - "doublet_neighbor_parents": list of numpy arrays 65 | `None` if `get_doublet_neighbor_parents` is `False`. Otherwise, entry i 66 | is the list (1-D numpy array) of parent cells that generated the doublet 67 | neighbors of cell i. Cells with no doublet neighbors have an empty list. 68 | ''' 69 | 70 | # Initialize output: dictionary to store results 71 | # and useful intermediate variables 72 | output = {} 73 | 74 | # Check that input is valid 75 | valid_scaling_methods = ['zscore', 'log', 'none'] 76 | if not scaling_method in valid_scaling_methods: 77 | print('Select one of the following scaling methods:', valid_scaling_methods) 78 | return 79 | 80 | # Convert counts matrix to sparse format if necessary 81 | if not scipy.sparse.issparse(E): 82 | print('Converting to scipy.sparse.csc_matrix') 83 | E = scipy.sparse.csc_matrix(E) 84 | elif not scipy.sparse.isspmatrix_csc(E): 85 | print('Converting to scipy.sparse.csc_matrix') 86 | E = E.tocsc() 87 | 88 | # Simulate doublets 89 | print('Simulating doublets') 90 | E_doub, parent_ix = simulate_doublets_from_counts(E, sim_doublet_ratio) 91 | 92 | # Total counts normalize observed cells and simulated doublets 93 | if total_counts_normalize: 94 | print('Total counts normalizing') 95 | E = tot_counts_norm(E)[0] 96 | E_doub = tot_counts_norm(E_doub, target_mean=1e5)[0] 97 | 98 | # Filter genes (highly variable, expressed above minimum level) 99 | if gene_filter is None: 100 | print('Finding highly variable genes') 101 | gene_filter = filter_genes(E, min_vscore_pctl=vscore_percentile, min_counts=min_counts, min_cells=min_cells) 102 | else: 103 | gene_filter = np.array(gene_filter) 104 | 105 | # Total counts normalize observed cells to the same total as doublets 106 | if total_counts_normalize: 107 | E = tot_counts_norm(E, target_mean=1e5)[0] 108 | 109 | # Apply gene filter 110 | print('Filtering genes from {} to {}'.format(E.shape[1], len(gene_filter))) 111 | E = E[:, gene_filter] 112 | E_doub = E_doub[:, gene_filter] 113 | output['gene_filter'] = gene_filter 114 | 115 | # Rescale counts 116 | if scaling_method == 'log': 117 | print('Applying log normalization') 118 | E.data = np.log10(1 + E.data) 119 | E_doub.data = np.log10(1 + E_doub.data) 120 | # to do: option of TruncatedSVD to preserve sparsity 121 | pca = PCA(n_components = n_prin_comps) 122 | E_doub = E_doub.toarray() 123 | E = E.toarray() 124 | elif scaling_method == 'zscore': 125 | print('Applying z-score normalization') 126 | gene_means = E.mean(0) 127 | gene_stdevs = np.sqrt(sparse_var(E)) 128 | E = sparse_zscore(E, gene_means, gene_stdevs) 129 | E_doub = sparse_zscore(E_doub, gene_means, gene_stdevs) 130 | pca = PCA(n_components = n_prin_comps) 131 | else: 132 | # to do: option of TruncatedSVD to preserve sparsity 133 | pca = PCA(n_components = n_prin_comps) 134 | E_doub = E_doub.toarray() 135 | E = E.toarray() 136 | 137 | # Fit PCA to observed cells, then apply same transformation to 138 | # simulated doublets. 139 | print('Running PCA') 140 | pca.fit(E) 141 | E_pca = pca.transform(E) 142 | E_doub_pca = pca.transform(E_doub) 143 | doub_labels = np.concatenate((np.zeros(E_pca.shape[0], dtype=int), 144 | np.ones(E_doub_pca.shape[0], dtype=int))) 145 | 146 | output['pca_observed_cells'] = E_pca 147 | output['pca_simulated_cells'] = E_doub_pca 148 | 149 | # Calculate doublet scores using k-nearest-neighbor classifier 150 | print('Building kNN graph and calculating doublet scores') 151 | nn_outs = nearest_neighbor_classifier(np.vstack((E_pca, E_doub_pca)), 152 | doub_labels, 153 | k=n_neighbors, 154 | use_approx_nn=use_approx_neighbors, 155 | exp_doub_rate = expected_doublet_rate, 156 | get_neighbor_parents = get_doublet_neighbor_parents, 157 | parent_cells = parent_ix 158 | ) 159 | output['doublet_scores_observed_cells'] = nn_outs[0] 160 | output['doublet_scores_simulated_doublets'] = nn_outs[1] 161 | output['doublet_neighbor_parents'] = nn_outs[2] 162 | return output 163 | 164 | #========================================================================================# 165 | 166 | def plot_scrublet_results(coords, doublet_scores_obs, doublet_scores_sim, score_threshold, marker_size=5, order_points=False, scale_hist_obs='log', scale_hist_sim='linear', fig_size = (8,6)): 167 | 168 | import matplotlib.pyplot as plt 169 | from matplotlib.lines import Line2D 170 | 171 | called_doubs = doublet_scores_obs > score_threshold 172 | called_doubs_sim = doublet_scores_sim > score_threshold 173 | predictable_doub_frac = sum(called_doubs_sim) / float(len(called_doubs_sim)) 174 | called_frac = sum(called_doubs) / float(len(called_doubs)) 175 | 176 | print('{}/{} = {:.1f}% of cells are predicted doublets.'.format(sum(called_doubs), len(called_doubs), 177 | 100 * called_frac)) 178 | print('{:.1f}% of doublets are predicted to be detectable.'.format(100 * predictable_doub_frac)) 179 | print('Predicted overall doublet rate = {:.1f}%'.format(100 * called_frac / predictable_doub_frac)) 180 | 181 | fig, axs = plt.subplots(2, 2, figsize = fig_size) 182 | 183 | ax = axs[0,0] 184 | ax.hist(doublet_scores_obs, np.linspace(0, 1, 50), color = 'gray', linewidth = 0, density=True) 185 | ax.set_yscale(scale_hist_obs) 186 | yl = ax.get_ylim() 187 | ax.set_ylim(yl) 188 | ax.plot([score_threshold, score_threshold], yl, c = 'black', linewidth = 1) 189 | ax.set_title('Observed cells') 190 | ax.set_xlabel('Doublet score') 191 | ax.set_ylabel('Prob. density') 192 | 193 | ax = axs[0,1] 194 | ax.hist(doublet_scores_sim, np.linspace(0, 1, 50), color = 'gray', linewidth = 0, density=True) 195 | ax.set_yscale(scale_hist_sim) 196 | yl = ax.get_ylim() 197 | ax.set_ylim(yl) 198 | ax.plot([score_threshold, score_threshold], yl, c = 'black', linewidth = 1) 199 | ax.set_title('Simulated doublets') 200 | ax.set_xlabel('Doublet score') 201 | ax.set_ylabel('Prob. density') 202 | 203 | x = coords[:,0] 204 | y = coords[:,1] 205 | xl = (x.min() - x.ptp() * .05, x.max() + x.ptp() * 0.05) 206 | yl = (y.min() - y.ptp() * .05, y.max() + y.ptp() * 0.05) 207 | 208 | if order_points: 209 | o = np.argsort(doublet_scores_obs) 210 | else: 211 | o = np.arange(len(doublet_scores_obs)) 212 | 213 | ax = axs[1,0] 214 | pp = ax.scatter(x[o], y[o], s=marker_size, edgecolors='', c = doublet_scores_obs[o], cmap=darken_cmap(plt.cm.Reds, 0.9)) 215 | ax.set_xlim(xl) 216 | ax.set_ylim(yl) 217 | ax.set_xticks([]) 218 | ax.set_yticks([]) 219 | ax.set_title('Doublet score') 220 | 221 | ax = axs[1,1] 222 | ax.scatter(x[o], y[o], s=marker_size, edgecolors='', c=doublet_scores_obs[o] > score_threshold, cmap = custom_cmap([[.7,.7,.7], [0,0,0]])) 223 | ax.set_xlim(xl) 224 | ax.set_ylim(yl) 225 | ax.set_xticks([]) 226 | ax.set_yticks([]) 227 | ax.set_title('Predicted doublets') 228 | singlet_marker = Line2D([], [], color=[.7,.7,.7], marker='o', markersize=5, label='Singlet', linewidth=0) 229 | doublet_marker = Line2D([], [], color=[.0,.0,.0], marker='o', markersize=5, label='Doublet', linewidth=0) 230 | ax.legend(handles = [singlet_marker, doublet_marker]) 231 | 232 | fig.tight_layout() 233 | 234 | return fig, axs 235 | 236 | #========================================================================================# 237 | 238 | def simulate_doublets_from_counts(E, sim_doublet_ratio=1): 239 | ''' 240 | Simulate doublets by summing the counts of random cell pairs. 241 | 242 | Inputs: 243 | E (numpy or scipy matrix of size (num_cells, num_genes)): counts matrix, ideally without total-counts normalization. 244 | sim_doublet_ratio (float): number of doublets to simulate, as a fraction of the number of cells in E. 245 | A total of num_sim_doubs = int(sim_doublet_ratio * E[0]) doublets will be simulated. 246 | 247 | Returns: 248 | - Edoub (scipy sparse CSC matrix of size (num_cells+num_sim_doubs, num_genes)): counts matrix with the simulated doublet data appended to the original data matrix E. 249 | - doub_labels (array of size (num_cells+num_sim_doubs)): 0 if observed cell, 1 if simulated doublet 250 | - pair_ix (matrix of size(num_sim_doubs, 2)): each row gives the indices of the parent cells from E used to generate the simulated doublet 251 | ''' 252 | 253 | if not scipy.sparse.issparse(E): 254 | E = scipy.sparse.csc_matrix(E) 255 | elif not scipy.sparse.isspmatrix_csc(E): 256 | E = E.tocsc() 257 | 258 | n_obs = E.shape[0] 259 | n_doub = int(n_obs * sim_doublet_ratio) 260 | pair_ix = np.random.randint(0, n_obs, size=(n_doub, 2)) 261 | Edoub = E[pair_ix[:, 0],:] + E[pair_ix[:, 1],:] 262 | 263 | return Edoub, pair_ix 264 | 265 | #========================================================================================# 266 | 267 | def simulate_doublets_from_pca(PCdat, total_counts=None, sim_doublet_ratio=1): 268 | ''' 269 | Simulate doublets by averaging PCA coordinates of random cell pairs. 270 | Average is weighted by total counts of each parent cell, if provided. 271 | 272 | Returns: 273 | PCdoub (matrix of size (num_cells+num_sim_doubs, num_pcs)): PCA matrix with the simulated doublet PCA coordinates appended to the original data matrix PCdat. 274 | doub_labels (array of size (num_cells+num_sim_doubs)): 0 if observed cell, 1 if simulated doublet 275 | pair_ix (matrix of size(num_sim_doubs, 2)): each row gives the indices of the parent cells used to generate the simulated doublet 276 | ''' 277 | 278 | n_obs = PCdat.shape[0] 279 | n_doub = int(n_obs * sim_doublet_ratio) 280 | 281 | if total_counts is None: 282 | total_counts = np.ones(n_obs) 283 | 284 | pair_ix = np.random.randint(0, n_obs, size=(n_doub, 2)) 285 | 286 | pair_tots = np.hstack((total_counts[pair_ix[:, 0]][:,None], total_counts[pair_ix[:, 1]][:,None])) 287 | pair_tots = np.array(pair_tots, dtype=float) 288 | pair_fracs = pair_tots / np.sum(pair_tots, axis=1)[:,None] 289 | 290 | PCdoub = PCdat[pair_ix[:, 0],:] * pair_fracs[:, 0][:,None] + PCdat[pair_ix[:, 1],:] * pair_fracs[:, 1][:,None] 291 | 292 | PCdoub = np.vstack((PCdat, PCdoub)) 293 | doub_labels = np.concatenate((np.zeros(n_obs, dtype=int), np.ones(n_doub, dtype=int))) 294 | 295 | return PCdoub, doub_labels, pair_ix 296 | 297 | #========================================================================================# 298 | 299 | def nearest_neighbor_classifier(embedding, doub_labels, k=50, use_approx_nn=True, exp_doub_rate = 1.0, get_neighbor_parents = False, parent_cells = None): 300 | n_obs = sum(doub_labels == 0) 301 | n_sim = sum(doub_labels == 1) 302 | 303 | # Adjust k (number of nearest neighbors) based on the ratio of simulated to observed cells 304 | k_adj = int(round(k * (1+n_sim/float(n_obs)))) 305 | 306 | # Find k_adj nearest neighbors 307 | neighbors = get_knn_graph(embedding, k=k_adj, dist_metric='euclidean', approx=use_approx_nn, return_edges = False) 308 | 309 | # Calculate doublet score based on ratio of simulated cell neighbors vs. observed cell neighbors 310 | doub_neigh_mask = doub_labels[neighbors] == 1 311 | n_sim_neigh = doub_neigh_mask.sum(1) 312 | n_obs_neigh = doub_neigh_mask.shape[1] - n_sim_neigh 313 | doub_score = n_sim_neigh / (n_sim_neigh + n_obs_neigh * n_sim / float(n_obs) / exp_doub_rate) 314 | 315 | # get parents of doublet neighbors, if requested 316 | neighbor_parents = None 317 | if get_neighbor_parents and parent_cells is not None: 318 | neighbors = neighbors - n_obs 319 | neighbor_parents = [] 320 | for iCell in range(n_obs): 321 | this_doub_neigh = neighbors[iCell,:][neighbors[iCell,:] > -1] 322 | if len(this_doub_neigh) > 0: 323 | this_doub_neigh_parents = np.unique(parent_cells[this_doub_neigh,:].flatten()) 324 | neighbor_parents.append(this_doub_neigh_parents) 325 | else: 326 | neighbor_parents.append([]) 327 | neighbor_parents = np.array(neighbor_parents) 328 | 329 | 330 | return doub_score[doub_labels == 0], doub_score[doub_labels == 1], neighbor_parents 331 | 332 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy 2 | scipy 3 | scikit-learn 4 | scikit-image 5 | matplotlib 6 | numba 7 | cython 8 | pandas 9 | annoy 10 | umap-learn 11 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import setuptools 2 | 3 | with open("README.md", "r") as fh: 4 | long_description = fh.read() 5 | 6 | setuptools.setup( 7 | name = "scrublet", 8 | packages = ['scrublet'], 9 | package_dir={'': 'src'}, 10 | version = '0.2.1', 11 | description = 'Doublet prediction in single-cell RNA-sequencing data', 12 | long_description=long_description, 13 | long_description_content_type="text/markdown", 14 | author = 'Samuel L. Wolock', 15 | author_email = 'swolock@g.harvard.edu', 16 | url = 'https://github.com/allonkleinlab/scrublet', 17 | install_requires=['cython', 'numpy', 'scipy', 'scikit-learn', 'scikit-image', 'matplotlib', 'annoy', 'numba', 'pandas', 'umap-learn'], 18 | ) 19 | -------------------------------------------------------------------------------- /src/scrublet/__init__.py: -------------------------------------------------------------------------------- 1 | from .scrublet import Scrublet 2 | from .helper_functions import * 3 | -------------------------------------------------------------------------------- /src/scrublet/helper_functions.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import scipy 3 | import scipy.stats 4 | import scipy.sparse 5 | from sklearn.decomposition import PCA,TruncatedSVD 6 | from sklearn.neighbors import NearestNeighbors 7 | import time 8 | 9 | ########## PREPROCESSING PIPELINE 10 | 11 | def print_optional(string, verbose=True): 12 | if verbose: 13 | print(string) 14 | return 15 | 16 | def pipeline_normalize(self, postnorm_total=None): 17 | ''' Total counts normalization ''' 18 | if postnorm_total is None: 19 | postnorm_total = self._total_counts_obs.mean() 20 | 21 | self._E_obs_norm = tot_counts_norm(self._E_obs, target_total=postnorm_total, total_counts=self._total_counts_obs) 22 | 23 | if self._E_sim is not None: 24 | self._E_sim_norm = tot_counts_norm(self._E_sim, target_total=postnorm_total, total_counts=self._total_counts_sim) 25 | return 26 | 27 | def pipeline_get_gene_filter(self, min_counts=3, min_cells=3, min_gene_variability_pctl=85): 28 | ''' Identify highly variable genes expressed above a minimum level ''' 29 | self._gene_filter = filter_genes(self._E_obs_norm, 30 | min_counts=min_counts, 31 | min_cells=min_cells, 32 | min_vscore_pctl=min_gene_variability_pctl) 33 | return 34 | 35 | def pipeline_apply_gene_filter(self): 36 | if self._E_obs is not None: 37 | self._E_obs = self._E_obs[:,self._gene_filter] 38 | if self._E_obs_norm is not None: 39 | self._E_obs_norm = self._E_obs_norm[:,self._gene_filter] 40 | if self._E_sim is not None: 41 | self._E_sim = self._E_sim[:,self._gene_filter] 42 | if self._E_sim_norm is not None: 43 | self._E_sim_norm = self._E_sim_norm[:,self._gene_filter] 44 | return 45 | 46 | def pipeline_mean_center(self): 47 | gene_means = self._E_obs_norm.mean(0) 48 | self._E_obs_norm = self._E_obs_norm - gene_means 49 | if self._E_sim_norm is not None: 50 | self._E_sim_norm = self._E_sim_norm - gene_means 51 | return 52 | 53 | def pipeline_normalize_variance(self): 54 | gene_stdevs = np.sqrt(sparse_var(self._E_obs_norm)) 55 | self._E_obs_norm = sparse_multiply(self._E_obs_norm.T, 1/gene_stdevs).T 56 | if self._E_sim_norm is not None: 57 | self._E_sim_norm = sparse_multiply(self._E_sim_norm.T, 1/gene_stdevs).T 58 | return 59 | 60 | def pipeline_zscore(self): 61 | gene_means = self._E_obs_norm.mean(0) 62 | gene_stdevs = np.sqrt(sparse_var(self._E_obs_norm)) 63 | self._E_obs_norm = np.array(sparse_zscore(self._E_obs_norm, gene_means, gene_stdevs)) 64 | if self._E_sim_norm is not None: 65 | self._E_sim_norm = np.array(sparse_zscore(self._E_sim_norm, gene_means, gene_stdevs)) 66 | return 67 | 68 | def pipeline_log_transform(self, pseudocount=1): 69 | self._E_obs_norm = log_normalize(self._E_obs_norm, pseudocount) 70 | if self._E_sim_norm is not None: 71 | self._E_sim_norm = log_normalize(self._E_sim_norm, pseudocount) 72 | return 73 | 74 | def pipeline_truncated_svd(self, n_prin_comps=30, random_state=0): 75 | svd = TruncatedSVD(n_components=n_prin_comps, random_state=random_state).fit(self._E_obs_norm) 76 | self.set_manifold(svd.transform(self._E_obs_norm), svd.transform(self._E_sim_norm)) 77 | return 78 | 79 | def pipeline_pca(self, n_prin_comps=50, random_state=0): 80 | if scipy.sparse.issparse(self._E_obs_norm): 81 | X_obs = self._E_obs_norm.toarray() 82 | else: 83 | X_obs = self._E_obs_norm 84 | if scipy.sparse.issparse(self._E_sim_norm): 85 | X_sim = self._E_sim_norm.toarray() 86 | else: 87 | X_sim = self._E_sim_norm 88 | 89 | pca = PCA(n_components=n_prin_comps, random_state=random_state).fit(X_obs) 90 | self.set_manifold(pca.transform(X_obs), pca.transform(X_sim)) 91 | return 92 | 93 | def matrix_multiply(X, Y): 94 | if not type(X) == np.ndarray: 95 | if scipy.sparse.issparse(X): 96 | X = X.toarray() 97 | else: 98 | X = np.array(X) 99 | if not type(Y) == np.ndarray: 100 | if scipy.sparse.issparse(Y): 101 | Y = Y.toarray() 102 | else: 103 | Y = np.array(Y) 104 | return np.dot(X,Y) 105 | 106 | def log_normalize(X,pseudocount=1): 107 | X.data = np.log10(X.data + pseudocount) 108 | return X 109 | 110 | ########## LOADING DATA 111 | def load_genes(filename, delimiter='\t', column=0, skip_rows=0): 112 | gene_list = [] 113 | gene_dict = {} 114 | 115 | with open(filename) as f: 116 | for iL in range(skip_rows): 117 | f.readline() 118 | for l in f: 119 | gene = l.strip('\n').split(delimiter)[column] 120 | if gene in gene_dict: 121 | gene_dict[gene] += 1 122 | gene_list.append(gene + '__' + str(gene_dict[gene])) 123 | if gene_dict[gene] == 2: 124 | i = gene_list.index(gene) 125 | gene_list[i] = gene + '__1' 126 | else: 127 | gene_dict[gene] = 1 128 | gene_list.append(gene) 129 | return gene_list 130 | 131 | 132 | def make_genes_unique(orig_gene_list): 133 | gene_list = [] 134 | gene_dict = {} 135 | 136 | for gene in orig_gene_list: 137 | if gene in gene_dict: 138 | gene_dict[gene] += 1 139 | gene_list.append(gene + '__' + str(gene_dict[gene])) 140 | if gene_dict[gene] == 2: 141 | i = gene_list.index(gene) 142 | gene_list[i] = gene + '__1' 143 | else: 144 | gene_dict[gene] = 1 145 | gene_list.append(gene) 146 | return gene_list 147 | 148 | ########## USEFUL SPARSE FUNCTIONS 149 | 150 | def sparse_var(E, axis=0): 151 | ''' variance across the specified axis ''' 152 | 153 | mean_gene = E.mean(axis=axis).A.squeeze() 154 | tmp = E.copy() 155 | tmp.data **= 2 156 | return tmp.mean(axis=axis).A.squeeze() - mean_gene ** 2 157 | 158 | def sparse_multiply(E, a): 159 | ''' multiply each row of E by a scalar ''' 160 | 161 | nrow = E.shape[0] 162 | w = scipy.sparse.lil_matrix((nrow, nrow)) 163 | w.setdiag(a) 164 | return w * E 165 | 166 | def sparse_zscore(E, gene_mean=None, gene_stdev=None): 167 | ''' z-score normalize each column of E ''' 168 | 169 | if gene_mean is None: 170 | gene_mean = E.mean(0) 171 | if gene_stdev is None: 172 | gene_stdev = np.sqrt(sparse_var(E)) 173 | return sparse_multiply((E - gene_mean).T, 1/gene_stdev).T 174 | 175 | def subsample_counts(E, rate, original_totals, random_seed=0): 176 | if rate < 1: 177 | np.random.seed(random_seed) 178 | E.data = np.random.binomial(np.round(E.data).astype(int), rate) 179 | current_totals = E.sum(1).A.squeeze() 180 | unsampled_orig_totals = original_totals - current_totals 181 | unsampled_downsamp_totals = np.random.binomial(np.round(unsampled_orig_totals).astype(int), rate) 182 | final_downsamp_totals = current_totals + unsampled_downsamp_totals 183 | else: 184 | final_downsamp_totals = original_totals 185 | return E, final_downsamp_totals 186 | 187 | 188 | ########## GENE FILTERING 189 | 190 | def runningquantile(x, y, p, nBins): 191 | 192 | ind = np.argsort(x) 193 | x = x[ind] 194 | y = y[ind] 195 | 196 | dx = (x[-1] - x[0]) / nBins 197 | xOut = np.linspace(x[0]+dx/2, x[-1]-dx/2, nBins) 198 | 199 | yOut = np.zeros(xOut.shape) 200 | 201 | for i in range(len(xOut)): 202 | ind = np.nonzero((x >= xOut[i]-dx/2) & (x < xOut[i]+dx/2))[0] 203 | if len(ind) > 0: 204 | yOut[i] = np.percentile(y[ind], p) 205 | else: 206 | if i > 0: 207 | yOut[i] = yOut[i-1] 208 | else: 209 | yOut[i] = np.nan 210 | 211 | return xOut, yOut 212 | 213 | 214 | def get_vscores(E, min_mean=0, nBins=50, fit_percentile=0.1, error_wt=1): 215 | ''' 216 | Calculate v-score (above-Poisson noise statistic) for genes in the input counts matrix 217 | Return v-scores and other stats 218 | ''' 219 | 220 | ncell = E.shape[0] 221 | 222 | mu_gene = E.mean(axis=0).A.squeeze() 223 | gene_ix = np.nonzero(mu_gene > min_mean)[0] 224 | mu_gene = mu_gene[gene_ix] 225 | 226 | tmp = E[:,gene_ix] 227 | tmp.data **= 2 228 | var_gene = tmp.mean(axis=0).A.squeeze() - mu_gene ** 2 229 | del tmp 230 | FF_gene = var_gene / mu_gene 231 | 232 | data_x = np.log(mu_gene) 233 | data_y = np.log(FF_gene / mu_gene) 234 | 235 | x, y = runningquantile(data_x, data_y, fit_percentile, nBins) 236 | x = x[~np.isnan(y)] 237 | y = y[~np.isnan(y)] 238 | 239 | gLog = lambda input: np.log(input[1] * np.exp(-input[0]) + input[2]) 240 | h,b = np.histogram(np.log(FF_gene[mu_gene>0]), bins=200) 241 | b = b[:-1] + np.diff(b)/2 242 | max_ix = np.argmax(h) 243 | c = np.max((np.exp(b[max_ix]), 1)) 244 | errFun = lambda b2: np.sum(abs(gLog([x,c,b2])-y) ** error_wt) 245 | b0 = 0.1 246 | b = scipy.optimize.fmin(func = errFun, x0=[b0], disp=False) 247 | a = c / (1+b) - 1 248 | 249 | 250 | v_scores = FF_gene / ((1+a)*(1+b) + b * mu_gene); 251 | CV_eff = np.sqrt((1+a)*(1+b) - 1); 252 | CV_input = np.sqrt(b); 253 | 254 | return v_scores, CV_eff, CV_input, gene_ix, mu_gene, FF_gene, a, b 255 | 256 | def filter_genes(E, base_ix = [], min_vscore_pctl = 85, min_counts = 3, min_cells = 3, show_vscore_plot = False, sample_name = ''): 257 | ''' 258 | Filter genes by expression level and variability 259 | Return list of filtered gene indices 260 | ''' 261 | 262 | if len(base_ix) == 0: 263 | base_ix = np.arange(E.shape[0]) 264 | 265 | Vscores, CV_eff, CV_input, gene_ix, mu_gene, FF_gene, a, b = get_vscores(E[base_ix, :]) 266 | ix2 = Vscores>0 267 | Vscores = Vscores[ix2] 268 | gene_ix = gene_ix[ix2] 269 | mu_gene = mu_gene[ix2] 270 | FF_gene = FF_gene[ix2] 271 | min_vscore = np.percentile(Vscores, min_vscore_pctl) 272 | ix = (((E[:,gene_ix] >= min_counts).sum(0).A.squeeze() >= min_cells) & (Vscores >= min_vscore)) 273 | 274 | if show_vscore_plot: 275 | import matplotlib.pyplot as plt 276 | x_min = 0.5*np.min(mu_gene) 277 | x_max = 2*np.max(mu_gene) 278 | xTh = x_min * np.exp(np.log(x_max/x_min)*np.linspace(0,1,100)) 279 | yTh = (1 + a)*(1+b) + b * xTh 280 | plt.figure(figsize=(8, 6)); 281 | plt.scatter(np.log10(mu_gene), np.log10(FF_gene), c = [.8,.8,.8], alpha = 0.3, edgecolors=''); 282 | plt.scatter(np.log10(mu_gene)[ix], np.log10(FF_gene)[ix], c = [0,0,0], alpha = 0.3, edgecolors=''); 283 | plt.plot(np.log10(xTh),np.log10(yTh)); 284 | plt.title(sample_name) 285 | plt.xlabel('log10(mean)'); 286 | plt.ylabel('log10(Fano factor)'); 287 | plt.show() 288 | 289 | return gene_ix[ix] 290 | 291 | ########## CELL NORMALIZATION 292 | 293 | def tot_counts_norm(E, total_counts = None, exclude_dominant_frac = 1, included = [], target_total = None): 294 | ''' 295 | Cell-level total counts normalization of input counts matrix, excluding overly abundant genes if desired. 296 | Return normalized counts, average total counts, and (if exclude_dominant_frac < 1) list of genes used to calculate total counts 297 | ''' 298 | 299 | E = E.tocsc() 300 | ncell = E.shape[0] 301 | if total_counts is None: 302 | if len(included) == 0: 303 | if exclude_dominant_frac == 1: 304 | tots_use = E.sum(axis=1) 305 | else: 306 | tots = E.sum(axis=1) 307 | wtmp = scipy.sparse.lil_matrix((ncell, ncell)) 308 | wtmp.setdiag(1. / tots) 309 | included = np.asarray(~(((wtmp * E) > exclude_dominant_frac).sum(axis=0) > 0))[0,:] 310 | tots_use = E[:,included].sum(axis = 1) 311 | print('Excluded %i genes from normalization' %(np.sum(~included))) 312 | else: 313 | tots_use = E[:,included].sum(axis = 1) 314 | else: 315 | tots_use = total_counts.copy() 316 | 317 | if target_total is None: 318 | target_total = np.mean(tots_use) 319 | 320 | w = scipy.sparse.lil_matrix((ncell, ncell)) 321 | w.setdiag(float(target_total) / tots_use) 322 | Enorm = w * E 323 | 324 | return Enorm.tocsc() 325 | 326 | ########## DIMENSIONALITY REDUCTION 327 | 328 | def get_pca(E, base_ix=[], numpc=50, keep_sparse=False, normalize=True, random_state=0): 329 | ''' 330 | Run PCA on the counts matrix E, gene-level normalizing if desired 331 | Return PCA coordinates 332 | ''' 333 | # If keep_sparse is True, gene-level normalization maintains sparsity 334 | # (no centering) and TruncatedSVD is used instead of normal PCA. 335 | 336 | if len(base_ix) == 0: 337 | base_ix = np.arange(E.shape[0]) 338 | 339 | if keep_sparse: 340 | if normalize: 341 | zstd = np.sqrt(sparse_var(E[base_ix,:])) 342 | Z = sparse_multiply(E.T, 1 / zstd).T 343 | else: 344 | Z = E 345 | pca = TruncatedSVD(n_components=numpc, random_state=random_state) 346 | 347 | else: 348 | if normalize: 349 | zmean = E[base_ix,:].mean(0) 350 | zstd = np.sqrt(sparse_var(E[base_ix,:])) 351 | Z = sparse_multiply((E - zmean).T, 1/zstd).T 352 | else: 353 | Z = E 354 | pca = PCA(n_components=numpc, random_state=random_state) 355 | 356 | pca.fit(Z[base_ix,:]) 357 | return pca.transform(Z) 358 | 359 | 360 | def preprocess_and_pca(E, total_counts_normalize=True, norm_exclude_abundant_gene_frac=1, min_counts=3, min_cells=5, min_vscore_pctl=85, gene_filter=None, num_pc=50, sparse_pca=False, zscore_normalize=True, show_vscore_plot=False): 361 | ''' 362 | Total counts normalize, filter genes, run PCA 363 | Return PCA coordinates and filtered gene indices 364 | ''' 365 | 366 | if total_counts_normalize: 367 | print('Total count normalizing') 368 | E = tot_counts_norm(E, exclude_dominant_frac = norm_exclude_abundant_gene_frac)[0] 369 | 370 | if gene_filter is None: 371 | print('Finding highly variable genes') 372 | gene_filter = filter_genes(E, min_vscore_pctl=min_vscore_pctl, min_counts=min_counts, min_cells=min_cells, show_vscore_plot=show_vscore_plot) 373 | 374 | print('Using %i genes for PCA' %len(gene_filter)) 375 | PCdat = get_pca(E[:,gene_filter], numpc=num_pc, keep_sparse=sparse_pca, normalize=zscore_normalize) 376 | 377 | return PCdat, gene_filter 378 | 379 | ########## GRAPH CONSTRUCTION 380 | 381 | def get_knn_graph(X, k=5, dist_metric='euclidean', approx=False, return_edges=True, random_seed=0): 382 | ''' 383 | Build k-nearest-neighbor graph 384 | Return edge list and nearest neighbor matrix 385 | ''' 386 | 387 | t0 = time.time() 388 | if approx: 389 | try: 390 | from annoy import AnnoyIndex 391 | except: 392 | approx = False 393 | print('Could not find library "annoy" for approx. nearest neighbor search') 394 | if approx: 395 | #print('Using approximate nearest neighbor search') 396 | 397 | if dist_metric == 'cosine': 398 | dist_metric = 'angular' 399 | npc = X.shape[1] 400 | ncell = X.shape[0] 401 | annoy_index = AnnoyIndex(npc, metric=dist_metric) 402 | annoy_index.set_seed(random_seed) 403 | 404 | for i in range(ncell): 405 | annoy_index.add_item(i, list(X[i,:])) 406 | annoy_index.build(10) # 10 trees 407 | 408 | knn = [] 409 | for iCell in range(ncell): 410 | knn.append(annoy_index.get_nns_by_item(iCell, k + 1)[1:]) 411 | knn = np.array(knn, dtype=int) 412 | 413 | else: 414 | #print('Using sklearn NearestNeighbors') 415 | 416 | if dist_metric == 'cosine': 417 | nbrs = NearestNeighbors(n_neighbors=k, metric=dist_metric, algorithm='brute').fit(X) 418 | else: 419 | nbrs = NearestNeighbors(n_neighbors=k, metric=dist_metric).fit(X) 420 | knn = nbrs.kneighbors(return_distance=False) 421 | 422 | if return_edges: 423 | links = set([]) 424 | for i in range(knn.shape[0]): 425 | for j in knn[i,:]: 426 | links.add(tuple(sorted((i,j)))) 427 | 428 | t_elapse = time.time() - t0 429 | #print('kNN graph built in %.3f sec' %(t_elapse)) 430 | 431 | return links, knn 432 | return knn 433 | 434 | def build_adj_mat(edges, n_nodes): 435 | A = scipy.sparse.lil_matrix((n_nodes, n_nodes)) 436 | for e in edges: 437 | i, j = e 438 | A[i,j] = 1 439 | A[j,i] = 1 440 | return A.tocsc() 441 | 442 | ########## 2-D EMBEDDINGS 443 | 444 | def get_umap(X, n_neighbors=10, min_dist=0.1, metric='euclidean', random_state=0): 445 | import umap 446 | return umap.UMAP(n_neighbors=n_neighbors, min_dist=min_dist, metric=metric, random_state=random_state).fit_transform(X) 447 | 448 | def get_tsne(X, angle=0.5, perplexity=30, random_state=0, verbose=False): 449 | from sklearn.manifold import TSNE 450 | return TSNE(angle=angle, perplexity=perplexity, random_state=random_state, verbose=verbose).fit_transform(X) 451 | 452 | def get_force_layout(X, n_neighbors=5, approx_neighbors=False, n_iter=300, verbose=False): 453 | edges = get_knn_graph(X, k=n_neighbors, approx=approx_neighbors, return_edges=True)[0] 454 | return run_force_layout(edges, X.shape[0], verbose=verbose) 455 | 456 | def run_force_layout(links, n_cells, n_iter=100, edgeWeightInfluence=1, barnesHutTheta=2, scalingRatio=1, gravity=0.05, jitterTolerance=1, verbose=False): 457 | from fa2 import ForceAtlas2 458 | import networkx as nx 459 | 460 | G = nx.Graph() 461 | G.add_nodes_from(range(n_cells)) 462 | G.add_edges_from(list(links)) 463 | 464 | forceatlas2 = ForceAtlas2( 465 | # Behavior alternatives 466 | outboundAttractionDistribution=False, # Dissuade hubs 467 | linLogMode=False, # NOT IMPLEMENTED 468 | adjustSizes=False, # Prevent overlap (NOT IMPLEMENTED) 469 | edgeWeightInfluence=edgeWeightInfluence, 470 | 471 | # Performance 472 | jitterTolerance=jitterTolerance, # Tolerance 473 | barnesHutOptimize=True, 474 | barnesHutTheta=barnesHutTheta, 475 | multiThreaded=False, # NOT IMPLEMENTED 476 | 477 | # Tuning 478 | scalingRatio=scalingRatio, 479 | strongGravityMode=False, 480 | gravity=gravity, 481 | # Log 482 | verbose=verbose) 483 | 484 | positions = forceatlas2.forceatlas2_networkx_layout(G, pos=None, iterations=n_iter) 485 | positions = np.array([positions[i] for i in sorted(positions.keys())]) 486 | return positions 487 | 488 | ########## CLUSTERING 489 | 490 | def get_spectral_clusters(A, k): 491 | from sklearn.cluster import SpectralClustering 492 | spec = SpectralClustering(n_clusters=k, random_state = 0, affinity = 'precomputed', assign_labels = 'discretize') 493 | return spec.fit_predict(A) 494 | 495 | 496 | def get_louvain_clusters(nodes, edges): 497 | import networkx as nx 498 | import community 499 | 500 | G = nx.Graph() 501 | G.add_nodes_from(nodes) 502 | G.add_edges_from(edges) 503 | 504 | return np.array(list(community.best_partition(G).values())) 505 | 506 | ########## GENE ENRICHMENT 507 | 508 | def rank_enriched_genes(E, gene_list, cell_mask, min_counts=3, min_cells=3, verbose=False): 509 | gix = (E[cell_mask,:]>=min_counts).sum(0).A.squeeze() >= min_cells 510 | print_optional('%i cells in group' %(sum(cell_mask)), verbose) 511 | print_optional('Considering %i genes' %(sum(gix)), verbose) 512 | 513 | gene_list = gene_list[gix] 514 | 515 | z = sparse_zscore(E[:,gix]) 516 | scores = z[cell_mask,:].mean(0).A.squeeze() 517 | o = np.argsort(-scores) 518 | 519 | return gene_list[o], scores[o] 520 | 521 | 522 | ########## PLOTTING STUFF 523 | 524 | def darken_cmap(cmap, scale_factor): 525 | cdat = np.zeros((cmap.N, 4)) 526 | for ii in range(cdat.shape[0]): 527 | curcol = cmap(ii) 528 | cdat[ii,0] = curcol[0] * scale_factor 529 | cdat[ii,1] = curcol[1] * scale_factor 530 | cdat[ii,2] = curcol[2] * scale_factor 531 | cdat[ii,3] = 1 532 | cmap = cmap.from_list(cmap.N, cdat) 533 | return cmap 534 | 535 | def custom_cmap(rgb_list): 536 | import matplotlib.pyplot as plt 537 | rgb_list = np.array(rgb_list) 538 | cmap = plt.cm.Reds 539 | cmap = cmap.from_list(rgb_list.shape[0],rgb_list) 540 | return cmap 541 | 542 | def plot_groups(x, y, groups, lim_buffer = 50, saving = False, fig_dir = './', fig_name = 'fig', res = 300, close_after = False, title_size = 12, point_size = 3, ncol = 5): 543 | import matplotlib.pyplot as plt 544 | 545 | n_col = int(ncol) 546 | ngroup = len(np.unique(groups)) 547 | nrow = int(np.ceil(ngroup / float(ncol))) 548 | fig = plt.figure(figsize = (14, 3 * nrow)) 549 | for ii, c in enumerate(np.unique(groups)): 550 | ax = plt.subplot(nrow, ncol, ii+1) 551 | ix = groups == c 552 | 553 | ax.scatter(x[~ix], y[~ix], s = point_size, c = [.8,.8,.8], edgecolors = '') 554 | ax.scatter(x[ix], y[ix], s = point_size, c = [0,0,0], edgecolors = '') 555 | ax.set_xticks([]) 556 | ax.set_yticks([]) 557 | ax.set_xlim([min(x) - lim_buffer, max(x) + lim_buffer]) 558 | ax.set_ylim([min(y) - lim_buffer, max(y) + lim_buffer]) 559 | 560 | ax.set_title(str(c), fontsize = title_size) 561 | 562 | fig.tight_layout() 563 | 564 | if saving: 565 | if not os.path.exists(fig_dir): 566 | os.makedirs(fig_dir) 567 | plt.savefig(fig_dir + '/' + fig_name + '.png', dpi=res) 568 | 569 | if close_after: 570 | plt.close() 571 | -------------------------------------------------------------------------------- /src/scrublet/scrublet.py: -------------------------------------------------------------------------------- 1 | from .helper_functions import * 2 | from sklearn.decomposition import PCA, TruncatedSVD 3 | import matplotlib.pyplot as plt 4 | 5 | class Scrublet(): 6 | def __init__(self, counts_matrix, total_counts=None, sim_doublet_ratio=2.0, n_neighbors=None, expected_doublet_rate=0.1, stdev_doublet_rate=0.02, random_state=0): 7 | ''' Initialize Scrublet object with counts matrix and doublet prediction parameters 8 | 9 | Parameters 10 | ---------- 11 | counts_matrix : scipy sparse matrix or ndarray, shape (n_cells, n_genes) 12 | Matrix containing raw (unnormalized) UMI-based transcript counts. 13 | Converted into a scipy.sparse.csc_matrix. 14 | 15 | total_counts : ndarray, shape (n_cells,), optional (default: None) 16 | Array of total UMI counts per cell. If `None`, this is calculated 17 | as the row sums of `counts_matrix`. 18 | 19 | sim_doublet_ratio : float, optional (default: 2.0) 20 | Number of doublets to simulate relative to the number of observed 21 | transcriptomes. 22 | 23 | n_neighbors : int, optional (default: None) 24 | Number of neighbors used to construct the KNN graph of observed 25 | transcriptomes and simulated doublets. If `None`, this is 26 | set to round(0.5 * sqrt(n_cells)) 27 | 28 | expected_doublet_rate : float, optional (default: 0.1) 29 | The estimated doublet rate for the experiment. 30 | 31 | stdev_doublet_rate : float, optional (default: 0.02) 32 | Uncertainty in the expected doublet rate. 33 | 34 | random_state : int, optional (default: 0) 35 | Random state for doublet simulation, approximate 36 | nearest neighbor search, and PCA/TruncatedSVD. 37 | 38 | Attributes 39 | ---------- 40 | predicted_doublets_ : ndarray, shape (n_cells,) 41 | Boolean mask of predicted doublets in the observed 42 | transcriptomes. 43 | 44 | doublet_scores_obs_ : ndarray, shape (n_cells,) 45 | Doublet scores for observed transcriptomes. 46 | 47 | doublet_scores_sim_ : ndarray, shape (n_doublets,) 48 | Doublet scores for simulated doublets. 49 | 50 | doublet_errors_obs_ : ndarray, shape (n_cells,) 51 | Standard error in the doublet scores for observed 52 | transcriptomes. 53 | 54 | doublet_errors_sim_ : ndarray, shape (n_doublets,) 55 | Standard error in the doublet scores for simulated 56 | doublets. 57 | 58 | threshold_: float 59 | Doublet score threshold for calling a transcriptome 60 | a doublet. 61 | 62 | z_scores_ : ndarray, shape (n_cells,) 63 | Z-score conveying confidence in doublet calls. 64 | Z = `(doublet_score_obs_ - threhsold_) / doublet_errors_obs_` 65 | 66 | detected_doublet_rate_: float 67 | Fraction of observed transcriptomes that have been called 68 | doublets. 69 | 70 | detectable_doublet_fraction_: float 71 | Estimated fraction of doublets that are detectable, i.e., 72 | fraction of simulated doublets with doublet scores above 73 | `threshold_` 74 | 75 | overall_doublet_rate_: float 76 | Estimated overall doublet rate, 77 | `detected_doublet_rate_ / detectable_doublet_fraction_`. 78 | Should agree (roughly) with `expected_doublet_rate`. 79 | 80 | manifold_obs_: ndarray, shape (n_cells, n_features) 81 | The single-cell "manifold" coordinates (e.g., PCA coordinates) 82 | for observed transcriptomes. Nearest neighbors are found using 83 | the union of `manifold_obs_` and `manifold_sim_` (see below). 84 | 85 | manifold_sim_: ndarray, shape (n_doublets, n_features) 86 | The single-cell "manifold" coordinates (e.g., PCA coordinates) 87 | for simulated doublets. Nearest neighbors are found using 88 | the union of `manifold_obs_` (see above) and `manifold_sim_`. 89 | 90 | doublet_parents_ : ndarray, shape (n_doublets, 2) 91 | Indices of the observed transcriptomes used to generate the 92 | simulated doublets. 93 | 94 | doublet_neighbor_parents_ : list, length n_cells 95 | A list of arrays of the indices of the doublet neighbors of 96 | each observed transcriptome (the ith entry is an array of 97 | the doublet neighbors of transcriptome i). 98 | ''' 99 | 100 | if not scipy.sparse.issparse(counts_matrix): 101 | counts_matrix = scipy.sparse.csc_matrix(counts_matrix) 102 | elif not scipy.sparse.isspmatrix_csc(counts_matrix): 103 | counts_matrix = counts_matrix.tocsc() 104 | 105 | # initialize counts matrices 106 | self._E_obs = counts_matrix 107 | self._E_sim = None 108 | self._E_obs_norm = None 109 | self._E_sim_norm = None 110 | 111 | if total_counts is None: 112 | self._total_counts_obs = self._E_obs.sum(1).A.squeeze() 113 | else: 114 | self._total_counts_obs = total_counts 115 | 116 | self._gene_filter = np.arange(self._E_obs.shape[1]) 117 | self._embeddings = {} 118 | 119 | self.sim_doublet_ratio = sim_doublet_ratio 120 | self.n_neighbors = n_neighbors 121 | self.expected_doublet_rate = expected_doublet_rate 122 | self.stdev_doublet_rate = stdev_doublet_rate 123 | self.random_state = random_state 124 | 125 | if self.n_neighbors is None: 126 | self.n_neighbors = int(round(0.5*np.sqrt(self._E_obs.shape[0]))) 127 | 128 | ######## Core Scrublet functions ######## 129 | 130 | def scrub_doublets(self, synthetic_doublet_umi_subsampling=1.0, use_approx_neighbors=True, distance_metric='euclidean', get_doublet_neighbor_parents=False, min_counts=3, min_cells=3, min_gene_variability_pctl=85, log_transform=False, mean_center=True, normalize_variance=True, n_prin_comps=30, verbose=True): 131 | ''' Standard pipeline for preprocessing, doublet simulation, and doublet prediction 132 | 133 | Automatically sets a threshold for calling doublets, but it's best to check 134 | this by running plot_histogram() afterwards and adjusting threshold 135 | with call_doublets(threshold=new_threshold) if necessary. 136 | 137 | Arguments 138 | --------- 139 | synthetic_doublet_umi_subsampling : float, optional (defuault: 1.0) 140 | Rate for sampling UMIs when creating synthetic doublets. If 1.0, 141 | each doublet is created by simply adding the UMIs from two randomly 142 | sampled observed transcriptomes. For values less than 1, the 143 | UMI counts are added and then randomly sampled at the specified 144 | rate. 145 | 146 | use_approx_neighbors : bool, optional (default: True) 147 | Use approximate nearest neighbor method (annoy) for the KNN 148 | classifier. 149 | 150 | distance_metric : str, optional (default: 'euclidean') 151 | Distance metric used when finding nearest neighbors. For list of 152 | valid values, see the documentation for annoy (if `use_approx_neighbors` 153 | is True) or sklearn.neighbors.NearestNeighbors (if `use_approx_neighbors` 154 | is False). 155 | 156 | get_doublet_neighbor_parents : bool, optional (default: False) 157 | If True, return the parent transcriptomes that generated the 158 | doublet neighbors of each observed transcriptome. This information can 159 | be used to infer the cell states that generated a given 160 | doublet state. 161 | 162 | min_counts : float, optional (default: 3) 163 | Used for gene filtering prior to PCA. Genes expressed at fewer than 164 | `min_counts` in fewer than `min_cells` (see below) are excluded. 165 | 166 | min_cells : int, optional (default: 3) 167 | Used for gene filtering prior to PCA. Genes expressed at fewer than 168 | `min_counts` (see above) in fewer than `min_cells` are excluded. 169 | 170 | min_gene_variability_pctl : float, optional (default: 85.0) 171 | Used for gene filtering prior to PCA. Keep the most highly variable genes 172 | (in the top min_gene_variability_pctl percentile), as measured by 173 | the v-statistic [Klein et al., Cell 2015]. 174 | 175 | log_transform : bool, optional (default: False) 176 | If True, log-transform the counts matrix (log10(1+TPM)). 177 | `sklearn.decomposition.TruncatedSVD` will be used for dimensionality 178 | reduction, unless `mean_center` is True. 179 | 180 | mean_center : bool, optional (default: True) 181 | If True, center the data such that each gene has a mean of 0. 182 | `sklearn.decomposition.PCA` will be used for dimensionality 183 | reduction. 184 | 185 | normalize_variance : bool, optional (default: True) 186 | If True, normalize the data such that each gene has a variance of 1. 187 | `sklearn.decomposition.TruncatedSVD` will be used for dimensionality 188 | reduction, unless `mean_center` is True. 189 | 190 | n_prin_comps : int, optional (default: 30) 191 | Number of principal components used to embed the transcriptomes prior 192 | to k-nearest-neighbor graph construction. 193 | 194 | verbose : bool, optional (default: True) 195 | If True, print progress updates. 196 | 197 | Sets 198 | ---- 199 | doublet_scores_obs_, doublet_errors_obs_, 200 | doublet_scores_sim_, doublet_errors_sim_, 201 | predicted_doublets_, z_scores_ 202 | threshold_, detected_doublet_rate_, 203 | detectable_doublet_fraction_, overall_doublet_rate_, 204 | doublet_parents_, doublet_neighbor_parents_ 205 | ''' 206 | t0 = time.time() 207 | 208 | self._E_sim = None 209 | self._E_obs_norm = None 210 | self._E_sim_norm = None 211 | self._gene_filter = np.arange(self._E_obs.shape[1]) 212 | 213 | print_optional('Preprocessing...', verbose) 214 | pipeline_normalize(self) 215 | pipeline_get_gene_filter(self, min_counts=min_counts, min_cells=min_cells, min_gene_variability_pctl=min_gene_variability_pctl) 216 | pipeline_apply_gene_filter(self) 217 | 218 | print_optional('Simulating doublets...', verbose) 219 | self.simulate_doublets(sim_doublet_ratio=self.sim_doublet_ratio, synthetic_doublet_umi_subsampling=synthetic_doublet_umi_subsampling) 220 | pipeline_normalize(self, postnorm_total=1e6) 221 | if log_transform: 222 | pipeline_log_transform(self) 223 | if mean_center and normalize_variance: 224 | pipeline_zscore(self) 225 | elif mean_center: 226 | pipeline_mean_center(self) 227 | elif normalize_variance: 228 | pipeline_normalize_variance(self) 229 | 230 | if mean_center: 231 | print_optional('Embedding transcriptomes using PCA...', verbose) 232 | pipeline_pca(self, n_prin_comps=n_prin_comps, random_state=self.random_state) 233 | else: 234 | print_optional('Embedding transcriptomes using Truncated SVD...', verbose) 235 | pipeline_truncated_svd(self, n_prin_comps=n_prin_comps, random_state=self.random_state) 236 | 237 | print_optional('Calculating doublet scores...', verbose) 238 | self.calculate_doublet_scores( 239 | use_approx_neighbors=use_approx_neighbors, 240 | distance_metric=distance_metric, 241 | get_doublet_neighbor_parents=get_doublet_neighbor_parents 242 | ) 243 | self.call_doublets(verbose=verbose) 244 | 245 | t1=time.time() 246 | print_optional('Elapsed time: {:.1f} seconds'.format(t1 - t0), verbose) 247 | return self.doublet_scores_obs_, self.predicted_doublets_ 248 | 249 | def simulate_doublets(self, sim_doublet_ratio=None, synthetic_doublet_umi_subsampling=1.0): 250 | ''' Simulate doublets by adding the counts of random observed transcriptome pairs. 251 | 252 | Arguments 253 | --------- 254 | sim_doublet_ratio : float, optional (default: None) 255 | Number of doublets to simulate relative to the number of observed 256 | transcriptomes. If `None`, self.sim_doublet_ratio is used. 257 | 258 | synthetic_doublet_umi_subsampling : float, optional (defuault: 1.0) 259 | Rate for sampling UMIs when creating synthetic doublets. If 1.0, 260 | each doublet is created by simply adding the UMIs from two randomly 261 | sampled observed transcriptomes. For values less than 1, the 262 | UMI counts are added and then randomly sampled at the specified 263 | rate. 264 | 265 | Sets 266 | ---- 267 | doublet_parents_ 268 | ''' 269 | 270 | if sim_doublet_ratio is None: 271 | sim_doublet_ratio = self.sim_doublet_ratio 272 | else: 273 | self.sim_doublet_ratio = sim_doublet_ratio 274 | 275 | n_obs = self._E_obs.shape[0] 276 | n_sim = int(n_obs * sim_doublet_ratio) 277 | 278 | np.random.seed(self.random_state) 279 | pair_ix = np.random.randint(0, n_obs, size=(n_sim, 2)) 280 | 281 | E1 = self._E_obs[pair_ix[:,0],:] 282 | E2 = self._E_obs[pair_ix[:,1],:] 283 | tots1 = self._total_counts_obs[pair_ix[:,0]] 284 | tots2 = self._total_counts_obs[pair_ix[:,1]] 285 | if synthetic_doublet_umi_subsampling < 1: 286 | self._E_sim, self._total_counts_sim = subsample_counts(E1+E2, synthetic_doublet_umi_subsampling, tots1+tots2, random_seed=self.random_state) 287 | else: 288 | self._E_sim = E1+E2 289 | self._total_counts_sim = tots1+tots2 290 | self.doublet_parents_ = pair_ix 291 | return 292 | 293 | def set_manifold(self, manifold_obs, manifold_sim): 294 | ''' Set the manifold coordinates used in k-nearest-neighbor graph construction 295 | 296 | Arguments 297 | --------- 298 | manifold_obs: ndarray, shape (n_cells, n_features) 299 | The single-cell "manifold" coordinates (e.g., PCA coordinates) 300 | for observed transcriptomes. Nearest neighbors are found using 301 | the union of `manifold_obs` and `manifold_sim` (see below). 302 | 303 | manifold_sim: ndarray, shape (n_doublets, n_features) 304 | The single-cell "manifold" coordinates (e.g., PCA coordinates) 305 | for simulated doublets. Nearest neighbors are found using 306 | the union of `manifold_obs` (see above) and `manifold_sim`. 307 | 308 | Sets 309 | ---- 310 | manifold_obs_, manifold_sim_, 311 | ''' 312 | 313 | self.manifold_obs_ = manifold_obs 314 | self.manifold_sim_ = manifold_sim 315 | return 316 | 317 | def calculate_doublet_scores(self, use_approx_neighbors=True, distance_metric='euclidean', get_doublet_neighbor_parents=False): 318 | ''' Calculate doublet scores for observed transcriptomes and simulated doublets 319 | 320 | Requires that manifold_obs_ and manifold_sim_ have already been set. 321 | 322 | Arguments 323 | --------- 324 | use_approx_neighbors : bool, optional (default: True) 325 | Use approximate nearest neighbor method (annoy) for the KNN 326 | classifier. 327 | 328 | distance_metric : str, optional (default: 'euclidean') 329 | Distance metric used when finding nearest neighbors. For list of 330 | valid values, see the documentation for annoy (if `use_approx_neighbors` 331 | is True) or sklearn.neighbors.NearestNeighbors (if `use_approx_neighbors` 332 | is False). 333 | 334 | get_doublet_neighbor_parents : bool, optional (default: False) 335 | If True, return the parent transcriptomes that generated the 336 | doublet neighbors of each observed transcriptome. This information can 337 | be used to infer the cell states that generated a given 338 | doublet state. 339 | 340 | Sets 341 | ---- 342 | doublet_scores_obs_, doublet_scores_sim_, 343 | doublet_errors_obs_, doublet_errors_sim_, 344 | doublet_neighbor_parents_ 345 | 346 | ''' 347 | 348 | self._nearest_neighbor_classifier( 349 | k=self.n_neighbors, 350 | exp_doub_rate=self.expected_doublet_rate, 351 | stdev_doub_rate=self.stdev_doublet_rate, 352 | use_approx_nn=use_approx_neighbors, 353 | distance_metric=distance_metric, 354 | get_neighbor_parents=get_doublet_neighbor_parents 355 | ) 356 | return self.doublet_scores_obs_ 357 | 358 | def _nearest_neighbor_classifier(self, k=40, use_approx_nn=True, distance_metric='euclidean', exp_doub_rate=0.1, stdev_doub_rate=0.03, get_neighbor_parents=False): 359 | manifold = np.vstack((self.manifold_obs_, self.manifold_sim_)) 360 | doub_labels = np.concatenate((np.zeros(self.manifold_obs_.shape[0], dtype=int), 361 | np.ones(self.manifold_sim_.shape[0], dtype=int))) 362 | 363 | n_obs = np.sum(doub_labels == 0) 364 | n_sim = np.sum(doub_labels == 1) 365 | 366 | # Adjust k (number of nearest neighbors) based on the ratio of simulated to observed cells 367 | k_adj = int(round(k * (1+n_sim/float(n_obs)))) 368 | 369 | # Find k_adj nearest neighbors 370 | neighbors = get_knn_graph(manifold, k=k_adj, dist_metric=distance_metric, approx=use_approx_nn, return_edges=False, random_seed=self.random_state) 371 | 372 | # Calculate doublet score based on ratio of simulated cell neighbors vs. observed cell neighbors 373 | doub_neigh_mask = doub_labels[neighbors] == 1 374 | n_sim_neigh = doub_neigh_mask.sum(1) 375 | n_obs_neigh = doub_neigh_mask.shape[1] - n_sim_neigh 376 | 377 | rho = exp_doub_rate 378 | r = n_sim / float(n_obs) 379 | nd = n_sim_neigh.astype(float) 380 | ns = n_obs_neigh.astype(float) 381 | N = float(k_adj) 382 | 383 | # Bayesian 384 | q=(nd+1)/(N+2) 385 | Ld = q*rho/r/(1-rho-q*(1-rho-rho/r)) 386 | 387 | se_q = np.sqrt(q*(1-q)/(N+3)) 388 | se_rho = stdev_doub_rate 389 | 390 | se_Ld = q*rho/r / (1-rho-q*(1-rho-rho/r))**2 * np.sqrt((se_q/q*(1-rho))**2 + (se_rho/rho*(1-q))**2) 391 | 392 | self.doublet_scores_obs_ = Ld[doub_labels == 0] 393 | self.doublet_scores_sim_ = Ld[doub_labels == 1] 394 | self.doublet_errors_obs_ = se_Ld[doub_labels==0] 395 | self.doublet_errors_sim_ = se_Ld[doub_labels==1] 396 | 397 | # get parents of doublet neighbors, if requested 398 | neighbor_parents = None 399 | if get_neighbor_parents: 400 | parent_cells = self.doublet_parents_ 401 | neighbors = neighbors - n_obs 402 | neighbor_parents = [] 403 | for iCell in range(n_obs): 404 | this_doub_neigh = neighbors[iCell,:][neighbors[iCell,:] > -1] 405 | if len(this_doub_neigh) > 0: 406 | this_doub_neigh_parents = np.unique(parent_cells[this_doub_neigh,:].flatten()) 407 | neighbor_parents.append(this_doub_neigh_parents) 408 | else: 409 | neighbor_parents.append([]) 410 | self.doublet_neighbor_parents_ = np.array(neighbor_parents) 411 | return 412 | 413 | def call_doublets(self, threshold=None, verbose=True): 414 | ''' Call trancriptomes as doublets or singlets 415 | 416 | Arguments 417 | --------- 418 | threshold : float, optional (default: None) 419 | Doublet score threshold for calling a transcriptome 420 | a doublet. If `None`, this is set automatically by looking 421 | for the minimum between the two modes of the `doublet_scores_sim_` 422 | histogram. It is best practice to check the threshold visually 423 | using the `doublet_scores_sim_` histogram and/or based on 424 | co-localization of predicted doublets in a 2-D embedding. 425 | 426 | verbose : bool, optional (default: True) 427 | If True, print summary statistics. 428 | 429 | Sets 430 | ---- 431 | predicted_doublets_, z_scores_, threshold_, 432 | detected_doublet_rate_, detectable_doublet_fraction, 433 | overall_doublet_rate_ 434 | ''' 435 | 436 | if threshold is None: 437 | # automatic threshold detection 438 | # http://scikit-image.org/docs/dev/api/skimage.filters.html 439 | from skimage.filters import threshold_minimum 440 | try: 441 | threshold = threshold_minimum(self.doublet_scores_sim_) 442 | if verbose: 443 | print("Automatically set threshold at doublet score = {:.2f}".format(threshold)) 444 | except: 445 | self.predicted_doublets_ = None 446 | if verbose: 447 | print("Warning: failed to automatically identify doublet score threshold. Run `call_doublets` with user-specified threshold.") 448 | return self.predicted_doublets_ 449 | 450 | Ld_obs = self.doublet_scores_obs_ 451 | Ld_sim = self.doublet_scores_sim_ 452 | se_obs = self.doublet_errors_obs_ 453 | Z = (Ld_obs - threshold) / se_obs 454 | self.predicted_doublets_ = Ld_obs > threshold 455 | self.z_scores_ = Z 456 | self.threshold_ = threshold 457 | self.detected_doublet_rate_ = (Ld_obs>threshold).sum() / float(len(Ld_obs)) 458 | self.detectable_doublet_fraction_ = (Ld_sim>threshold).sum() / float(len(Ld_sim)) 459 | self.overall_doublet_rate_ = self.detected_doublet_rate_ / self.detectable_doublet_fraction_ 460 | 461 | if verbose: 462 | print('Detected doublet rate = {:.1f}%'.format(100*self.detected_doublet_rate_)) 463 | print('Estimated detectable doublet fraction = {:.1f}%'.format(100*self.detectable_doublet_fraction_)) 464 | print('Overall doublet rate:') 465 | print('\tExpected = {:.1f}%'.format(100*self.expected_doublet_rate)) 466 | print('\tEstimated = {:.1f}%'.format(100*self.overall_doublet_rate_)) 467 | 468 | return self.predicted_doublets_ 469 | 470 | ######## Viz functions ######## 471 | 472 | def plot_histogram(self, scale_hist_obs='log', scale_hist_sim='linear', fig_size = (8,3)): 473 | ''' Plot histogram of doublet scores for observed transcriptomes and simulated doublets 474 | 475 | The histogram for simulated doublets is useful for determining the correct doublet 476 | score threshold. To set threshold to a new value, T, run call_doublets(threshold=T). 477 | 478 | ''' 479 | 480 | fig, axs = plt.subplots(1, 2, figsize = fig_size) 481 | 482 | ax = axs[0] 483 | ax.hist(self.doublet_scores_obs_, np.linspace(0, 1, 50), color='gray', linewidth=0, density=True) 484 | ax.set_yscale(scale_hist_obs) 485 | yl = ax.get_ylim() 486 | ax.set_ylim(yl) 487 | ax.plot(self.threshold_ * np.ones(2), yl, c='black', linewidth=1) 488 | ax.set_title('Observed transcriptomes') 489 | ax.set_xlabel('Doublet score') 490 | ax.set_ylabel('Prob. density') 491 | 492 | ax = axs[1] 493 | ax.hist(self.doublet_scores_sim_, np.linspace(0, 1, 50), color='gray', linewidth=0, density=True) 494 | ax.set_yscale(scale_hist_sim) 495 | yl = ax.get_ylim() 496 | ax.set_ylim(yl) 497 | ax.plot(self.threshold_ * np.ones(2), yl, c = 'black', linewidth = 1) 498 | ax.set_title('Simulated doublets') 499 | ax.set_xlabel('Doublet score') 500 | ax.set_ylabel('Prob. density') 501 | 502 | fig.tight_layout() 503 | 504 | return fig, axs 505 | 506 | def set_embedding(self, embedding_name, coordinates): 507 | ''' Add a 2-D embedding for the observed transcriptomes ''' 508 | self._embeddings[embedding_name] = coordinates 509 | return 510 | 511 | def plot_embedding(self, embedding_name, score='raw', marker_size=5, order_points=False, fig_size=(8,4), color_map=None): 512 | ''' Plot doublet predictions on 2-D embedding of observed transcriptomes ''' 513 | 514 | #from matplotlib.lines import Line2D 515 | if embedding_name not in self._embeddings: 516 | print('Cannot find "{}" in embeddings. First add the embedding using `set_embedding`.'.format(embedding_name)) 517 | return 518 | 519 | # TO DO: check if self.predicted_doublets exists; plot raw scores only if it doesn't 520 | 521 | fig, axs = plt.subplots(1, 2, figsize = fig_size) 522 | 523 | x = self._embeddings[embedding_name][:,0] 524 | y = self._embeddings[embedding_name][:,1] 525 | xl = (x.min() - x.ptp() * .05, x.max() + x.ptp() * 0.05) 526 | yl = (y.min() - y.ptp() * .05, y.max() + y.ptp() * 0.05) 527 | 528 | ax = axs[1] 529 | if score == 'raw': 530 | color_dat = self.doublet_scores_obs_ 531 | vmin = color_dat.min() 532 | vmax = color_dat.max() 533 | if color_map is None: 534 | cmap_use = darken_cmap(plt.cm.Reds, 0.9) 535 | else: 536 | cmap_use = color_map 537 | elif score == 'zscore': 538 | color_dat = self.z_scores_ 539 | vmin = -color_dat.max() 540 | vmax = color_dat.max() 541 | if color_map is None: 542 | cmap_use = darken_cmap(plt.cm.RdBu_r, 0.9) 543 | else: 544 | cmap_use = color_map 545 | if order_points: 546 | o = np.argsort(color_dat) 547 | else: 548 | o = np.arange(len(color_dat)) 549 | pp = ax.scatter(x[o], y[o], s=marker_size, edgecolors='', c = color_dat[o], 550 | cmap=cmap_use, vmin=vmin, vmax=vmax) 551 | ax.set_xlim(xl) 552 | ax.set_ylim(yl) 553 | ax.set_xticks([]) 554 | ax.set_yticks([]) 555 | ax.set_title('Doublet score') 556 | ax.set_xlabel(embedding_name + ' 1') 557 | ax.set_ylabel(embedding_name + ' 2') 558 | fig.colorbar(pp, ax=ax) 559 | 560 | ax = axs[0] 561 | called_doubs = self.predicted_doublets_ 562 | ax.scatter(x[o], y[o], s=marker_size, edgecolors='', c=called_doubs[o], cmap=custom_cmap([[.7,.7,.7], [0,0,0]])) 563 | ax.set_xlim(xl) 564 | ax.set_ylim(yl) 565 | ax.set_xticks([]) 566 | ax.set_yticks([]) 567 | ax.set_title('Predicted doublets') 568 | #singlet_marker = Line2D([], [], color=[.7,.7,.7], marker='o', markersize=5, label='Singlet', linewidth=0) 569 | #doublet_marker = Line2D([], [], color=[.0,.0,.0], marker='o', markersize=5, label='Doublet', linewidth=0) 570 | #ax.legend(handles = [singlet_marker, doublet_marker]) 571 | ax.set_xlabel(embedding_name + ' 1') 572 | ax.set_ylabel(embedding_name + ' 2') 573 | 574 | fig.tight_layout() 575 | 576 | return fig, axs 577 | 578 | 579 | 580 | 581 | 582 | 583 | --------------------------------------------------------------------------------