├── COVID19.csv ├── step0_map_raw_data_using_cellranger.sh ├── LICENSE ├── README.md ├── step2_integrate_patients_and_healthy_controls.R ├── step4_find_DEGs_between_disease_stages.ipynb ├── .ipynb_checkpoints ├── step4_find_DEGs_between_disease_stages-checkpoint.ipynb └── step1_detect_doublets_using_scrublet-checkpoint.ipynb └── step1_detect_doublets_using_scrublet.ipynb /COVID19.csv: -------------------------------------------------------------------------------- 1 | library_id,molecule_h5,batch 2 | P1-1r1,P1-1r1_outs/outs/molecule_info.h5,P1-1r1 3 | P1-1r2,P1-1r2_outs/outs/molecule_info.h5,P1-1r2 4 | P1-2r1,P1-2r1_outs/outs/molecule_info.h5,P1-2r1 5 | P1-2r2,P1-2r2_outs/outs/molecule_info.h5,P1-2r2 6 | P2-1,P2-1_outs/outs/molecule_info.h5,P2-1 7 | P2-2,P2-2_outs/outs/molecule_info.h5,P2-2 8 | P2-3,P2-3_outs/outs/molecule_info.h5,P2-3 9 | -------------------------------------------------------------------------------- /step0_map_raw_data_using_cellranger.sh: -------------------------------------------------------------------------------- 1 | cellranger count --id=P1-1r1_outs --transcriptome=~/refdata-cellranger-GRCh38-3.0.0/ --fastqs=P1-1r1/ 2 | cellranger count --id=P1-1r2_outs --transcriptome=~/refdata-cellranger-GRCh38-3.0.0/ --fastqs=P1-1r2/ 3 | cellranger count --id=P1-2r1_outs --transcriptome=~/refdata-cellranger-GRCh38-3.0.0/ --fastqs=P1-2r1/ 4 | cellranger count --id=P1-2r2_outs --transcriptome=~/refdata-cellranger-GRCh38-3.0.0/ --fastqs=P1-2r2/ 5 | cellranger count --id=P2-1_outs --transcriptome=~/refdata-cellranger-GRCh38-3.0.0/ --fastqs=P2-1/ 6 | cellranger count --id=P2-2_outs --transcriptome=~/refdata-cellranger-GRCh38-3.0.0/ --fastqs=P2-2/ 7 | cellranger count --id=P2-3_outs --transcriptome=~/refdata-cellranger-GRCh38-3.0.0/ --fastqs=P2-3/ 8 | cellranger aggr --id=COVID19 --csv=COVID19.csv --normalize=mapped -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 2-Clause License 2 | 3 | Copyright (c) 2020, QuKunLab 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | 1. Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | 2. Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 17 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 18 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 19 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 20 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 21 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 22 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 23 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 24 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 25 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 26 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # COVID-19 2 | The R/Python scripts for the analysis of single-cell RNA-seq data from COVID-19 patients 3 | 4 | ## 1. Requirement 5 | We analyzed the scRNA-seq data in a Linux system with R (version 3.6.1) and Python (version 3.6.8) enviroment. The following software and packages are also required: 6 | 7 | software|version|enviroment 8 | -|-|- 9 | Cellranger|3.1.0|Linux 10 | Seurat|3.1.4|R 11 | dplyr|0.8.4|R 12 | patchwork|1.0.0|R 13 | scrublet|0.2.1|Python 14 | scipy|1.0.0|Python 15 | pandas|0.24.2|Python 16 | matplotlib|3.0.3|Python 17 | seaborn|0.9.0|Python 18 | 19 | ## 2. Installation 20 | Users need to copy the scripts to the same path as the raw data folders (i.e. the path containing the "P1-1r1/", "P1-1r2/", "P1-2r1/", "P1-2r2/", "P2-1/", "P2-2/", and "P2-3/" folders), and run these scripts in R studio or Jupyter notebook. 21 | 22 | ## 3. Step by step analysis 23 | 24 | ### 3.1 Map the raw sequencing data to the genome reference 25 | 26 | bash step0_map_raw_data_using_cellranger.sh 27 | 28 | ### 3.2 Detect doublet by using Scrublet 29 | Run **step1_detect_doublets_using_scrublet.ipynb** in Jupyter notebook. This script decects doublets from the cells of patients and healthy controls respectively. 30 | 31 | ### 3.3 Integrate patients with healthy controls 32 | Run **step2_integrate_patients_and_healthy_controls.R** in R studio. This script also cluster cells and perform UMAP analysis on the scRNA-seq data. 33 | 34 | ### 3.4 Plot UMAP diagram and marker gene expression violinplot 35 | Run **step3_plot_umap_and_marker_gene_expression.ipynb** in Jupyter notebook. This script illustrates UMAP diagram for the cells and violinplot for the expression of marker genes. All diagrams will be automatically presented in the jupyter interface. 36 | 37 | ### 3.5 Generate DEGs between different disease stages 38 | Run **step4_find_DEGs_between_disease_stages.ipynb** in Jupyter notebook. This script generates DEGs between different disease stages, for CD14 monocytes and effector CD8 T cells respectively. It also present the expression heatmap of DEGs in PNG format file. 39 | -------------------------------------------------------------------------------- /step2_integrate_patients_and_healthy_controls.R: -------------------------------------------------------------------------------- 1 | library(dplyr) 2 | library(Seurat) 3 | library(patchwork) 4 | # 5 | # 6 | #### import scRNA-seq data of healthy controls #### 7 | healthy.data <- Read10X(data.dir = "cellranger_output_for_healthy_controls/") 8 | healthy <- CreateSeuratObject(counts = healthy.data, project = "healthy", min.cells = 3, min.features = 200) 9 | healthy[["percent.mt"]] <- PercentageFeatureSet(healthy, pattern = "^MT-") 10 | healthy_filtered <- subset(healthy, subset = nFeature_RNA > 300 & nFeature_RNA < 5000 & percent.mt < 10) 11 | healthy_doublet <- read.table('healthy_controls_doublets.txt', sep=',', check.names=F, row.names=1) 12 | healthy_doublet <- healthy_doublet[healthy_doublet$predicted_doublets=='True',] 13 | healthy_cells <- setdiff(colnames(healthy_filtered), rownames(healthy_doublet[1])) 14 | healthy_filtered <- SubsetData(healthy_filtered, cells=healthy_cells) 15 | healthy_filtered@meta.data['batch'] <- healthy_filtered@meta.data['orig.ident'] 16 | # 17 | # 18 | #### import scRNA-seq data of COVID-19 patients #### 19 | ncov.data <- Read10X(data.dir = "cellranger_output_for_COVID19_patients/") 20 | ncov <- CreateSeuratObject(counts = ncov.data, project = "covid19", min.cells = 3, min.features = 200) 21 | ncov[["percent.mt"]] <- PercentageFeatureSet(ncov, pattern = "^MT-") 22 | ncov_filtered <- subset(ncov, subset = nFeature_RNA > 500 & nFeature_RNA < 6000 & percent.mt < 10) 23 | cellinfo <- read.table("COVID19_cells_info.csv", sep='\t', header=TRUE, quote="", check.names=FALSE, row.names=1) 24 | ncov_filtered <- AddMetaData(ncov_filtered, metadata = cellinfo[colnames(ncov_filtered),], col.name=c('batch')) 25 | ncov_doublet <- read.table('COVID19_patients_doublets.txt', sep=',', check.names=F, row.names=1) 26 | ncov_doublet <- ncov_doublet[ncov_doublet$predicted_doublets=='True',] 27 | ncov_cells <- setdiff(colnames(ncov_filtered), rownames(ncov_doublet[1])) 28 | ncov_filtered <- SubsetData(ncov_filtered, cells=ncov_cells) 29 | # 30 | # 31 | #### integrated cells from healthy controls and COVID-19 patients #### 32 | overlap_genes <- intersect(rownames(ncov_filtered), rownames(healthy_filtered)) 33 | pbmc.list <- c(healthy_filtered, ncov_filtered) 34 | names(pbmc.list) = c('healthy', 'covid19') 35 | pbmc.anchors <- FindIntegrationAnchors(object.list=pbmc.list, dims = 1:40) 36 | pbmc.integrated <- IntegrateData(anchorset = pbmc.anchors, dims = 1:40, features.to.integrate=overlap_genes) 37 | # 38 | # 39 | #### normalization, clustering, UMAP, and plotting marker genes #### 40 | pbmc.integrated <- ScaleData(pbmc.integrated, features = rownames(pbmc.integrated)) 41 | pbmc.integrated <- RunPCA(pbmc.integrated, npcs=50, verbose=F) 42 | pbmc.integrated <- FindNeighbors(pbmc.integrated, dims = 1:50) 43 | pbmc.integrated <- FindClusters(pbmc.integrated, resolution = 0.3) 44 | pbmc.integrated <- RunUMAP(pbmc.integrated, reduction = "pca", dims = 1:50) 45 | DimPlot(pbmc.integrated, reduction = "umap", label=TRUE) 46 | FeaturePlot(pbmc.integrated, features=c('PTPRC', 'CD14', 'FCGR3A', 'CD3D', 'CD4', 'IL7R', 'CCR7', 'CD8A', 'PRDM1', 'MKI67', 47 | 'TRGC1', 'NKG7', 'CD79A', 'CD38', 'CD1C', 'CLEC4C', 'PPBP', 'CD34'), 48 | slot='data', min.cutoff=0, max.cutoff='q90') 49 | # 50 | # 51 | #### save data #### 52 | save(pbmc.integrated, file='integrated.allgenes.RData') 53 | write.table(pbmc.integrated[['RNA']]@data, file='RNA.data.csv',sep=',', quote=F) 54 | write.table(pbmc.integrated[['integrated']]@data, file='integrated.data.csv', sep=',', quote=F) 55 | write.table(pbmc.integrated@reductions$umap@cell.embeddings, file='umap.csv', sep=',', quote=F) 56 | write.table(pbmc.integrated@meta.data, file='meta_data.csv', sep=',') 57 | # 58 | # -------------------------------------------------------------------------------- /step4_find_DEGs_between_disease_stages.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "scrolled": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import numpy,pandas,scipy.stats\n", 12 | "import matplotlib.pyplot as plt\n", 13 | "#\n", 14 | "#\n", 15 | "def find_de_genes(data_df, meta_df, clusters, stage1, stage2, logfc=1, pvalue=0.001, pct=0.2):\n", 16 | " cells_df = meta_df.loc[meta_df['seurat_clusters'].isin(clusters)]\n", 17 | " cells1 = cells_df.loc[cells_df['patient stage']==stage1].index.values\n", 18 | " cells2 = cells_df.loc[cells_df['patient stage']==stage2].index.values\n", 19 | " degenes, matrix = [], []\n", 20 | " for igene,gene in enumerate(data_df.index.values):\n", 21 | " expr1 = data_df.loc[gene, cells1]\n", 22 | " expr2 = data_df.loc[gene, cells2]\n", 23 | " log2_fc = numpy.log2((expr1.mean()+1e-6) / (expr2.mean()+1e-6))\n", 24 | " pct1 = len(numpy.where(expr1>0)[0]) / len(expr1)\n", 25 | " pct2 = len(numpy.where(expr2>0)[0]) / len(expr2)\n", 26 | " if ((log2_fc>logfc)&(pct1>pct))|((log2_fc<-logfc)&(pct2>pct)):\n", 27 | " ww, pp = scipy.stats.mannwhitneyu(expr1, expr2)\n", 28 | " if pplogfc].index.values\n", 94 | " t2_t1 = df1.loc[df1['log2fc']<-logfc].index.values\n", 95 | " t1_h = df2.loc[df2['log2fc']>logfc].index.values\n", 96 | " h_t1 = df2.loc[df2['log2fc']<-logfc].index.values\n", 97 | " t2_h = df3.loc[df3['log2fc']>logfc].index.values\n", 98 | " h_t2 = df3.loc[df3['log2fc']<-logfc].index.values\n", 99 | " p321 = list(set(t1_t2).intersection(set(t2_h)))\n", 100 | " p311 = list(set(t1_t2).intersection(set(t1_h))-set(t2_h)-set(h_t2))\n", 101 | " p331 = list(set(t2_h).intersection(set(t1_h))-set(t1_t2)-set(t2_t1))\n", 102 | " p313 = list(set(t1_t2).intersection(set(h_t2)))\n", 103 | " p123 = list(set(t2_t1).intersection(set(h_t2)))\n", 104 | " p113 = list(set(h_t2).intersection(set(h_t1))-set(t1_t2)-set(t2_t1))\n", 105 | " p133 = list(set(t2_t1).intersection(set(h_t1))-set(h_t2)-set(t2_h))\n", 106 | " p131 = list(set(t2_t1).intersection(set(t2_h)))\n", 107 | " print(len(p321), len(p311), len(p331), len(p313), len(p123), len(p113), len(p133), len(p131))\n", 108 | " print(len(set(list(t1_t2)+list(t2_t1)+list(t1_h)+list(h_t1)+list(t2_h)+list(h_t2))))\n", 109 | " return p321, p311, p331, p313, p123, p113, p133, p131\n", 110 | "#\n", 111 | "def plot_de_genes(clusters, batches, scaled_df, meta_df, figsize, prefix, cmap='Reds'):\n", 112 | " for ibatch,batch in enumerate(batches):\n", 113 | " with open(prefix+'_geneset_'+str(ibatch)+'.txt', 'w') as outfile:\n", 114 | " for gg in batch:\n", 115 | " outfile.write(gg+'\\n')\n", 116 | " batch_length = numpy.array([len(x) for x in batches])\n", 117 | " bars = [batch_length[:ix].sum() for ix,x in enumerate(batch_length)]\n", 118 | "#\n", 119 | " cells_df = meta_df.loc[meta_df['seurat_clusters'].isin(clusters)]\n", 120 | " cells_df = cells_df.sample(frac=1)\n", 121 | " cells1 = cells_df.loc[cells_df['patient stage']=='severe stage'].index.values\n", 122 | " cells2 = cells_df.loc[cells_df['patient stage']=='remisson stage'].index.values\n", 123 | " cells3 = cells_df.loc[cells_df['patient stage']=='healthy people'].index.values\n", 124 | " cells = list(cells1) + list(cells2) + list(cells3)\n", 125 | " colors = ['darkorange']*len(cells1)+['yellowgreen']*len(cells2)+['steelblue']*len(cells3)\n", 126 | " genes = []\n", 127 | " for batch in batches:\n", 128 | " genes.extend(batch)\n", 129 | " sub_df = scaled_df.loc[genes, cells]\n", 130 | " zscore = scipy.stats.zscore(sub_df.values, axis=1)\n", 131 | " zscore_df = pandas.DataFrame(zscore, index=sub_df.index, columns=sub_df.columns)\n", 132 | " im = seaborn.clustermap(zscore_df, col_cluster=False, row_cluster=False, cmap=cmap, xticklabels=False,\n", 133 | " method='ward', vmin=0, vmax=1, col_colors=colors, figsize=figsize)\n", 134 | " im.ax_heatmap.hlines(bars, xmin=0, xmax=zscore_df.shape[1], colors='grey')\n", 135 | " im.ax_heatmap.vlines([len(cells1), len(cells1)+len(cells2)], ymin=0, ymax=zscore_df.shape[0], \n", 136 | " colors='grey')\n", 137 | " plt.savefig(prefix+'_heatmap.png', bbox_inches='tight')\n", 138 | " plt.close()\n", 139 | " return\n", 140 | "#\n", 141 | "#\n", 142 | "p321, p311, p331, p313, p123, p113, p133, p131 = organize_genes(\n", 143 | " 'DEgenes_cd14mono.severe_vs_remission.csv', \n", 144 | " 'DEgenes_cd14mono.severe_vs_healthy.csv',\n", 145 | " 'DEgenes_cd14mono.remission_vs_healthy.csv',\n", 146 | " meta_df, logfc=1)\n", 147 | "cluster_color = {1:'skyblue', 8:'crimson', 15:'royalblue', 12:'green'}\n", 148 | "plot_de_genes([1,8,15,12], [p321+p311,p331,p133,p131], integrated_df, meta_df, (10,20), \n", 149 | " 'cd14mono', cluster_color=cluster_color)\n", 150 | "#\n", 151 | "#\n", 152 | "p321, p311, p331, p313, p123, p113, p133, p131 = organize_genes(\n", 153 | " 'DEgenes_cd8effector.severe_vs_remission.csv', \n", 154 | " 'DEgenes_cd8effector.severe_vs_healthy.csv',\n", 155 | " 'DEgenes_cd8effector.remission_vs_healthy.csv',\n", 156 | " meta_df, logfc=1)\n", 157 | "plot_de_genes([5], [p321+p311,p331,p131], integrated_df, meta_df, (10,20), 'cd8effector')\n", 158 | "#\n", 159 | "#" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [] 168 | } 169 | ], 170 | "metadata": { 171 | "kernelspec": { 172 | "display_name": "Python 3", 173 | "language": "python", 174 | "name": "python3" 175 | }, 176 | "language_info": { 177 | "codemirror_mode": { 178 | "name": "ipython", 179 | "version": 3 180 | }, 181 | "file_extension": ".py", 182 | "mimetype": "text/x-python", 183 | "name": "python", 184 | "nbconvert_exporter": "python", 185 | "pygments_lexer": "ipython3", 186 | "version": "3.6.8" 187 | } 188 | }, 189 | "nbformat": 4, 190 | "nbformat_minor": 2 191 | } 192 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/step4_find_DEGs_between_disease_stages-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "scrolled": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import numpy,pandas,scipy.stats\n", 12 | "import matplotlib.pyplot as plt\n", 13 | "#\n", 14 | "#\n", 15 | "def find_de_genes(data_df, meta_df, clusters, stage1, stage2, logfc=1, pvalue=0.001, pct=0.2):\n", 16 | " cells_df = meta_df.loc[meta_df['seurat_clusters'].isin(clusters)]\n", 17 | " cells1 = cells_df.loc[cells_df['patient stage']==stage1].index.values\n", 18 | " cells2 = cells_df.loc[cells_df['patient stage']==stage2].index.values\n", 19 | " degenes, matrix = [], []\n", 20 | " for igene,gene in enumerate(data_df.index.values):\n", 21 | " expr1 = data_df.loc[gene, cells1]\n", 22 | " expr2 = data_df.loc[gene, cells2]\n", 23 | " log2_fc = numpy.log2((expr1.mean()+1e-6) / (expr2.mean()+1e-6))\n", 24 | " pct1 = len(numpy.where(expr1>0)[0]) / len(expr1)\n", 25 | " pct2 = len(numpy.where(expr2>0)[0]) / len(expr2)\n", 26 | " if ((log2_fc>logfc)&(pct1>pct))|((log2_fc<-logfc)&(pct2>pct)):\n", 27 | " ww, pp = scipy.stats.mannwhitneyu(expr1, expr2)\n", 28 | " if pplogfc].index.values\n", 94 | " t2_t1 = df1.loc[df1['log2fc']<-logfc].index.values\n", 95 | " t1_h = df2.loc[df2['log2fc']>logfc].index.values\n", 96 | " h_t1 = df2.loc[df2['log2fc']<-logfc].index.values\n", 97 | " t2_h = df3.loc[df3['log2fc']>logfc].index.values\n", 98 | " h_t2 = df3.loc[df3['log2fc']<-logfc].index.values\n", 99 | " p321 = list(set(t1_t2).intersection(set(t2_h)))\n", 100 | " p311 = list(set(t1_t2).intersection(set(t1_h))-set(t2_h)-set(h_t2))\n", 101 | " p331 = list(set(t2_h).intersection(set(t1_h))-set(t1_t2)-set(t2_t1))\n", 102 | " p313 = list(set(t1_t2).intersection(set(h_t2)))\n", 103 | " p123 = list(set(t2_t1).intersection(set(h_t2)))\n", 104 | " p113 = list(set(h_t2).intersection(set(h_t1))-set(t1_t2)-set(t2_t1))\n", 105 | " p133 = list(set(t2_t1).intersection(set(h_t1))-set(h_t2)-set(t2_h))\n", 106 | " p131 = list(set(t2_t1).intersection(set(t2_h)))\n", 107 | " print(len(p321), len(p311), len(p331), len(p313), len(p123), len(p113), len(p133), len(p131))\n", 108 | " print(len(set(list(t1_t2)+list(t2_t1)+list(t1_h)+list(h_t1)+list(t2_h)+list(h_t2))))\n", 109 | " return p321, p311, p331, p313, p123, p113, p133, p131\n", 110 | "#\n", 111 | "def plot_de_genes(clusters, batches, scaled_df, meta_df, figsize, prefix, cmap='Reds'):\n", 112 | " for ibatch,batch in enumerate(batches):\n", 113 | " with open(prefix+'_geneset_'+str(ibatch)+'.txt', 'w') as outfile:\n", 114 | " for gg in batch:\n", 115 | " outfile.write(gg+'\\n')\n", 116 | " batch_length = numpy.array([len(x) for x in batches])\n", 117 | " bars = [batch_length[:ix].sum() for ix,x in enumerate(batch_length)]\n", 118 | "#\n", 119 | " cells_df = meta_df.loc[meta_df['seurat_clusters'].isin(clusters)]\n", 120 | " cells_df = cells_df.sample(frac=1)\n", 121 | " cells1 = cells_df.loc[cells_df['patient stage']=='severe stage'].index.values\n", 122 | " cells2 = cells_df.loc[cells_df['patient stage']=='remisson stage'].index.values\n", 123 | " cells3 = cells_df.loc[cells_df['patient stage']=='healthy people'].index.values\n", 124 | " cells = list(cells1) + list(cells2) + list(cells3)\n", 125 | " colors = ['darkorange']*len(cells1)+['yellowgreen']*len(cells2)+['steelblue']*len(cells3)\n", 126 | " genes = []\n", 127 | " for batch in batches:\n", 128 | " genes.extend(batch)\n", 129 | " sub_df = scaled_df.loc[genes, cells]\n", 130 | " zscore = scipy.stats.zscore(sub_df.values, axis=1)\n", 131 | " zscore_df = pandas.DataFrame(zscore, index=sub_df.index, columns=sub_df.columns)\n", 132 | " im = seaborn.clustermap(zscore_df, col_cluster=False, row_cluster=False, cmap=cmap, xticklabels=False,\n", 133 | " method='ward', vmin=0, vmax=1, col_colors=colors, figsize=figsize)\n", 134 | " im.ax_heatmap.hlines(bars, xmin=0, xmax=zscore_df.shape[1], colors='grey')\n", 135 | " im.ax_heatmap.vlines([len(cells1), len(cells1)+len(cells2)], ymin=0, ymax=zscore_df.shape[0], \n", 136 | " colors='grey')\n", 137 | " plt.savefig(prefix+'_heatmap.png', bbox_inches='tight')\n", 138 | " plt.close()\n", 139 | " return\n", 140 | "#\n", 141 | "#\n", 142 | "p321, p311, p331, p313, p123, p113, p133, p131 = organize_genes(\n", 143 | " 'DEgenes_cd14mono.severe_vs_remission.csv', \n", 144 | " 'DEgenes_cd14mono.severe_vs_healthy.csv',\n", 145 | " 'DEgenes_cd14mono.remission_vs_healthy.csv',\n", 146 | " meta_df, logfc=1)\n", 147 | "cluster_color = {1:'skyblue', 8:'crimson', 15:'royalblue', 12:'green'}\n", 148 | "plot_de_genes([1,8,15,12], [p321+p311,p331,p133,p131], integrated_df, meta_df, (10,20), \n", 149 | " 'cd14mono', cluster_color=cluster_color)\n", 150 | "#\n", 151 | "#\n", 152 | "p321, p311, p331, p313, p123, p113, p133, p131 = organize_genes(\n", 153 | " 'DEgenes_cd8effector.severe_vs_remission.csv', \n", 154 | " 'DEgenes_cd8effector.severe_vs_healthy.csv',\n", 155 | " 'DEgenes_cd8effector.remission_vs_healthy.csv',\n", 156 | " meta_df, logfc=1)\n", 157 | "plot_de_genes([5], [p321+p311,p331,p131], integrated_df, meta_df, (10,20), 'cd8effector')\n", 158 | "#\n", 159 | "#" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [] 168 | } 169 | ], 170 | "metadata": { 171 | "kernelspec": { 172 | "display_name": "Python 3", 173 | "language": "python", 174 | "name": "python3" 175 | }, 176 | "language_info": { 177 | "codemirror_mode": { 178 | "name": "ipython", 179 | "version": 3 180 | }, 181 | "file_extension": ".py", 182 | "mimetype": "text/x-python", 183 | "name": "python", 184 | "nbconvert_exporter": "python", 185 | "pygments_lexer": "ipython3", 186 | "version": "3.6.8" 187 | } 188 | }, 189 | "nbformat": 4, 190 | "nbformat_minor": 2 191 | } 192 | -------------------------------------------------------------------------------- /step1_detect_doublets_using_scrublet.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 9, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stdout", 10 | "output_type": "stream", 11 | "text": [ 12 | "Counts matrix shape: 66457 rows, 33538 columns\n", 13 | "Number of genes in gene list: 33538\n" 14 | ] 15 | } 16 | ], 17 | "source": [ 18 | "import numpy,pandas,scipy.io,scipy.sparse\n", 19 | "import scrublet\n", 20 | "#\n", 21 | "#\n", 22 | "input_dir = 'cellranger_output_for_healthy_controls/'\n", 23 | "counts_matrix = scipy.io.mmread(input_dir + '/matrix.mtx').T.tocsc()\n", 24 | "genes = numpy.array(scrublet.load_genes(input_dir + '/features.tsv', delimiter='\\t', column=1))\n", 25 | "out_df = pandas.read_csv(input_dir + '/barcodes.tsv', header = None, index_col=None, names=['barcode'])\n", 26 | "print('Counts matrix shape: {} rows, {} columns'.format(counts_matrix.shape[0], counts_matrix.shape[1]))\n", 27 | "print('Number of genes in gene list: {}'.format(len(genes)))\n", 28 | "#" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 10, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "scrub = scrublet.Scrublet(counts_matrix, expected_doublet_rate=0.06)" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 11, 43 | "metadata": {}, 44 | "outputs": [ 45 | { 46 | "name": "stdout", 47 | "output_type": "stream", 48 | "text": [ 49 | "Preprocessing...\n", 50 | "Simulating doublets...\n", 51 | "Embedding transcriptomes using PCA...\n", 52 | "Calculating doublet scores...\n", 53 | "Automatically set threshold at doublet score = 0.35\n", 54 | "Detected doublet rate = 1.7%\n", 55 | "Estimated detectable doublet fraction = 47.8%\n", 56 | "Overall doublet rate:\n", 57 | "\tExpected = 6.0%\n", 58 | "\tEstimated = 3.6%\n", 59 | "Elapsed time: 236.5 seconds\n" 60 | ] 61 | } 62 | ], 63 | "source": [ 64 | "doublet_scores, predicted_doublets = scrub.scrub_doublets(min_counts=2, min_cells=3, min_gene_variability_pctl=85, \n", 65 | " n_prin_comps=30)" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 12, 71 | "metadata": {}, 72 | "outputs": [ 73 | { 74 | "name": "stdout", 75 | "output_type": "stream", 76 | "text": [ 77 | "Detected doublet rate = 2.2%\n", 78 | "Estimated detectable doublet fraction = 51.4%\n", 79 | "Overall doublet rate:\n", 80 | "\tExpected = 6.0%\n", 81 | "\tEstimated = 4.3%\n" 82 | ] 83 | }, 84 | { 85 | "data": { 86 | "text/plain": [ 87 | "(
,\n", 88 | " array([,\n", 89 | " ],\n", 90 | " dtype=object))" 91 | ] 92 | }, 93 | "execution_count": 12, 94 | "metadata": {}, 95 | "output_type": "execute_result" 96 | }, 97 | { 98 | "data": { 99 | "image/png": "\n", 100 | "text/plain": [ 101 | "
" 102 | ] 103 | }, 104 | "metadata": { 105 | "needs_background": "light" 106 | }, 107 | "output_type": "display_data" 108 | } 109 | ], 110 | "source": [ 111 | "import matplotlib.pyplot as plt\n", 112 | "scrub.call_doublets(threshold=0.25)\n", 113 | "scrub.plot_histogram()" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 13, 119 | "metadata": {}, 120 | "outputs": [ 121 | { 122 | "name": "stdout", 123 | "output_type": "stream", 124 | "text": [ 125 | "0.021908903501512256\n" 126 | ] 127 | } 128 | ], 129 | "source": [ 130 | "print(scrub.detected_doublet_rate_)\n", 131 | "out_df['doublet_scores'] = doublet_scores\n", 132 | "out_df['predicted_doublets'] = predicted_doublets\n", 133 | "out_df.to_csv('healthy_controls_doublets.txt', index=False,header=True)" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 14, 139 | "metadata": {}, 140 | "outputs": [ 141 | { 142 | "name": "stdout", 143 | "output_type": "stream", 144 | "text": [ 145 | "Counts matrix shape: 18752 rows, 33538 columns\n", 146 | "Number of genes in gene list: 33538\n", 147 | "Preprocessing...\n", 148 | "Simulating doublets...\n", 149 | "Embedding transcriptomes using PCA...\n", 150 | "Calculating doublet scores...\n", 151 | "Automatically set threshold at doublet score = 0.58\n", 152 | "Detected doublet rate = 0.4%\n", 153 | "Estimated detectable doublet fraction = 18.6%\n", 154 | "Overall doublet rate:\n", 155 | "\tExpected = 6.0%\n", 156 | "\tEstimated = 2.3%\n", 157 | "Elapsed time: 27.7 seconds\n", 158 | "Detected doublet rate = 3.7%\n", 159 | "Estimated detectable doublet fraction = 44.1%\n", 160 | "Overall doublet rate:\n", 161 | "\tExpected = 6.0%\n", 162 | "\tEstimated = 8.4%\n", 163 | "0.037116040955631396\n" 164 | ] 165 | }, 166 | { 167 | "data": { 168 | "image/png": "\n", 169 | "text/plain": [ 170 | "
" 171 | ] 172 | }, 173 | "metadata": { 174 | "needs_background": "light" 175 | }, 176 | "output_type": "display_data" 177 | } 178 | ], 179 | "source": [ 180 | "input_dir = 'cellranger_output_for_COVID19_patients/'\n", 181 | "counts_matrix = scipy.io.mmread(input_dir + '/matrix.mtx').T.tocsc()\n", 182 | "genes = numpy.array(scrublet.load_genes(input_dir + '/features.tsv', delimiter='\\t', column=1))\n", 183 | "out_df = pandas.read_csv(input_dir + '/barcodes.tsv', header = None, index_col=None, names=['barcode'])\n", 184 | "print('Counts matrix shape: {} rows, {} columns'.format(counts_matrix.shape[0], counts_matrix.shape[1]))\n", 185 | "print('Number of genes in gene list: {}'.format(len(genes)))\n", 186 | "scrub = scrublet.Scrublet(counts_matrix, expected_doublet_rate=0.06)\n", 187 | "doublet_scores, predicted_doublets = scrub.scrub_doublets(min_counts=2, min_cells=3, min_gene_variability_pctl=85, \n", 188 | " n_prin_comps=30)\n", 189 | "scrub.call_doublets(threshold=0.25)\n", 190 | "scrub.plot_histogram()\n", 191 | "print(scrub.detected_doublet_rate_)\n", 192 | "out_df['doublet_scores'] = doublet_scores\n", 193 | "out_df['predicted_doublets'] = predicted_doublets\n", 194 | "out_df.to_csv('COVID19_patients_doublets.txt', index=False,header=True)" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": null, 200 | "metadata": {}, 201 | "outputs": [], 202 | "source": [] 203 | } 204 | ], 205 | "metadata": { 206 | "kernelspec": { 207 | "display_name": "Python 3", 208 | "language": "python", 209 | "name": "python3" 210 | }, 211 | "language_info": { 212 | "codemirror_mode": { 213 | "name": "ipython", 214 | "version": 3 215 | }, 216 | "file_extension": ".py", 217 | "mimetype": "text/x-python", 218 | "name": "python", 219 | "nbconvert_exporter": "python", 220 | "pygments_lexer": "ipython3", 221 | "version": "3.6.8" 222 | } 223 | }, 224 | "nbformat": 4, 225 | "nbformat_minor": 2 226 | } 227 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/step1_detect_doublets_using_scrublet-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 9, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stdout", 10 | "output_type": "stream", 11 | "text": [ 12 | "Counts matrix shape: 66457 rows, 33538 columns\n", 13 | "Number of genes in gene list: 33538\n" 14 | ] 15 | } 16 | ], 17 | "source": [ 18 | "import numpy,pandas,scipy.io,scipy.sparse\n", 19 | "import scrublet\n", 20 | "#\n", 21 | "#\n", 22 | "input_dir = 'cellranger_output_for_healthy_controls/'\n", 23 | "counts_matrix = scipy.io.mmread(input_dir + '/matrix.mtx').T.tocsc()\n", 24 | "genes = numpy.array(scrublet.load_genes(input_dir + '/features.tsv', delimiter='\\t', column=1))\n", 25 | "out_df = pandas.read_csv(input_dir + '/barcodes.tsv', header = None, index_col=None, names=['barcode'])\n", 26 | "print('Counts matrix shape: {} rows, {} columns'.format(counts_matrix.shape[0], counts_matrix.shape[1]))\n", 27 | "print('Number of genes in gene list: {}'.format(len(genes)))\n", 28 | "#" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 10, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "scrub = scrublet.Scrublet(counts_matrix, expected_doublet_rate=0.06)" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 11, 43 | "metadata": {}, 44 | "outputs": [ 45 | { 46 | "name": "stdout", 47 | "output_type": "stream", 48 | "text": [ 49 | "Preprocessing...\n", 50 | "Simulating doublets...\n", 51 | "Embedding transcriptomes using PCA...\n", 52 | "Calculating doublet scores...\n", 53 | "Automatically set threshold at doublet score = 0.35\n", 54 | "Detected doublet rate = 1.7%\n", 55 | "Estimated detectable doublet fraction = 47.8%\n", 56 | "Overall doublet rate:\n", 57 | "\tExpected = 6.0%\n", 58 | "\tEstimated = 3.6%\n", 59 | "Elapsed time: 236.5 seconds\n" 60 | ] 61 | } 62 | ], 63 | "source": [ 64 | "doublet_scores, predicted_doublets = scrub.scrub_doublets(min_counts=2, min_cells=3, min_gene_variability_pctl=85, \n", 65 | " n_prin_comps=30)" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 12, 71 | "metadata": {}, 72 | "outputs": [ 73 | { 74 | "name": "stdout", 75 | "output_type": "stream", 76 | "text": [ 77 | "Detected doublet rate = 2.2%\n", 78 | "Estimated detectable doublet fraction = 51.4%\n", 79 | "Overall doublet rate:\n", 80 | "\tExpected = 6.0%\n", 81 | "\tEstimated = 4.3%\n" 82 | ] 83 | }, 84 | { 85 | "data": { 86 | "text/plain": [ 87 | "(
,\n", 88 | " array([,\n", 89 | " ],\n", 90 | " dtype=object))" 91 | ] 92 | }, 93 | "execution_count": 12, 94 | "metadata": {}, 95 | "output_type": "execute_result" 96 | }, 97 | { 98 | "data": { 99 | "image/png": "\n", 100 | "text/plain": [ 101 | "
" 102 | ] 103 | }, 104 | "metadata": { 105 | "needs_background": "light" 106 | }, 107 | "output_type": "display_data" 108 | } 109 | ], 110 | "source": [ 111 | "import matplotlib.pyplot as plt\n", 112 | "scrub.call_doublets(threshold=0.25)\n", 113 | "scrub.plot_histogram()" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 13, 119 | "metadata": {}, 120 | "outputs": [ 121 | { 122 | "name": "stdout", 123 | "output_type": "stream", 124 | "text": [ 125 | "0.021908903501512256\n" 126 | ] 127 | } 128 | ], 129 | "source": [ 130 | "print(scrub.detected_doublet_rate_)\n", 131 | "out_df['doublet_scores'] = doublet_scores\n", 132 | "out_df['predicted_doublets'] = predicted_doublets\n", 133 | "out_df.to_csv('healthy_controls_doublets.txt', index=False,header=True)" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 14, 139 | "metadata": {}, 140 | "outputs": [ 141 | { 142 | "name": "stdout", 143 | "output_type": "stream", 144 | "text": [ 145 | "Counts matrix shape: 18752 rows, 33538 columns\n", 146 | "Number of genes in gene list: 33538\n", 147 | "Preprocessing...\n", 148 | "Simulating doublets...\n", 149 | "Embedding transcriptomes using PCA...\n", 150 | "Calculating doublet scores...\n", 151 | "Automatically set threshold at doublet score = 0.58\n", 152 | "Detected doublet rate = 0.4%\n", 153 | "Estimated detectable doublet fraction = 18.6%\n", 154 | "Overall doublet rate:\n", 155 | "\tExpected = 6.0%\n", 156 | "\tEstimated = 2.3%\n", 157 | "Elapsed time: 27.7 seconds\n", 158 | "Detected doublet rate = 3.7%\n", 159 | "Estimated detectable doublet fraction = 44.1%\n", 160 | "Overall doublet rate:\n", 161 | "\tExpected = 6.0%\n", 162 | "\tEstimated = 8.4%\n", 163 | "0.037116040955631396\n" 164 | ] 165 | }, 166 | { 167 | "data": { 168 | "image/png": "\n", 169 | "text/plain": [ 170 | "
" 171 | ] 172 | }, 173 | "metadata": { 174 | "needs_background": "light" 175 | }, 176 | "output_type": "display_data" 177 | } 178 | ], 179 | "source": [ 180 | "input_dir = 'cellranger_output_for_COVID19_patients/'\n", 181 | "counts_matrix = scipy.io.mmread(input_dir + '/matrix.mtx').T.tocsc()\n", 182 | "genes = numpy.array(scrublet.load_genes(input_dir + '/features.tsv', delimiter='\\t', column=1))\n", 183 | "out_df = pandas.read_csv(input_dir + '/barcodes.tsv', header = None, index_col=None, names=['barcode'])\n", 184 | "print('Counts matrix shape: {} rows, {} columns'.format(counts_matrix.shape[0], counts_matrix.shape[1]))\n", 185 | "print('Number of genes in gene list: {}'.format(len(genes)))\n", 186 | "scrub = scrublet.Scrublet(counts_matrix, expected_doublet_rate=0.06)\n", 187 | "doublet_scores, predicted_doublets = scrub.scrub_doublets(min_counts=2, min_cells=3, min_gene_variability_pctl=85, \n", 188 | " n_prin_comps=30)\n", 189 | "scrub.call_doublets(threshold=0.25)\n", 190 | "scrub.plot_histogram()\n", 191 | "print(scrub.detected_doublet_rate_)\n", 192 | "out_df['doublet_scores'] = doublet_scores\n", 193 | "out_df['predicted_doublets'] = predicted_doublets\n", 194 | "out_df.to_csv('COVID19_patients_doublets.txt', index=False,header=True)" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": null, 200 | "metadata": {}, 201 | "outputs": [], 202 | "source": [] 203 | } 204 | ], 205 | "metadata": { 206 | "kernelspec": { 207 | "display_name": "Python 3", 208 | "language": "python", 209 | "name": "python3" 210 | }, 211 | "language_info": { 212 | "codemirror_mode": { 213 | "name": "ipython", 214 | "version": 3 215 | }, 216 | "file_extension": ".py", 217 | "mimetype": "text/x-python", 218 | "name": "python", 219 | "nbconvert_exporter": "python", 220 | "pygments_lexer": "ipython3", 221 | "version": "3.6.8" 222 | } 223 | }, 224 | "nbformat": 4, 225 | "nbformat_minor": 2 226 | } 227 | --------------------------------------------------------------------------------