├── .gitignore ├── README.md ├── overlapping_genes ├── overlap.rmd ├── overlap_functions.R ├── scREAD_overlapping_DEGs_from_multiple_comparisons.ipynb └── scread_db.rdata └── workflow ├── README.md ├── Snakefile ├── build_control_atlas.R ├── custom_marker.csv ├── example_control.csv ├── example_disease.csv ├── functions.R ├── run_analysis.R └── transfer_cell_type.R /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | workflow/.Rhistory 3 | 4 | overlapping_genes/.Rhistory 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # scREAD protocol 2 | 3 | 4 | This is the repository for the *STAR protocols* manuscript: **Use of scREAD to Explore and Analyze Single-cell and Single-nucleus RNA-Seq data for Alzheimer’s Disease**. 5 | 6 | The protocol is baed on **scREAD** (single-cell RNA-Seq database for Alzheimer's Disease). It is a first-of-its-kind database to provide comprehensive analysis results of all the existing single-cell RNA-Seq and single-nucleus RNA-Seq data of Alzheimer's Disease in the public domain. The database is freely available at: [http://osubmi.com/scread](http://osubmi.com/scread/) 7 | 8 | The original scREAD paper was published in *iScience*: [scREAD: A Single-Cell RNA-Seq Database for Alzheimer's Disease](https://www.sciencedirect.com/science/article/pii/S2589004220309664) 9 | 10 | If you have any questions or feedback regarding this notebook, please contact Cankun Wang . 11 | 12 | ## How to use the protocols? 13 | 14 | - Calculating overlapping DEGs from the same cell type across datasets (Optional section 6 in the manuscript) 15 | - Run /overlapping_genes/overlap.rmd locally 16 | - Use Google Colab version: [https://colab.research.google.com/drive/1lInXa6jD4yc7RGJc0EWDfy5NNoXT1qye?usp=sharing](https://colab.research.google.com/drive/1lInXa6jD4yc7RGJc0EWDfy5NNoXT1qye?usp=sharing) 17 | - Optional section 7: Running scREAD backend analysis workflow locally (Optional section 7 in the manuscript) 18 | - open [workflow readme](https://github.com/OSU-BMBL/scread-protocol/tree/master/workflow) 19 | 20 | ## Directory structure 21 | 22 | - overlapping_genes 23 | - scread_db.rdata (91MB): scREAD dataset information, differential gene expression analysis results. 24 | - overlap_functions.R: functions to obtain overlapping genes. 25 | - overlap.rmd: R markdown version to calculate overlapping genes. 26 | - workflow 27 | - custom_marker.csv: A manually created marker gene list file used for identified cell types. 28 | - functions.R: Visualization functions used in R. 29 | - build_control_atlas.R: build control cells atlas Seurat object from count matrix file. 30 | - transfer_cell_type.R: filter out control-like cells in disease dataset 31 | - run_analysis.R: run analysis workflow, and export tables in scREAD database format. 32 | - example_control.csv. The example control dataset. 33 | - example_disease.csv. The example disease dataset. 34 | 35 | ## Authors 36 | 37 | - [Cankun Wang](https://github.com/Wang-Cankun) 38 | - [Yujia Xiang](https://github.com/Candlelight-XYJ) 39 | 40 | -------------------------------------------------------------------------------- /overlapping_genes/overlap.rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "scREAD overlapping DEGs from multiple comparisons" 3 | author: "Cankun Wang" 4 | date: "`r format(Sys.time(), '%m/%d/%Y')`" 5 | output: 6 | html_document: 7 | toc: true 8 | toc_float: false 9 | number_sections: true 10 | df_print: paged 11 | --- 12 | 13 | ```{r setup, include=FALSE} 14 | knitr::opts_chunk$set(echo = F) 15 | library(tidyverse) 16 | library(rlist) 17 | library(tools) 18 | library(RVenn) 19 | library(rjson) 20 | 21 | # Please install DT package for interactive table 22 | # library(DT) 23 | 24 | # Critical: set your working directory below 25 | knitr::opts_knit$set(root.dir = "C:/Users/flyku/Documents/GitHub/scread-protocol/overlapping_genes") 26 | setwd("C:/Users/flyku/Documents/GitHub/scread-protocol/overlapping_genes") 27 | 28 | ``` 29 | 30 | # Introduction 31 | 32 | This notebook is a section to the protocol, *Use of scREAD to Explore and Analyze Single-cell and Single-nucleus RNA-Seq data for Alzheimer’s Disease* 33 | 34 | If you have any questions or feedback regarding this notebook, please contact Cankun Wang . 35 | 36 | ## Outline 37 | 38 | 0. How to use this notebook 39 | 1. Background 40 | 2. Install dependencies and curate the data from scREAD 41 | 3. Specify parameters settings 42 | 4. Interpret the result table 43 | 44 | ## 0. How to use this notebook 45 | This notebook utilizes Google Colab , which is an interactive computational enviroment that combines live code, visualizations, and explanatory text. To run this notebook, you may first need to make a copy by choosing File > Save a Copy in Drive from the menu bar (may take a few moments to save). 46 | 47 | The notebook is organized into a series of cells. You can modify the R command and execute each cell as you would a Jupyter notebook/R notebook. To run all of the cells at once, choose **Runtime > Run all** from the menu bar. 48 | 49 | ## 1. Background 50 | 51 | The notebook provide table to mainly answer the question: 52 | 53 | For all differentially expressed genes (DEGs) in Alzheimer’s Disease (AD) vs control datasets comparisons at a cell type of interest from scREAD, what genes are commanly ranked at top positions by the log-foldchanges? 54 | 55 | ## 2. Install dependencies and curate the data from scREAD 56 | 57 | First, let's install some necessary dependencies in this project, this should take about **10 minutes**: 58 | 59 | install.packages('RVenn', repos='http://cran.rstudio.com/') 60 | 61 | install.packages('rlist', repos='http://cran.rstudio.com/') 62 | 63 | ```{r,echo=F,eval=T,message=F,warning=F,error=F} 64 | 65 | 66 | tryCatch({ 67 | load("scread_db.rdata") 68 | }, error = { 69 | load( 70 | url( 71 | 'https://bmbl.bmi.osumc.edu/downloadFiles/scread/protocol/scread_db.rdata' 72 | ) 73 | ) 74 | }) 75 | 76 | 77 | source("overlap_functions.R") 78 | 79 | ``` 80 | 81 | ## 3. Specify parameters settings 82 | 83 | To calculate overlapping genes, these parameters are needed: 84 | 85 | - The number of top genes in each AD vs control DE results (default=100) 86 | 87 | - Species (default=Human) 88 | 89 | - Brain region (e.g,Entorhinal Cortex) 90 | 91 | - DE direction (e.g, up) 92 | 93 | - Overlap threshold (for example, a gene is an overlapping gene if a should at least appeared 3 times in total 4 comparisons. Here the threshold=3) 94 | 95 | By default two tables will be generated: 96 | 97 | 1. The overlapping genes in the selected brain region 98 | 99 | 2. The detailed information, including rankings, log-foldchange, dataset source information from the overlapping genes 100 | 101 | 102 | 103 | ```{r,echo=F,eval=T,message=FALSE,warning=F} 104 | # GLOBAL LIST 105 | 106 | REGION_LIST <- sort(unique(dataset$region)) 107 | CT_LIST <- sort(unique(cell_type_meta$cell_type)) 108 | CT_SHORT_LIST <- CT_LIST 109 | CT_SHORT_LIST[CT_LIST=="Oligodendrocyte precursor cells"] <- "opc" 110 | CT_SHORT_LIST <- tolower(substr(CT_SHORT_LIST, 1, 3)) 111 | 112 | 113 | ``` 114 | 115 | # Overview 116 | 117 | To calculate overlapping genes, these parameters are needed 118 | 119 | The number of top genes in each AD vs control DE results (default=100) 120 | 121 | Species (default=Human) 122 | 123 | Brain region (e.g,Entorhinal Cortex) 124 | 125 | DE direction (e.g, up) 126 | 127 | Overlap threshold (for example, a gene is an overlapping gene if a should at least appeared 3 times in total 4 comparisons. Here the threshold=3) 128 | 129 | Each section has two table: 130 | 131 | 1. The overlapping genes in the selected region 132 | 133 | 2. The ranking information from the overlapping genes 134 | 135 | # Human - Up-regulated 136 | 137 | ## Entorhinal Cortex 138 | 139 | ```{r,echo=F,eval=T,message=FALSE,warning=F} 140 | # Settings 141 | TOP <- 100 142 | this_species <- 'Human' 143 | this_region <- REGION_LIST[5] 144 | this_direction <- 'up' 145 | OVERLAP_THRES <- 3 146 | res <- calc_overlap_list() 147 | DT::datatable(res$list, filter = "top") 148 | DT::datatable(res$rank, filter = "top") 149 | 150 | ``` 151 | 152 | 153 | ## Prefrontal cortex 154 | 155 | ```{r,echo=F,eval=T,message=FALSE,warning=F} 156 | 157 | # Settings 158 | TOP <- 100 159 | this_species <- 'Human' 160 | this_region <- REGION_LIST[7] 161 | this_direction <- 'up' 162 | OVERLAP_THRES <- 4 163 | res <- calc_overlap_list() 164 | DT::datatable(res$list, filter = "top") 165 | DT::datatable(res$rank, filter = "top") 166 | ``` 167 | 168 | 169 | 170 | ## Superior frontal gyrus (BA8) 171 | 172 | ```{r,echo=F,eval=T,message=FALSE,warning=F} 173 | 174 | # Settings 175 | TOP <- 100 176 | this_species <- 'Human' 177 | this_region <- REGION_LIST[9] 178 | this_direction <- 'up' 179 | OVERLAP_THRES <- 2 180 | res <- calc_overlap_list() 181 | DT::datatable(res$list, filter = "top") 182 | DT::datatable(res$rank, filter = "top") 183 | 184 | ``` 185 | 186 | 187 | ## Superior parietal lobe 188 | 189 | ```{r,echo=F,eval=T,message=FALSE,warning=F} 190 | 191 | # Settings 192 | TOP <- 100 193 | this_species <- 'Human' 194 | this_region <- REGION_LIST[10] 195 | this_direction <- 'up' 196 | OVERLAP_THRES <- 2 197 | res <- calc_overlap_list() 198 | DT::datatable(res$list, filter = "top") 199 | DT::datatable(res$rank, filter = "top") 200 | 201 | ``` 202 | 203 | 204 | 205 | # Human - Down-regulated 206 | 207 | ## Entorhinal Cortex 208 | 209 | ```{r,echo=F,eval=T,message=FALSE,warning=F} 210 | # Settings 211 | TOP <- 100 212 | this_species <- 'Human' 213 | this_region <- REGION_LIST[5] 214 | this_direction <- 'down' 215 | OVERLAP_THRES <- 3 216 | res <- calc_overlap_list() 217 | DT::datatable(res$list, filter = "top") 218 | DT::datatable(res$rank, filter = "top") 219 | 220 | ``` 221 | 222 | 223 | ## Prefrontal cortex 224 | 225 | ```{r,echo=F,eval=T,message=FALSE,warning=F} 226 | 227 | # Settings 228 | TOP <- 100 229 | this_species <- 'Human' 230 | this_region <- REGION_LIST[7] 231 | this_direction <- 'down' 232 | OVERLAP_THRES <- 4 233 | res <- calc_overlap_list() 234 | DT::datatable(res$list, filter = "top") 235 | DT::datatable(res$rank, filter = "top") 236 | ``` 237 | 238 | 239 | 240 | ## Superior frontal gyrus (BA8) 241 | 242 | ```{r,echo=F,eval=T,message=FALSE,warning=F} 243 | 244 | # Settings 245 | TOP <- 100 246 | this_species <- 'Human' 247 | this_region <- REGION_LIST[9] 248 | this_direction <- 'down' 249 | OVERLAP_THRES <- 2 250 | res <- calc_overlap_list() 251 | DT::datatable(res$list, filter = "top") 252 | DT::datatable(res$rank, filter = "top") 253 | 254 | ``` 255 | 256 | 257 | ## Superior parietal lobe 258 | 259 | ```{r,echo=F,eval=T,message=FALSE,warning=F} 260 | 261 | # Settings 262 | TOP <- 100 263 | this_species <- 'Human' 264 | this_region <- REGION_LIST[10] 265 | this_direction <- 'down' 266 | OVERLAP_THRES <- 2 267 | res <- calc_overlap_list() 268 | DT::datatable(res$list, filter = "top") 269 | DT::datatable(res$rank, filter = "top") 270 | 271 | ``` 272 | 273 | 274 | 275 | # Session Infomation 276 | 277 | ```{r} 278 | 279 | #rm(mydb) 280 | #rm(de_query) 281 | save.image("scread_db1.rdata") 282 | sessionInfo() 283 | ``` 284 | -------------------------------------------------------------------------------- /overlapping_genes/overlap_functions.R: -------------------------------------------------------------------------------- 1 | 2 | calc_overlap_list <- function(this_disease_ids, this_control_ids) { 3 | 4 | # Options based on current settings 5 | this_ids <- dataset %>% 6 | filter(region == this_region & species == this_species) %>% 7 | pull(data_id) 8 | 9 | this_control_ids <- dataset %>% 10 | filter(region == this_region & species == this_species & condition == 'Control') %>% 11 | pull(data_id) 12 | 13 | this_disease_ids <- dataset %>% 14 | filter(region == this_region & species == this_species & condition == 'Disease') %>% 15 | pull(data_id) 16 | 17 | overlap_dataset_thres <- function(n) { 18 | half <- floor(n/2) + 1 19 | #half <- length(n) - 1 20 | return(half) 21 | } 22 | 23 | 24 | this_overlap_result <- list() 25 | this_deg_rank_result <- data.frame() 26 | # iterate cell types 27 | ctIndex = 1 28 | for (ctIndex in 1:length(CT_LIST)) { 29 | this_ct <- CT_LIST[ctIndex] 30 | this_ct_short <- CT_SHORT_LIST[ctIndex] 31 | #dIndex = 2 32 | this_deg_rank <- data.frame() 33 | this_deg <- list() 34 | #iterate disease IDs 35 | for (dIndex in 1:length(this_disease_ids)) { 36 | this_disease_id <- this_disease_ids[dIndex] 37 | this_control_id <- de_meta %>% 38 | filter(data_id == this_disease_id & description == "Disease vs control (same region)") %>% 39 | pull(b_data_id) 40 | this_comparison_deg <- de %>% 41 | filter(cluster == this_ct & ct == this_ct_short & a_data_id == this_disease_id & b_data_id %in% this_control_id) %>% 42 | arrange(p_val_adj) %>% 43 | filter(p_val_adj < 0.05) %>% 44 | filter(case_when( 45 | this_direction == 'up' ~ avg_logFC > 0.5, 46 | this_direction == 'down' ~ avg_logFC < -0.5 47 | )) %>% 48 | head(TOP) %>% 49 | rownames_to_column('rank') %>% 50 | mutate( 51 | total_comparison = length(this_disease_ids) 52 | ) %>% 53 | select(ct, gene, avg_logFC, a_data_id, b_data_id,rank, total_comparison) 54 | 55 | this_deg_rank <- rbind(this_deg_rank, this_comparison_deg) 56 | this_deg <- list.append(this_deg, this_comparison_deg$gene) 57 | } 58 | ### Test a gene in deg list 59 | 60 | #sapply(this_deg_result, function(x){ 61 | # 'LINC01481' %in% x 62 | #}) 63 | 64 | 65 | ########### Calculate unioned overlaps from one cell type in all datasets 66 | this_deg_overlap <- vector() 67 | half <- OVERLAP_THRES 68 | # Combined 69 | data_combines <- combn(length(this_disease_ids),half) 70 | 71 | #cIndex <- 1 72 | for (cIndex in 1:ncol(data_combines)) { 73 | this_combine <- as.vector(data_combines[,cIndex]) 74 | venn_set <- Venn(this_deg) 75 | tmp_venn <- overlap(venn_set, this_combine) 76 | this_deg_overlap <- union(this_deg_overlap, tmp_venn) 77 | } 78 | ########### Calculate unioned overlaps from one cell type in all datasets 79 | this_deg_rank_result <- this_deg_rank %>% 80 | filter(gene %in% this_deg_overlap) %>% 81 | rbind(this_deg_rank_result) 82 | this_overlap_result <- list.append(this_overlap_result, this_deg_overlap) 83 | 84 | } 85 | 86 | names(this_overlap_result) <- CT_SHORT_LIST 87 | 88 | freq_deg <- this_deg_rank_result %>% 89 | group_by(ct) %>% 90 | count(gene) %>% 91 | group_by(gene)%>% 92 | count(gene,sort = T) 93 | 94 | #i=1 95 | freq_comparisons <- this_deg_rank_result %>% 96 | filter(gene %in% freq_deg$gene) %>% 97 | group_by(ct,gene) %>% 98 | count(gene,sort = T,name = "overlapping_comparison") 99 | 100 | overlap_list_result <- tibble() 101 | for (i in 1:nrow(freq_deg)) { 102 | this_gene <- freq_deg %>% 103 | pull(gene) %>% 104 | nth(i) 105 | overlap_list_result <- this_deg_rank_result %>% 106 | filter(gene == this_gene) %>% 107 | pull(ct) %>% 108 | unique() %>% 109 | sort() %>% 110 | paste(collapse = ",") %>% 111 | tibble(ct=.,gene=this_gene) %>% 112 | rbind(overlap_list_result) %>% 113 | arrange(ct) %>% 114 | arrange(desc(nchar(ct))) %>% 115 | mutate(across(where(is_character),as_factor)) 116 | } 117 | 118 | overlap_rank_result <- this_deg_rank_result %>% 119 | filter(gene %in% freq_deg$gene) %>% 120 | group_by(gene) %>% 121 | mutate( 122 | mean_rank = mean(as.numeric(rank)) 123 | ) %>% 124 | arrange(gene) %>% 125 | left_join(freq_comparisons, by="gene") %>% 126 | tibble() %>% 127 | mutate(across(where(is_character),as_factor)) %>% 128 | select(ct.x, gene, avg_logFC, a_data_id, b_data_id, rank, overlapping_comparison, total_comparison, mean_rank) %>% 129 | rename(ct = ct.x, disease_id = a_data_id, control_id = b_data_id) %>% 130 | filter(!duplicated(avg_logFC)) 131 | 132 | return(list(list=overlap_list_result, rank=overlap_rank_result)) 133 | } 134 | -------------------------------------------------------------------------------- /overlapping_genes/scREAD_overlapping_DEGs_from_multiple_comparisons.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "scREAD overlapping DEGs from multiple comparisons", 7 | "provenance": [], 8 | "collapsed_sections": [] 9 | }, 10 | "kernelspec": { 11 | "display_name": "R", 12 | "name": "ir" 13 | } 14 | }, 15 | "cells": [ 16 | { 17 | "cell_type": "markdown", 18 | "metadata": { 19 | "id": "tU-p9Jki-dWG" 20 | }, 21 | "source": [ 22 | "This notebook is a section to the protocol, *Use of scREAD to Explore and Analyze Single-cell and Single-nucleus RNA-Seq data for Alzheimer’s Disease*\r\n", 23 | "\r\n", 24 | "If you have any questions or feedback regarding this notebook, please contact Cankun Wang ." 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": { 30 | "id": "SCQEG4YV-tmD" 31 | }, 32 | "source": [ 33 | "## Outline \r\n", 34 | "\r\n", 35 | "0. How to use this notebook\r\n", 36 | "1. Background\r\n", 37 | "2. Install dependencies and curate the data from scREAD\r\n", 38 | "3. Specify parameters settings \r\n", 39 | "4. Interpret the result table" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": { 45 | "id": "14UbSD_U_CHv" 46 | }, 47 | "source": [ 48 | "## 0. How to use this notebook\r\n", 49 | "This notebook utilizes Google Colab , which is an interactive computational enviroment that combines live code, visualizations, and explanatory text. To run this notebook, you may first need to make a copy by choosing File > Save a Copy in Drive from the menu bar (may take a few moments to save).\r\n", 50 | "\r\n", 51 | "The notebook is organized into a series of cells. You can modify the R command and execute each cell as you would a Jupyter notebook/R notebook. To run all of the cells at once, choose **Runtime > Run all** from the menu bar." 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": { 57 | "id": "Th830cNz_P1Z" 58 | }, 59 | "source": [ 60 | "## 1. Background\r\n", 61 | "\r\n", 62 | "The notebook provide table to mainly answer the question:\r\n", 63 | "\r\n", 64 | "For all differentially expressed genes (DEGs) in Alzheimer’s Disease (AD) vs control datasets comparisons at a cell type of interest from scREAD, what genes are commanly ranked at top positions by the log-foldchanges? " 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": { 70 | "id": "qOjS5Ugv8jiV" 71 | }, 72 | "source": [ 73 | "## 2. Install dependencies and curate the data from scREAD\r\n", 74 | "\r\n", 75 | "First, let's install some necessary dependencies in this project, this should take about **10 minutes**:" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": { 81 | "id": "OXw0k8nBkpHP" 82 | }, 83 | "source": [ 84 | "" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "metadata": { 90 | "colab": { 91 | "base_uri": "https://localhost:8080/" 92 | }, 93 | "id": "VNwzbCdQ5Hq8", 94 | "outputId": "f1d1eb66-5d55-4b22-8209-d361bb8f0b1c" 95 | }, 96 | "source": [ 97 | "install.packages('RVenn', repos='http://cran.rstudio.com/')\r\n", 98 | "install.packages('rlist', repos='http://cran.rstudio.com/')" 99 | ], 100 | "execution_count": 1, 101 | "outputs": [ 102 | { 103 | "output_type": "stream", 104 | "text": [ 105 | "Installing package into ‘/usr/local/lib/R/site-library’\n", 106 | "(as ‘lib’ is unspecified)\n", 107 | "\n", 108 | "also installing the dependencies ‘tweenr’, ‘polyclip’, ‘RcppEigen’, ‘permute’, ‘ggforce’, ‘vegan’, ‘pheatmap’\n", 109 | "\n", 110 | "\n", 111 | "Installing package into ‘/usr/local/lib/R/site-library’\n", 112 | "(as ‘lib’ is unspecified)\n", 113 | "\n", 114 | "also installing the dependencies ‘XML’, ‘data.table’\n", 115 | "\n", 116 | "\n" 117 | ], 118 | "name": "stderr" 119 | } 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": { 125 | "id": "sxIZlHAN-LCE" 126 | }, 127 | "source": [ 128 | "Next, we will load the R packages, scREAD data (~70MB), and pre-defined functions to calculate the overlapping genes:" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "metadata": { 134 | "colab": { 135 | "base_uri": "https://localhost:8080/" 136 | }, 137 | "id": "bMF_kFRn4O0B", 138 | "outputId": "84a40949-f5dd-404b-f0a0-68f374f0cba0" 139 | }, 140 | "source": [ 141 | "\r\n", 142 | "library(tidyverse)\r\n", 143 | "library(RVenn)\r\n", 144 | "library(rlist)\r\n", 145 | "library(knitr)\r\n", 146 | "\r\n", 147 | "## Load data from GitHub or use the alternative server\r\n", 148 | "\r\n", 149 | "tryCatch({\r\n", 150 | " load(url('https://raw.githubusercontent.com/OSU-BMBL/scread-protocol/master/overlapping_genes/scread_db.rdata'))\r\n", 151 | "}, error= function(e) {\r\n", 152 | " message(\"Failed to load data from GitHub , trying alternative server...\")\r\n", 153 | " load(url('https://bmbl.bmi.osumc.edu/downloadFiles/scread/protocol/scread_db.rdata'))\r\n", 154 | "})\r\n", 155 | "## The scread_db.rdata contains scREAD database DEGs data, a function named 'calc_overlap_list' to perform the analysis" 156 | ], 157 | "execution_count": 12, 158 | "outputs": [ 159 | { 160 | "output_type": "stream", 161 | "text": [ 162 | "Warning message in load(url(\"https://raw.githubusercontent.com/OSU-BMBL/scread-protocol/master/overlapping_genes/scread_db3.rdata\")):\n", 163 | "“cannot open URL 'https://raw.githubusercontent.com/OSU-BMBL/scread-protocol/master/overlapping_genes/scread_db3.rdata': HTTP status was '404 Not Found'”\n", 164 | "Failed to load data from GitHub , trying alternative server...\n", 165 | "\n", 166 | "Warning message in load(url(\"https://bmbl.bmi.osumc.edu/downloadFiles/scread/protocol/scread_db.rdata\")):\n", 167 | "“input string '10<96>12 months' cannot be translated to UTF-8, is it valid in 'CP936'?”\n" 168 | ], 169 | "name": "stderr" 170 | } 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": { 176 | "id": "ZIkp3_h07l7S" 177 | }, 178 | "source": [ 179 | "## 3. Specify parameters settings \r\n", 180 | "\r\n", 181 | "To calculate overlapping genes, these parameters are needed:\r\n", 182 | "\r\n", 183 | "- The number of top genes in each AD vs control DE results (default=100)\r\n", 184 | "\r\n", 185 | "- Species (default=Human)\r\n", 186 | "\r\n", 187 | "- Brain region (e.g,Entorhinal Cortex)\r\n", 188 | "\r\n", 189 | "- DE direction (e.g, up)\r\n", 190 | "\r\n", 191 | "- Overlap threshold (for example, a gene is an overlapping gene if a should at least appeared 3 times in total 4 comparisons. Here the threshold=3)\r\n", 192 | "\r\n", 193 | "By default two tables will be generated: \r\n", 194 | "\r\n", 195 | "1. The overlapping genes in the selected brain region\r\n", 196 | "\r\n", 197 | "2. The detailed information, including rankings, log-foldchange, dataset source information from the overlapping genes\r\n" 198 | ] 199 | }, 200 | { 201 | "cell_type": "markdown", 202 | "metadata": { 203 | "id": "c-nzktNu_yre" 204 | }, 205 | "source": [ 206 | "First, we can process some of our metadata:" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "metadata": { 212 | "id": "7aHHOpSM7jay" 213 | }, 214 | "source": [ 215 | "REGION_LIST <- sort(unique(dataset$region))\r\n", 216 | "CT_LIST <- sort(unique(cell_type_meta$cell_type))\r\n", 217 | "CT_SHORT_LIST <- CT_LIST\r\n", 218 | "CT_SHORT_LIST[CT_LIST==\"Oligodendrocyte precursor cells\"] <- \"opc\"\r\n", 219 | "CT_SHORT_LIST <- tolower(substr(CT_SHORT_LIST, 1, 3))\r\n" 220 | ], 221 | "execution_count": 14, 222 | "outputs": [] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "metadata": { 227 | "id": "cPvxs1VB9GG1" 228 | }, 229 | "source": [ 230 | "We can take a look at what brain regions, cell types, and cell types abbrevations are included in scREAD:" 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "metadata": { 236 | "colab": { 237 | "base_uri": "https://localhost:8080/", 238 | "height": 146 239 | }, 240 | "id": "u7apTqrN9TK4", 241 | "outputId": "431e8545-518f-49a6-f08a-ac9e845512c5" 242 | }, 243 | "source": [ 244 | "list(brain_region=REGION_LIST, cell_type=CT_LIST, short_name=CT_SHORT_LIST)" 245 | ], 246 | "execution_count": 15, 247 | "outputs": [ 248 | { 249 | "output_type": "display_data", 250 | "data": { 251 | "text/plain": [ 252 | "$brain_region\n", 253 | " [1] \"Cerebellum\" \"Cerebral cortex\" \n", 254 | " [3] \"Cortex\" \"Cortex and hippocampus\" \n", 255 | " [5] \"Entorhinal Cortex\" \"Hippocampus\" \n", 256 | " [7] \"Prefrontal cortex\" \"Subventricular zone and hippocampus\"\n", 257 | " [9] \"Superior frontal gyrus (BA8)\" \"Superior parietal lobe\" \n", 258 | "\n", 259 | "$cell_type\n", 260 | "[1] \"Astrocytes\" \"Endothelial cells\" \n", 261 | "[3] \"Excitatory neurons\" \"Inhibitory neurons\" \n", 262 | "[5] \"Microglia\" \"Oligodendrocyte precursor cells\"\n", 263 | "[7] \"Oligodendrocytes\" \n", 264 | "\n", 265 | "$short_name\n", 266 | "[1] \"ast\" \"end\" \"exc\" \"inh\" \"mic\" \"opc\" \"oli\"\n" 267 | ], 268 | "text/latex": "\\begin{description}\n\\item[\\$brain\\_region] \\begin{enumerate*}\n\\item 'Cerebellum'\n\\item 'Cerebral cortex'\n\\item 'Cortex'\n\\item 'Cortex and hippocampus'\n\\item 'Entorhinal Cortex'\n\\item 'Hippocampus'\n\\item 'Prefrontal cortex'\n\\item 'Subventricular zone and hippocampus'\n\\item 'Superior frontal gyrus (BA8)'\n\\item 'Superior parietal lobe'\n\\end{enumerate*}\n\n\\item[\\$cell\\_type] \\begin{enumerate*}\n\\item 'Astrocytes'\n\\item 'Endothelial cells'\n\\item 'Excitatory neurons'\n\\item 'Inhibitory neurons'\n\\item 'Microglia'\n\\item 'Oligodendrocyte precursor cells'\n\\item 'Oligodendrocytes'\n\\end{enumerate*}\n\n\\item[\\$short\\_name] \\begin{enumerate*}\n\\item 'ast'\n\\item 'end'\n\\item 'exc'\n\\item 'inh'\n\\item 'mic'\n\\item 'opc'\n\\item 'oli'\n\\end{enumerate*}\n\n\\end{description}\n", 269 | "text/markdown": "$brain_region\n: 1. 'Cerebellum'\n2. 'Cerebral cortex'\n3. 'Cortex'\n4. 'Cortex and hippocampus'\n5. 'Entorhinal Cortex'\n6. 'Hippocampus'\n7. 'Prefrontal cortex'\n8. 'Subventricular zone and hippocampus'\n9. 'Superior frontal gyrus (BA8)'\n10. 'Superior parietal lobe'\n\n\n\n$cell_type\n: 1. 'Astrocytes'\n2. 'Endothelial cells'\n3. 'Excitatory neurons'\n4. 'Inhibitory neurons'\n5. 'Microglia'\n6. 'Oligodendrocyte precursor cells'\n7. 'Oligodendrocytes'\n\n\n\n$short_name\n: 1. 'ast'\n2. 'end'\n3. 'exc'\n4. 'inh'\n5. 'mic'\n6. 'opc'\n7. 'oli'\n\n\n\n\n\n", 270 | "text/html": [ 271 | "
\n", 272 | "\t
$brain_region
\n", 273 | "\t\t
\n", 278 | "
  1. 'Cerebellum'
  2. 'Cerebral cortex'
  3. 'Cortex'
  4. 'Cortex and hippocampus'
  5. 'Entorhinal Cortex'
  6. 'Hippocampus'
  7. 'Prefrontal cortex'
  8. 'Subventricular zone and hippocampus'
  9. 'Superior frontal gyrus (BA8)'
  10. 'Superior parietal lobe'
\n", 279 | "
\n", 280 | "\t
$cell_type
\n", 281 | "\t\t
\n", 286 | "
  1. 'Astrocytes'
  2. 'Endothelial cells'
  3. 'Excitatory neurons'
  4. 'Inhibitory neurons'
  5. 'Microglia'
  6. 'Oligodendrocyte precursor cells'
  7. 'Oligodendrocytes'
\n", 287 | "
\n", 288 | "\t
$short_name
\n", 289 | "\t\t
\n", 294 | "
  1. 'ast'
  2. 'end'
  3. 'exc'
  4. 'inh'
  5. 'mic'
  6. 'opc'
  7. 'oli'
\n", 295 | "
\n", 296 | "
\n" 297 | ] 298 | }, 299 | "metadata": { 300 | "tags": [] 301 | } 302 | } 303 | ] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "metadata": { 308 | "id": "r_swFi5J9p1t" 309 | }, 310 | "source": [ 311 | "Below are the necessary settings in order to calculate the overlapping genes. **Feel free to change these parameters**!" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "metadata": { 317 | "id": "Qn9pXpOQ7sp3" 318 | }, 319 | "source": [ 320 | "# We use top 100 DE genes in each AD vs control comparison\r\n", 321 | "TOP <- 100\r\n", 322 | "\r\n", 323 | "# Species should be either 'Human' or 'Mouse'\r\n", 324 | "this_species <- 'Human'\r\n", 325 | "\r\n", 326 | "# Specify our brain region of interest, here we selected the 5th brain region in REGION_LIST variable, i.e, Entorhinal Cortex'\r\n", 327 | "this_region <- REGION_LIST[5]\r\n", 328 | "\r\n", 329 | "# DE direction should either 'up' or 'down', 'up' means we select DE genes that are expessed higher in the disease dataset (the first group)\r\n", 330 | "this_direction <- 'up'\r\n", 331 | "\r\n", 332 | "# The OVERLAP_THRES should be manually defined based on your interest and total number of comparisons in scREAD. \r\n", 333 | "# For example, scREAD have 4 total AD vs control datasets comparisons, we set the threshold to 3, meaning that we want to find overlapping genes that are at least appeared in 3 comparisons\r\n", 334 | "OVERLAP_THRES <- 3\r\n", 335 | "\r\n", 336 | "# Now, we can calculate the overlapping genes based on the parameters above, the results are stored in a list variable:\r\n", 337 | "result <- calc_overlap_list()\r\n" 338 | ], 339 | "execution_count": 16, 340 | "outputs": [] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "metadata": { 345 | "id": "LC2425wECC06" 346 | }, 347 | "source": [ 348 | "## 4. Interpret the result table\r\n", 349 | "\r\n", 350 | "### 1. Overlapping DEGs among cell types" 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "metadata": { 356 | "id": "jEClGE0iCxec" 357 | }, 358 | "source": [ 359 | "If you are using the default settings, the second row in the table below can be interpreted as:\r\n", 360 | "\r\n", 361 | "'For all AD vs control datasets comparisons in Human Entorhinal Cortex Astrocytes (ast), the *GFAP* gene ranked top 100 by log-foldchange values in at least 3 comparisons'.\r\n" 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "metadata": { 367 | "colab": { 368 | "base_uri": "https://localhost:8080/", 369 | "height": 298 370 | }, 371 | "id": "uxfH6jTZ8NTX", 372 | "outputId": "695d5d4f-4308-4a8c-fae8-dfc8cb01a8f0" 373 | }, 374 | "source": [ 375 | "kable(result$list)" 376 | ], 377 | "execution_count": 17, 378 | "outputs": [ 379 | { 380 | "output_type": "display_data", 381 | "data": { 382 | "text/plain": [ 383 | "\n", 384 | "\n", 385 | "|ct |gene |\n", 386 | "|:-----------|:---------|\n", 387 | "|ast,mic,oli |NEAT1 |\n", 388 | "|ast |ITGB4 |\n", 389 | "|ast |HSPA1B |\n", 390 | "|ast |GFAP |\n", 391 | "|end |IFITM3 |\n", 392 | "|end |IFITM2 |\n", 393 | "|exc |PLP1 |\n", 394 | "|exc |FTH1 |\n", 395 | "|mic |SPP1 |\n", 396 | "|mic |DPYD |\n", 397 | "|mic |ACSL1 |\n", 398 | "|oli |LINC00486 |" 399 | ] 400 | }, 401 | "metadata": { 402 | "tags": [] 403 | } 404 | } 405 | ] 406 | }, 407 | { 408 | "cell_type": "markdown", 409 | "metadata": { 410 | "id": "eq2HECA1COmb" 411 | }, 412 | "source": [ 413 | "### 2. The details of the overlapping gene from all cell types" 414 | ] 415 | }, 416 | { 417 | "cell_type": "markdown", 418 | "metadata": { 419 | "id": "PXhbVDSAERwS" 420 | }, 421 | "source": [ 422 | "The table below shows the details of the overlapping genes. " 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "metadata": { 428 | "colab": { 429 | "base_uri": "https://localhost:8080/", 430 | "height": 862 431 | }, 432 | "id": "J8TFoML073Ie", 433 | "outputId": "1a4e4e59-11ff-4fb0-e7f8-28e7e0eb253a" 434 | }, 435 | "source": [ 436 | "kable(result$rank[,c(1:6,9)])" 437 | ], 438 | "execution_count": 18, 439 | "outputs": [ 440 | { 441 | "output_type": "display_data", 442 | "data": { 443 | "text/plain": [ 444 | "\n", 445 | "\n", 446 | "|ct |gene | avg_logFC|disease_id |control_id |rank | mean_rank|\n", 447 | "|:---|:---------|---------:|:----------|:----------|:----|---------:|\n", 448 | "|mic |ACSL1 | 1.295940|AD00203 |AD00201 |19 | 27.000000|\n", 449 | "|mic |ACSL1 | 0.557061|AD00205 |AD00201 |52 | 27.000000|\n", 450 | "|mic |ACSL1 | 1.650250|AD00206 |AD00201 |10 | 27.000000|\n", 451 | "|mic |DPYD | 1.822040|AD00203 |AD00201 |4 | 9.500000|\n", 452 | "|mic |DPYD | 1.536450|AD00204 |AD00202 |3 | 9.500000|\n", 453 | "|mic |DPYD | 0.910131|AD00205 |AD00201 |24 | 9.500000|\n", 454 | "|mic |DPYD | 1.188850|AD00206 |AD00201 |7 | 9.500000|\n", 455 | "|exc |FTH1 | 1.749390|AD00203 |AD00201 |16 | 12.000000|\n", 456 | "|exc |FTH1 | 0.673986|AD00205 |AD00201 |15 | 12.000000|\n", 457 | "|exc |FTH1 | 1.201300|AD00206 |AD00201 |5 | 12.000000|\n", 458 | "|ast |GFAP | 1.217740|AD00203 |AD00201 |8 | 13.333333|\n", 459 | "|ast |GFAP | 1.080320|AD00204 |AD00202 |4 | 13.333333|\n", 460 | "|ast |GFAP | 0.553044|AD00206 |AD00201 |28 | 13.333333|\n", 461 | "|ast |HSPA1B | 1.369810|AD00203 |AD00201 |22 | 39.000000|\n", 462 | "|ast |HSPA1B | 0.745331|AD00205 |AD00201 |30 | 39.000000|\n", 463 | "|ast |HSPA1B | 0.987627|AD00206 |AD00201 |65 | 39.000000|\n", 464 | "|end |IFITM2 | 1.377880|AD00203 |AD00201 |23 | 60.000000|\n", 465 | "|end |IFITM2 | 0.942100|AD00205 |AD00201 |95 | 60.000000|\n", 466 | "|end |IFITM2 | 1.093290|AD00206 |AD00201 |62 | 60.000000|\n", 467 | "|end |IFITM3 | 1.477250|AD00203 |AD00201 |12 | 21.333333|\n", 468 | "|end |IFITM3 | 1.261220|AD00205 |AD00201 |9 | 21.333333|\n", 469 | "|end |IFITM3 | 0.888270|AD00206 |AD00201 |43 | 21.333333|\n", 470 | "|ast |ITGB4 | 0.866848|AD00203 |AD00201 |66 | 47.333333|\n", 471 | "|ast |ITGB4 | 0.807770|AD00204 |AD00202 |36 | 47.333333|\n", 472 | "|ast |ITGB4 | 0.532055|AD00205 |AD00201 |40 | 47.333333|\n", 473 | "|oli |LINC00486 | 1.285020|AD00203 |AD00201 |6 | 10.000000|\n", 474 | "|oli |LINC00486 | 0.543539|AD00204 |AD00202 |7 | 10.000000|\n", 475 | "|oli |LINC00486 | 0.639483|AD00205 |AD00201 |17 | 10.000000|\n", 476 | "|oli |NEAT1 | 0.798519|AD00203 |AD00201 |24 | 13.600000|\n", 477 | "|oli |NEAT1 | 1.949210|AD00204 |AD00202 |1 | 13.600000|\n", 478 | "|oli |NEAT1 | 0.521790|AD00205 |AD00201 |47 | 13.600000|\n", 479 | "|mic |NEAT1 | 0.859384|AD00203 |AD00201 |17 | 13.600000|\n", 480 | "|mic |NEAT1 | 1.041330|AD00204 |AD00202 |4 | 13.600000|\n", 481 | "|mic |NEAT1 | 0.507087|AD00205 |AD00201 |21 | 13.600000|\n", 482 | "|ast |NEAT1 | 1.275860|AD00203 |AD00201 |3 | 13.600000|\n", 483 | "|ast |NEAT1 | 1.653700|AD00204 |AD00202 |1 | 13.600000|\n", 484 | "|ast |NEAT1 | 0.580371|AD00205 |AD00201 |5 | 13.600000|\n", 485 | "|ast |NEAT1 | 0.759543|AD00206 |AD00201 |13 | 13.600000|\n", 486 | "|exc |PLP1 | 2.695190|AD00203 |AD00201 |4 | 9.333333|\n", 487 | "|exc |PLP1 | 0.825406|AD00205 |AD00201 |5 | 9.333333|\n", 488 | "|exc |PLP1 | 0.990474|AD00206 |AD00201 |19 | 9.333333|\n", 489 | "|mic |SPP1 | 1.285540|AD00203 |AD00201 |10 | 23.666667|\n", 490 | "|mic |SPP1 | 0.578141|AD00204 |AD00202 |8 | 23.666667|\n", 491 | "|mic |SPP1 | 0.805587|AD00206 |AD00201 |53 | 23.666667|" 492 | ] 493 | }, 494 | "metadata": { 495 | "tags": [] 496 | } 497 | } 498 | ] 499 | }, 500 | { 501 | "cell_type": "markdown", 502 | "metadata": { 503 | "id": "NkYnVvgSFnOy" 504 | }, 505 | "source": [ 506 | "Take *GFAP* gene as an example, we know *GFAP* is an overlapping gene in Human Entorhinal Cortex Astrocytes. From the section of the table below, we know the *GFAP* ranked top 100 in 3 comparisons of 4 total comparisons, the mean rank of the gene is 13, and the average log-foldchanges of *GFAP* in each comparisons are also listed. " 507 | ] 508 | }, 509 | { 510 | "cell_type": "code", 511 | "metadata": { 512 | "id": "-HmI8O-0Fl28" 513 | }, 514 | "source": [ 515 | "#|ct |gene | avg_logFC|disease_id |control_id |rank | overlapping_comparison| total_comparison| mean_rank|\r\n", 516 | "#|:---|:---------|---------:|:----------|:----------|:----|----------------------:|----------------:|---------:|\r\n", 517 | "#|ast |GFAP | 1.217740|AD00203 |AD00201 |8 | 3| 4| 13.333333|\r\n", 518 | "#|ast |GFAP | 1.080320|AD00204 |AD00202 |4 | 3| 4| 13.333333|\r\n", 519 | "#|ast |GFAP | 0.553044|AD00206 |AD00201 |28 | 3| 4| 13.333333|" 520 | ], 521 | "execution_count": 19, 522 | "outputs": [] 523 | } 524 | ] 525 | } -------------------------------------------------------------------------------- /overlapping_genes/scread_db.rdata: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OSU-BMBL/scread-protocol/3992a2c460de13a1ac43f3329ddd6e851fe50da3/overlapping_genes/scread_db.rdata -------------------------------------------------------------------------------- /workflow/README.md: -------------------------------------------------------------------------------- 1 | ## How to run ssREAD single cell RNA-seq backend workflow locally 2 | 3 | The workflow in R can be found in https://github.com/OSU-BMBL/scread-protocol/tree/master/workflow, the folder contains the following files: 4 | 1. custom_marker.csv. A manually created marker gene list file used for identified cell types. 5 | 2. functions.R. Visualization functions used in R. 6 | 3. build_control_atlas.R: build control cells atlas Seurat object from count matrix file. 7 | 4. transfer_cell_type.R: filter out control-like cells in disease dataset 8 | 5. run_analysis.R: run analysis workflow, and export tables in scREAD database format. 9 | 10 | ### Build control atlas 11 | 1. Goal: Build the control atlas file from raw gene expression matrix. 12 | 2. Prepare your control gene expression data in csv format In the data frame, the first column should be gene symbols, and other columns as cell labels. Put all code and data in a working directory. (e.g PATH_TO_WD), in this tutorial, we will run example_control.csv. 13 | 3. build_control_atlas.R takes three parameters: 14 | - Working directory path. 15 | - Control data path. 16 | - Output data ID 17 | ```{R} 18 | cd PATH_TO_WD 19 | Rscript build_control_atlas.R PATH_TO_WD example_control.csv control_example 20 | ``` 21 | 22 | 4. The output should contain four files: 23 | 24 | - control_example.rds. The Seurat R object storing example control data. 25 | - control_example_expr.txt. Filtered gene expression matrix. 26 | - control_example_cell_label.txt. The first column is the cell names, the second column is the cell type information. 27 | - control_example_umap.png. UMAP plot of example control data colored by cell types. 28 | 29 | ## Transfer cell types based on control atlas 30 | 1. Goal: Annotate cell type using control atlas as the reference, onto the disease gene expression matrix file. 31 | 2. Put all code and data in a working directory. (e.g PATH_TO_WD), after you have generated the control atlas file (control_example.rds). 32 | 3. build_control_atlas.R takes four parameters: 1. Working directory path; 2. Control atlas Seurat object file name; 3. Disease gene expression matrix name; 4. Output disease data ID. 33 | 34 | ```{r} 35 | cd PATH_TO_WD 36 | Rscript transfer_cell_type.R PATH_TO_WD control_example.rds example_disease.csv disease_example 37 | ``` 38 | 4. The output should contain four files: 39 | - disease_example.rds. The Seurat R object storing example disease data. 40 | - disease_example_expr.txt. Filtered gene expression matrix. 41 | - disease_example_cell_label.txt. The first column is the cell names, the second column is the cell type information. 42 | - disease_example_umap.png. UMAP plot for both control and disease disease data colored by cell types. 43 | 44 | ## Run data analysis 45 | 46 | 1. Goal: Perform analysis between disease and control data 47 | 2. Put all code and data in a working directory. (e.g PATH_TO_WD), after you have generated the control atlas file (control_example.rds), and the disease file (disease_example.rds) 48 | 3. run_analysis.R takes three parameters: 1. Working directory path; 2. Control Seurat object file name. 3. Disease Seurat object file name. 49 | 50 | ```{r} 51 | cd PATH_TO_WD 52 | Rscript run_analysis.R PATH_TO_WD control_example disease_example 53 | ``` 54 | 55 | 4. The output should be stored in three folders: 56 | - /de. Differential gene expression analysis results. 1. Cell-type-specific genes; 2. Sub-cluster specific genes; 3. Cell type DE genes between two conditions. 57 | - /dimension. UMAP coordinates for two datasets. 58 | - /subcluster_dimension. UMAP coordinates for each sub-clusters in two datasets. 59 | 60 | 61 | More information about the pipeline can be found in [What is the scREAD overall pipeline?](https://bmbls.bmi.osumc.edu/scread/help/faq#what-is-the-scread-overall-pipeline%3F) 62 | -------------------------------------------------------------------------------- /workflow/Snakefile: -------------------------------------------------------------------------------- 1 | rule all: 2 | input: 3 | "./de/control_example_de_cts.csv" 4 | 5 | rule control_atlas: 6 | input: 7 | "example_control.csv" 8 | params: 9 | wd = ".", name = "control_example" 10 | output: 11 | "control_example.rds" 12 | shell: 13 | "Rscript build_control_atlas.R {params.wd} {input} {params.name}" 14 | 15 | rule transfer_atlas: 16 | input: 17 | "control_example.rds", "example_disease.csv" 18 | params: 19 | wd = ".", name = "disease_example" 20 | output: 21 | "disease_example.rds" 22 | shell: 23 | "Rscript transfer_cell_type.R {params.wd} {input} {params.name}" 24 | 25 | rule data_analysis: 26 | input: 27 | "disease_example.rds" 28 | params: 29 | wd = ".", name1 = "control_example", name2 = "disease_example" 30 | output: 31 | "./de/control_example_de_cts.csv" 32 | shell: 33 | "Rscript run_analysis.R {params.wd} {params.name1} {params.name2}" 34 | -------------------------------------------------------------------------------- /workflow/build_control_atlas.R: -------------------------------------------------------------------------------- 1 | # Goal 2 | # This document aims to build control (healthy) cells atlas Seurat object from count matrix file, 3 | options(check.names=FALSE) 4 | options(future.globals.maxSize = 8000 * 1024^2) 5 | #suppressPackageStartupMessages(library(fst)) 6 | suppressPackageStartupMessages(library(Seurat)) 7 | suppressPackageStartupMessages(library(RColorBrewer)) 8 | suppressPackageStartupMessages(library(Polychrome)) 9 | suppressPackageStartupMessages(library(ggplot2)) 10 | suppressPackageStartupMessages(library(tidyverse)) 11 | suppressPackageStartupMessages(library(future)) 12 | suppressPackageStartupMessages(library(SCINA)) 13 | suppressPackageStartupMessages(library(preprocessCore)) 14 | suppressPackageStartupMessages(library(cowplot)) 15 | 16 | 17 | ## Do not use it, not working in OSC clusters 18 | ## Set multi-thread for Seurat 19 | #plan("multiprocess", workers = 16) 20 | #plan() 21 | 22 | 23 | args <- commandArgs(TRUE) 24 | wd <- args[1] # working directory 25 | expr_file <- args[2] # raw user filename 26 | data_id <- args[3] # unique data id 27 | 28 | load_test_data <- function(){ 29 | # This function is used for testing, set wd to your working directory 30 | rm(list = ls(all = TRUE)) 31 | wd <- 'C:/Users/flyku/Documents/GitHub/scread-protocol/workflow' 32 | expr_file = "example_control.csv" 33 | data_id <- 'control_example' 34 | } 35 | 36 | setwd(wd) 37 | source("functions.R") 38 | signatures <- preprocess.signatures('custom_marker.csv') 39 | cell_type_name <- c('Astrocytes', 'Endothelial cells','Excitatory neurons','Inhibitory neurons','Microglia','Oligodendrocytes','Oligodendrocyte precursor cells','Pericytes') 40 | names(signatures) <- cell_type_name 41 | 42 | expr_matrix <- read.csv(expr_file) 43 | rownames(expr_matrix) <- NULL 44 | expr_matrix <- column_to_rownames(expr_matrix, var = "X") 45 | 46 | health.obj <- CreateSeuratObject(counts = expr_matrix, project = "healthy", min.cells = 3, min.features = 200) 47 | health.obj <- FindVariableFeatures(health.obj, selection.method = "vst", nfeatures = 2000) 48 | health.all.genes <- rownames(health.obj) 49 | health.obj <- NormalizeData(health.obj) 50 | health.obj <- ScaleData(health.obj, features = health.all.genes) 51 | health.obj <- RunPCA(health.obj, features = VariableFeatures(object = health.obj)) 52 | health.obj <- FindNeighbors(health.obj, dims = 1:25) 53 | health.obj <- Seurat::RunUMAP(health.obj, dims = 1:25) 54 | if(ncol(health.obj) < 2000){ 55 | health.obj <- Seurat::FindClusters(health.obj, resolution = 1.5) 56 | } else if (ncol(health.obj) < 6000){ 57 | health.obj <- Seurat::FindClusters(health.obj, resolution = 1.2) 58 | } else { 59 | health.obj <- Seurat::FindClusters(health.obj, resolution = 0.8) 60 | } 61 | 62 | # The clustering is used for comparison with cell type predictions results 63 | Idents(health.obj) <- health.obj$seurat_clusters 64 | health_markers <- FindAllMarkers(health.obj, return.thresh = 0.05, only.pos = T) 65 | 66 | cell_type_result <- data.frame() 67 | for (i in 1:length(levels(Idents(health.obj)))) { 68 | this_cell_type <- data.frame() 69 | for (j in 1:8) { 70 | this_marker <- health_markers[health_markers$cluster == (i-1) & health_markers$p_val_adj < 0.05,7] 71 | this_overlap <- length(which(tolower(unlist(signatures[j])) %in% tolower(this_marker))) 72 | 73 | tmp_cell_type <- data.frame(ct = names(signatures[j]),overlap = this_overlap) 74 | this_cell_type <- rbind(this_cell_type, tmp_cell_type) 75 | } 76 | cell_type_result <- rbind(cell_type_result,this_cell_type[which.max(this_cell_type$overlap),]) 77 | } 78 | 79 | marker_cell_type <- health.obj$seurat_clusters 80 | levels(marker_cell_type) <- cell_type_result$ct 81 | health.obj <- AddMetaData(health.obj, marker_cell_type, col.name = 'marker_cell_type') 82 | 83 | Idents(health.obj) <- health.obj$marker_cell_type 84 | p1 <- Plot.cluster2D(health.obj,reduction.method = "umap",pt_size = 0.5, txt = "marker_cell_type") 85 | 86 | 87 | exp <- as.matrix(GetAssayData(object = health.obj[['RNA']],slot="data")) 88 | exp <- log1p(exp) 89 | exp[] <- normalize.quantiles(exp) 90 | rownames(exp) <- toupper(rownames(exp)) 91 | 92 | keep_scina_cell_type <- as.numeric(which(!sapply(signatures, function(x){ 93 | all(!x %in% rownames(exp)) 94 | }))) 95 | signatures <- signatures[keep_scina_cell_type] 96 | 97 | results <- SCINA(exp, signatures, max_iter = 1000, convergence_n = 20, 98 | convergence_rate = 0.99, sensitivity_cutoff = 0.9, rm_overlap=F, allow_unknown=T, log_file='SCINA.log') 99 | health.obj <- AddMetaData(health.obj, as.factor(results$cell_labels), col.name = 'scina_cell_type') 100 | 101 | Idents(health.obj) <- health.obj$seurat_clusters 102 | annotate_cell_type <- as.factor(health.obj$seurat_clusters) 103 | annotate_cell_type_name <- vector() 104 | # iterate i as all seurat cell clusters index 105 | for (i in 1:length(levels(Idents(health.obj)))) { 106 | this_cluster_cells <- WhichCells(health.obj, idents = levels(Idents(health.obj))[i]) 107 | this_scina_obj <- health.obj[,this_cluster_cells] 108 | Idents(this_scina_obj) <- this_scina_obj$scina_cell_type 109 | # select the largest overlapped cell type name as annotation of this seurat cluster, the unknown is ignored 110 | if('unknown' %in% levels(Idents(this_scina_obj))) { 111 | scina_obj_ident_length <- length(levels(Idents(this_scina_obj))) - 1 112 | max_ct <- names(which.max(table(Idents(this_scina_obj))[1:scina_obj_ident_length])) 113 | } else { 114 | scina_obj_ident_length <- length(levels(Idents(this_scina_obj))) 115 | max_ct <- names(which.max(table(Idents(this_scina_obj))[1:scina_obj_ident_length])) 116 | } 117 | annotate_cell_type_name <- c(annotate_cell_type_name,max_ct) 118 | } 119 | 120 | 121 | Idents(health.obj) <- health.obj$seurat_clusters 122 | p2 <- Plot.cluster2D(health.obj,reduction.method = "umap",pt_size = 0.5, txt = "Predict cluster") 123 | Idents(health.obj) <- health.obj$scina_cell_type 124 | p3 <- Plot.cluster2D(health.obj,reduction.method = "umap",pt_size = 0.5, txt = "scina_cell_type") 125 | 126 | levels(annotate_cell_type) <- annotate_cell_type_name 127 | health.obj <- AddMetaData(health.obj, as.factor(annotate_cell_type), col.name = 'predicted.id') 128 | Idents(health.obj) <- health.obj$predicted.id 129 | p4 <- Plot.cluster2D(health.obj,reduction.method = "umap",pt_size = 0.5, txt = "predicted.id") 130 | 131 | png(paste(data_id,"_umap.png",sep = ""),width=3000, height=1500,res = 300) 132 | plot_grid(p4) 133 | dev.off() 134 | 135 | # Save Seurat object 136 | saveRDS(health.obj, file = paste0(data_id,'.rds')) 137 | 138 | # Save raw counts rather than normalized values 139 | exp_data <- GetAssayData(object = health.obj,slot = "counts") 140 | 141 | write.table(data.frame("Gene"=rownames(exp_data),exp_data,check.names = F),paste(data_id,"_expr.txt",sep = ""), row.names = F,sep="\t",quote=FALSE) 142 | 143 | # Save cell type labels 144 | cell_info <- health.obj$predicted.id 145 | cell_label <- cbind(colnames(health.obj),as.character(cell_info)) 146 | colnames(cell_label) <- c("cell_name","label") 147 | cell_label <- cell_label[order(cell_label[,1]),] 148 | write.table(cell_label,paste(data_id,"_cell_label.txt",sep = ""),quote = F,row.names = F,sep = "\t") 149 | 150 | # Session Infomation 151 | sessionInfo() 152 | 153 | -------------------------------------------------------------------------------- /workflow/custom_marker.csv: -------------------------------------------------------------------------------- 1 | Astrocytes,Endothelial cells,Excitatory neurons,Inhibitory neurons,Microglia,Oligodendrocytes,Oligodendrocyte precursor cells,Pericytes 2 | GFAP,FLT1,SLC17A6,SLC32A1,IBA-1,OLIG2,VCAN,AMBP 3 | EAAT1,CLDN5,SLC17A7,GAD1,P2RY12,MBP,CSPG4,HIGD1B 4 | AQP4,VTN,NRGN,GAD2,CSF1R,MOBP,PDGFRA,COX4I2 5 | LCN2,ITM2A,CAMK2A,TAC1,CD74,PLP1,SOX10,AOC3 6 | GJA1,VWF,SATB2,PENK,C3,MOG,NEU4,PDE5A 7 | SLC1A2,FAM167B,COL5A1,SST,CST3,CLDN11,PCDH15,PTH1R 8 | FGFR3,BMX,SDK2,NPY,HEXB,MYRF,GPR37L1,P2RY14 9 | NKAIN4,CLEC1B,NEFM,MYBPC1,C1QA,GALC,C1QL1,ABCC9 10 | AGT,CD93,GRIN1,PVALB,CD74,ERMN,CDO1,KCNJ8 11 | PLXNB1,CDH5,GLS,GABBR2,CX3CR1,MAG,EPN2,CD248 12 | SLC1A3,,,,AIF-1,,, 13 | ,,,,TMEM119,,, 14 | -------------------------------------------------------------------------------- /workflow/functions.R: -------------------------------------------------------------------------------- 1 | ######### 2 | # Load useful functions, do not print in the final report 3 | ######### 4 | quiet <- function(x) { 5 | sink(tempfile()) 6 | on.exit(sink()) 7 | invisible(force(x)) 8 | } 9 | # point size function from test datasets 10 | x <- c(0,90,124,317,1000,2368,3005,4816,8298,50000,500000,5000000) 11 | y <- c(1,1,0.89,0.33,0.30,0.25,0.235,0.205,0.18,0.1,0.1,0.1) 12 | get_point_size <- approxfun(x, y) 13 | 14 | Plot.GeneUMAP<-function(object=my.object,gene.name=NULL,pt_size=0.5){ 15 | tmp.gene.expression<- GetAssayData(object) 16 | tmp.dim<-as.data.frame(object@reductions$umap@cell.embeddings) 17 | tmp.MatchIndex<- match(colnames(tmp.gene.expression),rownames(tmp.dim)) 18 | tmp.dim<-tmp.dim[tmp.MatchIndex,] 19 | tmp.gene.name<-paste0("^",gene.name,"$") 20 | tmp.One.gene.value<-tmp.gene.expression[grep(tmp.gene.name,rownames(tmp.gene.expression)),] 21 | tmp.dim.df<-cbind.data.frame(tmp.dim,Gene=tmp.One.gene.value) 22 | g<-ggplot(tmp.dim.df,aes(x=UMAP_1,y=UMAP_2,color=Gene)) 23 | g<-g+geom_point(stroke=pt_size,size=pt_size) + scale_color_gradient(low="grey",high = "red") 24 | g<-g+theme_classic()+labs(color=paste0(gene.name,"\nexpression\nvalue")) + coord_fixed(1) 25 | g 26 | } 27 | 28 | Plot.GeneTSNE<-function(object=my.object,gene.name=NULL,pt_size=0.5){ 29 | tmp.gene.expression<- object@assays$SCT@data 30 | tmp.dim<-as.data.frame(object@reductions$tsne@cell.embeddings) 31 | tmp.MatchIndex<- match(colnames(tmp.gene.expression),rownames(tmp.dim)) 32 | tmp.dim<-tmp.dim[tmp.MatchIndex,] 33 | tmp.gene.name<-paste0("^",gene.name,"$") 34 | tmp.One.gene.value<-tmp.gene.expression[grep(tmp.gene.name,rownames(tmp.gene.expression)),] 35 | tmp.dim.df<-cbind.data.frame(tmp.dim,Gene=tmp.One.gene.value) 36 | g<-ggplot(tmp.dim.df,aes(x=tSNE_1,y=tSNE_2,color=Gene)) 37 | g<-g+geom_point(stroke=pt_size,size=pt_size)+scale_color_gradient(low="grey",high = "red") 38 | g<-g+theme_classic()+labs(color=paste0(gene.name,"\nexpression\nvalue")) + coord_fixed(1) 39 | g 40 | } 41 | 42 | 43 | Plot.cluster2D<-function(object=combined,reduction.method="umap",customized=T,pt_size=1,txt="Cell type",...){ 44 | 45 | my.plot.all.source<-cbind.data.frame(Embeddings(object,reduction = reduction.method), 46 | Cell_type=Idents(object)) 47 | 48 | tmp.celltype <- levels(unique(my.plot.all.source$Cell_type)) 49 | p.cluster <- ggplot(my.plot.all.source, 50 | aes(x=my.plot.all.source[,1],y=my.plot.all.source[,2]))+xlab(colnames(my.plot.all.source)[1])+ylab(colnames(my.plot.all.source)[2]) 51 | p.cluster <- p.cluster+geom_point(stroke=pt_size,size=pt_size,aes(col=my.plot.all.source[,"Cell_type"])) 52 | 53 | p.cluster <- p.cluster + guides(colour = guide_legend(override.aes = list(size=5))) 54 | 55 | if (length(tmp.celltype) >= 5){ 56 | p.cluster <- p.cluster + scale_colour_manual(name =paste(txt,":(Cells)",sep = ""),values = as.character(palette36.colors(36)[-2][1:length(tmp.celltype)]), 57 | breaks=tmp.celltype, 58 | labels=paste0(tmp.celltype,":(",as.character(summary(my.plot.all.source$Cell_type)),")")) 59 | } else if (length(tmp.celltype) < 5){ 60 | p.cluster <- p.cluster + scale_colour_manual(name =paste(txt,":(Cells)",sep = ""),values = brewer.pal(4,"Spectral")[c(2,1,3,4)], 61 | breaks=tmp.celltype, 62 | labels=paste0(tmp.celltype,":(",as.character(summary(my.plot.all.source$Cell_type)),")")) 63 | } else{ 64 | p.cluster <- p.cluster + scale_colour_manual(name =paste(txt,":(Cells)",sep = ""),values = brewer.pal(5,"Spectral")[c(1,5)], 65 | breaks=tmp.celltype, 66 | labels=paste0(tmp.celltype,":(",as.character(summary(my.plot.all.source$Cell_type)),")")) 67 | 68 | } 69 | 70 | 71 | # + labs(col="cell type") 72 | p.cluster <- p.cluster + theme_classic() 73 | p.cluster <- p.cluster + coord_fixed(ratio=1) 74 | p.cluster 75 | } 76 | 77 | 78 | ############################## 79 | # define a fucntion for reading in 10X hd5f data and cell gene matrix by input (TenX) or (CellGene) 80 | read_data<-function(x=NULL,read.method=NULL,sep="\t",...){ 81 | if(!is.null(x)){ 82 | if(!is.null(read.method)){ 83 | if(read.method !="TenX.h5"&&read.method!="CellGene"&&read.method!="TenX.folder"){ 84 | stop("wrong 'read.method' argument, please choose 'TenX.h5','TenX.folder', or 'CellGene'!")} 85 | if(read.method == "TenX.h5"){ 86 | tmp_x<-Read10X_h5(x) 87 | return(tmp_x) 88 | }else if(read.method =="TenX.folder"){ 89 | 90 | all_files <- list.files(getwd()) 91 | barcode_file <- grep("barcodes",all_files) 92 | matrix_file <- grep("matrix",all_files) 93 | gene_file <- grep("genes",all_files) 94 | feature_file <- grep("features",all_files) 95 | 96 | #Check users upload single zipped file, by counting detected filename, if less than 3 we think users uploads zipped file 97 | if((length(barcode_file) + length(matrix_file) + length(gene_file) + length(feature_file)) < 3) { 98 | dir.create("tmp",showWarnings = F) 99 | if (file_ext(x) == "7z") { 100 | try(system(paste("7za x", x, "-aoa -otmp")),silent = T) 101 | } 102 | try(system(paste("unzip -o", x, "-d tmp")),silent = T) 103 | try(system(paste("tar xzvf", x, "--directory tmp")),silent = T) 104 | 105 | # check if the file is gz instead of tar.gz 106 | max_file <- which.max(file.info(list.files("tmp",full.names = T,recursive = T))[,1]) 107 | this_files <- list.files("tmp",full.names = T,recursive = T)[max_file] 108 | if(is.na(this_files) || file_ext(this_files) == "tar" || length(this_files) == 0) { 109 | system("rm -R tmp/*") 110 | this_filename <- gsub(".gz","",basename(x)) 111 | try(system(paste("gunzip -c ", x, " > tmp/",this_filename,sep="")),silent = T) 112 | max_file <- which.max(file.info(list.files("tmp",full.names = T,recursive = T))[,1]) 113 | this_files <- list.files("tmp",full.names = T,recursive = T)[max_file] 114 | this_delim <- reader::get.delim(this_files) 115 | tmp_z <- tryCatch(read.delim(paste0(this_files),header = T,row.names = NULL,check.names = F,sep=this_delim),error = function(e) 0) 116 | upload_type <<- "CellGene" 117 | return(tmp_z) 118 | } 119 | 120 | max_file <- which.max(file.info(list.files("tmp",full.names = T,recursive = T))[,1]) 121 | this_files <- list.files("tmp",full.names = T,recursive = T)[max_file] 122 | 123 | 124 | # incase folder contains 10X files 125 | tmp_x <- tryCatch(Read10X(gsub(basename(this_files),"",this_files)),error = function(e) 0) 126 | 127 | if (typeof(tmp_x) == "S4") { 128 | system("rm -R tmp/*") 129 | return(tmp_x) 130 | } else if(file_ext(this_files) == "h5" || file_ext(this_files) == "hdf5") { 131 | tmp_y <- tryCatch(Read10X_h5(this_files),error = function(e) 0) 132 | upload_type <<- "TenX.h5" 133 | system("rm -R tmp/*") 134 | return(tmp_y) 135 | } else { 136 | this_delim <- reader::get.delim(this_files) 137 | tmp_z <- tryCatch(read.delim(paste0(this_files),header = T,row.names = NULL,check.names = F,sep=this_delim),error = function(e) 0) 138 | upload_type <<- "CellGene" 139 | system("rm -R tmp/*") 140 | return(tmp_z) 141 | } 142 | 143 | } 144 | 145 | tryCatch(file.rename(all_files[barcode_file],paste("barcodes",gsub(".*barcodes","",all_files[barcode_file]),sep = "")),error = function(e) 0) 146 | tryCatch(file.rename(all_files[matrix_file],paste("matrix",gsub(".*matrix","",all_files[matrix_file]),sep = "")),error = function(e) 0) 147 | tryCatch(file.rename(all_files[gene_file],paste("genes",gsub(".*genes","",all_files[gene_file]),sep = "")),error = function(e) 0) 148 | tryCatch(file.rename(all_files[feature_file],paste("features",gsub(".*features","",all_files[features]),sep = "")),error = function(e) 0) 149 | 150 | tmp_x<-tryCatch(Read10X(getwd()),error = function(e) { 151 | all_files <- list.files(getwd()) 152 | barcode_file <- grep("barcodes",all_files) 153 | matrix_file <- grep("matrix",all_files) 154 | gene_file <- grep("genes",all_files) 155 | feature_file <- grep("features",all_files) 156 | try(system(paste("gunzip",(all_files[barcode_file]))),silent = T) 157 | try(system(paste("gunzip",(all_files[matrix_file]))),silent = T) 158 | try(system(paste("gunzip",(all_files[gene_file]))),silent = T) 159 | try(system(paste("gunzip",(all_files[feature_file]))),silent = T) 160 | try(system(paste("unzip",(all_files[barcode_file]))),silent = T) 161 | try(system(paste("unzip",(all_files[matrix_file]))),silent = T) 162 | try(system(paste("unzip",(all_files[gene_file]))),silent = T) 163 | try(system(paste("unzip",(all_files[feature_file]))),silent = T) 164 | }) 165 | tmp_x<-tryCatch(Read10X(getwd()),error = function(e){ 166 | 0 167 | }) 168 | return(tmp_x) 169 | } else if(read.method == "CellGene"){# read in cell * gene matrix, if there is error report, back to 18 line to run again. 170 | tmp_x<-read.delim(x,header = T,row.names = NULL,check.names = F,sep=sep,...) 171 | 172 | return(tmp_x) 173 | } 174 | } 175 | }else {stop("Missing 'x' argument, please input correct data")} 176 | } 177 | 178 | 179 | -------------------------------------------------------------------------------- /workflow/run_analysis.R: -------------------------------------------------------------------------------- 1 | # Goal 2 | # This document aims to run Seurat analysis workflow, and export tables in scREAD database format. 3 | 4 | # Important!! Install MAST first 5 | #BiocManager::install("MAST") 6 | 7 | options(future.globals.maxSize = 8000 * 1024^2) 8 | suppressPackageStartupMessages(library(fst)) 9 | suppressPackageStartupMessages(library(Seurat)) 10 | suppressPackageStartupMessages(library(RColorBrewer)) 11 | suppressPackageStartupMessages(library(Polychrome)) 12 | suppressPackageStartupMessages(library(ggplot2)) 13 | suppressPackageStartupMessages(library(tidyverse)) 14 | suppressPackageStartupMessages(library(harmony)) 15 | suppressPackageStartupMessages(library(cowplot)) 16 | suppressPackageStartupMessages(library(future)) 17 | suppressPackageStartupMessages(library(MAST)) 18 | 19 | ## Do not use it, not working in OSC clusters 20 | ## Set multi-thread for Seurat 21 | #plan("multiprocess", workers = 16) 22 | #plan() 23 | 24 | args <- commandArgs(TRUE) 25 | wd <- args[1] # working directory 26 | a_data_id <- args[2] # raw user filename 27 | b_data_id <- args[3] # raw user filename 28 | 29 | load_test_data <- function(){ 30 | # This function is used for testing, set wd to your working directory 31 | rm(list = ls(all = TRUE)) 32 | wd <- 'C:/Users/flyku/Desktop/script' 33 | #A is usually disease object, B is healthy (control) object 34 | a_data_id <- "disease_example" 35 | b_data_id <- "control_example" 36 | 37 | } 38 | 39 | setwd(wd) 40 | a_expr_file <- paste0(a_data_id,".rds") 41 | b_expr_file <- paste0(b_data_id,".rds") 42 | 43 | ####### Load raw files 44 | a.obj <- readRDS(a_expr_file) 45 | b.obj <- readRDS(b_expr_file) 46 | 47 | # Altough all objects should have already been normalized 48 | a.obj <- NormalizeData(a.obj) 49 | b.obj <- NormalizeData(b.obj) 50 | 51 | a_total_ct <- length(levels(as.factor(a.obj$predicted.id))) 52 | b_total_ct <- length(levels(as.factor(b.obj$predicted.id))) 53 | 54 | dir.create("de",showWarnings = F) 55 | dir.create("dimension",showWarnings = F) 56 | dir.create("subcluster_dimension",showWarnings = F) 57 | ####### Find Cell-type-specific DE genes for data A and B (if result not exist) 58 | 59 | ## Cell type specific DE genes on A dataset 60 | this_out_name <- paste0("de/",a_data_id,"_de_cts.csv") 61 | if(!file.exists(this_out_name) && length(levels(as.factor(a.obj$predicted.id))) > 1) { 62 | Idents(a.obj) <- a.obj$predicted.id 63 | cts_markers <- FindAllMarkers(a.obj, test.use = "MAST") 64 | if(nrow(cts_markers) >1){ 65 | cts_markers["a_data_id"] <- a_data_id 66 | cts_markers["b_data_id"] <- a_data_id 67 | tmp_cluster <- cts_markers$cluster 68 | levels(tmp_cluster)[levels(tmp_cluster)=="Oligodendrocyte precursor cells"] <- "opc" 69 | cts_markers["ct"] <- tolower(substr(tmp_cluster, 1, 3)) 70 | cts_markers["type"] <- "cell_type_specific" 71 | write.csv(cts_markers,this_out_name, row.names = F) 72 | } 73 | } 74 | 75 | 76 | ## Cell type specific DE genes on B dataset 77 | this_out_name <- paste0("de/",b_data_id,"_de_cts.csv") 78 | if(!file.exists(this_out_name) && length(levels(as.factor(b.obj$predicted.id))) > 1) { 79 | Idents(b.obj) <- b.obj$predicted.id 80 | cts_markers <- FindAllMarkers(b.obj, test.use = "MAST") 81 | if(nrow(cts_markers) >1){ 82 | cts_markers["a_data_id"] <- b_data_id 83 | cts_markers["b_data_id"] <- b_data_id 84 | tmp_cluster <- cts_markers$cluster 85 | levels(tmp_cluster)[levels(tmp_cluster)=="Oligodendrocyte precursor cells"] <- "opc" 86 | cts_markers["ct"] <- tolower(substr(tmp_cluster, 1, 3)) 87 | cts_markers["type"] <- "cell_type_specific" 88 | write.csv(cts_markers,this_out_name, row.names = F) 89 | } 90 | } 91 | 92 | 93 | ## subcluster analysis - A data 94 | 95 | for(i in 1:a_total_ct){ 96 | # this_ct is the cell type name 97 | this_ct <- levels(as.factor(a.obj$predicted.id))[i] 98 | if(this_ct == "Oligodendrocyte precursor cells") { 99 | abbr_this_ct <- "opc" 100 | } else { 101 | abbr_this_ct <- tolower(substr(this_ct, 1, 3)) 102 | } 103 | this_out_name <- paste0("de/",a_data_id,"_de_subcluster_",abbr_this_ct,".csv") 104 | this_subcluster_dimension_name <- paste0("subcluster_dimension/",a_data_id,"_dimension_subcluster_",abbr_this_ct,".csv") 105 | 106 | if(!file.exists(this_out_name) | !file.exists(this_subcluster_dimension_name)) { 107 | this_obj <- subset(a.obj, subset = predicted.id == this_ct) 108 | # If cells too few, PCA will fail 109 | 110 | if(ncol(this_obj) < 50) { 111 | dim_to_use <- ncol(this_obj) - 1 112 | pc_to_use <- ncol(this_obj) - 1 113 | } else { 114 | dim_to_use <- 25 115 | pc_to_use <- 50 116 | } 117 | if(ncol(this_obj) > 5) { 118 | this_obj <- NormalizeData(this_obj, normalization.method = "LogNormalize", scale.factor = 10000) 119 | this_obj <- Seurat::FindVariableFeatures(this_obj, selection.method = "vst", nfeatures = 2000) 120 | this_obj <- Seurat::ScaleData(this_obj, features = rownames(this_obj)) 121 | this_obj <- Seurat::RunPCA(this_obj, features = Seurat::VariableFeatures(object = this_obj), npcs = pc_to_use) 122 | this_obj <- Seurat::FindNeighbors(this_obj, dims = 1:dim_to_use) 123 | this_obj <- tryCatch(Seurat::RunUMAP(this_obj, dims = 1:dim_to_use), error=function(e) this_obj) 124 | this_obj <- Seurat::FindClusters(this_obj, resolution = 0.8) 125 | 126 | # Try to increase resolution 127 | if (length(levels(this_obj$seurat_clusters)) <= 1) { 128 | this_obj <- Seurat::FindClusters(this_obj, resolution = 1) 129 | } 130 | 131 | # Only proceed if sub-cluster is found, else create an empty table (will be useful in MYSQL) 132 | if (length(levels(this_obj$seurat_clusters)) <= 1) { 133 | this_markers <- data.frame(p_val=1,avg_logFC=0,pct.1=0,pct.2=0,p_val_adj=1,cluster=0,gene=0) 134 | } else { 135 | this_markers <- Seurat::FindAllMarkers(this_obj, test.use = "MAST") 136 | } 137 | this_markers["a_data_id"] <- a_data_id 138 | this_markers["b_data_id"] <- a_data_id 139 | this_markers["ct"] <- abbr_this_ct 140 | this_markers["type"] <- "subcluster" 141 | write.csv(this_markers,this_out_name, row.names = F,quote = F) 142 | 143 | #DefaultAssay(this_obj) 144 | #FeaturePlot(this_obj, features = "LINC00982") 145 | #DimPlot(this_obj,reduction = "umap") 146 | 147 | this_umap_df <- Embeddings(object = this_obj[['umap']]) 148 | this_pca_df <- Embeddings(object = this_obj[['pca']])[,1:3] 149 | this_sub_cluster_df <- as.matrix(paste0(this_obj$seurat_clusters)) 150 | dimension_df <- as.data.frame(cbind(this_sub_cluster_df,this_umap_df,this_pca_df)) 151 | dimension_df <- rownames_to_column(dimension_df, "cell_name") 152 | dimension_df['cell_type'] <- this_ct 153 | dimension_df['data_id'] <- a_data_id 154 | colnames(dimension_df)[2] <- "subcluster" 155 | write.csv(dimension_df,this_subcluster_dimension_name, row.names = F,quote = F) 156 | } 157 | } 158 | } 159 | 160 | 161 | ## subcluster analysis - B data 162 | for(i in 1:b_total_ct){ 163 | # this_ct is the cell type name 164 | this_ct <- levels(as.factor(b.obj$predicted.id))[i] 165 | if(this_ct == "Oligodendrocyte precursor cells") { 166 | abbr_this_ct <- "opc" 167 | } else { 168 | abbr_this_ct <- tolower(substr(this_ct, 1, 3)) 169 | } 170 | this_out_name <- paste0("de/",b_data_id,"_de_subcluster_",abbr_this_ct,".csv") 171 | this_subcluster_dimension_name <- paste0("subcluster_dimension/",b_data_id,"_dimension_subcluster_",abbr_this_ct,".csv") 172 | 173 | if(!file.exists(this_out_name) || !file.exists(this_subcluster_dimension_name)) { 174 | this_obj <- subset(b.obj, subset = predicted.id == this_ct) 175 | # If cells too few, PCA will fail 176 | if(ncol(this_obj) < 50) { 177 | dim_to_use <- ncol(this_obj) - 1 178 | pc_to_use <- ncol(this_obj) - 1 179 | } else { 180 | dim_to_use <- 25 181 | pc_to_use <- 50 182 | } 183 | 184 | if(ncol(this_obj) > 5) { 185 | 186 | this_obj <- NormalizeData(this_obj, normalization.method = "LogNormalize", scale.factor = 10000) 187 | this_obj <- Seurat::FindVariableFeatures(this_obj, selection.method = "vst", nfeatures = 2000) 188 | this_obj <- Seurat::ScaleData(this_obj, features = rownames(this_obj)) 189 | this_obj <- Seurat::RunPCA(this_obj, features = Seurat::VariableFeatures(object = this_obj), npcs = pc_to_use) 190 | this_obj <- Seurat::FindNeighbors(this_obj, dims = 1:dim_to_use) 191 | this_obj <- tryCatch(Seurat::RunUMAP(this_obj, dims = 1:dim_to_use), error=function(e) this_obj) 192 | this_obj <- Seurat::FindClusters(this_obj, resolution = 0.8) 193 | 194 | # Try to increase resolution 195 | if (length(levels(this_obj$seurat_clusters)) <= 1) { 196 | this_obj <- Seurat::FindClusters(this_obj, resolution = 1) 197 | } 198 | 199 | # Only proceed if sub-cluster is found, else create an empty table (will be useful in MYSQL) 200 | if (length(levels(this_obj$seurat_clusters)) <= 1) { 201 | this_markers <- data.frame(p_val=1,avg_logFC=0,pct.1=0,pct.2=0,p_val_adj=1,cluster=0,gene=0) 202 | } else { 203 | this_markers <- Seurat::FindAllMarkers(this_obj, test.use = "MAST") 204 | } 205 | this_markers["a_data_id"] <- b_data_id 206 | this_markers["b_data_id"] <- b_data_id 207 | this_markers["ct"] <- abbr_this_ct 208 | this_markers["type"] <- "subcluster" 209 | write.csv(this_markers,this_out_name, row.names = F,quote = F) 210 | 211 | #DefaultAssay(this_obj) 212 | #FeaturePlot(this_obj, features = "LINC00982") 213 | #DimPlot(this_obj,reduction = "umap") 214 | 215 | this_umap_df <- Embeddings(object = this_obj[['umap']]) 216 | this_pca_df <- Embeddings(object = this_obj[['pca']])[,1:3] 217 | this_sub_cluster_df <- as.matrix(paste0(this_obj$seurat_clusters)) 218 | dimension_df <- as.data.frame(cbind(this_sub_cluster_df,this_umap_df,this_pca_df)) 219 | dimension_df <- rownames_to_column(dimension_df, "cell_name") 220 | dimension_df['cell_type'] <- this_ct 221 | dimension_df['data_id'] <- b_data_id 222 | colnames(dimension_df)[2] <- "subcluster" 223 | write.csv(dimension_df,this_subcluster_dimension_name, row.names = F,quote = F) 224 | } 225 | 226 | } 227 | } 228 | 229 | 230 | ####### Compare A data cell type with B data cell type 231 | # Compare disease vs. control, 232 | for(i in 1:b_total_ct){ 233 | # this_ct is the cell type name 234 | this_ct <- levels(as.factor(b.obj$predicted.id))[i] 235 | if(this_ct == "Oligodendrocyte precursor cells") { 236 | abbr_this_ct <- "opc" 237 | } else { 238 | abbr_this_ct <- tolower(substr(this_ct, 1, 3)) 239 | } 240 | this_out_name <- paste0("de/",a_data_id,"_vs_",b_data_id,"_de_",abbr_this_ct,".csv") 241 | 242 | # Make sure both datasets have same cell type 243 | if(!file.exists(this_out_name) && this_ct %in% levels(as.factor(a.obj$predicted.id))) { 244 | this_a_obj <- subset(a.obj, subset = predicted.id == this_ct) 245 | this_a_obj[['condition']] <- 1 246 | this_b_obj <- subset(b.obj, subset = predicted.id == this_ct) 247 | this_b_obj[['condition']] <- 2 248 | if(ncol(this_a_obj) > 3 && ncol(this_b_obj) > 3) { 249 | this_combined <- merge(this_a_obj, y = this_b_obj, add.cell.ids = c("a", "b"), project = "combined") 250 | this_combined <- NormalizeData(this_combined) 251 | 252 | Idents(this_combined) <- this_combined$condition 253 | this_markers <- FindMarkers(this_combined, ident.1 = 1, ident.2 = 2, test.use = "MAST", latent.vars = 'condition') 254 | 255 | this_markers["cluster"] <- this_ct 256 | this_markers["gene"] <- rownames(this_markers) 257 | this_markers["a_data_id"] <- a_data_id 258 | this_markers["b_data_id"] <- b_data_id 259 | this_markers["ct"] <- abbr_this_ct 260 | this_markers["type"] <- "a_vs_b" 261 | write.csv(this_markers,this_out_name, row.names = F,quote = F) 262 | } 263 | } 264 | } 265 | 266 | 267 | ####### Export dimension reduction table, A data 268 | this_out_name <- paste0("dimension/",a_data_id,"_dimension_reduction.csv") 269 | if(!file.exists(this_out_name)) { 270 | if(class(a.obj[['umap']])[1] != "DimReduc") { 271 | a.obj <- Seurat::RunUMAP(a.obj, dims = 1:25) 272 | } 273 | umap_df <- Embeddings(object = a.obj[['umap']]) 274 | pca_df <- Embeddings(object = a.obj[['pca']])[,1:3] 275 | cell_type_df <- as.matrix(a.obj$predicted.id) 276 | dimension_df <- as.data.frame(cbind(cell_type_df,umap_df,pca_df)) 277 | dimension_df <- rownames_to_column(dimension_df, "cell_name") 278 | colnames(dimension_df) <- c("cell_name","cell_type","umap_1","umap_2","pca_1","pca2","pca3") 279 | dimension_df['data_id'] <- a_data_id 280 | write.csv(dimension_df,this_out_name, row.names = F,quote = F) 281 | } 282 | 283 | ####### Export dimension reduction table, B data 284 | this_out_name <- paste0("dimension/",b_data_id,"_dimension_reduction.csv") 285 | if(!file.exists(this_out_name)) { 286 | if(class(b.obj[['umap']])[1] != "DimReduc") { 287 | b.obj <- Seurat::RunUMAP(b.obj, dims = 1:25) 288 | } 289 | umap_df <- Embeddings(object = b.obj[['umap']]) 290 | pca_df <- Embeddings(object = b.obj[['pca']])[,1:3] 291 | cell_type_df <- as.matrix(b.obj$predicted.id) 292 | dimension_df <- as.data.frame(cbind(cell_type_df,umap_df,pca_df)) 293 | dimension_df <- rownames_to_column(dimension_df, "cell_name") 294 | colnames(dimension_df) <- c("cell_name","cell_type","umap_1","umap_2","pca_1","pca2","pca3") 295 | dimension_df['data_id'] <- b_data_id 296 | write.csv(dimension_df,this_out_name, row.names = F,quote = F) 297 | } 298 | 299 | 300 | #Session Infomation 301 | #sessionInfo() 302 | 303 | -------------------------------------------------------------------------------- /workflow/transfer_cell_type.R: -------------------------------------------------------------------------------- 1 | # Goal 2 | # This document aims to filter out control-like cells in disease stage dataset 3 | 4 | # Why use Harmony for integration? 5 | # 1. Fast and best performances among 14 tools: A benchmark of batch-effect correction methods for single-cell RNA sequencing data 6 | # 2. Seurat cannot handle 500k+ cells. 7 | 8 | # Why use PCA+PCA for label transfering? 9 | # 1. PCA results are better: https://docs.google.com/spreadsheets/d/1IJBT95FGIXBP05bNOUlFtKlM95aWaVUStCE9EpZ0rgA/edit 10 | # 2. Seurat recommendation: FindTransferAnchors: We recommend using PCA when reference and query datasets are from scRNA-seq 11 | # https://www.rdocumentation.org/packages/Seurat/versions/3.1.4/topics/FindTransferAnchors 12 | # 3. PCA is much faster since it is already calculated. 13 | 14 | 15 | options(future.globals.maxSize = 8000 * 1024^2) 16 | suppressPackageStartupMessages(library(fst)) 17 | suppressPackageStartupMessages(library(Seurat)) 18 | suppressPackageStartupMessages(library(RColorBrewer)) 19 | suppressPackageStartupMessages(library(Polychrome)) 20 | suppressPackageStartupMessages(library(ggplot2)) 21 | suppressPackageStartupMessages(library(tidyverse)) 22 | suppressPackageStartupMessages(library(harmony)) 23 | suppressPackageStartupMessages(library(cowplot)) 24 | suppressPackageStartupMessages(library(future)) 25 | 26 | 27 | ## Do not use it, not working in OSC clusters 28 | ## Set multi-thread for Seurat 29 | #plan("multiprocess", workers = 16) 30 | #plan() 31 | 32 | 33 | args <- commandArgs(TRUE) 34 | wd <- args[1] # working directory 35 | control_filename <- args[2] # rds seurat object 36 | disease_filename <- args[3] # raw filename 37 | disease_data_id <- args[4] # disease data ID 38 | 39 | load_test_data <- function(){ 40 | # This function is used for testing, set wd to your working directory 41 | rm(list = ls(all = TRUE)) 42 | wd <- 'C:/Users/flyku/Documents/GitHub/scread-protocol/workflow' 43 | control_filename <- "control_example.rds" 44 | disease_filename <- "example_disease.csv" 45 | disease_data_id <- "disease_example" 46 | } 47 | 48 | 49 | setwd(wd) 50 | source("functions.R") 51 | 52 | ####### Load raw files 53 | health.obj <- read_rds(control_filename) 54 | disease_matrix <- read.csv(disease_filename) 55 | 56 | rownames(disease_matrix) <- NULL 57 | disease_matrix <- column_to_rownames(disease_matrix, var = "X") 58 | disease.obj <- CreateSeuratObject(disease_matrix, project = "all", min.cells = 5) 59 | 60 | 61 | # Preview control object cell types 62 | #Idents(health.obj) <- health.obj$predicted.id 63 | #Plot.cluster2D(health.obj,reduction.method = "umap",pt_size = 0.1, txt = "Predicted.id") 64 | 65 | ####### Load dataset to Seurat object 66 | all.obj <- merge(health.obj, disease.obj) 67 | #all.obj <- CreateSeuratObject(counts = cbind(health.obj, disease_matrix), project = "all", min.cells = 5, meta.data = this_meta) 68 | all.obj <- NormalizeData(all.obj, verbose = T) 69 | all.obj <- FindVariableFeatures(all.obj, selection.method = "vst", nfeatures = 2000) 70 | all.obj <- ScaleData(all.obj, verbose = FALSE) 71 | all.obj <- RunPCA(all.obj, pc.genes = all.obj@var.genes, npcs = 25, verbose = T) 72 | all.obj@meta.data$group <- c(rep("control", ncol(health.obj)), rep("disease", ncol(disease.obj))) 73 | all.obj <- RunHarmony(all.obj, "group", plot_convergence = F) 74 | all.obj <- RunUMAP(all.obj, reduction = "harmony", dims = 1:25) 75 | all.obj <- FindNeighbors(all.obj, reduction = "harmony", dims = 1:25) 76 | all.obj <- FindClusters(all.obj, resolution = 4) 77 | 78 | ####### Preview before integration 79 | #options(repr.plot.height = 5, repr.plot.width = 12) 80 | #p1 <- DimPlot(object = all.obj, reduction = "pca", pt.size = .1, group.by = "group") 81 | #p2 <- VlnPlot(object = all.obj, features = "PC_1", group.by = "group", pt.size = .1) 82 | #plot_grid(p1,p2) 83 | 84 | ####### Preview after integration 85 | #options(repr.plot.height = 5, repr.plot.width = 12) 86 | #p1 <- DimPlot(object = all.obj, reduction = "harmony", pt.size = .1, group.by = "group") 87 | #p2 <- VlnPlot(object = all.obj, features = "harmony_1", group.by = "group", pt.size = .1) 88 | #plot_grid(p1,p2) 89 | 90 | ####### Identify control atlas, control-like cells, and disease like cells (hypergeometric test) 91 | clusters <- as.data.frame(all.obj$seurat_clusters) 92 | clusters <- rownames_to_column(clusters, "cell") 93 | colnames(clusters) <- c("cell","cluster") 94 | groups <- as.data.frame(all.obj$group) 95 | groups <- rownames_to_column(groups, "TAG") 96 | colnames(groups) <- c("cell","group") 97 | 98 | ### five columns: cell name, seurat cluster,cell type, subcluster, condition 99 | cluster_condition <- merge(clusters,groups, by = "cell", all = FALSE) 100 | #cluster_condition <- merge(cluster_condition,meta_file, by = "cell", all = FALSE) 101 | 102 | control_disease_like_result <- NULL 103 | for (i in levels(all.obj$seurat_clusters)) { 104 | this_cluster <- cluster_condition[which(cluster_condition$cluster == i),] 105 | this_cluster_control <- this_cluster[which(this_cluster$group %in% "control"),] 106 | n_this_cluster_control <- length(this_cluster_control$cell) 107 | cluster_control.percentage <- n_this_cluster_control/length(this_cluster$cell) 108 | 109 | q = n_this_cluster_control - 1 110 | m = ncol(health.obj) 111 | n = ncol(disease.obj) 112 | k = length(this_cluster$cell) 113 | pval <- phyper(q,m,n,k,lower.tail=F) 114 | 115 | this_result <- data.frame(cluster=i, percent_control=cluster_control.percentage, pval=pval) 116 | control_disease_like_result <- rbind(control_disease_like_result,this_result) 117 | 118 | } 119 | 120 | control_disease_like_result$pval <- p.adjust(control_disease_like_result$pval,"BH") 121 | control_disease_like_result$group <- ifelse(control_disease_like_result$pval < 0.0001, "control_cluster", "disease_cluster") 122 | 123 | ### control_disease_like_result: cluster group table 124 | #print(control_disease_like_result) 125 | 126 | ####### Annotate control cluster, disease cluster 127 | tmp_result_group <- all.obj$seurat_clusters 128 | levels(tmp_result_group) <- control_disease_like_result$group 129 | all.obj <- AddMetaData(all.obj,tmp_result_group, col.name = "cluster_group") 130 | 131 | ####### Annotate control cells percentage in each Seurat cluster 132 | result_percent <- all.obj$seurat_clusters 133 | levels(result_percent) <- paste(round(control_disease_like_result$percent_control, digits = 4), formatC(control_disease_like_result$pval, format = "e", digits = 4),sep = "-") 134 | all.obj <- AddMetaData(all.obj,result_percent, col.name = "healhy_cells_percent") 135 | 136 | 137 | ####### Annotate control cluster, disease cluster 138 | combine_group_pathlogy <- paste(tmp_result_group,as.factor(all.obj$group)) 139 | names(combine_group_pathlogy) <- names(tmp_result_group) 140 | combine_group_pathlogy <- as.factor(combine_group_pathlogy) 141 | all.obj <- AddMetaData(all.obj,combine_group_pathlogy, col.name = "combine_group_pathlogy") 142 | 143 | ####### Annotate control-like, disease-like cells 144 | associate_cells <- combine_group_pathlogy 145 | 146 | #In most cases the cells split to these four groups: 147 | #levels(associate_cells) <- c("control cells atlas","control cells atlas","control-like cells","control cells atlas") 148 | 149 | # Sometimes the output don't have four levels, i.e, no cells are control-like, thus we need to iterate every case 150 | new_levels <- vector() 151 | for (i in levels(associate_cells)) { 152 | this_level <- '' 153 | if(i == "control_cluster control") { 154 | this_level <- "control cells atlas" 155 | new_levels <- append(new_levels, this_level) 156 | 157 | } else if (i == "control_cluster disease") { 158 | this_level <- "control-like cells" 159 | new_levels <- append(new_levels, this_level) 160 | 161 | } else if (i == "disease_cluster control") { 162 | this_level <- "control cells atlas" 163 | new_levels <- append(new_levels, this_level) 164 | 165 | } else if (i == "disease_cluster disease") { 166 | this_level <- "disease-like cells" 167 | new_levels <- append(new_levels, this_level) 168 | 169 | } 170 | } 171 | 172 | levels(associate_cells) <- new_levels 173 | 174 | all.obj <- AddMetaData(all.obj,associate_cells, col.name = "associate_cells") 175 | 176 | #Idents(all.obj) <- all.obj$group 177 | #Plot.cluster2D(all.obj,reduction.method = "umap",pt_size = 0.1, txt = "orig.ident") 178 | # 179 | ####### Visualize disease-like cells 180 | #Idents(all.obj) <- all.obj$cell_type 181 | #p1 <- Plot.cluster2D(all.obj,reduction.method = "umap",pt_size = 0.1, txt = "Provided cell type") 182 | # 183 | #Idents(all.obj) <- all.obj$seurat_clusters 184 | #p2 <- Plot.cluster2D(all.obj,reduction.method = "umap",pt_size = 0.1, txt = "Seurat cluster") 185 | # 186 | # 187 | #Idents(all.obj) <- all.obj$healhy_cells_percent 188 | #p3 <- Plot.cluster2D(all.obj,reduction.method = "umap",pt_size = 0.1, txt = "control cells percentage-pvalue") 189 | # 190 | #Idents(all.obj) <- all.obj$cluster_group 191 | #p4 <- Plot.cluster2D(all.obj,reduction.method = "umap",pt_size = 0.1, txt = "Associate group") 192 | # 193 | #plot_grid(p1,p2,p3,p4) 194 | #Idents(all.obj) <- all.obj$combine_group_pathlogy 195 | #Plot.cluster2D(all.obj,reduction.method = "umap",pt_size = 0.1, txt = "Associated group + data source") 196 | # 197 | #Idents(all.obj) <- all.obj$associate_cells 198 | #Plot.cluster2D(all.obj,reduction.method = "umap",pt_size = 0.1, txt = "Associated cells") 199 | 200 | 201 | Idents(all.obj) <- all.obj$associate_cells 202 | dstage.obj <- subset(all.obj,subset = associate_cells == "disease-like cells") 203 | 204 | dstage.obj <- FindVariableFeatures(dstage.obj, selection.method = "vst", nfeatures = 2000) 205 | dstage.obj.gene <- rownames(dstage.obj) 206 | dstage.obj <- ScaleData(dstage.obj, features = dstage.obj.gene) 207 | dstage.obj <- RunPCA(dstage.obj, features = VariableFeatures(object = dstage.obj)) 208 | dstage.obj <- RunUMAP(dstage.obj, reduction = "pca", dims = 1:25) 209 | 210 | 211 | ## FindTransferAnchors: We recommend using PCA when reference and query datasets are from scRNA-seq 212 | transfer.anchors <- FindTransferAnchors(reference = health.obj, query = dstage.obj, features = VariableFeatures(object = health.obj), reduction = "pcaproject",verbose = TRUE) 213 | 214 | if(nrow(transfer.anchors@anchors) > 30) { 215 | celltype.predictions <- TransferData(anchorset = transfer.anchors, refdata = health.obj$predicted.id, weight.reduction = dstage.obj[["pca"]],l2.norm = FALSE,dims = 1:25, k.weight = 30) 216 | } else{ 217 | celltype.predictions <- TransferData(anchorset = transfer.anchors, refdata = health.obj$predicted.id, weight.reduction = dstage.obj[["pca"]],l2.norm = FALSE,dims = 1:25, k.weight = (nrow(transfer.anchors@anchors)-1)) 218 | } 219 | 220 | dstage.obj <- AddMetaData(dstage.obj, metadata = celltype.predictions) 221 | 222 | Idents(health.obj) <- health.obj$predicted.id 223 | p1 <- Plot.cluster2D(health.obj, reduction.method = "umap",pt_size = 0.4,txt = "Control cell type") 224 | 225 | Idents(dstage.obj) <- dstage.obj$predicted.id 226 | p2 <- Plot.cluster2D(dstage.obj, reduction.method = "umap",pt_size = 0.4,txt = "Disease cell type") 227 | 228 | png(paste(disease_data_id,"_transfer_umap.png",sep = ""),width=4000, height=2000,res=300) 229 | plot_grid(p1,p2) 230 | dev.off() 231 | 232 | # Save Seurat object 233 | Idents(dstage.obj) <- dstage.obj$predicted.id 234 | saveRDS(dstage.obj, paste0(disease_data_id,".rds")) 235 | 236 | # Save raw counts rather than normalized values 237 | exp_data <- GetAssayData(object = dstage.obj,slot = "counts") 238 | 239 | write.table(data.frame("Gene"=rownames(exp_data),exp_data,check.names = F),paste(disease_data_id,"_expr.txt",sep = ""), row.names = F,sep="\t",quote=FALSE) 240 | 241 | # Save cell type labels 242 | cell_info <- dstage.obj$predicted.id 243 | cell_label <- cbind(colnames(dstage.obj),as.character(cell_info)) 244 | colnames(cell_label) <- c("cell_name","label") 245 | cell_label <- cell_label[order(cell_label[,1]),] 246 | write.table(cell_label,paste(disease_data_id,"_cell_label.txt",sep = ""),quote = F,row.names = F,sep = "\t") 247 | 248 | 249 | # Session Infomation 250 | #sessionInfo() 251 | 252 | --------------------------------------------------------------------------------