├── DESCRIPTION ├── NAMESPACE ├── R ├── DEsingle.R ├── DEtype.R └── TestData.R ├── README.md ├── data └── TestData.rda ├── inst ├── CITATION └── NEWS ├── man ├── DEsingle.Rd ├── DEtype.Rd └── TestData.Rd └── vignettes ├── DEsingle.Rmd └── DEsingle_LOGO.png /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: DEsingle 2 | Type: Package 3 | Title: DEsingle for detecting three types of differential expression in single-cell RNA-seq data 4 | Version: 1.19.1 5 | Date: 2018-12-01 6 | Author: Zhun Miao 7 | Maintainer: Zhun Miao 8 | Description: DEsingle is an R package for differential expression (DE) analysis of 9 | single-cell RNA-seq (scRNA-seq) data. It defines and detects 3 types of differentially 10 | expressed genes between two groups of single cells, with regard to different expression 11 | status (DEs), differential expression abundance (DEa), and general differential expression 12 | (DEg). DEsingle employs Zero-Inflated Negative Binomial model to estimate the proportion 13 | of real and dropout zeros and to define and detect the 3 types of DE genes. Results showed 14 | that DEsingle outperforms existing methods for scRNA-seq DE analysis, and can reveal 15 | different types of DE genes that are enriched in different biological functions. 16 | License: GPL-2 17 | Encoding: UTF-8 18 | LazyData: true 19 | Depends: R (>= 3.4.0) 20 | Imports: 21 | stats, 22 | Matrix (>= 1.2-14), 23 | MASS (>= 7.3-45), 24 | VGAM (>= 1.0-2), 25 | bbmle (>= 1.0.18), 26 | gamlss (>= 4.4-0), 27 | maxLik (>= 1.3-4), 28 | pscl (>= 1.4.9), 29 | BiocParallel (>= 1.12.0), 30 | Suggests: 31 | knitr, 32 | rmarkdown, 33 | SingleCellExperiment 34 | VignetteBuilder: knitr 35 | URL: https://miaozhun.github.io/DEsingle/ 36 | biocViews: DifferentialExpression, GeneExpression, SingleCell, ImmunoOncology, RNASeq, Transcriptomics, Sequencing, Preprocessing, Software 37 | RoxygenNote: 6.0.1 38 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | export(DEsingle) 4 | export(DEtype) 5 | import(stats) 6 | importFrom(BiocParallel,bplapply) 7 | importFrom(BiocParallel,bpparam) 8 | importFrom(MASS,fitdistr) 9 | importFrom(MASS,glm.nb) 10 | importFrom(Matrix,Matrix) 11 | importFrom(VGAM,dzinegbin) 12 | importFrom(bbmle,mle2) 13 | importFrom(gamlss,gamlssML) 14 | importFrom(maxLik,maxLik) 15 | importFrom(pscl,zeroinfl) 16 | importMethodsFrom(Matrix,colSums) 17 | -------------------------------------------------------------------------------- /R/DEsingle.R: -------------------------------------------------------------------------------- 1 | #' DEsingle: Detecting differentially expressed genes from scRNA-seq data 2 | #' 3 | #' This function is used to detect differentially expressed genes between two specified groups of cells in a raw read counts matrix of single-cell RNA-seq (scRNA-seq) data. It takes a non-negative integer matrix of scRNA-seq raw read counts or a \code{SingleCellExperiment} object as input. So users should map the reads (obtained from sequencing libraries of the samples) to the corresponding genome and count the reads mapped to each gene according to the gene annotation to get the raw read counts matrix in advance. 4 | #' 5 | #' @param counts A non-negative integer matrix of scRNA-seq raw read counts or a \code{SingleCellExperiment} object which contains the read counts matrix. The rows of the matrix are genes and columns are samples/cells. 6 | #' @param group A vector of factor which specifies the two groups to be compared, corresponding to the columns in the counts matrix. 7 | #' @param parallel If FALSE (default), no parallel computation is used; if TRUE, parallel computation using \code{BiocParallel}, with argument \code{BPPARAM}. 8 | #' @param BPPARAM An optional parameter object passed internally to \code{\link{bplapply}} when \code{parallel=TRUE}. If not specified, \code{\link{bpparam}()} (default) will be used. 9 | #' @return 10 | #' A data frame containing the differential expression (DE) analysis results, rows are genes and columns contain the following items: 11 | #' \itemize{ 12 | #' \item theta_1, theta_2, mu_1, mu_2, size_1, size_2, prob_1, prob_2: MLE of the zero-inflated negative binomial distribution's parameters of group 1 and group 2. 13 | #' \item total_mean_1, total_mean_2: Mean of read counts of group 1 and group 2. 14 | #' \item foldChange: total_mean_1/total_mean_2. 15 | #' \item norm_total_mean_1, norm_total_mean_2: Mean of normalized read counts of group 1 and group 2. 16 | #' \item norm_foldChange: norm_total_mean_1/norm_total_mean_2. 17 | #' \item chi2LR1: Chi-square statistic for hypothesis testing of H0. 18 | #' \item pvalue_LR2: P value of hypothesis testing of H20 (Used to determine the type of a DE gene). 19 | #' \item pvalue_LR3: P value of hypothesis testing of H30 (Used to determine the type of a DE gene). 20 | #' \item FDR_LR2: Adjusted P value of pvalue_LR2 using Benjamini & Hochberg's method (Used to determine the type of a DE gene). 21 | #' \item FDR_LR3: Adjusted P value of pvalue_LR3 using Benjamini & Hochberg's method (Used to determine the type of a DE gene). 22 | #' \item pvalue: P value of hypothesis testing of H0 (Used to determine whether a gene is a DE gene). 23 | #' \item pvalue.adj.FDR: Adjusted P value of H0's pvalue using Benjamini & Hochberg's method (Used to determine whether a gene is a DE gene). 24 | #' \item Remark: Record of abnormal program information. 25 | #' } 26 | #' 27 | #' @author Zhun Miao. 28 | #' @seealso 29 | #' \code{\link{DEtype}}, for the classification of differentially expressed genes found by \code{\link{DEsingle}}. 30 | #' 31 | #' \code{\link{TestData}}, a test dataset for DEsingle. 32 | #' 33 | #' @examples 34 | #' # Load test data for DEsingle 35 | #' data(TestData) 36 | #' 37 | #' # Specifying the two groups to be compared 38 | #' # The sample number in group 1 and group 2 is 50 and 100 respectively 39 | #' group <- factor(c(rep(1,50), rep(2,100))) 40 | #' 41 | #' # Detecting the differentially expressed genes 42 | #' results <- DEsingle(counts = counts, group = group) 43 | #' 44 | #' # Dividing the differentially expressed genes into 3 categories 45 | #' results.classified <- DEtype(results = results, threshold = 0.05) 46 | #' 47 | #' @import stats 48 | #' @importFrom BiocParallel bpparam bplapply 49 | #' @importFrom Matrix Matrix 50 | #' @importFrom MASS glm.nb fitdistr 51 | #' @importFrom VGAM dzinegbin 52 | #' @importFrom bbmle mle2 53 | #' @importFrom gamlss gamlssML 54 | #' @importFrom maxLik maxLik 55 | #' @importFrom pscl zeroinfl 56 | #' @importMethodsFrom Matrix colSums 57 | #' @export 58 | 59 | 60 | 61 | DEsingle <- function(counts, group, parallel = FALSE, BPPARAM = bpparam()){ 62 | 63 | # Handle SingleCellExperiment 64 | if(class(counts)[1] == "SingleCellExperiment"){ 65 | if(!require(SingleCellExperiment)) 66 | stop("To use SingleCellExperiment as input, you should install the package firstly") 67 | counts <- counts(counts) 68 | } 69 | 70 | # Invalid input control 71 | if(!is.matrix(counts) & !is.data.frame(counts) & class(counts)[1] != "dgCMatrix") 72 | stop("Wrong data type of 'counts'") 73 | if(sum(is.na(counts)) > 0) 74 | stop("NA detected in 'counts'");gc(); 75 | if(sum(counts < 0) > 0) 76 | stop("Negative value detected in 'counts'");gc(); 77 | if(all(counts == 0)) 78 | stop("All elements of 'counts' are zero");gc(); 79 | if(any(colSums(counts) == 0)) 80 | warning("Library size of zero detected in 'counts'");gc(); 81 | 82 | if(!is.factor(group)) 83 | stop("Data type of 'group' is not factor") 84 | if(length(levels(group)) != 2) 85 | stop("Levels number of 'group' is not two") 86 | if(table(group)[1] < 2 | table(group)[2] < 2) 87 | stop("Too few samples (< 2) in a group") 88 | if(ncol(counts) != length(group)) 89 | stop("Length of 'group' must equal to column number of 'counts'") 90 | 91 | if(!is.logical(parallel)) 92 | stop("Data type of 'parallel' is not logical") 93 | if(length(parallel) != 1) 94 | stop("Length of 'parallel' is not one") 95 | 96 | # Preprocessing 97 | counts <- round(as.matrix(counts)) 98 | storage.mode(counts) <- "integer" 99 | if(any(rowSums(counts) == 0)) 100 | message("Removing ", sum(rowSums(counts) == 0), " rows of genes with all zero counts") 101 | counts <- counts[rowSums(counts) != 0,] 102 | geneNum <- nrow(counts) 103 | sampleNum <- ncol(counts) 104 | gc() 105 | 106 | # Normalization 107 | message("Normalizing the data") 108 | GEOmean <- rep(NA,geneNum) 109 | for (i in 1:geneNum) 110 | { 111 | gene_NZ <- counts[i,counts[i,] > 0] 112 | GEOmean[i] <- exp(sum(log(gene_NZ), na.rm=TRUE) / length(gene_NZ)) 113 | } 114 | S <- rep(NA, sampleNum) 115 | counts_norm <- counts 116 | for (j in 1:sampleNum) 117 | { 118 | sample_j <- counts[,j]/GEOmean 119 | S[j] <- median(sample_j[which(sample_j != 0)]) 120 | counts_norm[,j] <- counts[,j]/S[j] 121 | } 122 | counts_norm <- ceiling(counts_norm) 123 | remove(GEOmean, gene_NZ, S, sample_j, i, j) 124 | gc() 125 | 126 | # Cache totalMean and foldChange for each gene 127 | totalMean_1 <- rowMeans(counts[row.names(counts_norm), group == levels(group)[1]]) 128 | totalMean_2 <- rowMeans(counts[row.names(counts_norm), group == levels(group)[2]]) 129 | foldChange <- totalMean_1/totalMean_2 130 | All_Mean_FC <- cbind(totalMean_1, totalMean_2, foldChange) 131 | 132 | # Memory management 133 | remove(counts, totalMean_1, totalMean_2, foldChange) 134 | counts_norm <- Matrix(counts_norm, sparse = TRUE) 135 | gc() 136 | 137 | 138 | # Function of testing homogeneity of two ZINB populations 139 | CallDE <- function(i){ 140 | 141 | # Memory management 142 | if(i %% 100 == 0) 143 | gc() 144 | 145 | # Function input and output 146 | counts_1 <- counts_norm[i, group == levels(group)[1]] 147 | counts_2 <- counts_norm[i, group == levels(group)[2]] 148 | results_gene <- data.frame(row.names = row.names(counts_norm)[i], theta_1 = NA, theta_2 = NA, mu_1 = NA, mu_2 = NA, size_1 = NA, size_2 = NA, prob_1 = NA, prob_2 = NA, total_mean_1 = NA, total_mean_2 = NA, foldChange = NA, norm_total_mean_1 = NA, norm_total_mean_2 = NA, norm_foldChange = NA, chi2LR1 = NA, pvalue_LR2 = NA, pvalue_LR3 = NA, FDR_LR2 = NA, FDR_LR3 = NA, pvalue = NA, pvalue.adj.FDR = NA, Remark = NA) 149 | 150 | # Log likelihood functions 151 | logL <- function(counts_1, theta_1, size_1, prob_1, counts_2, theta_2, size_2, prob_2){ 152 | logL_1 <- sum(dzinegbin(counts_1, size = size_1, prob = prob_1, pstr0 = theta_1, log = TRUE)) 153 | logL_2 <- sum(dzinegbin(counts_2, size = size_2, prob = prob_2, pstr0 = theta_2, log = TRUE)) 154 | logL <- logL_1 + logL_2 155 | logL 156 | } 157 | logL2 <- function(param){ 158 | theta_resL2 <- param[1] 159 | size_1_resL2 <- param[2] 160 | prob_1_resL2 <- param[3] 161 | size_2_resL2 <- param[4] 162 | prob_2_resL2 <- param[5] 163 | logL_1 <- sum(dzinegbin(counts_1, size = size_1_resL2, prob = prob_1_resL2, pstr0 = theta_resL2, log = TRUE)) 164 | logL_2 <- sum(dzinegbin(counts_2, size = size_2_resL2, prob = prob_2_resL2, pstr0 = theta_resL2, log = TRUE)) 165 | logL <- logL_1 + logL_2 166 | logL 167 | } 168 | logL2NZ <- function(param){ 169 | theta_resL2 <- 0 170 | size_1_resL2 <- param[1] 171 | prob_1_resL2 <- param[2] 172 | size_2_resL2 <- param[3] 173 | prob_2_resL2 <- param[4] 174 | logL_1 <- sum(dzinegbin(counts_1, size = size_1_resL2, prob = prob_1_resL2, pstr0 = theta_resL2, log = TRUE)) 175 | logL_2 <- sum(dzinegbin(counts_2, size = size_2_resL2, prob = prob_2_resL2, pstr0 = theta_resL2, log = TRUE)) 176 | logL <- logL_1 + logL_2 177 | logL 178 | } 179 | logL3 <- function(param){ 180 | theta_1_resL3 <- param[1] 181 | size_resL3 <- param[2] 182 | prob_resL3 <- param[3] 183 | theta_2_resL3 <- param[4] 184 | logL_1 <- sum(dzinegbin(counts_1, size = size_resL3, prob = prob_resL3, pstr0 = theta_1_resL3, log = TRUE)) 185 | logL_2 <- sum(dzinegbin(counts_2, size = size_resL3, prob = prob_resL3, pstr0 = theta_2_resL3, log = TRUE)) 186 | logL <- logL_1 + logL_2 187 | logL 188 | } 189 | logL3NZ1 <- function(param){ 190 | theta_1_resL3 <- 0 191 | size_resL3 <- param[1] 192 | prob_resL3 <- param[2] 193 | theta_2_resL3 <- param[3] 194 | logL_1 <- sum(dzinegbin(counts_1, size = size_resL3, prob = prob_resL3, pstr0 = theta_1_resL3, log = TRUE)) 195 | logL_2 <- sum(dzinegbin(counts_2, size = size_resL3, prob = prob_resL3, pstr0 = theta_2_resL3, log = TRUE)) 196 | logL <- logL_1 + logL_2 197 | logL 198 | } 199 | logL3NZ2 <- function(param){ 200 | theta_1_resL3 <- param[1] 201 | size_resL3 <- param[2] 202 | prob_resL3 <- param[3] 203 | theta_2_resL3 <- 0 204 | logL_1 <- sum(dzinegbin(counts_1, size = size_resL3, prob = prob_resL3, pstr0 = theta_1_resL3, log = TRUE)) 205 | logL_2 <- sum(dzinegbin(counts_2, size = size_resL3, prob = prob_resL3, pstr0 = theta_2_resL3, log = TRUE)) 206 | logL <- logL_1 + logL_2 207 | logL 208 | } 209 | logL3AZ1 <- function(param){ 210 | theta_1_resL3 <- 1 211 | size_resL3 <- param[1] 212 | prob_resL3 <- param[2] 213 | theta_2_resL3 <- param[3] 214 | logL_1 <- sum(dzinegbin(counts_1, size = size_resL3, prob = prob_resL3, pstr0 = theta_1_resL3, log = TRUE)) 215 | logL_2 <- sum(dzinegbin(counts_2, size = size_resL3, prob = prob_resL3, pstr0 = theta_2_resL3, log = TRUE)) 216 | logL <- logL_1 + logL_2 217 | logL 218 | } 219 | logL3AZ2 <- function(param){ 220 | theta_1_resL3 <- param[1] 221 | size_resL3 <- param[2] 222 | prob_resL3 <- param[3] 223 | theta_2_resL3 <- 1 224 | logL_1 <- sum(dzinegbin(counts_1, size = size_resL3, prob = prob_resL3, pstr0 = theta_1_resL3, log = TRUE)) 225 | logL_2 <- sum(dzinegbin(counts_2, size = size_resL3, prob = prob_resL3, pstr0 = theta_2_resL3, log = TRUE)) 226 | logL <- logL_1 + logL_2 227 | logL 228 | } 229 | logL3NZ1AZ2 <- function(param){ 230 | theta_1_resL3 <- 0 231 | size_resL3 <- param[1] 232 | prob_resL3 <- param[2] 233 | theta_2_resL3 <- 1 234 | logL_1 <- sum(dzinegbin(counts_1, size = size_resL3, prob = prob_resL3, pstr0 = theta_1_resL3, log = TRUE)) 235 | logL_2 <- sum(dzinegbin(counts_2, size = size_resL3, prob = prob_resL3, pstr0 = theta_2_resL3, log = TRUE)) 236 | logL <- logL_1 + logL_2 237 | logL 238 | } 239 | logL3NZ2AZ1 <- function(param){ 240 | theta_1_resL3 <- 1 241 | size_resL3 <- param[1] 242 | prob_resL3 <- param[2] 243 | theta_2_resL3 <- 0 244 | logL_1 <- sum(dzinegbin(counts_1, size = size_resL3, prob = prob_resL3, pstr0 = theta_1_resL3, log = TRUE)) 245 | logL_2 <- sum(dzinegbin(counts_2, size = size_resL3, prob = prob_resL3, pstr0 = theta_2_resL3, log = TRUE)) 246 | logL <- logL_1 + logL_2 247 | logL 248 | } 249 | judgeParam <- function(param){ 250 | if((param >= 0) & (param <= 1)) 251 | res <- TRUE 252 | else 253 | res <- FALSE 254 | res 255 | } 256 | 257 | # MLE of parameters of ZINB counts_1 258 | if(sum(counts_1 == 0) > 0){ 259 | if(sum(counts_1 == 0) == length(counts_1)){ 260 | theta_1 <- 1 261 | mu_1 <- 0 262 | size_1 <- 1 263 | prob_1 <- size_1/(size_1 + mu_1) 264 | }else{ 265 | options(show.error.messages = FALSE) 266 | zinb_try <- try(gamlssML(counts_1, family="ZINBI"), silent=TRUE) 267 | options(show.error.messages = TRUE) 268 | if('try-error' %in% class(zinb_try)){ 269 | zinb_try_twice <- try(zeroinfl(formula = counts_1 ~ 1 | 1, dist = "negbin"), silent=TRUE) 270 | if('try-error' %in% class(zinb_try_twice)){ 271 | print("MLE of ZINB failed!"); 272 | results_gene[1,"Remark"] <- "ZINB failed!" 273 | return(results_gene) 274 | }else{ 275 | zinb_1 <- zinb_try_twice 276 | theta_1 <- plogis(zinb_1$coefficients$zero);names(theta_1) <- NULL 277 | mu_1 <- exp(zinb_1$coefficients$count);names(mu_1) <- NULL 278 | size_1 <- zinb_1$theta;names(size_1) <- NULL 279 | prob_1 <- size_1/(size_1 + mu_1);names(prob_1) <- NULL 280 | } 281 | }else{ 282 | zinb_1 <- zinb_try 283 | theta_1 <- zinb_1$nu;names(theta_1) <- NULL 284 | mu_1 <- zinb_1$mu;names(mu_1) <- NULL 285 | size_1 <- 1/zinb_1$sigma;names(size_1) <- NULL 286 | prob_1 <- size_1/(size_1 + mu_1);names(prob_1) <- NULL 287 | } 288 | } 289 | }else{ 290 | op <- options(warn=2) 291 | nb_try <- try(glm.nb(formula = counts_1 ~ 1), silent=TRUE) 292 | options(op) 293 | if('try-error' %in% class(nb_try)){ 294 | nb_try_twice <- try(fitdistr(counts_1, "Negative Binomial"), silent=TRUE) 295 | if('try-error' %in% class(nb_try_twice)){ 296 | nb_try_again <- try(mle2(counts_1~dnbinom(mu=exp(logmu),size=1/invk), data=data.frame(counts_1), start=list(logmu=0,invk=1), method="L-BFGS-B", lower=c(logmu=-Inf,invk=1e-8)), silent=TRUE) 297 | if('try-error' %in% class(nb_try_again)){ 298 | nb_try_fourth <- try(glm.nb(formula = counts_1 ~ 1), silent=TRUE) 299 | if('try-error' %in% class(nb_try_fourth)){ 300 | print("MLE of NB failed!"); 301 | results_gene[1,"Remark"] <- "NB failed!" 302 | return(results_gene) 303 | }else{ 304 | nb_1 <- nb_try_fourth 305 | theta_1 <- 0 306 | mu_1 <- exp(nb_1$coefficients);names(mu_1) <- NULL 307 | size_1 <- nb_1$theta;names(size_1) <- NULL 308 | prob_1 <- size_1/(size_1 + mu_1);names(prob_1) <- NULL 309 | } 310 | }else{ 311 | nb_1 <- nb_try_again 312 | theta_1 <- 0 313 | mu_1 <- exp(nb_1@coef["logmu"]);names(mu_1) <- NULL 314 | size_1 <- 1/nb_1@coef["invk"];names(size_1) <- NULL 315 | prob_1 <- size_1/(size_1 + mu_1);names(prob_1) <- NULL 316 | } 317 | }else{ 318 | nb_1 <- nb_try_twice 319 | theta_1 <- 0 320 | mu_1 <- nb_1$estimate["mu"];names(mu_1) <- NULL 321 | size_1 <- nb_1$estimate["size"];names(size_1) <- NULL 322 | prob_1 <- size_1/(size_1 + mu_1);names(prob_1) <- NULL 323 | } 324 | }else{ 325 | nb_1 <- nb_try 326 | theta_1 <- 0 327 | mu_1 <- exp(nb_1$coefficients);names(mu_1) <- NULL 328 | size_1 <- nb_1$theta;names(size_1) <- NULL 329 | prob_1 <- size_1/(size_1 + mu_1);names(prob_1) <- NULL 330 | } 331 | } 332 | 333 | # MLE of parameters of ZINB counts_2 334 | if(sum(counts_2 == 0) > 0){ 335 | if(sum(counts_2 == 0) == length(counts_2)){ 336 | theta_2 <- 1 337 | mu_2 <- 0 338 | size_2 <- 1 339 | prob_2 <- size_2/(size_2 + mu_2) 340 | }else{ 341 | options(show.error.messages = FALSE) 342 | zinb_try <- try(gamlssML(counts_2, family="ZINBI"), silent=TRUE) 343 | options(show.error.messages = TRUE) 344 | if('try-error' %in% class(zinb_try)){ 345 | zinb_try_twice <- try(zeroinfl(formula = counts_2 ~ 1 | 1, dist = "negbin"), silent=TRUE) 346 | if('try-error' %in% class(zinb_try_twice)){ 347 | print("MLE of ZINB failed!"); 348 | results_gene[1,"Remark"] <- "ZINB failed!" 349 | return(results_gene) 350 | }else{ 351 | zinb_2 <- zinb_try_twice 352 | theta_2 <- plogis(zinb_2$coefficients$zero);names(theta_2) <- NULL 353 | mu_2 <- exp(zinb_2$coefficients$count);names(mu_2) <- NULL 354 | size_2 <- zinb_2$theta;names(size_2) <- NULL 355 | prob_2 <- size_2/(size_2 + mu_2);names(prob_2) <- NULL 356 | } 357 | }else{ 358 | zinb_2 <- zinb_try 359 | theta_2 <- zinb_2$nu;names(theta_2) <- NULL 360 | mu_2 <- zinb_2$mu;names(mu_2) <- NULL 361 | size_2 <- 1/zinb_2$sigma;names(size_2) <- NULL 362 | prob_2 <- size_2/(size_2 + mu_2);names(prob_2) <- NULL 363 | } 364 | } 365 | }else{ 366 | op <- options(warn=2) 367 | nb_try <- try(glm.nb(formula = counts_2 ~ 1), silent=TRUE) 368 | options(op) 369 | if('try-error' %in% class(nb_try)){ 370 | nb_try_twice <- try(fitdistr(counts_2, "Negative Binomial"), silent=TRUE) 371 | if('try-error' %in% class(nb_try_twice)){ 372 | nb_try_again <- try(mle2(counts_2~dnbinom(mu=exp(logmu),size=1/invk), data=data.frame(counts_2), start=list(logmu=0,invk=1), method="L-BFGS-B", lower=c(logmu=-Inf,invk=1e-8)), silent=TRUE) 373 | if('try-error' %in% class(nb_try_again)){ 374 | nb_try_fourth <- try(glm.nb(formula = counts_2 ~ 1), silent=TRUE) 375 | if('try-error' %in% class(nb_try_fourth)){ 376 | print("MLE of NB failed!"); 377 | results_gene[1,"Remark"] <- "NB failed!" 378 | return(results_gene) 379 | }else{ 380 | nb_2 <- nb_try_fourth 381 | theta_2 <- 0 382 | mu_2 <- exp(nb_2$coefficients);names(mu_2) <- NULL 383 | size_2 <- nb_2$theta;names(size_2) <- NULL 384 | prob_2 <- size_2/(size_2 + mu_2);names(prob_2) <- NULL 385 | } 386 | }else{ 387 | nb_2 <- nb_try_again 388 | theta_2 <- 0 389 | mu_2 <- exp(nb_2@coef["logmu"]);names(mu_2) <- NULL 390 | size_2 <- 1/nb_2@coef["invk"];names(size_2) <- NULL 391 | prob_2 <- size_2/(size_2 + mu_2);names(prob_2) <- NULL 392 | } 393 | }else{ 394 | nb_2 <- nb_try_twice 395 | theta_2 <- 0 396 | mu_2 <- nb_2$estimate["mu"];names(mu_2) <- NULL 397 | size_2 <- nb_2$estimate["size"];names(size_2) <- NULL 398 | prob_2 <- size_2/(size_2 + mu_2);names(prob_2) <- NULL 399 | } 400 | }else{ 401 | nb_2 <- nb_try 402 | theta_2 <- 0 403 | mu_2 <- exp(nb_2$coefficients);names(mu_2) <- NULL 404 | size_2 <- nb_2$theta;names(size_2) <- NULL 405 | prob_2 <- size_2/(size_2 + mu_2);names(prob_2) <- NULL 406 | } 407 | } 408 | 409 | # Restricted MLE under H0 (MLE of c(counts_1, counts_2)) 410 | if(sum(c(counts_1, counts_2) == 0) > 0){ 411 | options(show.error.messages = FALSE) 412 | zinb_try <- try(gamlssML(c(counts_1, counts_2), family="ZINBI"), silent=TRUE) 413 | options(show.error.messages = TRUE) 414 | if('try-error' %in% class(zinb_try)){ 415 | zinb_try_twice <- try(zeroinfl(formula = c(counts_1, counts_2) ~ 1 | 1, dist = "negbin"), silent=TRUE) 416 | if('try-error' %in% class(zinb_try_twice)){ 417 | print("MLE of ZINB failed!"); 418 | results_gene[1,"Remark"] <- "ZINB failed!" 419 | return(results_gene) 420 | }else{ 421 | zinb_res <- zinb_try_twice 422 | theta_res <- plogis(zinb_res$coefficients$zero);names(theta_res) <- NULL 423 | mu_res <- exp(zinb_res$coefficients$count);names(mu_res) <- NULL 424 | size_res <- zinb_res$theta;names(size_res) <- NULL 425 | prob_res <- size_res/(size_res + mu_res);names(prob_res) <- NULL 426 | } 427 | }else{ 428 | zinb_res <- zinb_try 429 | theta_res <- zinb_res$nu;names(theta_res) <- NULL 430 | mu_res <- zinb_res$mu;names(mu_res) <- NULL 431 | size_res <- 1/zinb_res$sigma;names(size_res) <- NULL 432 | prob_res <- size_res/(size_res + mu_res);names(prob_res) <- NULL 433 | } 434 | }else{ 435 | op <- options(warn=2) 436 | nb_try <- try(glm.nb(formula = c(counts_1, counts_2) ~ 1), silent=TRUE) 437 | options(op) 438 | if('try-error' %in% class(nb_try)){ 439 | nb_try_twice <- try(fitdistr(c(counts_1, counts_2), "Negative Binomial"), silent=TRUE) 440 | if('try-error' %in% class(nb_try_twice)){ 441 | nb_try_again <- try(mle2(c(counts_1, counts_2)~dnbinom(mu=exp(logmu),size=1/invk), data=data.frame(c(counts_1, counts_2)), start=list(logmu=0,invk=1), method="L-BFGS-B", lower=c(logmu=-Inf,invk=1e-8)), silent=TRUE) 442 | if('try-error' %in% class(nb_try_again)){ 443 | nb_try_fourth <- try(glm.nb(formula = c(counts_1, counts_2) ~ 1), silent=TRUE) 444 | if('try-error' %in% class(nb_try_fourth)){ 445 | print("MLE of NB failed!"); 446 | results_gene[1,"Remark"] <- "NB failed!" 447 | return(results_gene) 448 | }else{ 449 | nb_res <- nb_try_fourth 450 | theta_res <- 0 451 | mu_res <- exp(nb_res$coefficients);names(mu_res) <- NULL 452 | size_res <- nb_res$theta;names(size_res) <- NULL 453 | prob_res <- size_res/(size_res + mu_res);names(prob_res) <- NULL 454 | } 455 | }else{ 456 | nb_res <- nb_try_again 457 | theta_res <- 0 458 | mu_res <- exp(nb_res@coef["logmu"]);names(mu_res) <- NULL 459 | size_res <- 1/nb_res@coef["invk"];names(size_res) <- NULL 460 | prob_res <- size_res/(size_res + mu_res);names(prob_res) <- NULL 461 | } 462 | }else{ 463 | nb_res <- nb_try_twice 464 | theta_res <- 0 465 | mu_res <- nb_res$estimate["mu"];names(mu_res) <- NULL 466 | size_res <- nb_res$estimate["size"];names(size_res) <- NULL 467 | prob_res <- size_res/(size_res + mu_res);names(prob_res) <- NULL 468 | } 469 | }else{ 470 | nb_res <- nb_try 471 | theta_res <- 0 472 | mu_res <- exp(nb_res$coefficients);names(mu_res) <- NULL 473 | size_res <- nb_res$theta;names(size_res) <- NULL 474 | prob_res <- size_res/(size_res + mu_res);names(prob_res) <- NULL 475 | } 476 | } 477 | 478 | # # LRT test of H0 479 | chi2LR1 <- 2 *(logL(counts_1, theta_1, size_1, prob_1, counts_2, theta_2, size_2, prob_2) - logL(counts_1, theta_res, size_res, prob_res, counts_2, theta_res, size_res, prob_res)) 480 | pvalue <- 1 - pchisq(chi2LR1, df = 3) 481 | 482 | # Format output 483 | results_gene[1,"theta_1"] <- theta_1 484 | results_gene[1,"theta_2"] <- theta_2 485 | results_gene[1,"mu_1"] <- mu_1 486 | results_gene[1,"mu_2"] <- mu_2 487 | results_gene[1,"size_1"] <- size_1 488 | results_gene[1,"size_2"] <- size_2 489 | results_gene[1,"prob_1"] <- prob_1 490 | results_gene[1,"prob_2"] <- prob_2 491 | results_gene[1,"norm_total_mean_1"] <- mean(counts_1) 492 | results_gene[1,"norm_total_mean_2"] <- mean(counts_2) 493 | results_gene[1,"norm_foldChange"] <- results_gene[1,"norm_total_mean_1"] / results_gene[1,"norm_total_mean_2"] 494 | results_gene[1,"chi2LR1"] <- chi2LR1 495 | results_gene[1,"pvalue"] <- pvalue 496 | 497 | # Restricted MLE of logL2 and logL3 under H20 and H30 when pvalue <= 0.05 498 | if(pvalue <= 0.05){ 499 | if(sum(c(counts_1, counts_2) == 0) > 0){ 500 | options(warn=-1) 501 | # Restricted MLE of logL2 502 | A <- matrix(rbind(c(1, 0, 0, 0, 0), c(-1, 0, 0, 0, 0), c(0, 0, 1, 0 ,0), c(0, 0, -1, 0 ,0), c(0, 0, 0, 0 ,1), c(0, 0, 0, 0 ,-1)), 6, 5) 503 | B <- c(1e-10, 1+1e-10, 1e-10, 1+1e-10, 1e-10, 1+1e-10) 504 | mleL2 <- try(maxLik(logLik = logL2, start = c(theta_resL2 = 0.5, size_1_resL2 = 1, prob_1_resL2 = 0.5, size_2_resL2 = 1, prob_2_resL2 = 0.5), constraints=list(ineqA=A, ineqB=B)), silent=TRUE) 505 | if('try-error' %in% class(mleL2)){ 506 | mleL2 <- try(maxLik(logLik = logL2, start = c(theta_resL2 = 0, size_1_resL2 = 1, prob_1_resL2 = 0.5, size_2_resL2 = 1, prob_2_resL2 = 0.5), constraints=list(ineqA=A, ineqB=B)), silent=TRUE) 507 | } 508 | if('try-error' %in% class(mleL2)){ 509 | mleL2 <- try(maxLik(logLik = logL2, start = c(theta_resL2 = 1, size_1_resL2 = 1, prob_1_resL2 = 0.5, size_2_resL2 = 1, prob_2_resL2 = 0.5), constraints=list(ineqA=A, ineqB=B)), silent=TRUE) 510 | } 511 | if('try-error' %in% class(mleL2)){ 512 | A <- matrix(rbind(c(0, 1, 0, 0), c(0, -1, 0, 0), c(0, 0, 0 ,1), c(0, 0, 0 ,-1)), 4, 4) 513 | B <- c(1e-10, 1+1e-10, 1e-10, 1+1e-10) 514 | mleL2 <- maxLik(logLik = logL2NZ, start = c(size_1_resL2 = 1, prob_1_resL2 = 0.5, size_2_resL2 = 1, prob_2_resL2 = 0.5), constraints=list(ineqA=A, ineqB=B)) 515 | theta_resL2 <- 0 516 | size_1_resL2 <- mleL2$estimate["size_1_resL2"];names(size_1_resL2) <- NULL 517 | prob_1_resL2 <- mleL2$estimate["prob_1_resL2"];names(prob_1_resL2) <- NULL 518 | size_2_resL2 <- mleL2$estimate["size_2_resL2"];names(size_2_resL2) <- NULL 519 | prob_2_resL2 <- mleL2$estimate["prob_2_resL2"];names(prob_2_resL2) <- NULL 520 | }else{ 521 | theta_resL2 <- mleL2$estimate["theta_resL2"];names(theta_resL2) <- NULL 522 | size_1_resL2 <- mleL2$estimate["size_1_resL2"];names(size_1_resL2) <- NULL 523 | prob_1_resL2 <- mleL2$estimate["prob_1_resL2"];names(prob_1_resL2) <- NULL 524 | size_2_resL2 <- mleL2$estimate["size_2_resL2"];names(size_2_resL2) <- NULL 525 | prob_2_resL2 <- mleL2$estimate["prob_2_resL2"];names(prob_2_resL2) <- NULL 526 | } 527 | 528 | # Restricted MLE of logL3 529 | if((sum(counts_1 == 0) > 0) & (sum(counts_2 == 0) > 0)){ 530 | # logL3 531 | if(sum(counts_1 == 0) == length(counts_1)){ 532 | A <- matrix(rbind(c(0, 1, 0), c(0, -1, 0), c(0, 0 ,1), c(0, 0 ,-1)), 4, 3) 533 | B <- c(1e-10, 1+1e-10, 1e-10, 1+1e-10) 534 | mleL3 <- maxLik(logLik = logL3AZ1, start = c(size_resL3 = 1, prob_resL3 = 0.5, theta_2_resL3 = 0.5), constraints=list(ineqA=A, ineqB=B)) 535 | theta_1_resL3 <- 1 536 | size_resL3 <- mleL3$estimate["size_resL3"];names(size_resL3) <- NULL 537 | prob_resL3 <- mleL3$estimate["prob_resL3"];names(prob_resL3) <- NULL 538 | theta_2_resL3 <- mleL3$estimate["theta_2_resL3"];names(theta_2_resL3) <- NULL 539 | }else if(sum(counts_2 == 0) == length(counts_2)){ 540 | A <- matrix(rbind(c(1, 0, 0), c(-1, 0, 0), c(0, 0 ,1), c(0, 0 ,-1)), 4, 3) 541 | B <- c(1e-10, 1+1e-10, 1e-10, 1+1e-10) 542 | mleL3 <- maxLik(logLik = logL3AZ2, start = c(theta_1_resL3 = 0.5, size_resL3 = 1, prob_resL3 = 0.5), constraints=list(ineqA=A, ineqB=B)) 543 | theta_1_resL3 <- mleL3$estimate["theta_1_resL3"];names(theta_1_resL3) <- NULL 544 | size_resL3 <- mleL3$estimate["size_resL3"];names(size_resL3) <- NULL 545 | prob_resL3 <- mleL3$estimate["prob_resL3"];names(prob_resL3) <- NULL 546 | theta_2_resL3 <- 1 547 | }else{ 548 | A <- matrix(rbind(c(1, 0, 0, 0), c(-1, 0, 0, 0), c(0, 0, 1, 0), c(0, 0, -1, 0), c(0, 0, 0 ,1), c(0, 0, 0 ,-1)), 6, 4) 549 | B <- c(1e-10, 1+1e-10, 1e-10, 1+1e-10, 1e-10, 1+1e-10) 550 | mleL3 <- maxLik(logLik = logL3, start = c(theta_1_resL3 = 0.5, size_resL3 = 1, prob_resL3 = 0.5, theta_2_resL3 = 0.5), constraints=list(ineqA=A, ineqB=B)) 551 | theta_1_resL3 <- mleL3$estimate["theta_1_resL3"];names(theta_1_resL3) <- NULL 552 | size_resL3 <- mleL3$estimate["size_resL3"];names(size_resL3) <- NULL 553 | prob_resL3 <- mleL3$estimate["prob_resL3"];names(prob_resL3) <- NULL 554 | theta_2_resL3 <- mleL3$estimate["theta_2_resL3"];names(theta_2_resL3) <- NULL 555 | } 556 | }else if(sum(counts_1 == 0) == 0){ 557 | # logL3 558 | if(sum(counts_2 == 0) == length(counts_2)){ 559 | A <- matrix(rbind(c(0, 1), c(0, -1)), 2, 2) 560 | B <- c(1e-10, 1+1e-10) 561 | mleL3 <- maxLik(logLik = logL3NZ1AZ2, start = c(size_resL3 = 1, prob_resL3 = 0.5), constraints=list(ineqA=A, ineqB=B)) 562 | theta_1_resL3 <- 0 563 | size_resL3 <- mleL3$estimate["size_resL3"];names(size_resL3) <- NULL 564 | prob_resL3 <- mleL3$estimate["prob_resL3"];names(prob_resL3) <- NULL 565 | theta_2_resL3 <- 1 566 | }else{ 567 | A <- matrix(rbind(c(0, 1, 0), c(0, -1, 0), c(0, 0 ,1), c(0, 0 ,-1)), 4, 3) 568 | B <- c(1e-10, 1+1e-10, 1e-10, 1+1e-10) 569 | mleL3 <- maxLik(logLik = logL3NZ1, start = c(size_resL3 = 1, prob_resL3 = 0.5, theta_2_resL3 = 0.5), constraints=list(ineqA=A, ineqB=B)) 570 | theta_1_resL3 <- 0 571 | size_resL3 <- mleL3$estimate["size_resL3"];names(size_resL3) <- NULL 572 | prob_resL3 <- mleL3$estimate["prob_resL3"];names(prob_resL3) <- NULL 573 | theta_2_resL3 <- mleL3$estimate["theta_2_resL3"];names(theta_2_resL3) <- NULL 574 | } 575 | }else if(sum(counts_2 == 0) == 0){ 576 | # logL3 577 | if(sum(counts_1 == 0) == length(counts_1)){ 578 | A <- matrix(rbind(c(0, 1), c(0, -1)), 2, 2) 579 | B <- c(1e-10, 1+1e-10) 580 | mleL3 <- maxLik(logLik = logL3NZ2AZ1, start = c(size_resL3 = 1, prob_resL3 = 0.5), constraints=list(ineqA=A, ineqB=B)) 581 | theta_1_resL3 <- 1 582 | size_resL3 <- mleL3$estimate["size_resL3"];names(size_resL3) <- NULL 583 | prob_resL3 <- mleL3$estimate["prob_resL3"];names(prob_resL3) <- NULL 584 | theta_2_resL3 <- 0 585 | }else{ 586 | A <- matrix(rbind(c(1, 0, 0), c(-1, 0, 0), c(0, 0 ,1), c(0, 0 ,-1)), 4, 3) 587 | B <- c(1e-10, 1+1e-10, 1e-10, 1+1e-10) 588 | mleL3 <- maxLik(logLik = logL3NZ2, start = c(theta_1_resL3 = 0.5, size_resL3 = 1, prob_resL3 = 0.5), constraints=list(ineqA=A, ineqB=B)) 589 | theta_1_resL3 <- mleL3$estimate["theta_1_resL3"];names(theta_1_resL3) <- NULL 590 | size_resL3 <- mleL3$estimate["size_resL3"];names(size_resL3) <- NULL 591 | prob_resL3 <- mleL3$estimate["prob_resL3"];names(prob_resL3) <- NULL 592 | theta_2_resL3 <- 0 593 | } 594 | } 595 | options(warn=0) 596 | }else{ 597 | # Restricted MLE of logL2 598 | theta_resL2 <- 0 599 | size_1_resL2 <- size_1 600 | prob_1_resL2 <- prob_1 601 | size_2_resL2 <- size_2 602 | prob_2_resL2 <- prob_2 603 | 604 | # Restricted MLE of logL3 605 | theta_1_resL3 <- 0 606 | size_resL3 <- size_res 607 | prob_resL3 <- prob_res 608 | theta_2_resL3 <- 0 609 | } 610 | 611 | # Judge parameters 612 | if(!(judgeParam(theta_resL2) & judgeParam(prob_1_resL2) & judgeParam(prob_2_resL2))) 613 | results_gene[1,"Remark"] <- "logL2 failed!" 614 | if(!(judgeParam(theta_1_resL3) & judgeParam(theta_2_resL3) & judgeParam(prob_resL3))) 615 | results_gene[1,"Remark"] <- "logL3 failed!" 616 | 617 | # LRT test of H20 and H30 618 | chi2LR2 <- 2 *(logL(counts_1, theta_1, size_1, prob_1, counts_2, theta_2, size_2, prob_2) - logL(counts_1, theta_resL2, size_1_resL2, prob_1_resL2, counts_2, theta_resL2, size_2_resL2, prob_2_resL2)) 619 | pvalue_LR2 <- 1 - pchisq(chi2LR2, df = 1) 620 | chi2LR3 <- 2 *(logL(counts_1, theta_1, size_1, prob_1, counts_2, theta_2, size_2, prob_2) - logL(counts_1, theta_1_resL3, size_resL3, prob_resL3, counts_2, theta_2_resL3, size_resL3, prob_resL3)) 621 | pvalue_LR3 <- 1 - pchisq(chi2LR3, df = 2) 622 | 623 | # Format output 624 | results_gene[1,"pvalue_LR2"] <- pvalue_LR2 625 | results_gene[1,"pvalue_LR3"] <- pvalue_LR3 626 | } 627 | 628 | # Return results_gene 629 | return(results_gene) 630 | } 631 | 632 | 633 | # Call DEG gene by gene 634 | if(!parallel){ 635 | results <- matrix(data=NA, nrow = geneNum, ncol = 22, dimnames = list(row.names(counts_norm), c("theta_1", "theta_2", "mu_1", "mu_2", "size_1", "size_2", "prob_1", "prob_2", "total_mean_1", "total_mean_2", "foldChange", "norm_total_mean_1", "norm_total_mean_2", "norm_foldChange", "chi2LR1", "pvalue_LR2", "pvalue_LR3", "FDR_LR2", "FDR_LR3", "pvalue", "pvalue.adj.FDR", "Remark"))) 636 | results <- as.data.frame(results) 637 | for(i in 1:geneNum){ 638 | cat("\r",paste0("DEsingle is analyzing ", i," of ",geneNum," expressed genes")) 639 | results[i,] <- CallDE(i) 640 | } 641 | }else{ 642 | message("DEsingle is analyzing ", geneNum, " expressed genes in parallel") 643 | results <- do.call(rbind, bplapply(1:geneNum, CallDE, BPPARAM = BPPARAM)) 644 | } 645 | 646 | # Format output results 647 | results[, c("total_mean_1", "total_mean_2", "foldChange")] <- All_Mean_FC 648 | results[,"FDR_LR2"] <- p.adjust(results[,"pvalue_LR2"], method="fdr") 649 | results[,"FDR_LR3"] <- p.adjust(results[,"pvalue_LR3"], method="fdr") 650 | results[,"pvalue.adj.FDR"] <- p.adjust(results[,"pvalue"], method="fdr") 651 | results <- results[order(results[,"chi2LR1"], decreasing = TRUE),] 652 | 653 | # Abnormity control 654 | if(exists("lastFuncGrad") & exists("lastFuncParam")) 655 | remove(lastFuncGrad, lastFuncParam, envir=.GlobalEnv) 656 | if(sum(!is.na(results[,"Remark"])) != 0) 657 | cat(paste0("\n\n ",sum(!is.na(results[,"Remark"])), " gene failed.\n\n")) 658 | 659 | return(results) 660 | 661 | 662 | } 663 | 664 | 665 | 666 | 667 | -------------------------------------------------------------------------------- /R/DEtype.R: -------------------------------------------------------------------------------- 1 | #' DEtype: Classifying differentially expressed genes from DEsingle 2 | #' 3 | #' This function is used to classify the differentially expressed genes of single-cell RNA-seq (scRNA-seq) data found by \code{DEsingle}. It takes the output data frame from \code{DEsingle} as input. 4 | #' 5 | #' @param results A output data frame from \code{DEsingle}, which contains the unclassified differential expression analysis results. 6 | #' @param threshold A number of (0,1) to specify the threshold of FDR. 7 | #' @return 8 | #' A data frame containing the differential expression (DE) analysis results and DE gene types and states. 9 | #' \itemize{ 10 | #' \item theta_1, theta_2, mu_1, mu_2, size_1, size_2, prob_1, prob_2: MLE of the zero-inflated negative binomial distribution's parameters of group 1 and group 2. 11 | #' \item total_mean_1, total_mean_2: Mean of read counts of group 1 and group 2. 12 | #' \item foldChange: total_mean_1/total_mean_2. 13 | #' \item norm_total_mean_1, norm_total_mean_2: Mean of normalized read counts of group 1 and group 2. 14 | #' \item norm_foldChange: norm_total_mean_1/norm_total_mean_2. 15 | #' \item chi2LR1: Chi-square statistic for hypothesis testing of H0. 16 | #' \item pvalue_LR2: P value of hypothesis testing of H20 (Used to determine the type of a DE gene). 17 | #' \item pvalue_LR3: P value of hypothesis testing of H30 (Used to determine the type of a DE gene). 18 | #' \item FDR_LR2: Adjusted P value of pvalue_LR2 using Benjamini & Hochberg's method (Used to determine the type of a DE gene). 19 | #' \item FDR_LR3: Adjusted P value of pvalue_LR3 using Benjamini & Hochberg's method (Used to determine the type of a DE gene). 20 | #' \item pvalue: P value of hypothesis testing of H0 (Used to determine whether a gene is a DE gene). 21 | #' \item pvalue.adj.FDR: Adjusted P value of H0's pvalue using Benjamini & Hochberg's method (Used to determine whether a gene is a DE gene). 22 | #' \item Remark: Record of abnormal program information. 23 | #' \item Type: Types of DE genes. DEs represents different expression status; DEa represents differential expression abundance; DEg represents general differential expression. 24 | #' \item State: State of DE genes, up represents up-regulated; down represents down-regulated. 25 | #' } 26 | #' 27 | #' @author Zhun Miao. 28 | #' @seealso 29 | #' \code{\link{DEsingle}}, for the detection of differentially expressed genes from scRNA-seq data. 30 | #' 31 | #' \code{\link{TestData}}, a test dataset for DEsingle. 32 | #' 33 | #' @examples 34 | #' # Load test data for DEsingle 35 | #' data(TestData) 36 | #' 37 | #' # Specifying the two groups to be compared 38 | #' # The sample number in group 1 and group 2 is 50 and 100 respectively 39 | #' group <- factor(c(rep(1,50), rep(2,100))) 40 | #' 41 | #' # Detecting the differentially expressed genes 42 | #' results <- DEsingle(counts = counts, group = group) 43 | #' 44 | #' # Dividing the differentially expressed genes into 3 categories 45 | #' results.classified <- DEtype(results = results, threshold = 0.05) 46 | #' 47 | #' @import stats 48 | #' @importFrom BiocParallel bpparam bplapply 49 | #' @importFrom Matrix Matrix 50 | #' @importFrom MASS glm.nb fitdistr 51 | #' @importFrom VGAM dzinegbin 52 | #' @importFrom bbmle mle2 53 | #' @importFrom gamlss gamlssML 54 | #' @importFrom maxLik maxLik 55 | #' @importFrom pscl zeroinfl 56 | #' @importMethodsFrom Matrix colSums 57 | #' @export 58 | 59 | 60 | 61 | DEtype <- function(results, threshold){ 62 | # Invalid input judge 63 | if(class(results) != "data.frame") 64 | stop("Invalid input of wrong data type of results") 65 | if(ncol(results) != 22) 66 | stop("Invalid input of wrong column number of results") 67 | if(colnames(results)[21] != "pvalue.adj.FDR" | colnames(results)[16] != "pvalue_LR2" | colnames(results)[17] != "pvalue_LR3") 68 | stop("Invalid input of wrong column name of results") 69 | if(class(threshold) != "numeric") 70 | stop("Invalid input of wrong data type of threshold") 71 | if(threshold <= 0 | threshold > 0.1) 72 | stop("Invalid input of wrong range of threshold") 73 | 74 | # Classify the types of DE genes 75 | results <- cbind(results, NA, NA) 76 | colnames(results)[c(ncol(results)-1, ncol(results))] <- c("Type", "State") 77 | for(i in 1:nrow(results)){ 78 | if(results[i,"pvalue.adj.FDR"] < threshold) 79 | { 80 | if(results[i,"pvalue_LR2"] < threshold & results[i,"pvalue_LR3"] < threshold){ 81 | results[i,"Type"] <- "DEg" 82 | if(results[i,"mu_1"] * (1 - results[i,"theta_1"]) >= results[i,"mu_2"] * (1 - results[i,"theta_2"])) 83 | results[i,"State"] <- "up" 84 | else 85 | results[i,"State"] <- "down" 86 | } 87 | else if(results[i,"pvalue_LR2"] < threshold){ 88 | results[i,"Type"] <- "DEs" 89 | if(results[i,"theta_1"] <= results[i,"theta_2"]) 90 | results[i,"State"] <- "up" 91 | else 92 | results[i,"State"] <- "down" 93 | } 94 | else if(results[i,"pvalue_LR3"] < threshold){ 95 | results[i,"Type"] <- "DEa" 96 | if(results[i,"mu_1"] >= results[i,"mu_2"]) 97 | results[i,"State"] <- "up" 98 | else 99 | results[i,"State"] <- "down" 100 | } 101 | else{ 102 | results[i,"Type"] <- "DEg" 103 | if(results[i,"mu_1"] * (1 - results[i,"theta_1"]) >= results[i,"mu_2"] * (1 - results[i,"theta_2"])) 104 | results[i,"State"] <- "up" 105 | else 106 | results[i,"State"] <- "down" 107 | } 108 | } 109 | else 110 | next; 111 | } 112 | results 113 | } 114 | 115 | 116 | 117 | 118 | -------------------------------------------------------------------------------- /R/TestData.R: -------------------------------------------------------------------------------- 1 | #' TestData: A test dataset for DEsingle 2 | #' 3 | #' A toy dataset containing a single-cell RNA-seq (scRNA-seq) read counts matrix and its grouping information. 4 | #' 5 | #' \itemize{ 6 | #' \item counts. A matrix of raw read counts of scRNA-seq data which has 200 genes (rows) and 150 cells (columns). 7 | #' \item group. A vector of factor specifying the two groups to be compared in \code{counts}. Also could be generated by: \code{group <- factor(c(rep(1,50), rep(2,100)))} 8 | #' } 9 | #' 10 | #' @name TestData 11 | #' @aliases counts group 12 | #' @docType data 13 | #' @keywords data 14 | #' @usage data(TestData) 15 | #' @format 16 | #' \itemize{ 17 | #' \item counts. A non-negative integer matrix of scRNA-seq raw read counts, rows are genes and columns are cells. 18 | #' \item group. A vector of factor specifying the two groups to be compared, corresponding to the columns of the \code{counts}. 19 | #' } 20 | #' @source Petropoulos S, et al. Cell, 2016, 165(4): 1012-1026. 21 | #' @seealso 22 | #' \code{\link{DEsingle}}, for the detection of differentially expressed genes from scRNA-seq data. 23 | #' 24 | #' \code{\link{DEtype}}, for the classification of differentially expressed genes found by \code{\link{DEsingle}}. 25 | #' 26 | #' @examples 27 | #' # Load test data for DEsingle 28 | #' data(TestData) 29 | #' 30 | #' # Specifying the two groups to be compared 31 | #' # The sample number in group 1 and group 2 is 50 and 100 respectively 32 | #' group <- factor(c(rep(1,50), rep(2,100))) 33 | #' 34 | #' # Detecting the differentially expressed genes 35 | #' results <- DEsingle(counts = counts, group = group) 36 | #' 37 | #' # Dividing the differentially expressed genes into 3 categories 38 | #' results.classified <- DEtype(results = results, threshold = 0.05) 39 | #' 40 | NULL 41 | 42 | 43 | 44 | 45 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # DEsingle 2 | 3 | *Zhun Miao* 4 | 5 | *2018-06-21* 6 | 7 | [![build](https://bioconductor.org/shields/build/release/bioc/DEsingle.svg)](http://bioconductor.org/checkResults/release/bioc-LATEST/DEsingle/) 8 | [![platform](https://bioconductor.org/shields/availability/3.7/DEsingle.svg)](https://miaozhun.github.io/DEsingle/#downloads) 9 | [![downloads](https://bioconductor.org/shields/downloads/DEsingle.svg)](https://bioconductor.org/packages/release/bioc/src/contrib/DEsingle_1.0.5.tar.gz) 10 | 11 | ![Logo](https://github.com/miaozhun/DEsingle/blob/master/vignettes/DEsingle_LOGO.png?raw=true) 12 | 13 | 14 | ## Introduction 15 | 16 | **`DEsingle`** is an R package for **differential expression (DE) analysis of single-cell RNA-seq (scRNA-seq) data**. It will detect differentially expressed genes between two groups of cells in a scRNA-seq raw read counts matrix. 17 | 18 | **`DEsingle`** employs the Zero-Inflated Negative Binomial model for differential expression analysis. By estimating the proportion of real and dropout zeros, it not only detects DE genes **at higher accuracy** but also **subdivides three types of differential expression with different regulatory and functional mechanisms**. 19 | 20 | For more information, please refer to the [manuscript](https://doi.org/10.1093/bioinformatics/bty332) by *Zhun Miao, Ke Deng, Xiaowo Wang and Xuegong Zhang*. 21 | 22 | 23 | ## Citation 24 | 25 | If you use **`DEsingle`** in published research, please cite: 26 | 27 | > Zhun Miao, Ke Deng, Xiaowo Wang, Xuegong Zhang (2018). DEsingle for detecting three types of differential expression in single-cell RNA-seq data. Bioinformatics, bty332. [10.1093/bioinformatics/bty332.](https://doi.org/10.1093/bioinformatics/bty332) 28 | 29 | 30 | ## Installation 31 | 32 | To install **`DEsingle`** from [**Bioconductor**](http://bioconductor.org/packages/DEsingle/): 33 | 34 | ```{r Installation from Bioconductor, eval = FALSE} 35 | if(!require(BiocManager)) install.packages("BiocManager") 36 | BiocManager::install("DEsingle") 37 | ``` 38 | 39 | To install the *developmental version* from [**GitHub**](https://github.com/miaozhun/DEsingle/): 40 | 41 | ```{r Installation from GitHub, eval = FALSE} 42 | if(!require(devtools)) install.packages("devtools") 43 | devtools::install_github("miaozhun/DEsingle", build_vignettes = TRUE) 44 | ``` 45 | 46 | To load the installed **`DEsingle`** in R: 47 | 48 | ```{r Load DEsingle, eval = FALSE} 49 | library(DEsingle) 50 | ``` 51 | 52 | 53 | ## Input 54 | 55 | **`DEsingle`** takes two inputs: `counts` and `group`. 56 | 57 | The input `counts` is a scRNA-seq **raw read counts matrix** or a **`SingleCellExperiment`** object which contains the read counts matrix. The rows of the matrix are genes and columns are cells. 58 | 59 | The other input `group` is a vector of factor which specifies the two groups in the matrix to be compared, corresponding to the columns in `counts`. 60 | 61 | 62 | ## Test data 63 | 64 | Users can load the test data in **`DEsingle`** by 65 | 66 | ```{r Load TestData} 67 | library(DEsingle) 68 | data(TestData) 69 | ``` 70 | 71 | The toy data `counts` in `TestData` is a scRNA-seq read counts matrix which has 200 genes (rows) and 150 cells (columns). 72 | 73 | ```{r counts} 74 | dim(counts) 75 | counts[1:6, 1:6] 76 | ``` 77 | 78 | The object `group` in `TestData` is a vector of factor which has two levels and equal length to the column number of `counts`. 79 | 80 | ```{r group} 81 | length(group) 82 | summary(group) 83 | ``` 84 | 85 | 86 | ## Usage 87 | 88 | ### With read counts matrix input 89 | 90 | Here is an example to run **`DEsingle`** with read counts matrix input: 91 | 92 | ```{r demo1, eval = FALSE} 93 | # Load library and the test data for DEsingle 94 | library(DEsingle) 95 | data(TestData) 96 | 97 | # Specifying the two groups to be compared 98 | # The sample number in group 1 and group 2 is 50 and 100 respectively 99 | group <- factor(c(rep(1,50), rep(2,100))) 100 | 101 | # Detecting the DE genes 102 | results <- DEsingle(counts = counts, group = group) 103 | 104 | # Dividing the DE genes into 3 categories at threshold of FDR < 0.05 105 | results.classified <- DEtype(results = results, threshold = 0.05) 106 | ``` 107 | 108 | ### With SingleCellExperiment input 109 | 110 | The [`SingleCellExperiment`](http://bioconductor.org/packages/SingleCellExperiment/) class is a widely used S4 class for storing single-cell genomics data. **`DEsingle`** also could take the `SingleCellExperiment` data representation as input. 111 | 112 | Here is an example to run **`DEsingle`** with `SingleCellExperiment` input: 113 | 114 | ```{r demo2, eval = FALSE} 115 | # Load library and the test data for DEsingle 116 | library(DEsingle) 117 | library(SingleCellExperiment) 118 | data(TestData) 119 | 120 | # Convert the test data in DEsingle to SingleCellExperiment data representation 121 | sce <- SingleCellExperiment(assays = list(counts = as.matrix(counts))) 122 | 123 | # Specifying the two groups to be compared 124 | # The sample number in group 1 and group 2 is 50 and 100 respectively 125 | group <- factor(c(rep(1,50), rep(2,100))) 126 | 127 | # Detecting the DE genes with SingleCellExperiment input sce 128 | results <- DEsingle(counts = sce, group = group) 129 | 130 | # Dividing the DE genes into 3 categories at threshold of FDR < 0.05 131 | results.classified <- DEtype(results = results, threshold = 0.05) 132 | ``` 133 | 134 | 135 | ## Output 136 | 137 | `DEtype` subdivides the DE genes found by `DEsingle` into 3 types: **`DEs`**, **`DEa`** and **`DEg`**. 138 | 139 | * **`DEs`** refers to ***“different expression status”***. It is the type of genes that show significant difference in the proportion of real zeros in the two groups, but do not have significant difference in the other cells. 140 | 141 | * **`DEa`** is for ***“differential expression abundance”***, which refers to genes that are significantly differentially expressed between the groups without significant difference in the proportion of real zeros. 142 | 143 | * **`DEg`** or ***“general differential expression”*** refers to genes that have significant difference in both the proportions of real zeros and the expression abundances between the two groups. 144 | 145 | The output of `DEtype` is a matrix containing the DE analysis results, whose rows are genes and columns contain the following items: 146 | 147 | * `theta_1`, `theta_2`, `mu_1`, `mu_2`, `size_1`, `size_2`, `prob_1`, `prob_2`: MLE of the zero-inflated negative binomial distribution's parameters of group 1 and group 2. 148 | * `total_mean_1`, `total_mean_2`: Mean of read counts of group 1 and group 2. 149 | * `foldChange`: total_mean_1/total_mean_2. 150 | * `norm_total_mean_1`, `norm_total_mean_2`: Mean of normalized read counts of group 1 and group 2. 151 | * `norm_foldChange`: norm_total_mean_1/norm_total_mean_2. 152 | * `chi2LR1`: Chi-square statistic for hypothesis testing of H0. 153 | * `pvalue_LR2`: P value of hypothesis testing of H20 (Used to determine the type of a DE gene). 154 | * `pvalue_LR3`: P value of hypothesis testing of H30 (Used to determine the type of a DE gene). 155 | * `FDR_LR2`: Adjusted P value of pvalue_LR2 using Benjamini & Hochberg's method (Used to determine the type of a DE gene). 156 | * `FDR_LR3`: Adjusted P value of pvalue_LR3 using Benjamini & Hochberg's method (Used to determine the type of a DE gene). 157 | * `pvalue`: P value of hypothesis testing of H0 (Used to determine whether a gene is a DE gene). 158 | * `pvalue.adj.FDR`: Adjusted P value of H0's pvalue using Benjamini & Hochberg's method (Used to determine whether a gene is a DE gene). 159 | * `Remark`: Record of abnormal program information. 160 | * `Type`: Types of DE genes. *DEs* represents differential expression status; *DEa* represents differential expression abundance; *DEg* represents general differential expression. 161 | * `State`: State of DE genes, *up* represents up-regulated; *down* represents down-regulated. 162 | 163 | To extract the significantly differentially expressed genes from the output of `DEtype` (**note that the same threshold of FDR should be used in this step as in `DEtype`**): 164 | 165 | ```{r extract DE, eval = FALSE} 166 | # Extract DE genes at threshold of FDR < 0.05 167 | results.sig <- results.classified[results.classified$pvalue.adj.FDR < 0.05, ] 168 | ``` 169 | 170 | To further extract the three types of DE genes separately: 171 | 172 | ```{r extract subtypes, eval = FALSE} 173 | # Extract three types of DE genes separately 174 | results.DEs <- results.sig[results.sig$Type == "DEs", ] 175 | results.DEa <- results.sig[results.sig$Type == "DEa", ] 176 | results.DEg <- results.sig[results.sig$Type == "DEg", ] 177 | ``` 178 | 179 | 180 | ## Parallelization 181 | 182 | **`DEsingle`** integrates parallel computing function with [`BiocParallel`](http://bioconductor.org/packages/BiocParallel/) package. Users could just set `parallel = TRUE` in function `DEsingle` to enable parallelization and leave the `BPPARAM` parameter alone. 183 | 184 | ```{r demo3, eval = FALSE} 185 | # Load library 186 | library(DEsingle) 187 | 188 | # Detecting the DE genes in parallelization 189 | results <- DEsingle(counts = counts, group = group, parallel = TRUE) 190 | ``` 191 | 192 | Advanced users could use a `BiocParallelParam` object from package `BiocParallel` to fill in the `BPPARAM` parameter to specify the parallel back-end to be used and its configuration parameters. 193 | 194 | ### For Unix and Mac users 195 | 196 | The best choice for Unix and Mac users is to use `MulticoreParam` to configure a multicore parallel back-end: 197 | 198 | ```{r demo4, eval = FALSE} 199 | # Load library 200 | library(DEsingle) 201 | library(BiocParallel) 202 | 203 | # Set the parameters and register the back-end to be used 204 | param <- MulticoreParam(workers = 18, progressbar = TRUE) 205 | register(param) 206 | 207 | # Detecting the DE genes in parallelization with 18 cores 208 | results <- DEsingle(counts = counts, group = group, parallel = TRUE, BPPARAM = param) 209 | ``` 210 | 211 | ### For Windows users 212 | 213 | For Windows users, use `SnowParam` to configure a Snow back-end is a good choice: 214 | 215 | ```{r demo5, eval = FALSE} 216 | # Load library 217 | library(DEsingle) 218 | library(BiocParallel) 219 | 220 | # Set the parameters and register the back-end to be used 221 | param <- SnowParam(workers = 8, type = "SOCK", progressbar = TRUE) 222 | register(param) 223 | 224 | # Detecting the DE genes in parallelization with 8 cores 225 | results <- DEsingle(counts = counts, group = group, parallel = TRUE, BPPARAM = param) 226 | ``` 227 | 228 | See the [*Reference Manual*](https://bioconductor.org/packages/release/bioc/manuals/BiocParallel/man/BiocParallel.pdf) of [`BiocParallel`](http://bioconductor.org/packages/BiocParallel/) package for more details of the `BiocParallelParam` class. 229 | 230 | 231 | ## Visualization of results 232 | 233 | Users could use the `heatmap()` function in `stats` or `heatmap.2` function in `gplots` to plot the heatmap of the DE genes DEsingle found, as we did in Figure S3 of the [*manuscript*](https://doi.org/10.1093/bioinformatics/bty332). 234 | 235 | 236 | ## Interpretation of results 237 | 238 | For the interpretation of results when **`DEsingle`** applied to real data, please refer to the *Three types of DE genes between E3 and E4 of human embryonic cells* part in the [*Supplementary Materials*](https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty332/4983067#supplementary-data) of our [*manuscript*](https://doi.org/10.1093/bioinformatics/bty332). 239 | 240 | 241 | ## Help 242 | 243 | Use `browseVignettes("DEsingle")` to see the vignettes of **`DEsingle`** in R after installation. 244 | 245 | Use the following code in R to get access to the help documentation for **`DEsingle`**: 246 | 247 | ```{r help1, eval = FALSE} 248 | # Documentation for DEsingle 249 | ?DEsingle 250 | ``` 251 | 252 | ```{r help2, eval = FALSE} 253 | # Documentation for DEtype 254 | ?DEtype 255 | ``` 256 | 257 | ```{r help3, eval = FALSE} 258 | # Documentation for TestData 259 | ?TestData 260 | ?counts 261 | ?group 262 | ``` 263 | 264 | You are also welcome to view and post *DEsingle* tagged questions on [Bioconductor Support Site of DEsingle](https://support.bioconductor.org/t/desingle/) or contact the author by email for help. 265 | 266 | 267 | ## Author 268 | 269 | *Zhun Miao* <> 270 | 271 | MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic & Systems Biology, TNLIST; Department of Automation, Tsinghua University, Beijing 100084, China. 272 | 273 | -------------------------------------------------------------------------------- /data/TestData.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/miaozhun/DEsingle/3cf7d9b3b35d6282f782b8c02116a1b3efde2ba3/data/TestData.rda -------------------------------------------------------------------------------- /inst/CITATION: -------------------------------------------------------------------------------- 1 | citEntry(entry="article", 2 | title = "DEsingle for detecting three types of differential expression in single-cell RNA-seq data", 3 | author = personList( as.person("Zhun Miao"), 4 | as.person("Ke Deng"), 5 | as.person("Xiaowo Wang"), 6 | as.person("Xuegong Zhang")), 7 | year = 2018, 8 | journal = "Bioinformatics", 9 | doi = "10.1093/bioinformatics/bty332", 10 | pages = "bty332", 11 | textVersion = 12 | paste("Zhun Miao, Ke Deng, Xiaowo Wang, Xuegong Zhang.", 13 | "DEsingle for detecting three types of differential expression in single-cell RNA-seq data.", 14 | "Bioinformatics (2018): bty332.")) 15 | -------------------------------------------------------------------------------- /inst/NEWS: -------------------------------------------------------------------------------- 1 | VERSION 1.0.5 2 | ------------------------- 3 | o Optimization of speed and memory. 4 | 5 | VERSION 1.0.1 6 | ------------------------- 7 | o Optimization of memory management. 8 | 9 | VERSION 1.0.0 10 | ------------------------- 11 | o Package released in Bioconductor. 12 | 13 | VERSION 0.99.12 14 | ------------------------- 15 | o Documentation improvements. 16 | 17 | VERSION 0.99.9 18 | ------------------------- 19 | o Add Parallelization. 20 | 21 | VERSION 0.99.0 22 | ------------------------- 23 | o Package released. 24 | -------------------------------------------------------------------------------- /man/DEsingle.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/DEsingle.R 3 | \name{DEsingle} 4 | \alias{DEsingle} 5 | \title{DEsingle: Detecting differentially expressed genes from scRNA-seq data} 6 | \usage{ 7 | DEsingle(counts, group, parallel = FALSE, BPPARAM = bpparam()) 8 | } 9 | \arguments{ 10 | \item{counts}{A non-negative integer matrix of scRNA-seq raw read counts or a \code{SingleCellExperiment} object which contains the read counts matrix. The rows of the matrix are genes and columns are samples/cells.} 11 | 12 | \item{group}{A vector of factor which specifies the two groups to be compared, corresponding to the columns in the counts matrix.} 13 | 14 | \item{parallel}{If FALSE (default), no parallel computation is used; if TRUE, parallel computation using \code{BiocParallel}, with argument \code{BPPARAM}.} 15 | 16 | \item{BPPARAM}{An optional parameter object passed internally to \code{\link{bplapply}} when \code{parallel=TRUE}. If not specified, \code{\link{bpparam}()} (default) will be used.} 17 | } 18 | \value{ 19 | A data frame containing the differential expression (DE) analysis results, rows are genes and columns contain the following items: 20 | \itemize{ 21 | \item theta_1, theta_2, mu_1, mu_2, size_1, size_2, prob_1, prob_2: MLE of the zero-inflated negative binomial distribution's parameters of group 1 and group 2. 22 | \item total_mean_1, total_mean_2: Mean of read counts of group 1 and group 2. 23 | \item foldChange: total_mean_1/total_mean_2. 24 | \item norm_total_mean_1, norm_total_mean_2: Mean of normalized read counts of group 1 and group 2. 25 | \item norm_foldChange: norm_total_mean_1/norm_total_mean_2. 26 | \item chi2LR1: Chi-square statistic for hypothesis testing of H0. 27 | \item pvalue_LR2: P value of hypothesis testing of H20 (Used to determine the type of a DE gene). 28 | \item pvalue_LR3: P value of hypothesis testing of H30 (Used to determine the type of a DE gene). 29 | \item FDR_LR2: Adjusted P value of pvalue_LR2 using Benjamini & Hochberg's method (Used to determine the type of a DE gene). 30 | \item FDR_LR3: Adjusted P value of pvalue_LR3 using Benjamini & Hochberg's method (Used to determine the type of a DE gene). 31 | \item pvalue: P value of hypothesis testing of H0 (Used to determine whether a gene is a DE gene). 32 | \item pvalue.adj.FDR: Adjusted P value of H0's pvalue using Benjamini & Hochberg's method (Used to determine whether a gene is a DE gene). 33 | \item Remark: Record of abnormal program information. 34 | } 35 | } 36 | \description{ 37 | This function is used to detect differentially expressed genes between two specified groups of cells in a raw read counts matrix of single-cell RNA-seq (scRNA-seq) data. It takes a non-negative integer matrix of scRNA-seq raw read counts or a \code{SingleCellExperiment} object as input. So users should map the reads (obtained from sequencing libraries of the samples) to the corresponding genome and count the reads mapped to each gene according to the gene annotation to get the raw read counts matrix in advance. 38 | } 39 | \examples{ 40 | # Load test data for DEsingle 41 | data(TestData) 42 | 43 | # Specifying the two groups to be compared 44 | # The sample number in group 1 and group 2 is 50 and 100 respectively 45 | group <- factor(c(rep(1,50), rep(2,100))) 46 | 47 | # Detecting the differentially expressed genes 48 | results <- DEsingle(counts = counts, group = group) 49 | 50 | # Dividing the differentially expressed genes into 3 categories 51 | results.classified <- DEtype(results = results, threshold = 0.05) 52 | 53 | } 54 | \seealso{ 55 | \code{\link{DEtype}}, for the classification of differentially expressed genes found by \code{\link{DEsingle}}. 56 | 57 | \code{\link{TestData}}, a test dataset for DEsingle. 58 | } 59 | \author{ 60 | Zhun Miao. 61 | } 62 | -------------------------------------------------------------------------------- /man/DEtype.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/DEtype.R 3 | \name{DEtype} 4 | \alias{DEtype} 5 | \title{DEtype: Classifying differentially expressed genes from DEsingle} 6 | \usage{ 7 | DEtype(results, threshold) 8 | } 9 | \arguments{ 10 | \item{results}{A output data frame from \code{DEsingle}, which contains the unclassified differential expression analysis results.} 11 | 12 | \item{threshold}{A number of (0,1) to specify the threshold of FDR.} 13 | } 14 | \value{ 15 | A data frame containing the differential expression (DE) analysis results and DE gene types and states. 16 | \itemize{ 17 | \item theta_1, theta_2, mu_1, mu_2, size_1, size_2, prob_1, prob_2: MLE of the zero-inflated negative binomial distribution's parameters of group 1 and group 2. 18 | \item total_mean_1, total_mean_2: Mean of read counts of group 1 and group 2. 19 | \item foldChange: total_mean_1/total_mean_2. 20 | \item norm_total_mean_1, norm_total_mean_2: Mean of normalized read counts of group 1 and group 2. 21 | \item norm_foldChange: norm_total_mean_1/norm_total_mean_2. 22 | \item chi2LR1: Chi-square statistic for hypothesis testing of H0. 23 | \item pvalue_LR2: P value of hypothesis testing of H20 (Used to determine the type of a DE gene). 24 | \item pvalue_LR3: P value of hypothesis testing of H30 (Used to determine the type of a DE gene). 25 | \item FDR_LR2: Adjusted P value of pvalue_LR2 using Benjamini & Hochberg's method (Used to determine the type of a DE gene). 26 | \item FDR_LR3: Adjusted P value of pvalue_LR3 using Benjamini & Hochberg's method (Used to determine the type of a DE gene). 27 | \item pvalue: P value of hypothesis testing of H0 (Used to determine whether a gene is a DE gene). 28 | \item pvalue.adj.FDR: Adjusted P value of H0's pvalue using Benjamini & Hochberg's method (Used to determine whether a gene is a DE gene). 29 | \item Remark: Record of abnormal program information. 30 | \item Type: Types of DE genes. DEs represents different expression status; DEa represents differential expression abundance; DEg represents general differential expression. 31 | \item State: State of DE genes, up represents up-regulated; down represents down-regulated. 32 | } 33 | } 34 | \description{ 35 | This function is used to classify the differentially expressed genes of single-cell RNA-seq (scRNA-seq) data found by \code{DEsingle}. It takes the output data frame from \code{DEsingle} as input. 36 | } 37 | \examples{ 38 | # Load test data for DEsingle 39 | data(TestData) 40 | 41 | # Specifying the two groups to be compared 42 | # The sample number in group 1 and group 2 is 50 and 100 respectively 43 | group <- factor(c(rep(1,50), rep(2,100))) 44 | 45 | # Detecting the differentially expressed genes 46 | results <- DEsingle(counts = counts, group = group) 47 | 48 | # Dividing the differentially expressed genes into 3 categories 49 | results.classified <- DEtype(results = results, threshold = 0.05) 50 | 51 | } 52 | \seealso{ 53 | \code{\link{DEsingle}}, for the detection of differentially expressed genes from scRNA-seq data. 54 | 55 | \code{\link{TestData}}, a test dataset for DEsingle. 56 | } 57 | \author{ 58 | Zhun Miao. 59 | } 60 | -------------------------------------------------------------------------------- /man/TestData.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/TestData.R 3 | \docType{data} 4 | \name{TestData} 5 | \alias{TestData} 6 | \alias{counts} 7 | \alias{group} 8 | \title{TestData: A test dataset for DEsingle} 9 | \format{\itemize{ 10 | \item counts. A non-negative integer matrix of scRNA-seq raw read counts, rows are genes and columns are cells. 11 | \item group. A vector of factor specifying the two groups to be compared, corresponding to the columns of the \code{counts}. 12 | }} 13 | \source{ 14 | Petropoulos S, et al. Cell, 2016, 165(4): 1012-1026. 15 | } 16 | \usage{ 17 | data(TestData) 18 | } 19 | \description{ 20 | A toy dataset containing a single-cell RNA-seq (scRNA-seq) read counts matrix and its grouping information. 21 | } 22 | \details{ 23 | \itemize{ 24 | \item counts. A matrix of raw read counts of scRNA-seq data which has 200 genes (rows) and 150 cells (columns). 25 | \item group. A vector of factor specifying the two groups to be compared in \code{counts}. Also could be generated by: \code{group <- factor(c(rep(1,50), rep(2,100)))} 26 | } 27 | } 28 | \examples{ 29 | # Load test data for DEsingle 30 | data(TestData) 31 | 32 | # Specifying the two groups to be compared 33 | # The sample number in group 1 and group 2 is 50 and 100 respectively 34 | group <- factor(c(rep(1,50), rep(2,100))) 35 | 36 | # Detecting the differentially expressed genes 37 | results <- DEsingle(counts = counts, group = group) 38 | 39 | # Dividing the differentially expressed genes into 3 categories 40 | results.classified <- DEtype(results = results, threshold = 0.05) 41 | 42 | } 43 | \seealso{ 44 | \code{\link{DEsingle}}, for the detection of differentially expressed genes from scRNA-seq data. 45 | 46 | \code{\link{DEtype}}, for the classification of differentially expressed genes found by \code{\link{DEsingle}}. 47 | } 48 | \keyword{data} 49 | -------------------------------------------------------------------------------- /vignettes/DEsingle.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "DEsingle" 3 | author: "Zhun Miao" 4 | date: "2018-06-21" 5 | output: 6 | html_document: 7 | toc: TRUE 8 | toc_depth: 3 9 | toc_float: TRUE 10 | collapsed: TRUE 11 | vignette: > 12 | %\VignetteIndexEntry{DEsingle} 13 | %\VignetteEngine{knitr::rmarkdown} 14 | %\VignetteEncoding{UTF-8} 15 | --- 16 | 17 | ```{r setup, include = FALSE} 18 | knitr::opts_chunk$set( 19 | echo = TRUE, 20 | collapse = TRUE, 21 | comment = "#>" 22 | ) 23 | ``` 24 | 25 | ![](DEsingle_LOGO.png) 26 | 27 | 28 | ## Introduction 29 | 30 | **`DEsingle`** is an R package for **differential expression (DE) analysis of single-cell RNA-seq (scRNA-seq) data**. It will detect differentially expressed genes between two groups of cells in a scRNA-seq raw read counts matrix. 31 | 32 | **`DEsingle`** employs the Zero-Inflated Negative Binomial model for differential expression analysis. By estimating the proportion of real and dropout zeros, it not only detects DE genes **at higher accuracy** but also **subdivides three types of differential expression with different regulatory and functional mechanisms**. 33 | 34 | For more information, please refer to the [manuscript](https://doi.org/10.1093/bioinformatics/bty332) by *Zhun Miao, Ke Deng, Xiaowo Wang and Xuegong Zhang*. 35 | 36 | 37 | ## Citation 38 | 39 | If you use **`DEsingle`** in published research, please cite: 40 | 41 | > Zhun Miao, Ke Deng, Xiaowo Wang, Xuegong Zhang (2018). DEsingle for detecting three types of differential expression in single-cell RNA-seq data. Bioinformatics, bty332. [10.1093/bioinformatics/bty332.](https://doi.org/10.1093/bioinformatics/bty332) 42 | 43 | 44 | ## Installation 45 | 46 | To install **`DEsingle`** from [**Bioconductor**](http://bioconductor.org/packages/DEsingle/): 47 | 48 | ```{r Installation from Bioconductor, eval = FALSE} 49 | if(!require(BiocManager)) install.packages("BiocManager") 50 | BiocManager::install("DEsingle") 51 | ``` 52 | 53 | To install the *developmental version* from [**GitHub**](https://github.com/miaozhun/DEsingle/): 54 | 55 | ```{r Installation from GitHub, eval = FALSE} 56 | if(!require(devtools)) install.packages("devtools") 57 | devtools::install_github("miaozhun/DEsingle", build_vignettes = TRUE) 58 | ``` 59 | 60 | To load the installed **`DEsingle`** in R: 61 | 62 | ```{r Load DEsingle, eval = FALSE} 63 | library(DEsingle) 64 | ``` 65 | 66 | 67 | ## Input 68 | 69 | **`DEsingle`** takes two inputs: `counts` and `group`. 70 | 71 | The input `counts` is a scRNA-seq **raw read counts matrix** or a **`SingleCellExperiment`** object which contains the read counts matrix. The rows of the matrix are genes and columns are cells. 72 | 73 | The other input `group` is a vector of factor which specifies the two groups in the matrix to be compared, corresponding to the columns in `counts`. 74 | 75 | 76 | ## Test data 77 | 78 | Users can load the test data in **`DEsingle`** by 79 | 80 | ```{r Load TestData} 81 | library(DEsingle) 82 | data(TestData) 83 | ``` 84 | 85 | The toy data `counts` in `TestData` is a scRNA-seq read counts matrix which has 200 genes (rows) and 150 cells (columns). 86 | 87 | ```{r counts} 88 | dim(counts) 89 | counts[1:6, 1:6] 90 | ``` 91 | 92 | The object `group` in `TestData` is a vector of factor which has two levels and equal length to the column number of `counts`. 93 | 94 | ```{r group} 95 | length(group) 96 | summary(group) 97 | ``` 98 | 99 | 100 | ## Usage 101 | 102 | ### With read counts matrix input 103 | 104 | Here is an example to run **`DEsingle`** with read counts matrix input: 105 | 106 | ```{r demo1, eval = FALSE} 107 | # Load library and the test data for DEsingle 108 | library(DEsingle) 109 | data(TestData) 110 | 111 | # Specifying the two groups to be compared 112 | # The sample number in group 1 and group 2 is 50 and 100 respectively 113 | group <- factor(c(rep(1,50), rep(2,100))) 114 | 115 | # Detecting the DE genes 116 | results <- DEsingle(counts = counts, group = group) 117 | 118 | # Dividing the DE genes into 3 categories at threshold of FDR < 0.05 119 | results.classified <- DEtype(results = results, threshold = 0.05) 120 | ``` 121 | 122 | ### With SingleCellExperiment input 123 | 124 | The [`SingleCellExperiment`](http://bioconductor.org/packages/SingleCellExperiment/) class is a widely used S4 class for storing single-cell genomics data. **`DEsingle`** also could take the `SingleCellExperiment` data representation as input. 125 | 126 | Here is an example to run **`DEsingle`** with `SingleCellExperiment` input: 127 | 128 | ```{r demo2, eval = FALSE} 129 | # Load library and the test data for DEsingle 130 | library(DEsingle) 131 | library(SingleCellExperiment) 132 | data(TestData) 133 | 134 | # Convert the test data in DEsingle to SingleCellExperiment data representation 135 | sce <- SingleCellExperiment(assays = list(counts = as.matrix(counts))) 136 | 137 | # Specifying the two groups to be compared 138 | # The sample number in group 1 and group 2 is 50 and 100 respectively 139 | group <- factor(c(rep(1,50), rep(2,100))) 140 | 141 | # Detecting the DE genes with SingleCellExperiment input sce 142 | results <- DEsingle(counts = sce, group = group) 143 | 144 | # Dividing the DE genes into 3 categories at threshold of FDR < 0.05 145 | results.classified <- DEtype(results = results, threshold = 0.05) 146 | ``` 147 | 148 | 149 | ## Output 150 | 151 | `DEtype` subdivides the DE genes found by `DEsingle` into 3 types: **`DEs`**, **`DEa`** and **`DEg`**. 152 | 153 | * **`DEs`** refers to ***“different expression status”***. It is the type of genes that show significant difference in the proportion of real zeros in the two groups, but do not have significant difference in the other cells. 154 | 155 | * **`DEa`** is for ***“differential expression abundance”***, which refers to genes that are significantly differentially expressed between the groups without significant difference in the proportion of real zeros. 156 | 157 | * **`DEg`** or ***“general differential expression”*** refers to genes that have significant difference in both the proportions of real zeros and the expression abundances between the two groups. 158 | 159 | The output of `DEtype` is a matrix containing the DE analysis results, whose rows are genes and columns contain the following items: 160 | 161 | * `theta_1`, `theta_2`, `mu_1`, `mu_2`, `size_1`, `size_2`, `prob_1`, `prob_2`: MLE of the zero-inflated negative binomial distribution's parameters of group 1 and group 2. 162 | * `total_mean_1`, `total_mean_2`: Mean of read counts of group 1 and group 2. 163 | * `foldChange`: total_mean_1/total_mean_2. 164 | * `norm_total_mean_1`, `norm_total_mean_2`: Mean of normalized read counts of group 1 and group 2. 165 | * `norm_foldChange`: norm_total_mean_1/norm_total_mean_2. 166 | * `chi2LR1`: Chi-square statistic for hypothesis testing of H0. 167 | * `pvalue_LR2`: P value of hypothesis testing of H20 (Used to determine the type of a DE gene). 168 | * `pvalue_LR3`: P value of hypothesis testing of H30 (Used to determine the type of a DE gene). 169 | * `FDR_LR2`: Adjusted P value of pvalue_LR2 using Benjamini & Hochberg's method (Used to determine the type of a DE gene). 170 | * `FDR_LR3`: Adjusted P value of pvalue_LR3 using Benjamini & Hochberg's method (Used to determine the type of a DE gene). 171 | * `pvalue`: P value of hypothesis testing of H0 (Used to determine whether a gene is a DE gene). 172 | * `pvalue.adj.FDR`: Adjusted P value of H0's pvalue using Benjamini & Hochberg's method (Used to determine whether a gene is a DE gene). 173 | * `Remark`: Record of abnormal program information. 174 | * `Type`: Types of DE genes. *DEs* represents differential expression status; *DEa* represents differential expression abundance; *DEg* represents general differential expression. 175 | * `State`: State of DE genes, *up* represents up-regulated; *down* represents down-regulated. 176 | 177 | To extract the significantly differentially expressed genes from the output of `DEtype` (**note that the same threshold of FDR should be used in this step as in `DEtype`**): 178 | 179 | ```{r extract DE, eval = FALSE} 180 | # Extract DE genes at threshold of FDR < 0.05 181 | results.sig <- results.classified[results.classified$pvalue.adj.FDR < 0.05, ] 182 | ``` 183 | 184 | To further extract the three types of DE genes separately: 185 | 186 | ```{r extract subtypes, eval = FALSE} 187 | # Extract three types of DE genes separately 188 | results.DEs <- results.sig[results.sig$Type == "DEs", ] 189 | results.DEa <- results.sig[results.sig$Type == "DEa", ] 190 | results.DEg <- results.sig[results.sig$Type == "DEg", ] 191 | ``` 192 | 193 | 194 | ## Parallelization 195 | 196 | **`DEsingle`** integrates parallel computing function with [`BiocParallel`](http://bioconductor.org/packages/BiocParallel/) package. Users could just set `parallel = TRUE` in function `DEsingle` to enable parallelization and leave the `BPPARAM` parameter alone. 197 | 198 | ```{r demo3, eval = FALSE} 199 | # Load library 200 | library(DEsingle) 201 | 202 | # Detecting the DE genes in parallelization 203 | results <- DEsingle(counts = counts, group = group, parallel = TRUE) 204 | ``` 205 | 206 | Advanced users could use a `BiocParallelParam` object from package `BiocParallel` to fill in the `BPPARAM` parameter to specify the parallel back-end to be used and its configuration parameters. 207 | 208 | ### For Unix and Mac users 209 | 210 | The best choice for Unix and Mac users is to use `MulticoreParam` to configure a multicore parallel back-end: 211 | 212 | ```{r demo4, eval = FALSE} 213 | # Load library 214 | library(DEsingle) 215 | library(BiocParallel) 216 | 217 | # Set the parameters and register the back-end to be used 218 | param <- MulticoreParam(workers = 18, progressbar = TRUE) 219 | register(param) 220 | 221 | # Detecting the DE genes in parallelization with 18 cores 222 | results <- DEsingle(counts = counts, group = group, parallel = TRUE, BPPARAM = param) 223 | ``` 224 | 225 | ### For Windows users 226 | 227 | For Windows users, use `SnowParam` to configure a Snow back-end is a good choice: 228 | 229 | ```{r demo5, eval = FALSE} 230 | # Load library 231 | library(DEsingle) 232 | library(BiocParallel) 233 | 234 | # Set the parameters and register the back-end to be used 235 | param <- SnowParam(workers = 8, type = "SOCK", progressbar = TRUE) 236 | register(param) 237 | 238 | # Detecting the DE genes in parallelization with 8 cores 239 | results <- DEsingle(counts = counts, group = group, parallel = TRUE, BPPARAM = param) 240 | ``` 241 | 242 | See the [*Reference Manual*](https://bioconductor.org/packages/release/bioc/manuals/BiocParallel/man/BiocParallel.pdf) of [`BiocParallel`](http://bioconductor.org/packages/BiocParallel/) package for more details of the `BiocParallelParam` class. 243 | 244 | 245 | ## Visualization of results 246 | 247 | Users could use the `heatmap()` function in `stats` or `heatmap.2` function in `gplots` to plot the heatmap of the DE genes DEsingle found, as we did in Figure S3 of the [*manuscript*](https://doi.org/10.1093/bioinformatics/bty332). 248 | 249 | 250 | ## Interpretation of results 251 | 252 | For the interpretation of results when **`DEsingle`** applied to real data, please refer to the *Three types of DE genes between E3 and E4 of human embryonic cells* part in the [*Supplementary Materials*](https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty332/4983067#supplementary-data) of our [*manuscript*](https://doi.org/10.1093/bioinformatics/bty332). 253 | 254 | 255 | ## Help 256 | 257 | Use `browseVignettes("DEsingle")` to see the vignettes of **`DEsingle`** in R after installation. 258 | 259 | Use the following code in R to get access to the help documentation for **`DEsingle`**: 260 | 261 | ```{r help1, eval = FALSE} 262 | # Documentation for DEsingle 263 | ?DEsingle 264 | ``` 265 | 266 | ```{r help2, eval = FALSE} 267 | # Documentation for DEtype 268 | ?DEtype 269 | ``` 270 | 271 | ```{r help3, eval = FALSE} 272 | # Documentation for TestData 273 | ?TestData 274 | ?counts 275 | ?group 276 | ``` 277 | 278 | You are also welcome to view and post *DEsingle* tagged questions on [Bioconductor Support Site of DEsingle](https://support.bioconductor.org/t/desingle/) or contact the author by email for help. 279 | 280 | 281 | ## Author 282 | 283 | *Zhun Miao* <> 284 | 285 | MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic & Systems Biology, TNLIST; Department of Automation, Tsinghua University, Beijing 100084, China. 286 | 287 | 288 | ## Session info 289 | 290 | ```{r sessionInfo} 291 | sessionInfo() 292 | ``` 293 | 294 | -------------------------------------------------------------------------------- /vignettes/DEsingle_LOGO.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/miaozhun/DEsingle/3cf7d9b3b35d6282f782b8c02116a1b3efde2ba3/vignettes/DEsingle_LOGO.png --------------------------------------------------------------------------------