├── .gitignore ├── ALDEx_comparison.Rmd ├── ALDEx_comparison.pdf ├── Exploratory_biplot.Rmd ├── Exploratory_biplot.pdf ├── Gloor_bio.docx ├── Outline.Rmd ├── Outline.pdf ├── Outline.tex ├── README.txt ├── chunk ├── R │ ├── propr-functions.R │ └── propr.R ├── explore_schurch.R └── setup.R ├── coda_seq.Rmd ├── coda_seq.pdf ├── data ├── ERP004763_sample_mapping.tsv ├── ak_vs_op.txt ├── ak_vs_op_aldex.txt ├── ak_vs_op_taxon.txt ├── barton_agg.tsv ├── countfinal2.tsv ├── d.all.l ├── filtered_table.txt ├── mouth_genus.txt ├── up_vs_op.txt ├── up_vs_op_aldex.txt └── up_vs_op_taxon.txt ├── download.jpg ├── effect.mbf.pdf ├── first_association.Rmd ├── first_association.pdf ├── make_interpret_biplot.Rmd ├── make_interpret_biplot.pdf ├── multi_comp.Rmd ├── multi_comp.pdf ├── prop_and_diffprop_with_propr.Rmd ├── prop_and_diffprop_with_propr.html ├── replicate_Evalue.pdf ├── replicate_lowEvalue.pdf ├── zero.Rmd └── zero.pdf /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | .DS_Store 3 | 4 | .Rhistory 5 | -------------------------------------------------------------------------------- /ALDEx_comparison.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "First comparison" 3 | author: "gg" 4 | date: '`r format(Sys.time(), "%d %B, %Y")`' 5 | bibliography: /Users/ggloor/Library/texmf/bibtex/bib/bibdesk_refs.bib 6 | fig_caption: true 7 | output: 8 | pdf_document: 9 | fig_caption: yes 10 | --- 11 | 12 | To run this file: 13 | Rscript -e "rmarkdown::render('ALDEx_comparison.Rmd')" 14 | ## Types of data 15 | 16 | Comparison of 'differential abundance' is problematic for compositional data [@fernandes:2013;@fernandes:2014]. Since the apparent abundance of every value depends on the apparent abundance of every other value, we can get into real difficulties if we are not careful. Take the simple example where we have two samples. The samples contain the following counts for five features: 17 | 18 | \begin{math} 19 | A = [1000, 100, 50, 250] 20 | \end{math} and 21 | \begin{math} 22 | B=[10, 500, 250, 1250] 23 | \end{math}. 24 | 25 | We want to answer the question: Have the abundances of the features changed? 26 | 27 | We sequence, and have a total count of about 100 (it is a first generation machine!) 28 | 29 | So we get: 30 | \begin{math}A_s = [71,7,4,18], B_s = [1,25,12,62] 31 | \end{math} 32 | 33 | Note that these values appear to be very different between the groups. However, if we take one feature as a reference, say feature 4, and determine a ratio, i.e.: 34 | 35 | \begin{math} 36 | A_r = [ 74/18, 7/18, 4/18 ] = [ 4.1, 0.39, 0.22 ] 37 | \end{math} 38 | 39 | \begin{math} 40 | B_r = [ 1/62, 25/62, 12/62 ] = [ 0.02 , 0.40, 0.20 ] 41 | \end{math} 42 | 43 | Here we can see that if we assume one feature is constant (feature 4), then the last two are seen to be very similar in abundance. Now we can infer that the majority of change is in the first feature. We cannot compare the last feature because it is assumed to be constant, that is, the assumed change in the last feature is 0. This approach is the one used by ANCOM, a recently developed tool to assess change in microbiome datasets [@ancom:2015]. 44 | 45 | Since we cannot know which feature, if any, is constant, we can assume that a large number of the features exhibit only random change. Then rather than using one feature as a reference we can use the geometric mean abundance of all features. Note: this approach works poorly if there are only a small number of features (less than about 50) or if the features are very asymmetrically distributed between groups. This approach is the one used by ALDEx2 [@fernandes:2013;@fernandes:2014], and is the method that we will use. 46 | 47 | One complication is that a genometric mean cannot be determined if any of the values have a count of 0. For pairwise comparisons 48 | 49 | 50 | ```{r, echo=TRUE,eval=TRUE} 51 | library(ALDEx2) 52 | 53 | # read the dataset 54 | d <- read.table("data/barton_agg.tsv", row.names=1, header=T) 55 | # make a vector containing the two names of the conditions 56 | # in the same order as in the column names 57 | 58 | d.conds <- c(rep("SNF", length(grep("SNF", rownames(d))) ), 59 | rep("WT", length(grep("WT", rownames(d)))) ) 60 | 61 | 62 | # generate Monte-Carlo instances of the probability of observing each count 63 | # given the actual read count and the observed read count. 64 | # use a prior of 0.5, corresponding to maximal uncertainty about the read count 65 | # this returns a set of clr values, one for each mc instance 66 | # this workflow can take several minutes 67 | 68 | # note that the latest version of ALDEx2 requires conditions explicitly 69 | d.x <- aldex.clr(t(d), conds=d.conds, mc.samples=128) 70 | 71 | # calculate effect sizes for each mc instance, report the expected value 72 | d.eff <- aldex.effect(d.x, d.conds, include.sample.summary=TRUE) 73 | 74 | # perform parametric or non-parametric tests for difference 75 | # report the expected value of the raw and BH-corrected P value 76 | d.tt <- aldex.ttest(d.x, d.conds) 77 | 78 | # concatenate everything into one file 79 | x.all <- data.frame(d.eff,d.tt) 80 | ``` 81 | 82 | We will display the results using a number of different plots to show how each plot gives a different way of exploring the data. The mainstay that we advocate is the effect plot [@Gloor:2015], that plots the constituents of normalized change, or effect size. 83 | 84 | ```{r, echo=FALSE,eval=TRUE, results='as.is', fig.width=7, fig.height=7, error=FALSE, message=FALSE, warning=FALSE, fig.cap="Plotted here are features with no difference between groups (grey), a statistically difference between groups (red), and with an effect larger than 2 (blue circles). These are plotted using different plots (described clockwise from top left). The effect plot [@Gloor:2015] illustrates the difference between groups vs. the dispersion (variance) within groups. If the effect is greater than one (outside the grey lines), then, on average the features are obviosly separable by eye when plotted; roughly, they would be seen to have a greater difference between groups than the pooled standard deviation. Effect is a more robust measure of difference than are P values, since the latter depend on sample size; large sample sizes will always give low P values [@Halsey:2015aa]. We can see this here where the large sample size means that even highly variable OTUs are significantly different. The Bland-Altman plot [@altman:1983] compares difference and abundance, and is often seen in RNA-Seq data. The Volcano plot [@Cui:2003aa] shows the association between difference and P value, and the final plot shows the association between effect and P value."} 85 | 86 | # x.all <- read.table("data/barton_agg.tsv", header=T, row.names=1) 87 | 88 | # get 'significant' set 89 | sig <- x.all$wi.eBH < 0.05 90 | eff <- abs(x.all$effect) > 2 91 | 92 | # plot all in transparent grey 93 | # low BH-corrected p values as red 94 | # effect sizes > 2 as blue+red 95 | par(mfrow=c(2,2)) 96 | 97 | plot(x.all$diff.win, x.all$diff.btw, col=rgb(0,0,0,0.3), pch=19, 98 | cex=0.5, ylim=c(-6,6), xlim=c(0,6), xlab="dispersion", ylab="difference", 99 | main="Effect plot") 100 | points(x.all$diff.win[sig], x.all$diff.btw[sig], col=rgb(1,0,0,0.3), pch=19, cex=0.5 ) 101 | points(x.all$diff.win[eff], x.all$diff.btw[eff], col=rgb(0,0,1,0.6), pch=21, cex=0.7 ) 102 | abline(0,1, lty=2, lwd=2, col=rgb(0,0,0,0.4)) 103 | abline(0,-1, lty=2, lwd=2, col=rgb(0,0,0,0.4)) 104 | 105 | plot(x.all$rab.all, x.all$diff.btw, col=rgb(0,0,0,0.3), pch=19, 106 | cex=0.5, ylim=c(-6,6), xlab="clr abundance", ylab="difference", 107 | main="Bland-Altman plot") 108 | points(x.all$rab.all[sig], x.all$diff.btw[sig], col=rgb(1,0,0,0.3), pch=19, cex=0.5 ) 109 | points(x.all$rab.all[eff], x.all$diff.btw[eff], col=rgb(0,0,1,0.6), pch=21, cex=0.7 ) 110 | 111 | plot(x.all$diff.btw, x.all$wi.ep, col=rgb(0,0,0,0.3), pch=19, 112 | cex=0.5, xlab="difference", ylab="log p value", 113 | main="Difference vs. p plot", log="y") 114 | points(x.all$diff.btw[sig], x.all$wi.ep[sig], col=rgb(1,0,0,0.3), pch=19, cex=0.5 ) 115 | points(x.all$diff.btw[eff], x.all$wi.ep[eff], col=rgb(0,0,1,0.6), pch=21, cex=0.7 ) 116 | 117 | plot(x.all$effect, x.all$wi.ep, col=rgb(0,0,0,0.3), pch=19, 118 | cex=0.5, xlab="effect", ylab="log p value", 119 | main="Effect vs. p plot", log="y") 120 | points(x.all$effect[sig], x.all$wi.ep[sig], col=rgb(1,0,0,0.3), pch=19, cex=0.5 ) 121 | points(x.all$effect[eff], x.all$wi.ep[eff], col=rgb(0,0,1,0.6), pch=21, cex=0.7 ) 122 | 123 | ``` 124 | 125 | \newpage 126 | 127 | The effect sizes can be understood as a measure of separability between groups for each feature. Plotted in the figure are features with different effect sizes and the corresponding adjusted p value is included. In RNA-seq data, we have found that an effect size cutoff between 1 and 2 [@macklaim:2013] gives reliable results in meta-transcriptome analyses. 128 | 129 | ```{r, echo=FALSE,eval=TRUE, results='as.is', fig.width=7, fig.height=7, error=FALSE, message=FALSE, warning=FALSE, fig.cap="Histograms showing the separation between groups when choosing features withdiffering effect sizes. Features with the largest effect are the most reliably different between groups, and should be chosen over those that are most significantly different whenever possible. Note that an effect size of 0.5 is more than sufficient to give a significant difference but the separation between groups is marginal."} 130 | 131 | par(mfrow=c(2,2)) 132 | 133 | hist(as.numeric(x.all["PWR1",4:51]), breaks=11, xlim=c(-8,-2), col=rgb(1,0,0,0.3), 134 | xlab="rAB PWR1", main="Effect: 0.5 wi.eBH: 4e-4", ylim=c(0,12) ) 135 | hist(as.numeric(x.all["PWR1",52:99]), add=T, breaks=20, col=rgb(0,0,1,0.3)) 136 | 137 | hist(as.numeric(x.all["NTS2.1",4:51]), breaks=20, col=rgb(1,0,0,0.3), 138 | xlab="rAB NTS2.1", main="Effect: 1.1 wi.eBH: 2e-11", ylim=c(0,15) ) 139 | hist(as.numeric(x.all["NTS2.1",52:99]), add=T, breaks=11, col=rgb(0,0,1,0.3)) 140 | 141 | hist(as.numeric(x.all["YPR145W",4:51]), xlim=c(2,6), breaks=11, col=rgb(1,0,0,0.3), 142 | xlab="rAB YPR145W", main="Effect: 2 wi.eBH: 6.5e-26", ylim=c(0,12) ) 143 | hist(as.numeric(x.all["YPR145W",52:99]), add=T, breaks=20, col=rgb(0,0,1,0.3)) 144 | 145 | hist(as.numeric(x.all["YOR290C",4:51]), xlim=c(-11,5), breaks=20, col=rgb(1,0,0,0.3), 146 | xlab="rAB YOR290C", main="Effect: 5.6 wi.eBH: 1e-26", ylim=c(0,40) ) 147 | hist(as.numeric(x.all["YOR290C",52:99]), add=T, breaks=3, col=rgb(0,0,1,0.3)) 148 | ``` 149 | 150 | We can also plot these as strip-charts to see the overlap in variability 151 | 152 | ```{r, echo=FALSE,eval=TRUE, results='as.is', fig.width=7, fig.height=7, error=FALSE, message=FALSE, warning=FALSE, fig.cap="Strip charts showing the separation between groups when choosing features with differing effect sizes."} 153 | 154 | par(mfrow=c(2,2)) 155 | 156 | stripchart(list(as.numeric(x.all["PWR1",4:51])), pch=19, method="jitter", 157 | vertical=T, col=rgb(1,0,0,0.3), main="Effect: 0.5", 158 | ylim=c(min(c(as.numeric(x.all["PWR1",4:51]),as.numeric(x.all["PWR1",52:99]))), 159 | max(c(as.numeric(x.all["PWR1",4:51]),as.numeric(x.all["PWR1",52:99]))))) 160 | stripchart(list(as.numeric(x.all["PWR1",52:99])), pch=19, method="jitter", 161 | add=TRUE, vertical=T, col=rgb(0,0,1,0.3) ) 162 | 163 | stripchart(list(as.numeric(x.all["NTS2.1",4:51])), pch=19, method="jitter", 164 | vertical=T, col=rgb(1,0,0,0.3), main="Effect: 1.1", 165 | ylim=c(min(c(as.numeric(x.all["NTS2.1",4:51]),as.numeric(x.all["NTS2.1",52:99]))), 166 | max(c(as.numeric(x.all["NTS2.1",4:51]),as.numeric(x.all["NTS2.1",52:99]))))) 167 | stripchart(list(as.numeric(x.all["NTS2.1",52:99])), pch=19, method="jitter", 168 | add=TRUE, vertical=T, col=rgb(0,0,1,0.3) ) 169 | 170 | stripchart(list(as.numeric(x.all["YPR145W",4:51])), pch=19, method="jitter", 171 | vertical=T, col=rgb(1,0,0,0.3), main="Effect: 2", 172 | ylim=c(min(c(as.numeric(x.all["YPR145W",4:51]),as.numeric(x.all["YPR145W",52:99]))), 173 | max(c(as.numeric(x.all["YPR145W",4:51]),as.numeric(x.all["YPR145W",52:99]))))) 174 | stripchart(list(as.numeric(x.all["YPR145W",52:99])), pch=19, method="jitter", 175 | add=TRUE, vertical=T, col=rgb(0,0,1,0.3) ) 176 | 177 | stripchart(list(as.numeric(x.all["YOR290C",4:51])), pch=19, method="jitter", 178 | vertical=T, col=rgb(1,0,0,0.3), main="Effect: 5.6", 179 | ylim=c(min(c(as.numeric(x.all["YOR290C",4:51]),as.numeric(x.all["YOR290C",52:99]))), 180 | max(c(as.numeric(x.all["YOR290C",4:51]),as.numeric(x.all["YOR290C",52:99]))))) 181 | stripchart(list(as.numeric(x.all["YOR290C",52:99])), pch=19, method="jitter", 182 | add=TRUE, vertical=T, col=rgb(0,0,1,0.3) ) 183 | ``` 184 | 185 | \newpage 186 | #References 187 | 188 | 189 | -------------------------------------------------------------------------------- /ALDEx_comparison.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/ALDEx_comparison.pdf -------------------------------------------------------------------------------- /Exploratory_biplot.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Exploratory biplot" 3 | author: "gg" 4 | date: '`r format(Sys.time(), "%d %B, %Y")`' 5 | bibliography: /Users/ggloor/Library/texmf/bibtex/bib/bibdesk_refs.bib 6 | fig_caption: true 7 | output: 8 | pdf_document: 9 | fig_caption: yes 10 | --- 11 | 12 | To run this file: 13 | Rscript -e "rmarkdown::render('Exploratory_biplot.Rmd')" 14 | 15 | ## R packages needed 16 | 17 | `zCompositions, ppclust, factoextra, cluster, fclust` CRAN 18 | 19 | ## The dataset and first biplot 20 | 21 | We will use as an example a transcriptome dataset [@Schurch:2016aa;@Gierlinski:2015aa] containing 96 samples, 48 each from wt and SNF2 knockout strain These data have been filtered to include only those features that are present with a mean count of at least 0.1 across all samples. 22 | 23 | The compositional biplot is the first exploratory data analysis tool that should be used whenever exploring a dataset. It shows, in one plot, the essences of your results. Do my samples separate into groups? features are driving this separation? what features are irrelevant to the analysis? 24 | 25 | Compositional biplots appear to be complex and intimidating, but with a little patience and practice they are easily interpretable [@aitchison2002biplots]. They are based on the variance of the ratios of the parts, and are substantially more informative that the commonly used PCoA plots that are driven largely by abundance [@Gorvitovskaia:2016aa]. 26 | 27 | ```{r outlier_function, echo=F} 28 | codaSeq.outlier <- function(x, plot.me=TRUE, col=rgb(1,0,0,0.3)){ 29 | 30 | pcx <- prcomp(x) 31 | mv <- sum(pcx$sdev^2) 32 | 33 | sample.var <- apply(pcx$x,1,function(y){sum(y^2/mv)}) 34 | 35 | cut <- median(apply(pcx$x,1,function(x){sum(x^2/mv)})) + 2 * IQR(apply(pcx$x,1,function(x){sum(x^2/mv)})) 36 | 37 | bad <- names(which(apply(pcx$x,1,function(x){sum(x^2/mv)}) > cut)) 38 | good <- names(which(apply(pcx$x,1,function(x){sum(x^2/mv)}) <= cut)) 39 | if(plot.me == TRUE){ 40 | hist(sample.var, breaks=100) 41 | boxplot(sample.var, horizontal=TRUE, col=col, add=TRUE) 42 | abline(v=cut, lty=2) 43 | } 44 | return(list(sample.var=sample.var, bad=bad, good=good) ) 45 | } 46 | ``` 47 | 48 | ```{r biplot, echo=TRUE, results='as.is', fig.width=7, fig.height=7, error=FALSE, message=FALSE, warning=FALSE, fig.cap="The compositional biplot is the workhorse tool for CoDa. This plot summarizes the entire analysis in a qualitative manner. We can see that the op and ak samples separate very well, although the proportion of variance explained on component 1 is small. Furthermore, we can see the genus names of some of the features that are driving this divide. Finally, component 1 has substantially more variance than does componet 2, and we can explain this experiment as a simple two part comparison with the largest variance along the axis of the comparison."} 49 | 50 | # read in the dataset and associated taxonomy file 51 | # samples by row - now correct: thx VGalata 52 | d.agg <- read.table("data/barton_agg.tsv", sep="\t", header=T, row.names=1) 53 | 54 | # load the library zCompositions to perform 0 replacement 55 | library(zCompositions) 56 | 57 | # it is important to first filter to remove rows that are exclusively 0 values 58 | d.filt <- d.agg[,colSums(d.agg) > 0,] 59 | 60 | # we are using the Count Zero Multiplicative approach 61 | d.n0 <- cmultRepl(d.filt, method="CZM", label=0) 62 | 63 | # generate the centered log-ratio transformed data 64 | # samples by row 65 | d.clr <- apply(d.n0, 1, function(x) log(x) - mean(log(x))) 66 | 67 | # apply a singular value decomposition to the dataset 68 | # do not use princomp function in R!! 69 | pcx <- prcomp(t(d.clr)) 70 | 71 | # get the labels for the first two components 72 | PC1 <- paste("PC1: ", round(pcx$sdev[1]^2/sum(pcx$sdev^2),3), sep="") 73 | PC2 <- paste("PC2: ", round(pcx$sdev[2]^2/sum(pcx$sdev^2),3), sep="") 74 | 75 | par(fig=c(0,1,0,1), new=TRUE) 76 | # generate a scree plot 77 | par(fig=c(0,0.8,0,1), new=TRUE) 78 | biplot(pcx, cex=c(0.6,0.6), col=c("black", rgb(1,0,0,0.2)), var.axes=F, scale=0, 79 | xlab=PC1, ylab=PC2) 80 | abline(h=0, lty=2, lwd=2, col=rgb(0,0,0,0.3)) 81 | abline(v=0, lty=2, lwd=2, col=rgb(0,0,0,0.3)) 82 | 83 | par(fig=c(0.8,1,0,1), new=TRUE) 84 | plot(pcx, main="hist") 85 | 86 | ``` 87 | 88 | \newpage 89 | 90 | ### Rules for interpreting compositional biplots: 91 | 92 | - All interpretations are up to the limit of the variance explained. We can think of this as a shadow of the multidimensional dataset (4545 dimensions!) projected onto two dimensions. If the variance explained is high ( > 0.8) then the edges of the shadows are sharp, however, if the variance explained is low, as it is here, then we have little confidence in the exact placement of any individual sample or feature. 93 | 94 | - The distance between samples is related to their multivariate similarity of the parts as ratios. If all components are relatively the same (ie, the ratios between all parts are identical), then two samples are in the same location. 95 | 96 | - We must interpret the features as ratios. Abundance information is not directly available on these plots. 97 | 98 | - The distance and direction of an feature from the origin is the standard deviation of the ratio of that feature to the geometric mean of all features. 99 | 100 | - The line between any set of features is called a link. Links that pass through more than one feature are permitted and do not change the interpretation. 101 | 102 | - Short links indicate a constant or near constant ratio between the two (or more) linked features in the dataset. This dataset is too complex to identify links easily 103 | 104 | - Long links indicate a non-constant ratio between the joined features, and define a ratio relationship that can be inverse or random. There is no principled method to determine which is the case. 105 | 106 | \newpage 107 | 108 | ## Finding outliers 109 | 110 | We can see that there are a number of samples that appear to be outlier samples. Should we include SNF2.6 in the analysis or not? One of the messages of the Barton papers [@Schurch:2016aa;@Gierlinski:2015aa] was that about 10% of samples, even carefully prepared samples can be outliers for unknown methodological reasons. We approach outliers by finding those samples that contribute more variance than expected to the variance of the group. Outliers are defined as those samples that contribute greater than the median plus twice the interquartile range of the sample variance to the total variance of the group. 111 | 112 | 113 | ```{r outlier, message=FALSE, warning=FALSE, echo=FALSE, fig.cap='outliers', fig.height=4, fig.width=4} 114 | # get the outliers from each group. See codaSeq.outlier function 115 | # get WT indices 116 | WT <- grep("WT", rownames(d.agg)) 117 | # subset 118 | WT.agg <- d.agg[WT,] 119 | 120 | # filter, samples by row 121 | wt.gt0 <- WT.agg[,colSums(WT.agg) > 0] 122 | 123 | # estimate 0 values (zCompositions) 124 | # samples by row 125 | wt.agg.n0 <- cmultRepl(wt.gt0, method="CZM", label=0) 126 | 127 | # clr transform 128 | wt.agg.n0.clr <- t(apply(wt.agg.n0, 1, function(x) log(x) - mean(log(x)))) 129 | 130 | # SVD 131 | pcx.wt <- prcomp(wt.agg.n0.clr) 132 | mvar.wt.clr <- sum(pcx.wt$sdev^2) 133 | 134 | # plot 135 | par(mfrow=c(1,1)) 136 | biplot(pcx.wt, var.axes=FALSE, scale=0, cex=c(1,.05)) 137 | 138 | # make a list of names to keep. found in $good 139 | WT.g <- codaSeq.outlier(wt.agg.n0.clr, plot.me=TRUE) 140 | 141 | SNF <- grep("SNF", rownames(d.agg)) 142 | # subset 143 | SNF.agg <- d.agg[SNF,] 144 | 145 | # filter, samples by row 146 | SNF.gt0 <- SNF.agg[,colSums(SNF.agg) > 0] 147 | 148 | # estimate 0 values (zCompositions) 149 | # samples by row 150 | SNF.agg.n0 <- cmultRepl(SNF.gt0, method="CZM", label=0) 151 | 152 | # clr transform 153 | SNF.agg.n0.clr <- t(apply(SNF.agg.n0, 1, function(x) log(x) - mean(log(x)))) 154 | 155 | # SVD 156 | pcx.SNF <- prcomp(SNF.agg.n0.clr) 157 | mvar.SNF.clr <- sum(pcx.SNF$sdev^2) 158 | 159 | # plot 160 | par(mfrow=c(1,1)) 161 | biplot(pcx.SNF, var.axes=FALSE, scale=0, cex=c(1,.05)) 162 | 163 | # make a list of names to keep. found in $good 164 | SNF.g <- codaSeq.outlier(SNF.agg.n0.clr, plot.me=TRUE) 165 | 166 | ``` 167 | \clearpage 168 | \newpage 169 | 170 | ## Biplot of non-outlier samples only 171 | 172 | Now we can make a biplot of only those samples that are non-outliers. We see that the SNF2 KO group is more homogeneous than is the WT group. Almost certainly since the SNF2 group is clonal, and the WT group is likely grown from a frozen culture. 173 | 174 | ```{r good_data_pca, message=FALSE, warning=FALSE, echo=FALSE, fig.cap='outliers', fig.height=7, fig.width=7} 175 | 176 | # make a dataset of only the non-outlier samples 177 | d.good <- rbind(d.agg[SNF.g$good,],d.agg[WT.g$good,]) 178 | 179 | # filter 180 | #d.good.gt0 <- codaSeq.filter(d.good, min.count=1, samples.by.row=TRUE) 181 | d.good.gt0 <- d.good[,colSums(d.good) > 0] 182 | 183 | # estimate 0 values (zCompositions) 184 | d.good.agg.n0 <- cmultRepl(d.good.gt0, method="CZM", label=0) 185 | 186 | # clr transform 187 | d.good.agg.n0.clr <- t(apply(d.good.agg.n0, 1, function(x) log(x) - mean(log(x)))) 188 | 189 | # SVD 190 | pcx.good <- prcomp(d.good.agg.n0.clr) 191 | mvar.good <- sum(pcx.good$sdev^2) 192 | # get the labels for the first two components 193 | PC1.g <- paste("PC1: ", round(pcx.good$sdev[1]^2/sum(pcx.good$sdev^2),3), sep="") 194 | PC2.g <- paste("PC2: ", round(pcx.good$sdev[2]^2/sum(pcx.good$sdev^2),3), sep="") 195 | 196 | # plot and save 197 | par(mfrow=c(1,1)) 198 | biplot(pcx.good, var.axes=FALSE, scale=0, cex=c(1,.5), xlab=PC1.g, ylab=PC2.g) 199 | abline(h=0, lty=2, lwd=2, col=rgb(0,0,0,0.3)) 200 | abline(v=0, lty=2, lwd=2, col=rgb(0,0,0,0.3)) 201 | 202 | # optionally save 203 | # write.table(d.good.gt0, file="data/filtered_table.txt", sep="\t", quote=F, col.names=NA) 204 | ``` 205 | \clearpage 206 | 207 | ## We can also plot the biplot of the samples that were removed. 208 | 209 | We can also examine those samples that were deemed to be outliers. Here we can see that the outlier samples have the axis of the experiment on PC2, which means that there problems in the dataset that led to these samples being outliers overwhelms the actual signal in the dataset. This confirms that these samples should be removed from the dataset as they are adding either noise, or some systematic information to the dataset. 210 | 211 | ```{r bad_data_pca, message=FALSE, warning=FALSE, echo=FALSE, fig.cap='outliers', fig.height=7, fig.width=7} 212 | 213 | # make a dataset of only the non-outlier samples 214 | d.bad <- rbind(d.agg[SNF.g$bad,],d.agg[WT.g$bad,]) 215 | 216 | # filter 217 | d.bad.gt0 <- d.bad[,colSums(d.bad) > 0] 218 | 219 | # estimate 0 values (zCompositions) 220 | d.bad.agg.n0 <- cmultRepl(d.bad.gt0, method="CZM", label=0) 221 | 222 | # clr transform 223 | d.bad.agg.n0.clr <- t(apply(d.bad.agg.n0, 1, function(x) log(x) - mean(log(x)))) 224 | 225 | # SVD 226 | pcx.bad <- prcomp(d.bad.agg.n0.clr) 227 | mvar.bad <- sum(pcx.bad$sdev^2) 228 | # get the labels for the first two components 229 | PC1.g <- paste("PC1: ", round(pcx.bad$sdev[1]^2/sum(pcx.bad$sdev^2),3), sep="") 230 | PC2.g <- paste("PC2: ", round(pcx.bad$sdev[2]^2/sum(pcx.bad$sdev^2),3), sep="") 231 | 232 | # plot and save 233 | par(mfrow=c(1,1)) 234 | biplot(pcx.bad, var.axes=FALSE, scale=0, cex=c(1,.5), xlab=PC1.g, ylab=PC2.g) 235 | abline(h=0, lty=2, lwd=2, col=rgb(0,0,0,0.3)) 236 | abline(v=0, lty=2, lwd=2, col=rgb(0,0,0,0.3)) 237 | 238 | # optionally save 239 | # write.table(d.good.gt0, file="data/filtered_table.txt", sep="\t", quote=F, col.names=NA) 240 | ``` 241 | 242 | \clearpage 243 | 244 | ## Additional filtering does not change the conclusions 245 | 246 | We can do additional filtering. Examining the features, most contribute little, if anything, to the separation. These can be removed by filtering out low variance features. Note that we lose some resolution, but that we recapitulate the dataset with only half the features. We could do this iteratively. 247 | 248 | ```{r lowvar, , message=FALSE, warning=FALSE, echo=FALSE, fig.cap='outliers', fig.height=7, fig.width=7} 249 | var.clr <- apply(d.good.agg.n0.clr, 2, var) 250 | nms <- which(var.clr > median(var.clr)) # 251 | 252 | d.lv <- d.good[, names(nms)] 253 | # filter 254 | 255 | # estimate 0 values (zCompositions) 256 | d.lv.agg.n0 <- cmultRepl(d.lv, method="CZM", label=0) 257 | 258 | # clr transform 259 | d.lv.agg.n0.clr <- t(apply(d.lv.agg.n0, 1, function(x) log(x) - mean(log(x)))) 260 | 261 | # SVD 262 | pcx.lv <- prcomp(d.lv.agg.n0.clr) 263 | mvar.lv <- sum(pcx.lv$sdev^2) 264 | PC1.lv <- paste("PC1: ", round(pcx.lv$sdev[1]^2/sum(pcx.lv$sdev^2),3), sep="") 265 | PC2.lv <- paste("PC2: ", round(pcx.lv$sdev[2]^2/sum(pcx.lv$sdev^2),3), sep="") 266 | 267 | # plot and save 268 | par(mfrow=c(1,1)) 269 | biplot(pcx.lv, var.axes=FALSE, scale=0, cex=c(1,.5), xlab=PC1.lv, ylab=PC2.lv) 270 | abline(h=0, lty=2, lwd=2, col=rgb(0,0,0,0.3)) 271 | abline(v=0, lty=2, lwd=2, col=rgb(0,0,0,0.3)) 272 | # get the labels for the first two components 273 | ``` 274 | \clearpage 275 | 276 | ## FUZZY CLUSTERING 277 | 278 | We can plot the samples according to their kmeans cluster membership. For this we are using the fuzzy clustering package ppclust [@fuzzy:2018]. There is a good introduction to fuzzy clustering in [@fernandez:2012]. Essentially, we are using a probabilistic (or possibilistic) approach to determine the number of clusters, and the cluster memberships. The vignette for this approach is at: https://cran.r-project.org/web/packages/ppclust/vignettes/fcm.html. As noted in the workshop, we get two clusters if we choose centers=2 or =3, but the SNF2 and WT groups split if we choose centers=4. 279 | 280 | ```{r fuzzy} 281 | library(ppclust) 282 | library(factoextra) 283 | library(cluster) 284 | library(fclust) 285 | 286 | 287 | res.fcm <- fcm(d.lv.agg.n0.clr, centers=2) 288 | #as.data.frame(res.fcm$u) 289 | #summary(res.fcm) 290 | 291 | res.fcm2 <- ppclust2(res.fcm, "kmeans") 292 | 293 | factoextra::fviz_cluster(res.fcm2, data = d.lv.agg.n0.clr, 294 | ellipse.type = "norm", labelsize=10, palette = "jco", 295 | repel = TRUE) 296 | ``` 297 | ############ 298 | 299 | ## References 300 | -------------------------------------------------------------------------------- /Exploratory_biplot.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/Exploratory_biplot.pdf -------------------------------------------------------------------------------- /Gloor_bio.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/Gloor_bio.docx -------------------------------------------------------------------------------- /Outline.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "CoDa HTS Workshop Proposal" 3 | author: "Greg Gloor" 4 | output: 5 | pdf_document: 6 | fig_caption: yes 7 | includes: 8 | in_header: /Users/ggloor/Documents/0_git/templates/header.tex 9 | keep_tex: yes 10 | pandoc_args: 11 | - -V 12 | - classoption=twocolumn 13 | toc: yes 14 | toc_depth: 3 15 | geometry: margin=2cm 16 | csl: /Users/ggloor/Documents/0_git/csl_styles/frontiers.csl 17 | bibliography: /Users/ggloor/Library/texmf/bibtex/bib/bibdesk_refs.bib 18 | --- 19 | 20 | ## Analyzing data as compositions 21 | 22 | Website: https://github.com/ggloor/CoDa_microbiome_tutorial. This will serve as the central repository for the demonstrated tools and workflows. The repository will be set as release 2.0 at the end of the workshop so that participants will have a permanent public record of what was covered. 23 | 24 | The Powerpoint presentation can be found at: https://github.com/ggloor/compositions/presentations/CoDa_workshop_NGS18.pptx. 25 | 26 | We have adapted and developed tools and protocols for the analysis of HTS as compositional count data [@fernandes:2013;@fernandes:2014;@Quinn206425;@erb:2016]. Analyses conducted under this paradigm are reproducible and robust, and allow conclusions about the relative relationships between features (genes, OTUs, etc) in the underlying environment [@bian:2017;@gloorFrontiers:2017]. 27 | 28 | It is possible to replace almost all steps in traditional RNA-seq, metagenomics or 16S rRNA gene sequencing analysis with compositionally appropriate methods [@gloorFrontiers:2017] that are robust to data manipulations and that provide reproducible insights into the underlying biology and composition of the system. 29 | 30 | ## Objectives and outcomes 31 | 32 | The workshop will enable participants to: 33 | 34 | 1. be able to identify when biological datasets are compositional, and understand the root problems that cause problems when interrogating compositional datasets. 35 | 36 | 2. understand why HTS data should be analyzed in a compositionally-appropriate framework. 37 | 38 | 3. know how to install, use and interpret the output from the basic HTS compositional toolkit that consists of compositional biplots, the `propr R` package and the `ALDEx2 R` package. 39 | 40 | 4. have a frame of reference for more complex compositional tools such as `philr` and concepts such as b-association and balance dendrograms. 41 | 42 | ## Outline 43 | 44 | The workshop will be delivered as mixed didactic and participation sessions, with about a 1:4 mixture. Each session will be introduced by a short didactic introduction and demonstration. The remainder of the session will be hands-on learning exercises in the `R` programming environment. 45 | 46 | We will demonstrate a test dataset from [@]Schurch:2016aa;@Gierlinski:2015aa] the lab of Dr. Geoffrey Barton that examined the effect of a SNF2 gene knockout \emph{Saccharomyces cervisiae} transcription. This dataset is nearly ideal and simple to understand. However, participants are invited (expected) to bring their own dataset in the form of a count table with associated metadata for examination. 47 | 48 | The outline of this 1-day workshop is: 49 | 50 | ### Start time 9am - Introduction (Gloor, didactic lecture) 51 | 52 | https://github.com/ggloor/compositions/presentations/CoDa_workshop_NGS18.pptx 53 | 54 | - demonstrate and understand the geometry of high throughput sequencing data and how this constrains the analyses 55 | - demonstrate the pathologies associated with HTS data analyzed using standard methods 56 | - enable participants to understand why and when the usual methods of analysis are likely to be misleading 57 | - understand the importance of subcompositional coherence and subcompositional dominance, and how these concepts lead to robust analyses 58 | 59 | ### Start time 9:45 - Probabilities and ratio transformations (Gloor, hands on) 60 | 61 | source: zero.Rmd 62 | 63 | - provide an overview of sequencing as a probabilistic process, and the manipulation of probability vectors using compositional data methods 64 | - how to generate probability distributions from count data using ALDEx2 65 | - how to generate and interpret compositionally appropriate data transformations 66 | - zero replacement strategies for sparse data with the zCompositions R package 67 | - why count normalization is futile 68 | 69 | ### Break 10:30 70 | 71 | ### Start time 11 - Dimension reduction, outlier identification and clustering (Gloor, hands on) 72 | 73 | source: make_interpret_biplot.Rmd, Exploratory_biplot.Rmd 74 | 75 | - demonstrate dimension reduction of compositional data 76 | - the production and interpretation of a compositional PCA biplot 77 | - identifying outlier samples 78 | - learn how to conduct and interpret clustering and discriminate analysis in compositional data 79 | - fuzzy clustering 80 | 81 | 82 | ### Start time 12: Correlation and compositional association (Erb) 83 | 84 | source: prop_and_diffprop_with_propr 85 | 86 | - demonstrate compositionally appropriate identification of correlated (compositionally associated) features using the `propr R` package [@Quinn:2017] 87 | - an introduction to compositional association 88 | 89 | ### Start time: 1 lunch break 90 | 91 | 92 | ### Start time : 2: - Correlation and compositional association continued (Erb) 93 | 94 | ### Start time 2:30 Differential abundance with ALDEx2 (Gloor) 95 | 96 | source: ALDEx_comparison.Rmd 97 | 98 | - demonstrate compositionally appropriate identification of differentially relatively abundant features using the `ALDEx2 R` package 99 | - learn how to generate and interpret posterior expected values for differential relative abundance 100 | - learn how to generate and use standardized effect sizes for differential relative abundance 101 | - learn how to interpret effect plots as an adjunct to volcano and Bland-Altmann plots 102 | 103 | ### Start time: 3:30 - Working with users' data (Gloor, Erb) 104 | 105 | - analyzing users' own data 106 | - troubleshooting users' own datasets 107 | - common problems from the participants will be highlighted and solutions demonstrated 108 | 109 | ### Start time: 4:30- Wrapup (Gloor, Erb) 110 | 111 | - review of concepts and strategies 112 | - understand the congruence between the results obtained by the compositional biplot, compositional association and compositional differential relative abundance 113 | - provide guidance and sources on the proper interpretation of HTS datasets using a compositional paradigm 114 | 115 | ### Finish time 5 pm 116 | 117 | ## Requirements 118 | 119 | 1. a reasonably up-to-date laptop computer with at leaset 8Gb RAM 120 | 121 | 2. familiarity with scripting or programming languages, proficency in the `R` programming environment 122 | 123 | 3. the current version of the `R` programming language installed 124 | 125 | 4. a number of `R` packages will be used during the workshop. Participants should be familiar with installation of packages from both `Bioconductor` and `CRAN` 126 | 127 | ## Intended Audience and Level 128 | 129 | The intended audience for this session is bioinformaticians or computational biologists who use high throughput sequencing with experimental designs that include tag sequencing (eg. 16S rRNA gene sequencing), metagenomics, transcriptomics or meta-transcriptomics. 130 | 131 | This is not intended to be an introduction to R for bioinformaticians: attendees should be relatively proficient with R, either using RStudio, or on the command line and should have a plain text editor available. Attendees will use R markdown documents to keep track of their work, and templates will be provided for use. Attendees will be expected to have a laptop with R installed and the following packages and their dependencies: propr (CRAN), ALDEx2 (Bioconductor), omicplotR (Bioconductor), zCompositions (CRAN). Attendees are encouraged to bring their own datasets for analysis, but should be aware that only pairwise (i.e., two condition) experiments will be demonstrated. 132 | 133 | Compositional concepts will be at an introductory-intermediate level suitable for participants of any background, but will be more intuitive to those with a grounding in probability and linear algebra. 134 | 135 | The practical aspects will be at an intermediate level, suitable for participants with pre-exisiting competency in `R`. 136 | 137 | Attendance should be capped at no more than 40 participants. 138 | 139 | 140 | ## Organizers and Presenters 141 | 142 | Greg Gloor is a Professor of Biochemistry at The University of Western Ontario. He is one of the pioneers in using compositional data analysis to analyze HTS datasets. He is the maintainer of the `ALDEx2 R` package on Bioconductor used for differential relative abundance analysis. He has published original research, methods papers, and reviews that use compositional data analysis methods to interpret HTS datasets using transcriptome, microbiome and meta-transcriptome datasets [@bian:2017;@gloorFrontiers:2017;@Wolfs:2016aa;@gloorAJS:2016;@Gloor:2016cjm;@gloor2016s;@gloor:effect;@Goneau:2015ab;@McMillan:2015aa;@fernandes:2014;@macklaim:2013;@fernandes:2013]. He has taught undergraduate and graduate courses in computational biology for almost two decades, and has won awards from both student groups and from faculty-wide competitions. His homepage and CV is at ggloor.github.io 143 | 144 | Ionas Erb is a PDF and Bioinformatician at the Centre for Genomic Regulation. He is an active developer of tools to determine compositional association and is a contributor to the `propr R` package on CRAN used to explore correlation in a compositionally appropriate manner. He is an advocate for and active developer of tools that for compositionally-appropriate methods to examine correlation [@Quinn206425;@Erb134536;@erb:2016] 145 | 146 | ## References 147 | -------------------------------------------------------------------------------- /Outline.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/Outline.pdf -------------------------------------------------------------------------------- /Outline.tex: -------------------------------------------------------------------------------- 1 | \documentclass[twocolumn]{article} 2 | \usepackage{lmodern} 3 | \usepackage{amssymb,amsmath} 4 | \usepackage{ifxetex,ifluatex} 5 | \usepackage{fixltx2e} % provides \textsubscript 6 | \ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex 7 | \usepackage[T1]{fontenc} 8 | \usepackage[utf8]{inputenc} 9 | \else % if luatex or xelatex 10 | \ifxetex 11 | \usepackage{mathspec} 12 | \else 13 | \usepackage{fontspec} 14 | \fi 15 | \defaultfontfeatures{Ligatures=TeX,Scale=MatchLowercase} 16 | \fi 17 | % use upquote if available, for straight quotes in verbatim environments 18 | \IfFileExists{upquote.sty}{\usepackage{upquote}}{} 19 | % use microtype if available 20 | \IfFileExists{microtype.sty}{% 21 | \usepackage{microtype} 22 | \UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts 23 | }{} 24 | \usepackage[margin=2cm]{geometry} 25 | \usepackage{hyperref} 26 | \hypersetup{unicode=true, 27 | pdftitle={CoDa HTS Workshop Proposal}, 28 | pdfauthor={Greg Gloor}, 29 | pdfborder={0 0 0}, 30 | breaklinks=true} 31 | \urlstyle{same} % don't use monospace font for urls 32 | \usepackage{graphicx,grffile} 33 | \makeatletter 34 | \def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi} 35 | \def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi} 36 | \makeatother 37 | % Scale images if necessary, so that they will not overflow the page 38 | % margins by default, and it is still possible to overwrite the defaults 39 | % using explicit options in \includegraphics[width, height, ...]{} 40 | \setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio} 41 | \IfFileExists{parskip.sty}{% 42 | \usepackage{parskip} 43 | }{% else 44 | \setlength{\parindent}{0pt} 45 | \setlength{\parskip}{6pt plus 2pt minus 1pt} 46 | } 47 | \setlength{\emergencystretch}{3em} % prevent overfull lines 48 | \providecommand{\tightlist}{% 49 | \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}} 50 | \setcounter{secnumdepth}{0} 51 | % Redefines (sub)paragraphs to behave more like sections 52 | \ifx\paragraph\undefined\else 53 | \let\oldparagraph\paragraph 54 | \renewcommand{\paragraph}[1]{\oldparagraph{#1}\mbox{}} 55 | \fi 56 | \ifx\subparagraph\undefined\else 57 | \let\oldsubparagraph\subparagraph 58 | \renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}} 59 | \fi 60 | 61 | %%% Use protect on footnotes to avoid problems with footnotes in titles 62 | \let\rmarkdownfootnote\footnote% 63 | \def\footnote{\protect\rmarkdownfootnote} 64 | 65 | %%% Change title format to be more compact 66 | \usepackage{titling} 67 | 68 | % Create subtitle command for use in maketitle 69 | \newcommand{\subtitle}[1]{ 70 | \posttitle{ 71 | \begin{center}\large#1\end{center} 72 | } 73 | } 74 | 75 | \setlength{\droptitle}{-2em} 76 | \title{CoDa HTS Workshop Proposal} 77 | \pretitle{\vspace{\droptitle}\centering\huge} 78 | \posttitle{\par} 79 | \author{Greg Gloor} 80 | \preauthor{\centering\large\emph} 81 | \postauthor{\par} 82 | \date{} 83 | \predate{}\postdate{} 84 | 85 | \usepackage{geometry} 86 | \usepackage{amsmath} 87 | \newcommand{\ith}[1]{ #1\textsuperscript{th}\ } 88 | \newcommand{\vect}[1]{\vec{\textbf{#1}}} 89 | \setlength{\columnsep}{18pt} 90 | 91 | \setlength\textwidth{5.5in} 92 | \setlength\marginparwidth{1.5in} 93 | 94 | \begin{document} 95 | \maketitle 96 | 97 | { 98 | \setcounter{tocdepth}{3} 99 | \tableofcontents 100 | } 101 | \hypertarget{analyzing-data-as-compositions}{% 102 | \subsection{Analyzing data as 103 | compositions}\label{analyzing-data-as-compositions}} 104 | 105 | Website: \url{https://github.com/ggloor/CoDa_microbiome_tutorial}. This 106 | will serve as the central repository for the demonstrated tools and 107 | workflows. The repository will be set as release 2.0 at the end of the 108 | workshop so that participants will have a permanent public record of 109 | what was covered. 110 | 111 | The Powerpoint presentation can be found at: 112 | \url{https://github.com/ggloor/compositions/presentations/CoDa_workshop_NGS18.pptx}. 113 | 114 | We have adapted and developed tools and protocols for the analysis of 115 | HTS as compositional count data (Erb and Notredame, 2016; Fernandes et 116 | al., 2013, 2014; Quinn et al., 2017a). Analyses conducted under this 117 | paradigm are reproducible and robust, and allow conclusions about the 118 | relative relationships between features (genes, OTUs, etc) in the 119 | underlying environment (Bian et al.; Gloor et al., 2017). 120 | 121 | It is possible to replace almost all steps in traditional RNA-seq, 122 | metagenomics or 16S rRNA gene sequencing analysis with compositionally 123 | appropriate methods (Gloor et al., 2017) that are robust to data 124 | manipulations and that provide reproducible insights into the underlying 125 | biology and composition of the system. 126 | 127 | \hypertarget{objectives-and-outcomes}{% 128 | \subsection{Objectives and outcomes}\label{objectives-and-outcomes}} 129 | 130 | The workshop will enable participants to: 131 | 132 | \begin{enumerate} 133 | \def\labelenumi{\arabic{enumi}.} 134 | \item 135 | be able to identify when biological datasets are compositional, and 136 | understand the root problems that cause problems when interrogating 137 | compositional datasets. 138 | \item 139 | understand why HTS data should be analyzed in a 140 | compositionally-appropriate framework. 141 | \item 142 | know how to install, use and interpret the output from the basic HTS 143 | compositional toolkit that consists of compositional biplots, the 144 | \texttt{propr\ R} package and the \texttt{ALDEx2\ R} package. 145 | \item 146 | have a frame of reference for more complex compositional tools such as 147 | \texttt{philr} and concepts such as b-association and balance 148 | dendrograms. 149 | \end{enumerate} 150 | 151 | \hypertarget{outline}{% 152 | \subsection{Outline}\label{outline}} 153 | 154 | The workshop will be delivered as mixed didactic and participation 155 | sessions, with about a 1:4 mixture. Each session will be introduced by a 156 | short didactic introduction and demonstration. The remainder of the 157 | session will be hands-on learning exercises in the \texttt{R} 158 | programming environment. 159 | 160 | We will demonstrate a test dataset from 161 | {[}@{]}Schurch:\href{mailto:2016aa;@Gierlinski}{\nolinkurl{2016aa;@Gierlinski}}:2015aa{]} 162 | the lab of Dr.~Geoffrey Barton that examined the effect of a SNF2 gene 163 | knockout \emph{Saccharomyces cervisiae} transcription. This dataset is 164 | nearly ideal and simple to understand. However, participants are invited 165 | (expected) to bring their own dataset in the form of a count table with 166 | associated metadata for examination. 167 | 168 | The outline of this 1-day workshop is: 169 | 170 | \hypertarget{start-time-9am---introduction-gloor-didactic-lecture}{% 171 | \subsubsection{Start time 9am - Introduction (Gloor, didactic 172 | lecture)}\label{start-time-9am---introduction-gloor-didactic-lecture}} 173 | 174 | \url{https://github.com/ggloor/compositions/presentations/CoDa_workshop_NGS18.pptx} 175 | 176 | \begin{itemize} 177 | \tightlist 178 | \item 179 | demonstrate and understand the geometry of high throughput sequencing 180 | data and how this constrains the analyses 181 | 182 | \begin{itemize} 183 | \tightlist 184 | \item 185 | demonstrate the pathologies associated with HTS data analyzed using 186 | standard methods 187 | \item 188 | enable participants to understand why and when the usual methods of 189 | analysis are likely to be misleading 190 | \item 191 | understand the importance of subcompositional coherence and 192 | subcompositional dominance, and how these concepts lead to robust 193 | analyses 194 | \end{itemize} 195 | \end{itemize} 196 | 197 | \hypertarget{start-time-945---probabilities-and-ratio-transformations-gloor-hands-on}{% 198 | \subsubsection{Start time 9:45 - Probabilities and ratio transformations 199 | (Gloor, hands 200 | on)}\label{start-time-945---probabilities-and-ratio-transformations-gloor-hands-on}} 201 | 202 | \begin{itemize} 203 | \tightlist 204 | \item 205 | provide an overview of sequencing as a probabilistic process, and the 206 | manipulation of probability vectors using compositional data methods 207 | 208 | \begin{itemize} 209 | \tightlist 210 | \item 211 | how to generate probability distributions from count data using 212 | ALDEx2 213 | \item 214 | how to generate and interpret compositionally appropriate data 215 | transformations 216 | \item 217 | zero replacement strategies for sparse data with the zCompositions R 218 | package 219 | \item 220 | why count normalization is futile 221 | \end{itemize} 222 | \end{itemize} 223 | 224 | \hypertarget{break-1030}{% 225 | \subsubsection{Break 10:30}\label{break-1030}} 226 | 227 | \hypertarget{start-time-11---dimension-reduction-outlier-identification-and-clustering-gloor-hands-on}{% 228 | \subsubsection{Start time 11 - Dimension reduction, outlier 229 | identification and clustering (Gloor, hands 230 | on)}\label{start-time-11---dimension-reduction-outlier-identification-and-clustering-gloor-hands-on}} 231 | 232 | \begin{itemize} 233 | \tightlist 234 | \item 235 | demonstrate dimension reduction of compositional data 236 | 237 | \begin{itemize} 238 | \tightlist 239 | \item 240 | the production and interpretation of a compositional PCA biplot 241 | \item 242 | identifying outlier samples 243 | \item 244 | learn how to conduct and interpret clustering and discriminate 245 | analysis in compositional data 246 | \item 247 | fuzzy clustering 248 | \end{itemize} 249 | \end{itemize} 250 | 251 | \hypertarget{start-time-12-correlation-and-compositional-association-erb}{% 252 | \subsubsection{Start time 12: Correlation and compositional association 253 | (Erb)}\label{start-time-12-correlation-and-compositional-association-erb}} 254 | 255 | \begin{itemize} 256 | \tightlist 257 | \item 258 | demonstrate compositionally appropriate identification of correlated 259 | (compositionally associated) features using the \texttt{propr\ R} 260 | package (Quinn et al., 2017b) 261 | 262 | \begin{itemize} 263 | \tightlist 264 | \item 265 | an introduction to compositional association 266 | \end{itemize} 267 | \end{itemize} 268 | 269 | \hypertarget{start-time-1-lunch-break}{% 270 | \subsubsection{Start time: 1 lunch 271 | break}\label{start-time-1-lunch-break}} 272 | 273 | \hypertarget{start-time-2---correlation-and-compositional-association-continued-erb}{% 274 | \subsubsection{Start time : 2: - Correlation and compositional 275 | association continued 276 | (Erb)}\label{start-time-2---correlation-and-compositional-association-continued-erb}} 277 | 278 | \hypertarget{start-time-230-differential-abundance-with-aldex2-gloor}{% 279 | \subsubsection{Start time 2:30 Differential abundance with ALDEx2 280 | (Gloor)}\label{start-time-230-differential-abundance-with-aldex2-gloor}} 281 | 282 | \begin{itemize} 283 | \tightlist 284 | \item 285 | demonstrate compositionally appropriate identification of 286 | differentially relatively abundant features using the 287 | \texttt{ALDEx2\ R} package 288 | 289 | \begin{itemize} 290 | \tightlist 291 | \item 292 | learn how to generate and interpret posterior expected values for 293 | differential relative abundance 294 | \item 295 | learn how to generate and use standardized effect sizes for 296 | differential relative abundance 297 | \item 298 | learn how to interpret effect plots as an adjunct to volcano and 299 | Bland-Altmann plots 300 | \end{itemize} 301 | \end{itemize} 302 | 303 | \hypertarget{start-time-330---working-with-users-data-gloor-erb}{% 304 | \subsubsection{Start time: 3:30 - Working with users' data (Gloor, 305 | Erb)}\label{start-time-330---working-with-users-data-gloor-erb}} 306 | 307 | \begin{itemize} 308 | \tightlist 309 | \item 310 | analyzing users' own data 311 | \item 312 | troubleshooting users' own datasets 313 | \item 314 | common problems from the participants will be highlighted and 315 | solutions demonstrated 316 | \end{itemize} 317 | 318 | \hypertarget{start-time-430--wrapup-gloor-erb}{% 319 | \subsubsection{Start time: 4:30- Wrapup (Gloor, 320 | Erb)}\label{start-time-430--wrapup-gloor-erb}} 321 | 322 | \begin{itemize} 323 | \tightlist 324 | \item 325 | review of concepts and strategies 326 | \item 327 | understand the congruence between the results obtained by the 328 | compositional biplot, compositional association and compositional 329 | differential relative abundance 330 | \item 331 | provide guidance and sources on the proper interpretation of HTS 332 | datasets using a compositional paradigm 333 | \end{itemize} 334 | 335 | \hypertarget{finish-time-5-pm}{% 336 | \subsubsection{Finish time 5 pm}\label{finish-time-5-pm}} 337 | 338 | \hypertarget{requirements}{% 339 | \subsection{Requirements}\label{requirements}} 340 | 341 | \begin{enumerate} 342 | \def\labelenumi{\arabic{enumi}.} 343 | \item 344 | a reasonably up-to-date laptop computer with at leaset 8Gb RAM 345 | \item 346 | familiarity with scripting or programming languages, proficency in the 347 | \texttt{R} programming environment 348 | \item 349 | the current version of the \texttt{R} programming language installed 350 | \item 351 | a number of \texttt{R} packages will be used during the workshop. 352 | Participants should be familiar with installation of packages from 353 | both \texttt{Bioconductor} and \texttt{CRAN} 354 | \end{enumerate} 355 | 356 | \hypertarget{intended-audience-and-level}{% 357 | \subsection{Intended Audience and 358 | Level}\label{intended-audience-and-level}} 359 | 360 | The intended audience for this session is bioinformaticians or 361 | computational biologists who use high throughput sequencing with 362 | experimental designs that include tag sequencing (eg. 16S rRNA gene 363 | sequencing), metagenomics, transcriptomics or meta-transcriptomics. 364 | 365 | This is not intended to be an introduction to R for bioinformaticians: 366 | attendees should be relatively proficient with R, either using RStudio, 367 | or on the command line and should have a plain text editor available. 368 | Attendees will use R markdown documents to keep track of their work, and 369 | templates will be provided for use. Attendees will be expected to have a 370 | laptop with R installed and the following packages and their 371 | dependencies: propr (CRAN), ALDEx2 (Bioconductor), omicplotR 372 | (Bioconductor), zCompositions (CRAN). Attendees are encouraged to bring 373 | their own datasets for analysis, but should be aware that only pairwise 374 | (i.e., two condition) experiments will be demonstrated. 375 | 376 | Compositional concepts will be at an introductory-intermediate level 377 | suitable for participants of any background, but will be more intuitive 378 | to those with a grounding in probability and linear algebra. 379 | 380 | The practical aspects will be at an intermediate level, suitable for 381 | participants with pre-exisiting competency in \texttt{R}. 382 | 383 | Attendance should be capped at no more than 40 participants. 384 | 385 | \hypertarget{organizers-and-presenters}{% 386 | \subsection{Organizers and Presenters}\label{organizers-and-presenters}} 387 | 388 | Greg Gloor is a Professor of Biochemistry at The University of Western 389 | Ontario. He is one of the pioneers in using compositional data analysis 390 | to analyze HTS datasets. He is the maintainer of the \texttt{ALDEx2\ R} 391 | package on Bioconductor used for differential relative abundance 392 | analysis. He has published original research, methods papers, and 393 | reviews that use compositional data analysis methods to interpret HTS 394 | datasets using transcriptome, microbiome and meta-transcriptome datasets 395 | (Bian et al.; Fernandes et al., 2013, 2014; Gloor et al., 2016a, 2017, 396 | 2016b, 2016c; Gloor and Reid, 2016; Goneau et al., 2015; Macklaim et 397 | al., 2013; McMillan et al., 2015; Wolfs et al., 2016). He has taught 398 | undergraduate and graduate courses in computational biology for almost 399 | two decades, and has won awards from both student groups and from 400 | faculty-wide competitions. His homepage and CV is at ggloor.github.io 401 | 402 | Ionas Erb is a PDF and Bioinformatician at the Centre for Genomic 403 | Regulation. He is an active developer of tools to determine 404 | compositional association and is a contributor to the \texttt{propr\ R} 405 | package on CRAN used to explore correlation in a compositionally 406 | appropriate manner. He is an advocate for and active developer of tools 407 | that for compositionally-appropriate methods to examine correlation (Erb 408 | and Notredame, 2016; Erb et al., 2017; Quinn et al., 2017a) 409 | 410 | \hypertarget{references}{% 411 | \subsection*{References}\label{references}} 412 | \addcontentsline{toc}{subsection}{References} 413 | 414 | \hypertarget{refs}{} 415 | \leavevmode\hypertarget{ref-bian:2017}{}% 416 | Bian, G., Gloor, G. B., Gong, A., Jia, C., Zhang, W., Hu, J., et al. The 417 | gut microbiota of healthy aged chinese is similar to that of the healthy 418 | young. \emph{mSphere} 2, e00327--17. 419 | doi:\href{https://doi.org/10.1128/mSphere.00327-17}{10.1128/mSphere.00327-17}. 420 | 421 | \leavevmode\hypertarget{ref-erb:2016}{}% 422 | Erb, I., and Notredame, C. (2016). How should we measure proportionality 423 | on relative gene expression data? \emph{Theory in Biosciences} 135, 424 | 21--36. 425 | 426 | \leavevmode\hypertarget{ref-Erb134536}{}% 427 | Erb, I., Quinn, T., Lovell, D., and Notredame, C. (2017). Differential 428 | proportionality - a normalization-free approach to differential gene 429 | expression. \emph{bioRxiv}. 430 | doi:\href{https://doi.org/10.1101/134536}{10.1101/134536}. 431 | 432 | \leavevmode\hypertarget{ref-fernandes:2013}{}% 433 | Fernandes, A. D., Macklaim, J. M., Linn, T. G., Reid, G., and Gloor, G. 434 | B. (2013). ANOVA-like differential expression (aldex) analysis for mixed 435 | population rna-seq. \emph{PLoS One} 8, e67019. 436 | doi:\href{https://doi.org/10.1371/journal.pone.0067019}{10.1371/journal.pone.0067019}. 437 | 438 | \leavevmode\hypertarget{ref-fernandes:2014}{}% 439 | Fernandes, A. D., Reid, J. N., Macklaim, J. M., McMurrough, T. A., 440 | Edgell, D. R., and Gloor, G. B. (2014). Unifying the analysis of 441 | high-throughput sequencing datasets: Characterizing RNA-seq, 16S rRNA 442 | gene sequencing and selective growth experiments by compositional data 443 | analysis. \emph{Microbiome} 2, 15.1--15.13. 444 | doi:\href{https://doi.org/10.1186/2049-2618-2-15}{10.1186/2049-2618-2-15}. 445 | 446 | \leavevmode\hypertarget{ref-gloor:effect}{}% 447 | Gloor, G. B., Macklaim, J. M., and Fernandes, A. D. (2016a). Displaying 448 | variation in large datasets: Plotting a visual summary of effect sizes. 449 | \emph{Journal of Computational and Graphical Statistics} 25, 971--979. 450 | doi:\href{https://doi.org/10.1080/10618600.2015.1131161}{10.1080/10618600.2015.1131161}. 451 | 452 | \leavevmode\hypertarget{ref-gloorFrontiers:2017}{}% 453 | Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V., and Egozcue, J. J. 454 | (2017). Microbiome datasets are compositional: And this is not optional. 455 | \emph{Frontiers in Microbiology} 8, 2224. 456 | doi:\href{https://doi.org/10.3389/fmicb.2017.02224}{10.3389/fmicb.2017.02224}. 457 | 458 | \leavevmode\hypertarget{ref-gloorAJS:2016}{}% 459 | Gloor, G. B., Macklaim, J. M., Vu, M., and Fernandes, A. D. (2016b). 460 | Compositional uncertainty should not be ignored in high-throughput 461 | sequencing data analysis. \emph{Austrian Journal of Statistics} 45, 462 | 73--87. 463 | doi:\href{https://doi.org/doi:10.17713/ajs.v45i4.122}{doi:10.17713/ajs.v45i4.122}. 464 | 465 | \leavevmode\hypertarget{ref-Gloor:2016cjm}{}% 466 | Gloor, G. B., and Reid, G. (2016). Compositional analysis: A valid 467 | approach to analyze microbiome high-throughput sequencing data. 468 | \emph{Can J Microbiol} 62, 692--703. 469 | doi:\href{https://doi.org/10.1139/cjm-2015-0821}{10.1139/cjm-2015-0821}. 470 | 471 | \leavevmode\hypertarget{ref-gloor2016s}{}% 472 | Gloor, G. B., Wu, J. R., Pawlowsky-Glahn, V., and Egozcue, J. J. 473 | (2016c). It's all relative: Analyzing microbiome data as compositions. 474 | \emph{Ann Epidemiol} 26, 322--9. 475 | doi:\href{https://doi.org/10.1016/j.annepidem.2016.03.003}{10.1016/j.annepidem.2016.03.003}. 476 | 477 | \leavevmode\hypertarget{ref-Goneau:2015ab}{}% 478 | Goneau, L. W., Hannan, T. J., MacPhee, R. A., Schwartz, D. J., Macklaim, 479 | J. M., Gloor, G. B., et al. (2015). Subinhibitory antibiotic therapy 480 | alters recurrent urinary tract infection pathogenesis through modulation 481 | of bacterial virulence and host immunity. \emph{MBio} 6. 482 | doi:\href{https://doi.org/10.1128/mBio.00356-15}{10.1128/mBio.00356-15}. 483 | 484 | \leavevmode\hypertarget{ref-macklaim:2013}{}% 485 | Macklaim, M. J., Fernandes, D. A., Di Bella, M. J., Hammond, J.-A., 486 | Reid, G., and Gloor, G. B. (2013). Comparative meta-RNA-seq of the 487 | vaginal microbiota and differential expression by \emph{Lactobacillus 488 | iners} in health and dysbiosis. \emph{Microbiome} 1, 15. 489 | doi:\href{https://doi.org/doi:\%2010.1186/2049-2618-1-12}{doi: 10.1186/2049-2618-1-12}. 490 | 491 | \leavevmode\hypertarget{ref-McMillan:2015aa}{}% 492 | McMillan, A., Rulisa, S., Sumarah, M., Macklaim, J. M., Renaud, J., 493 | Bisanz, J. E., et al. (2015). A multi-platform metabolomics approach 494 | identifies highly specific biomarkers of bacterial diversity in the 495 | vagina of pregnant and non-pregnant women. \emph{Sci Rep} 5, 14174. 496 | doi:\href{https://doi.org/10.1038/srep14174}{10.1038/srep14174}. 497 | 498 | \leavevmode\hypertarget{ref-Quinn206425}{}% 499 | Quinn, T. P., Erb, I., Richardson, M. F., and Crowley, T. M. (2017a). 500 | Understanding sequencing data as compositions: An outlook and review. 501 | \emph{bioRxiv}. 502 | doi:\href{https://doi.org/10.1101/206425}{10.1101/206425}. 503 | 504 | \leavevmode\hypertarget{ref-Quinn:2017}{}% 505 | Quinn, T., Richardson, M. F., Lovell, D., and Crowley, T. (2017b). 506 | Propr: An R-package for identifying proportionally abundant features 507 | using compositional data analysis. \emph{bioRxiv}. 508 | doi:\href{https://doi.org/10.1101/104935}{10.1101/104935}. 509 | 510 | \leavevmode\hypertarget{ref-Wolfs:2016aa}{}% 511 | Wolfs, J. M., Hamilton, T. A., Lant, J. T., Laforet, M., Zhang, J., 512 | Salemi, L. M., et al. (2016). Biasing genome-editing events toward 513 | precise length deletions with an rna-guided tevcas9 dual nuclease. 514 | \emph{Proc Natl Acad Sci U S A}. 515 | doi:\href{https://doi.org/10.1073/pnas.1616343114}{10.1073/pnas.1616343114}. 516 | 517 | 518 | \end{document} 519 | -------------------------------------------------------------------------------- /README.txt: -------------------------------------------------------------------------------- 1 | This directory contains the readings, materials, and examples for a workshop originally offered at the Exploring Human Host-Microbiome Interactions in Health and Disease 2016 conference. 2 | 3 | The workshop was subsequently modified for the NGS'18 conference in Barcelona. For this, we examine an RNA-seq dataset, outlier detection was added, and the use of the propr R package was added (Ionas Erb). 4 | 5 | The primary focus will be on explaining why transcriptome data are compositional, limits and caveats on the analysis of compositional data, and finally proper methods to analyze such data. Practical methods to deal with the sparse nature of the data will be presented and demonstrated in practice. 6 | 7 | The workshop will focus on hands-on application of three main methods. 8 | 9 | Exploratory data analysis using the compositional biplot which shows the variance in both the samples and the genes on one plot. Outlier detection and removal 10 | 11 | Proper correlation and association analysis using the latest methods including the rho-metric inside propr and other tools for correlation. 12 | 13 | Examination of differential abundance under a compositional approach using the ALDEx2 tool. 14 | 15 | Workshop participants should read the following: 16 | 17 | multi_comp.pdf, and coda_seq.pdf: brief introductions to the problem from different perspectives 18 | make_interpret_biplot: a simple biplot with random data 19 | first_biplot.pdf: exploratory data analysis using compositional biplots and outlier detection and removal of the example dataset 20 | first_comparison.pdf: finding maximally different genes using ALDEx2 21 | first_association.pdf: finding maximally associated genes using propr 22 | -------------------------------------------------------------------------------- /chunk/R/propr-functions.R: -------------------------------------------------------------------------------- 1 | #' @title Closure 2 | #' 3 | #' @description 4 | #' \code{clo} divides each row of \code{X} by its row sum 5 | #' 6 | #' @details If \code{check} is \code{TRUE} then this function will stop if 7 | #' there are any negative or \code{NA} values in \code{X} 8 | #' @param X A matrix or dataframe of positive numeric values 9 | #' @param check A logical scalar 10 | #' @return A version of \code{X} where each row has been scaled so they sum to 1. 11 | #' @examples 12 | #' X <- matrix(1:12, nrow=3) 13 | #' x <- clo(X) 14 | #' rowSums(x) 15 | #' @export 16 | propr.clo <- function(X, check=FALSE){ 17 | if(check){ 18 | if(any(X < 0)) stop("negative values found") 19 | if(any(is.na(X))) stop("NA values found") 20 | } 21 | return(sweep(X, 1, rowSums(X), "/")) 22 | } 23 | 24 | ####################################################################################### 25 | #' @title Centred logratio transformation 26 | #' 27 | #' @description 28 | #' \code{clr} takes the log of each row of X and centres it (i.e., subtracts the mean). 29 | #' 30 | #' @details If \code{check} is \code{TRUE} then this function will stop if 31 | #' there are any negative or \code{NA} values in \code{X} 32 | #' @param X A matrix or dataframe of positive numeric values 33 | #' @param check A logical scalar 34 | #' @return The logarithm of \code{X} where each row has been shifted to have mean 0. 35 | #' \deqn{\mathrm{clr}(x) = \log x_i - \sum_{i=1}^D \log x_i} 36 | #' @examples 37 | #' X <- matrix(1:12, nrow=3) 38 | #' x <- clr(X) 39 | #' rowSums(x) # Pretty close to zero 40 | #' apply(exp(x), 1, prod) # The row products of exp(x) will be 1 41 | #' @export 42 | propr.clr <- function(X, check=FALSE){ 43 | if(check){ 44 | if(any(X < 0)) stop("negative values found") 45 | if(any(is.na(X))) stop("NA values found") 46 | } 47 | logX <- log(X) 48 | return(sweep(logX, 1, rowMeans(logX), "-")) 49 | } 50 | 51 | ####################################################################################### 52 | #' @title Variance of logratios 53 | #' 54 | #' @description 55 | #' \code{vlr} returns a matrix where element (i,j) is 56 | #' the variance (over rows) of the log of the ratios of column i and j. 57 | #' 58 | #' @details If \code{check} is \code{TRUE} then this function will stop if 59 | #' there are any negative or \code{NA} values in \code{X}. 60 | #' @param X A matrix or dataframe of positive numeric values 61 | #' @param check A logical scalar 62 | #' @return The symmetric matrix 63 | #' \eqn{\mathrm{Var}{\log(X_i/X_j)}}{Var(log(X_i/X_j))} where \eqn{X_i} and \eqn{X_j} 64 | #' denote \emph{columns} \eqn{i} and \eqn{j} of \eqn{X}. 65 | #' @examples 66 | #' N <- 10 # Number of observations 67 | #' # Make a data frame with columns a and b roughly proportional 68 | #' # and columns c and d roughly proportional 69 | #' X <- data.frame(a=(1:N), b=(1:N) * rnorm(N, 10, 0.1), 70 | #' c=(N:1), d=(N:1) * rnorm(N, 10, 1.0)) 71 | #' round(vlr(X),2) 72 | #' @export 73 | propr.vlr <- function(X, check=FALSE){ 74 | if(check){ 75 | if(any(X < 0)) stop("negative values found") 76 | if(any(is.na(X))) stop("NA values found") 77 | } 78 | logX <- log(X) 79 | Cov <- stats::var(logX) ## Note the avoidance of compositions::var 80 | D <- ncol(logX) 81 | VarCol <- matrix(rep(diag(Cov), D), ncol = D) 82 | return(-2 * Cov + VarCol + t(VarCol)) 83 | } 84 | 85 | 86 | ####################################################################################### 87 | #' @title Symmetric phi statistic 88 | #' 89 | #' @description 90 | #' \code{phisym} returns a matrix where element (i,j) is 91 | #' the symmetric phi statistic between columns i and j of \code{X}. 92 | #' @details \code{X} should be the result of a centred logratio transformation 93 | #' @param X A matrix or dataframe 94 | #' @return TBA. 95 | #' @examples 96 | #' N <- 10 # Number of observations 97 | #' # Make a data frame with columns a and b roughly proportional 98 | #' # and columns c and d roughly proportional 99 | #' X <- data.frame(a=(1:N), b=(1:N) * rnorm(N, 10, 0.1), 100 | #' c=(N:1), d=(N:1) * rnorm(N, 10, 1.0)) 101 | #' round(phisym(clr(X)),2) 102 | #' @export 103 | propr.phisym <- function (X) 104 | { 105 | Cov <- stats::var(X) 106 | tmp <- 2 * Cov / outer(diag(Cov), diag(Cov), "+") 107 | return((1-tmp)/(1+tmp)) 108 | } 109 | 110 | ####################################################################################### 111 | #' @title Expected value of phi from Dirichlet log-ratio distributions 112 | #' 113 | #' @description 114 | #' default returns dataframe of the lower-triangle of symmetrical phi metric, 115 | #' alternatively returns matrix of the summetrical phi metric 116 | #' in either case, the value of phi is the expected value of a number of Dirichlet 117 | #' Monte-Carlo replicates of the data. This reduces the problem of 118 | #' 0-count and low-count features being highly variable because their 119 | #' values range wildly and so the expected value is always large 120 | #' @details requires aldex.clr function from ALDEx2 121 | #' param aldex.clr is an S3 object from the aldex.clr function 122 | #' we ignore all the other measures that are used for trouble-shooting phi 123 | #' the sma.df function in particular is very time and memory intensive 124 | #' @examples 125 | #' # use a count table where the samples are by column, features by row 126 | #' x <- aldex.clr(count.table, return="df") 127 | #' # if return = df, returns a dataframe of the expected value of the lower 128 | #' triangle of the propr.phisym function. 129 | #' # if return = mat, returns the symmetric matrix 130 | #' The number of Dirichlet Monte-Carlo replicates is 131 | #' obtained from the aldex.clr object 132 | 133 | propr.aldex.phi <- function(aldex.clr, return="df"){ 134 | 135 | # calculate expected value of phi 136 | # a single high phi value will push the component out of consideration 137 | # a median is right out for memory considerations 138 | 139 | # get first value 140 | sym.phi <- propr.phisym(t(sapply(getMonteCarloInstances(aldex.clr), 141 | function(y){y[,1]}))) 142 | 143 | # sum the rest of the values as we proceed through the DIR MC instances 144 | for(i in 2:numMCInstances(aldex.clr)){ 145 | #print(i) 146 | sym.phi <- sym.phi + propr.phisym(t(sapply(getMonteCarloInstances(aldex.clr), 147 | function(y){y[,i]}))) 148 | } 149 | ##### Done ALDEx2 stuff 150 | 151 | # make indices of the correct size 152 | lt <- which(col(sym.phi) 4],x.all$diff.btw[x.all$effect > 4], col="red", pch=19) 195 | points(x.all$wi.eBH[x.all$effect < -4],x.all$diff.btw[x.all$effect < -4], col="blue",pch=19) 196 | 197 | # effect size plot 198 | aldex.plot(x.all, type="MW") 199 | 200 | # we typically use effect sizes of 2 or more 201 | 202 | # subset to find those with very large effect sizes 203 | rownames(x.all)[abs(x.all$effect) > 4] 204 | ``` 205 | ```{r edgeR} 206 | 207 | # https://bioconductor.org/packages/devel/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf 208 | 209 | library(edgeR) 210 | 211 | # from ALDEx above 212 | group <- factor(conds) 213 | 214 | y <- DGEList(counts=d.good.gt0,group=group) 215 | 216 | y <- calcNormFactors(y) 217 | 218 | y <- estimateDisp(y) 219 | 220 | # plot a non-metric multidimensional scaling plot of the data 221 | # by default uses the 500 genes with the largest difference between samples 222 | # NMDS is essentially PCA on the rank order differences rather than actual variances 223 | # note that in this dataset, it looks very similar to the plot from the clr values 224 | 225 | plotMDS(y) 226 | 227 | # how well does the data fit the model? 228 | # looking for the blue and red lines to be coincident 229 | plotBCV(y) 230 | 231 | # this is the differential abundance test 232 | # exact test using negative binomial 233 | et <- exactTest(y) 234 | # hack to get the data for plotting later 235 | tt <- topTags(et, sort.by="none", n=6236) 236 | 237 | summary(et <- decideTestsDGE(et)) 238 | detags <- rownames(y)[as.logical(et)] 239 | 240 | 241 | # you can plot the MA plot 242 | # compare to the MA plot from ALDEx, they are rather similar for this dataset 243 | plotSmear(y, de.tags=detags) 244 | abline(h=c(-1, 1), col="blue", lty=2) 245 | 246 | ``` 247 | 248 | ```{ r DESeq} 249 | # https://www.bioconductor.org/packages/3.3/bioc/vignettes/DESeq/inst/doc/DESeq.pdf 250 | 251 | library(DESeq) 252 | 253 | # we will use the same input table 254 | # set up the metadata 255 | design <- data.frame( 256 | row.names = colnames(d.good.gt0), 257 | condition = conds, 258 | libType="single-end" 259 | ) 260 | 261 | # for convenince so I can cut and paste 262 | countTable <- d.good.gt0 263 | condition <- design$condition 264 | condition <- factor(condition) 265 | 266 | # make the base data table 267 | cds = newCountDataSet( countTable, condition ) 268 | 269 | # normalize read counts 270 | cds = estimateSizeFactors( cds ) 271 | # observe 272 | sizeFactors( cds ) 273 | 274 | # estimate the variance and fit to a negative binomial 275 | cds = estimateDispersions( cds ) 276 | 277 | # observe how well it fits the data 278 | # how does this differ and what is similar to edgeR? 279 | plotDispEsts( cds ) 280 | 281 | # find DE genes 282 | res = nbinomTest( cds, "SNF", "WT" ) 283 | 284 | # plot the MA plot, compare to ALDEx2 and edgeR 285 | plotMA(res) 286 | 287 | # it is useful to look at the p value distribution 288 | par(mfrow=c(1,3)) 289 | hist(res$pval, breaks=100, main="DESeq") #DESeq 290 | hist(tt[[1]]$PValue, breaks=100, col=rgb(1,0,0,0.1), main="edgeR") # edgeR 291 | hist(x.all$we.ep, breaks=100, col=rgb(0,0,1,0.1), main="ALDEx2") # ALDEx2 292 | 293 | par(mfrow=c(1,2)) 294 | plot(res$pval, tt[[1]]$PValue, log="xy") 295 | plot(res$pval, x.all$we.ep, log="xy") 296 | 297 | 298 | # get lists of differential genes by tool 299 | # use cutoff of BH of 0.05 300 | 301 | sig.ald <- rownames(x.all)[x.all$we.eBH < 0.05] 302 | sig.des <- res$id[res$padj < 0.05] 303 | sig.edg <- detags 304 | 305 | write.table(sig.ald, file="sig.ald.txt", quote=FALSE, row.names=F, col.names=F) 306 | write.table(sig.des, file="sig.des.txt", quote=FALSE, row.names=F, col.names=F) 307 | write.table(sig.edg, file="sig.edg.txt", quote=FALSE, row.names=F, col.names=F) 308 | 309 | 310 | ``` 311 | 312 | 313 | ```{r, annotation} 314 | # you can supply lists of genes of interest, or call directly from within edgeR for this 315 | # see the edgeR documentation for an example of GO terms, only metazoan models 316 | # we can do KEGG term annotation though 317 | 318 | de.aldex <- rownames(x.all)[x.all$effect > 2] 319 | write.table(de.aldex, file="aldex_sig.effect.txt", sep="\t", col.names=F,row.names=F, quote=F) 320 | 321 | # http://www.kegg.jp/kegg/tool/map_pathway1.html 322 | 323 | # kegga is going to an external database (KEGG) 324 | kegg.aldex <- kegga(de.aldex, species.KEGG="sce") 325 | 326 | ``` 327 | 328 | ```{r, assoc} 329 | 330 | library(ALDEx2) 331 | library(CoDaSeq) 332 | library(zCompositions) 333 | library(igraph) 334 | library(car) 335 | source("http://michael.hahsler.net/SMU/ScientificCompR/code/map.R") 336 | 337 | 338 | x <- aldex.clr(d.good.gt0) 339 | 340 | conds <- c(rep("SNF", length(SNF.g$good)), rep("WT", length(WT.g$good))) 341 | # me 342 | x <- aldex.clr(d.good.gt0, conds, mc.samples=16) 343 | 344 | x.phi <- codaSeq.phi(x) 345 | 346 | phi.cutoff <- .005 347 | x.lo.phi <- subset(x.phi, phi <= phi.cutoff) 348 | 349 | # generate a graphical object 350 | g <- graph.data.frame(x.lo.phi, directed=FALSE) 351 | 352 | # overview of all the proportional relationships 353 | # this can take a long time!!! 354 | V(g)$label.cex <- 1 355 | OTU.names <- V(g)$name 356 | 357 | # V(g)$name <- as.character(hmpgenera[V(g)$name, "genus"]) 358 | 359 | 360 | plot(g, layout=layout.fruchterman.reingold.grid(g, weight=0.05/E(g)$phi), vertex.size=1, vertex.color="black") 361 | 362 | # http://michael.hahsler.net/SMU/LearnROnYourOwn/code/igraph.html 363 | layout <-layout.fruchterman.reingold(g) 364 | plot(g, layout=layout) 365 | 366 | # map centrality 367 | plot(g, layout=layout, 368 | vertex.size=map(betweenness(g),c(1,15)), 369 | edge.width=map(edge.betweenness(g), c(1,10))) 370 | 371 | # plot page rank (connectedness) 372 | pr <- page.rank(g)$vector 373 | plot(g, layout=layout, vertex.size=map(pr, c(1,20)), edge.arrow.size=.2) 374 | 375 | # break into component sub-groups 376 | dg <- decompose.graph(g) 377 | plot(dg[[1]],layout=layout.fruchterman.reingold(dg[[1]]), 378 | vertex.size=map(page.rank(dg[[1]])$vector, c(1,10)), 379 | edge.arrow.size=.2) 380 | 381 | 382 | ## communities 383 | # https://users.dimi.uniud.it/~massimo.franceschet/R/communities.html 384 | c1 = cluster_fast_greedy(dg[[1]]) 385 | plot(dg[[1]], vertex.color=membership(c1), layout=layout_with_fr(dg[[1]])) 386 | 387 | c2 = cluster_leading_eigen(g) 388 | plot(g, vertex.color=membership(c2), layout=layout_with_fr(g), vertex.size=map(pr, c(1,20)), edge.arrow.size=.2) 389 | 390 | c3 = cluster_edge_betweenness(g) 391 | plot(g, vertex.color=membership(c3), layout=layout_with_fr(g), vertex.size=map(page.rank(g)$vector, c(1,20)), edge.arrow.size=.2) 392 | 393 | plot(g, vertex.color=membership(c3), layout=layout_with_fr(g), vertex.size=map(betweenness(g),c(1,15)), 394 | edge.width=map(edge.betweenness(g), c(1,10))) 395 | 396 | 397 | 398 | 399 | pdf("phi_graph.pdf", height=5, width=5) 400 | plot(dg[[21]]) 401 | dev.off() 402 | 403 | # # get the clusters from the graph object 404 | g.clust <- clusters(g) 405 | # 406 | # # data frame containing the names and group memberships of each cluster 407 | g.df.u <- data.frame(Systematic.name=V(g)$name, cluster=g.clust$membership, cluster.size=g.clust$csize[g.clust$membership], stringsAsFactors=FALSE) 408 | g.df.u <- data.frame(Name=V(g)$name, cluster=g.clust$membership, stringsAsFactors=FALSE) 409 | 410 | # generate the list of features and clusters in a data frame 411 | g.df <- g.df.u[order(g.df.u[,"cluster"]),] 412 | ``` 413 | -------------------------------------------------------------------------------- /chunk/setup.R: -------------------------------------------------------------------------------- 1 | 2 | # this is the dataset I will present in Cambridge 3 | # for the tutorial, we will use a random subset of 50 samples from each 4 | 5 | The dataset contains 229 genera and 1457 samples. Samples are named thusly: 6 | # ids have leading 700 stripped off 7 | # samples named as follows: 8 | # td_ Tongue Dorsum 9 | # bm_ Buccal mucosa 10 | # ak_ Attached Keratinzed Gingiva 11 | # hp_ Hard Palate 12 | # pt_ Palatine Tonsils 13 | # sa_ Saliva 14 | # up_ Subgingival Plaque (under plaque) 15 | # op_ Supragingival Plaque (over plaque) 16 | 17 | # read in the example table 18 | # this is a subset of the HMP oral microbiome with 24 TD and 18 BM samples 19 | # no real reason to choose these except they separate 20 | 21 | # HMP oral V3-5 microbiome dataset 22 | mouth <- read.table("~/git/compositions/oral/data/mouth_otu.txt", header=T, row.names=1, sep="\t") 23 | taxon <- read.table("~/git/compositions/oral/data/taxon_names.txt", header=T, row.names=1, sep="\t") 24 | 25 | # fix unknown taxa 26 | # all p__, c__, o__ named 27 | notF <- grep("f__", taxon[,1], invert=T) 28 | fillF <- "f__Unknown;g__Unknown" 29 | 30 | for(i in 1:length(notF)){ 31 | taxon[notF[i],1] <- ( paste(taxon[notF[i],1], fillF, sep=";") ) 32 | } 33 | notG <- grep("g__", taxon[,1], invert=T) 34 | fillG <- "g__Unknown" 35 | 36 | for(i in 1:length(notG)){ 37 | taxon[notG[i],1] <- ( paste(taxon[notG[i],1], fillG, sep=";") ) 38 | } 39 | 40 | z <- strsplit(as.character(taxon[,1]), ';') 41 | tax.df <- as.data.frame(do.call(rbind,z)) 42 | 43 | rownames(tax.df) <- rownames(mouth) 44 | colnames(tax.df) <- c("root", "phylum", "class", "order", "family", "genus") 45 | tax.df$phylum <- gsub("p__", "", tax.df$phylum) 46 | tax.df$genus <- gsub("g__", "", tax.df$genus) 47 | 48 | 49 | d <- data.frame(mouth[,grep("ak_", colnames(mouth))], mouth[,grep("op_", colnames(mouth))]) 50 | 51 | # keep samples with > 1000 reads 52 | d.col <- d[, which(apply(d,2,sum) > 1000)] 53 | 54 | # many ways to filter 55 | # keep OTUs with mean > 0.1 reads across all samples 56 | d.subset <- d[ which(apply(d.col,1,mean) > .1),] 57 | taxon.1 <- tax.df[ which(apply(d.col,1,mean) > .1),] 58 | 59 | # this is the base dataset that everyone can use 60 | write.table(d.subset, file="~/git/Host_Microbe_2016/data/ak_vs_op.txt", sep="\t", col.names=NA, quote=F) 61 | write.table(taxon.1, file="~/git/Host_Microbe_2016/data/taxon.txt", sep="\t", col.names=NA, quote=F) 62 | 63 | d <- data.frame(mouth[,grep("up_", colnames(mouth))], mouth[,grep("op_", colnames(mouth))]) 64 | 65 | # keep samples with > 1000 reads 66 | d.col <- d[, which(apply(d,2,sum) > 1000)] 67 | 68 | # many ways to filter 69 | # keep OTUs with mean > 0.1 reads across all samples 70 | d.subset <- d[ which(apply(d.col,1,mean) > .1),] 71 | taxon.1 <- tax.df[ which(apply(d.col,1,mean) > .1),] 72 | 73 | # this is the base dataset that everyone can use 74 | write.table(d.subset, file="~/git/Host_Microbe_2016/data/up_vs_op.txt", sep="\t", col.names=NA, quote=F) 75 | write.table(taxon.1, file="~/git/Host_Microbe_2016/data/up_vs_op_taxon.txt", sep="\t", col.names=NA, quote=F) 76 | 77 | ############# 78 | 79 | ################ OTU LEVEL 80 | # some initial characterization of the whole dataset 81 | 82 | 83 | d.subset <- read.table("~/git/Host_Microbe_2016/data/ak_vs_op.txt", sep="\t", header=T, row.names=1) 84 | taxon.1 <- read.table("~/git/Host_Microbe_2016/data/ak_vs_op_taxon.txt", sep="\t", header=T, row.names=1) 85 | 86 | library(zCompositions) 87 | d.n0 <- cmultRepl(t(d.subset), method="CZM", label=0) 88 | d.clr <- t(apply(d.n0, 1, function(x) log(x) - mean(log(x)))) 89 | pcx <- prcomp(d.clr) 90 | rownames(pcx$rotation) <- taxon.1[rownames(pcx$rotation), "genus"] 91 | 92 | PC1 <- paste("PC1: ", round(pcx$sdev[1]^2/sum(pcx$sdev^2),3), sep="") 93 | PC2 <- paste("PC2: ", round(pcx$sdev[2]^2/sum(pcx$sdev^2),3), sep="") 94 | 95 | 96 | biplot(pcx, cex=c(0.5,0.4), col=c("black", rgb(1,0,0,0.2)), var.axes=F, scale=0, xlab=PC1, ylab=PC2) 97 | 98 | d.rand <- cbind(d.subset[,sample(grep("ak", colnames(d.subset)), 25)], d.subset[,sample(grep("op", colnames(d.subset)), 25)]) 99 | d.rand.filt <- d.rand[rowSums(d.rand) > 0,] 100 | 101 | 102 | n.rand = 15 103 | conds <- c(rep("ak", n.rand), rep("op" , n.rand)) 104 | 105 | 106 | x <- aldex.clr(d.rand) 107 | 108 | x.e <- aldex.effect(x, conds) 109 | x.t <- aldex.ttest(x, conds) 110 | x.all <- data.frame(x.e,x.t) 111 | 112 | tax <- "genus" 113 | all.plot <- data.frame(x.all, tax.df[rownames(x.all), tax]) 114 | colnames(all.plot)[ncol(all.plot)] <- "taxon.level" 115 | all.plot <- droplevels(all.plot) 116 | 117 | no.sig <- all.plot$wi.eBH > 0.01 118 | sig.pos <- all.plot$wi.eBH < 0.01 & all.plot$effect > 0 119 | sig.neg <- all.plot$wi.eBH < 0.01 & all.plot$effect < 0 120 | 121 | groups <- unique(all.plot$taxon.level) 122 | ylim<-c(length(groups) - (length(groups)+0.5), length(groups) + 0.5) 123 | 124 | xlim = c(min(-1 * max(abs(all.plot$effect))), max(all.plot$effect)) 125 | pdf("genus_stripchart.pdf") 126 | par(mar=c(5,15,5,1), las=1, cex=0.7) 127 | stripchart(effect ~ taxon.level, data=all.plot[no.sig,], col=rgb(0,0,0,0.3),method="jitter", jitter=0.2, pch=19, xlim=xlim, xlab="effect", main=tax) 128 | stripchart(effect ~ taxon.level, data=all.plot[sig.pos,], col=rgb(0,0,1,0.3),method="jitter", jitter=0.2, pch=19, xlim=xlim, xlab="effect", add=T) 129 | stripchart(effect ~ taxon.level, data=all.plot[sig.neg,], col=rgb(1,0,0,0.3),method="jitter", jitter=0.2, pch=19, xlim=xlim, xlab="effect", add=T) 130 | abline(v=0, lty=2, lwd=2, col=rgb(0,0,0,0.3)) 131 | #draw horizonal lines 132 | for (j in 0.5:(length(groups)+0.5)){ 133 | abline(h=j, lty=3, col=rgb(0,0,0,0.3)) 134 | } 135 | dev.off() 136 | 137 | 138 | ### MAKE A FULL COMPARISON FOR FOR THE UP OP SAMPLE PAIR, and the ak op sample pair 139 | 140 | e.subset <- read.table("data/up_vs_op.txt", row.names=1, header=T) 141 | e.x <- aldex.clr(e.subset) 142 | e.conds <- c(rep("up", length(grep("up", colnames(e.subset))) ), rep("op", length(grep("op", colnames(e.subset)))) ) 143 | e.eff <- aldex.effect(e.x, e.conds) 144 | e.tt <- aldex.ttest(e.x, e.conds) 145 | e.all <- data.frame(e.eff,e.tt) 146 | 147 | 148 | # SAVE DATA IN up_vs_op_aldex.txt 149 | write.table(x.all, , file="~/git/Host_Microbe_2016/data/ak_vs_op_aldex.txt", sep="\t", col.names=NA, quote=F) 150 | write.table(e.all, , file="~/git/Host_Microbe_2016/data/up_vs_op_aldex.txt", sep="\t", col.names=NA, quote=F) 151 | 152 | ############# GENUS LEVEL 153 | # aggregate the data by genus count 154 | mouth.genus <- aggregate(mouth, by=list(tax.df$genus),FUN=sum) 155 | rownames(mouth.genus) <- mouth.genus$Group.1 156 | mouth.genus$Group.1 <- NULL 157 | mouth.genus.subset <- mouth.genus[rowSums(mouth.genus) > 0,] 158 | write.table(mouth.genus.subset, file="~/git/Host_Microbe_2016/data/mouth_genus.txt", sep="\t", col.names=NA, quote=F) 159 | 160 | 161 | gen.n0 <- cmultRepl(t(mouth.genus.subset), label=0, method="CZM") 162 | 163 | gen.clr <- t(apply(gen.n0, 1, function(x) log(x) - mean(log(x)) )) 164 | pcx.gen <- prcomp(gen.clr) 165 | 166 | PC1 <- paste("PC1: ", round(pcx.gen$sdev[1]^2/sum(pcx.gen$sdev^2),3), sep="") 167 | PC2 <- paste("PC2: ", round(pcx.gen$sdev[2]^2/sum(pcx.gen$sdev^2),3), sep="") 168 | biplot(pcx.gen, cex=c(0.6,0.5), var.axes=F, scale=0, xlab=PC1, ylab=PC2) 169 | abline(v=0) 170 | abline(h=0) 171 | 172 | d.g <- data.frame(mouth.genus.subset[,grep("ak_", colnames(mouth.genus.subset))], mouth.genus.subset[,grep("op_", colnames(mouth.genus.subset))]) 173 | d.g.subset <- d.g[apply(d.g, 1, mean) > 0.1, ] 174 | d.g.n0 <- cmultRepl(t(d.g.subset), label=0, method="CZM") 175 | d.g.clr <- t(apply(d.g.n0, 1, function(x) log(x) - mean(log(x)) )) 176 | pcx.d.g <- prcomp(d.g.clr) 177 | 178 | PC1 <- paste("PC1: ", round(pcx.d.g$sdev[1]^2/sum(pcx.d.g$sdev^2),3), sep="") 179 | PC2 <- paste("PC2: ", round(pcx.d.g$sdev[2]^2/sum(pcx.d.g$sdev^2),3), sep="") 180 | biplot(pcx.d.g, cex=c(0.9,0.7), var.axes=F, scale=0, xlab=PC1, ylab=PC2) 181 | abline(v=0) 182 | abline(h=0) 183 | 184 | conds.g <- c(rep("A", length(grep("ak", colnames(d.g.subset))) ), rep("O", length(grep("op", colnames(d.g.subset)))) ) 185 | x.g <- aldex.clr(d.g.subset) 186 | x.g.e <- aldex.effect(x.g, conds.g) 187 | x.t.t <- aldex.ttest(x.g, conds.g) 188 | x.g.all <- data.frame(x.g.e, x.t.t) 189 | 190 | 191 | -------------------------------------------------------------------------------- /coda_seq.Rmd: -------------------------------------------------------------------------------- 1 | 2 | --- 3 | title: "CoDa and sequencing" 4 | author: "gg" 5 | date: '`r format(Sys.time(), "%d %B, %Y")`' 6 | bibliography: ~/Library/texmf/bibtex/bib/bibdesk_refs.bib 7 | fig_caption: true 8 | output: 9 | pdf_document: 10 | fig_caption: yes 11 | --- 12 | 13 | To run this file: 14 | Rscript -e "rmarkdown::render('coda_seq.Rmd') 15 | 16 | It is assumed that the output from a high-throughput sequencing experiment represents in some way the underlying abundance of the input DNA molecules. This is not necessarily the case as explained by the following thought experiment. 17 | 18 | ```{r, echo=F, fig.width=10, results='asis', fig.cap="High-throughput sequencing affects the shape of the data differently on constrained and unconstrained data. The two left panels show the absolute number of reads in the input tube for 20 steps where the green and black OTUs are changing abundance by 2-fold each step. The gray, blue and red OTUs are held at a constant number in each step in both cases. The second column shows the output in proportions (or ppm, or FPKM) after random sampling to a constant sum, as occurs on the sequencer. The orange OTU in the constrained data set is much more abundant than any other, and is changing to maintain a constant number of input molecules. Samples in the two right columns are the same values plotted on a log scale on the Y-axis for convenience. Note how the constrained data is the same before and after sequencing while the unconstrained data is severely distorted."} 19 | rdirichlet <- function (n, alpha) 20 | { 21 | if(length(n) > 1) n <- length(n) 22 | #if(length(n) == 0 || as.integer(n) == 0) return(numeric(0)) 23 | #n <- as.integer(n) 24 | if(n < 0) stop("value(n) can not be negative in rtriang") 25 | 26 | if(is.vector(alpha)) alpha <- t(alpha) 27 | l <- dim(alpha)[2] 28 | x <- matrix(rgamma(l * n, t(alpha)), ncol = l, byrow=TRUE) # Gere le recycling 29 | return(x / rowSums(x)) 30 | } 31 | 32 | num.one = 100 # the number of rare-counts in the dataset 33 | 34 | mat.double <- matrix(data=NA, nrow=20, ncol=num.one + 10) 35 | prop.mat <- matrix(data=NA, nrow=20, ncol=num.one + 10) 36 | clr.mat <- matrix(data=NA, nrow=20, ncol=num.one + 10) 37 | 38 | mat.double.u <- matrix(data=NA, nrow=20, ncol=num.one + 10) 39 | prop.mat.u <- matrix(data=NA, nrow=20, ncol=num.one + 10) 40 | clr.mat.u <- matrix(data=NA, nrow=20, ncol=num.one + 10) 41 | 42 | # constant sum input 43 | minimum.count <- 1 # multiplier to set minimum count for in.put 44 | # non-constant sum input with both one big increase 45 | in.put <- c(10,20971,1,1,5,10,20,50,100,200,1000) * minimum.count 46 | 47 | total.sum <- sum(in.put + 1) * 1000 48 | 49 | for(i in 0:19){ 50 | # constant sum input 51 | junk <- in.put * c(2^i, rep(1,num.one + 9)) 52 | junk[3] <- total.sum - sum(junk) 53 | mat.double[i+1,] <- junk 54 | prop.mat[i+1,] <- as.numeric( rdirichlet(1, junk) ) 55 | clr.mat[i+1,] <- log2(prop.mat[i+1,]) - mean(log2(prop.mat[i+1,])) 56 | } 57 | 58 | for(i in 0:19){ 59 | # non-constant sum input 60 | #junk <- in.put * c(2^i, rep(1,num.one + 9)) 61 | junk <- in.put * c(2^i, rep(1,num.one + 9)) 62 | mat.double.u[i+1,] <- junk 63 | prop.mat.u[i+1,] <- as.numeric( rdirichlet(1, junk) ) 64 | clr.mat.u[i+1,] <- 2^(log2(prop.mat.u[i+1,]) - mean(log2(prop.mat.u[i+1,]))) 65 | } 66 | 67 | par(mfrow=c(2,4), mar=c(4,4,3,1) ) 68 | 69 | plot(mat.double[,1], pch=20, type="b", ylim=c(min(mat.double), max(mat.double)), xlab="time point", ylab="raw count") 70 | title( main="Constrained\ninput (linear)", adj=0.5) 71 | points(mat.double[,2], type="b",pch=21, col="gray") 72 | points(mat.double[,3], type="b",pch=22, col="orange") 73 | points(mat.double[,num.one + 10], type="b",pch=23, col="blue") 74 | points(mat.double[,num.one+4], type="b",pch=24, col="red") 75 | 76 | plot(prop.mat[,1], pch=20, type="b", ylim=c(min(prop.mat[,num.one+4]), max(prop.mat)), xlab="time point", ylab="raw proportion") 77 | title( main="Constrained\nproportion (linear)", adj=0.5) 78 | points(prop.mat[,2], type="b", pch=21, col="gray") 79 | points(prop.mat[,3], type="b", pch=22, col="orange") 80 | points(prop.mat[,num.one+10], type="b", pch=23, col="blue") 81 | points(prop.mat[,num.one+4], type="b", pch=24, col="red") 82 | 83 | plot(mat.double.u[,1], pch=20, type="b", ylim=c(min(mat.double.u), max(mat.double.u)), xlab="time point", ylab="raw count") 84 | title( main="Unconstrained\ninput (linear)", adj=0.5) 85 | points(mat.double.u[,2], type="b",pch=21, col="gray") 86 | points(mat.double.u[,num.one + 10], type="b",pch=23, col="blue") 87 | points(mat.double.u[,num.one+4], type="b",pch=24, col="red") 88 | 89 | plot(prop.mat.u[,1], pch=20, type="b", ylim=c(min(prop.mat.u[,num.one+4]), max(prop.mat.u)), xlab="time point", ylab="raw proportion") 90 | title( main="Unconstrained\nproportion (linear)", adj=0.5) 91 | points(prop.mat.u[,2], type="b", pch=21, col="gray") 92 | points(prop.mat.u[,num.one+10], type="b", pch=23, col="blue") 93 | points(prop.mat.u[,num.one+4], type="b", pch=24, col="red") 94 | 95 | plot(mat.double[,1], pch=20, type="b", log="y", ylim=c(min(mat.double), max(mat.double)), xlab="time point", ylab="log10 count") 96 | title( main="Constrained\ninput (log)", adj=0.5) 97 | points(mat.double[,2], type="b",pch=21, col="gray") 98 | points(mat.double[,3], type="b",pch=22, col="orange") 99 | points(mat.double[,num.one + 10], type="b",pch=23, col="blue") 100 | points(mat.double[,num.one+4], type="b",pch=24, col="red") 101 | 102 | plot(prop.mat[,1], pch=20, type="b", ylim=c(min(prop.mat[,num.one+4]), max(prop.mat)), xlab="time point", log="y", ylab="log10 proportion") 103 | title( main="Constrained\nproportion (log)", adj=0.5) 104 | points(prop.mat[,2], type="b", pch=21, col="gray") 105 | points(prop.mat[,3], type="b", pch=22, col="orange") 106 | points(prop.mat[,num.one+10], type="b", pch=23, col="blue") 107 | points(clr.mat[,num.one+4], type="b", pch=24, col="red") 108 | 109 | plot(mat.double.u[,1], pch=20, type="b", log="y", ylim=c(min(mat.double.u), max(mat.double.u)), xlab="time point", ylab="log10 count") 110 | title( main="Unconstrained\ninput (log)", adj=0.5) 111 | points(mat.double.u[,2], type="b",pch=21, col="gray") 112 | points(mat.double.u[,num.one + 10], type="b",pch=23, col="blue") 113 | points(mat.double.u[,num.one+4], type="b",pch=24, col="red") 114 | 115 | 116 | plot(prop.mat.u[,1], pch=20, type="b", ylim=c(min(prop.mat.u[,num.one+4]), max(prop.mat.u)),log="y", xlab="time point", ylab="log10 proportion") 117 | title( main="Unconstrained\nproportion (log)", adj=0.5) 118 | points(prop.mat.u[,2], type="b", pch=21, col="gray") 119 | points(prop.mat.u[,num.one+10], type="b", pch=23, col="blue") 120 | points(prop.mat.u[,num.one+4], type="b", pch=24, col="red") 121 | ``` 122 | 123 | 124 | 125 | Figure 1 shows two idealized experiments with four different ways of looking at the exact same data: the top row shows the data as linear points, the bottom row shows the same data after a log transform. The constrained input shows the case where the total count of all nucleic acid species in the input is constrained to a constant sum. The unconstrained input shows the case where the total sum is not constrained to a constant sum. These are modelled as a time series, but any process would produce the same results, and in practice we will likely only be comparing the first and last points (before and after some intervention). These are shown in two different ways. The first is linear counts "input", the second is proportions. The bottome row 126 | 127 | Constrained datasets occur if the increase or decrease in any component is exactly compensated by the increase or decrease of one or more others. Here the total count remains constant across all experimental conditions. Examples of constrained datasets would include allele frequencies at a locus where the total has to be 1, and the RNA-seq where the induction of genes occurs in a steady- state cell culture. In this case, any process, such as sequencing that generates a proportion simply recapitulates the data with sampling error. The unspoken assumption in most high throughput experimental designs is that this assumption is true— it is not! 128 | 129 | An unconstrained dataset results if the total count is free to vary. Examples of unconstrained datasets would include ChIP-Seq, RNA-seq where we are examining two different conditions or cell populations, metagenomics, etc. Importantly, 16S rRNA gene sequencing analyses are almost always free to vary; that is, the total bacterial load is rarely constant in an environment. Thus, the unconstrained data type would be the predominant type of data that would be expected. 130 | 131 | The relative abundance panels on the right side of Figure below shows the result of random sampling with a defined maximum value in these two types of datasets. This random sampling reflects the data that results from high throughput sequencing where the total number of reads is constrained by the instrument capacity. The data is represented as a proportion, but scales to parts per million or parts per billion without changing the shape. Here we see that the shape of the data after sequencing is very similar to the input data in the case of constrained, but is very different in the case of non-constrained data. In the unconstrained dataset, observe how the blue and red features appear to be constant over the first 10 time points, but then appear to decrease in abundance at later time points. Conversely, the black feature appears to increase linearly at early time points, but appears to become constant at late time points. Obviously, we would misinterpret what is happening if we compared early and late timepoints in the unconstrained dataset. It is also worth noting how the act of random sampling makes the proportional abundance of the rare OTU species uncertain in both the constrained and unconstrained data, but has little effect on the relative apparent effect on the relative abundance of OTUs with high counts. 132 | 133 | -------------------------------------------------------------------------------- /coda_seq.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/coda_seq.pdf -------------------------------------------------------------------------------- /data/ERP004763_sample_mapping.tsv: -------------------------------------------------------------------------------- 1 | RunAccession Lane Sample BiolRep 2 | ERR458493 1 WT 1 3 | ERR458494 2 WT 1 4 | ERR458495 3 WT 1 5 | ERR458496 4 WT 1 6 | ERR458497 5 WT 1 7 | ERR458498 6 WT 1 8 | ERR458499 7 WT 1 9 | ERR458500 1 SNF2 1 10 | ERR458501 2 SNF2 1 11 | ERR458502 3 SNF2 1 12 | ERR458503 4 SNF2 1 13 | ERR458504 5 SNF2 1 14 | ERR458505 6 SNF2 1 15 | ERR458506 7 SNF2 1 16 | ERR458507 1 SNF2 2 17 | ERR458508 2 SNF2 2 18 | ERR458509 3 SNF2 2 19 | ERR458510 4 SNF2 2 20 | ERR458511 5 SNF2 2 21 | ERR458512 6 SNF2 2 22 | ERR458513 7 SNF2 2 23 | ERR458514 1 SNF2 3 24 | ERR458515 2 SNF2 3 25 | ERR458516 3 SNF2 3 26 | ERR458517 4 SNF2 3 27 | ERR458518 5 SNF2 3 28 | ERR458519 6 SNF2 3 29 | ERR458520 7 SNF2 3 30 | ERR458521 1 SNF2 4 31 | ERR458522 2 SNF2 4 32 | ERR458523 3 SNF2 4 33 | ERR458524 4 SNF2 4 34 | ERR458525 5 SNF2 4 35 | ERR458526 6 SNF2 4 36 | ERR458527 7 SNF2 4 37 | ERR458528 1 SNF2 5 38 | ERR458529 2 SNF2 5 39 | ERR458530 3 SNF2 5 40 | ERR458531 4 SNF2 5 41 | ERR458532 5 SNF2 5 42 | ERR458533 6 SNF2 5 43 | ERR458534 7 SNF2 5 44 | ERR458535 1 SNF2 6 45 | ERR458536 2 SNF2 6 46 | ERR458537 3 SNF2 6 47 | ERR458538 4 SNF2 6 48 | ERR458539 5 SNF2 6 49 | ERR458540 6 SNF2 6 50 | ERR458541 7 SNF2 6 51 | ERR458542 1 SNF2 7 52 | ERR458543 2 SNF2 7 53 | ERR458544 3 SNF2 7 54 | ERR458545 4 SNF2 7 55 | ERR458546 5 SNF2 7 56 | ERR458547 6 SNF2 7 57 | ERR458548 7 SNF2 7 58 | ERR458549 1 SNF2 8 59 | ERR458550 2 SNF2 8 60 | ERR458551 3 SNF2 8 61 | ERR458552 4 SNF2 8 62 | ERR458553 5 SNF2 8 63 | ERR458554 6 SNF2 8 64 | ERR458555 7 SNF2 8 65 | ERR458556 1 SNF2 9 66 | ERR458557 2 SNF2 9 67 | ERR458558 3 SNF2 9 68 | ERR458559 4 SNF2 9 69 | ERR458560 5 SNF2 9 70 | ERR458561 6 SNF2 9 71 | ERR458562 7 SNF2 9 72 | ERR458563 1 SNF2 10 73 | ERR458564 2 SNF2 10 74 | ERR458565 3 SNF2 10 75 | ERR458566 4 SNF2 10 76 | ERR458567 5 SNF2 10 77 | ERR458568 6 SNF2 10 78 | ERR458569 7 SNF2 10 79 | ERR458570 1 SNF2 11 80 | ERR458571 2 SNF2 11 81 | ERR458572 3 SNF2 11 82 | ERR458573 4 SNF2 11 83 | ERR458574 5 SNF2 11 84 | ERR458575 6 SNF2 11 85 | ERR458576 7 SNF2 11 86 | ERR458577 1 SNF2 12 87 | ERR458578 2 SNF2 12 88 | ERR458579 3 SNF2 12 89 | ERR458580 4 SNF2 12 90 | ERR458581 5 SNF2 12 91 | ERR458582 6 SNF2 12 92 | ERR458583 7 SNF2 12 93 | ERR458584 1 SNF2 13 94 | ERR458585 2 SNF2 13 95 | ERR458586 3 SNF2 13 96 | ERR458587 4 SNF2 13 97 | ERR458588 5 SNF2 13 98 | ERR458589 6 SNF2 13 99 | ERR458590 7 SNF2 13 100 | ERR458591 1 SNF2 14 101 | ERR458592 2 SNF2 14 102 | ERR458593 3 SNF2 14 103 | ERR458594 4 SNF2 14 104 | ERR458595 5 SNF2 14 105 | ERR458596 6 SNF2 14 106 | ERR458597 7 SNF2 14 107 | ERR458598 1 SNF2 15 108 | ERR458599 2 SNF2 15 109 | ERR458600 3 SNF2 15 110 | ERR458601 4 SNF2 15 111 | ERR458602 5 SNF2 15 112 | ERR458603 6 SNF2 15 113 | ERR458604 7 SNF2 15 114 | ERR458605 1 SNF2 16 115 | ERR458606 2 SNF2 16 116 | ERR458607 3 SNF2 16 117 | ERR458608 4 SNF2 16 118 | ERR458609 5 SNF2 16 119 | ERR458610 6 SNF2 16 120 | ERR458611 7 SNF2 16 121 | ERR458612 1 SNF2 17 122 | ERR458613 2 SNF2 17 123 | ERR458614 3 SNF2 17 124 | ERR458615 4 SNF2 17 125 | ERR458616 5 SNF2 17 126 | ERR458617 6 SNF2 17 127 | ERR458618 7 SNF2 17 128 | ERR458619 1 SNF2 18 129 | ERR458620 2 SNF2 18 130 | ERR458621 3 SNF2 18 131 | ERR458622 4 SNF2 18 132 | ERR458623 5 SNF2 18 133 | ERR458624 6 SNF2 18 134 | ERR458625 7 SNF2 18 135 | ERR458626 1 SNF2 19 136 | ERR458627 2 SNF2 19 137 | ERR458628 3 SNF2 19 138 | ERR458629 4 SNF2 19 139 | ERR458630 5 SNF2 19 140 | ERR458631 6 SNF2 19 141 | ERR458632 7 SNF2 19 142 | ERR458633 1 SNF2 20 143 | ERR458634 2 SNF2 20 144 | ERR458635 3 SNF2 20 145 | ERR458636 4 SNF2 20 146 | ERR458637 5 SNF2 20 147 | ERR458638 6 SNF2 20 148 | ERR458639 7 SNF2 20 149 | ERR458640 1 SNF2 21 150 | ERR458641 2 SNF2 21 151 | ERR458642 3 SNF2 21 152 | ERR458643 4 SNF2 21 153 | ERR458644 5 SNF2 21 154 | ERR458645 6 SNF2 21 155 | ERR458646 7 SNF2 21 156 | ERR458647 1 SNF2 22 157 | ERR458648 2 SNF2 22 158 | ERR458649 3 SNF2 22 159 | ERR458650 4 SNF2 22 160 | ERR458651 5 SNF2 22 161 | ERR458652 6 SNF2 22 162 | ERR458653 7 SNF2 22 163 | ERR458654 1 SNF2 23 164 | ERR458655 2 SNF2 23 165 | ERR458656 3 SNF2 23 166 | ERR458657 4 SNF2 23 167 | ERR458658 5 SNF2 23 168 | ERR458659 6 SNF2 23 169 | ERR458660 7 SNF2 23 170 | ERR458661 1 SNF2 24 171 | ERR458662 2 SNF2 24 172 | ERR458663 3 SNF2 24 173 | ERR458664 4 SNF2 24 174 | ERR458665 5 SNF2 24 175 | ERR458666 6 SNF2 24 176 | ERR458667 7 SNF2 24 177 | ERR458668 1 SNF2 25 178 | ERR458669 2 SNF2 25 179 | ERR458670 3 SNF2 25 180 | ERR458671 4 SNF2 25 181 | ERR458672 5 SNF2 25 182 | ERR458673 6 SNF2 25 183 | ERR458674 7 SNF2 25 184 | ERR458675 1 SNF2 26 185 | ERR458676 2 SNF2 26 186 | ERR458677 3 SNF2 26 187 | ERR458678 4 SNF2 26 188 | ERR458679 5 SNF2 26 189 | ERR458680 6 SNF2 26 190 | ERR458681 7 SNF2 26 191 | ERR458682 1 SNF2 27 192 | ERR458683 2 SNF2 27 193 | ERR458684 3 SNF2 27 194 | ERR458685 4 SNF2 27 195 | ERR458686 5 SNF2 27 196 | ERR458687 6 SNF2 27 197 | ERR458688 7 SNF2 27 198 | ERR458689 1 SNF2 28 199 | ERR458690 2 SNF2 28 200 | ERR458691 3 SNF2 28 201 | ERR458692 4 SNF2 28 202 | ERR458693 5 SNF2 28 203 | ERR458694 6 SNF2 28 204 | ERR458695 7 SNF2 28 205 | ERR458696 1 SNF2 29 206 | ERR458697 2 SNF2 29 207 | ERR458698 3 SNF2 29 208 | ERR458699 4 SNF2 29 209 | ERR458700 5 SNF2 29 210 | ERR458701 6 SNF2 29 211 | ERR458702 7 SNF2 29 212 | ERR458703 1 SNF2 30 213 | ERR458704 2 SNF2 30 214 | ERR458705 3 SNF2 30 215 | ERR458706 4 SNF2 30 216 | ERR458707 5 SNF2 30 217 | ERR458708 6 SNF2 30 218 | ERR458709 7 SNF2 30 219 | ERR458710 1 SNF2 31 220 | ERR458711 2 SNF2 31 221 | ERR458712 3 SNF2 31 222 | ERR458713 4 SNF2 31 223 | ERR458714 5 SNF2 31 224 | ERR458715 6 SNF2 31 225 | ERR458716 7 SNF2 31 226 | ERR458717 1 SNF2 32 227 | ERR458718 2 SNF2 32 228 | ERR458719 3 SNF2 32 229 | ERR458720 4 SNF2 32 230 | ERR458721 5 SNF2 32 231 | ERR458722 6 SNF2 32 232 | ERR458723 7 SNF2 32 233 | ERR458724 1 SNF2 33 234 | ERR458725 2 SNF2 33 235 | ERR458726 3 SNF2 33 236 | ERR458727 4 SNF2 33 237 | ERR458728 5 SNF2 33 238 | ERR458729 6 SNF2 33 239 | ERR458730 7 SNF2 33 240 | ERR458731 1 SNF2 34 241 | ERR458732 2 SNF2 34 242 | ERR458733 3 SNF2 34 243 | ERR458734 4 SNF2 34 244 | ERR458735 5 SNF2 34 245 | ERR458736 6 SNF2 34 246 | ERR458737 7 SNF2 34 247 | ERR458738 1 SNF2 35 248 | ERR458739 2 SNF2 35 249 | ERR458740 3 SNF2 35 250 | ERR458741 4 SNF2 35 251 | ERR458742 5 SNF2 35 252 | ERR458743 6 SNF2 35 253 | ERR458744 7 SNF2 35 254 | ERR458745 1 SNF2 36 255 | ERR458746 2 SNF2 36 256 | ERR458747 3 SNF2 36 257 | ERR458748 4 SNF2 36 258 | ERR458749 5 SNF2 36 259 | ERR458750 6 SNF2 36 260 | ERR458751 7 SNF2 36 261 | ERR458752 1 SNF2 37 262 | ERR458753 2 SNF2 37 263 | ERR458754 3 SNF2 37 264 | ERR458755 4 SNF2 37 265 | ERR458756 5 SNF2 37 266 | ERR458757 6 SNF2 37 267 | ERR458758 7 SNF2 37 268 | ERR458759 1 SNF2 38 269 | ERR458760 2 SNF2 38 270 | ERR458761 3 SNF2 38 271 | ERR458762 4 SNF2 38 272 | ERR458763 5 SNF2 38 273 | ERR458764 6 SNF2 38 274 | ERR458765 7 SNF2 38 275 | ERR458766 1 SNF2 39 276 | ERR458767 2 SNF2 39 277 | ERR458768 3 SNF2 39 278 | ERR458769 4 SNF2 39 279 | ERR458770 5 SNF2 39 280 | ERR458771 6 SNF2 39 281 | ERR458772 7 SNF2 39 282 | ERR458773 1 SNF2 40 283 | ERR458774 2 SNF2 40 284 | ERR458775 3 SNF2 40 285 | ERR458776 4 SNF2 40 286 | ERR458777 5 SNF2 40 287 | ERR458778 6 SNF2 40 288 | ERR458779 7 SNF2 40 289 | ERR458780 1 SNF2 41 290 | ERR458781 2 SNF2 41 291 | ERR458782 3 SNF2 41 292 | ERR458783 4 SNF2 41 293 | ERR458784 5 SNF2 41 294 | ERR458785 6 SNF2 41 295 | ERR458786 7 SNF2 41 296 | ERR458787 1 SNF2 42 297 | ERR458788 2 SNF2 42 298 | ERR458789 3 SNF2 42 299 | ERR458790 4 SNF2 42 300 | ERR458791 5 SNF2 42 301 | ERR458792 6 SNF2 42 302 | ERR458793 7 SNF2 42 303 | ERR458794 1 SNF2 43 304 | ERR458795 2 SNF2 43 305 | ERR458796 3 SNF2 43 306 | ERR458797 4 SNF2 43 307 | ERR458798 5 SNF2 43 308 | ERR458799 6 SNF2 43 309 | ERR458800 7 SNF2 43 310 | ERR458801 1 SNF2 44 311 | ERR458802 2 SNF2 44 312 | ERR458803 3 SNF2 44 313 | ERR458804 4 SNF2 44 314 | ERR458805 5 SNF2 44 315 | ERR458806 6 SNF2 44 316 | ERR458807 7 SNF2 44 317 | ERR458808 1 SNF2 45 318 | ERR458809 2 SNF2 45 319 | ERR458810 3 SNF2 45 320 | ERR458811 4 SNF2 45 321 | ERR458812 5 SNF2 45 322 | ERR458813 6 SNF2 45 323 | ERR458814 7 SNF2 45 324 | ERR458815 1 SNF2 46 325 | ERR458816 2 SNF2 46 326 | ERR458817 3 SNF2 46 327 | ERR458818 4 SNF2 46 328 | ERR458819 5 SNF2 46 329 | ERR458820 6 SNF2 46 330 | ERR458821 7 SNF2 46 331 | ERR458822 1 SNF2 47 332 | ERR458823 2 SNF2 47 333 | ERR458824 3 SNF2 47 334 | ERR458825 4 SNF2 47 335 | ERR458826 5 SNF2 47 336 | ERR458827 6 SNF2 47 337 | ERR458828 7 SNF2 47 338 | ERR458829 1 SNF2 48 339 | ERR458830 2 SNF2 48 340 | ERR458831 3 SNF2 48 341 | ERR458832 4 SNF2 48 342 | ERR458833 5 SNF2 48 343 | ERR458834 6 SNF2 48 344 | ERR458835 7 SNF2 48 345 | ERR458878 1 WT 2 346 | ERR458879 2 WT 2 347 | ERR458880 3 WT 2 348 | ERR458881 4 WT 2 349 | ERR458882 5 WT 2 350 | ERR458883 6 WT 2 351 | ERR458884 7 WT 2 352 | ERR458885 1 WT 3 353 | ERR458886 2 WT 3 354 | ERR458887 3 WT 3 355 | ERR458888 4 WT 3 356 | ERR458889 5 WT 3 357 | ERR458890 6 WT 3 358 | ERR458891 7 WT 3 359 | ERR458892 1 WT 4 360 | ERR458893 2 WT 4 361 | ERR458894 3 WT 4 362 | ERR458895 4 WT 4 363 | ERR458896 5 WT 4 364 | ERR458897 6 WT 4 365 | ERR458898 7 WT 4 366 | ERR458899 1 WT 5 367 | ERR458900 2 WT 5 368 | ERR458901 3 WT 5 369 | ERR458902 4 WT 5 370 | ERR458903 5 WT 5 371 | ERR458904 6 WT 5 372 | ERR458905 7 WT 5 373 | ERR458906 1 WT 6 374 | ERR458907 2 WT 6 375 | ERR458908 3 WT 6 376 | ERR458909 4 WT 6 377 | ERR458910 5 WT 6 378 | ERR458911 6 WT 6 379 | ERR458912 7 WT 6 380 | ERR458913 1 WT 7 381 | ERR458914 2 WT 7 382 | ERR458915 3 WT 7 383 | ERR458916 4 WT 7 384 | ERR458917 5 WT 7 385 | ERR458918 6 WT 7 386 | ERR458919 7 WT 7 387 | ERR458920 1 WT 8 388 | ERR458921 2 WT 8 389 | ERR458922 3 WT 8 390 | ERR458923 4 WT 8 391 | ERR458924 5 WT 8 392 | ERR458925 6 WT 8 393 | ERR458926 7 WT 8 394 | ERR458927 1 WT 9 395 | ERR458928 2 WT 9 396 | ERR458929 3 WT 9 397 | ERR458930 4 WT 9 398 | ERR458931 5 WT 9 399 | ERR458932 6 WT 9 400 | ERR458933 7 WT 9 401 | ERR458934 1 WT 10 402 | ERR458935 2 WT 10 403 | ERR458936 3 WT 10 404 | ERR458937 4 WT 10 405 | ERR458938 5 WT 10 406 | ERR458939 6 WT 10 407 | ERR458940 7 WT 10 408 | ERR458941 1 WT 11 409 | ERR458942 2 WT 11 410 | ERR458943 3 WT 11 411 | ERR458944 4 WT 11 412 | ERR458945 5 WT 11 413 | ERR458946 6 WT 11 414 | ERR458947 7 WT 11 415 | ERR458948 1 WT 12 416 | ERR458949 2 WT 12 417 | ERR458950 3 WT 12 418 | ERR458951 4 WT 12 419 | ERR458952 5 WT 12 420 | ERR458953 6 WT 12 421 | ERR458954 7 WT 12 422 | ERR458955 1 WT 13 423 | ERR458956 2 WT 13 424 | ERR458957 3 WT 13 425 | ERR458958 4 WT 13 426 | ERR458959 5 WT 13 427 | ERR458960 6 WT 13 428 | ERR458961 7 WT 13 429 | ERR458962 1 WT 14 430 | ERR458963 2 WT 14 431 | ERR458964 3 WT 14 432 | ERR458965 4 WT 14 433 | ERR458966 5 WT 14 434 | ERR458967 6 WT 14 435 | ERR458968 7 WT 14 436 | ERR458969 1 WT 15 437 | ERR458970 2 WT 15 438 | ERR458971 3 WT 15 439 | ERR458972 4 WT 15 440 | ERR458973 5 WT 15 441 | ERR458974 6 WT 15 442 | ERR458975 7 WT 15 443 | ERR458976 1 WT 16 444 | ERR458977 2 WT 16 445 | ERR458978 3 WT 16 446 | ERR458979 4 WT 16 447 | ERR458980 5 WT 16 448 | ERR458981 6 WT 16 449 | ERR458982 7 WT 16 450 | ERR458983 1 WT 17 451 | ERR458984 2 WT 17 452 | ERR458985 3 WT 17 453 | ERR458986 4 WT 17 454 | ERR458987 5 WT 17 455 | ERR458988 6 WT 17 456 | ERR458989 7 WT 17 457 | ERR458990 1 WT 18 458 | ERR458991 2 WT 18 459 | ERR458992 3 WT 18 460 | ERR458993 4 WT 18 461 | ERR458994 5 WT 18 462 | ERR458995 6 WT 18 463 | ERR458996 7 WT 18 464 | ERR458997 1 WT 19 465 | ERR458998 2 WT 19 466 | ERR458999 3 WT 19 467 | ERR459000 4 WT 19 468 | ERR459001 5 WT 19 469 | ERR459002 6 WT 19 470 | ERR459003 7 WT 19 471 | ERR459004 1 WT 20 472 | ERR459005 2 WT 20 473 | ERR459006 3 WT 20 474 | ERR459007 4 WT 20 475 | ERR459008 5 WT 20 476 | ERR459009 6 WT 20 477 | ERR459010 7 WT 20 478 | ERR459011 1 WT 21 479 | ERR459012 2 WT 21 480 | ERR459013 3 WT 21 481 | ERR459014 4 WT 21 482 | ERR459015 5 WT 21 483 | ERR459016 6 WT 21 484 | ERR459017 7 WT 21 485 | ERR459018 1 WT 22 486 | ERR459019 2 WT 22 487 | ERR459020 3 WT 22 488 | ERR459021 4 WT 22 489 | ERR459022 5 WT 22 490 | ERR459023 6 WT 22 491 | ERR459024 7 WT 22 492 | ERR459025 1 WT 23 493 | ERR459026 2 WT 23 494 | ERR459027 3 WT 23 495 | ERR459028 4 WT 23 496 | ERR459029 5 WT 23 497 | ERR459030 6 WT 23 498 | ERR459031 7 WT 23 499 | ERR459032 1 WT 24 500 | ERR459033 2 WT 24 501 | ERR459034 3 WT 24 502 | ERR459035 4 WT 24 503 | ERR459036 5 WT 24 504 | ERR459037 6 WT 24 505 | ERR459038 7 WT 24 506 | ERR459039 1 WT 25 507 | ERR459040 2 WT 25 508 | ERR459041 3 WT 25 509 | ERR459042 4 WT 25 510 | ERR459043 5 WT 25 511 | ERR459044 6 WT 25 512 | ERR459045 7 WT 25 513 | ERR459046 1 WT 26 514 | ERR459047 2 WT 26 515 | ERR459048 3 WT 26 516 | ERR459049 4 WT 26 517 | ERR459050 5 WT 26 518 | ERR459051 6 WT 26 519 | ERR459052 7 WT 26 520 | ERR459053 1 WT 27 521 | ERR459054 2 WT 27 522 | ERR459055 3 WT 27 523 | ERR459056 4 WT 27 524 | ERR459057 5 WT 27 525 | ERR459058 6 WT 27 526 | ERR459059 7 WT 27 527 | ERR459060 1 WT 28 528 | ERR459061 2 WT 28 529 | ERR459062 3 WT 28 530 | ERR459063 4 WT 28 531 | ERR459064 5 WT 28 532 | ERR459065 6 WT 28 533 | ERR459066 7 WT 28 534 | ERR459067 1 WT 29 535 | ERR459068 2 WT 29 536 | ERR459069 3 WT 29 537 | ERR459070 4 WT 29 538 | ERR459071 5 WT 29 539 | ERR459072 6 WT 29 540 | ERR459073 7 WT 29 541 | ERR459074 1 WT 30 542 | ERR459075 2 WT 30 543 | ERR459076 3 WT 30 544 | ERR459077 4 WT 30 545 | ERR459078 5 WT 30 546 | ERR459079 6 WT 30 547 | ERR459080 7 WT 30 548 | ERR459081 1 WT 31 549 | ERR459082 2 WT 31 550 | ERR459083 3 WT 31 551 | ERR459084 4 WT 31 552 | ERR459085 5 WT 31 553 | ERR459086 6 WT 31 554 | ERR459087 7 WT 31 555 | ERR459088 1 WT 32 556 | ERR459089 2 WT 32 557 | ERR459090 3 WT 32 558 | ERR459091 4 WT 32 559 | ERR459092 5 WT 32 560 | ERR459093 6 WT 32 561 | ERR459094 7 WT 32 562 | ERR459095 1 WT 33 563 | ERR459096 2 WT 33 564 | ERR459097 3 WT 33 565 | ERR459098 4 WT 33 566 | ERR459099 5 WT 33 567 | ERR459100 6 WT 33 568 | ERR459101 7 WT 33 569 | ERR459102 1 WT 34 570 | ERR459103 2 WT 34 571 | ERR459104 3 WT 34 572 | ERR459105 4 WT 34 573 | ERR459106 5 WT 34 574 | ERR459107 6 WT 34 575 | ERR459108 7 WT 34 576 | ERR459109 1 WT 35 577 | ERR459110 2 WT 35 578 | ERR459111 3 WT 35 579 | ERR459112 4 WT 35 580 | ERR459113 5 WT 35 581 | ERR459114 6 WT 35 582 | ERR459115 7 WT 35 583 | ERR459116 1 WT 36 584 | ERR459117 2 WT 36 585 | ERR459118 3 WT 36 586 | ERR459119 4 WT 36 587 | ERR459120 5 WT 36 588 | ERR459121 6 WT 36 589 | ERR459122 7 WT 36 590 | ERR459123 1 WT 37 591 | ERR459124 2 WT 37 592 | ERR459125 3 WT 37 593 | ERR459126 4 WT 37 594 | ERR459127 5 WT 37 595 | ERR459128 6 WT 37 596 | ERR459129 7 WT 37 597 | ERR459130 1 WT 38 598 | ERR459131 2 WT 38 599 | ERR459132 3 WT 38 600 | ERR459133 4 WT 38 601 | ERR459134 5 WT 38 602 | ERR459135 6 WT 38 603 | ERR459136 7 WT 38 604 | ERR459137 1 WT 39 605 | ERR459138 2 WT 39 606 | ERR459139 3 WT 39 607 | ERR459140 4 WT 39 608 | ERR459141 5 WT 39 609 | ERR459142 6 WT 39 610 | ERR459143 7 WT 39 611 | ERR459144 1 WT 40 612 | ERR459145 2 WT 40 613 | ERR459146 3 WT 40 614 | ERR459147 4 WT 40 615 | ERR459148 5 WT 40 616 | ERR459149 6 WT 40 617 | ERR459150 7 WT 40 618 | ERR459151 1 WT 41 619 | ERR459152 2 WT 41 620 | ERR459153 3 WT 41 621 | ERR459154 4 WT 41 622 | ERR459155 5 WT 41 623 | ERR459156 6 WT 41 624 | ERR459157 7 WT 41 625 | ERR459158 1 WT 42 626 | ERR459159 2 WT 42 627 | ERR459160 3 WT 42 628 | ERR459161 4 WT 42 629 | ERR459162 5 WT 42 630 | ERR459163 6 WT 42 631 | ERR459164 7 WT 42 632 | ERR459165 1 WT 43 633 | ERR459166 2 WT 43 634 | ERR459167 3 WT 43 635 | ERR459168 4 WT 43 636 | ERR459169 5 WT 43 637 | ERR459170 6 WT 43 638 | ERR459171 7 WT 43 639 | ERR459172 1 WT 44 640 | ERR459173 2 WT 44 641 | ERR459174 3 WT 44 642 | ERR459175 4 WT 44 643 | ERR459176 5 WT 44 644 | ERR459177 6 WT 44 645 | ERR459178 7 WT 44 646 | ERR459179 1 WT 45 647 | ERR459180 2 WT 45 648 | ERR459181 3 WT 45 649 | ERR459182 4 WT 45 650 | ERR459183 5 WT 45 651 | ERR459184 6 WT 45 652 | ERR459185 7 WT 45 653 | ERR459186 1 WT 46 654 | ERR459187 2 WT 46 655 | ERR459188 3 WT 46 656 | ERR459189 4 WT 46 657 | ERR459190 5 WT 46 658 | ERR459191 6 WT 46 659 | ERR459192 7 WT 46 660 | ERR459193 1 WT 47 661 | ERR459194 2 WT 47 662 | ERR459195 3 WT 47 663 | ERR459196 4 WT 47 664 | ERR459197 5 WT 47 665 | ERR459198 6 WT 47 666 | ERR459199 7 WT 47 667 | ERR459200 1 WT 48 668 | ERR459201 2 WT 48 669 | ERR459202 3 WT 48 670 | ERR459203 4 WT 48 671 | ERR459204 5 WT 48 672 | ERR459205 6 WT 48 673 | ERR459206 7 WT 48 674 | -------------------------------------------------------------------------------- /download.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/download.jpg -------------------------------------------------------------------------------- /effect.mbf.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/effect.mbf.pdf -------------------------------------------------------------------------------- /first_association.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "First association" 3 | author: "gg" 4 | date: '`r format(Sys.time(), "%d %B, %Y")`' 5 | bibliography: ~/Library/texmf/bibtex/bib/bibdesk_refs.bib 6 | fig_caption: true 7 | output: 8 | pdf_document: 9 | fig_caption: yes 10 | --- 11 | 12 | To run this file: 13 | Rscript -e "rmarkdown::render('first_association.Rmd')" 14 | 15 | ##Correlation 16 | 17 | Correlation is particularly problematic for compositional data because the abundance of one part affects the abundance of every other. We can identify positively associated taxa by finding those taxa where the ratio between two or more taxa is approximately constant in all samples [@Lovell:2015]. Such groups of taxa can be said to be associated (or positively correlated). Negative associations are very problematic because they can arise from underlying biology (which we want to detect) or from the negative correlation bias (when one thing goes up, one or more others must go down!). 18 | 19 | Identifying correlated taxa is a very active and open research problem. Two methods have been proposed for datasets with particular sparsity profiles [@Friedman:2012; @Kurtz:2015], and one method has been proposed for general datasets [@Lovell:2015]. We will demonstrate $\phi$, the general method, and give one or two examples of how this can be useful when analyzing microbiome datasets. The $\phi$ metric has been modified to allow us to include taxa with 0 values by using the Monte-Carlo replicates of the data that were generated by ALDEx2. 20 | 21 | It is important to note that $\phi$ is a strength of association measure, and not a p value measure. In that way, it is similar to an effect size. You have some intuition about how $\phi$ behaves, and you choose a cutoff you are comfortable with. Guidance suggests that values less than 0.1 are appropriate for transcriptome datasets where the data are highly reproducible, and that 0.2-0.3 are more appropriate for the noisier microbiome datasets. 22 | 23 | This plots the connectivity graph for $\phi <= 0.3$, and includes additional ways to display the information from $\phi$. Displaying correlation graphs in a useful and intuitive way is a research topic by itself. Interested participants are encouraged to look at the igraph R package. 24 | 25 | 26 | 27 | ```{r, echo=FALSE,eval=TRUE,message=FALSE, warning=FALSE} 28 | library(ALDEx2) 29 | library(zCompositions) 30 | source("chunk/R/propr-functions.R") # phi association functions 31 | library(igraph) 32 | oldcat=cat 33 | cat <- function(...) {} 34 | 35 | ``` 36 | 37 | There is quite a bit of setup, since this is not in any easy-to-use R package as of yet. 38 | 39 | 40 | ```{r, echo=TRUE,eval=TRUE, results='as.is', fig.width=9, fig.height=9, error=FALSE, message=FALSE, warning=FALSE, fig.cap="The connectivity graph for phi <=0.29. In this dataset, for the most part, we have small clusters composed of taxa from the same genus. This could be because of inefficient clustering, or because of similar organisms in the same genus exploiting exactly the same resources. There are a few clusters composed of different genera, and these are most likely members of different genera exploiting resources similarly. "} 41 | 42 | # set a list of defined colours 43 | colours <- c("indianred1", "steelblue3", "skyblue1", "mediumorchid","royalblue4", "olivedrab3", 44 | "pink", "#FFED6F", "mediumorchid3", "ivory2", "tan1", "aquamarine3", "#C0C0C0", 45 | "mediumvioletred", "#999933", "#666699", "#CC9933", "#006666", "#3399FF", 46 | "#993300", "#CCCC99", "#666666", "#FFCC66", "#6699CC", "#663366", "#9999CC", "#CCCCCC", 47 | "#669999", "#CCCC66", "#CC6600", "#9999FF", "#0066CC", "#99CCCC", "#999999", "#FFCC00", 48 | "#009999", "#FF9900", "#999966", "#66CCCC", "#339966", "#CCCC33", "#EDEDED" 49 | ) 50 | 51 | # read in the dataset and associated taxonomy file 52 | d.subset <- read.table("data/ak_vs_op.txt", sep="\t", header=T, row.names=1) 53 | taxon.1 <- read.table("data/ak_vs_op_taxon.txt", sep="\t", header=T, row.names=1) 54 | 55 | # ALDEx2: generate Monte Carlo replicates of our clr data 56 | # use 16 samples of the data to keep this short(ish) 57 | x <- aldex.clr(d.subset, mc.samples=16, verbose=FALSE, useMC=TRUE) 58 | 59 | # propr: calculate the phi statistic. 60 | d.sma.df <- propr.aldex.phi(x) 61 | 62 | # choose a cutoff. In practice this can be very low for transcriptomes and higher 63 | # for microbiomes. Somewhere between 0.2 and 0.3 is a good place to start 64 | phi.cutoff <- 0.30 65 | 66 | # get the subset of OTUs that are joined by one or more low phi connections 67 | d.sma.lo.phi <- subset(d.sma.df, phi < phi.cutoff) 68 | 69 | # igraph: convert the connections into a graphical object 70 | # igraph is a full-featured graph analysis and display tool 71 | # the full use of which is well beyond what we can demonstrate here 72 | g <- graph.data.frame(d.sma.lo.phi, directed=FALSE) 73 | 74 | # igraph: group by clusters 75 | g.clust <- clusters(g) 76 | 77 | # make a table to examine the cluster membership by hand 78 | g.df <- data.frame(Systematic.name=V(g)$name, cluster=g.clust$membership, 79 | cluster.size=g.clust$csize[g.clust$membership]) 80 | 81 | # generate a set of clusters larger than some size 82 | # minimum is 2 (obviously) 83 | big <- g.df[which(g.df$cluster.size >= 2),] 84 | colnames(big) <- colnames(g.df) 85 | 86 | # igraph: rename the cluster members by their genus name 87 | V(g)$name <- as.vector(taxon.1[names(V(g)),"genus"]) 88 | 89 | # igraph: 90 | plot(g, vertex.size=5, vertex.color=rgb(0,0,0,0.2), 91 | vertex.frame.color="white") 92 | ``` 93 | 94 | We next what to display the results on top of the compositional biplot to get some sense of how good the projection is, and to see if any of the low-$\phi$ clusters are distributed asymmetrically in the dataset. This can be done with a bit of fiddling. 95 | 96 | We first generate a biplot using scale=0) so that the axes values correspond to the loadings. We color the sample names in light grey, and do not plot the loadings values. Then we overplot the clusters and give each cluster a unique color from the list of colours. For this plot, we will include only clusters with sizes greater than 5 to plot to keep it simple. 97 | 98 | 99 | 100 | ```{r, echo=TRUE,eval=TRUE, results='as.is', fig.width=9, fig.height=9, error=FALSE, message=FALSE, warning=FALSE, fig.cap="Inspection of the location of the large clusters on the biplot shows that taxa with very similar standard deviations on PC1 are clustered, and that the projection of the data is not particularly good. The Haemophilus cluster, in particular, has a large spread on component 2, but all have a similar distance from the origin on PC1. However, there are two clusters that appear to separate the groups strongly. One, associated with the ak group is composed of Streptococcus, the other associated with the op group contains Streptococus, and Actinomyces. Some instances will have Rothia also associated with the latter group. This is a consequence of examining random instances of the data and indicate that the Rothia exhibit a marginal score."} 101 | 102 | big <- g.df[which(g.df$cluster.size >= 5),] 103 | 104 | # zCompositions: make a pcx object as we did for the biplot 105 | d.n0 <- cmultRepl(t(d.subset), label=0, method="CZM") 106 | 107 | d.clr <- t(apply(d.n0, 1, function(x) log(x) - mean(log(x)) ) ) 108 | pcx.d <- prcomp(d.clr) 109 | 110 | biplot(pcx.d, var.axes=F, cex=c(0.4,0.01), scale=0, 111 | col=c(rgb(0,0,0,0.2), "red"), 112 | xlab=paste("PC1: ", round(sum(pcx.d$sdev[1]^2)/sum(pcx.d$sdev^2), 3), sep=""), 113 | ylab=paste("PC2: ", round(sum(pcx.d$sdev[2]^2)/sum(pcx.d$sdev^2),3),sep="") 114 | ) 115 | 116 | #abline(v=0, lty=2, col="grey") 117 | #abline(h=0, lty=2, col="grey") 118 | 119 | lev <- factor(big$cluster) 120 | for(i in as.numeric(levels(lev))){ 121 | nms <- rownames(big)[big$cluster==i] 122 | 123 | text(pcx.d$rotation[nms,][,1], pcx.d$rotation[nms,][,2], 124 | labels = taxon.1[rownames(big)[big$cluster==i],"genus"],col=colours[i], cex=1) 125 | } 126 | 127 | ``` 128 | 129 | \newpage 130 | 131 | 132 | ```{r, echo=TRUE,eval=TRUE, results='as.is', fig.width=9, fig.height=9, error=FALSE, message=FALSE, warning=FALSE, fig.cap="Separation of groups based on ratios of sets of associated taxa. "} 133 | 134 | clop <- c("5425","34862","27937","34896","11687","36010","32985","35898","35936","36038","11","3760") 135 | 136 | clopc <- c("26066", "28841","22379", "6107","21779") 137 | 138 | clak <- c("38304","39078","39235","39306","39171","39227","37976","38334","38833") 139 | 140 | mean.op <- apply(d.clr[,clop], 1, mean) 141 | mean.ak <- apply(d.clr[,clak], 1, mean) 142 | nms <- rownames(d.clr) 143 | 144 | hist(mean.op[grep("ak", nms)] - mean.ak[grep("ak", nms)], col=rgb(0,0,1,0.4), xlim=c(-8,8), ylim=c(0,80), xlab=("ratio Grp A : Grp B"), main="Ratio") 145 | hist(mean.op[grep("op", nms)] - mean.ak[grep("op", nms)], col=rgb(1,0,0,0.4), add=T) 146 | 147 | 148 | ``` 149 | 150 | \newpage 151 | 152 | #References 153 | -------------------------------------------------------------------------------- /first_association.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/first_association.pdf -------------------------------------------------------------------------------- /make_interpret_biplot.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Simple biplot" 3 | author: "gg" 4 | date: '`r format(Sys.time(), "%d %B, %Y")`' 5 | bibliography: /Users/ggloor/Library/texmf/bibtex/bib/bibdesk_refs.bib 6 | fig_caption: true 7 | output: 8 | pdf_document: 9 | fig_caption: yes 10 | --- 11 | 12 | To run this file: 13 | Rscript -e "rmarkdown::render('make_interpret_biplot.Rmd')" 14 | # The compositional biplot 15 | 16 | When analyzing and interpreting compositional data, it is important to remember that we are examining the variance in the ratios of the underlying data, and not directly examining abundance. The first tool that we will use is the compositional biplot. This is generated by the following set of steps: 17 | 18 | 1. remove essential 0 values (0s that are in all samples. i.e. nondetects) 19 | 2. perform any additional filtering (sparsity, minimal abundance, minimum sample count, etc) 20 | 3. adjust remaining 0 values with the zCompositions package 21 | 4. perform the clr transform on the data 22 | 5. conduct a singular value decomposition using prcomp 23 | 6. display the results in a principle component plot 24 | 25 | Let us see how this works in principle. We will make a sample dataset that has 30 samples and only eight features. Samples will be in two groups. The first group of 20 will differ from the last group of 10 . Feature A will be more abundant in the first 20 samples, and less abundant in the last 10 , features C and D will be the opposite, with feature D having a greater difference between groups than feature C. Features B, and E to H will be highly variable, but the variation will be associated with samples randomly. For simplicity we will use a random-uniform distribution, but any other distribution would work as well. 26 | 27 | ```{r, echo=FALSE, results='show', fig.width=8, fig.height=8, message=FALSE,fig.cap="Counts were generated randomly for eight features labeled a_h and were free to range from the values indicated for each feature. Thirty samples were generated, and features A,C and D were differentially abundant in the first twenty and last 10 samples. The others were of widely different abundances, but with totally random change between samples. Feature D is a slightly randomized version of feature C that is scaled 100 fold. From this we can see that the absolute abundance is not represented in the biplot."} 28 | 29 | set.seed(7) 30 | # runif(n, min, max) 31 | A <- c(runif(20, 15, 30), runif(10,5,10)) 32 | B <- runif(30, 1, 50) 33 | C <- c(runif(20, 5,10), runif(10,15,30)) 34 | D <- C * runif(30, 1000, 1250) 35 | # D <- c(runif(20, 10, 20), runif(10,40,80)) 36 | E <- runif(30, 1, 50) 37 | F <- runif(30, 1, 2) 38 | G <- runif(30, 1000, 1200) 39 | H <- runif(30, 1000, 2000) 40 | 41 | a_h <- cbind(A,B,C,D,E,F,G,H) 42 | a_h[a_h<0] <- 0.00001 43 | 44 | # convert to proportions 45 | a_h.prop <- t(apply(a_h, 1, function(x){x/sum(x)})) 46 | 47 | # clr transform 48 | a_h.clr <- t(apply(a_h.prop, 1, function(x){log(x) - mean(log(x))})) 49 | 50 | # extract the principle components and loadings 51 | a_h.pcx <- prcomp(a_h.clr) 52 | 53 | # make a compositional biplot 54 | #biplot(a_h.pcx, var.axes=TRUE, scale=0) # form 55 | biplot(a_h.pcx, var.axes=TRUE, scale=0, 56 | xlab=paste("PC1", round(a_h.pcx$sdev[1]^2 / sum(a_h.pcx$sdev^2),3), sep=": "), 57 | ylab=paste("PC2", round(a_h.pcx$sdev[2]^2 / sum(a_h.pcx$sdev^2),3), sep=": ") 58 | ) # covariance 59 | 60 | ``` 61 | 62 | Principle component plots display the projection (shadow) of your multi-dimensional data onto a lower number of dimensions, and here the maximum number of dimensions possible is seven. The compositional biplot displays the results of your experiment in a semi-quantitative way. 63 | 64 | The first dimension is the one that displays the largest amount of variation in the data: for example, if your data had four features, it would have 3 dimensions (like a rugby ball) and the first dimension would represent the long axis. 65 | 66 | The principle components are arranged in decreasing order of the variance explained. If you have a strong effect driven by a single experimental feature, then you will have a large amount of variation explained on the first axis, and a much smaller 67 | 68 | From this plot we can determine the following: 69 | 70 | 1. The first two principle components explain about 80\% of the variance in the dataset. This is very good. The greater the variance explained, the greater the confidence one can have in the projection. If we thought of this as a shadow of the data, then most features and samples would have fairly distinct locations. 71 | 2. The samples (in black) partition into two groups: samples 1:20 on the left and samples 21-30 on the right with a clear separation between them. They are separated on PC1, which indicates a lack of confounding effects on the split between samples. 72 | 3. The length and direction of the arrows (feature locations) is proportional to the standard deviation of the feature in the dataset. So we can see that feature A is highly variable along the same direction as are samples 1-20. We can interpret this as feature A is relatively more abundant in these samples than in samples 21-30. Likewise for the other features. 73 | 4. features C,D are very close together; this is referred to as having a short link. The length of a link is proportional to the variance in their ratios. In other words, the variance of the ratios of these two features will be fairly constant. In other words again, these features have a high compositional association. These are the types of relationships we aim to identify with the propr package later on. 74 | 5. We would expect that features A, C and D are the most variable between samples, and this is the type of result we would test with ALDEx2 later on. 75 | 76 | 77 | 78 | ```{r, echo=FALSE, results='show', fig.width=8, fig.height=8, warning=FALSE, message=FALSE,fig.cap="The advantage of compositions is that the results are largely invariant to subsetting. That is, we get essentially the same answer with all (A-H), and with a subset of the data (A-G). We also see that in these data, the variance of the clr transformed values is much more informative than the absolute, or proportional values, and better represents the known structure of the data. "} 79 | 80 | a_h <- cbind(A,B,C,D,E,F,G,H) 81 | a_h.prop <- t(apply(a_h, 1, function(x){x/sum(x)})) 82 | a_h.clr <- t(apply(a_h.prop, 1, function(x){log(x) - mean(log(x))})) 83 | a_h.pcx <- prcomp(a_h.clr) 84 | a_h.p.pcx <- prcomp(a_h.prop) 85 | a_h.c.pcx <- prcomp(a_h) 86 | 87 | a_g <- cbind(A,B,C,D,E,F,G) 88 | a_g.prop <- t(apply(a_g, 1, function(x){x/sum(x)})) 89 | a_g.clr <- t(apply(a_g.prop, 1, function(x){log(x) - mean(log(x))})) 90 | a_g.pcx <- prcomp(a_g.clr) 91 | a_g.p.pcx <- prcomp(a_g.prop) 92 | a_g.c.pcx <- prcomp(a_g) 93 | 94 | a_h.z <- apply(a_h, 2, function(x){ (x-mean(x))/sqrt(var(x)) }) 95 | a_h.z.prop <- apply(a_h.prop, 2, function(x){ (x-mean(x))/sqrt(var(x)) }) 96 | a_h.z.prop.pcx <- prcomp(a_h.z.prop) 97 | a_h.z.pcx <- prcomp(a_h.z) 98 | 99 | a_g.z <- apply(a_g, 2, function(x){ (x-mean(x))/sqrt(var(x)) }) 100 | a_g.z.prop <- apply(a_g.prop, 2, function(x){ (x-mean(x))/sqrt(var(x)) }) 101 | a_g.z.prop.pcx <- prcomp(a_g.z.prop) 102 | a_g.z.pcx <- prcomp(a_g.z) 103 | 104 | par(mfrow=c(2,3)) 105 | #biplot(a_h.pcx, var.axes=TRUE, scale=0) # form 106 | biplot(a_h.pcx, var.axes=TRUE, scale=0, 107 | xlab=paste("PC1", round(a_h.pcx$sdev[1]^2 / sum(a_h.pcx$sdev^2),3), sep=": "), 108 | ylab=paste("PC2", round(a_h.pcx$sdev[2]^2 / sum(a_h.pcx$sdev^2),3), sep=": ") 109 | , main="A-H clr" 110 | ) # covariance 111 | 112 | biplot(a_h.p.pcx, var.axes=TRUE, scale=0, 113 | xlab=paste("PC1", round(a_h.p.pcx$sdev[1]^2 / sum(a_h.p.pcx$sdev^2),3), sep=": "), 114 | ylab=paste("PC2", round(a_h.p.pcx$sdev[2]^2 / sum(a_h.p.pcx$sdev^2),3), sep=": ") 115 | , main="A-H prop" 116 | ) # covariance 117 | 118 | biplot(a_h.c.pcx, var.axes=TRUE, scale=0, 119 | xlab=paste("PC1", round(a_h.c.pcx$sdev[1]^2 / sum(a_h.c.pcx$sdev^2),3), sep=": "), 120 | ylab=paste("PC2", round(a_h.c.pcx$sdev[2]^2 / sum(a_h.c.pcx$sdev^2),3), sep=": ") 121 | , main="A-H count" 122 | ) # covariance 123 | 124 | biplot(a_g.pcx, var.axes=TRUE, scale=0, 125 | xlab=paste("PC1", round(a_g.pcx$sdev[1]^2 / sum(a_g.pcx$sdev^2),3), sep=": "), 126 | ylab=paste("PC2", round(a_g.pcx$sdev[2]^2 / sum(a_g.pcx$sdev^2),3), sep=": ") 127 | , main="A-G clr" 128 | ) # covariance 129 | 130 | biplot(a_g.p.pcx, var.axes=TRUE, scale=0, 131 | xlab=paste("PC1", round(a_g.p.pcx$sdev[1]^2 / sum(a_g.p.pcx$sdev^2),3), sep=": "), 132 | ylab=paste("PC2", round(a_g.p.pcx$sdev[2]^2 / sum(a_g.p.pcx$sdev^2),3), sep=": ") 133 | , main="A-G prop" 134 | ) # covariance 135 | 136 | biplot(a_g.c.pcx, var.axes=TRUE, scale=0, 137 | xlab=paste("PC1", round(a_g.c.pcx$sdev[1]^2 / sum(a_g.c.pcx$sdev^2),3), sep=": "), 138 | ylab=paste("PC2", round(a_g.c.pcx$sdev[2]^2 / sum(a_g.c.pcx$sdev^2),3), sep=": ") 139 | , main="A-G count" 140 | ) # covariance 141 | 142 | ``` 143 | -------------------------------------------------------------------------------- /make_interpret_biplot.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/make_interpret_biplot.pdf -------------------------------------------------------------------------------- /multi_comp.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "The data we have: multivariate compositional data" 3 | author: "gg" 4 | date: '`r format(Sys.time(), "%d %B, %Y")`' 5 | bibliography: ~/Library/texmf/bibtex/bib/bibdesk_refs.bib 6 | fig_caption: true 7 | output: 8 | pdf_document: 9 | fig_caption: yes 10 | --- 11 | 12 | To run this file: 13 | Rscript -e "rmarkdown::render('multi_comp.Rmd')" 14 | 15 | #Part 1: relative abundance data 16 | 17 | Whenever we analyze microbiome data, we analyze the data as `relative abundance' data, with or without normalization. What are relative abundance data? 18 | 19 | Relative abundance data are data where the absolute count of something has no meaning. Instead, the only thing we can assess are the relationships between datapoints. That is, their ratios. This is compositional data. 20 | 21 | When analyzing compositional data, or CoDa, (as we shall see) it is customary to log transform in some way and log-transformed data requires a different way of thinking. In linear space, a ratio is $x/y$, while in log space a ratio is $log(x) - log(y)$. So whenever we are working on log-ratios, what we are really saying is that we are working with the differences between log-transformed datapoints, and that these differences are equivalent to ratios in linear space. 22 | 23 | A second issue is that of correlation of log-transformed data [@Lovell:2015]. In linear space, perfectly correlated data will follow the familiar formula $y = m \times x _ b$, where y and x are variables, m is the slope of the line and b is the intercept. However, in log transformed data, m becomes the slope of the line, and b becomes a non-linear parameter. Consider the simple case of plotting y as a function of x from 1:10, with b set to 0, or b set to 2. 24 | 25 | Here we can see that in the case of the intercept being 0, the data are linearly related in both the normal and logarithmic spaces. In normal space, the intercept is 0, and the slope of the line changes. However, the two parts are always in constant ratio. In log space the intercept of the line changes, but the slope stays at 1. 26 | 27 | In the case of the intercept not being 0, in the normal space the lines appear to be perfectly correlated, but in log space we can see that the lines are now curved because of the effect of a non-0 intercept. This curvature indicates that the ratios between x and y are changing, and so are not associated in logarithmic space. 28 | 29 | This problem with correlation is not corrected by using non-parametric correlation measures! It is also worth noting that the recently proposed DESeq count normalization [@McMurdie:2014a] is simply a linear transform of the relative abundance values: i.e., if the count being normlized to is $1x10^6$, then this is simply relative abundance $x 1x10^4$. 30 | 31 | 32 | ```{r, echo=FALSE, results='show', fig.width=8, fig.height=8, message=FALSE,fig.cap="Plots to show the effect of log-transformation on simple linear lines. The grey dashed lines in the log space plots are lines of constant ratio."} 33 | y <- function(m,x,b){m * x + b} 34 | 35 | par(mfrow=c(2,2)) 36 | x <- seq(1:10) 37 | b <- 0 38 | plot(x, y(1,x,b), col="red", pch=19, type="b", xlim=c(0,10), ylim=c(0,20), main="normal space") 39 | points(x, y(2,x,b), col="blue", pch=19, type="b") 40 | points(x, y(0.5,x,b), col="orange", pch=19, type="b") 41 | text(1.5,15, labels="y=2x", col="blue") 42 | text(1.5,12.5, labels="y=x", col="red") 43 | text(1.5,10, labels="y=0.5x", col="orange") 44 | 45 | plot(x, y(1,x,b), col="red", pch=19, type="b", xlim=c(1,10), ylim=c(0.5,20), log="xy", main="log space") 46 | points(x, y(2,x,b), col="blue", pch=19, type="b") 47 | points(x, y(0.5,x,b), col="orange", pch=19, type="b") 48 | text(1.5,18, labels="y=2x", col="blue") 49 | text(1.5,12, labels="y=x", col="red") 50 | text(1.5,8, labels="y=0.5x", col="orange") 51 | abline(0.1,1, lty=2, col=rgb(0,0,0,0.2)) 52 | 53 | b <- 2 54 | plot(x, y(1,x,b), col="red", pch=19, type="b", xlim=c(0,10), ylim=c(0.5,20), main="normal space") 55 | points(x, y(2,x,b), col="blue", pch=19, type="b") 56 | points(x, y(0.5,x,b), col="orange", pch=19, type="b") 57 | text(1.5,15, labels="y=2x + 2", col="blue") 58 | text(1.5,12.5, labels="y=x + 2", col="red") 59 | text(1.5,10, labels="y=0.5x + 2", col="orange") 60 | 61 | plot(x, y(1,x,b), col="red", pch=19, type="b", xlim=c(1,10), ylim=c(2,20), log="xy", main="log space") 62 | points(x, y(2,x,b), col="blue", pch=19, type="b") 63 | points(x, y(0.5,x,b), col="orange", pch=19, type="b") 64 | text(1.5,18, labels="y=2x + 2", col="blue") 65 | text(1.5,12, labels="y=x + 2", col="red") 66 | text(1.5,8, labels="y=0.5x + 2", col="orange") 67 | abline(0.5,0.6, lty=2, col=rgb(0,0,0,0.2)) 68 | 69 | ``` 70 | 71 | \newpage 72 | 73 | 74 | #PART 2: multivariate compositional data 75 | 76 | Univariate data have is one variable per sample. This is the typical kind of data generated by biologists. For example, you measure the height of people in a room, divide them into male and female and compare. In this case, we have one variable (height) measured for two groups (male and female). These data are unconstrained, within the bounds of human height, and are in general what traditional statistical tools were developed for. 77 | 78 | Multivariate data have more than one variable per sample. Multivariate data can be independent or dependent. So for example, if the variables are truly independent (randomly chosen, not linked) then each may be treated as univariate. For example if we measure height, hair color and handedness of people, then each person is an observation that holds three variables. We can assume that these are (relatively) independent, and all statistical tests univariate and multivariate should be valid and importantly, multiple test corrections would be valid. 79 | 80 | However, if the variables are dependent, then we have many unappreciated problems. Unfortunately, the typical high throughput sequencing dataset is multivariate and highly dependent [@gloor2016s;@Gloor:2016cjm]. This dependency is forced upon the data by the sequencing instrument itself. 81 | 82 | We acknowledge this dependence when we call the data `relative abundance' data. What does this really mean? 83 | 84 | Let us set up a thought experiment. Let us imagine we have a very simple dataset composed of 100 samples and three taxa. Note that everything about this example generalizes to datasets with more samples and more taxa, it is just difficult to show this with an n-dimensional graphic. In this example, we will generate counts for 100 samples from a single experiment where the taxa abundance can range anywhere between 1 and an arbitrary number of counts. We will assume that we are actually counting the number of molecules belonging to each taxon, and that there is no practical upper limit on the number of molecules. 85 | 86 | We will save the special case of 0 for our practical demonstration. Now we have 100 multivariate observations with three variables. 87 | 88 | ```{r, echo=FALSE, results='show', fig.width=8, fig.height=3, message=FALSE,fig.cap="Counts were generated randomly for three taxa labeled a, b and c, and were free to range from 1 to 1000 for a, up to 2000 for b, and up to 3000 for c. One hundred samples were generated. The correlation coefficient of each sample vs. each other is shown above the plots."} 89 | library(compositions) 90 | library(scatterplot3d) 91 | 92 | a <- round(runif(100, min=1, max=1000)) 93 | b <- round(runif(100, min=1, max=2000)) 94 | c <- round(runif(100, min=1, max=3000)) 95 | abc <- rbind(a,b,c) 96 | tot=1 97 | abc.comp <- apply(abc, 2, function(x){x/sum(x)} * tot) 98 | abc.acomp <- acomp(abc, tot=1) 99 | 100 | par(mfrow=c(1,3)) 101 | plot(a,b, main=round(cor(a,b),3)) 102 | plot(a,c, main=round(cor(a,c),3)) 103 | plot(b,c, main=round(cor(b,c),3)) 104 | ``` 105 | 106 | The plots show that the variables in each of the samples are randomly placed. Therefore, we can see that the data in each pair of samples are essentially uncorrelated, as we expect for randomly generated data. This is what we would expect for randomly generated data where each point is absolutely independent of each other: in other words this is the best case scenario that would be rarely seen in a biological context. We shall now constrain the data to a constant sum and see how this affects the shape of the data. 107 | 108 | ```{r, echo=FALSE, results='show', fig.width=8, fig.height=3, message=FALSE,fig.cap="The same data as above were constrained to sum to an arbitrary sum: in this case, the data were converted to proportions. Any arbitrary number (percentage, ppm) has the same effect. The correlation coefficient of each sample vs. each other is shown above the plots. We can see very clearly that the associations between each of the datapoints are now not independent."} 109 | par(mfrow=c(1,3)) 110 | plot(abc.comp[1,], abc.comp[2,], main=round(cor(abc.comp[1,],abc.comp[2,]),3)) 111 | plot(abc.comp[1,], abc.comp[3,], main=round(cor(abc.comp[1,],abc.comp[3,]),3)) 112 | plot(abc.comp[2,], abc.comp[3,], main=round(cor(abc.comp[2,],abc.comp[3,]),3)) 113 | ``` 114 | 115 | The simple act of converting each count to a proportion markedly skews the data. Here we can see that the data now appear much more correlated and constrained. While this is a simple example with three taxa, it is generalizable to any number of samples, and to any number of variables (taxa) in a sample. 116 | 117 | Why is this so? 118 | 119 | ```{r, echo=FALSE, results='show', fig.width=8, fig.height=3, message=FALSE,fig.cap="Plotting these data in three dimensions shows the reason for the non-independence. The unconstrained data are on the left, the constrained data are in the middle, and the right side shows the data as if we were looking directly down onto the plane of data seen in the middle box. Data points are coloured from red to black to show their place on the b axis."} 120 | par(mfrow=c(1,3)) 121 | scatterplot3d(t(abc), scale=1.5, highlight.3d=T) 122 | scatterplot3d(t(abc.comp), xlim=c(0,tot), ylim=c(0,tot), zlim=c(0,tot), scale=1.5, highlight.3d=T) 123 | plot(t(abc.acomp), pch=19, cex=0.5, col=rgb(0,0,0,0.5)) 124 | 125 | ``` 126 | 127 | The essential problem is one of geometry. When the data are unconstrained, as shown in the box on the left, the data points are scattered uniformly within the box. This is a visualization of multivariate data that are unconstrained and uncorrelated. Essentially, in data of this type, knowing information about one datapoint gives no information about any other datapoint. 128 | 129 | When the data are constrained to a constant sum as shown in the middle box, then the datapoints collapse to a flat plane within the box, with limits at the corners. This plane is called a simplex, and occurs because if we know information about all points but one, then we know the information about the last point as well. For example, if the data must sum to 1, then knowing that the previous 10 points have a sum of 0.9, we know that the last point must have a value of 0.1. Thus, we can see intuitively that the data are constrained. We can re-orient our view to look directly perpendicular to the simplex plane and observe how the three dimensions of the data map onto the two dimensional space on the right. 130 | 131 | The data are clearly in a very different geometry than are the unconstrained data points. Moving from place to place in the box on the left is an additive process since the difference between points is a linear distance. In the simplex, moving from point to point is a multiplicative process, and we think of differential taxon abundance in terms of being multiples of the original abundance. 132 | 133 | In order to analyze such data, we must place ourselves on the simplex, or modify the data to move it from a constrained to an unconstrained simplex. Thus, we need to recast our thinking into a compositional data analysis (CoDa) way of thinking. 134 | 135 | #References and Further reading 136 | 137 | -------------------------------------------------------------------------------- /multi_comp.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/multi_comp.pdf -------------------------------------------------------------------------------- /prop_and_diffprop_with_propr.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Proportionality and differential proportionality using the propr R package" 3 | author: "Ionas Erb" 4 | date: "`r Sys.Date()`" 5 | output: html_document 6 | --- 7 | 8 | ## Introduction 9 | 10 | In this document, we will introduce two concepts of Compositional Data Analysis that are useful for RNA-seq analysis: Proportionality and differential propotionality. While the former can be seen as a substitute for correlation, the latter can be uderstood as a differential expression of gene ratios. Both try to circumvent the problems introduced by the constant-sum constraint imposed on the data by the sequencing procedure without making normalization assumptions. As a work-through example, we are using the SNF2-knockout yeast data set from the Barton group (Gierlinski et al. 2015). We can load it by 11 | ```{r} 12 | ma=read.table("barton_agg.tsv",row.names=1,header=TRUE) 13 | dim(ma) 14 | ``` 15 | The rows of the matrix are $N=96$ observations of the 6349 genes in the columns. The observations fall in two groups, the SNF2-knockout and the wildtype. Although we will need it only in the final section of this document, let us define the group variable already here: 16 | ```{r} 17 | gr=c(rep(0,48),rep(1,48)) 18 | ``` 19 | Let us now remove some genes that have on average less than one count per sample. 20 | ```{r} 21 | M=ma[,which(apply(ma,2,sum)>95)] 22 | dim(M) 23 | ``` 24 | This leaves us with $D=5958$ genes, some of which will have zero entries somewhere. Note that the statistics that we present below can be modified to explicitly deal with zeros (and much of this is already implemented in propr), but we will not have time to cover this topic here. 25 | 26 | ## Proportionality 27 | 28 | Correlation between variables cannot be defined consistently on compositional data, but positive correlation can be quantified using the log-ratio variance (LRV), see (Aitchsion 1986). 29 | $$\textrm{LRV}(x,y) = \textrm{var}\left(\log\frac{x_1}{y_1},\dots,\log\frac{x_N}{y_N}\right)$$ 30 | 31 | Since ratios are taken within each single sample, any (normalization) factor constraining the sample to a given sum will cancel out. The log is needed to make the sample-wise ratios symmetric with their reciprocal values. Now the variance of the sample-wise log-ratios will be close to zero for variables changing in about the same way along the samples. Such genes $x$, $y$ are *proportional*, i.e. $y=$const.$x$, because a vanishing LRV implies $y/x=$const. It is equivalent to saying that genes have a linear relationship without $y$-intercept. Let us look at two arbitrary genes now: 32 | ```{r} 33 | plot(M[,1],M[,2],xlim=c(0,range(M[,1])[2]),ylim=c(0,range(M[,2])[2]),xlab=colnames(M)[1],ylab=colnames(M)[2]) 34 | ``` 35 | 36 | Scatter plots like this one are intrinsically problematic for compositional data. This is because they assume that the values of the variables displayed on the axes have an absolute meaning. Compositional variables are parts of a whole and should be understood in relationship to other variables only. This implies that the total sum each sample is adding to should not play a role in the analysis. To make this clearer, let us multiply each sample by some random prefactors between 0 and 1: 37 | ```{r} 38 | set.seed(123) 39 | fac=sample(seq(0.01,1,0.01),dim(M)[1]) 40 | plot(fac*M[,1],fac*M[,2],xlim=c(0,range(fac*M[,1])[2]),ylim=c(0,range(fac*M[,2])[2]),xlab=colnames(M)[1],ylab=colnames(M)[2]) 41 | ``` 42 | 43 | As we can see already in the plots, the "crude" correlation between the genes changes. The Pearson correlation increases a lot. 44 | ```{r} 45 | cor(M[,1],M[,2]) 46 | cor(fac*M[,1],fac*M[,2]) 47 | ``` 48 | 49 | Now compare with this ratio representation and the unchanged variance of the log-ratio: 50 | ```{r} 51 | plot(c(1:96),M[,1]/M[,2],log="y",xlab="sample",ylab=paste(colnames(M)[1],colnames(M)[2],sep="/")) 52 | var(log(M[,1]/M[,2])) 53 | var((log(fac*M[,1]/(fac*M[,2])))) 54 | ``` 55 | 56 | Is 0.26 a small LRV? Unlike correlation, LRV has no natural scale. We could argue that a value of 0.26 is more impressive for genes that vary a lot by themselves than for those that change little across samples anyway. This idea has been introduced by (Lovell et al. 2015). In their paper, the authors set out to scale LRV by the variance of one or both of the individual genes. Note, however, that the variance of individual genes is problematic because it presupposes that we can have a meaningful comparison between samples based on raw counts. Following compositional data-analysis principles, inter-sample comparisons can only be done for ratios. The individual gene variances that are used for the scaling of LRV have themselves to be based on ratios then. This can be done with a reference variable $z$ (in form of an individual gene or the geometric mean over all genes): 57 | 58 | $$\rho(x,y)=\frac{2~\textrm{cov}(\log(x/z),\log(y/z))}{\textrm{var}(\log(x/z))+\textrm{var}(\log(y/z))}=1-\frac{\textrm{LRV}(x,y)}{\textrm{LRV}(x,z)+\textrm{LRV}(y,z)}$$ 59 | 60 | The first equality shows that $\rho$ is similar to a correlation between the logs, only that the geometric mean of the variances is replaced by the arithmetic mean. $\rho$ punishes both low correlation (between the logs) and a finite $y$-intercept of the scatter. Like normal correlation, it has the nice property that it falls between -1 and 1, and a value of 1 indicates full proportionality (-1 indicates reciprocality). To return to the example above, and taking as a reference the gene *Rev7* (YIL139C), we can inspect the scaled version of the scatter plot and the proportionality coefficient $\rho$: 61 | ```{r} 62 | z=M[,"YIL139C"] 63 | plot(M[,1]/z,M[,2]/z,xlim=c(0,range(M[,1]/z)[2]),ylim=c(0,range(M[,2]/z)[2]),xlab=paste(colnames(M)[1],"REV7",sep="/"),ylab=paste(colnames(M)[2],"REV7",sep="/")) 64 | abline(a=0,b=sqrt(var(M[,2]/z)/var(M[,1]/z)),col="red") 65 | 1-var(log(M[,1]/M[,2]))/(var(log(M[,1]/z))+var(log(M[,2]/z))) 66 | ``` 67 | The red line indicates the theoretical relationship the genes should follow if they were proportional. However, we can see a rather noisy cloud that is shifted with respect to the zero-intercept line, indicating that these genes cannot be proportional. This is confirmed by the rather modest coefficient $\rho$. Let us now change the reference to the geometric mean of the genes (that do not have zero-entries): 68 | ```{r} 69 | g=exp(apply(log(M[,which(apply(M,2,min)>0)]),1,mean)) 70 | plot(M[,1]/g,M[,2]/g,xlim=c(0,range(M[,1]/g)[2]),ylim=c(0,range(M[,2]/g)[2]),xlab=paste(colnames(M)[1],"gMean",sep="/"),ylab=paste(colnames(M)[2],"gMean",sep="/")) 71 | abline(a=0,b=sqrt(var(M[,2]/g)/var(M[,1]/g)),col="red") 72 | 1-var(log(M[,1]/M[,2]))/(var(log(M[,1]/g))+var(log(M[,2]/g))) 73 | ``` 74 | Clearly, here the result of $\rho$ depends a lot on the reference used. How this can introduce spurious results was worked out in detail in (Erb & Notredame 2016). However, it was found empirically that spurious results seem to affect mainly sensitivity, not specificity (Quinn et al. 2017). Let us now use the propr package to calculate both versions of $\rho$ on our entire data set: 75 | ```{r} 76 | library(propr) 77 | rho.g=perb(M) 78 | rho.z=perb(M,ivar=which(colnames(M)=="YIL139C")) 79 | str(rho.g) 80 | ``` 81 | The object that is created by proper contains, among other things, an object @matrix of the $\rho$ coefficients. We can fill the @pairs object with pairs we are interested in using a cutoff on $\rho$. Let us first look at all of them using a histogram. 82 | ```{r} 83 | rho.g=rho.g[">",-1.1] 84 | hist(rho.g@matrix[rho.g@pairs],main="Rho with geometric mean reference") 85 | ``` 86 | 87 | Let us compare this with the $\rho$ using our reference gene: 88 | ```{r} 89 | rho.z=rho.z[">",-1.1] 90 | hist(rho.z@matrix[rho.z@pairs],main="Rho with Rev7 reference") 91 | ``` 92 | 93 | While the geometric mean forces the distribution to be symmetric, our reference gene might approximate better an unchanged reference and we could gain some sensitivity using it. Let us now see if the pairs with coeffcients close to one are the same for both coefficients: 94 | ```{r} 95 | rho.z@pairs=numeric(0) 96 | rho.z=rho.z[">",0.98] 97 | rho.g@pairs=numeric(0) 98 | rho.g=rho.g[">",0.98] 99 | length(rho.z@pairs) 100 | length(rho.g@pairs) 101 | length(intersect(rho.g@pairs,rho.z@pairs)) 102 | ``` 103 | It appears that for this cut-off of 0.98 we get some 200 pairs more using the *Rev7* reference, but fortunately, the rest of the pairs is in perfect agreement. Let us now recover the gene indices from the consensus pairs and put them into a data frame: 104 | ```{r} 105 | cind=round(dim(M)[2]*(rho.g@pairs/dim(M)[2]-floor(rho.g@pairs/dim(M)[2]))) 106 | rind=round(floor(rho.g@pairs/dim(M)[2])+1) 107 | ppairs=as.data.frame(cbind(sort(cind),rind[order(cind)])) 108 | ``` 109 | 110 | We can now use the igraph package to gain an idea about the connectivity: 111 | ```{r} 112 | library(igraph) 113 | graph=graph_from_data_frame(ppairs,directed=FALSE) 114 | set.seed(123) 115 | plot(graph,vertex.size=2,vertex.label=NA) 116 | sort(degree(graph),decreasing=TRUE)[1:32] 117 | ``` 118 | 119 | Instead of exploring these modules, we will now move on to the section about differential proportionality. Since the calculations for this take a few minutes, let us prepare the object already (here it is commented out for rendering the markdown document quickly): 120 | ```{r} 121 | #res=propd(M,group=gr) 122 | ``` 123 | 124 | 125 | 126 | ## Differential proportionality 127 | 128 | 129 | Differential proportionality as we present it here is a differential expression analysis of gene ratios (Erb et al. 2017). What does it mean for a ratio to be differentially expressed? Let us look at the following example: 130 | ```{r} 131 | plot (M[,101],M[,80],col=gr+2,xlab=colnames(M)[101],ylab=colnames(M)[80]) 132 | abline(a=0,b=sqrt(var(M[1:48,80])/var(M[1:48,101])),col="red") 133 | abline(a=0,b=sqrt(var(M[49:96,80])/var(M[49:96,101])),col="green") 134 | ``` 135 | 136 | We can see that the slope of the scatter is condition-specific. The genes are clearly not proportional, but they appear to be proportional within each of the conditions. The colored lines indicate the theoretical relationship they would follow if the genes were perfectly proportional within each condition. As we can see, the red line fits pretty well, while the green line is a bit off, indicating that proportionality is only approximately true here. 137 | 138 | A plot showing the ratios on a log scale makes the equivalence with differential expression quite clear. Instead of a log-fold change, we have a log-ratio change. The vertical lines indicate the log-ratio means in both groups. 139 | ```{r} 140 | plot (c(1:96),M[,101]/M[,80],col=gr+2,log="y",xlab="sample",ylab=paste(colnames(M)[101],colnames(M)[80],sep="/")) 141 | abline(h=exp(mean(log(M[1:48,101]/M[1:48,80]))),col="red") 142 | abline(h=exp(mean(log(M[49:96,101]/M[49:96,80]))),col="green") 143 | ``` 144 | 145 | The measure for finding such gene pairs in our data set (where $K=48$) is this: 146 | 147 | $$\vartheta(x,y)=\frac{(K-1)~\textrm{LRV}_{1,...,K}(x,y)+(N-K-1)~\textrm{LRV}_{K+1,...,N}(x,y)}{(N-1)~\textrm{LRV}_{1,...,N}(x,y)}$$ 148 | 149 | The LRV indices are meant to indicate that LRV is only evaluated on the respective subset. The LRV in the denominator is of course the one we were using before, and the indices are only shown for clarity. $\vartheta$ is the ratio of within-group LRV to total LRV and thus, unlike LRV, has the nice property to fall between 0 and 1. Like LRV, it is independent of a reference, and values close to zero indicate the interesting cases. It can be derived from a decomposition of LRV into between- and within-group variance as known from analysis of variance (ANOVA). We are essentially doing a $t$-test on the ratios, only that the direction of the fold change (i.e. the log-ratio change) does not matter (because it does not matter which of the two genes is in the denominator of the ratio). The statistic is thus fully equivalent to the squared $t$-statistic (i.e. the $F$-statistic of one-way ANOVA). 150 | 151 | Our calculations before the start of this section provided us with the res object. For the example pair shown before it contains this entry: 152 | ```{r} 153 | res@theta[5030,] 154 | ``` 155 | 156 | $\vartheta$ indicates the significance of the differential proportionality, and we can associate it with a false discovery rate (FDR) using a permutation test. We will not do this here because of time constraints, but the command for 10 permutations would be this: 157 | ```{r} 158 | #res=updateCutoffs(res,cutoff=seq(0.05,0.95,0.01), p=10) 159 | ``` 160 | 161 | This will populate the res@fdr object where FDRs for the specified cut-off sequence are provided. (I can assure you that for our data set even a cut-off as high as 0.95 is still highly significant). What about the effect size then? The equivalent to the log-fold change of differential expression is the log-ratio change. It is the difference of the log-ratio means (LRMs) of the two groups and can be used to calculate the between-group variance: 162 | 163 | $$(1-\vartheta(x,y))\textrm{LRV(x,y)}=\frac{K(N-K)}{N^2}(\textrm{LRM}_{1,...,K}(x,y)-\textrm{LRM}_{K+1,...,N}(x,y))^2$$ 164 | 165 | The group LRMs can also be directly read off the res@theta object (see example above). Something similar to a volcano plot can now be obtained when we plot the size of the log-ratio change against theta. We are doing this here only for a subset of 100,000 pairs to speed up the plotting: 166 | ```{r} 167 | set.seed(123) 168 | myset=sample(c(1:dim(res@theta)[1]),100000) 169 | plot(res@theta$theta[myset],abs(res@theta$lrm1[myset]-res@theta$lrm2[myset]),col=rgb(0.1,0.1,0.1,0.1),pch=20,xlab="theta",ylab="absolute log-ratio change") 170 | ``` 171 | 172 | There is a conspicuous set of pairs with low $\vartheta$ and high LRMs. We extract it and look at the degrees of the resulting graph nodes: 173 | ```{r} 174 | pairset=which(abs(res@theta$lrm1-res@theta$lrm2)>5) 175 | graph=graph_from_data_frame(res@theta[pairset,1:2],directed=FALSE) 176 | sort(degree(graph),decreasing=TRUE)[1:10] 177 | length(pairset) 178 | colnames(res@counts)[5300] 179 | ``` 180 | 181 | The degree of the first gene has about the size of the pairset, which corresponds almost to the total number of genes. So one gene appears to be connected to all the other genes. This is the knockout gene *SNF2*, whose systematic name is YOR290C. 182 | 183 | In our "volcano" plot we can also see that there are a number of pairs with even lower $\vartheta$ than the ones from this module. To extract the genes involved, we extract the gene indices from the res@theta object. Again, to prioritize genes, it is interesting to look at connectivity. 184 | ```{r} 185 | pairset=which(res@theta$theta<0.01) 186 | geneset=unique(c(res@theta[pairset,1],res@theta[pairset,2])) 187 | length(geneset) 188 | graph=graph_from_data_frame(res@theta[pairset,1:2],directed=FALSE) 189 | sort(degree(graph),decreasing=TRUE)[1:10] 190 | set.seed(123) 191 | plot(graph,vertex.size=2,vertex.label=NA) 192 | ``` 193 | 194 | ## References 195 | 196 | 1. Gierliński, Marek et al. “Statistical Models for RNA-Seq Data Derived from a Two-Condition 48-Replicate Experiment.” Bioinformatics 31 (2015): 3625–3630. 197 | 2. Aitchison, John (1986), The Statistical Analysis of Compositional Data, Chapman & Hall; reprinted in 2003, with additional material, by The Blackburn Press. 198 | 3. Lovell, David, et al. "Proportionality: a valid alternative to correlation for relative data." PLoS Computational Biology 11 (2015): e1004075. 199 | 4. Erb, Ionas & Notredame, Cedric "How should we measure proportionality on relative gene expression data?" Theory in Biosciences 135 (2016): 21–36. 200 | 5. Quinn, Thom et al. "propr: An R-package for Identifying Proportionally Abundant Features Using Compositional Data Analysis." Scientific Reports 7 (2017): 16252. 201 | 6. Erb, Ionas et al. "Differential proportionality – a normalization-free approach to differential gene expression." Proceedings of CoDaWork 2017, available at bioRxiv (2017): 134536 202 | -------------------------------------------------------------------------------- /replicate_Evalue.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/replicate_Evalue.pdf -------------------------------------------------------------------------------- /replicate_lowEvalue.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/replicate_lowEvalue.pdf -------------------------------------------------------------------------------- /zero.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "What is a 0 anyway" 3 | author: "gg" 4 | date: "`r Sys.Date()`" 5 | fig_caption: true 6 | output: pdf_document 7 | --- 8 | To run this file: 9 | Rscript -e "rmarkdown::render('zero.Rmd')" 10 | 11 | # 0 is an observation, not a ground truth 12 | 13 | We will explore what a 0, or any other value for that matter means in a probabilistic sense. 14 | 15 | We take the dataset, which has 7 technical replicates of the yeast dataset and compare one technical replicate with the other. 16 | 17 | ```{r, echo=TRUE} 18 | d <- as.matrix(read.table("data/countfinal2.tsv", header=T, row.names=1)) 19 | 20 | # remove 0 sum features 21 | d.n0 <- d[rowSums(d) > 0,] 22 | ``` 23 | At this point we have removed all features that are 0 in all samples. 24 | These features are almost certainly not 0 if we sequence 10X more deeply, but they are of no consequence for the present analysis. 25 | 26 | In the context of RNA-seq, 0 is an anomaly since most genes have a low level of stochastic expression. In the context of single cell seq, a 0 can be a 0 because expression is somewhat binary in single cells, but averaged over a population RNA expression is continuous. 27 | 28 | We can now examine technical replicates and see what a 0 means in the context of a near perfect replicate. What happens if one replicate is 0? 29 | 30 | ```{r, echo=TRUE} 31 | # choose the features of replicate 2 where the value of replicate 32 | # 1 is 0, adn so on 33 | one_is_0 <- d.n0[,2][d.n0[,1] == 0] 34 | one_is_1 <- d.n0[,2][d.n0[,1] == 1] 35 | one_is_2 <- d.n0[,2][d.n0[,1] == 2] 36 | one_is_32 <- d.n0[,2][d.n0[,1] == 32] 37 | 38 | # plot the distribution of values from replicate 2 where replicate 1 39 | # has a value of 0 or a value of 32 40 | par(mfrow=c(1,2)) 41 | hist(one_is_0, breaks=9, xlim=c(0,20)) 42 | hist(one_is_32, breaks=20) 43 | ``` 44 | 45 | Just for completeness, we can choose a second replicate to compare. Here the the first technical replicate is in blue, the second technical replicate is in pink, and the expected value of 0 in the technical replicate is calculated as the mean of the replicate distribution. In this dataset, E(0) is approximately 0.3. Note that the most likely value for features with a count of 1 in the first replicate is 0, but the expected value of these features is about 1.4. 46 | 47 | ```{r, echo=T} 48 | # compare replicates 49 | one_is_0b <- d.n0[,3][d.n0[,1] == 0] 50 | one_is_1b <- d.n0[,3][d.n0[,1] == 1] 51 | one_is_2b <- d.n0[,3][d.n0[,1] == 2] 52 | one_is_32b <- d.n0[,3][d.n0[,1] == 32] 53 | 54 | par(mfrow=c(1,2)) 55 | hist(one_is_0, breaks=9, xlim=c(0,8), col=rgb(0,0,1,0.2)) 56 | hist(one_is_0b, breaks=15, add=T, col=rgb(1,0,0,0.2)) 57 | abline(v=mean(one_is_0)) # the expected value of 0 for pair 58 | abline(v=mean(one_is_0b), col="red") 59 | 60 | hist(one_is_1, breaks=9, xlim=c(0,12)) 61 | hist(one_is_1b, breaks=9, add=T, col=rgb(1,0,0,0.2)) 62 | abline(v=mean(one_is_1)) 63 | abline(v=mean(one_is_1b), col="red") 64 | ``` 65 | We can generate a scatter plot all vs all for technical replicate 7 vs technical replicate 2 in either Euclidian or log-log space. Note where the imprecision is - at the low count margin. 66 | 67 | ```{r, echo=TRUE} 68 | par(mfrow=c(1,2)) 69 | plot(d.n0[,1] + 0.2, d.n0[,2] + 0.2, pch=19, col=rgb(0,0,0,0.1), xlab="replicate 1", ylab="replicate 2") 70 | plot(d[,1] +0.2, d[,2] +0.2, log="xy", pch=19, col=rgb(0,0,0,0.1), xlab="replicate 1", ylab="replicate 2") 71 | ``` 72 | 73 | Finally, we can plot the expected value for the other replicates as a function of the reference replicate. This shows quite nicely that the expected value is actually higher than the observed value up to a count of 10-15 in this dataset. 74 | 75 | ```{r, echo=F} 76 | output <- matrix(data=NA, nrow=51, ncol=6) 77 | 78 | for(j in 0:50){ 79 | for(i in 1:6){ 80 | output[j+1, i ] <- mean(d.n0[,i][d.n0[,7] == j]) / (sum(d.n0[,i])/sum(d.n0[,7])) 81 | } 82 | } 83 | 84 | x <- seq(0,50, by=1) 85 | 86 | rownames(output) = x 87 | 88 | par(mfrow=c(1,1)) 89 | boxplot(t(output), xlab="R1 count", ylab="E(R count)") 90 | abline(h=seq(0,50, by=5), col=rgb(0,0,0,0.3), lty=2) 91 | abline(v=seq(1,51, by=5), col=rgb(0,0,0,0.3), lty=2) 92 | abline(-1,1) 93 | 94 | ``` 95 | In RNA-seq we typically do not have enough samples to have independent power for each feature. We really need about 10 samples for every feature we want to examine independently. 96 | 97 | The data are discrete probabilities of observing the feature in the DNA sequencing library scaled by the read count; i.e., converted to integers by 'counting'. If we sequenced 10X deeper, then 0 values would be converted to a integers between 0 and ~10 (sampling). If we sequenced 1000X deeper, then we would get an even better estimate of the actual underlying value for our 0s. Most of the 0s at some point will be converted to a count---so what is an appropriate value for 0? Somewhere between 0 (can never occur in the observable universe) and 1 (we always should see it) is a least likely value that, over many experiments will be least likely to perturb the system? That value is 0.5. Not always the best, usually not the optimum, but in general the least wrong. 98 | 99 | I typically choose either of two methods for zero replacement/estimation: The first is to use the zCompositions package to replace 0 values with a count zero multiplicative estimate. This approach replaces 0 values with a suitable pseudocount, but keeps the relationships between all the other features constant. The pseudocount only replaces the 0 value, not the rest. This approach is useful for exploratory data analysis, such as compositional biplots, clustering, or using the propr package with point estimates. The second approach is instantiated in the ALDEx2 package, where we assume that the prior probability of 0 is 0.5. In the prior approach we are adding 0.5 to \emph{every} value in the matrix. On the surface this seems to be undesirable. However, note that the expected value of low count features in this dataset is actually larger than the observed value: recall that it is about 0.3 for a count of 0, about 1.3 for a count of 1, about 2.7 for a count of 2, and about 31 for a count of 32. Thus, adjusting each value by a constant begins to make more sense. 100 | 101 | If we recall that the basis of log-ratio analysis is the ratio between components, then adding a prior probability to each feature starts to make sense. As shown in Table \ref{tab:ratios}, adding the prior is a much better estimate of the ratios of the Expected values than is the ratio of the point observations. The addition of the prior is in fact an even better estimate than is the 0 replacement method in this dataset. 102 | 103 | \begin{table}[!h] 104 | \caption{Count vs E(count) and the ratio relationships. The table shows the ratio between counts, and the expected value of the count in another technical replicate (E ratio), compared to the ratio of the count + a prior of 0.5 (P ratio). }\vspace{0.2cm} 105 | \centering 106 | \resizebox{\columnwidth}{!}{% 107 | \begin{tabular}{l r r r r r r} 108 | \hline 109 | Count 1 & Count 2 & E(count) 1 & E(count) 2 & Count ratio & E ratio & P ratio \\ \hline \hline 110 | 0 & 1 & 0.3 & 1.3 & 0/1 & 0.3/1.3 & 0.5/1.5\\ 111 | 0 & 2 & 0.3 & 2.7 & 0/2 & 0.3/2.7 & 0.5/2.5\\ 112 | 0 & 32 & 0.3 & 31 & 0/32 & 0.3/31 & 0.5/32.5\\ 113 | \end{tabular} 114 | } 115 | \label{tab:ratios} 116 | \end{table} 117 | 118 | 119 | ```{r, echo=FALSE} 120 | 121 | rdirichlet <- function (n, alpha) 122 | #ISALIAS ddirichlet 123 | #-------------------------------------------- 124 | { 125 | if(length(n) > 1) n <- length(n) 126 | if(length(n) == 0 || as.integer(n) == 0) return(numeric(0)) 127 | n <- as.integer(n) 128 | if(n < 0) stop("integer(n) can not be negative in rtriang") 129 | 130 | if(is.vector(alpha)) alpha <- t(alpha) 131 | l <- dim(alpha)[2] 132 | x <- matrix(rgamma(l * n, t(alpha)), ncol = l, byrow=TRUE) # Gere le recycling 133 | return(x / rowSums(x)) 134 | } 135 | ``` 136 | 137 | Lets generate random instances of the data. The maximum likelihood probability is the count divided by the sum of the sample. This is a discrete probability since we are dealing with finite values. 138 | 139 | Since we know that a pure replicate is not identical and differs by sampling variation the ML estimate is strongly affected by sampling. We can convert the discrete estimate of the probability to a continuous probability by modelling as a multivariate Poisson process with a fixed sum: this is a Dirichlet distribution. 140 | 141 | Even though the E(Dir) is very close to the ML estimate, the individual Dir instances are variable. This variation is a reasonable match for the actual underlying technical variation (tails are a bit broader in general), but is reasonable. 142 | 143 | ```{r, echo=T, fig.height=3, fig.width=7.5} 144 | par(mfrow=c(1,3)) 145 | plot((d.n0[,1] + 0.2) /sum(d.n0[,1] + 0.2), 146 | (d.n0[,2] + 0.2)/sum(d.n0[,2] + 0.2), 147 | log="xy", xlab="P 1", ylab="P 2", 148 | pch=19, col=rgb(0,0,0,0.1)) 149 | 150 | 151 | d.dir <- rdirichlet(16, d.n0[,1]+0.5) 152 | ml <- (d.n0[,1]+0.5) / sum(d.n0[,1]+0.5) 153 | E_dir <- apply(d.dir, 2, mean) 154 | plot(d.dir[1,], d.dir[2,], pch=19, cex=0.2,xlab="Dir 1", 155 | ylab="Dir 2", col=rgb(1,0,0,0.1), log="xy") 156 | points(E_dir, ml, pch=19,cex=0.1, col="blue") 157 | 158 | plot((d.n0[,1] + 0.2) /sum(d.n0[,1] + 0.2), 159 | (d.n0[,2] + 0.2)/sum(d.n0[,2] + 0.2), 160 | log="xy", pch=19, col=rgb(0,0,0,0.1), 161 | xlab="P(D) 1", ylab="P(D) 2") 162 | points(d.dir[1,], d.dir[2,], pch=19, cex=0.2, col=rgb(1,0,0,0.1)) 163 | points(d.dir[1,], d.dir[3,], pch=19, cex=0.2, col=rgb(1,0,0,0.1)) 164 | points(d.dir[1,], d.dir[4,], pch=19, cex=0.2, col=rgb(1,0,0,0.1)) 165 | points(E_dir, ml, pch=19,cex=0.1, col="blue") 166 | 167 | ``` 168 | 169 | In summary an observed count of 0 is does not mean that 0 is what we would observe upon replication. In other words, and observation of 0 does not mean that there is 0 probability of observing a 0. In fact, in this dataset a value of 0 has an Expected value of about 0.3. Thus there is a good justification to adjust 0 values with a pseudocount (from zCompositions) or a prior probability to more accurately capture the real underlying structure of the data at the low count margin. 170 | -------------------------------------------------------------------------------- /zero.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/zero.pdf --------------------------------------------------------------------------------