├── .gitignore
├── ALDEx_comparison.Rmd
├── ALDEx_comparison.pdf
├── Exploratory_biplot.Rmd
├── Exploratory_biplot.pdf
├── Gloor_bio.docx
├── Outline.Rmd
├── Outline.pdf
├── Outline.tex
├── README.txt
├── chunk
    ├── R
    │   ├── propr-functions.R
    │   └── propr.R
    ├── explore_schurch.R
    └── setup.R
├── coda_seq.Rmd
├── coda_seq.pdf
├── data
    ├── ERP004763_sample_mapping.tsv
    ├── ak_vs_op.txt
    ├── ak_vs_op_aldex.txt
    ├── ak_vs_op_taxon.txt
    ├── barton_agg.tsv
    ├── countfinal2.tsv
    ├── d.all.l
    ├── filtered_table.txt
    ├── mouth_genus.txt
    ├── up_vs_op.txt
    ├── up_vs_op_aldex.txt
    └── up_vs_op_taxon.txt
├── download.jpg
├── effect.mbf.pdf
├── first_association.Rmd
├── first_association.pdf
├── make_interpret_biplot.Rmd
├── make_interpret_biplot.pdf
├── multi_comp.Rmd
├── multi_comp.pdf
├── prop_and_diffprop_with_propr.Rmd
├── prop_and_diffprop_with_propr.html
├── replicate_Evalue.pdf
├── replicate_lowEvalue.pdf
├── zero.Rmd
└── zero.pdf


/.gitignore:
--------------------------------------------------------------------------------
1 | 
2 | .DS_Store
3 | 
4 | .Rhistory
5 | 


--------------------------------------------------------------------------------
/ALDEx_comparison.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "First comparison"
  3 | author: "gg"
  4 | date: '`r format(Sys.time(), "%d %B, %Y")`'
  5 | bibliography: /Users/ggloor/Library/texmf/bibtex/bib/bibdesk_refs.bib
  6 | fig_caption: true
  7 | output:
  8 |     pdf_document:
  9 |         fig_caption: yes
 10 | ---
 11 | 
 12 | To run this file:
 13 | Rscript -e "rmarkdown::render('ALDEx_comparison.Rmd')"
 14 | ## Types of data
 15 | 
 16 | Comparison of 'differential abundance' is problematic for compositional data [@fernandes:2013;@fernandes:2014]. Since the apparent abundance of every value depends on the apparent abundance of every other value, we can get into real difficulties if we are not careful. Take the simple example where we have two samples. The samples contain the following counts for five features:
 17 | 
 18 | \begin{math}
 19 | A = [1000, 100, 50, 250]
 20 | \end{math} and
 21 | \begin{math}
 22 | B=[10, 500, 250, 1250]
 23 | \end{math}.
 24 | 
 25 | We want to answer the question: Have the abundances of the features changed?
 26 | 
 27 | We sequence, and have a total count of about 100 (it is a first generation machine!)
 28 | 
 29 | So we get:
 30 | \begin{math}A_s = [71,7,4,18], B_s = [1,25,12,62]
 31 | \end{math}
 32 | 
 33 | Note that these values appear to be very different between the groups. However, if we take one feature as a reference, say feature 4, and determine a ratio, i.e.:
 34 | 
 35 | \begin{math}
 36 | A_r = [ 74/18, 7/18, 4/18 ] = [ 4.1, 0.39, 0.22 ]
 37 | \end{math}
 38 | 
 39 | \begin{math}
 40 | B_r = [ 1/62, 25/62, 12/62 ] = [ 0.02 , 0.40, 0.20 ]
 41 | \end{math}
 42 | 
 43 | Here we can see that if we assume one feature is constant (feature 4), then the last two are seen to be very similar in abundance. Now we can infer that the majority of change is in the first feature. We cannot compare the last feature because it is assumed to be constant, that is, the assumed change in the last feature is 0. This approach is the one used by ANCOM, a recently developed tool to assess change in microbiome datasets [@ancom:2015].
 44 | 
 45 | Since we cannot know which feature, if any, is constant, we can assume that a large number of the features exhibit only random change.  Then rather than using one feature as a reference we can use the geometric mean abundance of all features. Note: this approach works poorly if there are only a small number of features (less than about 50) or if the features are very asymmetrically distributed between groups. This approach is the one used by ALDEx2 [@fernandes:2013;@fernandes:2014], and is the method that we will use.
 46 | 
 47 | One complication is that a genometric mean cannot be determined if any of the values have a count of 0. For pairwise comparisons
 48 | 
 49 | 
 50 | ```{r, echo=TRUE,eval=TRUE}
 51 | library(ALDEx2)
 52 | 
 53 | # read the dataset
 54 | d <- read.table("data/barton_agg.tsv", row.names=1, header=T)
 55 | # make a vector containing the two names of the conditions
 56 | # in the same order as in the column names
 57 | 
 58 | d.conds <- c(rep("SNF", length(grep("SNF", rownames(d))) ),
 59 |     rep("WT", length(grep("WT", rownames(d)))) )
 60 | 
 61 | 
 62 | # generate Monte-Carlo instances of the probability of observing each count
 63 | # given the actual read count and the observed read count.
 64 | # use a prior of 0.5, corresponding to maximal uncertainty about the read count
 65 | # this returns a set of clr values, one for each mc instance
 66 | # this workflow can take several minutes
 67 | 
 68 | # note that the latest version of ALDEx2 requires conditions explicitly
 69 | d.x <- aldex.clr(t(d), conds=d.conds, mc.samples=128)
 70 | 
 71 | # calculate effect sizes for each mc instance, report the expected value
 72 | d.eff <- aldex.effect(d.x, d.conds, include.sample.summary=TRUE)
 73 | 
 74 | # perform parametric or non-parametric tests for difference
 75 | # report the expected value of the raw and BH-corrected P value
 76 | d.tt <- aldex.ttest(d.x, d.conds)
 77 | 
 78 | # concatenate everything into one file
 79 | x.all <- data.frame(d.eff,d.tt)
 80 | ```
 81 | 
 82 | We will display the results using a number of different plots to show how each plot gives a different way of exploring the data. The mainstay that we advocate is the effect plot [@Gloor:2015], that plots the constituents of normalized change, or effect size.
 83 | 
 84 | ```{r, echo=FALSE,eval=TRUE, results='as.is', fig.width=7, fig.height=7, error=FALSE, message=FALSE, warning=FALSE, fig.cap="Plotted here are features with no difference between groups (grey), a statistically difference between groups (red), and with an effect larger than 2 (blue circles). These are plotted using different plots (described clockwise from top left). The effect plot [@Gloor:2015] illustrates the difference between groups vs. the dispersion (variance) within groups. If the effect is greater than one (outside the grey lines), then, on average the features are obviosly separable by eye when plotted; roughly, they would be seen to have a greater difference between groups than the pooled standard deviation. Effect is a more robust measure of difference than are P values, since the latter depend on sample size; large sample sizes will always give low P values [@Halsey:2015aa]. We can see this here where the large sample size means that even highly variable OTUs are significantly different. The Bland-Altman plot [@altman:1983] compares difference and abundance, and is often seen in RNA-Seq data. The Volcano plot [@Cui:2003aa] shows the association between difference and P value, and the final plot shows the association between effect and P value."}
 85 | 
 86 | # x.all <- read.table("data/barton_agg.tsv", header=T, row.names=1)
 87 | 
 88 | # get 'significant' set
 89 | sig <- x.all$wi.eBH < 0.05
 90 | eff <- abs(x.all$effect) > 2
 91 | 
 92 | # plot all in transparent grey
 93 | # low BH-corrected p values as red
 94 | # effect sizes > 2 as blue+red
 95 | par(mfrow=c(2,2))
 96 | 
 97 | plot(x.all$diff.win, x.all$diff.btw, col=rgb(0,0,0,0.3), pch=19,
 98 |     cex=0.5, ylim=c(-6,6), xlim=c(0,6), xlab="dispersion", ylab="difference",
 99 |     main="Effect plot")
100 | points(x.all$diff.win[sig], x.all$diff.btw[sig], col=rgb(1,0,0,0.3), pch=19, cex=0.5 )
101 | points(x.all$diff.win[eff], x.all$diff.btw[eff], col=rgb(0,0,1,0.6), pch=21, cex=0.7 )
102 | abline(0,1, lty=2, lwd=2, col=rgb(0,0,0,0.4))
103 | abline(0,-1, lty=2, lwd=2, col=rgb(0,0,0,0.4))
104 | 
105 | plot(x.all$rab.all, x.all$diff.btw, col=rgb(0,0,0,0.3), pch=19,
106 |     cex=0.5, ylim=c(-6,6), xlab="clr abundance", ylab="difference",
107 |     main="Bland-Altman plot")
108 | points(x.all$rab.all[sig], x.all$diff.btw[sig], col=rgb(1,0,0,0.3), pch=19, cex=0.5 )
109 | points(x.all$rab.all[eff], x.all$diff.btw[eff], col=rgb(0,0,1,0.6), pch=21, cex=0.7 )
110 | 
111 | plot(x.all$diff.btw, x.all$wi.ep, col=rgb(0,0,0,0.3), pch=19,
112 |     cex=0.5, xlab="difference", ylab="log p value",
113 |     main="Difference vs. p plot", log="y")
114 | points(x.all$diff.btw[sig], x.all$wi.ep[sig], col=rgb(1,0,0,0.3), pch=19, cex=0.5 )
115 | points(x.all$diff.btw[eff], x.all$wi.ep[eff], col=rgb(0,0,1,0.6), pch=21, cex=0.7 )
116 | 
117 | plot(x.all$effect, x.all$wi.ep, col=rgb(0,0,0,0.3), pch=19,
118 |     cex=0.5,  xlab="effect", ylab="log p value",
119 |     main="Effect vs. p plot", log="y")
120 | points(x.all$effect[sig], x.all$wi.ep[sig], col=rgb(1,0,0,0.3), pch=19, cex=0.5 )
121 | points(x.all$effect[eff], x.all$wi.ep[eff], col=rgb(0,0,1,0.6), pch=21, cex=0.7 )
122 | 
123 | ```
124 | 
125 | \newpage
126 | 
127 | The effect sizes can be understood as a measure of separability between groups for each feature. Plotted in the figure are features with different effect sizes and the corresponding adjusted p value is included. In RNA-seq data, we have found that an effect size cutoff between 1 and 2 [@macklaim:2013] gives reliable results in meta-transcriptome analyses.
128 | 
129 | ```{r, echo=FALSE,eval=TRUE, results='as.is', fig.width=7, fig.height=7, error=FALSE, message=FALSE, warning=FALSE, fig.cap="Histograms showing the separation between groups when choosing features withdiffering effect sizes. Features  with the largest effect are the most reliably different between groups, and should be chosen over those that are most significantly different whenever possible. Note that an effect size of 0.5 is more than sufficient to give a significant difference but the separation between groups is marginal."}
130 | 
131 | par(mfrow=c(2,2))
132 | 
133 | hist(as.numeric(x.all["PWR1",4:51]), breaks=11, xlim=c(-8,-2), col=rgb(1,0,0,0.3),
134 |     xlab="rAB PWR1", main="Effect: 0.5 wi.eBH: 4e-4", ylim=c(0,12) )
135 | hist(as.numeric(x.all["PWR1",52:99]), add=T,  breaks=20, col=rgb(0,0,1,0.3))
136 | 
137 | hist(as.numeric(x.all["NTS2.1",4:51]), breaks=20, col=rgb(1,0,0,0.3),
138 |     xlab="rAB NTS2.1", main="Effect: 1.1 wi.eBH: 2e-11", ylim=c(0,15) )
139 | hist(as.numeric(x.all["NTS2.1",52:99]), add=T,  breaks=11, col=rgb(0,0,1,0.3))
140 | 
141 | hist(as.numeric(x.all["YPR145W",4:51]), xlim=c(2,6), breaks=11, col=rgb(1,0,0,0.3),
142 |     xlab="rAB YPR145W", main="Effect: 2 wi.eBH: 6.5e-26", ylim=c(0,12) )
143 | hist(as.numeric(x.all["YPR145W",52:99]), add=T,  breaks=20, col=rgb(0,0,1,0.3))
144 | 
145 | hist(as.numeric(x.all["YOR290C",4:51]), xlim=c(-11,5), breaks=20, col=rgb(1,0,0,0.3),
146 |     xlab="rAB YOR290C", main="Effect: 5.6 wi.eBH: 1e-26", ylim=c(0,40) )
147 | hist(as.numeric(x.all["YOR290C",52:99]), add=T,  breaks=3, col=rgb(0,0,1,0.3))
148 | ```
149 | 
150 | We can also plot these as strip-charts to see the overlap in variability
151 | 
152 | ```{r, echo=FALSE,eval=TRUE, results='as.is', fig.width=7, fig.height=7, error=FALSE, message=FALSE, warning=FALSE, fig.cap="Strip charts showing the separation between groups when choosing features with differing effect sizes."}
153 | 
154 | par(mfrow=c(2,2))
155 | 
156 | stripchart(list(as.numeric(x.all["PWR1",4:51])), pch=19, method="jitter",
157 |     vertical=T, col=rgb(1,0,0,0.3), main="Effect: 0.5",
158 |     ylim=c(min(c(as.numeric(x.all["PWR1",4:51]),as.numeric(x.all["PWR1",52:99]))),
159 |     max(c(as.numeric(x.all["PWR1",4:51]),as.numeric(x.all["PWR1",52:99])))))
160 | stripchart(list(as.numeric(x.all["PWR1",52:99])), pch=19, method="jitter",
161 |      add=TRUE, vertical=T, col=rgb(0,0,1,0.3) )
162 | 
163 | stripchart(list(as.numeric(x.all["NTS2.1",4:51])), pch=19, method="jitter",
164 |     vertical=T, col=rgb(1,0,0,0.3), main="Effect: 1.1",
165 |     ylim=c(min(c(as.numeric(x.all["NTS2.1",4:51]),as.numeric(x.all["NTS2.1",52:99]))),
166 |     max(c(as.numeric(x.all["NTS2.1",4:51]),as.numeric(x.all["NTS2.1",52:99])))))
167 | stripchart(list(as.numeric(x.all["NTS2.1",52:99])), pch=19, method="jitter",
168 |      add=TRUE, vertical=T, col=rgb(0,0,1,0.3) )
169 | 
170 | stripchart(list(as.numeric(x.all["YPR145W",4:51])), pch=19, method="jitter",
171 |     vertical=T, col=rgb(1,0,0,0.3), main="Effect: 2",
172 |     ylim=c(min(c(as.numeric(x.all["YPR145W",4:51]),as.numeric(x.all["YPR145W",52:99]))),
173 |     max(c(as.numeric(x.all["YPR145W",4:51]),as.numeric(x.all["YPR145W",52:99])))))
174 | stripchart(list(as.numeric(x.all["YPR145W",52:99])), pch=19, method="jitter",
175 |      add=TRUE, vertical=T, col=rgb(0,0,1,0.3) )
176 | 
177 | stripchart(list(as.numeric(x.all["YOR290C",4:51])), pch=19, method="jitter",
178 |     vertical=T, col=rgb(1,0,0,0.3), main="Effect: 5.6",
179 |     ylim=c(min(c(as.numeric(x.all["YOR290C",4:51]),as.numeric(x.all["YOR290C",52:99]))),
180 |     max(c(as.numeric(x.all["YOR290C",4:51]),as.numeric(x.all["YOR290C",52:99])))))
181 | stripchart(list(as.numeric(x.all["YOR290C",52:99])), pch=19, method="jitter",
182 |      add=TRUE, vertical=T, col=rgb(0,0,1,0.3) )
183 | ```
184 | 
185 | \newpage
186 | #References
187 | 
188 | 
189 | 


--------------------------------------------------------------------------------
/ALDEx_comparison.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/ALDEx_comparison.pdf


--------------------------------------------------------------------------------
/Exploratory_biplot.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Exploratory biplot"
  3 | author: "gg"
  4 | date: '`r format(Sys.time(), "%d %B, %Y")`'
  5 | bibliography: /Users/ggloor/Library/texmf/bibtex/bib/bibdesk_refs.bib
  6 | fig_caption: true
  7 | output:
  8 |     pdf_document:
  9 |         fig_caption: yes
 10 | ---
 11 | 
 12 | To run this file:
 13 | Rscript -e "rmarkdown::render('Exploratory_biplot.Rmd')"
 14 | 
 15 | ## R packages needed
 16 | 
 17 | `zCompositions, ppclust, factoextra, cluster, fclust` CRAN
 18 | 
 19 | ## The dataset and first biplot
 20 | 
 21 | We will use as an example a transcriptome dataset [@Schurch:2016aa;@Gierlinski:2015aa] containing 96 samples, 48 each from wt and SNF2 knockout strain These data have been filtered to include only those features that are present with a mean count of at least 0.1 across all samples.
 22 | 
 23 | The compositional biplot is the first exploratory data analysis tool that should be used whenever exploring a  dataset. It shows, in one plot, the essences of your results. Do my samples separate into groups? features are driving this separation? what features are irrelevant to the analysis?
 24 | 
 25 | Compositional biplots appear to be complex and intimidating, but with a little patience and practice they are easily interpretable [@aitchison2002biplots]. They are based on the variance of the ratios of the parts, and are substantially more informative that the commonly used PCoA plots that are driven largely by abundance [@Gorvitovskaia:2016aa].
 26 | 
 27 | ```{r outlier_function, echo=F}
 28 | codaSeq.outlier <- function(x, plot.me=TRUE, col=rgb(1,0,0,0.3)){
 29 | 
 30 |   pcx <- prcomp(x)
 31 |   mv <- sum(pcx$sdev^2)
 32 | 
 33 |   sample.var <-  apply(pcx$x,1,function(y){sum(y^2/mv)})
 34 | 
 35 |   cut <- median(apply(pcx$x,1,function(x){sum(x^2/mv)})) + 2 * IQR(apply(pcx$x,1,function(x){sum(x^2/mv)}))
 36 | 
 37 |   bad <- names(which(apply(pcx$x,1,function(x){sum(x^2/mv)}) > cut))
 38 |   good <- names(which(apply(pcx$x,1,function(x){sum(x^2/mv)}) <= cut))
 39 |   if(plot.me == TRUE){
 40 |     hist(sample.var, breaks=100)
 41 |     boxplot(sample.var, horizontal=TRUE, col=col, add=TRUE)
 42 |     abline(v=cut, lty=2)
 43 |   }
 44 |   return(list(sample.var=sample.var, bad=bad, good=good) )
 45 | }
 46 | ```
 47 | 
 48 | ```{r biplot, echo=TRUE, results='as.is', fig.width=7, fig.height=7, error=FALSE, message=FALSE, warning=FALSE, fig.cap="The compositional biplot is the workhorse tool for CoDa. This plot summarizes the entire analysis in a qualitative manner. We can see that the op and ak samples separate very well, although the proportion of variance explained on component 1 is small. Furthermore, we can see the genus names of some of the features that are driving this divide. Finally, component 1 has substantially more variance than does componet 2, and we can explain this experiment as a simple two part comparison with the largest variance along the axis of the comparison."}
 49 | 
 50 | # read in the dataset and associated taxonomy file
 51 | # samples by row - now correct: thx VGalata
 52 | d.agg <- read.table("data/barton_agg.tsv", sep="\t", header=T, row.names=1)
 53 | 
 54 | # load the library zCompositions to perform 0 replacement
 55 | library(zCompositions)
 56 | 
 57 | # it is important to first filter to remove rows that are exclusively 0 values
 58 | d.filt <- d.agg[,colSums(d.agg) > 0,]
 59 | 
 60 | # we are using the Count Zero Multiplicative approach
 61 | d.n0 <- cmultRepl(d.filt, method="CZM", label=0)
 62 | 
 63 | # generate the centered log-ratio transformed data
 64 | # samples by row
 65 | d.clr <- apply(d.n0, 1, function(x) log(x) - mean(log(x)))
 66 | 
 67 | # apply a singular value decomposition to the dataset
 68 | # do not use princomp function in R!!
 69 | pcx <- prcomp(t(d.clr))
 70 | 
 71 | # get the labels for the first two components
 72 | PC1 <- paste("PC1: ", round(pcx$sdev[1]^2/sum(pcx$sdev^2),3), sep="")
 73 | PC2 <- paste("PC2: ", round(pcx$sdev[2]^2/sum(pcx$sdev^2),3), sep="")
 74 | 
 75 | par(fig=c(0,1,0,1), new=TRUE)
 76 | # generate a scree plot
 77 | par(fig=c(0,0.8,0,1), new=TRUE)
 78 | biplot(pcx, cex=c(0.6,0.6), col=c("black", rgb(1,0,0,0.2)), var.axes=F, scale=0,
 79 |     xlab=PC1, ylab=PC2)
 80 | abline(h=0, lty=2, lwd=2, col=rgb(0,0,0,0.3))
 81 | abline(v=0, lty=2, lwd=2, col=rgb(0,0,0,0.3))
 82 | 
 83 | par(fig=c(0.8,1,0,1), new=TRUE)
 84 | plot(pcx, main="hist")
 85 | 
 86 | ```
 87 | 
 88 | \newpage
 89 | 
 90 | ### Rules for interpreting compositional biplots:
 91 | 
 92 | - All interpretations are up to the limit of the variance explained. We can think of this as a shadow of the multidimensional dataset (4545 dimensions!) projected onto two dimensions. If the variance explained is high ( > 0.8) then the edges of the shadows are sharp, however, if the variance explained is low, as it is here, then we have little confidence in the exact placement of any individual sample or feature.
 93 | 
 94 | - The distance between samples is related to their multivariate similarity of the parts as ratios. If all components are relatively the same (ie, the ratios between all parts are identical), then two samples are in the same location.
 95 | 
 96 | - We must interpret the features as ratios. Abundance information is not directly available on these plots.
 97 | 
 98 | - The distance and direction of an feature from the origin is the standard deviation of the ratio of that feature to the geometric mean of all features.
 99 | 
100 | - The line between any set of features is called a link. Links that pass through more than one feature are permitted and do not change the interpretation.
101 | 
102 | - Short links indicate a constant or near constant ratio between the two (or more) linked features in the dataset. This dataset is too complex to identify links easily
103 | 
104 | - Long links indicate a non-constant ratio between the joined features, and define a ratio relationship that can be inverse or random. There is no principled method to determine which is the case.
105 | 
106 | \newpage
107 | 
108 | ## Finding outliers
109 | 
110 | We can see that there are a number of samples that appear to be outlier samples. Should we include SNF2.6 in the analysis or not? One of the messages of the Barton papers [@Schurch:2016aa;@Gierlinski:2015aa] was that about 10% of samples, even carefully prepared samples can be outliers for unknown methodological reasons. We approach outliers by finding those samples that contribute more variance than expected to the variance of the group. Outliers are defined as those samples that contribute greater than the median plus twice the interquartile range of the sample variance to the total variance of the group.
111 | 
112 | 
113 | ```{r outlier, message=FALSE, warning=FALSE, echo=FALSE, fig.cap='outliers', fig.height=4, fig.width=4}
114 | # get the outliers from each group. See codaSeq.outlier function
115 | # get WT indices
116 | WT <- grep("WT", rownames(d.agg))
117 | # subset
118 | WT.agg <- d.agg[WT,]
119 | 
120 | # filter, samples by row
121 | wt.gt0 <- WT.agg[,colSums(WT.agg) > 0]
122 | 
123 | # estimate 0 values (zCompositions)
124 | # samples by row
125 | wt.agg.n0 <- cmultRepl(wt.gt0, method="CZM", label=0)
126 | 
127 | # clr transform
128 | wt.agg.n0.clr <- t(apply(wt.agg.n0, 1, function(x) log(x) - mean(log(x))))
129 | 
130 | # SVD
131 | pcx.wt  <- prcomp(wt.agg.n0.clr)
132 | mvar.wt.clr <- sum(pcx.wt$sdev^2)
133 | 
134 | # plot
135 | par(mfrow=c(1,1))
136 | biplot(pcx.wt, var.axes=FALSE, scale=0,  cex=c(1,.05))
137 | 
138 | # make a list of names to keep. found in $good
139 | WT.g <- codaSeq.outlier(wt.agg.n0.clr, plot.me=TRUE)
140 | 
141 | SNF <- grep("SNF", rownames(d.agg))
142 | # subset
143 | SNF.agg <- d.agg[SNF,]
144 | 
145 | # filter, samples by row
146 | SNF.gt0 <- SNF.agg[,colSums(SNF.agg) > 0]
147 | 
148 | # estimate 0 values (zCompositions)
149 | # samples by row
150 | SNF.agg.n0 <- cmultRepl(SNF.gt0, method="CZM", label=0)
151 | 
152 | # clr transform
153 | SNF.agg.n0.clr <- t(apply(SNF.agg.n0, 1, function(x) log(x) - mean(log(x))))
154 | 
155 | # SVD
156 | pcx.SNF  <- prcomp(SNF.agg.n0.clr)
157 | mvar.SNF.clr <- sum(pcx.SNF$sdev^2)
158 | 
159 | # plot
160 | par(mfrow=c(1,1))
161 | biplot(pcx.SNF, var.axes=FALSE, scale=0,  cex=c(1,.05))
162 | 
163 | # make a list of names to keep. found in $good
164 | SNF.g <- codaSeq.outlier(SNF.agg.n0.clr, plot.me=TRUE)
165 | 
166 | ```
167 | \clearpage
168 | \newpage
169 | 
170 | ## Biplot of non-outlier samples only
171 | 
172 | Now we can make a biplot of only those samples that are non-outliers. We see that the SNF2 KO group is more homogeneous than is the WT group. Almost certainly since the SNF2 group is clonal, and the WT group is likely grown from a frozen culture.
173 | 
174 | ```{r good_data_pca, message=FALSE, warning=FALSE, echo=FALSE, fig.cap='outliers', fig.height=7, fig.width=7}
175 | 
176 | # make a dataset of only the non-outlier samples
177 | d.good <- rbind(d.agg[SNF.g$good,],d.agg[WT.g$good,])
178 | 
179 | # filter
180 | #d.good.gt0 <- codaSeq.filter(d.good,  min.count=1, samples.by.row=TRUE)
181 | d.good.gt0 <- d.good[,colSums(d.good) > 0]
182 | 
183 | # estimate 0 values (zCompositions)
184 | d.good.agg.n0 <- cmultRepl(d.good.gt0, method="CZM", label=0)
185 | 
186 | # clr transform
187 | d.good.agg.n0.clr <- t(apply(d.good.agg.n0, 1, function(x) log(x) - mean(log(x))))
188 | 
189 | # SVD
190 | pcx.good  <- prcomp(d.good.agg.n0.clr)
191 | mvar.good <- sum(pcx.good$sdev^2)
192 | # get the labels for the first two components
193 | PC1.g <- paste("PC1: ", round(pcx.good$sdev[1]^2/sum(pcx.good$sdev^2),3), sep="")
194 | PC2.g <- paste("PC2: ", round(pcx.good$sdev[2]^2/sum(pcx.good$sdev^2),3), sep="")
195 | 
196 | # plot and save
197 | par(mfrow=c(1,1))
198 | biplot(pcx.good, var.axes=FALSE, scale=0, cex=c(1,.5),  xlab=PC1.g, ylab=PC2.g)
199 | abline(h=0, lty=2, lwd=2, col=rgb(0,0,0,0.3))
200 | abline(v=0, lty=2, lwd=2, col=rgb(0,0,0,0.3))
201 | 
202 | # optionally save
203 | # write.table(d.good.gt0, file="data/filtered_table.txt", sep="\t", quote=F, col.names=NA)
204 | ```
205 | \clearpage
206 | 
207 | ## We can also plot the biplot of the samples that were removed.
208 | 
209 | We can also examine those samples that were deemed to be outliers. Here we can see that the outlier samples have the axis of the experiment on PC2, which means that there problems in the dataset that led to these samples being outliers overwhelms the actual signal in the dataset. This confirms that these samples should be removed from the dataset as they are adding either noise, or some systematic information to the dataset.
210 | 
211 | ```{r bad_data_pca, message=FALSE, warning=FALSE, echo=FALSE, fig.cap='outliers', fig.height=7, fig.width=7}
212 | 
213 | # make a dataset of only the non-outlier samples
214 | d.bad <- rbind(d.agg[SNF.g$bad,],d.agg[WT.g$bad,])
215 | 
216 | # filter
217 | d.bad.gt0 <- d.bad[,colSums(d.bad) > 0]
218 | 
219 | # estimate 0 values (zCompositions)
220 | d.bad.agg.n0 <- cmultRepl(d.bad.gt0, method="CZM", label=0)
221 | 
222 | # clr transform
223 | d.bad.agg.n0.clr <- t(apply(d.bad.agg.n0, 1, function(x) log(x) - mean(log(x))))
224 | 
225 | # SVD
226 | pcx.bad  <- prcomp(d.bad.agg.n0.clr)
227 | mvar.bad <- sum(pcx.bad$sdev^2)
228 | # get the labels for the first two components
229 | PC1.g <- paste("PC1: ", round(pcx.bad$sdev[1]^2/sum(pcx.bad$sdev^2),3), sep="")
230 | PC2.g <- paste("PC2: ", round(pcx.bad$sdev[2]^2/sum(pcx.bad$sdev^2),3), sep="")
231 | 
232 | # plot and save
233 | par(mfrow=c(1,1))
234 | biplot(pcx.bad, var.axes=FALSE, scale=0, cex=c(1,.5),  xlab=PC1.g, ylab=PC2.g)
235 | abline(h=0, lty=2, lwd=2, col=rgb(0,0,0,0.3))
236 | abline(v=0, lty=2, lwd=2, col=rgb(0,0,0,0.3))
237 | 
238 | # optionally save
239 | # write.table(d.good.gt0, file="data/filtered_table.txt", sep="\t", quote=F, col.names=NA)
240 | ```
241 | 
242 | \clearpage
243 | 
244 | ## Additional filtering does not change the conclusions
245 | 
246 | We can do additional filtering. Examining the features, most contribute little, if anything, to the separation. These can be removed by filtering out low variance features. Note that we lose some resolution, but that we recapitulate the dataset with only half the features. We could do this iteratively.
247 | 
248 | ```{r lowvar, , message=FALSE, warning=FALSE, echo=FALSE, fig.cap='outliers', fig.height=7, fig.width=7}
249 | var.clr <- apply(d.good.agg.n0.clr, 2, var)
250 | nms <- which(var.clr > median(var.clr)) #
251 | 
252 | d.lv <- d.good[, names(nms)]
253 | # filter
254 | 
255 | # estimate 0 values (zCompositions)
256 | d.lv.agg.n0 <- cmultRepl(d.lv, method="CZM", label=0)
257 | 
258 | # clr transform
259 | d.lv.agg.n0.clr <- t(apply(d.lv.agg.n0, 1, function(x) log(x) - mean(log(x))))
260 | 
261 | # SVD
262 | pcx.lv  <- prcomp(d.lv.agg.n0.clr)
263 | mvar.lv <- sum(pcx.lv$sdev^2)
264 | PC1.lv <- paste("PC1: ", round(pcx.lv$sdev[1]^2/sum(pcx.lv$sdev^2),3), sep="")
265 | PC2.lv <- paste("PC2: ", round(pcx.lv$sdev[2]^2/sum(pcx.lv$sdev^2),3), sep="")
266 | 
267 | # plot and save
268 | par(mfrow=c(1,1))
269 | biplot(pcx.lv, var.axes=FALSE, scale=0, cex=c(1,.5),  xlab=PC1.lv, ylab=PC2.lv)
270 | abline(h=0, lty=2, lwd=2, col=rgb(0,0,0,0.3))
271 | abline(v=0, lty=2, lwd=2, col=rgb(0,0,0,0.3))
272 | # get the labels for the first two components
273 | ```
274 | \clearpage
275 | 
276 | ## FUZZY CLUSTERING
277 | 
278 | We can plot the samples according to their kmeans cluster membership. For this we are using the fuzzy clustering package ppclust [@fuzzy:2018]. There is a good introduction to fuzzy clustering in  [@fernandez:2012]. Essentially, we are using a probabilistic (or possibilistic) approach to determine the number of clusters, and the cluster memberships. The vignette for this approach is at: https://cran.r-project.org/web/packages/ppclust/vignettes/fcm.html. As noted in the workshop, we get two clusters if we choose centers=2 or =3, but the SNF2 and WT groups split if we choose centers=4.
279 | 
280 | ```{r fuzzy}
281 | library(ppclust)
282 | library(factoextra)
283 | library(cluster)
284 | library(fclust)
285 | 
286 | 
287 | res.fcm <- fcm(d.lv.agg.n0.clr, centers=2)
288 | #as.data.frame(res.fcm$u)
289 | #summary(res.fcm)
290 | 
291 | res.fcm2 <- ppclust2(res.fcm, "kmeans")
292 | 
293 | factoextra::fviz_cluster(res.fcm2, data = d.lv.agg.n0.clr,
294 |   ellipse.type = "norm", labelsize=10,  palette = "jco",
295 |   repel = TRUE)
296 | ```
297 | ############
298 | 
299 | ## References
300 | 


--------------------------------------------------------------------------------
/Exploratory_biplot.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/Exploratory_biplot.pdf


--------------------------------------------------------------------------------
/Gloor_bio.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/Gloor_bio.docx


--------------------------------------------------------------------------------
/Outline.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "CoDa HTS Workshop Proposal"
  3 | author: "Greg Gloor"
  4 | output:
  5 |   pdf_document:
  6 |     fig_caption: yes
  7 |     includes:
  8 |       in_header: /Users/ggloor/Documents/0_git/templates/header.tex
  9 |     keep_tex: yes
 10 |     pandoc_args:
 11 |     - -V
 12 |     - classoption=twocolumn
 13 |     toc: yes
 14 |     toc_depth: 3
 15 | geometry: margin=2cm
 16 | csl: /Users/ggloor/Documents/0_git/csl_styles/frontiers.csl
 17 | bibliography: /Users/ggloor/Library/texmf/bibtex/bib/bibdesk_refs.bib
 18 | ---
 19 | 
 20 | ## Analyzing data as compositions
 21 | 
 22 | Website: https://github.com/ggloor/CoDa_microbiome_tutorial. This will serve as the central repository for the demonstrated tools and workflows. The repository will be set as release 2.0 at the end of the workshop so that participants will have a permanent public record of what was covered.
 23 | 
 24 | The Powerpoint presentation can be found at: https://github.com/ggloor/compositions/presentations/CoDa_workshop_NGS18.pptx.
 25 | 
 26 | We have adapted and developed tools and protocols for the analysis of HTS as compositional count data [@fernandes:2013;@fernandes:2014;@Quinn206425;@erb:2016]. Analyses conducted under this paradigm are reproducible and robust, and allow conclusions about the relative relationships between features (genes, OTUs, etc) in the underlying environment [@bian:2017;@gloorFrontiers:2017].
 27 | 
 28 | It is possible to replace almost all steps in traditional RNA-seq, metagenomics or 16S rRNA gene sequencing analysis with compositionally appropriate methods [@gloorFrontiers:2017] that are robust to data manipulations and that provide reproducible insights into the underlying biology and composition of the system.
 29 | 
 30 | ## Objectives and outcomes
 31 | 
 32 | The workshop will enable participants to:
 33 | 
 34 | 1. be able to identify when biological datasets are compositional, and understand the root problems  that cause problems when interrogating compositional datasets.
 35 | 
 36 | 2. understand why HTS data should be analyzed in a compositionally-appropriate framework.
 37 | 
 38 | 3. know how to install, use and interpret the output from the basic HTS compositional toolkit that consists of compositional biplots, the `propr R` package and the `ALDEx2 R` package.
 39 | 
 40 | 4. have a frame of reference for more complex compositional tools such as `philr` and concepts such as b-association and balance dendrograms.
 41 | 
 42 | ## Outline
 43 | 
 44 | The workshop will be delivered as mixed didactic and participation sessions, with about a 1:4 mixture. Each session will be introduced by a short didactic introduction and demonstration. The remainder of the session will be hands-on learning exercises in the `R` programming environment.
 45 | 
 46 | We will demonstrate a test dataset from [@]Schurch:2016aa;@Gierlinski:2015aa] the lab of Dr. Geoffrey Barton that examined the effect of a SNF2 gene knockout \emph{Saccharomyces cervisiae} transcription. This dataset is nearly ideal and simple to understand. However, participants are invited (expected) to bring their own dataset in the form of a count table with associated metadata for examination.
 47 | 
 48 | The outline of this 1-day workshop is:
 49 | 
 50 | ### Start time 9am - Introduction (Gloor, didactic lecture)
 51 | 
 52 | https://github.com/ggloor/compositions/presentations/CoDa_workshop_NGS18.pptx
 53 | 
 54 | - demonstrate and understand the geometry of high throughput sequencing data and how this constrains the analyses
 55 |     - demonstrate the pathologies associated with HTS data analyzed using standard methods
 56 |     - enable participants to understand why and when the usual methods of analysis are likely to be misleading
 57 |     - understand the importance of subcompositional coherence and subcompositional dominance, and how these concepts lead to robust analyses
 58 | 
 59 | ### Start time 9:45 - Probabilities and ratio transformations (Gloor, hands on)
 60 | 
 61 | source: zero.Rmd
 62 | 
 63 | - provide an overview of sequencing as a probabilistic process, and the manipulation of probability vectors using compositional data methods
 64 |     - how to generate probability distributions from count data using ALDEx2
 65 |     - how to generate and interpret compositionally appropriate data transformations
 66 |     - zero replacement strategies for sparse data with the zCompositions R package
 67 |     - why count normalization is futile
 68 | 
 69 | ### Break 10:30
 70 | 
 71 | ### Start time 11 - Dimension reduction, outlier identification and clustering (Gloor, hands on)
 72 | 
 73 | source: make_interpret_biplot.Rmd, Exploratory_biplot.Rmd
 74 | 
 75 | - demonstrate dimension reduction of compositional data
 76 |     - the production and interpretation of a compositional PCA biplot
 77 |     - identifying outlier samples
 78 |     - learn how to conduct and interpret clustering and discriminate analysis in compositional data
 79 |     - fuzzy clustering
 80 | 
 81 | 
 82 | ### Start time 12: Correlation and compositional association (Erb)
 83 | 
 84 | source: prop_and_diffprop_with_propr
 85 | 
 86 | - demonstrate compositionally appropriate identification of correlated (compositionally associated) features using the `propr R` package [@Quinn:2017]
 87 |     - an introduction to compositional association
 88 | 
 89 | ### Start time: 1 lunch break
 90 | 
 91 | 
 92 | ### Start time : 2: -  Correlation and compositional association continued (Erb)
 93 | 
 94 | ### Start time 2:30 Differential abundance with ALDEx2 (Gloor)
 95 | 
 96 | source: ALDEx_comparison.Rmd
 97 | 
 98 | - demonstrate compositionally appropriate identification of differentially relatively abundant features using the `ALDEx2 R` package
 99 | 	- learn how to generate and interpret posterior expected values for differential relative abundance
100 |     - learn how to generate and use standardized effect sizes for differential relative abundance
101 |     - learn how to interpret effect plots as an adjunct to volcano and Bland-Altmann plots
102 | 
103 | ### Start time: 3:30 - Working with users' data (Gloor, Erb)
104 | 
105 | - analyzing users' own data
106 | - troubleshooting users' own datasets
107 | - common problems from the participants will be highlighted and solutions demonstrated
108 | 
109 | ### Start time: 4:30- Wrapup (Gloor, Erb)
110 | 
111 | - review of concepts and strategies
112 | - understand the congruence between the results obtained by the compositional biplot, compositional association and compositional differential relative abundance
113 | - provide guidance and sources on the proper interpretation of HTS datasets using a compositional paradigm
114 | 
115 | ### Finish time 5 pm
116 | 
117 | ## Requirements
118 | 
119 | 1. a reasonably up-to-date laptop computer with at leaset 8Gb RAM
120 | 
121 | 2. familiarity with scripting or programming languages, proficency in the `R` programming environment
122 | 
123 | 3. the current version of the `R` programming language installed
124 | 
125 | 4. a number of `R` packages will be used during the workshop. Participants should be familiar with installation of packages from both `Bioconductor` and `CRAN`
126 | 
127 | ## Intended Audience and Level
128 | 
129 | The intended audience for this session is bioinformaticians or computational biologists who use high throughput sequencing with experimental designs that include tag sequencing (eg. 16S rRNA gene sequencing), metagenomics, transcriptomics or meta-transcriptomics.
130 | 
131 | This is not intended to be an introduction to R for bioinformaticians: attendees should be relatively proficient with R, either using RStudio, or on the command line and should have a plain text editor available. Attendees will use R markdown documents to keep track of their work, and templates will be provided for use. Attendees will be expected to have a laptop with R installed and the following packages and their dependencies: propr (CRAN), ALDEx2 (Bioconductor), omicplotR (Bioconductor), zCompositions (CRAN). Attendees are encouraged to bring their own datasets for analysis, but should be aware that only pairwise (i.e., two condition) experiments will be demonstrated.
132 | 
133 | Compositional concepts will be at an introductory-intermediate level suitable for participants of any background, but will be more intuitive to those with a grounding in probability and linear algebra.
134 | 
135 | The practical aspects will be at an intermediate level, suitable for participants with pre-exisiting competency in `R`.
136 | 
137 | Attendance should be capped at no more than 40 participants.
138 | 
139 | 
140 | ## Organizers and Presenters
141 | 
142 | Greg Gloor is a Professor of Biochemistry at The University of Western Ontario. He is one of the pioneers in using compositional data analysis to analyze HTS datasets. He is the maintainer of the `ALDEx2 R` package on Bioconductor used for differential relative abundance analysis. He has published original research, methods papers, and reviews that use compositional data analysis methods to interpret HTS datasets using transcriptome, microbiome and meta-transcriptome datasets  [@bian:2017;@gloorFrontiers:2017;@Wolfs:2016aa;@gloorAJS:2016;@Gloor:2016cjm;@gloor2016s;@gloor:effect;@Goneau:2015ab;@McMillan:2015aa;@fernandes:2014;@macklaim:2013;@fernandes:2013]. He has taught undergraduate and graduate courses in computational biology for almost two decades, and has won awards from both student groups and from faculty-wide competitions. His homepage and CV is at ggloor.github.io
143 | 
144 | Ionas Erb is a PDF and Bioinformatician at the Centre for Genomic Regulation. He is an active developer of tools to determine compositional association and is a contributor to the `propr R` package on CRAN used to explore correlation in a compositionally appropriate manner. He is an advocate for and active developer of tools that for compositionally-appropriate methods to examine correlation [@Quinn206425;@Erb134536;@erb:2016]
145 | 
146 | ## References
147 | 


--------------------------------------------------------------------------------
/Outline.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/Outline.pdf


--------------------------------------------------------------------------------
/Outline.tex:
--------------------------------------------------------------------------------
  1 | \documentclass[twocolumn]{article}
  2 | \usepackage{lmodern}
  3 | \usepackage{amssymb,amsmath}
  4 | \usepackage{ifxetex,ifluatex}
  5 | \usepackage{fixltx2e} % provides \textsubscript
  6 | \ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
  7 |   \usepackage[T1]{fontenc}
  8 |   \usepackage[utf8]{inputenc}
  9 | \else % if luatex or xelatex
 10 |   \ifxetex
 11 |     \usepackage{mathspec}
 12 |   \else
 13 |     \usepackage{fontspec}
 14 |   \fi
 15 |   \defaultfontfeatures{Ligatures=TeX,Scale=MatchLowercase}
 16 | \fi
 17 | % use upquote if available, for straight quotes in verbatim environments
 18 | \IfFileExists{upquote.sty}{\usepackage{upquote}}{}
 19 | % use microtype if available
 20 | \IfFileExists{microtype.sty}{%
 21 | \usepackage{microtype}
 22 | \UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts
 23 | }{}
 24 | \usepackage[margin=2cm]{geometry}
 25 | \usepackage{hyperref}
 26 | \hypersetup{unicode=true,
 27 |             pdftitle={CoDa HTS Workshop Proposal},
 28 |             pdfauthor={Greg Gloor},
 29 |             pdfborder={0 0 0},
 30 |             breaklinks=true}
 31 | \urlstyle{same}  % don't use monospace font for urls
 32 | \usepackage{graphicx,grffile}
 33 | \makeatletter
 34 | \def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi}
 35 | \def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi}
 36 | \makeatother
 37 | % Scale images if necessary, so that they will not overflow the page
 38 | % margins by default, and it is still possible to overwrite the defaults
 39 | % using explicit options in \includegraphics[width, height, ...]{}
 40 | \setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio}
 41 | \IfFileExists{parskip.sty}{%
 42 | \usepackage{parskip}
 43 | }{% else
 44 | \setlength{\parindent}{0pt}
 45 | \setlength{\parskip}{6pt plus 2pt minus 1pt}
 46 | }
 47 | \setlength{\emergencystretch}{3em}  % prevent overfull lines
 48 | \providecommand{\tightlist}{%
 49 |   \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
 50 | \setcounter{secnumdepth}{0}
 51 | % Redefines (sub)paragraphs to behave more like sections
 52 | \ifx\paragraph\undefined\else
 53 | \let\oldparagraph\paragraph
 54 | \renewcommand{\paragraph}[1]{\oldparagraph{#1}\mbox{}}
 55 | \fi
 56 | \ifx\subparagraph\undefined\else
 57 | \let\oldsubparagraph\subparagraph
 58 | \renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}}
 59 | \fi
 60 | 
 61 | %%% Use protect on footnotes to avoid problems with footnotes in titles
 62 | \let\rmarkdownfootnote\footnote%
 63 | \def\footnote{\protect\rmarkdownfootnote}
 64 | 
 65 | %%% Change title format to be more compact
 66 | \usepackage{titling}
 67 | 
 68 | % Create subtitle command for use in maketitle
 69 | \newcommand{\subtitle}[1]{
 70 |   \posttitle{
 71 |     \begin{center}\large#1\end{center}
 72 |     }
 73 | }
 74 | 
 75 | \setlength{\droptitle}{-2em}
 76 |   \title{CoDa HTS Workshop Proposal}
 77 |   \pretitle{\vspace{\droptitle}\centering\huge}
 78 |   \posttitle{\par}
 79 |   \author{Greg Gloor}
 80 |   \preauthor{\centering\large\emph}
 81 |   \postauthor{\par}
 82 |   \date{}
 83 |   \predate{}\postdate{}
 84 | 
 85 | \usepackage{geometry}
 86 | \usepackage{amsmath}
 87 | \newcommand{\ith}[1]{ #1\textsuperscript{th}\ }
 88 | \newcommand{\vect}[1]{\vec{\textbf{#1}}}
 89 | \setlength{\columnsep}{18pt} 
 90 | 
 91 | \setlength\textwidth{5.5in}
 92 | \setlength\marginparwidth{1.5in}
 93 | 
 94 | \begin{document}
 95 | \maketitle
 96 | 
 97 | {
 98 | \setcounter{tocdepth}{3}
 99 | \tableofcontents
100 | }
101 | \hypertarget{analyzing-data-as-compositions}{%
102 | \subsection{Analyzing data as
103 | compositions}\label{analyzing-data-as-compositions}}
104 | 
105 | Website: \url{https://github.com/ggloor/CoDa_microbiome_tutorial}. This
106 | will serve as the central repository for the demonstrated tools and
107 | workflows. The repository will be set as release 2.0 at the end of the
108 | workshop so that participants will have a permanent public record of
109 | what was covered.
110 | 
111 | The Powerpoint presentation can be found at:
112 | \url{https://github.com/ggloor/compositions/presentations/CoDa_workshop_NGS18.pptx}.
113 | 
114 | We have adapted and developed tools and protocols for the analysis of
115 | HTS as compositional count data (Erb and Notredame, 2016; Fernandes et
116 | al., 2013, 2014; Quinn et al., 2017a). Analyses conducted under this
117 | paradigm are reproducible and robust, and allow conclusions about the
118 | relative relationships between features (genes, OTUs, etc) in the
119 | underlying environment (Bian et al.; Gloor et al., 2017).
120 | 
121 | It is possible to replace almost all steps in traditional RNA-seq,
122 | metagenomics or 16S rRNA gene sequencing analysis with compositionally
123 | appropriate methods (Gloor et al., 2017) that are robust to data
124 | manipulations and that provide reproducible insights into the underlying
125 | biology and composition of the system.
126 | 
127 | \hypertarget{objectives-and-outcomes}{%
128 | \subsection{Objectives and outcomes}\label{objectives-and-outcomes}}
129 | 
130 | The workshop will enable participants to:
131 | 
132 | \begin{enumerate}
133 | \def\labelenumi{\arabic{enumi}.}
134 | \item
135 |   be able to identify when biological datasets are compositional, and
136 |   understand the root problems that cause problems when interrogating
137 |   compositional datasets.
138 | \item
139 |   understand why HTS data should be analyzed in a
140 |   compositionally-appropriate framework.
141 | \item
142 |   know how to install, use and interpret the output from the basic HTS
143 |   compositional toolkit that consists of compositional biplots, the
144 |   \texttt{propr\ R} package and the \texttt{ALDEx2\ R} package.
145 | \item
146 |   have a frame of reference for more complex compositional tools such as
147 |   \texttt{philr} and concepts such as b-association and balance
148 |   dendrograms.
149 | \end{enumerate}
150 | 
151 | \hypertarget{outline}{%
152 | \subsection{Outline}\label{outline}}
153 | 
154 | The workshop will be delivered as mixed didactic and participation
155 | sessions, with about a 1:4 mixture. Each session will be introduced by a
156 | short didactic introduction and demonstration. The remainder of the
157 | session will be hands-on learning exercises in the \texttt{R}
158 | programming environment.
159 | 
160 | We will demonstrate a test dataset from
161 | {[}@{]}Schurch:\href{mailto:2016aa;@Gierlinski}{\nolinkurl{2016aa;@Gierlinski}}:2015aa{]}
162 | the lab of Dr.~Geoffrey Barton that examined the effect of a SNF2 gene
163 | knockout \emph{Saccharomyces cervisiae} transcription. This dataset is
164 | nearly ideal and simple to understand. However, participants are invited
165 | (expected) to bring their own dataset in the form of a count table with
166 | associated metadata for examination.
167 | 
168 | The outline of this 1-day workshop is:
169 | 
170 | \hypertarget{start-time-9am---introduction-gloor-didactic-lecture}{%
171 | \subsubsection{Start time 9am - Introduction (Gloor, didactic
172 | lecture)}\label{start-time-9am---introduction-gloor-didactic-lecture}}
173 | 
174 | \url{https://github.com/ggloor/compositions/presentations/CoDa_workshop_NGS18.pptx}
175 | 
176 | \begin{itemize}
177 | \tightlist
178 | \item
179 |   demonstrate and understand the geometry of high throughput sequencing
180 |   data and how this constrains the analyses
181 | 
182 |   \begin{itemize}
183 |   \tightlist
184 |   \item
185 |     demonstrate the pathologies associated with HTS data analyzed using
186 |     standard methods
187 |   \item
188 |     enable participants to understand why and when the usual methods of
189 |     analysis are likely to be misleading
190 |   \item
191 |     understand the importance of subcompositional coherence and
192 |     subcompositional dominance, and how these concepts lead to robust
193 |     analyses
194 |   \end{itemize}
195 | \end{itemize}
196 | 
197 | \hypertarget{start-time-945---probabilities-and-ratio-transformations-gloor-hands-on}{%
198 | \subsubsection{Start time 9:45 - Probabilities and ratio transformations
199 | (Gloor, hands
200 | on)}\label{start-time-945---probabilities-and-ratio-transformations-gloor-hands-on}}
201 | 
202 | \begin{itemize}
203 | \tightlist
204 | \item
205 |   provide an overview of sequencing as a probabilistic process, and the
206 |   manipulation of probability vectors using compositional data methods
207 | 
208 |   \begin{itemize}
209 |   \tightlist
210 |   \item
211 |     how to generate probability distributions from count data using
212 |     ALDEx2
213 |   \item
214 |     how to generate and interpret compositionally appropriate data
215 |     transformations
216 |   \item
217 |     zero replacement strategies for sparse data with the zCompositions R
218 |     package
219 |   \item
220 |     why count normalization is futile
221 |   \end{itemize}
222 | \end{itemize}
223 | 
224 | \hypertarget{break-1030}{%
225 | \subsubsection{Break 10:30}\label{break-1030}}
226 | 
227 | \hypertarget{start-time-11---dimension-reduction-outlier-identification-and-clustering-gloor-hands-on}{%
228 | \subsubsection{Start time 11 - Dimension reduction, outlier
229 | identification and clustering (Gloor, hands
230 | on)}\label{start-time-11---dimension-reduction-outlier-identification-and-clustering-gloor-hands-on}}
231 | 
232 | \begin{itemize}
233 | \tightlist
234 | \item
235 |   demonstrate dimension reduction of compositional data
236 | 
237 |   \begin{itemize}
238 |   \tightlist
239 |   \item
240 |     the production and interpretation of a compositional PCA biplot
241 |   \item
242 |     identifying outlier samples
243 |   \item
244 |     learn how to conduct and interpret clustering and discriminate
245 |     analysis in compositional data
246 |   \item
247 |     fuzzy clustering
248 |   \end{itemize}
249 | \end{itemize}
250 | 
251 | \hypertarget{start-time-12-correlation-and-compositional-association-erb}{%
252 | \subsubsection{Start time 12: Correlation and compositional association
253 | (Erb)}\label{start-time-12-correlation-and-compositional-association-erb}}
254 | 
255 | \begin{itemize}
256 | \tightlist
257 | \item
258 |   demonstrate compositionally appropriate identification of correlated
259 |   (compositionally associated) features using the \texttt{propr\ R}
260 |   package (Quinn et al., 2017b)
261 | 
262 |   \begin{itemize}
263 |   \tightlist
264 |   \item
265 |     an introduction to compositional association
266 |   \end{itemize}
267 | \end{itemize}
268 | 
269 | \hypertarget{start-time-1-lunch-break}{%
270 | \subsubsection{Start time: 1 lunch
271 | break}\label{start-time-1-lunch-break}}
272 | 
273 | \hypertarget{start-time-2---correlation-and-compositional-association-continued-erb}{%
274 | \subsubsection{Start time : 2: - Correlation and compositional
275 | association continued
276 | (Erb)}\label{start-time-2---correlation-and-compositional-association-continued-erb}}
277 | 
278 | \hypertarget{start-time-230-differential-abundance-with-aldex2-gloor}{%
279 | \subsubsection{Start time 2:30 Differential abundance with ALDEx2
280 | (Gloor)}\label{start-time-230-differential-abundance-with-aldex2-gloor}}
281 | 
282 | \begin{itemize}
283 | \tightlist
284 | \item
285 |   demonstrate compositionally appropriate identification of
286 |   differentially relatively abundant features using the
287 |   \texttt{ALDEx2\ R} package
288 | 
289 |   \begin{itemize}
290 |   \tightlist
291 |   \item
292 |     learn how to generate and interpret posterior expected values for
293 |     differential relative abundance
294 |   \item
295 |     learn how to generate and use standardized effect sizes for
296 |     differential relative abundance
297 |   \item
298 |     learn how to interpret effect plots as an adjunct to volcano and
299 |     Bland-Altmann plots
300 |   \end{itemize}
301 | \end{itemize}
302 | 
303 | \hypertarget{start-time-330---working-with-users-data-gloor-erb}{%
304 | \subsubsection{Start time: 3:30 - Working with users' data (Gloor,
305 | Erb)}\label{start-time-330---working-with-users-data-gloor-erb}}
306 | 
307 | \begin{itemize}
308 | \tightlist
309 | \item
310 |   analyzing users' own data
311 | \item
312 |   troubleshooting users' own datasets
313 | \item
314 |   common problems from the participants will be highlighted and
315 |   solutions demonstrated
316 | \end{itemize}
317 | 
318 | \hypertarget{start-time-430--wrapup-gloor-erb}{%
319 | \subsubsection{Start time: 4:30- Wrapup (Gloor,
320 | Erb)}\label{start-time-430--wrapup-gloor-erb}}
321 | 
322 | \begin{itemize}
323 | \tightlist
324 | \item
325 |   review of concepts and strategies
326 | \item
327 |   understand the congruence between the results obtained by the
328 |   compositional biplot, compositional association and compositional
329 |   differential relative abundance
330 | \item
331 |   provide guidance and sources on the proper interpretation of HTS
332 |   datasets using a compositional paradigm
333 | \end{itemize}
334 | 
335 | \hypertarget{finish-time-5-pm}{%
336 | \subsubsection{Finish time 5 pm}\label{finish-time-5-pm}}
337 | 
338 | \hypertarget{requirements}{%
339 | \subsection{Requirements}\label{requirements}}
340 | 
341 | \begin{enumerate}
342 | \def\labelenumi{\arabic{enumi}.}
343 | \item
344 |   a reasonably up-to-date laptop computer with at leaset 8Gb RAM
345 | \item
346 |   familiarity with scripting or programming languages, proficency in the
347 |   \texttt{R} programming environment
348 | \item
349 |   the current version of the \texttt{R} programming language installed
350 | \item
351 |   a number of \texttt{R} packages will be used during the workshop.
352 |   Participants should be familiar with installation of packages from
353 |   both \texttt{Bioconductor} and \texttt{CRAN}
354 | \end{enumerate}
355 | 
356 | \hypertarget{intended-audience-and-level}{%
357 | \subsection{Intended Audience and
358 | Level}\label{intended-audience-and-level}}
359 | 
360 | The intended audience for this session is bioinformaticians or
361 | computational biologists who use high throughput sequencing with
362 | experimental designs that include tag sequencing (eg. 16S rRNA gene
363 | sequencing), metagenomics, transcriptomics or meta-transcriptomics.
364 | 
365 | This is not intended to be an introduction to R for bioinformaticians:
366 | attendees should be relatively proficient with R, either using RStudio,
367 | or on the command line and should have a plain text editor available.
368 | Attendees will use R markdown documents to keep track of their work, and
369 | templates will be provided for use. Attendees will be expected to have a
370 | laptop with R installed and the following packages and their
371 | dependencies: propr (CRAN), ALDEx2 (Bioconductor), omicplotR
372 | (Bioconductor), zCompositions (CRAN). Attendees are encouraged to bring
373 | their own datasets for analysis, but should be aware that only pairwise
374 | (i.e., two condition) experiments will be demonstrated.
375 | 
376 | Compositional concepts will be at an introductory-intermediate level
377 | suitable for participants of any background, but will be more intuitive
378 | to those with a grounding in probability and linear algebra.
379 | 
380 | The practical aspects will be at an intermediate level, suitable for
381 | participants with pre-exisiting competency in \texttt{R}.
382 | 
383 | Attendance should be capped at no more than 40 participants.
384 | 
385 | \hypertarget{organizers-and-presenters}{%
386 | \subsection{Organizers and Presenters}\label{organizers-and-presenters}}
387 | 
388 | Greg Gloor is a Professor of Biochemistry at The University of Western
389 | Ontario. He is one of the pioneers in using compositional data analysis
390 | to analyze HTS datasets. He is the maintainer of the \texttt{ALDEx2\ R}
391 | package on Bioconductor used for differential relative abundance
392 | analysis. He has published original research, methods papers, and
393 | reviews that use compositional data analysis methods to interpret HTS
394 | datasets using transcriptome, microbiome and meta-transcriptome datasets
395 | (Bian et al.; Fernandes et al., 2013, 2014; Gloor et al., 2016a, 2017,
396 | 2016b, 2016c; Gloor and Reid, 2016; Goneau et al., 2015; Macklaim et
397 | al., 2013; McMillan et al., 2015; Wolfs et al., 2016). He has taught
398 | undergraduate and graduate courses in computational biology for almost
399 | two decades, and has won awards from both student groups and from
400 | faculty-wide competitions. His homepage and CV is at ggloor.github.io
401 | 
402 | Ionas Erb is a PDF and Bioinformatician at the Centre for Genomic
403 | Regulation. He is an active developer of tools to determine
404 | compositional association and is a contributor to the \texttt{propr\ R}
405 | package on CRAN used to explore correlation in a compositionally
406 | appropriate manner. He is an advocate for and active developer of tools
407 | that for compositionally-appropriate methods to examine correlation (Erb
408 | and Notredame, 2016; Erb et al., 2017; Quinn et al., 2017a)
409 | 
410 | \hypertarget{references}{%
411 | \subsection*{References}\label{references}}
412 | \addcontentsline{toc}{subsection}{References}
413 | 
414 | \hypertarget{refs}{}
415 | \leavevmode\hypertarget{ref-bian:2017}{}%
416 | Bian, G., Gloor, G. B., Gong, A., Jia, C., Zhang, W., Hu, J., et al. The
417 | gut microbiota of healthy aged chinese is similar to that of the healthy
418 | young. \emph{mSphere} 2, e00327--17.
419 | doi:\href{https://doi.org/10.1128/mSphere.00327-17}{10.1128/mSphere.00327-17}.
420 | 
421 | \leavevmode\hypertarget{ref-erb:2016}{}%
422 | Erb, I., and Notredame, C. (2016). How should we measure proportionality
423 | on relative gene expression data? \emph{Theory in Biosciences} 135,
424 | 21--36.
425 | 
426 | \leavevmode\hypertarget{ref-Erb134536}{}%
427 | Erb, I., Quinn, T., Lovell, D., and Notredame, C. (2017). Differential
428 | proportionality - a normalization-free approach to differential gene
429 | expression. \emph{bioRxiv}.
430 | doi:\href{https://doi.org/10.1101/134536}{10.1101/134536}.
431 | 
432 | \leavevmode\hypertarget{ref-fernandes:2013}{}%
433 | Fernandes, A. D., Macklaim, J. M., Linn, T. G., Reid, G., and Gloor, G.
434 | B. (2013). ANOVA-like differential expression (aldex) analysis for mixed
435 | population rna-seq. \emph{PLoS One} 8, e67019.
436 | doi:\href{https://doi.org/10.1371/journal.pone.0067019}{10.1371/journal.pone.0067019}.
437 | 
438 | \leavevmode\hypertarget{ref-fernandes:2014}{}%
439 | Fernandes, A. D., Reid, J. N., Macklaim, J. M., McMurrough, T. A.,
440 | Edgell, D. R., and Gloor, G. B. (2014). Unifying the analysis of
441 | high-throughput sequencing datasets: Characterizing RNA-seq, 16S rRNA
442 | gene sequencing and selective growth experiments by compositional data
443 | analysis. \emph{Microbiome} 2, 15.1--15.13.
444 | doi:\href{https://doi.org/10.1186/2049-2618-2-15}{10.1186/2049-2618-2-15}.
445 | 
446 | \leavevmode\hypertarget{ref-gloor:effect}{}%
447 | Gloor, G. B., Macklaim, J. M., and Fernandes, A. D. (2016a). Displaying
448 | variation in large datasets: Plotting a visual summary of effect sizes.
449 | \emph{Journal of Computational and Graphical Statistics} 25, 971--979.
450 | doi:\href{https://doi.org/10.1080/10618600.2015.1131161}{10.1080/10618600.2015.1131161}.
451 | 
452 | \leavevmode\hypertarget{ref-gloorFrontiers:2017}{}%
453 | Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V., and Egozcue, J. J.
454 | (2017). Microbiome datasets are compositional: And this is not optional.
455 | \emph{Frontiers in Microbiology} 8, 2224.
456 | doi:\href{https://doi.org/10.3389/fmicb.2017.02224}{10.3389/fmicb.2017.02224}.
457 | 
458 | \leavevmode\hypertarget{ref-gloorAJS:2016}{}%
459 | Gloor, G. B., Macklaim, J. M., Vu, M., and Fernandes, A. D. (2016b).
460 | Compositional uncertainty should not be ignored in high-throughput
461 | sequencing data analysis. \emph{Austrian Journal of Statistics} 45,
462 | 73--87.
463 | doi:\href{https://doi.org/doi:10.17713/ajs.v45i4.122}{doi:10.17713/ajs.v45i4.122}.
464 | 
465 | \leavevmode\hypertarget{ref-Gloor:2016cjm}{}%
466 | Gloor, G. B., and Reid, G. (2016). Compositional analysis: A valid
467 | approach to analyze microbiome high-throughput sequencing data.
468 | \emph{Can J Microbiol} 62, 692--703.
469 | doi:\href{https://doi.org/10.1139/cjm-2015-0821}{10.1139/cjm-2015-0821}.
470 | 
471 | \leavevmode\hypertarget{ref-gloor2016s}{}%
472 | Gloor, G. B., Wu, J. R., Pawlowsky-Glahn, V., and Egozcue, J. J.
473 | (2016c). It's all relative: Analyzing microbiome data as compositions.
474 | \emph{Ann Epidemiol} 26, 322--9.
475 | doi:\href{https://doi.org/10.1016/j.annepidem.2016.03.003}{10.1016/j.annepidem.2016.03.003}.
476 | 
477 | \leavevmode\hypertarget{ref-Goneau:2015ab}{}%
478 | Goneau, L. W., Hannan, T. J., MacPhee, R. A., Schwartz, D. J., Macklaim,
479 | J. M., Gloor, G. B., et al. (2015). Subinhibitory antibiotic therapy
480 | alters recurrent urinary tract infection pathogenesis through modulation
481 | of bacterial virulence and host immunity. \emph{MBio} 6.
482 | doi:\href{https://doi.org/10.1128/mBio.00356-15}{10.1128/mBio.00356-15}.
483 | 
484 | \leavevmode\hypertarget{ref-macklaim:2013}{}%
485 | Macklaim, M. J., Fernandes, D. A., Di Bella, M. J., Hammond, J.-A.,
486 | Reid, G., and Gloor, G. B. (2013). Comparative meta-RNA-seq of the
487 | vaginal microbiota and differential expression by \emph{Lactobacillus
488 | iners} in health and dysbiosis. \emph{Microbiome} 1, 15.
489 | doi:\href{https://doi.org/doi:\%2010.1186/2049-2618-1-12}{doi: 10.1186/2049-2618-1-12}.
490 | 
491 | \leavevmode\hypertarget{ref-McMillan:2015aa}{}%
492 | McMillan, A., Rulisa, S., Sumarah, M., Macklaim, J. M., Renaud, J.,
493 | Bisanz, J. E., et al. (2015). A multi-platform metabolomics approach
494 | identifies highly specific biomarkers of bacterial diversity in the
495 | vagina of pregnant and non-pregnant women. \emph{Sci Rep} 5, 14174.
496 | doi:\href{https://doi.org/10.1038/srep14174}{10.1038/srep14174}.
497 | 
498 | \leavevmode\hypertarget{ref-Quinn206425}{}%
499 | Quinn, T. P., Erb, I., Richardson, M. F., and Crowley, T. M. (2017a).
500 | Understanding sequencing data as compositions: An outlook and review.
501 | \emph{bioRxiv}.
502 | doi:\href{https://doi.org/10.1101/206425}{10.1101/206425}.
503 | 
504 | \leavevmode\hypertarget{ref-Quinn:2017}{}%
505 | Quinn, T., Richardson, M. F., Lovell, D., and Crowley, T. (2017b).
506 | Propr: An R-package for identifying proportionally abundant features
507 | using compositional data analysis. \emph{bioRxiv}.
508 | doi:\href{https://doi.org/10.1101/104935}{10.1101/104935}.
509 | 
510 | \leavevmode\hypertarget{ref-Wolfs:2016aa}{}%
511 | Wolfs, J. M., Hamilton, T. A., Lant, J. T., Laforet, M., Zhang, J.,
512 | Salemi, L. M., et al. (2016). Biasing genome-editing events toward
513 | precise length deletions with an rna-guided tevcas9 dual nuclease.
514 | \emph{Proc Natl Acad Sci U S A}.
515 | doi:\href{https://doi.org/10.1073/pnas.1616343114}{10.1073/pnas.1616343114}.
516 | 
517 | 
518 | \end{document}
519 | 


--------------------------------------------------------------------------------
/README.txt:
--------------------------------------------------------------------------------
 1 | This directory contains the readings, materials, and examples for a workshop originally offered at the Exploring Human Host-Microbiome Interactions in Health and Disease 2016 conference.
 2 | 
 3 | The workshop was subsequently modified for the NGS'18 conference in Barcelona. For this, we examine an RNA-seq dataset, outlier detection was added, and the use of the propr R package was added (Ionas Erb).
 4 | 
 5 | The primary focus will be on explaining why transcriptome data are compositional, limits and caveats on the analysis of compositional data, and finally proper methods to analyze such data. Practical methods to deal with the sparse nature of the data will be presented and demonstrated in practice.
 6 | 
 7 | The workshop will focus on hands-on application of three main methods.
 8 | 
 9 | Exploratory data analysis using the compositional biplot which shows the variance in both the samples and the genes on one plot. Outlier detection and removal
10 | 
11 | Proper correlation and association analysis using the latest methods including the rho-metric inside propr and other tools for correlation.
12 | 
13 | Examination of differential abundance under a compositional approach using the ALDEx2 tool.
14 | 
15 | Workshop participants should read the following:
16 | 
17 | multi_comp.pdf, and coda_seq.pdf: brief introductions to the problem from different perspectives
18 | make_interpret_biplot: a simple biplot with random data
19 | first_biplot.pdf: exploratory data analysis using compositional biplots and outlier detection and removal of the example dataset
20 | first_comparison.pdf: finding maximally different genes using ALDEx2
21 | first_association.pdf: finding maximally associated genes using propr
22 | 


--------------------------------------------------------------------------------
/chunk/R/propr-functions.R:
--------------------------------------------------------------------------------
  1 | #' @title Closure
  2 | #'
  3 | #' @description
  4 | #' \code{clo} divides each row of \code{X} by its row sum
  5 | #'
  6 | #' @details If \code{check} is \code{TRUE} then this function will stop if
  7 | #' there are any negative or \code{NA} values in \code{X}
  8 | #' @param X A matrix or dataframe of positive numeric values
  9 | #' @param check A logical scalar
 10 | #' @return A version of \code{X} where each row has been scaled so they sum to 1.
 11 | #' @examples
 12 | #' X <- matrix(1:12, nrow=3)
 13 | #' x <- clo(X)
 14 | #' rowSums(x)
 15 | #' @export
 16 | propr.clo <- function(X, check=FALSE){
 17 |   if(check){
 18 |     if(any(X < 0))    stop("negative values found")
 19 |     if(any(is.na(X))) stop("NA values found")
 20 |   }
 21 |   return(sweep(X, 1, rowSums(X), "/"))
 22 | }
 23 | 
 24 | #######################################################################################
 25 | #' @title Centred logratio transformation
 26 | #'
 27 | #' @description
 28 | #' \code{clr} takes the log of each row of X and centres it (i.e., subtracts the mean).
 29 | #'
 30 | #' @details If \code{check} is \code{TRUE} then this function will stop if
 31 | #' there are any negative or \code{NA} values in \code{X}
 32 | #' @param X A matrix or dataframe of positive numeric values
 33 | #' @param check A logical scalar
 34 | #' @return The logarithm of \code{X} where each row has been shifted to have mean 0.
 35 | #' \deqn{\mathrm{clr}(x) = \log x_i - \sum_{i=1}^D \log x_i}
 36 | #' @examples
 37 | #' X <- matrix(1:12, nrow=3)
 38 | #' x <- clr(X)
 39 | #' rowSums(x)             # Pretty close to zero
 40 | #' apply(exp(x), 1, prod) # The row products of exp(x) will be 1
 41 | #' @export
 42 | propr.clr <- function(X, check=FALSE){
 43 |   if(check){
 44 |     if(any(X < 0))    stop("negative values found")
 45 |     if(any(is.na(X))) stop("NA values found")
 46 |   }
 47 |   logX <- log(X)
 48 |   return(sweep(logX, 1, rowMeans(logX), "-"))
 49 | }
 50 | 
 51 | #######################################################################################
 52 | #' @title Variance of logratios
 53 | #'
 54 | #' @description
 55 | #' \code{vlr} returns a matrix where element (i,j) is
 56 | #' the variance (over rows) of the log of the ratios of column i and j.
 57 | #'
 58 | #' @details If \code{check} is \code{TRUE} then this function will stop if
 59 | #' there are any negative or \code{NA} values in \code{X}.
 60 | #' @param X A matrix or dataframe of positive numeric values
 61 | #' @param check A logical scalar
 62 | #' @return The symmetric matrix
 63 | #' \eqn{\mathrm{Var}{\log(X_i/X_j)}}{Var(log(X_i/X_j))} where \eqn{X_i} and \eqn{X_j}
 64 | #' denote \emph{columns} \eqn{i} and \eqn{j} of \eqn{X}.
 65 | #' @examples
 66 | #' N <- 10 # Number of observations
 67 | #' # Make a data frame with columns a and b roughly proportional
 68 | #' # and columns c and d roughly proportional
 69 | #' X <- data.frame(a=(1:N), b=(1:N) * rnorm(N, 10, 0.1),
 70 | #'                 c=(N:1), d=(N:1) * rnorm(N, 10, 1.0))
 71 | #' round(vlr(X),2)
 72 | #' @export
 73 | propr.vlr <- function(X, check=FALSE){
 74 |   if(check){
 75 |     if(any(X < 0))    stop("negative values found")
 76 |     if(any(is.na(X))) stop("NA values found")
 77 |   }
 78 |   logX <- log(X)
 79 |   Cov    <- stats::var(logX)  ## Note the avoidance of compositions::var
 80 |   D      <- ncol(logX)
 81 |   VarCol <- matrix(rep(diag(Cov), D), ncol = D)
 82 |   return(-2 * Cov + VarCol + t(VarCol))
 83 | }
 84 | 
 85 | 
 86 | #######################################################################################
 87 | #' @title Symmetric phi statistic
 88 | #'
 89 | #' @description
 90 | #' \code{phisym} returns a matrix where element (i,j) is
 91 | #' the symmetric phi statistic between columns i and j of \code{X}.
 92 | #' @details \code{X} should be the result of a centred logratio transformation
 93 | #' @param X A matrix or dataframe
 94 | #' @return TBA.
 95 | #' @examples
 96 | #' N <- 10 # Number of observations
 97 | #' # Make a data frame with columns a and b roughly proportional
 98 | #' # and columns c and d roughly proportional
 99 | #' X <- data.frame(a=(1:N), b=(1:N) * rnorm(N, 10, 0.1),
100 | #'                 c=(N:1), d=(N:1) * rnorm(N, 10, 1.0))
101 | #' round(phisym(clr(X)),2)
102 | #' @export
103 | propr.phisym <- function (X)
104 | {
105 |   Cov    <- stats::var(X)
106 |   tmp    <- 2 * Cov / outer(diag(Cov), diag(Cov), "+")
107 |   return((1-tmp)/(1+tmp))
108 | }
109 | 
110 | #######################################################################################
111 | #' @title Expected value of phi from Dirichlet log-ratio distributions
112 | #'
113 | #' @description
114 | #' default returns dataframe of the lower-triangle of symmetrical phi metric,
115 | #' alternatively returns matrix of the summetrical phi metric
116 | #' in either case, the value of phi is the expected value of a number of Dirichlet
117 | #' Monte-Carlo replicates of the data. This reduces the problem of
118 | #' 0-count and low-count features being highly variable because their
119 | #' values range wildly and so the expected value is always large
120 | #' @details requires aldex.clr function from ALDEx2
121 | #' param aldex.clr is an S3 object from the aldex.clr function
122 | #' we ignore all the other measures that are used for trouble-shooting phi
123 | #' the sma.df function in particular is very time and memory intensive
124 | #' @examples
125 | #' # use a count table where the samples are by column, features by row
126 | #' x <- aldex.clr(count.table, return="df")
127 | #' # if return = df, returns a dataframe of the expected value of the lower
128 | #' triangle of the propr.phisym function.
129 | #' # if return = mat, returns the symmetric matrix
130 | #' The number of Dirichlet Monte-Carlo replicates is
131 | #' obtained from the aldex.clr object
132 | 
133 | propr.aldex.phi <- function(aldex.clr, return="df"){
134 | 
135 | 	# calculate expected value of phi
136 | 	# a single high phi value will push the component out of consideration
137 | 	# a median is right out for memory considerations
138 | 
139 | 	# get first value
140 | 	sym.phi <- propr.phisym(t(sapply(getMonteCarloInstances(aldex.clr),
141 | 	    function(y){y[,1]})))
142 | 
143 | 	# sum the rest of the values as we proceed through the DIR MC instances
144 | 	for(i in 2:numMCInstances(aldex.clr)){
145 | 		#print(i)
146 | 		sym.phi <- sym.phi + propr.phisym(t(sapply(getMonteCarloInstances(aldex.clr),
147 | 		    function(y){y[,i]})))
148 | 	}
149 | 	##### Done ALDEx2 stuff
150 | 
151 | 	# make indices of the correct size
152 | 	lt <- which(col(sym.phi)<row(sym.phi), arr.ind=FALSE)
153 | 	lt.ind <- which(col(sym.phi)<row(sym.phi), arr.ind=TRUE)
154 | 
155 | 	# dataframe to hold the info,
156 | 	# data is a set of vectors where only the lower triangle is kept, because the matrix
157 | 	#    is symmetrical
158 | 	# this is needed so subsequent subset function works properly
159 | 	sma.df <- data.frame(row=factor(rownames(sym.phi)[lt.ind[,"row"]]),
160 | 		col=factor(colnames(sym.phi)[lt.ind[,"col"]]))
161 | 
162 | 	#save the lower triangle as an expected value
163 | 	sma.df$phi <- sym.phi[lt] /  numMCInstances(aldex.clr)
164 | 
165 | 	if(return=="df") return(sma.df)
166 | 	if(return=="mat") return(sym.phi /  numMCInstances(aldex.clr))
167 | }
168 | 
169 | #######################################################################################
170 | #######################################################################################
171 | #' @title Standardised Major Axis fits of pairs of columns
172 | #'
173 | #' @description
174 | #' \code{sma} returns a list whose elements are matrices whose elements (i,j)
175 | #' relate to the Standardised Major Axis fits of columns i and j of \code{X}
176 | #' @details \strong{Note:} \code{X} should be the result of a centred logratio transformation
177 | #' @param X A matrix or dataframe
178 | #' @return A list of three elements \code{b}, \code{p} and \code{r2}.
179 | #' @examples
180 | #' N <- 100 # Number of observations
181 | #' # Make a data frame with columns a and b roughly proportional
182 | #' # and columns c and d unrelated
183 | #' a <- seq(from=5, to=15, len=N)
184 | #' b <- a * rnorm(N, 1, 0.1)
185 | #' c <- rnorm(N,10)
186 | #' d <- rnorm(N,10)
187 | #' X <- data.frame(a, b, c, d)
188 | #' pairs(X)
189 | #' pairs(clr(X)) # Note the spurious correlation between variables c and d
190 | #' sma(clr(X))
191 | #' @export
192 | propr.sma <- function(X){
193 |   X.cor <- stats::cor(X, use="pairwise.complete.obs")
194 |   X.var <- stats::cov(X, use="pairwise.complete.obs")
195 |   X.sd  <- sqrt(diag(X.var))
196 | 
197 |   # Following the approach of Warton et al. Biol. Rev. (2006), 81, pp. 259-291
198 |   # r.rf2 = cor(X+Y, X-Y)^2
199 |   #       = (var(X) - var(Y))^2  /  ((var(X) + var(Y))^2 - 4cov(X,Y)^2)
200 |   r.rf2   <-
201 |     (outer(diag(X.var), diag(X.var), "-")^2 ) /
202 |     (outer(diag(X.var), diag(X.var), "+")^2 - 4 * X.var^2 )
203 | 
204 |   # At this point the diagonal of r.rf2 will be 0/0 = NaN. The correlation should be 0
205 |   diag(r.rf2) <- 0
206 |   res.dof     <- nrow(X) - 2
207 |   F           <- r.rf2/(1 - r.rf2) * res.dof
208 | 
209 |   list(b=sign(X.cor) * outer(X.sd, X.sd, "/"), # slope = sign(s_xy) s_y/s_x
210 |        p=1 - pf(F, 1, res.dof),                # p-value of the test that b = 1
211 |        r2=X.cor^2)                             # the squared correlation coefficient
212 | }
213 | 
214 | 
215 | 
216 | #######################################################################################
217 | #' @title Pairwise proportionality, slope and other statistics of columns
218 | #'
219 | #' @description
220 | #' \code{phiDF} returns a dataframe of various statistics for each pair of columns in
221 | #' \code{X}.
222 | #' @details \strong{Note:} \code{X} should be the result of a centred logratio transformation
223 | #' @param X A matrix or dataframe
224 | #' @return A list of three elements \code{b}, \code{p} and \code{r2}.
225 | #' @examples
226 | #' N <- 100 # Number of observations
227 | #' # Make a data frame with columns a and b roughly proportional
228 | #' # and columns c and d unrelated
229 | #' a <- seq(from=5, to=15, len=N)
230 | #' b <- a * rnorm(N, 1, 0.1)
231 | #' c <- rnorm(N,10)
232 | #' d <- rnorm(N,10)
233 | #' X <- data.frame(a, b, c, d)
234 | #' pairs(X)
235 | #' pairs(clr(X)) # Note the spurious correlation between variables c and d
236 | #' phiDF(X)
237 | #' # Note that phi and phisym are related to the slope and r^2 values:
238 | #' with(phiDF(X),
239 | #'   all.equal(
240 | #'     phisym,
241 | #'     (1 + b^2 - 2*b*sqrt(r2))/(1 + b^2 + 2*b*sqrt(r2))
242 | #'   )
243 | #' )
244 | #' with(phiDF(X),
245 | #'   all.equal(
246 | #'     phi,
247 | #'     (1 + b^2 - 2*b*sqrt(r2))
248 | #'   )
249 | #' )
250 | #' @export
251 | propr.phiDF <- function(X){
252 |   X.clr          <- clr(X)
253 |   X.sma          <- sma(X.clr)
254 |   X.vlr          <- vlr(X)
255 |   X.clr.var      <- apply(X.clr, 2, var)  # The variance of each column
256 |   X.phi          <- sweep(X.vlr, 2, X.clr.var, FUN="/")
257 |   X.phisym       <- phisym(X.clr)
258 |   lt             <- which(col(X.sma$b)<row(X.sma$b),arr.ind=FALSE)
259 |   lt.ind         <- which(col(X.sma$b)<row(X.sma$b),arr.ind=TRUE)
260 |   result         <- data.frame(
261 |     row=factor(rownames(X.sma$b)[lt.ind[,"row"]]),
262 |     col=factor(colnames(X.sma$b)[lt.ind[,"col"]])
263 |   )
264 |   result$b       <- X.sma$b[lt]
265 |   result$p       <- X.sma$p[lt]
266 |   result$r2      <- X.sma$r2[lt]
267 |   result$vlr     <- X.vlr[lt]
268 |   result$phi     <- X.phi[lt]
269 |   result$phisym  <- X.phisym[lt]
270 |   return(result)
271 | }
272 | 
273 | 


--------------------------------------------------------------------------------
/chunk/R/propr.R:
--------------------------------------------------------------------------------
1 | #' propr: Ways to measure strength of proportionality.
2 | #' 
3 | #' This is the package level documentation for propr and it lives in the file R/propr.R
4 | #' @docType package
5 | #' @name propr
6 | NULL


--------------------------------------------------------------------------------
/chunk/explore_schurch.R:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "explore schurch RNA-seq set"
  3 | author: "gg"
  4 | date: "March 19,2017"
  5 | output: pdf_document
  6 | ---
  7 | 
  8 | ```{r lib, include=FALSE, message=FALSE, warning=FALSE, echo=TRUE}
  9 | # CRAN
 10 | # install.packages("mypackage") propr, zCompositions, car
 11 | 
 12 | #bioconductor
 13 | # source("https://bioconductor.org/biocLite.R")
 14 | # biocLite("ALDEx2") and DESeq and edgeR
 15 | 
 16 | # github
 17 | # install.packages('devtools')
 18 | # devtools::install_github('ggloor/CoDaSeq/CoDaSeq')
 19 | 
 20 | library(zCompositions)
 21 | library(ALDEx2)
 22 | library(CoDaSeq)
 23 | 
 24 | plot=FALSE
 25 | ```
 26 | 
 27 | ```{r explore1, message=FALSE, warning=FALSE, echo=TRUE, fig.cap='Initial exploratory PCA plot of the SNF1 KO dataset. .', fig.height=4, fig.width=14}
 28 | 
 29 | d <- read.table("data/countfinal2.tsv", header=T, row.names=1, sep="\t", stringsAsFactors=F, comment.char="")
 30 | 
 31 | m <- read.table("data/ERP004763_sample_mapping.tsv", , header=T, row.names=1, sep="\t", stringsAsFactors=F, comment.char="")
 32 | 
 33 | library(CoDaSeq)
 34 | library(zCompositions)
 35 | 
 36 | min(apply(d,1,sum))
 37 | # at least one gene has 0 in every sample
 38 | 
 39 | # need to remove genes with 0 reads in all samples
 40 | d.n0 <- codaSeq.filter(d, min.reads=0, min.prop=0, min.count=0, samples.by.row=FALSE)
 41 | 
 42 | # replace 0 values
 43 | d.czm <- cmultRepl(t(d.n0), method="CZM", label=0)
 44 | 
 45 | # convert to clr
 46 | d.clr <- codaSeq.clr(d.czm)
 47 | 
 48 | # do the SVD
 49 | d.pcx <- prcomp(d.clr)
 50 | 
 51 | # make the biplot
 52 | biplot(d.pcx, var.axes=F, scale=0, cex=c(0.5,0.1))
 53 | 
 54 | ```
 55 | 
 56 | ```{r agg, message=FALSE, warning=FALSE, echo=TRUE, fig.cap='aggregate the dataset by techrep.', fig.height=4, fig.width=14}
 57 | 
 58 | # make a vector of names concatenated with biological replicate number
 59 | nms.agg <- paste(m[,2], m[,3], sep=".")
 60 | 
 61 | #sum by the set of unique names
 62 | d.agg <- aggregate(t(d), by=list(nms.agg), FUN=sum)
 63 | rownames(d.agg) <- d.agg$Group.1
 64 | d.agg$Group.1 <- NULL
 65 | 
 66 | # remove rows with 0 counts
 67 | # returns data with samples by columns!!!!
 68 | d.agg.gt0 <- codaSeq.filter(d.agg, min.reads=0, min.prop=0, min.count=0, samples.by.row=TRUE)
 69 | 
 70 | # estimate 0 values (zCompositions)
 71 | d.agg.n0 <- cmultRepl(t(d.agg.gt0), method="CZM", label=0)
 72 | 
 73 | # clr transform
 74 | d.agg.n0.clr <- codaSeq.clr(d.agg.n0)
 75 | 
 76 | # do the SVD and get the total variance
 77 | pcx.agg  <- prcomp(d.agg.n0.clr)
 78 | mvar.agg.clr <- sum(pcx.agg$sdev^2)
 79 | PC1.var <- pcx.agg$sdev[1]^2/mvar.agg.clr
 80 | PC2.var <- pcx.agg$sdev[2]^2/mvar.agg.clr
 81 | 
 82 | # scree plot
 83 | plot(pcx.agg$sdev[1:10]^2/mvar.agg.clr)
 84 | 
 85 | # check it
 86 | par(mfrow=c(1,1))
 87 | biplot(pcx.agg, var.axes=F, scale=0, cex=c(1,.5))
 88 | 
 89 | ```
 90 | 
 91 | ```{r outlier, message=FALSE, warning=FALSE, echo=TRUE, fig.cap='outliers', fig.height=4, fig.width=14}
 92 | # get the outliers from each group. See codaSeq.outlier
 93 | # get WT indices
 94 | WT <- grep("WT", rownames(d.agg))
 95 | # subset
 96 | WT.agg <- d.agg[WT,]
 97 | 
 98 | # filter
 99 | wt.gt0 <- codaSeq.filter(WT.agg, min.reads=0, min.prop=0, min.count=0, samples.by.row=TRUE)
100 | 
101 | # estimate 0 values (zCompositions)
102 | wt.agg.n0 <- cmultRepl(t(wt.gt0), method="CZM", label=0)
103 | 
104 | # clr transform
105 | wt.agg.n0.clr <- codaSeq.clr(wt.agg.n0)
106 | 
107 | # SVD
108 | pcx.wt  <- prcomp(wt.agg.n0.clr)
109 | mvar.wt.clr <- sum(pcx.wt$sdev^2)
110 | 
111 | # plot
112 | par(mfrow=c(1,1))
113 | biplot(pcx.wt, var.axes=F, scale=0, cex=c(2,.5))
114 | 
115 | # make a list of names to keep. found in $good
116 | WT.g <- codaSeq.outlier(wt.agg.n0.clr, plot.me=TRUE)
117 | 
118 | # now do the same for SNF2
119 | # get SNF indices
120 | SNF <- grep("SNF", rownames(d.agg))
121 | # subset
122 | SNF.agg <- d.agg[SNF,]
123 | 
124 | SNF.gt0 <- codaSeq.filter(SNF.agg, min.reads=0, min.prop=0, min.count=0, samples.by.row=TRUE)
125 | 
126 | # estimate 0 values (zCompositions)
127 | SNF.agg.n0 <- cmultRepl(t(SNF.gt0), method="CZM", label=0)
128 | 
129 | # clr transform
130 | SNF.agg.n0.clr <- codaSeq.clr(SNF.agg.n0)
131 | 
132 | pcx.SNF  <- prcomp(SNF.agg.n0.clr)
133 | mvar.SNF.clr <- sum(pcx.SNF$sdev^2)
134 | 
135 | par(mfrow=c(1,1))
136 | biplot(pcx.SNF, var.axes=F, scale=0, cex=c(2,.5))
137 | 
138 | SNF.g <- codaSeq.outlier(SNF.agg.n0.clr, plot.me=TRUE)
139 | 
140 | ```
141 | 
142 | ```{r good_data_pca, message=FALSE, warning=FALSE, echo=TRUE, fig.cap='outliers', fig.height=4, fig.width=14}
143 | 
144 | # make a dataset of only the non-outlier samples
145 | d.good <- rbind(d.agg[SNF.g$good,],d.agg[WT.g$good,])
146 | 
147 | # filter
148 | d.good.gt0 <- codaSeq.filter(d.good, min.reads=0, min.prop=0, min.count=0, samples.by.row=TRUE)
149 | 
150 | # estimate 0 values (zCompositions)
151 | d.good.agg.n0 <- cmultRepl(t(d.good.gt0), method="CZM", label=0)
152 | 
153 | # clr transform
154 | d.good.agg.n0.clr <- codaSeq.clr(d.good.agg.n0)
155 | 
156 | # SVD
157 | pcx.good  <- prcomp(d.good.agg.n0.clr)
158 | mvar.good <- sum(pcx.good$sdev^2)
159 | 
160 | # plot and save
161 | par(mfrow=c(1,1))
162 | biplot(pcx.good, var.axes=F, scale=1, cex=c(1,.5))
163 | 
164 | write.table(d.good.gt0, file="data/filtered_table.txt", sep="\t", quote=F, col.names=NA)
165 | ```
166 | 
167 | ```{r aldex}
168 | d.good.gt0 <- read.table("data/filtered_table.txt",row.names=1, header=T, sep="\t", stringsAsFactors=F, comment.char="")
169 | 
170 | # check for differential abundance
171 | library(ALDEx2)
172 | 
173 | # make a vector of conditions
174 | # halsey "the ficle p value ...."
175 | # make the distribution of possible values
176 | conds <- c(rep("SNF", length(grep("SNF", colnames(d.good.gt0)))), rep("WT", length(grep("WT", colnames(d.good.gt0)))))
177 | 
178 | x <- aldex.clr(d.good.gt0, conds)
179 | 
180 | # me
181 | x <- aldex.clr(d.good.gt0, conds, mc.samples=16)
182 | 
183 | x.e <- aldex.effect(x, conds)
184 | x.t <- aldex.ttest(x, conds)
185 | x.all <- data.frame(x.e,x.t)
186 | 
187 | 
188 | # now explore BLand Altman Plot
189 | aldex.plot(x.all, type="MA")
190 | 
191 | # volcano plot
192 | plot(x.all$diff.btw, x.all$wi.eBH, log="y"
193 | plot(x.all$wi.eBH,x.all$diff.btw, log="x", col="grey")
194 | points(x.all$wi.eBH[x.all$effect > 4],x.all$diff.btw[x.all$effect > 4], col="red", pch=19)
195 | points(x.all$wi.eBH[x.all$effect < -4],x.all$diff.btw[x.all$effect < -4], col="blue",pch=19)
196 | 
197 | # effect size plot
198 | aldex.plot(x.all, type="MW")
199 | 
200 | # we typically use effect sizes of 2 or more
201 | 
202 | # subset to find those with very large effect sizes
203 | rownames(x.all)[abs(x.all$effect) > 4]
204 | ```
205 | ```{r edgeR}
206 | 
207 | # https://bioconductor.org/packages/devel/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
208 | 
209 | library(edgeR)
210 | 
211 | # from ALDEx above
212 | group <- factor(conds)
213 | 
214 | y <- DGEList(counts=d.good.gt0,group=group)
215 | 
216 | y <- calcNormFactors(y)
217 | 
218 | y <- estimateDisp(y)
219 | 
220 | # plot a non-metric multidimensional scaling plot of the data
221 | # by default uses the 500 genes with the largest difference between samples
222 | # NMDS is essentially PCA on the rank order differences rather than actual variances
223 | # note that in this dataset, it looks very similar to the plot from the clr values
224 | 
225 | plotMDS(y)
226 | 
227 | # how well does the data fit the model?
228 | # looking for the blue and red lines to be coincident
229 | plotBCV(y)
230 | 
231 | # this is the differential abundance test
232 | # exact test using negative binomial
233 | et <- exactTest(y)
234 | # hack to get the data for plotting later
235 | tt <- topTags(et, sort.by="none", n=6236)
236 | 
237 | summary(et <- decideTestsDGE(et))
238 | detags <- rownames(y)[as.logical(et)]
239 | 
240 | 
241 | # you can plot the MA plot
242 | # compare to the MA plot from ALDEx, they are rather similar for this dataset
243 | plotSmear(y, de.tags=detags)
244 | abline(h=c(-1, 1), col="blue", lty=2)
245 | 
246 | ```
247 | 
248 | ```{ r DESeq}
249 | # https://www.bioconductor.org/packages/3.3/bioc/vignettes/DESeq/inst/doc/DESeq.pdf
250 | 
251 | library(DESeq)
252 | 
253 | # we will use the same input table
254 | # set up the metadata
255 | design <- data.frame(
256 |   row.names = colnames(d.good.gt0),
257 |   condition = conds,
258 |   libType="single-end"
259 | )
260 | 
261 | # for convenince so I can cut and paste
262 | countTable <- d.good.gt0
263 | condition <- design$condition
264 | condition <- factor(condition)
265 | 
266 | # make the base data table
267 | cds = newCountDataSet( countTable, condition )
268 | 
269 | # normalize read counts
270 | cds = estimateSizeFactors( cds )
271 | # observe
272 | sizeFactors( cds )
273 | 
274 | # estimate the variance and fit to a negative binomial
275 | cds = estimateDispersions( cds )
276 | 
277 | # observe how well it fits the data
278 | # how does this differ and what is similar to edgeR?
279 | plotDispEsts( cds )
280 | 
281 | # find DE genes
282 | res = nbinomTest( cds, "SNF", "WT" )
283 | 
284 | # plot the MA plot, compare to ALDEx2 and edgeR
285 | plotMA(res)
286 | 
287 | # it is useful to look at the p value distribution
288 | par(mfrow=c(1,3))
289 | hist(res$pval, breaks=100, main="DESeq") #DESeq
290 | hist(tt[[1]]$PValue, breaks=100, col=rgb(1,0,0,0.1), main="edgeR") # edgeR
291 | hist(x.all$we.ep, breaks=100, col=rgb(0,0,1,0.1), main="ALDEx2") # ALDEx2
292 | 
293 | par(mfrow=c(1,2))
294 | plot(res$pval, tt[[1]]$PValue, log="xy")
295 | plot(res$pval, x.all$we.ep, log="xy")
296 | 
297 | 
298 | # get lists of differential genes by tool
299 | # use cutoff of BH of 0.05
300 | 
301 | sig.ald <- rownames(x.all)[x.all$we.eBH < 0.05]
302 | sig.des <- res$id[res$padj < 0.05]
303 | sig.edg <- detags
304 | 
305 | write.table(sig.ald, file="sig.ald.txt", quote=FALSE, row.names=F, col.names=F)
306 | write.table(sig.des, file="sig.des.txt", quote=FALSE, row.names=F, col.names=F)
307 | write.table(sig.edg, file="sig.edg.txt", quote=FALSE, row.names=F, col.names=F)
308 | 
309 | 
310 | ```
311 | 
312 | 
313 | ```{r, annotation}
314 | # you can supply lists of genes of interest, or call directly from within edgeR for this
315 | # see the edgeR documentation for an example of GO terms, only metazoan models
316 | # we can do KEGG term annotation though
317 | 
318 | de.aldex <- rownames(x.all)[x.all$effect > 2]
319 | write.table(de.aldex, file="aldex_sig.effect.txt", sep="\t", col.names=F,row.names=F, quote=F)
320 | 
321 | # http://www.kegg.jp/kegg/tool/map_pathway1.html
322 | 
323 | # kegga is going to an external database (KEGG)
324 | kegg.aldex <- kegga(de.aldex, species.KEGG="sce")
325 | 
326 | ```
327 | 
328 | ```{r, assoc}
329 | 
330 | library(ALDEx2)
331 | library(CoDaSeq)
332 | library(zCompositions)
333 | library(igraph)
334 | library(car)
335 | source("http://michael.hahsler.net/SMU/ScientificCompR/code/map.R")
336 | 
337 | 
338 | x <- aldex.clr(d.good.gt0)
339 | 
340 | conds <- c(rep("SNF", length(SNF.g$good)), rep("WT", length(WT.g$good)))
341 | # me
342 | x <- aldex.clr(d.good.gt0, conds, mc.samples=16)
343 | 
344 | x.phi <- codaSeq.phi(x)
345 | 
346 | phi.cutoff <- .005
347 | x.lo.phi <- subset(x.phi, phi <= phi.cutoff)
348 | 
349 | # generate a graphical object
350 | g <- graph.data.frame(x.lo.phi, directed=FALSE)
351 | 
352 | # overview of all the proportional relationships
353 | # this can take a long time!!!
354 | V(g)$label.cex <- 1
355 | OTU.names <- V(g)$name
356 | 
357 | # V(g)$name <- as.character(hmpgenera[V(g)$name, "genus"])
358 | 
359 | 
360 | plot(g, layout=layout.fruchterman.reingold.grid(g, weight=0.05/E(g)$phi), vertex.size=1, vertex.color="black")
361 | 
362 | # http://michael.hahsler.net/SMU/LearnROnYourOwn/code/igraph.html
363 | layout <-layout.fruchterman.reingold(g)
364 | plot(g, layout=layout)
365 | 
366 | # map centrality
367 | plot(g, layout=layout,
368 |   vertex.size=map(betweenness(g),c(1,15)),
369 |   edge.width=map(edge.betweenness(g), c(1,10)))
370 | 
371 | # plot page rank (connectedness)
372 | pr <- page.rank(g)$vector
373 | plot(g, layout=layout, vertex.size=map(pr, c(1,20)), edge.arrow.size=.2)
374 | 
375 | # break into component sub-groups
376 | dg <- decompose.graph(g)
377 | plot(dg[[1]],layout=layout.fruchterman.reingold(dg[[1]]),
378 |   vertex.size=map(page.rank(dg[[1]])$vector, c(1,10)),
379 |   edge.arrow.size=.2)
380 | 
381 | 
382 | ## communities
383 | # https://users.dimi.uniud.it/~massimo.franceschet/R/communities.html
384 | c1 = cluster_fast_greedy(dg[[1]])
385 | plot(dg[[1]], vertex.color=membership(c1), layout=layout_with_fr(dg[[1]]))
386 | 
387 | c2 = cluster_leading_eigen(g)
388 | plot(g, vertex.color=membership(c2), layout=layout_with_fr(g), vertex.size=map(pr, c(1,20)), edge.arrow.size=.2)
389 | 
390 | c3 = cluster_edge_betweenness(g)
391 | plot(g, vertex.color=membership(c3), layout=layout_with_fr(g), vertex.size=map(page.rank(g)$vector, c(1,20)), edge.arrow.size=.2)
392 | 
393 | plot(g, vertex.color=membership(c3), layout=layout_with_fr(g), vertex.size=map(betweenness(g),c(1,15)),
394 |   edge.width=map(edge.betweenness(g), c(1,10)))
395 | 
396 | 
397 | 
398 | 
399 | pdf("phi_graph.pdf", height=5, width=5)
400 | plot(dg[[21]])
401 | dev.off()
402 | 
403 | # # get the clusters from the graph object
404 |  g.clust <- clusters(g)
405 | #
406 | # # data frame containing the names and group memberships of each cluster
407 |  g.df.u <- data.frame(Systematic.name=V(g)$name, cluster=g.clust$membership, cluster.size=g.clust$csize[g.clust$membership], stringsAsFactors=FALSE)
408 | g.df.u <- data.frame(Name=V(g)$name, cluster=g.clust$membership, stringsAsFactors=FALSE)
409 | 
410 | # generate the list of features and clusters in a data frame
411 | g.df <- g.df.u[order(g.df.u[,"cluster"]),]
412 | ```
413 | 


--------------------------------------------------------------------------------
/chunk/setup.R:
--------------------------------------------------------------------------------
  1 | 
  2 | # this is the dataset I will present in Cambridge
  3 | # for the tutorial, we will use a random subset of 50 samples from each
  4 | 
  5 | The dataset contains 229 genera and 1457 samples. Samples are named thusly:
  6 | # ids have leading 700 stripped off
  7 | # samples named as follows:
  8 | # td_ Tongue Dorsum
  9 | # bm_ Buccal mucosa
 10 | # ak_ Attached Keratinzed Gingiva
 11 | # hp_ Hard Palate
 12 | # pt_ Palatine Tonsils
 13 | # sa_ Saliva
 14 | # up_ Subgingival Plaque (under plaque)
 15 | # op_ Supragingival Plaque (over plaque)
 16 | 
 17 | # read in the example table
 18 | # this is a subset of the HMP oral microbiome with 24 TD and 18 BM samples
 19 | # no real reason to choose these except they separate
 20 | 
 21 | # HMP oral V3-5 microbiome dataset
 22 | mouth <- read.table("~/git/compositions/oral/data/mouth_otu.txt", header=T, row.names=1, sep="\t")
 23 | taxon <- read.table("~/git/compositions/oral/data/taxon_names.txt", header=T, row.names=1, sep="\t")
 24 | 
 25 | # fix unknown taxa
 26 | # all p__, c__, o__ named
 27 | notF <- grep("f__", taxon[,1], invert=T)
 28 | fillF <- "f__Unknown;g__Unknown"
 29 | 
 30 | for(i in 1:length(notF)){
 31 | 	taxon[notF[i],1] <- ( paste(taxon[notF[i],1], fillF, sep=";") )
 32 | }
 33 | notG <- grep("g__", taxon[,1], invert=T)
 34 | fillG <- "g__Unknown"
 35 | 
 36 | for(i in 1:length(notG)){
 37 | 	taxon[notG[i],1] <- ( paste(taxon[notG[i],1], fillG, sep=";") )
 38 | }
 39 | 
 40 | z <- strsplit(as.character(taxon[,1]), ';')
 41 | tax.df <- as.data.frame(do.call(rbind,z))
 42 | 
 43 | rownames(tax.df) <- rownames(mouth)
 44 | colnames(tax.df) <- c("root", "phylum", "class", "order", "family", "genus")
 45 | tax.df$phylum <- gsub("p__", "", tax.df$phylum)
 46 | tax.df$genus <- gsub("g__", "", tax.df$genus)
 47 | 
 48 | 
 49 | d <- data.frame(mouth[,grep("ak_", colnames(mouth))], mouth[,grep("op_", colnames(mouth))])
 50 | 
 51 | # keep samples with > 1000 reads
 52 | d.col <- d[, which(apply(d,2,sum) > 1000)]
 53 | 
 54 | # many ways to filter
 55 | # keep OTUs with mean > 0.1 reads across all samples
 56 | d.subset <- d[ which(apply(d.col,1,mean) > .1),]
 57 | taxon.1 <- tax.df[ which(apply(d.col,1,mean) > .1),]
 58 | 
 59 | # this is the base dataset that everyone can use
 60 | write.table(d.subset, file="~/git/Host_Microbe_2016/data/ak_vs_op.txt", sep="\t", col.names=NA, quote=F)
 61 | write.table(taxon.1, file="~/git/Host_Microbe_2016/data/taxon.txt", sep="\t", col.names=NA, quote=F)
 62 | 
 63 | d <- data.frame(mouth[,grep("up_", colnames(mouth))], mouth[,grep("op_", colnames(mouth))])
 64 | 
 65 | # keep samples with > 1000 reads
 66 | d.col <- d[, which(apply(d,2,sum) > 1000)]
 67 | 
 68 | # many ways to filter
 69 | # keep OTUs with mean > 0.1 reads across all samples
 70 | d.subset <- d[ which(apply(d.col,1,mean) > .1),]
 71 | taxon.1 <- tax.df[ which(apply(d.col,1,mean) > .1),]
 72 | 
 73 | # this is the base dataset that everyone can use
 74 | write.table(d.subset, file="~/git/Host_Microbe_2016/data/up_vs_op.txt", sep="\t", col.names=NA, quote=F)
 75 | write.table(taxon.1, file="~/git/Host_Microbe_2016/data/up_vs_op_taxon.txt", sep="\t", col.names=NA, quote=F)
 76 | 
 77 | #############
 78 | 
 79 | ################ OTU LEVEL
 80 | # some initial characterization of the whole dataset
 81 | 
 82 | 
 83 | d.subset <- read.table("~/git/Host_Microbe_2016/data/ak_vs_op.txt", sep="\t", header=T, row.names=1)
 84 | taxon.1 <- read.table("~/git/Host_Microbe_2016/data/ak_vs_op_taxon.txt", sep="\t", header=T, row.names=1)
 85 | 
 86 | library(zCompositions)
 87 | d.n0 <- cmultRepl(t(d.subset), method="CZM", label=0)
 88 | d.clr <- t(apply(d.n0, 1, function(x) log(x) - mean(log(x))))
 89 | pcx <- prcomp(d.clr)
 90 | rownames(pcx$rotation) <- taxon.1[rownames(pcx$rotation), "genus"]
 91 | 
 92 | PC1 <- paste("PC1: ", round(pcx$sdev[1]^2/sum(pcx$sdev^2),3), sep="")
 93 | PC2 <- paste("PC2: ", round(pcx$sdev[2]^2/sum(pcx$sdev^2),3), sep="")
 94 | 
 95 | 
 96 | biplot(pcx, cex=c(0.5,0.4), col=c("black", rgb(1,0,0,0.2)), var.axes=F, scale=0, xlab=PC1, ylab=PC2)
 97 | 
 98 | d.rand <- cbind(d.subset[,sample(grep("ak", colnames(d.subset)), 25)], d.subset[,sample(grep("op", colnames(d.subset)), 25)])
 99 | d.rand.filt <- d.rand[rowSums(d.rand) > 0,]
100 | 
101 | 
102 | n.rand = 15
103 | conds <- c(rep("ak", n.rand), rep("op" , n.rand))
104 | 
105 | 
106 | x <- aldex.clr(d.rand)
107 | 
108 | x.e <- aldex.effect(x, conds)
109 | x.t <- aldex.ttest(x, conds)
110 | x.all <- data.frame(x.e,x.t)
111 | 
112 | tax <- "genus"
113 | all.plot <- data.frame(x.all, tax.df[rownames(x.all), tax])
114 | colnames(all.plot)[ncol(all.plot)] <- "taxon.level"
115 | all.plot <- droplevels(all.plot)
116 | 
117 | no.sig <- all.plot$wi.eBH > 0.01
118 | sig.pos <- all.plot$wi.eBH < 0.01 & all.plot$effect > 0
119 | sig.neg <- all.plot$wi.eBH < 0.01 & all.plot$effect < 0
120 | 
121 | groups <- unique(all.plot$taxon.level)
122 | ylim<-c(length(groups) - (length(groups)+0.5), length(groups) + 0.5)
123 | 
124 | xlim = c(min(-1 * max(abs(all.plot$effect))), max(all.plot$effect))
125 | pdf("genus_stripchart.pdf")
126 | par(mar=c(5,15,5,1), las=1, cex=0.7)
127 | stripchart(effect ~ taxon.level, data=all.plot[no.sig,], col=rgb(0,0,0,0.3),method="jitter", jitter=0.2, pch=19, xlim=xlim, xlab="effect", main=tax)
128 | stripchart(effect ~ taxon.level, data=all.plot[sig.pos,], col=rgb(0,0,1,0.3),method="jitter", jitter=0.2, pch=19, xlim=xlim, xlab="effect", add=T)
129 | stripchart(effect ~ taxon.level, data=all.plot[sig.neg,], col=rgb(1,0,0,0.3),method="jitter", jitter=0.2, pch=19, xlim=xlim, xlab="effect", add=T)
130 | abline(v=0, lty=2, lwd=2, col=rgb(0,0,0,0.3))
131 | #draw horizonal lines
132 | for (j in 0.5:(length(groups)+0.5)){
133 | 	abline(h=j, lty=3, col=rgb(0,0,0,0.3))
134 | }
135 | dev.off()
136 | 
137 | 
138 | ### MAKE A FULL COMPARISON FOR FOR THE UP OP SAMPLE PAIR, and the ak op sample pair
139 | 
140 | e.subset <- read.table("data/up_vs_op.txt", row.names=1, header=T)
141 | e.x <- aldex.clr(e.subset)
142 | e.conds <- c(rep("up", length(grep("up", colnames(e.subset))) ), rep("op", length(grep("op", colnames(e.subset)))) )
143 | e.eff <- aldex.effect(e.x, e.conds)
144 | e.tt <- aldex.ttest(e.x, e.conds)
145 | e.all <- data.frame(e.eff,e.tt)
146 | 
147 | 
148 | # SAVE DATA IN up_vs_op_aldex.txt
149 | write.table(x.all, , file="~/git/Host_Microbe_2016/data/ak_vs_op_aldex.txt", sep="\t", col.names=NA, quote=F)
150 | write.table(e.all, , file="~/git/Host_Microbe_2016/data/up_vs_op_aldex.txt", sep="\t", col.names=NA, quote=F)
151 | 
152 | ############# GENUS LEVEL
153 | # aggregate the data by genus count
154 | mouth.genus <- aggregate(mouth, by=list(tax.df$genus),FUN=sum)
155 | rownames(mouth.genus) <- mouth.genus$Group.1
156 | mouth.genus$Group.1 <- NULL
157 | mouth.genus.subset <- mouth.genus[rowSums(mouth.genus) > 0,]
158 | write.table(mouth.genus.subset, file="~/git/Host_Microbe_2016/data/mouth_genus.txt", sep="\t", col.names=NA, quote=F)
159 | 
160 | 
161 | gen.n0 <- cmultRepl(t(mouth.genus.subset), label=0, method="CZM")
162 | 
163 | gen.clr <- t(apply(gen.n0, 1, function(x) log(x) - mean(log(x)) ))
164 | pcx.gen <- prcomp(gen.clr)
165 | 
166 | PC1 <- paste("PC1: ", round(pcx.gen$sdev[1]^2/sum(pcx.gen$sdev^2),3), sep="")
167 | PC2 <- paste("PC2: ", round(pcx.gen$sdev[2]^2/sum(pcx.gen$sdev^2),3), sep="")
168 | biplot(pcx.gen, cex=c(0.6,0.5), var.axes=F, scale=0, xlab=PC1, ylab=PC2)
169 | abline(v=0)
170 | abline(h=0)
171 | 
172 | d.g <- data.frame(mouth.genus.subset[,grep("ak_", colnames(mouth.genus.subset))], mouth.genus.subset[,grep("op_", colnames(mouth.genus.subset))])
173 | d.g.subset <- d.g[apply(d.g, 1, mean) > 0.1, ]
174 | d.g.n0 <- cmultRepl(t(d.g.subset), label=0, method="CZM")
175 | d.g.clr <- t(apply(d.g.n0, 1, function(x) log(x) - mean(log(x)) ))
176 | pcx.d.g <- prcomp(d.g.clr)
177 | 
178 | PC1 <- paste("PC1: ", round(pcx.d.g$sdev[1]^2/sum(pcx.d.g$sdev^2),3), sep="")
179 | PC2 <- paste("PC2: ", round(pcx.d.g$sdev[2]^2/sum(pcx.d.g$sdev^2),3), sep="")
180 | biplot(pcx.d.g, cex=c(0.9,0.7), var.axes=F, scale=0, xlab=PC1, ylab=PC2)
181 | abline(v=0)
182 | abline(h=0)
183 | 
184 | conds.g <- c(rep("A", length(grep("ak", colnames(d.g.subset)))  ), rep("O", length(grep("op", colnames(d.g.subset)))) )
185 | x.g <- aldex.clr(d.g.subset)
186 | x.g.e <- aldex.effect(x.g, conds.g)
187 | x.t.t <- aldex.ttest(x.g, conds.g)
188 | x.g.all <- data.frame(x.g.e, x.t.t)
189 | 
190 | 
191 | 


--------------------------------------------------------------------------------
/coda_seq.Rmd:
--------------------------------------------------------------------------------
  1 | 
  2 | ---
  3 | title: "CoDa and sequencing"
  4 | author: "gg"
  5 | date: '`r format(Sys.time(), "%d %B, %Y")`'
  6 | bibliography: ~/Library/texmf/bibtex/bib/bibdesk_refs.bib
  7 | fig_caption: true
  8 | output:
  9 |     pdf_document:
 10 |         fig_caption: yes
 11 | ---
 12 | 
 13 | To run this file:
 14 | Rscript -e "rmarkdown::render('coda_seq.Rmd')
 15 | 
 16 | It is assumed that the output from a high-throughput sequencing experiment represents in some way the underlying abundance of the input DNA molecules. This is not necessarily the case as explained by the following thought experiment.
 17 | 
 18 | ```{r, echo=F, fig.width=10, results='asis', fig.cap="High-throughput sequencing affects the shape of the data differently on constrained and unconstrained data. The two left panels show the absolute number of reads in the input tube for 20 steps where the green and black OTUs are changing abundance by 2-fold each step. The gray, blue and red OTUs are held at a constant number in each step in both cases. The second column shows the output in proportions (or ppm, or FPKM) after random sampling to a constant sum, as occurs on the sequencer. The orange OTU in the constrained data set is much more abundant than any other, and is changing to maintain a constant number of input molecules. Samples in the two right columns are the same values plotted on a log scale on the Y-axis for convenience. Note how the constrained data is the same before and after sequencing while the unconstrained data is severely distorted."}
 19 | rdirichlet <- function (n, alpha)
 20 | {
 21 |   if(length(n) > 1) n <- length(n)
 22 |   #if(length(n) == 0 || as.integer(n) == 0) return(numeric(0))
 23 |   #n <- as.integer(n)
 24 |   if(n < 0) stop("value(n) can not be negative in rtriang")
 25 | 
 26 |   if(is.vector(alpha)) alpha <- t(alpha)
 27 |   l <- dim(alpha)[2]
 28 |   x <- matrix(rgamma(l * n, t(alpha)), ncol = l, byrow=TRUE)  # Gere le recycling
 29 |   return(x / rowSums(x))
 30 | }
 31 | 
 32 | num.one = 100 # the number of rare-counts in the dataset
 33 | 
 34 | mat.double <- matrix(data=NA, nrow=20, ncol=num.one + 10)
 35 | prop.mat <- matrix(data=NA, nrow=20, ncol=num.one + 10)
 36 | clr.mat <- matrix(data=NA, nrow=20, ncol=num.one + 10)
 37 | 
 38 | mat.double.u <- matrix(data=NA, nrow=20, ncol=num.one + 10)
 39 | prop.mat.u <- matrix(data=NA, nrow=20, ncol=num.one + 10)
 40 | clr.mat.u <- matrix(data=NA, nrow=20, ncol=num.one + 10)
 41 | 
 42 | # constant sum input
 43 | minimum.count <- 1 # multiplier to set minimum count for in.put
 44 | # non-constant sum input with both one big increase
 45 | in.put <- c(10,20971,1,1,5,10,20,50,100,200,1000) * minimum.count
 46 | 
 47 | total.sum <- sum(in.put + 1) * 1000
 48 | 
 49 | for(i in 0:19){
 50 | 	# constant sum input
 51 | 	junk <- in.put * c(2^i, rep(1,num.one + 9))
 52 | 	junk[3] <- total.sum - sum(junk)
 53 | 	mat.double[i+1,] <- junk
 54 | 	prop.mat[i+1,] <- as.numeric( rdirichlet(1, junk) )
 55 | 	clr.mat[i+1,] <- log2(prop.mat[i+1,]) - mean(log2(prop.mat[i+1,]))
 56 | }
 57 | 
 58 | for(i in 0:19){
 59 | 	# non-constant sum input
 60 | 	#junk <- in.put * c(2^i, rep(1,num.one + 9))
 61 | 	junk <- in.put * c(2^i, rep(1,num.one + 9))
 62 | 	mat.double.u[i+1,] <- junk
 63 | 	prop.mat.u[i+1,] <- as.numeric( rdirichlet(1, junk) )
 64 | 	clr.mat.u[i+1,] <- 2^(log2(prop.mat.u[i+1,]) - mean(log2(prop.mat.u[i+1,])))
 65 | }
 66 | 
 67 | par(mfrow=c(2,4), mar=c(4,4,3,1) )
 68 | 
 69 | plot(mat.double[,1], pch=20, type="b",  ylim=c(min(mat.double), max(mat.double)), xlab="time point", ylab="raw count")
 70 | title( main="Constrained\ninput (linear)", adj=0.5)
 71 | points(mat.double[,2], type="b",pch=21, col="gray")
 72 | points(mat.double[,3], type="b",pch=22, col="orange")
 73 | points(mat.double[,num.one + 10], type="b",pch=23, col="blue")
 74 | points(mat.double[,num.one+4], type="b",pch=24, col="red")
 75 | 
 76 | plot(prop.mat[,1], pch=20, type="b", ylim=c(min(prop.mat[,num.one+4]), max(prop.mat)), xlab="time point", ylab="raw proportion")
 77 | title( main="Constrained\nproportion (linear)", adj=0.5)
 78 | points(prop.mat[,2], type="b", pch=21, col="gray")
 79 | points(prop.mat[,3], type="b", pch=22, col="orange")
 80 | points(prop.mat[,num.one+10], type="b", pch=23, col="blue")
 81 | points(prop.mat[,num.one+4], type="b", pch=24, col="red")
 82 | 
 83 | plot(mat.double.u[,1], pch=20, type="b",  ylim=c(min(mat.double.u), max(mat.double.u)), xlab="time point", ylab="raw count")
 84 | title( main="Unconstrained\ninput (linear)", adj=0.5)
 85 | points(mat.double.u[,2], type="b",pch=21, col="gray")
 86 | points(mat.double.u[,num.one + 10], type="b",pch=23, col="blue")
 87 | points(mat.double.u[,num.one+4], type="b",pch=24, col="red")
 88 | 
 89 | plot(prop.mat.u[,1], pch=20, type="b", ylim=c(min(prop.mat.u[,num.one+4]), max(prop.mat.u)), xlab="time point", ylab="raw proportion")
 90 | title( main="Unconstrained\nproportion (linear)", adj=0.5)
 91 | points(prop.mat.u[,2], type="b", pch=21, col="gray")
 92 | points(prop.mat.u[,num.one+10], type="b", pch=23, col="blue")
 93 | points(prop.mat.u[,num.one+4], type="b", pch=24, col="red")
 94 | 
 95 | plot(mat.double[,1], pch=20, type="b", log="y", ylim=c(min(mat.double), max(mat.double)), xlab="time point", ylab="log10 count")
 96 | title( main="Constrained\ninput (log)", adj=0.5)
 97 | points(mat.double[,2], type="b",pch=21, col="gray")
 98 | points(mat.double[,3], type="b",pch=22, col="orange")
 99 | points(mat.double[,num.one + 10], type="b",pch=23, col="blue")
100 | points(mat.double[,num.one+4], type="b",pch=24, col="red")
101 | 
102 | plot(prop.mat[,1], pch=20, type="b", ylim=c(min(prop.mat[,num.one+4]), max(prop.mat)), xlab="time point", log="y", ylab="log10 proportion")
103 | title( main="Constrained\nproportion (log)", adj=0.5)
104 | points(prop.mat[,2], type="b", pch=21, col="gray")
105 | points(prop.mat[,3], type="b", pch=22, col="orange")
106 | points(prop.mat[,num.one+10], type="b", pch=23, col="blue")
107 | points(clr.mat[,num.one+4], type="b", pch=24, col="red")
108 | 
109 | plot(mat.double.u[,1], pch=20, type="b", log="y", ylim=c(min(mat.double.u), max(mat.double.u)), xlab="time point", ylab="log10 count")
110 | title( main="Unconstrained\ninput (log)", adj=0.5)
111 | points(mat.double.u[,2], type="b",pch=21, col="gray")
112 | points(mat.double.u[,num.one + 10], type="b",pch=23, col="blue")
113 | points(mat.double.u[,num.one+4], type="b",pch=24, col="red")
114 | 
115 | 
116 | plot(prop.mat.u[,1], pch=20, type="b", ylim=c(min(prop.mat.u[,num.one+4]), max(prop.mat.u)),log="y", xlab="time point", ylab="log10 proportion")
117 | title( main="Unconstrained\nproportion (log)", adj=0.5)
118 | points(prop.mat.u[,2], type="b", pch=21, col="gray")
119 | points(prop.mat.u[,num.one+10], type="b", pch=23, col="blue")
120 | points(prop.mat.u[,num.one+4], type="b", pch=24, col="red")
121 | ```
122 | 
123 | 
124 | 
125 | Figure 1 shows two idealized experiments with four different ways of looking at the exact same data: the top row shows the data as linear points, the bottom row shows the same data after a log transform. The constrained input shows the case where the total count of all nucleic acid species in the input is constrained to a constant sum. The unconstrained input shows the case where the total sum is not constrained to a constant sum. These are modelled as a time series, but any process would produce the same results, and in practice we will likely only be comparing the first and last points (before and after some intervention). These are shown in two different ways. The first is linear counts "input", the second is proportions. The bottome row
126 | 
127 | Constrained datasets occur if the increase or decrease in any component is exactly compensated by the increase or decrease of one or more others. Here the total count remains constant across all experimental conditions. Examples of constrained datasets would include allele frequencies at a locus where the total has to be 1, and the RNA-seq where the induction of genes occurs in a steady- state cell culture. In this case, any process, such as sequencing that generates a proportion simply recapitulates the data with sampling error. The unspoken assumption in most high throughput experimental designs is that this assumption is true— it is not!
128 | 
129 | An unconstrained dataset results if the total count is free to vary. Examples of unconstrained datasets would include ChIP-Seq, RNA-seq where we are examining two different conditions or cell populations, metagenomics, etc. Importantly, 16S rRNA gene sequencing analyses are almost always free to vary; that is, the total bacterial load is rarely constant in an environment. Thus, the unconstrained data type would be the predominant type of data that would be expected.
130 | 
131 | The relative abundance panels on the right side of Figure  below shows the result of random sampling with a defined maximum value in these two types of datasets. This random sampling reflects the data that results from high throughput sequencing where the total number of reads is constrained by the instrument capacity. The data is represented as a proportion, but scales to parts per million or parts per billion without changing the shape. Here we see that the shape of the data after sequencing is very similar to the input data in the case of constrained, but is very different in the case of non-constrained data. In the unconstrained dataset, observe how the blue and red features appear to be constant over the first 10 time points, but then appear to decrease in abundance at later time points. Conversely, the black feature appears to increase linearly at early time points, but appears to become constant at late time points. Obviously, we would misinterpret what is happening if we compared early and late timepoints in the unconstrained dataset. It is also worth noting how the act of random sampling makes the proportional abundance of the rare OTU species uncertain in both the constrained and unconstrained data, but has little effect on the relative apparent effect on the relative abundance of OTUs with high counts.
132 | 
133 | 


--------------------------------------------------------------------------------
/coda_seq.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/coda_seq.pdf


--------------------------------------------------------------------------------
/data/ERP004763_sample_mapping.tsv:
--------------------------------------------------------------------------------
  1 | RunAccession	Lane	Sample	BiolRep
  2 | ERR458493	1	WT	1
  3 | ERR458494	2	WT	1
  4 | ERR458495	3	WT	1
  5 | ERR458496	4	WT	1
  6 | ERR458497	5	WT	1
  7 | ERR458498	6	WT	1
  8 | ERR458499	7	WT	1
  9 | ERR458500	1	SNF2	1
 10 | ERR458501	2	SNF2	1
 11 | ERR458502	3	SNF2	1
 12 | ERR458503	4	SNF2	1
 13 | ERR458504	5	SNF2	1
 14 | ERR458505	6	SNF2	1
 15 | ERR458506	7	SNF2	1
 16 | ERR458507	1	SNF2	2
 17 | ERR458508	2	SNF2	2
 18 | ERR458509	3	SNF2	2
 19 | ERR458510	4	SNF2	2
 20 | ERR458511	5	SNF2	2
 21 | ERR458512	6	SNF2	2
 22 | ERR458513	7	SNF2	2
 23 | ERR458514	1	SNF2	3
 24 | ERR458515	2	SNF2	3
 25 | ERR458516	3	SNF2	3
 26 | ERR458517	4	SNF2	3
 27 | ERR458518	5	SNF2	3
 28 | ERR458519	6	SNF2	3
 29 | ERR458520	7	SNF2	3
 30 | ERR458521	1	SNF2	4
 31 | ERR458522	2	SNF2	4
 32 | ERR458523	3	SNF2	4
 33 | ERR458524	4	SNF2	4
 34 | ERR458525	5	SNF2	4
 35 | ERR458526	6	SNF2	4
 36 | ERR458527	7	SNF2	4
 37 | ERR458528	1	SNF2	5
 38 | ERR458529	2	SNF2	5
 39 | ERR458530	3	SNF2	5
 40 | ERR458531	4	SNF2	5
 41 | ERR458532	5	SNF2	5
 42 | ERR458533	6	SNF2	5
 43 | ERR458534	7	SNF2	5
 44 | ERR458535	1	SNF2	6
 45 | ERR458536	2	SNF2	6
 46 | ERR458537	3	SNF2	6
 47 | ERR458538	4	SNF2	6
 48 | ERR458539	5	SNF2	6
 49 | ERR458540	6	SNF2	6
 50 | ERR458541	7	SNF2	6
 51 | ERR458542	1	SNF2	7
 52 | ERR458543	2	SNF2	7
 53 | ERR458544	3	SNF2	7
 54 | ERR458545	4	SNF2	7
 55 | ERR458546	5	SNF2	7
 56 | ERR458547	6	SNF2	7
 57 | ERR458548	7	SNF2	7
 58 | ERR458549	1	SNF2	8
 59 | ERR458550	2	SNF2	8
 60 | ERR458551	3	SNF2	8
 61 | ERR458552	4	SNF2	8
 62 | ERR458553	5	SNF2	8
 63 | ERR458554	6	SNF2	8
 64 | ERR458555	7	SNF2	8
 65 | ERR458556	1	SNF2	9
 66 | ERR458557	2	SNF2	9
 67 | ERR458558	3	SNF2	9
 68 | ERR458559	4	SNF2	9
 69 | ERR458560	5	SNF2	9
 70 | ERR458561	6	SNF2	9
 71 | ERR458562	7	SNF2	9
 72 | ERR458563	1	SNF2	10
 73 | ERR458564	2	SNF2	10
 74 | ERR458565	3	SNF2	10
 75 | ERR458566	4	SNF2	10
 76 | ERR458567	5	SNF2	10
 77 | ERR458568	6	SNF2	10
 78 | ERR458569	7	SNF2	10
 79 | ERR458570	1	SNF2	11
 80 | ERR458571	2	SNF2	11
 81 | ERR458572	3	SNF2	11
 82 | ERR458573	4	SNF2	11
 83 | ERR458574	5	SNF2	11
 84 | ERR458575	6	SNF2	11
 85 | ERR458576	7	SNF2	11
 86 | ERR458577	1	SNF2	12
 87 | ERR458578	2	SNF2	12
 88 | ERR458579	3	SNF2	12
 89 | ERR458580	4	SNF2	12
 90 | ERR458581	5	SNF2	12
 91 | ERR458582	6	SNF2	12
 92 | ERR458583	7	SNF2	12
 93 | ERR458584	1	SNF2	13
 94 | ERR458585	2	SNF2	13
 95 | ERR458586	3	SNF2	13
 96 | ERR458587	4	SNF2	13
 97 | ERR458588	5	SNF2	13
 98 | ERR458589	6	SNF2	13
 99 | ERR458590	7	SNF2	13
100 | ERR458591	1	SNF2	14
101 | ERR458592	2	SNF2	14
102 | ERR458593	3	SNF2	14
103 | ERR458594	4	SNF2	14
104 | ERR458595	5	SNF2	14
105 | ERR458596	6	SNF2	14
106 | ERR458597	7	SNF2	14
107 | ERR458598	1	SNF2	15
108 | ERR458599	2	SNF2	15
109 | ERR458600	3	SNF2	15
110 | ERR458601	4	SNF2	15
111 | ERR458602	5	SNF2	15
112 | ERR458603	6	SNF2	15
113 | ERR458604	7	SNF2	15
114 | ERR458605	1	SNF2	16
115 | ERR458606	2	SNF2	16
116 | ERR458607	3	SNF2	16
117 | ERR458608	4	SNF2	16
118 | ERR458609	5	SNF2	16
119 | ERR458610	6	SNF2	16
120 | ERR458611	7	SNF2	16
121 | ERR458612	1	SNF2	17
122 | ERR458613	2	SNF2	17
123 | ERR458614	3	SNF2	17
124 | ERR458615	4	SNF2	17
125 | ERR458616	5	SNF2	17
126 | ERR458617	6	SNF2	17
127 | ERR458618	7	SNF2	17
128 | ERR458619	1	SNF2	18
129 | ERR458620	2	SNF2	18
130 | ERR458621	3	SNF2	18
131 | ERR458622	4	SNF2	18
132 | ERR458623	5	SNF2	18
133 | ERR458624	6	SNF2	18
134 | ERR458625	7	SNF2	18
135 | ERR458626	1	SNF2	19
136 | ERR458627	2	SNF2	19
137 | ERR458628	3	SNF2	19
138 | ERR458629	4	SNF2	19
139 | ERR458630	5	SNF2	19
140 | ERR458631	6	SNF2	19
141 | ERR458632	7	SNF2	19
142 | ERR458633	1	SNF2	20
143 | ERR458634	2	SNF2	20
144 | ERR458635	3	SNF2	20
145 | ERR458636	4	SNF2	20
146 | ERR458637	5	SNF2	20
147 | ERR458638	6	SNF2	20
148 | ERR458639	7	SNF2	20
149 | ERR458640	1	SNF2	21
150 | ERR458641	2	SNF2	21
151 | ERR458642	3	SNF2	21
152 | ERR458643	4	SNF2	21
153 | ERR458644	5	SNF2	21
154 | ERR458645	6	SNF2	21
155 | ERR458646	7	SNF2	21
156 | ERR458647	1	SNF2	22
157 | ERR458648	2	SNF2	22
158 | ERR458649	3	SNF2	22
159 | ERR458650	4	SNF2	22
160 | ERR458651	5	SNF2	22
161 | ERR458652	6	SNF2	22
162 | ERR458653	7	SNF2	22
163 | ERR458654	1	SNF2	23
164 | ERR458655	2	SNF2	23
165 | ERR458656	3	SNF2	23
166 | ERR458657	4	SNF2	23
167 | ERR458658	5	SNF2	23
168 | ERR458659	6	SNF2	23
169 | ERR458660	7	SNF2	23
170 | ERR458661	1	SNF2	24
171 | ERR458662	2	SNF2	24
172 | ERR458663	3	SNF2	24
173 | ERR458664	4	SNF2	24
174 | ERR458665	5	SNF2	24
175 | ERR458666	6	SNF2	24
176 | ERR458667	7	SNF2	24
177 | ERR458668	1	SNF2	25
178 | ERR458669	2	SNF2	25
179 | ERR458670	3	SNF2	25
180 | ERR458671	4	SNF2	25
181 | ERR458672	5	SNF2	25
182 | ERR458673	6	SNF2	25
183 | ERR458674	7	SNF2	25
184 | ERR458675	1	SNF2	26
185 | ERR458676	2	SNF2	26
186 | ERR458677	3	SNF2	26
187 | ERR458678	4	SNF2	26
188 | ERR458679	5	SNF2	26
189 | ERR458680	6	SNF2	26
190 | ERR458681	7	SNF2	26
191 | ERR458682	1	SNF2	27
192 | ERR458683	2	SNF2	27
193 | ERR458684	3	SNF2	27
194 | ERR458685	4	SNF2	27
195 | ERR458686	5	SNF2	27
196 | ERR458687	6	SNF2	27
197 | ERR458688	7	SNF2	27
198 | ERR458689	1	SNF2	28
199 | ERR458690	2	SNF2	28
200 | ERR458691	3	SNF2	28
201 | ERR458692	4	SNF2	28
202 | ERR458693	5	SNF2	28
203 | ERR458694	6	SNF2	28
204 | ERR458695	7	SNF2	28
205 | ERR458696	1	SNF2	29
206 | ERR458697	2	SNF2	29
207 | ERR458698	3	SNF2	29
208 | ERR458699	4	SNF2	29
209 | ERR458700	5	SNF2	29
210 | ERR458701	6	SNF2	29
211 | ERR458702	7	SNF2	29
212 | ERR458703	1	SNF2	30
213 | ERR458704	2	SNF2	30
214 | ERR458705	3	SNF2	30
215 | ERR458706	4	SNF2	30
216 | ERR458707	5	SNF2	30
217 | ERR458708	6	SNF2	30
218 | ERR458709	7	SNF2	30
219 | ERR458710	1	SNF2	31
220 | ERR458711	2	SNF2	31
221 | ERR458712	3	SNF2	31
222 | ERR458713	4	SNF2	31
223 | ERR458714	5	SNF2	31
224 | ERR458715	6	SNF2	31
225 | ERR458716	7	SNF2	31
226 | ERR458717	1	SNF2	32
227 | ERR458718	2	SNF2	32
228 | ERR458719	3	SNF2	32
229 | ERR458720	4	SNF2	32
230 | ERR458721	5	SNF2	32
231 | ERR458722	6	SNF2	32
232 | ERR458723	7	SNF2	32
233 | ERR458724	1	SNF2	33
234 | ERR458725	2	SNF2	33
235 | ERR458726	3	SNF2	33
236 | ERR458727	4	SNF2	33
237 | ERR458728	5	SNF2	33
238 | ERR458729	6	SNF2	33
239 | ERR458730	7	SNF2	33
240 | ERR458731	1	SNF2	34
241 | ERR458732	2	SNF2	34
242 | ERR458733	3	SNF2	34
243 | ERR458734	4	SNF2	34
244 | ERR458735	5	SNF2	34
245 | ERR458736	6	SNF2	34
246 | ERR458737	7	SNF2	34
247 | ERR458738	1	SNF2	35
248 | ERR458739	2	SNF2	35
249 | ERR458740	3	SNF2	35
250 | ERR458741	4	SNF2	35
251 | ERR458742	5	SNF2	35
252 | ERR458743	6	SNF2	35
253 | ERR458744	7	SNF2	35
254 | ERR458745	1	SNF2	36
255 | ERR458746	2	SNF2	36
256 | ERR458747	3	SNF2	36
257 | ERR458748	4	SNF2	36
258 | ERR458749	5	SNF2	36
259 | ERR458750	6	SNF2	36
260 | ERR458751	7	SNF2	36
261 | ERR458752	1	SNF2	37
262 | ERR458753	2	SNF2	37
263 | ERR458754	3	SNF2	37
264 | ERR458755	4	SNF2	37
265 | ERR458756	5	SNF2	37
266 | ERR458757	6	SNF2	37
267 | ERR458758	7	SNF2	37
268 | ERR458759	1	SNF2	38
269 | ERR458760	2	SNF2	38
270 | ERR458761	3	SNF2	38
271 | ERR458762	4	SNF2	38
272 | ERR458763	5	SNF2	38
273 | ERR458764	6	SNF2	38
274 | ERR458765	7	SNF2	38
275 | ERR458766	1	SNF2	39
276 | ERR458767	2	SNF2	39
277 | ERR458768	3	SNF2	39
278 | ERR458769	4	SNF2	39
279 | ERR458770	5	SNF2	39
280 | ERR458771	6	SNF2	39
281 | ERR458772	7	SNF2	39
282 | ERR458773	1	SNF2	40
283 | ERR458774	2	SNF2	40
284 | ERR458775	3	SNF2	40
285 | ERR458776	4	SNF2	40
286 | ERR458777	5	SNF2	40
287 | ERR458778	6	SNF2	40
288 | ERR458779	7	SNF2	40
289 | ERR458780	1	SNF2	41
290 | ERR458781	2	SNF2	41
291 | ERR458782	3	SNF2	41
292 | ERR458783	4	SNF2	41
293 | ERR458784	5	SNF2	41
294 | ERR458785	6	SNF2	41
295 | ERR458786	7	SNF2	41
296 | ERR458787	1	SNF2	42
297 | ERR458788	2	SNF2	42
298 | ERR458789	3	SNF2	42
299 | ERR458790	4	SNF2	42
300 | ERR458791	5	SNF2	42
301 | ERR458792	6	SNF2	42
302 | ERR458793	7	SNF2	42
303 | ERR458794	1	SNF2	43
304 | ERR458795	2	SNF2	43
305 | ERR458796	3	SNF2	43
306 | ERR458797	4	SNF2	43
307 | ERR458798	5	SNF2	43
308 | ERR458799	6	SNF2	43
309 | ERR458800	7	SNF2	43
310 | ERR458801	1	SNF2	44
311 | ERR458802	2	SNF2	44
312 | ERR458803	3	SNF2	44
313 | ERR458804	4	SNF2	44
314 | ERR458805	5	SNF2	44
315 | ERR458806	6	SNF2	44
316 | ERR458807	7	SNF2	44
317 | ERR458808	1	SNF2	45
318 | ERR458809	2	SNF2	45
319 | ERR458810	3	SNF2	45
320 | ERR458811	4	SNF2	45
321 | ERR458812	5	SNF2	45
322 | ERR458813	6	SNF2	45
323 | ERR458814	7	SNF2	45
324 | ERR458815	1	SNF2	46
325 | ERR458816	2	SNF2	46
326 | ERR458817	3	SNF2	46
327 | ERR458818	4	SNF2	46
328 | ERR458819	5	SNF2	46
329 | ERR458820	6	SNF2	46
330 | ERR458821	7	SNF2	46
331 | ERR458822	1	SNF2	47
332 | ERR458823	2	SNF2	47
333 | ERR458824	3	SNF2	47
334 | ERR458825	4	SNF2	47
335 | ERR458826	5	SNF2	47
336 | ERR458827	6	SNF2	47
337 | ERR458828	7	SNF2	47
338 | ERR458829	1	SNF2	48
339 | ERR458830	2	SNF2	48
340 | ERR458831	3	SNF2	48
341 | ERR458832	4	SNF2	48
342 | ERR458833	5	SNF2	48
343 | ERR458834	6	SNF2	48
344 | ERR458835	7	SNF2	48
345 | ERR458878	1	WT	2
346 | ERR458879	2	WT	2
347 | ERR458880	3	WT	2
348 | ERR458881	4	WT	2
349 | ERR458882	5	WT	2
350 | ERR458883	6	WT	2
351 | ERR458884	7	WT	2
352 | ERR458885	1	WT	3
353 | ERR458886	2	WT	3
354 | ERR458887	3	WT	3
355 | ERR458888	4	WT	3
356 | ERR458889	5	WT	3
357 | ERR458890	6	WT	3
358 | ERR458891	7	WT	3
359 | ERR458892	1	WT	4
360 | ERR458893	2	WT	4
361 | ERR458894	3	WT	4
362 | ERR458895	4	WT	4
363 | ERR458896	5	WT	4
364 | ERR458897	6	WT	4
365 | ERR458898	7	WT	4
366 | ERR458899	1	WT	5
367 | ERR458900	2	WT	5
368 | ERR458901	3	WT	5
369 | ERR458902	4	WT	5
370 | ERR458903	5	WT	5
371 | ERR458904	6	WT	5
372 | ERR458905	7	WT	5
373 | ERR458906	1	WT	6
374 | ERR458907	2	WT	6
375 | ERR458908	3	WT	6
376 | ERR458909	4	WT	6
377 | ERR458910	5	WT	6
378 | ERR458911	6	WT	6
379 | ERR458912	7	WT	6
380 | ERR458913	1	WT	7
381 | ERR458914	2	WT	7
382 | ERR458915	3	WT	7
383 | ERR458916	4	WT	7
384 | ERR458917	5	WT	7
385 | ERR458918	6	WT	7
386 | ERR458919	7	WT	7
387 | ERR458920	1	WT	8
388 | ERR458921	2	WT	8
389 | ERR458922	3	WT	8
390 | ERR458923	4	WT	8
391 | ERR458924	5	WT	8
392 | ERR458925	6	WT	8
393 | ERR458926	7	WT	8
394 | ERR458927	1	WT	9
395 | ERR458928	2	WT	9
396 | ERR458929	3	WT	9
397 | ERR458930	4	WT	9
398 | ERR458931	5	WT	9
399 | ERR458932	6	WT	9
400 | ERR458933	7	WT	9
401 | ERR458934	1	WT	10
402 | ERR458935	2	WT	10
403 | ERR458936	3	WT	10
404 | ERR458937	4	WT	10
405 | ERR458938	5	WT	10
406 | ERR458939	6	WT	10
407 | ERR458940	7	WT	10
408 | ERR458941	1	WT	11
409 | ERR458942	2	WT	11
410 | ERR458943	3	WT	11
411 | ERR458944	4	WT	11
412 | ERR458945	5	WT	11
413 | ERR458946	6	WT	11
414 | ERR458947	7	WT	11
415 | ERR458948	1	WT	12
416 | ERR458949	2	WT	12
417 | ERR458950	3	WT	12
418 | ERR458951	4	WT	12
419 | ERR458952	5	WT	12
420 | ERR458953	6	WT	12
421 | ERR458954	7	WT	12
422 | ERR458955	1	WT	13
423 | ERR458956	2	WT	13
424 | ERR458957	3	WT	13
425 | ERR458958	4	WT	13
426 | ERR458959	5	WT	13
427 | ERR458960	6	WT	13
428 | ERR458961	7	WT	13
429 | ERR458962	1	WT	14
430 | ERR458963	2	WT	14
431 | ERR458964	3	WT	14
432 | ERR458965	4	WT	14
433 | ERR458966	5	WT	14
434 | ERR458967	6	WT	14
435 | ERR458968	7	WT	14
436 | ERR458969	1	WT	15
437 | ERR458970	2	WT	15
438 | ERR458971	3	WT	15
439 | ERR458972	4	WT	15
440 | ERR458973	5	WT	15
441 | ERR458974	6	WT	15
442 | ERR458975	7	WT	15
443 | ERR458976	1	WT	16
444 | ERR458977	2	WT	16
445 | ERR458978	3	WT	16
446 | ERR458979	4	WT	16
447 | ERR458980	5	WT	16
448 | ERR458981	6	WT	16
449 | ERR458982	7	WT	16
450 | ERR458983	1	WT	17
451 | ERR458984	2	WT	17
452 | ERR458985	3	WT	17
453 | ERR458986	4	WT	17
454 | ERR458987	5	WT	17
455 | ERR458988	6	WT	17
456 | ERR458989	7	WT	17
457 | ERR458990	1	WT	18
458 | ERR458991	2	WT	18
459 | ERR458992	3	WT	18
460 | ERR458993	4	WT	18
461 | ERR458994	5	WT	18
462 | ERR458995	6	WT	18
463 | ERR458996	7	WT	18
464 | ERR458997	1	WT	19
465 | ERR458998	2	WT	19
466 | ERR458999	3	WT	19
467 | ERR459000	4	WT	19
468 | ERR459001	5	WT	19
469 | ERR459002	6	WT	19
470 | ERR459003	7	WT	19
471 | ERR459004	1	WT	20
472 | ERR459005	2	WT	20
473 | ERR459006	3	WT	20
474 | ERR459007	4	WT	20
475 | ERR459008	5	WT	20
476 | ERR459009	6	WT	20
477 | ERR459010	7	WT	20
478 | ERR459011	1	WT	21
479 | ERR459012	2	WT	21
480 | ERR459013	3	WT	21
481 | ERR459014	4	WT	21
482 | ERR459015	5	WT	21
483 | ERR459016	6	WT	21
484 | ERR459017	7	WT	21
485 | ERR459018	1	WT	22
486 | ERR459019	2	WT	22
487 | ERR459020	3	WT	22
488 | ERR459021	4	WT	22
489 | ERR459022	5	WT	22
490 | ERR459023	6	WT	22
491 | ERR459024	7	WT	22
492 | ERR459025	1	WT	23
493 | ERR459026	2	WT	23
494 | ERR459027	3	WT	23
495 | ERR459028	4	WT	23
496 | ERR459029	5	WT	23
497 | ERR459030	6	WT	23
498 | ERR459031	7	WT	23
499 | ERR459032	1	WT	24
500 | ERR459033	2	WT	24
501 | ERR459034	3	WT	24
502 | ERR459035	4	WT	24
503 | ERR459036	5	WT	24
504 | ERR459037	6	WT	24
505 | ERR459038	7	WT	24
506 | ERR459039	1	WT	25
507 | ERR459040	2	WT	25
508 | ERR459041	3	WT	25
509 | ERR459042	4	WT	25
510 | ERR459043	5	WT	25
511 | ERR459044	6	WT	25
512 | ERR459045	7	WT	25
513 | ERR459046	1	WT	26
514 | ERR459047	2	WT	26
515 | ERR459048	3	WT	26
516 | ERR459049	4	WT	26
517 | ERR459050	5	WT	26
518 | ERR459051	6	WT	26
519 | ERR459052	7	WT	26
520 | ERR459053	1	WT	27
521 | ERR459054	2	WT	27
522 | ERR459055	3	WT	27
523 | ERR459056	4	WT	27
524 | ERR459057	5	WT	27
525 | ERR459058	6	WT	27
526 | ERR459059	7	WT	27
527 | ERR459060	1	WT	28
528 | ERR459061	2	WT	28
529 | ERR459062	3	WT	28
530 | ERR459063	4	WT	28
531 | ERR459064	5	WT	28
532 | ERR459065	6	WT	28
533 | ERR459066	7	WT	28
534 | ERR459067	1	WT	29
535 | ERR459068	2	WT	29
536 | ERR459069	3	WT	29
537 | ERR459070	4	WT	29
538 | ERR459071	5	WT	29
539 | ERR459072	6	WT	29
540 | ERR459073	7	WT	29
541 | ERR459074	1	WT	30
542 | ERR459075	2	WT	30
543 | ERR459076	3	WT	30
544 | ERR459077	4	WT	30
545 | ERR459078	5	WT	30
546 | ERR459079	6	WT	30
547 | ERR459080	7	WT	30
548 | ERR459081	1	WT	31
549 | ERR459082	2	WT	31
550 | ERR459083	3	WT	31
551 | ERR459084	4	WT	31
552 | ERR459085	5	WT	31
553 | ERR459086	6	WT	31
554 | ERR459087	7	WT	31
555 | ERR459088	1	WT	32
556 | ERR459089	2	WT	32
557 | ERR459090	3	WT	32
558 | ERR459091	4	WT	32
559 | ERR459092	5	WT	32
560 | ERR459093	6	WT	32
561 | ERR459094	7	WT	32
562 | ERR459095	1	WT	33
563 | ERR459096	2	WT	33
564 | ERR459097	3	WT	33
565 | ERR459098	4	WT	33
566 | ERR459099	5	WT	33
567 | ERR459100	6	WT	33
568 | ERR459101	7	WT	33
569 | ERR459102	1	WT	34
570 | ERR459103	2	WT	34
571 | ERR459104	3	WT	34
572 | ERR459105	4	WT	34
573 | ERR459106	5	WT	34
574 | ERR459107	6	WT	34
575 | ERR459108	7	WT	34
576 | ERR459109	1	WT	35
577 | ERR459110	2	WT	35
578 | ERR459111	3	WT	35
579 | ERR459112	4	WT	35
580 | ERR459113	5	WT	35
581 | ERR459114	6	WT	35
582 | ERR459115	7	WT	35
583 | ERR459116	1	WT	36
584 | ERR459117	2	WT	36
585 | ERR459118	3	WT	36
586 | ERR459119	4	WT	36
587 | ERR459120	5	WT	36
588 | ERR459121	6	WT	36
589 | ERR459122	7	WT	36
590 | ERR459123	1	WT	37
591 | ERR459124	2	WT	37
592 | ERR459125	3	WT	37
593 | ERR459126	4	WT	37
594 | ERR459127	5	WT	37
595 | ERR459128	6	WT	37
596 | ERR459129	7	WT	37
597 | ERR459130	1	WT	38
598 | ERR459131	2	WT	38
599 | ERR459132	3	WT	38
600 | ERR459133	4	WT	38
601 | ERR459134	5	WT	38
602 | ERR459135	6	WT	38
603 | ERR459136	7	WT	38
604 | ERR459137	1	WT	39
605 | ERR459138	2	WT	39
606 | ERR459139	3	WT	39
607 | ERR459140	4	WT	39
608 | ERR459141	5	WT	39
609 | ERR459142	6	WT	39
610 | ERR459143	7	WT	39
611 | ERR459144	1	WT	40
612 | ERR459145	2	WT	40
613 | ERR459146	3	WT	40
614 | ERR459147	4	WT	40
615 | ERR459148	5	WT	40
616 | ERR459149	6	WT	40
617 | ERR459150	7	WT	40
618 | ERR459151	1	WT	41
619 | ERR459152	2	WT	41
620 | ERR459153	3	WT	41
621 | ERR459154	4	WT	41
622 | ERR459155	5	WT	41
623 | ERR459156	6	WT	41
624 | ERR459157	7	WT	41
625 | ERR459158	1	WT	42
626 | ERR459159	2	WT	42
627 | ERR459160	3	WT	42
628 | ERR459161	4	WT	42
629 | ERR459162	5	WT	42
630 | ERR459163	6	WT	42
631 | ERR459164	7	WT	42
632 | ERR459165	1	WT	43
633 | ERR459166	2	WT	43
634 | ERR459167	3	WT	43
635 | ERR459168	4	WT	43
636 | ERR459169	5	WT	43
637 | ERR459170	6	WT	43
638 | ERR459171	7	WT	43
639 | ERR459172	1	WT	44
640 | ERR459173	2	WT	44
641 | ERR459174	3	WT	44
642 | ERR459175	4	WT	44
643 | ERR459176	5	WT	44
644 | ERR459177	6	WT	44
645 | ERR459178	7	WT	44
646 | ERR459179	1	WT	45
647 | ERR459180	2	WT	45
648 | ERR459181	3	WT	45
649 | ERR459182	4	WT	45
650 | ERR459183	5	WT	45
651 | ERR459184	6	WT	45
652 | ERR459185	7	WT	45
653 | ERR459186	1	WT	46
654 | ERR459187	2	WT	46
655 | ERR459188	3	WT	46
656 | ERR459189	4	WT	46
657 | ERR459190	5	WT	46
658 | ERR459191	6	WT	46
659 | ERR459192	7	WT	46
660 | ERR459193	1	WT	47
661 | ERR459194	2	WT	47
662 | ERR459195	3	WT	47
663 | ERR459196	4	WT	47
664 | ERR459197	5	WT	47
665 | ERR459198	6	WT	47
666 | ERR459199	7	WT	47
667 | ERR459200	1	WT	48
668 | ERR459201	2	WT	48
669 | ERR459202	3	WT	48
670 | ERR459203	4	WT	48
671 | ERR459204	5	WT	48
672 | ERR459205	6	WT	48
673 | ERR459206	7	WT	48
674 | 


--------------------------------------------------------------------------------
/download.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/download.jpg


--------------------------------------------------------------------------------
/effect.mbf.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/effect.mbf.pdf


--------------------------------------------------------------------------------
/first_association.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "First association"
  3 | author: "gg"
  4 | date: '`r format(Sys.time(), "%d %B, %Y")`'
  5 | bibliography: ~/Library/texmf/bibtex/bib/bibdesk_refs.bib
  6 | fig_caption: true
  7 | output:
  8 |     pdf_document:
  9 |         fig_caption: yes
 10 | ---
 11 | 
 12 | To run this file:
 13 | Rscript -e "rmarkdown::render('first_association.Rmd')"
 14 | 
 15 | ##Correlation
 16 | 
 17 | Correlation is particularly problematic for compositional data because the abundance of one part affects the abundance of every other. We can identify positively associated taxa by finding those taxa where the ratio between two or more taxa is approximately constant in all samples [@Lovell:2015]. Such groups of taxa can be said to be associated (or positively correlated). Negative associations are very problematic because they can arise from underlying biology (which we want to detect) or from the negative correlation bias (when one thing goes up, one or more others must go down!).
 18 | 
 19 | Identifying correlated taxa is a very active and open research problem. Two methods have been proposed for datasets with particular sparsity profiles [@Friedman:2012; @Kurtz:2015], and one method has been proposed for general datasets [@Lovell:2015]. We will demonstrate $\phi$, the general method, and give one or two examples of how this can be useful when analyzing microbiome datasets. The $\phi$ metric has been modified to allow us to include taxa with 0 values by using the Monte-Carlo replicates of the data that were generated by ALDEx2.
 20 | 
 21 | It is important to note that $\phi$ is a strength of association measure, and not a p value measure. In that way, it is similar to an effect size. You have some intuition about how $\phi$ behaves, and you choose a cutoff you are comfortable with. Guidance suggests that values less than 0.1 are appropriate for transcriptome datasets where the data are highly reproducible, and that 0.2-0.3 are more appropriate for the noisier microbiome datasets.
 22 | 
 23 | This plots the connectivity graph for $\phi <= 0.3$, and includes additional ways to display the information from $\phi$. Displaying correlation graphs in a useful and intuitive way is a research topic by itself. Interested participants are encouraged to look at the igraph R package.
 24 | 
 25 | 
 26 | 
 27 | ```{r, echo=FALSE,eval=TRUE,message=FALSE, warning=FALSE}
 28 | library(ALDEx2)
 29 | library(zCompositions)
 30 | source("chunk/R/propr-functions.R") # phi association functions
 31 | library(igraph)
 32 | oldcat=cat
 33 | cat <- function(...) {}
 34 | 
 35 | ```
 36 | 
 37 | There is quite a bit of setup, since this is not in any easy-to-use R package as of yet.
 38 | 
 39 | 
 40 | ```{r, echo=TRUE,eval=TRUE, results='as.is', fig.width=9, fig.height=9, error=FALSE, message=FALSE, warning=FALSE, fig.cap="The connectivity graph for phi <=0.29. In this dataset, for the most part, we have small clusters composed of taxa from the same genus. This could be because of inefficient clustering, or because of similar organisms in the same genus exploiting exactly the same resources. There are a few clusters composed of different genera, and these are most likely members of different genera exploiting resources similarly. "}
 41 | 
 42 | # set a list of defined colours
 43 | colours <- c("indianred1", "steelblue3",  "skyblue1", "mediumorchid","royalblue4", "olivedrab3",
 44 |    "pink", "#FFED6F", "mediumorchid3", "ivory2", "tan1", "aquamarine3", "#C0C0C0",
 45 |     "mediumvioletred", "#999933", "#666699", "#CC9933", "#006666", "#3399FF",
 46 |    "#993300", "#CCCC99", "#666666", "#FFCC66", "#6699CC", "#663366", "#9999CC", "#CCCCCC",
 47 |    "#669999", "#CCCC66", "#CC6600", "#9999FF", "#0066CC", "#99CCCC", "#999999", "#FFCC00",
 48 |    "#009999", "#FF9900", "#999966", "#66CCCC", "#339966", "#CCCC33", "#EDEDED"
 49 | )
 50 | 
 51 | # read in the dataset and associated taxonomy file
 52 | d.subset <- read.table("data/ak_vs_op.txt", sep="\t", header=T, row.names=1)
 53 | taxon.1 <- read.table("data/ak_vs_op_taxon.txt", sep="\t", header=T, row.names=1)
 54 | 
 55 | # ALDEx2: generate Monte Carlo replicates of our clr data
 56 | # use 16 samples of the data to keep this short(ish)
 57 | x <- aldex.clr(d.subset, mc.samples=16, verbose=FALSE, useMC=TRUE)
 58 | 
 59 | # propr: calculate the phi statistic.
 60 | d.sma.df <- propr.aldex.phi(x)
 61 | 
 62 | # choose a cutoff. In practice this can be very low for transcriptomes and higher
 63 | # for microbiomes. Somewhere between 0.2 and 0.3 is a good place to start
 64 | phi.cutoff <- 0.30
 65 | 
 66 | # get the subset of OTUs that are joined by one or more low phi connections
 67 | d.sma.lo.phi <- subset(d.sma.df, phi < phi.cutoff)
 68 | 
 69 | # igraph: convert the connections into a graphical object
 70 | # igraph is a full-featured graph analysis and display tool
 71 | # the full use of which is well beyond what we can demonstrate here
 72 | g <- graph.data.frame(d.sma.lo.phi, directed=FALSE)
 73 | 
 74 | # igraph: group by clusters
 75 | g.clust <- clusters(g)
 76 | 
 77 | # make a table to examine the cluster membership by hand
 78 | g.df <- data.frame(Systematic.name=V(g)$name, cluster=g.clust$membership,
 79 |     cluster.size=g.clust$csize[g.clust$membership])
 80 | 
 81 | # generate a set of clusters larger than some size
 82 | # minimum is 2 (obviously)
 83 | big <- g.df[which(g.df$cluster.size >= 2),]
 84 | colnames(big) <- colnames(g.df)
 85 | 
 86 | # igraph: rename the cluster members by their genus name
 87 | V(g)$name <- as.vector(taxon.1[names(V(g)),"genus"])
 88 | 
 89 | # igraph:
 90 | plot(g, vertex.size=5, vertex.color=rgb(0,0,0,0.2),
 91 |    vertex.frame.color="white")
 92 | ```
 93 | 
 94 | We next what to display the results on top of the compositional biplot to get some sense of how good the projection is, and to see if any of the low-$\phi$ clusters are distributed asymmetrically in the dataset. This can be done with a bit of fiddling.
 95 | 
 96 | We first generate a biplot using scale=0) so that the axes values correspond to the loadings. We color the sample names in light grey, and do not plot the loadings values. Then we overplot the clusters and give each cluster a unique color from the list of colours. For this plot, we will include only clusters with sizes greater than 5 to plot to keep it simple.
 97 | 
 98 | 
 99 | 
100 | ```{r, echo=TRUE,eval=TRUE, results='as.is', fig.width=9, fig.height=9, error=FALSE, message=FALSE, warning=FALSE, fig.cap="Inspection of the location of the large clusters on the biplot shows that taxa with very similar standard deviations on PC1 are clustered, and that the projection of the data is not particularly good. The Haemophilus cluster, in particular, has a large spread on component 2, but all have a similar distance from the origin on PC1. However, there are two clusters that  appear to separate the groups strongly. One, associated with the ak group is composed of Streptococcus, the other associated with the op group contains Streptococus, and Actinomyces. Some instances will have Rothia also associated with the latter group. This is a consequence of examining random instances of the data and indicate that the Rothia exhibit a marginal score."}
101 | 
102 | big <- g.df[which(g.df$cluster.size >= 5),]
103 | 
104 | # zCompositions: make a pcx object as we did for the biplot
105 | d.n0 <- cmultRepl(t(d.subset), label=0, method="CZM")
106 | 
107 | d.clr <- t(apply(d.n0, 1, function(x) log(x) - mean(log(x)) ) )
108 | pcx.d <- prcomp(d.clr)
109 | 
110 | biplot(pcx.d, var.axes=F, cex=c(0.4,0.01), scale=0,
111 |   col=c(rgb(0,0,0,0.2), "red"),
112 |   xlab=paste("PC1: ", round(sum(pcx.d$sdev[1]^2)/sum(pcx.d$sdev^2), 3), sep=""),
113 |   ylab=paste("PC2: ", round(sum(pcx.d$sdev[2]^2)/sum(pcx.d$sdev^2),3),sep="")
114 |   )
115 | 
116 | #abline(v=0, lty=2, col="grey")
117 | #abline(h=0, lty=2, col="grey")
118 | 
119 | lev <- factor(big$cluster)
120 | for(i in as.numeric(levels(lev))){
121 | nms <- rownames(big)[big$cluster==i]
122 | 
123 | text(pcx.d$rotation[nms,][,1], pcx.d$rotation[nms,][,2],
124 |     labels = taxon.1[rownames(big)[big$cluster==i],"genus"],col=colours[i], cex=1)
125 | }
126 | 
127 | ```
128 | 
129 | \newpage
130 | 
131 | 
132 | ```{r, echo=TRUE,eval=TRUE, results='as.is', fig.width=9, fig.height=9, error=FALSE, message=FALSE, warning=FALSE, fig.cap="Separation of groups based on ratios of sets of associated taxa. "}
133 | 
134 | clop <- c("5425","34862","27937","34896","11687","36010","32985","35898","35936","36038","11","3760")
135 | 
136 | clopc <- c("26066", "28841","22379", "6107","21779")
137 | 
138 | clak <- c("38304","39078","39235","39306","39171","39227","37976","38334","38833")
139 | 
140 | mean.op <- apply(d.clr[,clop], 1, mean)
141 | mean.ak <- apply(d.clr[,clak], 1, mean)
142 | nms <- rownames(d.clr)
143 | 
144 | hist(mean.op[grep("ak", nms)] - mean.ak[grep("ak", nms)], col=rgb(0,0,1,0.4), xlim=c(-8,8), ylim=c(0,80), xlab=("ratio Grp A : Grp B"), main="Ratio")
145 | hist(mean.op[grep("op", nms)] - mean.ak[grep("op", nms)], col=rgb(1,0,0,0.4), add=T)
146 | 
147 | 
148 | ```
149 | 
150 | \newpage
151 | 
152 | #References
153 | 


--------------------------------------------------------------------------------
/first_association.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/first_association.pdf


--------------------------------------------------------------------------------
/make_interpret_biplot.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Simple biplot"
  3 | author: "gg"
  4 | date: '`r format(Sys.time(), "%d %B, %Y")`'
  5 | bibliography: /Users/ggloor/Library/texmf/bibtex/bib/bibdesk_refs.bib
  6 | fig_caption: true
  7 | output:
  8 |     pdf_document:
  9 |         fig_caption: yes
 10 | ---
 11 | 
 12 | To run this file:
 13 | Rscript -e "rmarkdown::render('make_interpret_biplot.Rmd')"
 14 | # The compositional biplot
 15 | 
 16 | When analyzing and interpreting compositional data, it is important to remember that we are examining the variance in the ratios of the underlying data, and not directly examining abundance. The first tool that we will use is the compositional biplot. This is generated by the following set of steps:
 17 | 
 18 | 1. remove essential 0 values (0s that are in all samples. i.e. nondetects)
 19 | 2. perform any additional filtering (sparsity, minimal abundance, minimum sample count, etc)
 20 | 3. adjust remaining 0 values with the zCompositions package
 21 | 4. perform the clr transform on the data
 22 | 5. conduct a singular value decomposition using prcomp
 23 | 6. display the results in a principle component plot
 24 | 
 25 | Let us see how this works in principle. We will make a sample dataset that has 30 samples and only eight features. Samples will be in two groups. The first group of 20 will differ from the last group of 10 . Feature A will be more abundant in the first 20 samples, and less abundant in the last 10 , features C and D will be the opposite, with feature D having a greater difference between groups than feature C. Features B, and E to H will be highly variable, but the variation will be associated with samples randomly. For simplicity we will use a random-uniform distribution, but any other distribution would work as well.
 26 | 
 27 | ```{r, echo=FALSE, results='show', fig.width=8, fig.height=8, message=FALSE,fig.cap="Counts were generated randomly for eight features labeled a_h and were free to range from the values indicated for each feature. Thirty samples were generated, and  features A,C and D were differentially abundant in the first twenty and last 10 samples. The others were of widely different abundances, but with totally random change between samples. Feature D is a slightly randomized version of feature C that is scaled 100 fold. From this we can see that the absolute abundance is not represented in the biplot."}
 28 | 
 29 | set.seed(7)
 30 | # runif(n, min, max)
 31 | A <- c(runif(20, 15, 30), runif(10,5,10))
 32 | B <- runif(30, 1, 50)
 33 | C <- c(runif(20, 5,10), runif(10,15,30))
 34 | D <- C * runif(30, 1000, 1250)
 35 | # D <- c(runif(20, 10, 20), runif(10,40,80))
 36 | E <- runif(30, 1, 50)
 37 | F <- runif(30, 1, 2)
 38 | G <- runif(30, 1000, 1200)
 39 | H <- runif(30, 1000, 2000)
 40 | 
 41 | a_h <- cbind(A,B,C,D,E,F,G,H)
 42 | a_h[a_h<0] <- 0.00001
 43 | 
 44 | # convert to proportions
 45 | a_h.prop <- t(apply(a_h, 1, function(x){x/sum(x)}))
 46 | 
 47 | # clr transform
 48 | a_h.clr <- t(apply(a_h.prop, 1, function(x){log(x) - mean(log(x))}))
 49 | 
 50 | # extract the principle components and loadings
 51 | a_h.pcx <- prcomp(a_h.clr)
 52 | 
 53 | # make a compositional biplot
 54 | #biplot(a_h.pcx, var.axes=TRUE, scale=0) # form
 55 | biplot(a_h.pcx, var.axes=TRUE, scale=0,
 56 |     xlab=paste("PC1", round(a_h.pcx$sdev[1]^2 / sum(a_h.pcx$sdev^2),3), sep=": "),
 57 |     ylab=paste("PC2", round(a_h.pcx$sdev[2]^2 / sum(a_h.pcx$sdev^2),3), sep=": ")
 58 | ) # covariance
 59 | 
 60 | ```
 61 | 
 62 | Principle component plots display the projection (shadow) of your multi-dimensional data onto a lower number of dimensions, and here the maximum number of dimensions possible is seven. The compositional biplot  displays the results of your experiment in a semi-quantitative way.
 63 | 
 64 | The first dimension is the one that displays the largest amount of variation in the data: for example, if your data had four features, it would have 3 dimensions (like a rugby ball) and the first dimension would represent the long axis.
 65 | 
 66 | The principle components are arranged in decreasing order of the variance explained. If you have a strong effect driven by a single experimental feature, then you will have a large amount of variation explained on the first axis, and a much smaller
 67 | 
 68 | From this plot we can determine the following:
 69 | 
 70 | 1. The first two principle components explain about 80\% of the variance in the dataset. This is very good. The greater the variance explained, the greater the confidence one can have in the projection. If we thought of this as a shadow of the data, then most features and samples would have fairly distinct locations.
 71 | 2. The samples (in black) partition into two groups: samples 1:20 on the left and samples 21-30 on the right with a clear separation between them. They are separated on PC1, which indicates a lack of confounding effects on the split between samples.
 72 | 3. The length and direction of the arrows (feature locations) is proportional to the standard deviation of the feature in the dataset. So we can see that feature A is highly variable along the same direction as are samples 1-20. We can interpret this as feature A is relatively more abundant in these samples than in samples 21-30. Likewise for the other features.
 73 | 4. features C,D are very close together; this is referred to as having a short link. The length of a link is proportional to the variance in their ratios. In other words, the variance of the ratios of these two features will be fairly constant. In other words again, these features have a high compositional association. These are the types of relationships we aim to identify with the propr package later on.
 74 | 5. We would expect that features A, C and D are the most variable between samples, and this is the type of result we would test with ALDEx2 later on.
 75 | 
 76 | 
 77 | 
 78 | ```{r, echo=FALSE, results='show', fig.width=8, fig.height=8, warning=FALSE, message=FALSE,fig.cap="The advantage of compositions is that the results are largely invariant to subsetting. That is, we get essentially the same answer with all (A-H), and with a subset of the data (A-G). We also see that in these data, the variance of the clr transformed values is much more informative than the absolute, or proportional values, and better represents the known structure of the data. "}
 79 | 
 80 | a_h <- cbind(A,B,C,D,E,F,G,H)
 81 | a_h.prop <- t(apply(a_h, 1, function(x){x/sum(x)}))
 82 | a_h.clr <- t(apply(a_h.prop, 1, function(x){log(x) - mean(log(x))}))
 83 | a_h.pcx <- prcomp(a_h.clr)
 84 | a_h.p.pcx <- prcomp(a_h.prop)
 85 | a_h.c.pcx <- prcomp(a_h)
 86 | 
 87 | a_g <- cbind(A,B,C,D,E,F,G)
 88 | a_g.prop <- t(apply(a_g, 1, function(x){x/sum(x)}))
 89 | a_g.clr <- t(apply(a_g.prop, 1, function(x){log(x) - mean(log(x))}))
 90 | a_g.pcx <- prcomp(a_g.clr)
 91 | a_g.p.pcx <- prcomp(a_g.prop)
 92 | a_g.c.pcx <- prcomp(a_g)
 93 | 
 94 | a_h.z <- apply(a_h, 2, function(x){ (x-mean(x))/sqrt(var(x)) })
 95 | a_h.z.prop <- apply(a_h.prop, 2, function(x){ (x-mean(x))/sqrt(var(x)) })
 96 | a_h.z.prop.pcx <- prcomp(a_h.z.prop)
 97 | a_h.z.pcx <- prcomp(a_h.z)
 98 | 
 99 | a_g.z <- apply(a_g, 2, function(x){ (x-mean(x))/sqrt(var(x)) })
100 | a_g.z.prop <- apply(a_g.prop, 2, function(x){ (x-mean(x))/sqrt(var(x)) })
101 | a_g.z.prop.pcx <- prcomp(a_g.z.prop)
102 | a_g.z.pcx <- prcomp(a_g.z)
103 | 
104 | par(mfrow=c(2,3))
105 | #biplot(a_h.pcx, var.axes=TRUE, scale=0) # form
106 | biplot(a_h.pcx, var.axes=TRUE, scale=0,
107 |     xlab=paste("PC1", round(a_h.pcx$sdev[1]^2 / sum(a_h.pcx$sdev^2),3), sep=": "),
108 |     ylab=paste("PC2", round(a_h.pcx$sdev[2]^2 / sum(a_h.pcx$sdev^2),3), sep=": ")
109 |     , main="A-H clr"
110 | ) # covariance
111 | 
112 | biplot(a_h.p.pcx, var.axes=TRUE, scale=0,
113 |     xlab=paste("PC1", round(a_h.p.pcx$sdev[1]^2 / sum(a_h.p.pcx$sdev^2),3), sep=": "),
114 |     ylab=paste("PC2", round(a_h.p.pcx$sdev[2]^2 / sum(a_h.p.pcx$sdev^2),3), sep=": ")
115 |     , main="A-H prop"
116 | ) # covariance
117 | 
118 | biplot(a_h.c.pcx, var.axes=TRUE, scale=0,
119 |     xlab=paste("PC1", round(a_h.c.pcx$sdev[1]^2 / sum(a_h.c.pcx$sdev^2),3), sep=": "),
120 |     ylab=paste("PC2", round(a_h.c.pcx$sdev[2]^2 / sum(a_h.c.pcx$sdev^2),3), sep=": ")
121 |     , main="A-H count"
122 | ) # covariance
123 | 
124 | biplot(a_g.pcx, var.axes=TRUE, scale=0,
125 |     xlab=paste("PC1", round(a_g.pcx$sdev[1]^2 / sum(a_g.pcx$sdev^2),3), sep=": "),
126 |     ylab=paste("PC2", round(a_g.pcx$sdev[2]^2 / sum(a_g.pcx$sdev^2),3), sep=": ")
127 |     , main="A-G clr"
128 | ) # covariance
129 | 
130 | biplot(a_g.p.pcx, var.axes=TRUE, scale=0,
131 |     xlab=paste("PC1", round(a_g.p.pcx$sdev[1]^2 / sum(a_g.p.pcx$sdev^2),3), sep=": "),
132 |     ylab=paste("PC2", round(a_g.p.pcx$sdev[2]^2 / sum(a_g.p.pcx$sdev^2),3), sep=": ")
133 |     , main="A-G prop"
134 | ) # covariance
135 | 
136 | biplot(a_g.c.pcx, var.axes=TRUE, scale=0,
137 |     xlab=paste("PC1", round(a_g.c.pcx$sdev[1]^2 / sum(a_g.c.pcx$sdev^2),3), sep=": "),
138 |     ylab=paste("PC2", round(a_g.c.pcx$sdev[2]^2 / sum(a_g.c.pcx$sdev^2),3), sep=": ")
139 |     , main="A-G count"
140 | ) # covariance
141 | 
142 | ```
143 | 


--------------------------------------------------------------------------------
/make_interpret_biplot.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/make_interpret_biplot.pdf


--------------------------------------------------------------------------------
/multi_comp.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "The data we have: multivariate compositional data"
  3 | author: "gg"
  4 | date: '`r format(Sys.time(), "%d %B, %Y")`'
  5 | bibliography: ~/Library/texmf/bibtex/bib/bibdesk_refs.bib
  6 | fig_caption: true
  7 | output:
  8 |     pdf_document:
  9 |         fig_caption: yes
 10 | ---
 11 | 
 12 | To run this file:
 13 | Rscript -e "rmarkdown::render('multi_comp.Rmd')"
 14 | 
 15 | #Part 1: relative abundance data
 16 | 
 17 | Whenever we analyze microbiome data, we analyze the data as `relative abundance' data, with or without normalization. What are relative abundance data?
 18 | 
 19 | Relative abundance data are data where the absolute count of something has no meaning. Instead, the only thing we can assess are the relationships between datapoints. That is, their ratios. This is compositional data.
 20 | 
 21 | When analyzing compositional data, or CoDa, (as we shall see) it is customary to log transform in some way and log-transformed data requires a different way of thinking. In linear space, a ratio is $x/y$, while in log space a ratio is $log(x) - log(y)$. So whenever we are working on log-ratios, what we are really saying is that we are working with the differences between log-transformed datapoints, and that these differences are equivalent to ratios in linear space.
 22 | 
 23 | A second issue is that of correlation of log-transformed data [@Lovell:2015]. In linear space, perfectly correlated data will follow the familiar formula $y = m \times x _ b$, where y and x are variables, m is the slope of the line and b is the intercept. However, in log transformed data, m becomes the slope of the line, and b becomes a non-linear parameter. Consider the simple case of plotting y as a function of x from 1:10, with b set to 0, or b set to 2.
 24 | 
 25 | Here we can see that in the case of the intercept being 0, the data are linearly related in both the normal and logarithmic spaces. In normal space, the intercept is 0, and the slope of the line changes. However, the two parts are always in constant ratio. In log space the intercept of the line changes, but the slope stays at 1.
 26 | 
 27 | In the case of the intercept not being 0, in the normal space the lines appear to be perfectly correlated, but in log space we can see that the lines are now curved because of the effect of a non-0 intercept. This curvature indicates that the ratios between x and y are changing, and so are not associated in logarithmic space.
 28 | 
 29 | This problem with correlation is not corrected by using non-parametric correlation measures! It is also worth noting that the recently proposed DESeq count normalization [@McMurdie:2014a] is simply a linear transform of the relative abundance values: i.e., if the count being normlized to is $1x10^6$, then this is simply relative abundance $x 1x10^4$.
 30 | 
 31 | 
 32 | ```{r, echo=FALSE, results='show', fig.width=8, fig.height=8, message=FALSE,fig.cap="Plots to show the effect of log-transformation on simple linear lines. The grey dashed lines in the log space plots are lines of constant ratio."}
 33 | y <- function(m,x,b){m * x + b}
 34 | 
 35 | par(mfrow=c(2,2))
 36 | x <- seq(1:10)
 37 | b <- 0
 38 | plot(x, y(1,x,b), col="red", pch=19, type="b", xlim=c(0,10), ylim=c(0,20), main="normal space")
 39 | points(x, y(2,x,b), col="blue", pch=19, type="b")
 40 | points(x, y(0.5,x,b), col="orange", pch=19, type="b")
 41 | text(1.5,15, labels="y=2x", col="blue")
 42 | text(1.5,12.5, labels="y=x", col="red")
 43 | text(1.5,10, labels="y=0.5x", col="orange")
 44 | 
 45 | plot(x, y(1,x,b), col="red", pch=19, type="b", xlim=c(1,10), ylim=c(0.5,20), log="xy", main="log space")
 46 | points(x, y(2,x,b), col="blue", pch=19, type="b")
 47 | points(x, y(0.5,x,b), col="orange", pch=19, type="b")
 48 | text(1.5,18, labels="y=2x", col="blue")
 49 | text(1.5,12, labels="y=x", col="red")
 50 | text(1.5,8, labels="y=0.5x", col="orange")
 51 | abline(0.1,1, lty=2, col=rgb(0,0,0,0.2))
 52 | 
 53 | b <- 2
 54 | plot(x, y(1,x,b), col="red", pch=19, type="b", xlim=c(0,10), ylim=c(0.5,20), main="normal space")
 55 | points(x, y(2,x,b), col="blue", pch=19, type="b")
 56 | points(x, y(0.5,x,b), col="orange", pch=19, type="b")
 57 | text(1.5,15, labels="y=2x + 2", col="blue")
 58 | text(1.5,12.5, labels="y=x + 2", col="red")
 59 | text(1.5,10, labels="y=0.5x + 2", col="orange")
 60 | 
 61 | plot(x, y(1,x,b), col="red", pch=19, type="b", xlim=c(1,10), ylim=c(2,20), log="xy", main="log space")
 62 | points(x, y(2,x,b), col="blue", pch=19, type="b")
 63 | points(x, y(0.5,x,b), col="orange", pch=19, type="b")
 64 | text(1.5,18, labels="y=2x + 2", col="blue")
 65 | text(1.5,12, labels="y=x + 2", col="red")
 66 | text(1.5,8, labels="y=0.5x + 2", col="orange")
 67 | abline(0.5,0.6, lty=2, col=rgb(0,0,0,0.2))
 68 | 
 69 | ```
 70 | 
 71 | \newpage
 72 | 
 73 | 
 74 | #PART 2: multivariate compositional data
 75 | 
 76 | Univariate data have is one variable per sample. This is the typical kind of data generated by biologists. For example, you measure the height of people in a room, divide them into male and female and compare. In this case, we have one variable (height) measured for two groups (male and female). These data are unconstrained, within the bounds of human height, and are in general what traditional statistical tools were developed for.
 77 | 
 78 | Multivariate data have more than one variable per sample. Multivariate data can be independent or dependent. So for example, if the variables are truly independent (randomly chosen, not linked) then each may be treated as univariate. For example if we measure height, hair color and handedness of people, then each person is an observation that holds three variables. We can assume that these are (relatively) independent, and all statistical tests univariate and multivariate should be valid and importantly, multiple test corrections would be valid.
 79 | 
 80 | However, if the variables are dependent, then we have many unappreciated problems. Unfortunately, the typical high throughput sequencing dataset is multivariate and highly dependent [@gloor2016s;@Gloor:2016cjm]. This dependency is forced upon the data by the sequencing instrument itself.
 81 | 
 82 | We acknowledge this dependence when we call the data `relative abundance' data. What does this really mean?
 83 | 
 84 | Let us set up a thought experiment. Let us imagine we have a very simple dataset composed of 100 samples and three taxa. Note that everything about this example generalizes to datasets with more samples and  more taxa, it is just difficult to show this with an n-dimensional graphic. In this example, we will generate counts for 100 samples from a single experiment where the taxa abundance can range anywhere between 1 and an arbitrary number of counts. We will assume that we are actually counting the number of molecules belonging to each taxon, and that there is no practical upper limit on the number of molecules.
 85 | 
 86 | We will save the special case of 0 for our practical demonstration. Now we have 100 multivariate observations with three variables.
 87 | 
 88 | ```{r, echo=FALSE, results='show', fig.width=8, fig.height=3, message=FALSE,fig.cap="Counts were generated randomly for three taxa labeled a, b and c, and were free to range from 1 to 1000 for a, up to 2000 for b, and up to 3000 for c. One hundred samples were generated. The correlation coefficient of each sample vs. each other is shown above the plots."}
 89 | library(compositions)
 90 | library(scatterplot3d)
 91 | 
 92 | a <- round(runif(100, min=1, max=1000))
 93 | b <- round(runif(100, min=1, max=2000))
 94 | c <- round(runif(100, min=1, max=3000))
 95 | abc <- rbind(a,b,c)
 96 | tot=1
 97 | abc.comp <- apply(abc, 2, function(x){x/sum(x)} * tot)
 98 | abc.acomp <- acomp(abc, tot=1)
 99 | 
100 | par(mfrow=c(1,3))
101 | plot(a,b, main=round(cor(a,b),3))
102 | plot(a,c, main=round(cor(a,c),3))
103 | plot(b,c, main=round(cor(b,c),3))
104 | ```
105 | 
106 | The plots show that the variables in each of the samples are randomly placed. Therefore, we can see that the data in each pair of samples are essentially uncorrelated, as we expect for randomly generated data. This is what we would expect for randomly generated data where each point is absolutely independent of each other: in other words this is the best case scenario that would be rarely seen in a biological context. We shall now constrain the data to a constant sum and see how this affects the shape of the data.
107 | 
108 | ```{r, echo=FALSE, results='show', fig.width=8, fig.height=3, message=FALSE,fig.cap="The same data as above were constrained to sum to an arbitrary sum: in this case, the data were converted to proportions. Any arbitrary number (percentage, ppm) has the same effect. The correlation coefficient of each sample vs. each other is shown above the plots. We can see very clearly that the associations between each of the datapoints are now not independent."}
109 | par(mfrow=c(1,3))
110 | plot(abc.comp[1,], abc.comp[2,], main=round(cor(abc.comp[1,],abc.comp[2,]),3))
111 | plot(abc.comp[1,], abc.comp[3,], main=round(cor(abc.comp[1,],abc.comp[3,]),3))
112 | plot(abc.comp[2,], abc.comp[3,], main=round(cor(abc.comp[2,],abc.comp[3,]),3))
113 | ```
114 | 
115 | The simple act of converting each count to a proportion markedly skews the data. Here we can see that the data now appear much more correlated and constrained. While this is a simple example with three taxa, it is generalizable to any number of samples, and to any number of variables (taxa) in a sample.
116 | 
117 | Why is this so?
118 | 
119 | ```{r, echo=FALSE, results='show', fig.width=8, fig.height=3, message=FALSE,fig.cap="Plotting these data in three dimensions shows the reason for the non-independence. The unconstrained data are on the left, the constrained data are in the middle, and the right side shows the data as if we were looking directly down onto the plane of data seen in the middle box. Data points are coloured from red to black to show their place on the b axis."}
120 | par(mfrow=c(1,3))
121 | scatterplot3d(t(abc), scale=1.5, highlight.3d=T)
122 | scatterplot3d(t(abc.comp), xlim=c(0,tot), ylim=c(0,tot), zlim=c(0,tot), scale=1.5, highlight.3d=T)
123 | plot(t(abc.acomp), pch=19, cex=0.5, col=rgb(0,0,0,0.5))
124 | 
125 | ```
126 | 
127 | The essential problem is one of geometry. When the data are unconstrained, as shown in the box on the left, the data points are scattered uniformly within the box. This is a visualization of multivariate data that are unconstrained and uncorrelated. Essentially, in data of this type, knowing information about one datapoint gives no information about any other datapoint.
128 | 
129 | When the data are constrained to a constant sum as shown in the middle box, then the datapoints collapse to a flat plane within the box, with limits at the corners. This plane is called a simplex, and occurs because if we know information about all points but one, then we know the information about the last point as well. For example, if the data must sum to 1, then knowing that the previous 10 points have a sum of 0.9, we know that the last point must have a value of 0.1. Thus, we can see intuitively that the data are constrained. We can re-orient our view to look directly perpendicular to the simplex plane and observe how the three dimensions of the data map onto the two dimensional space on the right.
130 | 
131 | The data are clearly in a very different geometry than are the unconstrained data points. Moving from place to place in the box on the left is an additive process since the difference between points is a linear distance. In the simplex, moving from point to point is a multiplicative process, and we think of differential taxon abundance in terms of being multiples of the original abundance.
132 | 
133 | In order to analyze such data, we must place ourselves on the simplex, or modify the data to move it from a constrained to an unconstrained simplex. Thus, we need to recast our thinking into a compositional data analysis (CoDa) way of thinking.
134 | 
135 | #References and Further reading
136 | 
137 | 


--------------------------------------------------------------------------------
/multi_comp.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/multi_comp.pdf


--------------------------------------------------------------------------------
/prop_and_diffprop_with_propr.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Proportionality and differential proportionality using the propr R package"
  3 | author: "Ionas Erb"
  4 | date: "`r Sys.Date()`"
  5 | output: html_document
  6 | ---
  7 | 
  8 | ## Introduction
  9 | 
 10 | In this document, we will introduce two concepts of Compositional Data Analysis that are useful for RNA-seq analysis: Proportionality and differential propotionality. While the former can be seen as a substitute for correlation, the latter can be uderstood as a differential expression of gene ratios. Both try to circumvent the problems introduced by the constant-sum constraint imposed on the data by the sequencing procedure without making normalization assumptions. As a work-through example, we are using the SNF2-knockout yeast data set from the Barton group (Gierlinski et al. 2015). We can load it by
 11 | ```{r}
 12 | ma=read.table("barton_agg.tsv",row.names=1,header=TRUE)
 13 | dim(ma)
 14 | ```
 15 | The rows of the matrix are $N=96$ observations of the 6349 genes in the columns. The observations fall in two groups, the SNF2-knockout and the wildtype. Although we will need it only in the final section of this document, let us define the group variable already here:
 16 | ```{r}
 17 | gr=c(rep(0,48),rep(1,48))
 18 | ```
 19 | Let us now remove some genes that have on average less than one count per sample.  
 20 | ```{r}
 21 | M=ma[,which(apply(ma,2,sum)>95)]
 22 | dim(M)
 23 | ```
 24 | This leaves us with $D=5958$ genes, some of which will have zero entries somewhere. Note that the statistics that we present below can be modified to explicitly deal with zeros (and much of this is already implemented in propr), but we will not have time to cover this topic here.
 25 |   
 26 | ## Proportionality
 27 | 
 28 | Correlation between variables cannot be defined consistently on compositional data, but positive correlation can be quantified using the log-ratio variance (LRV), see (Aitchsion 1986). 
 29 | $$\textrm{LRV}(x,y) = \textrm{var}\left(\log\frac{x_1}{y_1},\dots,\log\frac{x_N}{y_N}\right)$$
 30 | 
 31 | Since ratios are taken within each single sample, any (normalization) factor constraining the sample to a given sum will cancel out. The log is needed to make the sample-wise ratios symmetric with their reciprocal values. Now the variance of the sample-wise log-ratios will be close to zero for variables changing in about the same way along the samples. Such genes $x$, $y$ are *proportional*, i.e. $y=$const.$x$, because a vanishing LRV implies $y/x=$const. It is equivalent to saying that genes have a linear relationship without $y$-intercept. Let us look at two arbitrary genes now: 
 32 | ```{r}
 33 | plot(M[,1],M[,2],xlim=c(0,range(M[,1])[2]),ylim=c(0,range(M[,2])[2]),xlab=colnames(M)[1],ylab=colnames(M)[2])
 34 | ```
 35 | 
 36 | Scatter plots like this one are intrinsically problematic for compositional data. This is because they assume that the values of the variables displayed on the axes have an absolute meaning. Compositional variables are parts of a whole and should be understood in relationship to other variables only. This implies that the total sum each sample is adding to should not play a role in the analysis. To make this clearer, let us multiply each sample by some random prefactors between 0 and 1:
 37 | ```{r}
 38 | set.seed(123)
 39 | fac=sample(seq(0.01,1,0.01),dim(M)[1])
 40 | plot(fac*M[,1],fac*M[,2],xlim=c(0,range(fac*M[,1])[2]),ylim=c(0,range(fac*M[,2])[2]),xlab=colnames(M)[1],ylab=colnames(M)[2])
 41 | ```
 42 | 
 43 | As we can see already in the plots, the "crude" correlation between the genes changes. The Pearson correlation increases a lot.
 44 | ```{r}
 45 | cor(M[,1],M[,2])
 46 | cor(fac*M[,1],fac*M[,2])
 47 | ```
 48 | 
 49 | Now compare with this ratio representation and the unchanged variance of the log-ratio:
 50 | ```{r}
 51 | plot(c(1:96),M[,1]/M[,2],log="y",xlab="sample",ylab=paste(colnames(M)[1],colnames(M)[2],sep="/"))
 52 | var(log(M[,1]/M[,2]))
 53 | var((log(fac*M[,1]/(fac*M[,2]))))
 54 | ```
 55 | 
 56 | Is 0.26 a small LRV? Unlike correlation, LRV has no natural scale. We could argue that a value of 0.26 is more impressive for genes that vary a lot by themselves than for those that change little across samples anyway. This idea has been introduced by (Lovell et al. 2015). In their paper, the authors set out to scale LRV by the variance of one or both of the individual genes. Note, however, that the variance of individual genes is problematic because it presupposes that we can have a meaningful comparison between samples based on raw counts. Following compositional data-analysis principles, inter-sample comparisons can only be done for ratios. The individual gene variances that are used for the scaling of LRV have themselves to be based on ratios then. This can be done with a reference variable $z$ (in form of an individual gene or the geometric mean over all genes):
 57 | 
 58 | $$\rho(x,y)=\frac{2~\textrm{cov}(\log(x/z),\log(y/z))}{\textrm{var}(\log(x/z))+\textrm{var}(\log(y/z))}=1-\frac{\textrm{LRV}(x,y)}{\textrm{LRV}(x,z)+\textrm{LRV}(y,z)}$$
 59 | 
 60 | The first equality shows that $\rho$ is similar to a correlation between the logs, only that the geometric mean of the variances is replaced by the arithmetic mean. $\rho$ punishes both low correlation (between the logs) and a finite $y$-intercept of the scatter. Like normal correlation, it has the nice property that it falls between -1 and 1, and a value of 1 indicates full proportionality (-1 indicates reciprocality). To return to the example above, and taking as a reference the gene *Rev7* (YIL139C), we can inspect the scaled version of the scatter plot and the proportionality coefficient $\rho$:
 61 | ```{r}
 62 | z=M[,"YIL139C"]
 63 | plot(M[,1]/z,M[,2]/z,xlim=c(0,range(M[,1]/z)[2]),ylim=c(0,range(M[,2]/z)[2]),xlab=paste(colnames(M)[1],"REV7",sep="/"),ylab=paste(colnames(M)[2],"REV7",sep="/"))
 64 | abline(a=0,b=sqrt(var(M[,2]/z)/var(M[,1]/z)),col="red")
 65 | 1-var(log(M[,1]/M[,2]))/(var(log(M[,1]/z))+var(log(M[,2]/z)))
 66 | ```
 67 | The red line indicates the theoretical relationship the genes should follow if they were proportional. However, we can see a rather noisy cloud that is shifted with respect to the zero-intercept line, indicating that these genes cannot be proportional. This is confirmed by the rather modest coefficient $\rho$. Let us now change the reference to the geometric mean of the genes (that do not have zero-entries):
 68 | ```{r}
 69 | g=exp(apply(log(M[,which(apply(M,2,min)>0)]),1,mean))
 70 | plot(M[,1]/g,M[,2]/g,xlim=c(0,range(M[,1]/g)[2]),ylim=c(0,range(M[,2]/g)[2]),xlab=paste(colnames(M)[1],"gMean",sep="/"),ylab=paste(colnames(M)[2],"gMean",sep="/"))
 71 | abline(a=0,b=sqrt(var(M[,2]/g)/var(M[,1]/g)),col="red")
 72 | 1-var(log(M[,1]/M[,2]))/(var(log(M[,1]/g))+var(log(M[,2]/g)))
 73 | ```
 74 | Clearly, here the result of $\rho$ depends a lot on the reference used. How this can introduce spurious results was worked out in detail in (Erb & Notredame 2016). However, it was found empirically that spurious results seem to affect mainly sensitivity, not specificity (Quinn et al. 2017). Let us now use the propr package to calculate both versions of $\rho$ on our entire data set:
 75 | ```{r}
 76 | library(propr)
 77 | rho.g=perb(M)
 78 | rho.z=perb(M,ivar=which(colnames(M)=="YIL139C"))
 79 | str(rho.g)
 80 | ```
 81 | The object that is created by proper contains, among other things, an object @matrix of the $\rho$ coefficients. We can fill the @pairs object with pairs we are interested in using a cutoff on $\rho$. Let us first look at all of them using a histogram.
 82 | ```{r}
 83 | rho.g=rho.g[">",-1.1]
 84 | hist(rho.g@matrix[rho.g@pairs],main="Rho with geometric mean reference")
 85 | ```
 86 | 
 87 | Let us compare this with the $\rho$ using our reference gene:
 88 | ```{r}
 89 | rho.z=rho.z[">",-1.1]
 90 | hist(rho.z@matrix[rho.z@pairs],main="Rho with Rev7 reference")
 91 | ```
 92 | 
 93 | While the geometric mean forces the distribution to be symmetric, our reference gene might approximate better an unchanged reference and we could gain some sensitivity using it. Let us now see if the pairs with coeffcients close to one are the same for both coefficients:
 94 | ```{r}
 95 | rho.z@pairs=numeric(0)
 96 | rho.z=rho.z[">",0.98]
 97 | rho.g@pairs=numeric(0)
 98 | rho.g=rho.g[">",0.98]
 99 | length(rho.z@pairs)
100 | length(rho.g@pairs)
101 | length(intersect(rho.g@pairs,rho.z@pairs))
102 | ```
103 | It appears that for this cut-off of 0.98 we get some 200 pairs more using the *Rev7* reference, but fortunately, the rest of the pairs is in perfect agreement. Let us now recover the gene indices from the consensus pairs and put them into a data frame:
104 | ```{r}
105 | cind=round(dim(M)[2]*(rho.g@pairs/dim(M)[2]-floor(rho.g@pairs/dim(M)[2])))
106 | rind=round(floor(rho.g@pairs/dim(M)[2])+1)
107 | ppairs=as.data.frame(cbind(sort(cind),rind[order(cind)]))
108 | ```
109 | 
110 | We can now use the igraph package to gain an idea about the connectivity:
111 | ```{r}
112 | library(igraph)
113 | graph=graph_from_data_frame(ppairs,directed=FALSE)
114 | set.seed(123)
115 | plot(graph,vertex.size=2,vertex.label=NA)
116 | sort(degree(graph),decreasing=TRUE)[1:32]
117 | ```
118 | 
119 | Instead of exploring these modules, we will now move on to the section about differential proportionality. Since the calculations for this take a few minutes, let us prepare the object already (here it is commented out for rendering the markdown document quickly):
120 | ```{r}
121 | #res=propd(M,group=gr)
122 | ```
123 | 
124 | 
125 | 
126 | ## Differential proportionality
127 | 
128 | 
129 | Differential proportionality as we present it here is a differential expression analysis of gene ratios (Erb et al. 2017). What does it mean for a ratio to be differentially expressed? Let us look at the following example:
130 | ```{r}
131 | plot (M[,101],M[,80],col=gr+2,xlab=colnames(M)[101],ylab=colnames(M)[80])
132 | abline(a=0,b=sqrt(var(M[1:48,80])/var(M[1:48,101])),col="red")
133 | abline(a=0,b=sqrt(var(M[49:96,80])/var(M[49:96,101])),col="green")
134 | ```
135 | 
136 | We can see that the slope of the scatter is condition-specific. The genes are clearly not proportional, but they appear to be proportional within each of the conditions. The colored lines indicate the theoretical relationship they would follow if the genes were perfectly proportional within each condition. As we can see, the red line fits pretty well, while the green line is a bit off, indicating that proportionality is only approximately true here.
137 | 
138 | A plot showing the ratios on a log scale makes the equivalence with differential expression quite clear. Instead of a log-fold change, we have a log-ratio change. The vertical lines indicate the log-ratio means in both groups. 
139 | ```{r}
140 | plot (c(1:96),M[,101]/M[,80],col=gr+2,log="y",xlab="sample",ylab=paste(colnames(M)[101],colnames(M)[80],sep="/"))
141 | abline(h=exp(mean(log(M[1:48,101]/M[1:48,80]))),col="red")
142 | abline(h=exp(mean(log(M[49:96,101]/M[49:96,80]))),col="green")
143 | ```
144 | 
145 | The measure for finding such gene pairs in our data set (where $K=48$) is this: 
146 | 
147 | $$\vartheta(x,y)=\frac{(K-1)~\textrm{LRV}_{1,...,K}(x,y)+(N-K-1)~\textrm{LRV}_{K+1,...,N}(x,y)}{(N-1)~\textrm{LRV}_{1,...,N}(x,y)}$$
148 | 
149 | The LRV indices are meant to indicate that LRV is only evaluated on the respective subset. The LRV in the denominator is of course the one we were using before, and the indices are only shown for clarity. $\vartheta$ is the ratio of within-group LRV to total LRV and thus, unlike LRV, has the nice property to fall between 0 and 1. Like LRV, it is independent of a reference, and values close to zero indicate the interesting cases. It can be derived from a decomposition of LRV into between- and within-group variance as known from analysis of variance (ANOVA). We are essentially doing a $t$-test on the ratios, only that the direction of the fold change (i.e. the log-ratio change) does not matter (because it does not matter which of the two genes is in the denominator of the ratio). The statistic is thus fully equivalent to the squared $t$-statistic (i.e. the $F$-statistic of one-way ANOVA). 
150 | 
151 | Our calculations before the start of this section provided us with the res object. For the example pair shown before it contains this entry:
152 | ```{r}
153 | res@theta[5030,]
154 | ```
155 | 
156 | $\vartheta$ indicates the significance of the differential proportionality, and we can associate it with a false discovery rate (FDR) using a permutation test. We will not do this here because of time constraints, but the command for 10 permutations would be this:
157 | ```{r}
158 | #res=updateCutoffs(res,cutoff=seq(0.05,0.95,0.01), p=10)
159 | ```
160 | 
161 | This will populate the res@fdr object where FDRs for the specified cut-off sequence are provided. (I can assure you that for our data set even a cut-off as high as 0.95 is still highly significant). What about the effect size then? The equivalent to the log-fold change of differential expression is the log-ratio change. It is the difference of the log-ratio means (LRMs) of the two groups and can be used to calculate the between-group variance:
162 | 
163 | $$(1-\vartheta(x,y))\textrm{LRV(x,y)}=\frac{K(N-K)}{N^2}(\textrm{LRM}_{1,...,K}(x,y)-\textrm{LRM}_{K+1,...,N}(x,y))^2$$
164 | 
165 | The group LRMs can also be directly read off the res@theta object (see example above). Something similar to a volcano plot can now be obtained when we plot the size of the log-ratio change against theta. We are doing this here only for a subset of 100,000 pairs to speed up the plotting:
166 | ```{r}
167 | set.seed(123)
168 | myset=sample(c(1:dim(res@theta)[1]),100000)
169 | plot(res@theta$theta[myset],abs(res@theta$lrm1[myset]-res@theta$lrm2[myset]),col=rgb(0.1,0.1,0.1,0.1),pch=20,xlab="theta",ylab="absolute log-ratio change")
170 | ```
171 | 
172 | There is a conspicuous set of pairs with low $\vartheta$ and high LRMs. We extract it and look at the degrees of the resulting graph nodes:
173 | ```{r}
174 | pairset=which(abs(res@theta$lrm1-res@theta$lrm2)>5)
175 | graph=graph_from_data_frame(res@theta[pairset,1:2],directed=FALSE)
176 | sort(degree(graph),decreasing=TRUE)[1:10]
177 | length(pairset)
178 | colnames(res@counts)[5300]
179 | ```
180 | 
181 | The degree of the first gene has about the size of the pairset, which corresponds almost to the total number of genes. So one gene appears to be connected to all the other genes. This is the knockout gene *SNF2*, whose systematic name is YOR290C. 
182 | 
183 | In our "volcano" plot we can also see that there are a number of pairs with even lower $\vartheta$ than the ones from this module. To extract the genes involved, we extract the gene indices from the res@theta object. Again, to prioritize genes, it is interesting to look at connectivity. 
184 | ```{r}
185 | pairset=which(res@theta$theta<0.01)
186 | geneset=unique(c(res@theta[pairset,1],res@theta[pairset,2]))  
187 | length(geneset)
188 | graph=graph_from_data_frame(res@theta[pairset,1:2],directed=FALSE)
189 | sort(degree(graph),decreasing=TRUE)[1:10]
190 | set.seed(123)
191 | plot(graph,vertex.size=2,vertex.label=NA)
192 | ```
193 | 
194 | ## References
195 | 
196 | 1. Gierliński, Marek et al. “Statistical Models for RNA-Seq Data Derived from a Two-Condition 48-Replicate Experiment.” Bioinformatics 31 (2015): 3625–3630.
197 | 2. Aitchison, John (1986), The Statistical Analysis of Compositional Data, Chapman & Hall; reprinted in 2003, with additional material, by The Blackburn Press.
198 | 3. Lovell, David, et al. "Proportionality: a valid alternative to correlation for relative data." PLoS Computational Biology 11 (2015): e1004075.
199 | 4. Erb, Ionas & Notredame, Cedric "How should we measure proportionality on relative gene expression data?" Theory in Biosciences 135 (2016): 21–36.
200 | 5. Quinn, Thom et al. "propr: An R-package for Identifying Proportionally Abundant Features Using Compositional Data Analysis." Scientific Reports 7 (2017): 16252.
201 | 6. Erb, Ionas et al. "Differential proportionality – a normalization-free approach to differential gene expression." Proceedings of CoDaWork 2017, available at bioRxiv (2017): 134536
202 | 


--------------------------------------------------------------------------------
/replicate_Evalue.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/replicate_Evalue.pdf


--------------------------------------------------------------------------------
/replicate_lowEvalue.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/replicate_lowEvalue.pdf


--------------------------------------------------------------------------------
/zero.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "What is a 0 anyway"
  3 | author: "gg"
  4 | date: "`r Sys.Date()`"
  5 | fig_caption: true
  6 | output: pdf_document
  7 | ---
  8 | To run this file:
  9 | Rscript -e "rmarkdown::render('zero.Rmd')"
 10 | 
 11 | # 0 is an observation, not a ground truth
 12 | 
 13 | We will explore what a 0, or any other value for that matter means in a probabilistic sense.
 14 | 
 15 | We take the dataset, which has 7 technical replicates of the yeast dataset and compare one technical replicate with the other.
 16 | 
 17 | ```{r, echo=TRUE}
 18 | d <- as.matrix(read.table("data/countfinal2.tsv", header=T, row.names=1))
 19 | 
 20 | # remove 0 sum features
 21 | d.n0 <- d[rowSums(d) > 0,]
 22 | ```
 23 | At this point we have removed all features that are 0 in all samples.
 24 | These features are almost certainly not 0 if we sequence 10X more deeply, but they are of no consequence for the present analysis.
 25 | 
 26 | In the context of RNA-seq, 0 is an anomaly since most genes have a low level of stochastic expression. In the context of single cell seq, a 0 can be a 0 because expression is somewhat binary in single cells, but averaged over a population RNA expression is continuous.
 27 | 
 28 | We can now examine technical replicates and see what a 0 means in the context of a near perfect replicate. What happens if one replicate is 0?
 29 | 
 30 | ```{r, echo=TRUE}
 31 | # choose the features of replicate 2 where the value of replicate
 32 | # 1 is 0, adn so on
 33 | one_is_0 <- d.n0[,2][d.n0[,1] == 0]
 34 | one_is_1 <- d.n0[,2][d.n0[,1] == 1]
 35 | one_is_2 <- d.n0[,2][d.n0[,1] == 2]
 36 | one_is_32 <- d.n0[,2][d.n0[,1] == 32]
 37 | 
 38 | # plot the distribution of values from replicate 2 where replicate 1
 39 | # has a value of 0 or a value of 32
 40 | par(mfrow=c(1,2))
 41 | hist(one_is_0, breaks=9, xlim=c(0,20))
 42 | hist(one_is_32, breaks=20)
 43 | ```
 44 | 
 45 | Just for completeness, we can choose a second replicate to compare. Here the the first technical replicate is in blue, the second technical replicate is in pink, and the expected value of 0 in the technical replicate is calculated as the mean of the replicate distribution. In this dataset, E(0) is approximately 0.3. Note that the most likely value for features with a count of 1 in the first replicate is 0, but the expected value of these features is about 1.4.
 46 | 
 47 | ```{r, echo=T}
 48 | # compare replicates
 49 | one_is_0b <- d.n0[,3][d.n0[,1] == 0]
 50 | one_is_1b <- d.n0[,3][d.n0[,1] == 1]
 51 | one_is_2b <- d.n0[,3][d.n0[,1] == 2]
 52 | one_is_32b <- d.n0[,3][d.n0[,1] == 32]
 53 | 
 54 | par(mfrow=c(1,2))
 55 | hist(one_is_0, breaks=9, xlim=c(0,8), col=rgb(0,0,1,0.2))
 56 | hist(one_is_0b, breaks=15, add=T, col=rgb(1,0,0,0.2))
 57 | abline(v=mean(one_is_0)) # the expected value of 0 for pair
 58 | abline(v=mean(one_is_0b), col="red")
 59 | 
 60 | hist(one_is_1, breaks=9, xlim=c(0,12))
 61 | hist(one_is_1b, breaks=9, add=T, col=rgb(1,0,0,0.2))
 62 | abline(v=mean(one_is_1))
 63 | abline(v=mean(one_is_1b), col="red")
 64 | ```
 65 | We can generate a scatter plot all vs all for technical replicate 7 vs technical replicate 2 in either Euclidian or log-log space. Note where the imprecision is - at the low count margin.
 66 | 
 67 | ```{r, echo=TRUE}
 68 | par(mfrow=c(1,2))
 69 | plot(d.n0[,1] + 0.2, d.n0[,2] + 0.2, pch=19, col=rgb(0,0,0,0.1), xlab="replicate 1", ylab="replicate 2")
 70 | plot(d[,1] +0.2, d[,2] +0.2, log="xy", pch=19, col=rgb(0,0,0,0.1), xlab="replicate 1", ylab="replicate 2")
 71 | ```
 72 | 
 73 | Finally, we can plot the expected value for the other replicates as a function of the reference replicate. This shows quite nicely that the expected value is actually higher than the observed value up to a count of 10-15 in this dataset.
 74 | 
 75 | ```{r, echo=F}
 76 | output <- matrix(data=NA, nrow=51, ncol=6)
 77 | 
 78 | for(j in 0:50){
 79 |     for(i in 1:6){
 80 |       output[j+1, i ] <- mean(d.n0[,i][d.n0[,7] == j]) / (sum(d.n0[,i])/sum(d.n0[,7]))
 81 |     }
 82 | }
 83 | 
 84 | x <- seq(0,50, by=1)
 85 | 
 86 | rownames(output) = x
 87 | 
 88 | par(mfrow=c(1,1))
 89 | boxplot(t(output), xlab="R1 count", ylab="E(R count)")
 90 | abline(h=seq(0,50, by=5), col=rgb(0,0,0,0.3), lty=2)
 91 | abline(v=seq(1,51, by=5), col=rgb(0,0,0,0.3), lty=2)
 92 | abline(-1,1)
 93 | 
 94 | ```
 95 | In RNA-seq we typically do not have enough samples to have independent power for each feature. We really need about 10 samples for every feature we want to examine independently.
 96 | 
 97 |  The data are discrete probabilities of observing the feature in the DNA sequencing library scaled by the read count; i.e., converted to integers by 'counting'. If we sequenced 10X deeper, then 0 values would be converted to a integers between 0 and ~10 (sampling). If we sequenced 1000X deeper, then we would get an even better estimate of the actual underlying value for our 0s. Most of the 0s at some point will be converted to a count---so what is an appropriate value for 0? Somewhere between 0 (can never occur in the observable universe) and 1 (we always should see it) is a least likely value that, over many experiments will be least likely to perturb the system? That value is 0.5. Not always the best, usually not the optimum, but in general the least wrong.
 98 | 
 99 | I typically choose either of two methods for zero replacement/estimation: The first is to use the zCompositions package to replace 0 values with a count zero multiplicative estimate. This approach replaces 0 values with a suitable pseudocount, but keeps the relationships between all the other features constant. The pseudocount only replaces the 0 value, not the rest. This approach is useful for exploratory data analysis, such as compositional biplots, clustering, or using the propr package with point estimates. The second approach is instantiated in the ALDEx2 package, where we assume that the prior probability of 0 is 0.5. In the prior approach we are adding 0.5 to \emph{every} value in the matrix. On the surface this seems to be undesirable. However, note that the expected value of low count features in this dataset is actually larger than the observed value: recall that it is about 0.3 for a count of 0, about 1.3 for a count of 1, about 2.7 for a count of 2, and about 31 for a count of 32. Thus, adjusting each value by a constant begins to make more sense.
100 | 
101 | If we recall that the basis of log-ratio analysis is the ratio between components, then adding a prior probability to each feature starts to make sense. As shown in Table \ref{tab:ratios}, adding the prior is a much better estimate of the ratios of the Expected values than is the ratio of the point observations. The addition of the prior is in fact an even better estimate than is the 0 replacement method in this dataset.
102 | 
103 | \begin{table}[!h]
104 | \caption{Count vs E(count) and the ratio relationships. The table shows the ratio between counts, and the expected value of the count in another technical replicate (E ratio), compared to the ratio of the count + a prior of 0.5 (P ratio). }\vspace{0.2cm}
105 | \centering
106 | \resizebox{\columnwidth}{!}{%
107 | \begin{tabular}{l r r r r r r}
108 | \hline
109 | Count 1 & Count 2 & E(count) 1 & E(count) 2 & Count ratio & E ratio & P ratio \\ \hline \hline
110 | 0 & 1 & 0.3 & 1.3 & 0/1 & 0.3/1.3 & 0.5/1.5\\
111 | 0 & 2 & 0.3 & 2.7 & 0/2 & 0.3/2.7 & 0.5/2.5\\
112 | 0 & 32 & 0.3 & 31 & 0/32 & 0.3/31 & 0.5/32.5\\
113 | \end{tabular}
114 | }
115 | \label{tab:ratios}
116 | \end{table}
117 | 
118 | 
119 |  ```{r, echo=FALSE}
120 | 
121 | rdirichlet <- function (n, alpha)
122 | #ISALIAS ddirichlet
123 | #--------------------------------------------
124 | {
125 |   if(length(n) > 1) n <- length(n)
126 |   if(length(n) == 0 || as.integer(n) == 0) return(numeric(0))
127 |   n <- as.integer(n)
128 |   if(n < 0) stop("integer(n) can not be negative in rtriang")
129 | 
130 |   if(is.vector(alpha)) alpha <- t(alpha)
131 |   l <- dim(alpha)[2]
132 |   x <- matrix(rgamma(l * n, t(alpha)), ncol = l, byrow=TRUE)  # Gere le recycling
133 |   return(x / rowSums(x))
134 | }
135 | ```
136 | 
137 | Lets generate random instances of the data. The maximum likelihood probability is the count divided by the sum of the sample. This is a discrete probability since we are dealing with finite values.
138 | 
139 | Since we know that a pure replicate is not identical and differs by sampling variation the ML estimate is strongly affected by sampling. We can convert the discrete estimate of the probability to a continuous probability by modelling as a multivariate Poisson process with a fixed sum: this is a Dirichlet distribution.
140 | 
141 | Even though the E(Dir) is very close to the ML estimate, the individual Dir instances are variable. This variation is a reasonable match for the actual underlying technical variation (tails are a bit broader in general), but is reasonable.
142 | 
143 | ```{r, echo=T, fig.height=3, fig.width=7.5}
144 | par(mfrow=c(1,3))
145 | plot((d.n0[,1] + 0.2) /sum(d.n0[,1] + 0.2),
146 |     (d.n0[,2] + 0.2)/sum(d.n0[,2] + 0.2),
147 |     log="xy", xlab="P 1", ylab="P 2",
148 |     pch=19, col=rgb(0,0,0,0.1))
149 | 
150 | 
151 | d.dir <- rdirichlet(16, d.n0[,1]+0.5)
152 | ml <- (d.n0[,1]+0.5) / sum(d.n0[,1]+0.5)
153 | E_dir <- apply(d.dir, 2, mean)
154 | plot(d.dir[1,], d.dir[2,], pch=19, cex=0.2,xlab="Dir 1",
155 |     ylab="Dir 2", col=rgb(1,0,0,0.1), log="xy")
156 | points(E_dir, ml, pch=19,cex=0.1, col="blue")
157 | 
158 | plot((d.n0[,1] + 0.2) /sum(d.n0[,1] + 0.2),
159 |     (d.n0[,2] + 0.2)/sum(d.n0[,2] + 0.2),
160 |     log="xy", pch=19, col=rgb(0,0,0,0.1),
161 |     xlab="P(D) 1", ylab="P(D) 2")
162 | points(d.dir[1,], d.dir[2,], pch=19, cex=0.2, col=rgb(1,0,0,0.1))
163 | points(d.dir[1,], d.dir[3,], pch=19, cex=0.2, col=rgb(1,0,0,0.1))
164 | points(d.dir[1,], d.dir[4,], pch=19, cex=0.2, col=rgb(1,0,0,0.1))
165 | points(E_dir, ml, pch=19,cex=0.1, col="blue")
166 | 
167 | ```
168 | 
169 | In summary an observed count of 0 is does not mean that 0 is what we would observe upon replication. In other words, and observation of 0 does not mean that there is 0 probability of observing a 0. In fact, in this dataset a value of 0 has an Expected value of about 0.3. Thus there is a good justification to adjust 0 values with a pseudocount (from zCompositions) or a prior probability to more accurately capture the real underlying structure of the data at the low count margin.
170 | 


--------------------------------------------------------------------------------
/zero.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ggloor/CoDa_microbiome_tutorial/8c8352813b7ee12f82290e9a661861edae7d166d/zero.pdf


--------------------------------------------------------------------------------