├── JC69
    ├── BayesianMCMC.pdf
    ├── mcmc.JCd.R
    └── mcmc.JCrt.R
├── K80
    ├── mcmc.K80.R
    ├── surf.png
    └── traces.png
├── LICENSE
└── README.md


/JC69/BayesianMCMC.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thednainus/Bayesian_tutorial/e542d9281a7e08986bc5accf975882b499c31980/JC69/BayesianMCMC.pdf


--------------------------------------------------------------------------------
/JC69/mcmc.JCd.R:
--------------------------------------------------------------------------------
  1 | # A Bayesian phylogenetics inference program
  2 | # July 2015
  3 | # Authors: Mario dos Reis and Ziheng Yang
  4 | # This scrip has been previously published at 
  5 | # http://abacus.gene.ucl.ac.uk/teach/BayesianMCMCintro/MCMC.JCd.R
  6 | 
  7 | 
  8 | rm(list=ls())
  9 | # ##########################################
 10 | # (A) Simple example, calculated analytically
 11 | # ##########################################
 12 | 
 13 | # The data are a pairwise sequence alignment of the 12s RNA gene sequences
 14 | # from human and orangutan. 
 15 | # * The alignment is n = 948 nucleotides long, with x = 90 differences between sequences.
 16 | # (Example 7.1 p.216, Yang 2014 Molecular Evolution: A Statistical Approach)
 17 | 
 18 | # Our aim is to estimate the molecular distance d, in substitutions per site, 
 19 | # between the two sequences. Because the two species are closely related (i.e.,
 20 | # the alignment has less than 10% divergence) we use the Jukes and Cantor (1969)
 21 | # nucleotide substitution model (JC69).
 22 | 
 23 | # The Bayesian estimate of d, is given by the posterior distribution:
 24 | # p(d | x) = C * p(d) * p(x | d).
 25 | 
 26 | # p(d) is the prior on the distance.
 27 | # p(x | d) is the (Jukes-Cantor) likelihood, i.e. the probability of observing
 28 | # the data x given d, and C is a normalizing constant, C = 1 / ∫ p(d) p(x | d) dd 
 29 | 
 30 | # likelihood under JC69 model, i.e. L = p(x | d):
 31 | L <- function(d, x, n) (3/4 - 3*exp(-4*d/3)/4)^x * (1/4 + 3*exp(-4*d/3)/4)^(n - x)
 32 | 
 33 | # plot the likelihood:
 34 | curve(L(x, 90, 948), from=0, to=0.2, n=500, xlab="distance, d", ylab="likelihood, p(x | d)")
 35 | 
 36 | # The maximum likelihood estimate is (the highest point in the curve)
 37 | d.mle <- -3/4 * log(1 - 4/3 * 90/948) # [1] 0.1015060
 38 | abline(v=d.mle, lty=2)
 39 | title(expression(hat(d) == 0.1015))
 40 | 
 41 | # For the prior on d, we use an exponential distribution with mean = 0.2
 42 | # p(d) = exp(-d/mu) / mu; mu = 0.2.
 43 | d.prior <- function(d, mu) exp(-d/mu) / mu
 44 | 
 45 | # un-normalised posterior:
 46 | d.upost <- function(d, x, n, mu) d.prior(d, mu=mu) * L(d, x=x, n=n)
 47 | 
 48 | # To calculate the normalising constant C, we need to integrate the un-normalised posterior:
 49 | integrate(d.upost, lower=0, upper=Inf, x=90, n=948, mu=0.2, abs.tol=0)
 50 | # 5.167762e-131 with absolute error < 1.9e-135
 51 | C  <- 1 / 5.167762e-131
 52 | 
 53 | # The posterior distribution function (when x=90, n=948 and mu=0.2)is thus
 54 | d.post <- function(d) C * d.upost(d=d, x=90, n=948, mu=0.2)
 55 | 
 56 | # Plot posterior and add prior:
 57 | curve(d.post(x), from=0, to=0.2, n=500, xlab="d", ylab="density")
 58 | curve(d.prior(x, mu=0.2), add=TRUE, lty=2)
 59 | 
 60 | # EXERCISE 1: Try adding the likelihood curve to the plots above.
 61 | # HINT: You may want to scale the likelihood by a constant like C * 3
 62 | 
 63 | # The posterior mean of d is found by integrating ∫ d * p(d | x) dd
 64 | f <- function(d) d * d.post(d)
 65 | integrate(f, lower=0, upper=Inf, abs.tol=0)
 66 | # 0.1021246 with absolute error < 5.5e-06
 67 | # Add it to the plot:
 68 | abline(v=0.1021246, lty=3) # posterior mean
 69 | 
 70 | # Add MLE:
 71 | abline(v=d.mle, lty=3, col="red")
 72 | 
 73 | # The posterior distribution can be used to answer questions such as
 74 | # what is the probability that d > 0.1?
 75 | integrate(d.post, lower=0.1, upper=Inf)
 76 | # 0.5632372 with absolute error < 1.6e-08, i.e. 56.3%
 77 | 
 78 | # The 95% equal-tail credibility interval is (0.08191, 0.12463)
 79 | integrate(d.post, lower=0, upper=0.08191) # 2.5%
 80 | integrate(d.post, lower=0, upper=0.12463) # 97.5%
 81 | 
 82 | # EXERCISE 2 (VERY HARD): Can you work out how to calculate the 99% CI?
 83 | # HINT: You need to find the upper bounds where the integral is 0.005
 84 | # and 0.995 respectively. The integrate and optim functions in R may be useful.
 85 | 
 86 | 
 87 | # ###################################################################
 88 | # (B) The same example analysed using Markov chain Monte Carlo (MCMC)
 89 | # ###################################################################
 90 | 
 91 | # We now obtain the posterior distribution by MCMC sampling.
 92 | # In most practical problems, constant C cannot be calculated (either
 93 | # analytically or numerically), and so MCMC becomes necessary.
 94 | 
 95 | #To avoid numerical problems, we want to calculate the likelihood, prior and posterior
 96 | # on the log scale, so we define and use the following functions
 97 | lnL <- function(d, x, n)  x*log(3/4 - 3*exp(-4*d/3)/4) + (n - x) *log(1/4 + 3*exp(-4*d/3)/4)
 98 | d.lnprior <- function(d, mu)  -d/mu - log(mu)   # the log(mu) term is constant & can be deleted
 99 | d.lnpost <- function(d, x, n, mu)  d.lnprior(d, mu=mu) + lnL(d, x=x, n=n)  # unnormalized
100 | 
101 | 
102 | # Draft MCMC algorithm:
103 | # 1. Set initial state for d.
104 | # 2. Propose a new state d* (from an appropriate proposal density).
105 | # 3. Accept or reject the proposal with probability
106 | #      min(1, p(d*)p(x|d*) / p(d)p(x|d))
107 | #      If the proposal is accepted set d=d*, otherwise d=d.
108 | # 4. Save d.
109 | # 5. Go to step 2.
110 | d.mcmcf <- function(init.d, N, w) {
111 |   # init.d is the initial state
112 |   # N is the number of 'generations' the algorithm is run for.
113 |   # w is the 'width' of the proposal density.
114 |   d <- numeric(N+1) # Here we will keep the visited values of d.
115 |   d[1] <- init.d
116 |   dnow <- init.d
117 |   acc.prop <- 0 # acceptance proportion
118 |   for (i in 1:N) {
119 |     # here we use a uniform density with reflection to propose a new d*
120 |     dnew <- dnow + runif(1, -w/2, w/2)
121 |     #dnew  <- dnow + rnorm(1, 0, w)  # a normal proposal, using w as the SD.
122 |     
123 |     if(dnew < 0) dnew <- -dnew
124 |     
125 |     lnalpha <- d.lnpost(dnew, x=90, n=948, mu=0.2) - d.lnpost(dnow, x=90, n=948, mu=0.2)
126 |     # if ru < alpha accept proposal:
127 |     if (lnalpha>0 || runif(1, 0, 1) < exp(lnalpha)) { dnow <- dnew; acc.prop  <- acc.prop + 1 }
128 |     # else reject it, so that dnow is dnow.
129 |     d[i+1] <- dnow
130 |     # Can you think of a way to print the progress of the MCMC to the screen?
131 |   }
132 |   # print out the proportion of times the proposal was accepted
133 |   print(c("Acceptance proportion:", acc.prop/N))
134 |   return (d) # return vector of d's visited during MCMC
135 | }
136 | 
137 | # Test the algorithm
138 | d.1 <- d.mcmcf(0.5, 100, 0.08)
139 | 
140 | # plot output:
141 | plot(d.1, type='p', xlab="generation", ylab="d") # This is the 'trace' plot of d.
142 | # the chain wiggles about spending time at different values of d in
143 | # proportion to their posterior density.
144 | 
145 | # Add 95% CI as reference:
146 | abline(h=c(0.08191, 0.12463), lty=2, col="red")
147 | 
148 | # It may be useful to caculate the time it takes to run x generations:
149 | system.time(d.mcmcf(0.5, 10000, 0.08))
150 | # ~ 0.6s on my laptop, that means 1e6 would take ~60s or 1min.
151 | 
152 | 
153 | # EXERCISE 2a.  Note that each MCMC iteration calls the function for calculating the 
154 | # log posterior twice, for dnow and dnew.  This wastes computation since that for 
155 | # dnow was calculated in the previous iteration.  Modify the code (by recording the log
156 | # posterior for dnow so that you only need to calculate the value for dnew in every
157 | # iteration).  Run long chains to see whether this reduces the running time.
158 | # Typically the C code runs at least 100 times faster than the R code for the same algorithm.
159 | 
160 | 
161 | # Get a very long sample:
162 | d.long <- d.mcmcf(0.5, 10000, 0.08)
163 | plot(d.long, type='l')
164 | # The trace plot looks like a dense, hairy caterpillar!
165 | # This is good!
166 | 
167 | # Remove initial state and burn-in
168 | d.long <- d.long[-(1:101)]
169 | 
170 | hist(d.long, prob=TRUE, n=50, xlab="d", main="") # The sample histogram from the MCMC
171 | curve(d.post(x), add=TRUE, col="red", lwd=2)  # The true posterior distribution.
172 | 
173 | # Alternatively:
174 | plot(density(d.long, adj=1.5));  rug(d.long)
175 | curve(d.post(x), add=TRUE, col="red", lwd=2)
176 | 
177 | # EXERCISE 3: What does adj in the density function do? Try different values
178 | # such as adj=0.1 or adj=1.0.
179 | # How long (how many generations, N) do you need to run the MCMC so that
180 | # the density plot is almost identical to the true distribution?
181 | 
182 | # Now we can get the posterior mean, and 95% CI from the sample:
183 | mean(d.long) # should be close to 0.1021246
184 | quantile(d.long, prob=c(0.025, 0.975)) # should be close to (0.08191, 0.12463)
185 | 
186 | # Effect of step size, w:
187 | d.1 <- d.mcmcf(0.5, 100, w=0.08) # moderate step size
188 | plot(d.1, type='b', col="black", pch=19, cex=.5) # chain moves about
189 | 
190 | d.2 <- d.mcmcf(0.5, 100, w=0.01) # tiny step size
191 | lines(d.2, type='b', col="green", pch=19, cex=.5) # baby steps, chain moves slowly
192 | 
193 | d.3 <- d.mcmcf(0.5, 100, w=1) # large step size
194 | lines(d.3, type='b', col="blue", pch=19, cex=.5) # chain gets stuck easily
195 | 
196 | # EXERCISE 4: The optimal acceptance proportion is about 40%. Try various step sizes, w,
197 | # and see the effect on the acceptance proportion. What is the optimal w?
198 | # Finding the best step size is called fine-tuning.
199 | 
200 | 
201 | # ###################################################################
202 | # (C) Autocorrelation, efficiency, effective sample size and thinning
203 | # ###################################################################
204 | 
205 | # States in a MCMC sample are autocorrelated, because each new state is
206 | # either the previous state or a modification of it.
207 | 
208 | x <- d.long[-length(d.long)] # remove the last state
209 | y <- d.long[-1] # remove the first state
210 | plot(x, y, xlab="current state", ylab="next state", main="lag, k=1")
211 | cor(x, y) # this is the autocorrelation for the d sample for lag 1, ~50-60%
212 | 
213 | # We can use the acf function to calculate the autocorrelation for various lags:
214 | d.acf <- acf(d.long)  # The correlation is lost after a lag of k=10
215 | d.acf$acf # the actual acf values are here.
216 | 
217 | # If the autocorrelation is high, the chain will be inefficient,
218 | # i.e. we will need to run the chain for a long time to obtain a good
219 | # approximation to the posterior distribution.
220 | # The efficiency of a chain is defined as:
221 | # eff = 1 / (1 + 2(r1 + r2 + r3 + ...))
222 | # where r_k is the correlation for lag k.
223 | 
224 | eff <- function(acf) 1 / (1 + 2 * sum(acf$acf[-1]))
225 | eff(d.acf)  # ~ 30% What does that mean?
226 | 
227 | # Efficiency is related to the effective sample size (ESS) by:  ESS = N * eff
228 | N <- length(d.long) # 9,900
229 | length(d.long) * eff(d.acf) # ~ 2,700
230 | # That is, our MCMC sample of size 9,900 is as informative about d as an
231 | # independent sample of size ~2,700.  In other words, 30% efficiency means
232 | # the estimate (of the posterior mean) from an MCMC sample of size N has the
233 | # same variance as the estimate from an independent sample of the size N*0.3.
234 | 
235 | 
236 | # EXERCISE 5: Run a long chain (say N=1e4) with step size w=0.01.
237 | # Calculate the efficiency and the ESS.
238 | # Do it at home: Try different values of w, and plot eff against w.  
239 | #                What value of w gives the highest eff?
240 | 
241 | 
242 | # Usually, to save hard drive space, the MCMC is 'thinned', i.e. rather than
243 | # saving every single value sampled, we only save every m-th value. The resulting
244 | # sample will be smaller (thus saves hard drive space) but still has nearly as much info
245 | # as the total sample.  Usually we modify the MCMC code to do this.
246 | # Here we may reduce the sample to ~1/4 the original size:
247 | d.reduced <- d.long[seq(from=1, to=N, by=4)]
248 | mean(d.reduced) # still close to 0.1021246
249 | quantile(d.reduced, c(0.025, 0.975)) # still close to (0.08191, 0.12463)
250 | 
251 | 
252 | # ###############################################################
253 | # (D) MCMC diagnostics.  How do we know our results are reliable?
254 | # ###############################################################
255 | 
256 | # (1) Run the chain from different starting values, the chains
257 | #     should converge to the same region in parameter space:
258 | d.high  <- d.mcmcf(0.4, 1e2, 0.1)
259 | plot(d.high, col="red", ty='l', ylim=c(0, 0.4))
260 | 
261 | d.middle <- d.mcmcf(0.1, 1e2, 0.1)
262 | lines(d.middle, col="black", ty='l')
263 | 
264 | d.low <- d.mcmcf(0, 1e2, 0.1)
265 | lines(d.low, col="blue", ty='l')
266 | 
267 | abline(h=c(0.08191, 0.12463), lty=2)
268 | 
269 | # (2) Compare the histograms ,and summary statistics obtained 
270 | #     from several long, independent runs:
271 | d.long1 <- d.mcmcf(0.4, 1e3, 0.1)[-(1:101)]
272 | d.long2 <- d.mcmcf(0.0, 1e3, 0.1)[-(1:101)]
273 | 
274 | plot(density(d.long1, adj=1), col="red")
275 | lines(density(d.long2, adj=1), col="blue")
276 | 
277 | mean(d.long1); mean(d.long2)
278 | # relative difference
279 | abs(mean(d.long1) - mean(d.long2)) / mean(c(d.long1, d.long2)) * 100
280 | 
281 | # It is extremely importantthat you run all your MCMC analyses at least twice!
282 | # You cannot assess the robustness of a MCMC with only one run!
283 | 
284 | # Try again using N=1e5 (will take a little to finish)
285 | d.long1 <- d.mcmcf(0.4, 1e5, 0.1)[-(1:101)]
286 | d.long2 <- d.mcmcf(0.4, 1e5, 0.1)[-(1:101)]
287 | 
288 | # (3) Calculate ESS. Ideally, we should aim for ESS > 1,000, when
289 | #     summary statistics such as the posterior mean and 95% CI can be
290 | #     accurately estimated. ESS > 10,000 offers superb performance, but this
291 | #     is rarely achieved in phylogenetic problems. ESS < 100 is poor:
292 | length(d.long1) * eff(acf(d.long1))  # > 22,000 (nice!)
293 | 
294 | # (4) Plot running means:
295 | rm1 <- cumsum(d.long1) / (1:length(d.long1))
296 | rm2 <- cumsum(d.long2) / (1:length(d.long2))
297 | plot(rm1, ty='l', ylab="posterior mean", xlab="generation"); lines(rm2, col="red")
298 | 
299 | # There are more diagnostics tests and statistics that can be applied, for
300 | # example the potential scale reduction statistic
301 | # (p. 242, Yang 2014 Molecular Evolution: A Statistical Approach).
302 | 
303 | # R packages such as CODA have been written to perform MCMC 
304 | # diagnostics, and they may prove quite useful.
305 | 
306 | 
307 | # #################################################
308 | # (E) MCMC in 2 dimensions - Molecular clock dating
309 | # #################################################
310 | 
311 | # Now consider estimating the time of divergence between human and orangutan
312 | # using the data above, x=90, n=948.
313 | # The molecular distance is the product of geological time, t, and the
314 | # molecular rate, r: d = r * t.
315 | 
316 | # Our JC69 log likelihood based on r and t is:
317 | lnL.rt <- function(r, t, x, n) lnL(d=r*t, x=x, n=n)
318 | 
319 | # The log likelihood is now a 2D surface:
320 | r <- seq(from=2e-3, to=6e-3, len=50)
321 | t <- seq(from=10, to=40, len=50)
322 | 
323 | # creates a table of all combinations of r and t values:
324 | rt <- expand.grid(r=r, t=t)
325 | 
326 | z <- lnL.rt(r=rt$r, t=rt$t, x=90, n=948)
327 | 
328 | # The log likelihood surface
329 | image(r, t, matrix(z, ncol=50), xlab="r (s/My)", ylab="t (Ma)")
330 | contour(r, t, matrix(z, ncol=50), nlevels=5, add=TRUE)
331 | 
332 | # The log likelihood looks like a flat, curve mountanous ridge
333 | # The top of the ridge is located at d.mle = 0.1015060 = r * t
334 | curve(d.mle/x, add=TRUE, col="blue", lwd=2, lty=2)
335 | 
336 | # The molecular data are informative about d only, but not about r or t
337 | # separately. We say that r and t are confounded parameters. Any combination
338 | # of r and t that satisfy 0.1015060 = r * t are ML solutions.
339 | 
340 | # We can make a neat 3D plot of the log likelihood:
341 | persp(r, t, matrix(z, ncol=50), zlab="likelihood", theta=40, phi=40)
342 | 
343 | # The Bayesian method provides a useful methodology to estimate t and r through
344 | # the use priors.
345 | # The fossil record suggests that the common ancestor of human-orang lived
346 | # 33.7-11.2 Ma (see Benton et al. 2009 in the Timetree of Life book)
347 | # Thus the prior on t is a normal distribution with mean 22.45 (the midpoint
348 | # of the fossil calibration) and sd 5.6, so that the 95% range is roughly
349 | # 33.7 to 11.2.
350 | t.lnprior <- function(t, mu=22.45, sd=5.6) -((t-mu)/sd)^2   # constant ignored
351 | curve(t.lnprior(x), from=5, to=40, n=5e2, xlab="t (Ma)", ylab="density")
352 | 
353 | # For the rate, r, we use an exponential distribution with mean
354 | # mu = 0.10 / 22.45 = 0.00445 (based on the distance MLE and the midpoint
355 | # of the calibration)
356 | r.lnprior <- function(r, mu=0.10/22.45) -r/mu - log(mu)
357 | curve(r.lnprior(x), from=0, to=.02, xlab="r (s/My)", ylab="density")
358 | 
359 | # We can plot the log joint prior:
360 | z1 <- r.lnprior(rt$r) + t.lnprior(rt$t)
361 | image(r, t, matrix(z1, ncol=50), xlab="r (s/My)", ylab="t (Ma)")
362 | contour(r, t, matrix(z1, ncol=50), nlevels=5, add=TRUE)
363 | 
364 | # The log unnormalized posterior is thus
365 | rt.lnpost <- function(r, t, x=90, n=948)  r.lnprior(r) + t.lnprior(t) + lnL.rt(r, t, x, n)
366 | 
367 | # We can plot the un-normalised posterior:
368 | z2 <- rt.upost(rt$r, rt$t)
369 | image(r, t, matrix(z2, ncol=50), xlab="r (s/My)", ylab="t (Ma)")
370 | contour(r, t, matrix(z2, ncol=50), nlevels=5, add=TRUE)
371 | curve(d.mle/x, add=TRUE, col="blue", lty=2, lwd=2)
372 | 
373 | # EXERCISE 6: Can you plot the un-normalise posterior as a 3D surface using persp?
374 | 
375 | # We are now ready to build a two-parameter MCMC!
376 | 
377 | # The draft algorithm is as follows:
378 | # 1. Set initial states for t and r.
379 | # 2a. Propose a new state t* (from a uniform sliding window of width wt, for example).
380 | # 2b. Accept or reject the proposal with probability
381 | #      min(1, p(r)p(t*)p(x|d*) / p(r)p(t)p(x|d))
382 | #      If the proposal is accepted set t=t*, otherwise t=t
383 | # 3a. Propose a new state r* (from a uniform sliding window of width wr, for example).
384 | # 3b. Accept or reject the proposal with probability
385 | #      min(1, p(r*)p(t)p(x|d*) / p(r)p(t)p(x|d))
386 | #      If the proposal is accepted set r=r*, otherwise r=r
387 | # 5. Save r and t.
388 | # 6. Go to step 2.
389 | 
390 | 
391 | # EXERCISE 7: Create a function called rt.mcmcf based on the d.mcmcf function.
392 | # Run your newly written phylogenetics MCMC program! You will need two step sizes,
393 | # wt and wr. Vary the step sizes and check the performance of the
394 | # MCMC chain. Run the analysis several times, and calculate posterior means and
395 | # 95% CIs for r and t. Calculate efficiency and ess. A 3D density histogram of the
396 | # MCMC posterior can be obtained using the kde2d in the MASS package.
397 | 
398 | # EXERCISE 8: Try using a normal proposal density (rather than uniform) with reflection
399 | # for r and t. Does it make a difference?
400 | 
401 | # EXERCISE (VERY HARD) 9: The normalising constant in this case is a double integral
402 | # C = 1 /  ∫∫ p(r) p(t) p(x | d) dr dt
403 | # Can you calculate C using R's cubature package (the normal integrate function
404 | # does not work, you need special multidimensional integration packages)?
405 | # The cubature package is not a default package, so you need to use install.packages 
406 | # to get it.


--------------------------------------------------------------------------------
/JC69/mcmc.JCrt.R:
--------------------------------------------------------------------------------
 1 | # Solution to EXERCISE 7:
 2 | lnL <- function(d, x, n)  x*log(3/4 - 3*exp(-4*d/3)/4) + (n - x) *log(1/4 + 3*exp(-4*d/3)/4)
 3 | r.lnprior <- function(r, mu=0.10/22.45) -r/mu              # -log(mu) term deleted
 4 | t.lnprior <- function(t, mu=22.45, sd=5.6) -((t-mu)/sd)^2  # constant deleted
 5 | lnL.rt <- function(r, t, x, n) lnL(d=r*t, x=x, n=n)
 6 | rt.lnpost <- function(r, t, x=90, n=948)  r.lnprior(r) + t.lnprior(t) + lnL.rt(r, t, x, n)
 7 | 
 8 | 
 9 | rt.mcmcf <- function(init.r, init.t, N, w.r, w.t) {
10 |   # init.r and init.t are the initial states
11 |   # N is the number of 'generations' the algorithm is run for.
12 |   # w.r and w.t are the step sizes of the proposal densities.
13 |   r <- t <- numeric(N+1)
14 |   r[1] <- init.r; t[1] <- init.t
15 |   rnow <- init.r; tnow <- init.t
16 |   ap.r <- ap.t <- 0 # acceptance proportions
17 |   for (i in 1:N) {
18 |     # Propose, and accept or reject new r:
19 |     rnew <- rnow + runif(1, -w.r/2, w.r/2)
20 |     if(rnew < 0) rnew <- -rnew;
21 |     lnalpha <- rt.lnpost(rnew, tnow, x=90, n=948) - rt.lnpost(rnow, tnow, x=90, n=948)
22 |     # if ru < alpha accept proposal:
23 |     if (lnalpha>0 || runif(1, 0, 1) < exp(lnalpha)) { rnow <- rnew; ap.r  <- ap.r + 1 }
24 |     # else reject it, so that rnow = rnow.
25 |     
26 |     # Propose, and accept or reject new t:
27 |     tnew <- tnow + runif(1, -w.t/2, w.t/2)
28 |     if(tnew < 0) tnew <- -tnew;
29 |     lnalpha <- rt.lnpost(rnow, tnew, x=90, n=948) - rt.lnpost(rnow, tnow, x=90, n=948)
30 |     if (lnalpha > 0 || runif(1, 0, 1) < exp(lnalpha)) { tnow <- tnew; ap.t  <- ap.t + 1 }
31 |     # else reject it so that tnow = tnow.
32 |     r[i+1] <- rnow;  t[i+1] <- tnow;   # take the sample
33 |   }
34 |   # print out the acceptance proportions
35 |   print(c(ap.r/N, ap.t/N))
36 |   return (list(r=r, t=t)) # return vector of d's visited during MCMC
37 | }
38 | 
39 | # Test the chain:
40 | rt.1 <- rt.mcmcf(0.01, 40, 1e2, 0.001, 10)
41 | 
42 | plot(rt.1$r, rt.1$t, ty='b', pch=19, cex=.5, xlab="rate", ylab="time")
43 | 
44 | # Do a longer run and finetune:
45 | rt.2 <- rt.mcmcf(0.01, 40, 1e4, 0.001, 10)
46 | # If the acceptance proportion is high, step is too small and needs to be increased
47 | # On the other hand, if acceptance proportion is low, decrease step size
48 | # What step sizes above lead to 30% and 30% acceptance proportions?
49 | 
50 | # Look at the trace files:
51 | par(mfrow=c(2,1))
52 | plot(rt.2$r, ty='l', main="rate")
53 | plot(rt.2$t, ty='l', main="time")
54 | 
55 | require(MASS)  # This package has additional statistical functions
56 | zz <- kde2d(rt.2$r, rt.2$t)
57 | par(mfrow=c(1,1))
58 | image(zz, xlab="rate", ylab="time")
59 | contour(zz, add=TRUE)
60 | 
61 | # Add a first few points to see MCMC progress:
62 | lines(rt.2$r[1:200], rt.2$t[1:200], ty='b', pch=19, cex=.5, col="blue")


--------------------------------------------------------------------------------
/K80/mcmc.K80.R:
--------------------------------------------------------------------------------
  1 | # Clean up your R space
  2 | rm(list=ls())
  3 | 
  4 | #######################
  5 | # PART 1: Introduction
  6 | #######################
  7 | 
  8 | # The data are the 12S rRNA alignment of human and orangutang, with 948 base pairs and 
  9 | # 90 differences (84 transitions and 6 transversions).
 10 | # Example data from Table 1.3, p.7 of Yang (2014) Molecular Evolution: A Statistical 
 11 | # Approach. Oxford University Press.
 12 | 
 13 | n <- 948 # length of alignment in bp
 14 | ns <- 84 # total number of transitions (23+30+10+21)
 15 | nv <- 6  # total number of transversions (1+0+2+0+2+1+0+0)
 16 |  
 17 | # log-likelihood function, f(D|d,k), using Kimura's (1980) substitution model
 18 | # see p.8 in Yang (2014)
 19 | 
 20 | k80.lnL <- function(d, k, n=948, ns=84, nv=6) {
 21 |  
 22 |  p0 <- .25 + .25 * exp(-4*d/(k+2)) + .5 * exp(-2*d*(k+1)/(k+2))
 23 |  p1 <- .25 + .25 * exp(-4*d/(k+2)) - .5 * exp(-2*d*(k+1)/(k+2))
 24 |  p2 <- .25 - .25 * exp(-4*d/(k+2))
 25 |  
 26 |  return ((n - ns - nv) * log(p0/4) +
 27 |         ns * log(p1/4) + nv * log(p2/4))
 28 | }
 29 | 
 30 | dim <- 100  # dimension for the plot
 31 | d.v <- seq(from=0, to=0.3, len=dim) # vector of d values
 32 | k.v <- seq(from=0, to=100, len=dim) # vector of k values
 33 | dk <- expand.grid(d=d.v, k=k.v)
 34 |  
 35 | par(mfrow=c(1, 3))
 36 | 
 37 | # Prior surface, f(D)f(k)
 38 | Pri <- matrix(dgamma(dk$d, 2, 20) * dgamma(dk$k, 2, .1),
 39 |        ncol=dim)
 40 |  
 41 | image(d.v, k.v, -Pri, las=1, col=heat.colors(50),
 42 |       main="Prior", xlab="distance, d",
 43 |       ylab="kappa, k", cex.main=2.0,
 44 |       cex.lab=1.5, cex.axis=1.5)
 45 |  
 46 | contour(d.v, k.v, Pri, nlev=10, drawlab=FALSE, add=TRUE)
 47 |  
 48 | # Likelihood surface, f(D|d,k)
 49 | lnL <- matrix(k80.lnL(d=dk$d, k=dk$k), ncol=dim)
 50 |  
 51 | # for numerical reasons, we scale the likelihood to be 1
 52 | # at the maximum, i.e. we subtract max(lnL)
 53 | L <- exp(lnL - max(lnL))  
 54 |  
 55 | image(d.v, k.v, -L, las=1, col=heat.colors(50),
 56 |       main="Likelihood", xlab="distance, d",
 57 |       ylab="kappa, k", cex.main=2.0,
 58 |       cex.lab=1.5, cex.axis=1.5)
 59 |  
 60 | contour(d.v, k.v, L, nlev=10,
 61 |         drawlab=FALSE, add=TRUE) # creates a contour plot
 62 | # Unscaled posterior surface, f(d)f(k)f(D|d,k)
 63 | Pos <- Pri * L
 64 |  
 65 | image(d.v, k.v, -Pos, las=1, col=heat.colors(50),
 66 |       main="Posterior", xlab="distance, d",
 67 |       ylab="kappa, k", cex.main=2.0,
 68 |       cex.lab=1.5, cex.axis=1.5)
 69 |  
 70 | contour(d.v, k.v, Pos, nlev=10,
 71 |         drawlab=FALSE, add=TRUE)
 72 |         
 73 |         
 74 | ##########################################
 75 | # PART 2: Markov Chain Monte Carlo (MCMC)
 76 | ##########################################
 77 | 
 78 | # We now obtain the posterior distribution by MCMC sampling.
 79 | # In most practical problems, constant z cannot be calculated (either
 80 | # analytically or numerically), and so the MCMC algorithm becomes necessary.
 81 | 
 82 | # Function that returns the logarithm of the unscaled posterior:
 83 | # f(d) * f(k) * f(D|d,k)
 84 | # by default we set the priors as: 
 85 | # f(d) = Gamma(d | 2, 20) and f(k) = Gamma(k | 2, .1)
 86 | ulnPf <- function(d, k, n=948, ns=84, nv=6, a.d=2, b.d=20, a.k=2, b.k=.1) {
 87 |   # The normalizing constant in the prior densities can be ignored:
 88 |   lnpriord <- (a.d - 1)*log(d) - b.d * d
 89 |   lnpriork <- (a.k - 1)*log(k) - b.k * k
 90 |   
 91 |   # log-Likelihood (K80 model):
 92 |   expd1 <- exp(-4*d/(k+2));
 93 |   expd2 <- exp(-2*d*(k+1)/(k+2))
 94 |   p0 <- .25 + .25 * expd1 + .5 * expd2
 95 |   p1 <- .25 + .25 * expd1 - .5 * expd2
 96 |   p2 <- .25 - .25 * expd1
 97 |   lnL <- ((n - ns - nv) * log(p0/4) + ns * log(p1/4) + nv * log(p2/4))
 98 |   
 99 |   # Return unnormalised posterior:
100 |   return (lnpriord + lnpriork + lnL)
101 | }
102 | 
103 | # Draft MCMC algorithm:
104 | # 1. Set initial states for d and k.
105 | # 2. Propose a new state d* (from an appropriate proposal density).
106 | # 3. Accept or reject the proposal with probability
107 | #      min(1, p(d*)p(x|d*) / p(d)p(x|d))
108 | #      If the proposal is accepted set d=d*, otherwise d=d.
109 | # 4. Save d.
110 | # 5. Repeat 2-4 for k.
111 | # 6. Go to step 2.
112 | 
113 | mcmcf <- function(init.d, init.k, N, w.d, w.k) {
114 |   # init.d and init.k are the initial states
115 |   # N is the number of MCMC iterations.
116 |   # w.d is the width of the sliding-window proposal for d.
117 |   # w.k is the width of the sliding-window proposal for k.
118 |   
119 |   # We keep the visited states (d, k) in sample.d and sample.k 
120 |   # for easy plotting. In practical MCMC applications, these 
121 |   # are usually written into a file.
122 |   sample.d <- sample.k <- numeric(N+1)
123 |   
124 |   # STEP 1: Initialise the MCMC chain
125 |   d <- init.d;  sample.d[1] <- init.d
126 |   k <- init.k;  sample.k[1] <- init.k
127 |   ulnP <- ulnPf(d, k)
128 |   acc.d <- 0;  acc.k <- 0 # number of acceptances
129 |   
130 |   for (i in 1:N) {
131 |     # STEP 2: Propose new state d*
132 |     # we use a uniform sliding window of width w with reflection
133 |     # to propose new values d* and k* 
134 |     # propose d* and accept/reject the proposal
135 |     dprop <- d + runif(1, -w.d/2, w.d/2)
136 |     # reflect if dprop is negative
137 |     if (dprop < 0) dprop <- -dprop
138 |     
139 |     ulnPprop <- ulnPf(dprop, k)
140 |     lnalpha <- ulnPprop - ulnP
141 |     
142 |     # STEP 3: Accept or reject the proposal
143 |     # if ru < alpha accept proposed d*:
144 |     if (lnalpha > 0 || runif(1) < exp(lnalpha)){
145 |       d <- dprop;  ulnP <- ulnPprop; 
146 |       acc.d  <- acc.d + 1
147 |     }
148 |     # else reject and stay where we are (so that nothing needs 
149 |     # to be done).
150 |     
151 |     # STEP 4: Repeat steps 2-3 for k
152 |     # propose k* and accept/reject the proposal
153 |     kprop <- k + runif(1, -w.k/2, w.k/2)
154 |     # reflect if kprop is negative
155 |     if (kprop < 0) kprop <- -kprop
156 |     
157 |     ulnPprop <- ulnPf(d, kprop)
158 |     lnalpha <- ulnPprop - ulnP
159 |     # if ru < alpha accept proposed k*:
160 |     if (lnalpha > 0 || runif(1) < exp(lnalpha)){
161 |       k <- kprop;  ulnP <- ulnPprop
162 |       acc.k  <- acc.k + 1
163 |     }
164 |     # else reject and stay where we are.
165 |     
166 |     # STEP 5: Save chain state
167 |     sample.d[i+1] <- d;  sample.k[i+1] <- k
168 |   }
169 |   
170 |   # print out the proportion of times
171 |   # the proposals were accepted
172 |   print("Acceptance proportions (d, k):")
173 |   print(c(acc.d/N, acc.k/N))
174 |   
175 |   # return vector of d and k visited during MCMC
176 |   
177 |   return (list(d=sample.d, k=sample.k))
178 | }
179 | 
180 | # Test run-time:
181 | system.time(mcmcf(0.2, 20, 1e4, .12, 180)) # about 0.3s
182 | # Run again and save MCMC output:
183 | dk.mcmc <- mcmcf(0.2, 20, 1e4, .12, 180) 
184 | 
185 | par(mfrow=c(1,3))
186 | # trace plot of d
187 | plot(dk.mcmc$d, ty='l', xlab="Iteration", ylab="d", main="Trace of d")
188 | # trace plot of k
189 | plot(dk.mcmc$k, ty='l', xlab="Iteration", ylab="k", main="Trace of k")
190 |  
191 | # We can also plot the joint sample of d and k
192 | # (points sampled from posterior surface)
193 | plot(dk.mcmc$d, dk.mcmc$k, pch='.', xlab="d", ylab="k", main="Joint of d and k")
194 | 
195 | ##########################################
196 | # PART 3: Efficiency of the MCMC chain
197 | ##########################################
198 | 
199 | # Values sampled in an MCMC chain are autocorrelated because new states
200 | # are either the previous state or a modification of it.
201 | # The efficiency of an MCMC chain is closely related to the autocorrelation.
202 | # Intuitively, if the autocorrelation is high, the chain will be inefficient,
203 | # i.e. we will need to run the chain for a long time to obtain a good
204 | # approximation to the posterior distribution.
205 | # The efficiency of a chain is defined as:
206 | # eff = 1 / (1 + 2(r1 + r2 + r3 + ...))
207 | # where ri is the correlation for lag i.
208 | 
209 | # run a very long chain (1e6 generations take about
210 | # 40s in my MacBook Air) to calculate efficiency
211 | dk.mcmc2 <- mcmcf(0.2, 20, 1e6, .12, 180)
212 |  
213 | # R's acf function (for AutoCorrelation Function) 
214 | par(mfrow=c(1,2))
215 | acf(dk.mcmc2$d)
216 | acf(dk.mcmc2$k)
217 |  
218 | # Define efficiency function
219 | eff <- function(acf) 1 / (1 + 2 * sum(acf$acf[-1]))
220 |  
221 | # the efficiencies are roughly 22% and 20% for d and k respectively:
222 | eff(acf(dk.mcmc2$d)) # [1] 0.2255753 # mcmcf(0.2, 20, 1e7, .12, 180)
223 | eff(acf(dk.mcmc2$k)) # [1] 0.2015054 # mcmcf(0.2, 20, 1e7, .12, 180)
224 | 
225 | 
226 | #To illustrate inefficient chains, we will run our MCMC again by using a proposal density
227 | #with a too large step size for d, and another with a too small step size for k.
228 | 
229 | # The window width for the d proposal density is too large,
230 | # while it is too small for k
231 | dk.mcmc3 <- mcmcf(0.2, 20, 1e4, 3, 5)
232 |  
233 | par(mfrow=c(1,2))
234 | # because proposal width for d is too large,
235 | # chain gets stuck at same values of d:
236 | plot(dk.mcmc3$d, ty='l', main="Trace of d", cex.main=2.0,
237 |      cex.lab=1.5, cex.axis=1.5, ylab="d")
238 |  
239 | # whereas proposal width for k is too small,
240 | # so chain moves slowly:
241 | plot(dk.mcmc3$k, ty='l', main="Trace of k", cex.main=2.0,
242 |      cex.lab=1.5, cex.axis=1.5, ylab="k")
243 |      
244 | dk.mcmc4 <- mcmcf(0.2, 20, 1e6, 3, 5)
245 |  
246 | # Efficiencies are roughly 1.5% for d, and 0.35% for k:
247 | eff(acf(dk.mcmc4$d, lag.max=2e3)) # [1] 0.01530385  # mcmcf(0.2, 20, 1e7, 3, 5)
248 | eff(acf(dk.mcmc4$k, lag.max=2e3)) # [1] 0.003493112 # mcmcf(0.2, 20, 1e7, 3, 5)
249 | 
250 | # plot the traces for efficient (part 2) and inefficient chains
251 | 
252 | par(mfrow=c(2,2))
253 |  
254 | plot(dk.mcmc$d, ty='l', las=1, ylim=c(.05,.2),
255 |      main="Trace of d, efficient chain", xlab='',
256 |      ylab="Distance, d", cex.main=2.0, cex.lab=1.5)
257 | plot(dk.mcmc3$d, ty='l', las=1, ylim=c(.05,.2),
258 |      main="Trace of d, inefficient chain", xlab='',
259 |      ylab='', cex.main=2.0, cex.lab=1.5)
260 | plot(dk.mcmc$k, ty='l', las=1, ylim=c(0,100),
261 |      main="Trace of k, efficient chain",
262 |      xlab='', ylab="ts/tv ratio, k",
263 |      cex.main=2.0, cex.lab=1.5)
264 | plot(dk.mcmc3$k, ty='l', las=1, ylim=c(0,100),
265 |      main="Trace of k, inefficient chain",
266 |      xlab='', ylab='', cex.main=2.0, cex.lab=1.5)
267 |      
268 | ####################################
269 | # PART 4: Checking for convergence
270 | ####################################
271 | 
272 | # We now illustrate the concept of burn-in
273 | # We will run a chain with a high starting value,
274 | # and another with a low starting value.
275 | dk.mcmc.l <- mcmcf(0.01, 20, 1e4, .12, 180)
276 | dk.mcmc.h <- mcmcf(0.4, 20, 1e4, .12, 180)
277 | 
278 | plot(dk.mcmc.l$d, xlim = c(1,200), ylim = c(0,0.4), ty = "l")
279 | lines(dk.mcmc.h$d, col="red")
280 | 
281 | # We use the low chain to calculate the mean
282 | # and sd of d. We could have used the high chain
283 | # as well.
284 | mean.d <- mean(dk.mcmc.l$d)
285 | sd.d <- sd(dk.mcmc.l$d)
286 | abline(h = mean.d + 2 * c(-sd.d, sd.d), lty = 2)
287 | # The horizontal dashed lines indicate approximately
288 | # the 95% CI. Notice how the chains move from either 
289 | # the high or low starting values towards the
290 | # stationary phase (the area within the dashed lines).
291 | # The area before it reaches stationarity is the burn-in.
292 | 
293 | 
294 | # Efficient chain (good proposal step sizes)
295 | dk.mcmc.b <- mcmcf(0.05, 5, 1e4, .12, 180)
296 | # Inefficient chain (bad proposal step sizes)
297 | dk.mcmc3.b <- mcmcf(0.05, 5, 1e4, 3, 5)
298 | 
299 | # plot and compare histograms
300 | par(mfrow=c(1,2))
301 | bks <- seq(from=0, to=150, by=1)
302 |  
303 | hist(dk.mcmc.b$k, prob=TRUE, breaks=bks, border=NA,
304 |      col=rgb(0, 0, 1, .5), las=1, xlab="kappa",
305 |      xlim=c(0,100), ylim=c(0,.055))
306 |  
307 | hist(dk.mcmc$k, prob=TRUE, breaks=bks, border=NA,
308 |      col=rgb(.5, .5, .5, .5), add=TRUE)
309 |  
310 | hist(dk.mcmc3.b$k, prob=TRUE, breaks=bks, border=NA,
311 |      col=rgb(0, 0, 1, .5), las=1, xlab="kappa",
312 |      xlim=c(0,100), ylim=c(0,.055))
313 |  
314 | hist(dk.mcmc3$k, prob=TRUE, breaks=bks, border=NA,
315 |      col=rgb(.5, .5, .5, .5), add=TRUE)
316 | 
317 | # to calculate the posterior
318 | # posterior means (similar for efficient chains):
319 | mean(dk.mcmc$d); mean(dk.mcmc.b$d)
320 | mean(dk.mcmc$k); mean(dk.mcmc.b$k)
321 |  
322 | # posterior means (not so similar for the inefficient chains):
323 | mean(dk.mcmc3$d); mean(dk.mcmc3.b$d)
324 | mean(dk.mcmc3$k); mean(dk.mcmc3.b$k)
325 | 
326 | # efficient chain, standard error of the means
327 | sqrt(1/1e4 * var(dk.mcmc$d) / 0.23) # roughly 2.5e-4
328 | sqrt(1/1e4 * var(dk.mcmc$k) / 0.20) # roughly 0.2
329 |  
330 | # inefficient chain, standard error of the means
331 | sqrt(1/1e4 * var(dk.mcmc3$d) / 0.015) # roughly 9.7e-4
332 | sqrt(1/1e4 * var(dk.mcmc3$k) / 0.003) # roughly 1.6
333 | 
334 | # plot densities (smoothed histograms) for the efficient and inefficient chains
335 | par(mfrow=c(1,2)); adj <- 1.5
336 | # Efficient chains:
337 | plot(density(dk.mcmc.b$k, adj=adj), col="blue", las=1,
338 |      xlim=c(0, 100), ylim=c(0, .05), xaxs="i", yaxs="i")
339 | lines(density(dk.mcmc$k, adj=adj), col="black")
340 |  
341 | # Inefficient chains:
342 | plot(density(dk.mcmc3.b$k, adj=adj), col="blue", las=1,
343 |      xlim=c(0, 100), ylim=c(0, .05), xaxs="i", yaxs="i")
344 | lines(density(dk.mcmc3$k, adj=adj), col="black")
345 | 
346 | 


--------------------------------------------------------------------------------
/K80/surf.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thednainus/Bayesian_tutorial/e542d9281a7e08986bc5accf975882b499c31980/K80/surf.png


--------------------------------------------------------------------------------
/K80/traces.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thednainus/Bayesian_tutorial/e542d9281a7e08986bc5accf975882b499c31980/K80/traces.png


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright 2017 Mario dos Reis, Fabricia Nascimento, and Ziheng Yang 
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | this software and associated documentation files (the "Software"), to deal in
 5 | the Software without restriction, including without limitation the rights to
 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | the Software, and to permit persons to whom the Software is furnished to do so,
 8 | subject to the following conditions:
 9 | 
10 | The above copyright notice and this permission notice shall be included in all
11 | copies or substantial portions of the Software.
12 | 
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
15 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
16 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
17 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
18 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
19 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Bayesian MCMC phylogenetics tutorial in R
 2 | Fabricia F. Nascimento, Mario dos Reis and Ziheng Yang
 3 | 
 4 | This repository containts two tutorials that guide the user through writing simple MCMC phylogenetics software to estimate the molecular distance (and the transition/trasversion ratio) for a pairwise sequence alignment under the Jukes and Cantor (1969) (and Kimura's 1980) substitution model. The tutorials introduce concepts such as burn-in, mixing, convergence, efficiency and autocorrelation of the MCMC chain.
 5 | 
 6 | Directory `JC69/` contains the MCMC tutorial to calculate the molecular distance under the JC69 model. The directory contains three files. File `mcmc.JCd.R` contains the main R code with exercises. File `mcmc.JCrt.R` contains the solution to exercise 7 in the previous file. File `BayesianMCMC-JC.pdf` contains a more detailed explanation of the theory used in the tutorial.
 7 | 
 8 | Directory `K80/` contains the MCMC tutorial to calculate the molecular distance and the ts/tv ratio under the K80 model. The tutorial is similar to the JC69 one, but focusing on a two parameter MCMC instead. File `mcmc.K80.R` contains the main R code. A detailed step-by-step explanation of the R code is given at:
 9 | 
10 | https://thednainus.wordpress.com/2017/03/03/tutorial-bayesian-mcmc-phylogenetics-using-r/
11 | 
12 | In the K80 tutorial the user will be able to reproduce the plots in our review: 
13 | 
14 | **A biologist's guide to Bayesian phylogenetic analysis**  
15 | Nascimento FF, dos Reis M, Yang Z. (2017) [Nature Ecology and Evolution, 1:1446-1454.](http://rdcu.be/v4jM)
16 | 
17 | For example, the plot below shows the prior, likelihood and posterior surfaces for the distance, d, and the ts/tv ratio, k, for the K80 example:
18 | 
19 | ![](K80/surf.png)
20 | 
21 | And these are the traces for d and k obtained from an MCMC algorithm written to sample from the posterior surface above:
22 | 
23 | ![](K80/traces.png)
24 | 
25 | We hope you enjoy the tutorials. If you have any questions or comments please let us know.
26 | 
27 | Fabricia, Mario and Ziheng.
28 | 


--------------------------------------------------------------------------------