├── JC69 ├── BayesianMCMC.pdf ├── mcmc.JCd.R └── mcmc.JCrt.R ├── K80 ├── mcmc.K80.R ├── surf.png └── traces.png ├── LICENSE └── README.md /JC69/BayesianMCMC.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thednainus/Bayesian_tutorial/e542d9281a7e08986bc5accf975882b499c31980/JC69/BayesianMCMC.pdf -------------------------------------------------------------------------------- /JC69/mcmc.JCd.R: -------------------------------------------------------------------------------- 1 | # A Bayesian phylogenetics inference program 2 | # July 2015 3 | # Authors: Mario dos Reis and Ziheng Yang 4 | # This scrip has been previously published at 5 | # http://abacus.gene.ucl.ac.uk/teach/BayesianMCMCintro/MCMC.JCd.R 6 | 7 | 8 | rm(list=ls()) 9 | # ########################################## 10 | # (A) Simple example, calculated analytically 11 | # ########################################## 12 | 13 | # The data are a pairwise sequence alignment of the 12s RNA gene sequences 14 | # from human and orangutan. 15 | # * The alignment is n = 948 nucleotides long, with x = 90 differences between sequences. 16 | # (Example 7.1 p.216, Yang 2014 Molecular Evolution: A Statistical Approach) 17 | 18 | # Our aim is to estimate the molecular distance d, in substitutions per site, 19 | # between the two sequences. Because the two species are closely related (i.e., 20 | # the alignment has less than 10% divergence) we use the Jukes and Cantor (1969) 21 | # nucleotide substitution model (JC69). 22 | 23 | # The Bayesian estimate of d, is given by the posterior distribution: 24 | # p(d | x) = C * p(d) * p(x | d). 25 | 26 | # p(d) is the prior on the distance. 27 | # p(x | d) is the (Jukes-Cantor) likelihood, i.e. the probability of observing 28 | # the data x given d, and C is a normalizing constant, C = 1 / ∫ p(d) p(x | d) dd 29 | 30 | # likelihood under JC69 model, i.e. L = p(x | d): 31 | L <- function(d, x, n) (3/4 - 3*exp(-4*d/3)/4)^x * (1/4 + 3*exp(-4*d/3)/4)^(n - x) 32 | 33 | # plot the likelihood: 34 | curve(L(x, 90, 948), from=0, to=0.2, n=500, xlab="distance, d", ylab="likelihood, p(x | d)") 35 | 36 | # The maximum likelihood estimate is (the highest point in the curve) 37 | d.mle <- -3/4 * log(1 - 4/3 * 90/948) # [1] 0.1015060 38 | abline(v=d.mle, lty=2) 39 | title(expression(hat(d) == 0.1015)) 40 | 41 | # For the prior on d, we use an exponential distribution with mean = 0.2 42 | # p(d) = exp(-d/mu) / mu; mu = 0.2. 43 | d.prior <- function(d, mu) exp(-d/mu) / mu 44 | 45 | # un-normalised posterior: 46 | d.upost <- function(d, x, n, mu) d.prior(d, mu=mu) * L(d, x=x, n=n) 47 | 48 | # To calculate the normalising constant C, we need to integrate the un-normalised posterior: 49 | integrate(d.upost, lower=0, upper=Inf, x=90, n=948, mu=0.2, abs.tol=0) 50 | # 5.167762e-131 with absolute error < 1.9e-135 51 | C <- 1 / 5.167762e-131 52 | 53 | # The posterior distribution function (when x=90, n=948 and mu=0.2)is thus 54 | d.post <- function(d) C * d.upost(d=d, x=90, n=948, mu=0.2) 55 | 56 | # Plot posterior and add prior: 57 | curve(d.post(x), from=0, to=0.2, n=500, xlab="d", ylab="density") 58 | curve(d.prior(x, mu=0.2), add=TRUE, lty=2) 59 | 60 | # EXERCISE 1: Try adding the likelihood curve to the plots above. 61 | # HINT: You may want to scale the likelihood by a constant like C * 3 62 | 63 | # The posterior mean of d is found by integrating ∫ d * p(d | x) dd 64 | f <- function(d) d * d.post(d) 65 | integrate(f, lower=0, upper=Inf, abs.tol=0) 66 | # 0.1021246 with absolute error < 5.5e-06 67 | # Add it to the plot: 68 | abline(v=0.1021246, lty=3) # posterior mean 69 | 70 | # Add MLE: 71 | abline(v=d.mle, lty=3, col="red") 72 | 73 | # The posterior distribution can be used to answer questions such as 74 | # what is the probability that d > 0.1? 75 | integrate(d.post, lower=0.1, upper=Inf) 76 | # 0.5632372 with absolute error < 1.6e-08, i.e. 56.3% 77 | 78 | # The 95% equal-tail credibility interval is (0.08191, 0.12463) 79 | integrate(d.post, lower=0, upper=0.08191) # 2.5% 80 | integrate(d.post, lower=0, upper=0.12463) # 97.5% 81 | 82 | # EXERCISE 2 (VERY HARD): Can you work out how to calculate the 99% CI? 83 | # HINT: You need to find the upper bounds where the integral is 0.005 84 | # and 0.995 respectively. The integrate and optim functions in R may be useful. 85 | 86 | 87 | # ################################################################### 88 | # (B) The same example analysed using Markov chain Monte Carlo (MCMC) 89 | # ################################################################### 90 | 91 | # We now obtain the posterior distribution by MCMC sampling. 92 | # In most practical problems, constant C cannot be calculated (either 93 | # analytically or numerically), and so MCMC becomes necessary. 94 | 95 | #To avoid numerical problems, we want to calculate the likelihood, prior and posterior 96 | # on the log scale, so we define and use the following functions 97 | lnL <- function(d, x, n) x*log(3/4 - 3*exp(-4*d/3)/4) + (n - x) *log(1/4 + 3*exp(-4*d/3)/4) 98 | d.lnprior <- function(d, mu) -d/mu - log(mu) # the log(mu) term is constant & can be deleted 99 | d.lnpost <- function(d, x, n, mu) d.lnprior(d, mu=mu) + lnL(d, x=x, n=n) # unnormalized 100 | 101 | 102 | # Draft MCMC algorithm: 103 | # 1. Set initial state for d. 104 | # 2. Propose a new state d* (from an appropriate proposal density). 105 | # 3. Accept or reject the proposal with probability 106 | # min(1, p(d*)p(x|d*) / p(d)p(x|d)) 107 | # If the proposal is accepted set d=d*, otherwise d=d. 108 | # 4. Save d. 109 | # 5. Go to step 2. 110 | d.mcmcf <- function(init.d, N, w) { 111 | # init.d is the initial state 112 | # N is the number of 'generations' the algorithm is run for. 113 | # w is the 'width' of the proposal density. 114 | d <- numeric(N+1) # Here we will keep the visited values of d. 115 | d[1] <- init.d 116 | dnow <- init.d 117 | acc.prop <- 0 # acceptance proportion 118 | for (i in 1:N) { 119 | # here we use a uniform density with reflection to propose a new d* 120 | dnew <- dnow + runif(1, -w/2, w/2) 121 | #dnew <- dnow + rnorm(1, 0, w) # a normal proposal, using w as the SD. 122 | 123 | if(dnew < 0) dnew <- -dnew 124 | 125 | lnalpha <- d.lnpost(dnew, x=90, n=948, mu=0.2) - d.lnpost(dnow, x=90, n=948, mu=0.2) 126 | # if ru < alpha accept proposal: 127 | if (lnalpha>0 || runif(1, 0, 1) < exp(lnalpha)) { dnow <- dnew; acc.prop <- acc.prop + 1 } 128 | # else reject it, so that dnow is dnow. 129 | d[i+1] <- dnow 130 | # Can you think of a way to print the progress of the MCMC to the screen? 131 | } 132 | # print out the proportion of times the proposal was accepted 133 | print(c("Acceptance proportion:", acc.prop/N)) 134 | return (d) # return vector of d's visited during MCMC 135 | } 136 | 137 | # Test the algorithm 138 | d.1 <- d.mcmcf(0.5, 100, 0.08) 139 | 140 | # plot output: 141 | plot(d.1, type='p', xlab="generation", ylab="d") # This is the 'trace' plot of d. 142 | # the chain wiggles about spending time at different values of d in 143 | # proportion to their posterior density. 144 | 145 | # Add 95% CI as reference: 146 | abline(h=c(0.08191, 0.12463), lty=2, col="red") 147 | 148 | # It may be useful to caculate the time it takes to run x generations: 149 | system.time(d.mcmcf(0.5, 10000, 0.08)) 150 | # ~ 0.6s on my laptop, that means 1e6 would take ~60s or 1min. 151 | 152 | 153 | # EXERCISE 2a. Note that each MCMC iteration calls the function for calculating the 154 | # log posterior twice, for dnow and dnew. This wastes computation since that for 155 | # dnow was calculated in the previous iteration. Modify the code (by recording the log 156 | # posterior for dnow so that you only need to calculate the value for dnew in every 157 | # iteration). Run long chains to see whether this reduces the running time. 158 | # Typically the C code runs at least 100 times faster than the R code for the same algorithm. 159 | 160 | 161 | # Get a very long sample: 162 | d.long <- d.mcmcf(0.5, 10000, 0.08) 163 | plot(d.long, type='l') 164 | # The trace plot looks like a dense, hairy caterpillar! 165 | # This is good! 166 | 167 | # Remove initial state and burn-in 168 | d.long <- d.long[-(1:101)] 169 | 170 | hist(d.long, prob=TRUE, n=50, xlab="d", main="") # The sample histogram from the MCMC 171 | curve(d.post(x), add=TRUE, col="red", lwd=2) # The true posterior distribution. 172 | 173 | # Alternatively: 174 | plot(density(d.long, adj=1.5)); rug(d.long) 175 | curve(d.post(x), add=TRUE, col="red", lwd=2) 176 | 177 | # EXERCISE 3: What does adj in the density function do? Try different values 178 | # such as adj=0.1 or adj=1.0. 179 | # How long (how many generations, N) do you need to run the MCMC so that 180 | # the density plot is almost identical to the true distribution? 181 | 182 | # Now we can get the posterior mean, and 95% CI from the sample: 183 | mean(d.long) # should be close to 0.1021246 184 | quantile(d.long, prob=c(0.025, 0.975)) # should be close to (0.08191, 0.12463) 185 | 186 | # Effect of step size, w: 187 | d.1 <- d.mcmcf(0.5, 100, w=0.08) # moderate step size 188 | plot(d.1, type='b', col="black", pch=19, cex=.5) # chain moves about 189 | 190 | d.2 <- d.mcmcf(0.5, 100, w=0.01) # tiny step size 191 | lines(d.2, type='b', col="green", pch=19, cex=.5) # baby steps, chain moves slowly 192 | 193 | d.3 <- d.mcmcf(0.5, 100, w=1) # large step size 194 | lines(d.3, type='b', col="blue", pch=19, cex=.5) # chain gets stuck easily 195 | 196 | # EXERCISE 4: The optimal acceptance proportion is about 40%. Try various step sizes, w, 197 | # and see the effect on the acceptance proportion. What is the optimal w? 198 | # Finding the best step size is called fine-tuning. 199 | 200 | 201 | # ################################################################### 202 | # (C) Autocorrelation, efficiency, effective sample size and thinning 203 | # ################################################################### 204 | 205 | # States in a MCMC sample are autocorrelated, because each new state is 206 | # either the previous state or a modification of it. 207 | 208 | x <- d.long[-length(d.long)] # remove the last state 209 | y <- d.long[-1] # remove the first state 210 | plot(x, y, xlab="current state", ylab="next state", main="lag, k=1") 211 | cor(x, y) # this is the autocorrelation for the d sample for lag 1, ~50-60% 212 | 213 | # We can use the acf function to calculate the autocorrelation for various lags: 214 | d.acf <- acf(d.long) # The correlation is lost after a lag of k=10 215 | d.acf$acf # the actual acf values are here. 216 | 217 | # If the autocorrelation is high, the chain will be inefficient, 218 | # i.e. we will need to run the chain for a long time to obtain a good 219 | # approximation to the posterior distribution. 220 | # The efficiency of a chain is defined as: 221 | # eff = 1 / (1 + 2(r1 + r2 + r3 + ...)) 222 | # where r_k is the correlation for lag k. 223 | 224 | eff <- function(acf) 1 / (1 + 2 * sum(acf$acf[-1])) 225 | eff(d.acf) # ~ 30% What does that mean? 226 | 227 | # Efficiency is related to the effective sample size (ESS) by: ESS = N * eff 228 | N <- length(d.long) # 9,900 229 | length(d.long) * eff(d.acf) # ~ 2,700 230 | # That is, our MCMC sample of size 9,900 is as informative about d as an 231 | # independent sample of size ~2,700. In other words, 30% efficiency means 232 | # the estimate (of the posterior mean) from an MCMC sample of size N has the 233 | # same variance as the estimate from an independent sample of the size N*0.3. 234 | 235 | 236 | # EXERCISE 5: Run a long chain (say N=1e4) with step size w=0.01. 237 | # Calculate the efficiency and the ESS. 238 | # Do it at home: Try different values of w, and plot eff against w. 239 | # What value of w gives the highest eff? 240 | 241 | 242 | # Usually, to save hard drive space, the MCMC is 'thinned', i.e. rather than 243 | # saving every single value sampled, we only save every m-th value. The resulting 244 | # sample will be smaller (thus saves hard drive space) but still has nearly as much info 245 | # as the total sample. Usually we modify the MCMC code to do this. 246 | # Here we may reduce the sample to ~1/4 the original size: 247 | d.reduced <- d.long[seq(from=1, to=N, by=4)] 248 | mean(d.reduced) # still close to 0.1021246 249 | quantile(d.reduced, c(0.025, 0.975)) # still close to (0.08191, 0.12463) 250 | 251 | 252 | # ############################################################### 253 | # (D) MCMC diagnostics. How do we know our results are reliable? 254 | # ############################################################### 255 | 256 | # (1) Run the chain from different starting values, the chains 257 | # should converge to the same region in parameter space: 258 | d.high <- d.mcmcf(0.4, 1e2, 0.1) 259 | plot(d.high, col="red", ty='l', ylim=c(0, 0.4)) 260 | 261 | d.middle <- d.mcmcf(0.1, 1e2, 0.1) 262 | lines(d.middle, col="black", ty='l') 263 | 264 | d.low <- d.mcmcf(0, 1e2, 0.1) 265 | lines(d.low, col="blue", ty='l') 266 | 267 | abline(h=c(0.08191, 0.12463), lty=2) 268 | 269 | # (2) Compare the histograms ,and summary statistics obtained 270 | # from several long, independent runs: 271 | d.long1 <- d.mcmcf(0.4, 1e3, 0.1)[-(1:101)] 272 | d.long2 <- d.mcmcf(0.0, 1e3, 0.1)[-(1:101)] 273 | 274 | plot(density(d.long1, adj=1), col="red") 275 | lines(density(d.long2, adj=1), col="blue") 276 | 277 | mean(d.long1); mean(d.long2) 278 | # relative difference 279 | abs(mean(d.long1) - mean(d.long2)) / mean(c(d.long1, d.long2)) * 100 280 | 281 | # It is extremely importantthat you run all your MCMC analyses at least twice! 282 | # You cannot assess the robustness of a MCMC with only one run! 283 | 284 | # Try again using N=1e5 (will take a little to finish) 285 | d.long1 <- d.mcmcf(0.4, 1e5, 0.1)[-(1:101)] 286 | d.long2 <- d.mcmcf(0.4, 1e5, 0.1)[-(1:101)] 287 | 288 | # (3) Calculate ESS. Ideally, we should aim for ESS > 1,000, when 289 | # summary statistics such as the posterior mean and 95% CI can be 290 | # accurately estimated. ESS > 10,000 offers superb performance, but this 291 | # is rarely achieved in phylogenetic problems. ESS < 100 is poor: 292 | length(d.long1) * eff(acf(d.long1)) # > 22,000 (nice!) 293 | 294 | # (4) Plot running means: 295 | rm1 <- cumsum(d.long1) / (1:length(d.long1)) 296 | rm2 <- cumsum(d.long2) / (1:length(d.long2)) 297 | plot(rm1, ty='l', ylab="posterior mean", xlab="generation"); lines(rm2, col="red") 298 | 299 | # There are more diagnostics tests and statistics that can be applied, for 300 | # example the potential scale reduction statistic 301 | # (p. 242, Yang 2014 Molecular Evolution: A Statistical Approach). 302 | 303 | # R packages such as CODA have been written to perform MCMC 304 | # diagnostics, and they may prove quite useful. 305 | 306 | 307 | # ################################################# 308 | # (E) MCMC in 2 dimensions - Molecular clock dating 309 | # ################################################# 310 | 311 | # Now consider estimating the time of divergence between human and orangutan 312 | # using the data above, x=90, n=948. 313 | # The molecular distance is the product of geological time, t, and the 314 | # molecular rate, r: d = r * t. 315 | 316 | # Our JC69 log likelihood based on r and t is: 317 | lnL.rt <- function(r, t, x, n) lnL(d=r*t, x=x, n=n) 318 | 319 | # The log likelihood is now a 2D surface: 320 | r <- seq(from=2e-3, to=6e-3, len=50) 321 | t <- seq(from=10, to=40, len=50) 322 | 323 | # creates a table of all combinations of r and t values: 324 | rt <- expand.grid(r=r, t=t) 325 | 326 | z <- lnL.rt(r=rt$r, t=rt$t, x=90, n=948) 327 | 328 | # The log likelihood surface 329 | image(r, t, matrix(z, ncol=50), xlab="r (s/My)", ylab="t (Ma)") 330 | contour(r, t, matrix(z, ncol=50), nlevels=5, add=TRUE) 331 | 332 | # The log likelihood looks like a flat, curve mountanous ridge 333 | # The top of the ridge is located at d.mle = 0.1015060 = r * t 334 | curve(d.mle/x, add=TRUE, col="blue", lwd=2, lty=2) 335 | 336 | # The molecular data are informative about d only, but not about r or t 337 | # separately. We say that r and t are confounded parameters. Any combination 338 | # of r and t that satisfy 0.1015060 = r * t are ML solutions. 339 | 340 | # We can make a neat 3D plot of the log likelihood: 341 | persp(r, t, matrix(z, ncol=50), zlab="likelihood", theta=40, phi=40) 342 | 343 | # The Bayesian method provides a useful methodology to estimate t and r through 344 | # the use priors. 345 | # The fossil record suggests that the common ancestor of human-orang lived 346 | # 33.7-11.2 Ma (see Benton et al. 2009 in the Timetree of Life book) 347 | # Thus the prior on t is a normal distribution with mean 22.45 (the midpoint 348 | # of the fossil calibration) and sd 5.6, so that the 95% range is roughly 349 | # 33.7 to 11.2. 350 | t.lnprior <- function(t, mu=22.45, sd=5.6) -((t-mu)/sd)^2 # constant ignored 351 | curve(t.lnprior(x), from=5, to=40, n=5e2, xlab="t (Ma)", ylab="density") 352 | 353 | # For the rate, r, we use an exponential distribution with mean 354 | # mu = 0.10 / 22.45 = 0.00445 (based on the distance MLE and the midpoint 355 | # of the calibration) 356 | r.lnprior <- function(r, mu=0.10/22.45) -r/mu - log(mu) 357 | curve(r.lnprior(x), from=0, to=.02, xlab="r (s/My)", ylab="density") 358 | 359 | # We can plot the log joint prior: 360 | z1 <- r.lnprior(rt$r) + t.lnprior(rt$t) 361 | image(r, t, matrix(z1, ncol=50), xlab="r (s/My)", ylab="t (Ma)") 362 | contour(r, t, matrix(z1, ncol=50), nlevels=5, add=TRUE) 363 | 364 | # The log unnormalized posterior is thus 365 | rt.lnpost <- function(r, t, x=90, n=948) r.lnprior(r) + t.lnprior(t) + lnL.rt(r, t, x, n) 366 | 367 | # We can plot the un-normalised posterior: 368 | z2 <- rt.upost(rt$r, rt$t) 369 | image(r, t, matrix(z2, ncol=50), xlab="r (s/My)", ylab="t (Ma)") 370 | contour(r, t, matrix(z2, ncol=50), nlevels=5, add=TRUE) 371 | curve(d.mle/x, add=TRUE, col="blue", lty=2, lwd=2) 372 | 373 | # EXERCISE 6: Can you plot the un-normalise posterior as a 3D surface using persp? 374 | 375 | # We are now ready to build a two-parameter MCMC! 376 | 377 | # The draft algorithm is as follows: 378 | # 1. Set initial states for t and r. 379 | # 2a. Propose a new state t* (from a uniform sliding window of width wt, for example). 380 | # 2b. Accept or reject the proposal with probability 381 | # min(1, p(r)p(t*)p(x|d*) / p(r)p(t)p(x|d)) 382 | # If the proposal is accepted set t=t*, otherwise t=t 383 | # 3a. Propose a new state r* (from a uniform sliding window of width wr, for example). 384 | # 3b. Accept or reject the proposal with probability 385 | # min(1, p(r*)p(t)p(x|d*) / p(r)p(t)p(x|d)) 386 | # If the proposal is accepted set r=r*, otherwise r=r 387 | # 5. Save r and t. 388 | # 6. Go to step 2. 389 | 390 | 391 | # EXERCISE 7: Create a function called rt.mcmcf based on the d.mcmcf function. 392 | # Run your newly written phylogenetics MCMC program! You will need two step sizes, 393 | # wt and wr. Vary the step sizes and check the performance of the 394 | # MCMC chain. Run the analysis several times, and calculate posterior means and 395 | # 95% CIs for r and t. Calculate efficiency and ess. A 3D density histogram of the 396 | # MCMC posterior can be obtained using the kde2d in the MASS package. 397 | 398 | # EXERCISE 8: Try using a normal proposal density (rather than uniform) with reflection 399 | # for r and t. Does it make a difference? 400 | 401 | # EXERCISE (VERY HARD) 9: The normalising constant in this case is a double integral 402 | # C = 1 / ∫∫ p(r) p(t) p(x | d) dr dt 403 | # Can you calculate C using R's cubature package (the normal integrate function 404 | # does not work, you need special multidimensional integration packages)? 405 | # The cubature package is not a default package, so you need to use install.packages 406 | # to get it. -------------------------------------------------------------------------------- /JC69/mcmc.JCrt.R: -------------------------------------------------------------------------------- 1 | # Solution to EXERCISE 7: 2 | lnL <- function(d, x, n) x*log(3/4 - 3*exp(-4*d/3)/4) + (n - x) *log(1/4 + 3*exp(-4*d/3)/4) 3 | r.lnprior <- function(r, mu=0.10/22.45) -r/mu # -log(mu) term deleted 4 | t.lnprior <- function(t, mu=22.45, sd=5.6) -((t-mu)/sd)^2 # constant deleted 5 | lnL.rt <- function(r, t, x, n) lnL(d=r*t, x=x, n=n) 6 | rt.lnpost <- function(r, t, x=90, n=948) r.lnprior(r) + t.lnprior(t) + lnL.rt(r, t, x, n) 7 | 8 | 9 | rt.mcmcf <- function(init.r, init.t, N, w.r, w.t) { 10 | # init.r and init.t are the initial states 11 | # N is the number of 'generations' the algorithm is run for. 12 | # w.r and w.t are the step sizes of the proposal densities. 13 | r <- t <- numeric(N+1) 14 | r[1] <- init.r; t[1] <- init.t 15 | rnow <- init.r; tnow <- init.t 16 | ap.r <- ap.t <- 0 # acceptance proportions 17 | for (i in 1:N) { 18 | # Propose, and accept or reject new r: 19 | rnew <- rnow + runif(1, -w.r/2, w.r/2) 20 | if(rnew < 0) rnew <- -rnew; 21 | lnalpha <- rt.lnpost(rnew, tnow, x=90, n=948) - rt.lnpost(rnow, tnow, x=90, n=948) 22 | # if ru < alpha accept proposal: 23 | if (lnalpha>0 || runif(1, 0, 1) < exp(lnalpha)) { rnow <- rnew; ap.r <- ap.r + 1 } 24 | # else reject it, so that rnow = rnow. 25 | 26 | # Propose, and accept or reject new t: 27 | tnew <- tnow + runif(1, -w.t/2, w.t/2) 28 | if(tnew < 0) tnew <- -tnew; 29 | lnalpha <- rt.lnpost(rnow, tnew, x=90, n=948) - rt.lnpost(rnow, tnow, x=90, n=948) 30 | if (lnalpha > 0 || runif(1, 0, 1) < exp(lnalpha)) { tnow <- tnew; ap.t <- ap.t + 1 } 31 | # else reject it so that tnow = tnow. 32 | r[i+1] <- rnow; t[i+1] <- tnow; # take the sample 33 | } 34 | # print out the acceptance proportions 35 | print(c(ap.r/N, ap.t/N)) 36 | return (list(r=r, t=t)) # return vector of d's visited during MCMC 37 | } 38 | 39 | # Test the chain: 40 | rt.1 <- rt.mcmcf(0.01, 40, 1e2, 0.001, 10) 41 | 42 | plot(rt.1$r, rt.1$t, ty='b', pch=19, cex=.5, xlab="rate", ylab="time") 43 | 44 | # Do a longer run and finetune: 45 | rt.2 <- rt.mcmcf(0.01, 40, 1e4, 0.001, 10) 46 | # If the acceptance proportion is high, step is too small and needs to be increased 47 | # On the other hand, if acceptance proportion is low, decrease step size 48 | # What step sizes above lead to 30% and 30% acceptance proportions? 49 | 50 | # Look at the trace files: 51 | par(mfrow=c(2,1)) 52 | plot(rt.2$r, ty='l', main="rate") 53 | plot(rt.2$t, ty='l', main="time") 54 | 55 | require(MASS) # This package has additional statistical functions 56 | zz <- kde2d(rt.2$r, rt.2$t) 57 | par(mfrow=c(1,1)) 58 | image(zz, xlab="rate", ylab="time") 59 | contour(zz, add=TRUE) 60 | 61 | # Add a first few points to see MCMC progress: 62 | lines(rt.2$r[1:200], rt.2$t[1:200], ty='b', pch=19, cex=.5, col="blue") -------------------------------------------------------------------------------- /K80/mcmc.K80.R: -------------------------------------------------------------------------------- 1 | # Clean up your R space 2 | rm(list=ls()) 3 | 4 | ####################### 5 | # PART 1: Introduction 6 | ####################### 7 | 8 | # The data are the 12S rRNA alignment of human and orangutang, with 948 base pairs and 9 | # 90 differences (84 transitions and 6 transversions). 10 | # Example data from Table 1.3, p.7 of Yang (2014) Molecular Evolution: A Statistical 11 | # Approach. Oxford University Press. 12 | 13 | n <- 948 # length of alignment in bp 14 | ns <- 84 # total number of transitions (23+30+10+21) 15 | nv <- 6 # total number of transversions (1+0+2+0+2+1+0+0) 16 | 17 | # log-likelihood function, f(D|d,k), using Kimura's (1980) substitution model 18 | # see p.8 in Yang (2014) 19 | 20 | k80.lnL <- function(d, k, n=948, ns=84, nv=6) { 21 | 22 | p0 <- .25 + .25 * exp(-4*d/(k+2)) + .5 * exp(-2*d*(k+1)/(k+2)) 23 | p1 <- .25 + .25 * exp(-4*d/(k+2)) - .5 * exp(-2*d*(k+1)/(k+2)) 24 | p2 <- .25 - .25 * exp(-4*d/(k+2)) 25 | 26 | return ((n - ns - nv) * log(p0/4) + 27 | ns * log(p1/4) + nv * log(p2/4)) 28 | } 29 | 30 | dim <- 100 # dimension for the plot 31 | d.v <- seq(from=0, to=0.3, len=dim) # vector of d values 32 | k.v <- seq(from=0, to=100, len=dim) # vector of k values 33 | dk <- expand.grid(d=d.v, k=k.v) 34 | 35 | par(mfrow=c(1, 3)) 36 | 37 | # Prior surface, f(D)f(k) 38 | Pri <- matrix(dgamma(dk$d, 2, 20) * dgamma(dk$k, 2, .1), 39 | ncol=dim) 40 | 41 | image(d.v, k.v, -Pri, las=1, col=heat.colors(50), 42 | main="Prior", xlab="distance, d", 43 | ylab="kappa, k", cex.main=2.0, 44 | cex.lab=1.5, cex.axis=1.5) 45 | 46 | contour(d.v, k.v, Pri, nlev=10, drawlab=FALSE, add=TRUE) 47 | 48 | # Likelihood surface, f(D|d,k) 49 | lnL <- matrix(k80.lnL(d=dk$d, k=dk$k), ncol=dim) 50 | 51 | # for numerical reasons, we scale the likelihood to be 1 52 | # at the maximum, i.e. we subtract max(lnL) 53 | L <- exp(lnL - max(lnL)) 54 | 55 | image(d.v, k.v, -L, las=1, col=heat.colors(50), 56 | main="Likelihood", xlab="distance, d", 57 | ylab="kappa, k", cex.main=2.0, 58 | cex.lab=1.5, cex.axis=1.5) 59 | 60 | contour(d.v, k.v, L, nlev=10, 61 | drawlab=FALSE, add=TRUE) # creates a contour plot 62 | # Unscaled posterior surface, f(d)f(k)f(D|d,k) 63 | Pos <- Pri * L 64 | 65 | image(d.v, k.v, -Pos, las=1, col=heat.colors(50), 66 | main="Posterior", xlab="distance, d", 67 | ylab="kappa, k", cex.main=2.0, 68 | cex.lab=1.5, cex.axis=1.5) 69 | 70 | contour(d.v, k.v, Pos, nlev=10, 71 | drawlab=FALSE, add=TRUE) 72 | 73 | 74 | ########################################## 75 | # PART 2: Markov Chain Monte Carlo (MCMC) 76 | ########################################## 77 | 78 | # We now obtain the posterior distribution by MCMC sampling. 79 | # In most practical problems, constant z cannot be calculated (either 80 | # analytically or numerically), and so the MCMC algorithm becomes necessary. 81 | 82 | # Function that returns the logarithm of the unscaled posterior: 83 | # f(d) * f(k) * f(D|d,k) 84 | # by default we set the priors as: 85 | # f(d) = Gamma(d | 2, 20) and f(k) = Gamma(k | 2, .1) 86 | ulnPf <- function(d, k, n=948, ns=84, nv=6, a.d=2, b.d=20, a.k=2, b.k=.1) { 87 | # The normalizing constant in the prior densities can be ignored: 88 | lnpriord <- (a.d - 1)*log(d) - b.d * d 89 | lnpriork <- (a.k - 1)*log(k) - b.k * k 90 | 91 | # log-Likelihood (K80 model): 92 | expd1 <- exp(-4*d/(k+2)); 93 | expd2 <- exp(-2*d*(k+1)/(k+2)) 94 | p0 <- .25 + .25 * expd1 + .5 * expd2 95 | p1 <- .25 + .25 * expd1 - .5 * expd2 96 | p2 <- .25 - .25 * expd1 97 | lnL <- ((n - ns - nv) * log(p0/4) + ns * log(p1/4) + nv * log(p2/4)) 98 | 99 | # Return unnormalised posterior: 100 | return (lnpriord + lnpriork + lnL) 101 | } 102 | 103 | # Draft MCMC algorithm: 104 | # 1. Set initial states for d and k. 105 | # 2. Propose a new state d* (from an appropriate proposal density). 106 | # 3. Accept or reject the proposal with probability 107 | # min(1, p(d*)p(x|d*) / p(d)p(x|d)) 108 | # If the proposal is accepted set d=d*, otherwise d=d. 109 | # 4. Save d. 110 | # 5. Repeat 2-4 for k. 111 | # 6. Go to step 2. 112 | 113 | mcmcf <- function(init.d, init.k, N, w.d, w.k) { 114 | # init.d and init.k are the initial states 115 | # N is the number of MCMC iterations. 116 | # w.d is the width of the sliding-window proposal for d. 117 | # w.k is the width of the sliding-window proposal for k. 118 | 119 | # We keep the visited states (d, k) in sample.d and sample.k 120 | # for easy plotting. In practical MCMC applications, these 121 | # are usually written into a file. 122 | sample.d <- sample.k <- numeric(N+1) 123 | 124 | # STEP 1: Initialise the MCMC chain 125 | d <- init.d; sample.d[1] <- init.d 126 | k <- init.k; sample.k[1] <- init.k 127 | ulnP <- ulnPf(d, k) 128 | acc.d <- 0; acc.k <- 0 # number of acceptances 129 | 130 | for (i in 1:N) { 131 | # STEP 2: Propose new state d* 132 | # we use a uniform sliding window of width w with reflection 133 | # to propose new values d* and k* 134 | # propose d* and accept/reject the proposal 135 | dprop <- d + runif(1, -w.d/2, w.d/2) 136 | # reflect if dprop is negative 137 | if (dprop < 0) dprop <- -dprop 138 | 139 | ulnPprop <- ulnPf(dprop, k) 140 | lnalpha <- ulnPprop - ulnP 141 | 142 | # STEP 3: Accept or reject the proposal 143 | # if ru < alpha accept proposed d*: 144 | if (lnalpha > 0 || runif(1) < exp(lnalpha)){ 145 | d <- dprop; ulnP <- ulnPprop; 146 | acc.d <- acc.d + 1 147 | } 148 | # else reject and stay where we are (so that nothing needs 149 | # to be done). 150 | 151 | # STEP 4: Repeat steps 2-3 for k 152 | # propose k* and accept/reject the proposal 153 | kprop <- k + runif(1, -w.k/2, w.k/2) 154 | # reflect if kprop is negative 155 | if (kprop < 0) kprop <- -kprop 156 | 157 | ulnPprop <- ulnPf(d, kprop) 158 | lnalpha <- ulnPprop - ulnP 159 | # if ru < alpha accept proposed k*: 160 | if (lnalpha > 0 || runif(1) < exp(lnalpha)){ 161 | k <- kprop; ulnP <- ulnPprop 162 | acc.k <- acc.k + 1 163 | } 164 | # else reject and stay where we are. 165 | 166 | # STEP 5: Save chain state 167 | sample.d[i+1] <- d; sample.k[i+1] <- k 168 | } 169 | 170 | # print out the proportion of times 171 | # the proposals were accepted 172 | print("Acceptance proportions (d, k):") 173 | print(c(acc.d/N, acc.k/N)) 174 | 175 | # return vector of d and k visited during MCMC 176 | 177 | return (list(d=sample.d, k=sample.k)) 178 | } 179 | 180 | # Test run-time: 181 | system.time(mcmcf(0.2, 20, 1e4, .12, 180)) # about 0.3s 182 | # Run again and save MCMC output: 183 | dk.mcmc <- mcmcf(0.2, 20, 1e4, .12, 180) 184 | 185 | par(mfrow=c(1,3)) 186 | # trace plot of d 187 | plot(dk.mcmc$d, ty='l', xlab="Iteration", ylab="d", main="Trace of d") 188 | # trace plot of k 189 | plot(dk.mcmc$k, ty='l', xlab="Iteration", ylab="k", main="Trace of k") 190 | 191 | # We can also plot the joint sample of d and k 192 | # (points sampled from posterior surface) 193 | plot(dk.mcmc$d, dk.mcmc$k, pch='.', xlab="d", ylab="k", main="Joint of d and k") 194 | 195 | ########################################## 196 | # PART 3: Efficiency of the MCMC chain 197 | ########################################## 198 | 199 | # Values sampled in an MCMC chain are autocorrelated because new states 200 | # are either the previous state or a modification of it. 201 | # The efficiency of an MCMC chain is closely related to the autocorrelation. 202 | # Intuitively, if the autocorrelation is high, the chain will be inefficient, 203 | # i.e. we will need to run the chain for a long time to obtain a good 204 | # approximation to the posterior distribution. 205 | # The efficiency of a chain is defined as: 206 | # eff = 1 / (1 + 2(r1 + r2 + r3 + ...)) 207 | # where ri is the correlation for lag i. 208 | 209 | # run a very long chain (1e6 generations take about 210 | # 40s in my MacBook Air) to calculate efficiency 211 | dk.mcmc2 <- mcmcf(0.2, 20, 1e6, .12, 180) 212 | 213 | # R's acf function (for AutoCorrelation Function) 214 | par(mfrow=c(1,2)) 215 | acf(dk.mcmc2$d) 216 | acf(dk.mcmc2$k) 217 | 218 | # Define efficiency function 219 | eff <- function(acf) 1 / (1 + 2 * sum(acf$acf[-1])) 220 | 221 | # the efficiencies are roughly 22% and 20% for d and k respectively: 222 | eff(acf(dk.mcmc2$d)) # [1] 0.2255753 # mcmcf(0.2, 20, 1e7, .12, 180) 223 | eff(acf(dk.mcmc2$k)) # [1] 0.2015054 # mcmcf(0.2, 20, 1e7, .12, 180) 224 | 225 | 226 | #To illustrate inefficient chains, we will run our MCMC again by using a proposal density 227 | #with a too large step size for d, and another with a too small step size for k. 228 | 229 | # The window width for the d proposal density is too large, 230 | # while it is too small for k 231 | dk.mcmc3 <- mcmcf(0.2, 20, 1e4, 3, 5) 232 | 233 | par(mfrow=c(1,2)) 234 | # because proposal width for d is too large, 235 | # chain gets stuck at same values of d: 236 | plot(dk.mcmc3$d, ty='l', main="Trace of d", cex.main=2.0, 237 | cex.lab=1.5, cex.axis=1.5, ylab="d") 238 | 239 | # whereas proposal width for k is too small, 240 | # so chain moves slowly: 241 | plot(dk.mcmc3$k, ty='l', main="Trace of k", cex.main=2.0, 242 | cex.lab=1.5, cex.axis=1.5, ylab="k") 243 | 244 | dk.mcmc4 <- mcmcf(0.2, 20, 1e6, 3, 5) 245 | 246 | # Efficiencies are roughly 1.5% for d, and 0.35% for k: 247 | eff(acf(dk.mcmc4$d, lag.max=2e3)) # [1] 0.01530385 # mcmcf(0.2, 20, 1e7, 3, 5) 248 | eff(acf(dk.mcmc4$k, lag.max=2e3)) # [1] 0.003493112 # mcmcf(0.2, 20, 1e7, 3, 5) 249 | 250 | # plot the traces for efficient (part 2) and inefficient chains 251 | 252 | par(mfrow=c(2,2)) 253 | 254 | plot(dk.mcmc$d, ty='l', las=1, ylim=c(.05,.2), 255 | main="Trace of d, efficient chain", xlab='', 256 | ylab="Distance, d", cex.main=2.0, cex.lab=1.5) 257 | plot(dk.mcmc3$d, ty='l', las=1, ylim=c(.05,.2), 258 | main="Trace of d, inefficient chain", xlab='', 259 | ylab='', cex.main=2.0, cex.lab=1.5) 260 | plot(dk.mcmc$k, ty='l', las=1, ylim=c(0,100), 261 | main="Trace of k, efficient chain", 262 | xlab='', ylab="ts/tv ratio, k", 263 | cex.main=2.0, cex.lab=1.5) 264 | plot(dk.mcmc3$k, ty='l', las=1, ylim=c(0,100), 265 | main="Trace of k, inefficient chain", 266 | xlab='', ylab='', cex.main=2.0, cex.lab=1.5) 267 | 268 | #################################### 269 | # PART 4: Checking for convergence 270 | #################################### 271 | 272 | # We now illustrate the concept of burn-in 273 | # We will run a chain with a high starting value, 274 | # and another with a low starting value. 275 | dk.mcmc.l <- mcmcf(0.01, 20, 1e4, .12, 180) 276 | dk.mcmc.h <- mcmcf(0.4, 20, 1e4, .12, 180) 277 | 278 | plot(dk.mcmc.l$d, xlim = c(1,200), ylim = c(0,0.4), ty = "l") 279 | lines(dk.mcmc.h$d, col="red") 280 | 281 | # We use the low chain to calculate the mean 282 | # and sd of d. We could have used the high chain 283 | # as well. 284 | mean.d <- mean(dk.mcmc.l$d) 285 | sd.d <- sd(dk.mcmc.l$d) 286 | abline(h = mean.d + 2 * c(-sd.d, sd.d), lty = 2) 287 | # The horizontal dashed lines indicate approximately 288 | # the 95% CI. Notice how the chains move from either 289 | # the high or low starting values towards the 290 | # stationary phase (the area within the dashed lines). 291 | # The area before it reaches stationarity is the burn-in. 292 | 293 | 294 | # Efficient chain (good proposal step sizes) 295 | dk.mcmc.b <- mcmcf(0.05, 5, 1e4, .12, 180) 296 | # Inefficient chain (bad proposal step sizes) 297 | dk.mcmc3.b <- mcmcf(0.05, 5, 1e4, 3, 5) 298 | 299 | # plot and compare histograms 300 | par(mfrow=c(1,2)) 301 | bks <- seq(from=0, to=150, by=1) 302 | 303 | hist(dk.mcmc.b$k, prob=TRUE, breaks=bks, border=NA, 304 | col=rgb(0, 0, 1, .5), las=1, xlab="kappa", 305 | xlim=c(0,100), ylim=c(0,.055)) 306 | 307 | hist(dk.mcmc$k, prob=TRUE, breaks=bks, border=NA, 308 | col=rgb(.5, .5, .5, .5), add=TRUE) 309 | 310 | hist(dk.mcmc3.b$k, prob=TRUE, breaks=bks, border=NA, 311 | col=rgb(0, 0, 1, .5), las=1, xlab="kappa", 312 | xlim=c(0,100), ylim=c(0,.055)) 313 | 314 | hist(dk.mcmc3$k, prob=TRUE, breaks=bks, border=NA, 315 | col=rgb(.5, .5, .5, .5), add=TRUE) 316 | 317 | # to calculate the posterior 318 | # posterior means (similar for efficient chains): 319 | mean(dk.mcmc$d); mean(dk.mcmc.b$d) 320 | mean(dk.mcmc$k); mean(dk.mcmc.b$k) 321 | 322 | # posterior means (not so similar for the inefficient chains): 323 | mean(dk.mcmc3$d); mean(dk.mcmc3.b$d) 324 | mean(dk.mcmc3$k); mean(dk.mcmc3.b$k) 325 | 326 | # efficient chain, standard error of the means 327 | sqrt(1/1e4 * var(dk.mcmc$d) / 0.23) # roughly 2.5e-4 328 | sqrt(1/1e4 * var(dk.mcmc$k) / 0.20) # roughly 0.2 329 | 330 | # inefficient chain, standard error of the means 331 | sqrt(1/1e4 * var(dk.mcmc3$d) / 0.015) # roughly 9.7e-4 332 | sqrt(1/1e4 * var(dk.mcmc3$k) / 0.003) # roughly 1.6 333 | 334 | # plot densities (smoothed histograms) for the efficient and inefficient chains 335 | par(mfrow=c(1,2)); adj <- 1.5 336 | # Efficient chains: 337 | plot(density(dk.mcmc.b$k, adj=adj), col="blue", las=1, 338 | xlim=c(0, 100), ylim=c(0, .05), xaxs="i", yaxs="i") 339 | lines(density(dk.mcmc$k, adj=adj), col="black") 340 | 341 | # Inefficient chains: 342 | plot(density(dk.mcmc3.b$k, adj=adj), col="blue", las=1, 343 | xlim=c(0, 100), ylim=c(0, .05), xaxs="i", yaxs="i") 344 | lines(density(dk.mcmc3$k, adj=adj), col="black") 345 | 346 | -------------------------------------------------------------------------------- /K80/surf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thednainus/Bayesian_tutorial/e542d9281a7e08986bc5accf975882b499c31980/K80/surf.png -------------------------------------------------------------------------------- /K80/traces.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thednainus/Bayesian_tutorial/e542d9281a7e08986bc5accf975882b499c31980/K80/traces.png -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright 2017 Mario dos Reis, Fabricia Nascimento, and Ziheng Yang 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so, 8 | subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in all 11 | copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 15 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 16 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 17 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 18 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 19 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Bayesian MCMC phylogenetics tutorial in R 2 | Fabricia F. Nascimento, Mario dos Reis and Ziheng Yang 3 | 4 | This repository containts two tutorials that guide the user through writing simple MCMC phylogenetics software to estimate the molecular distance (and the transition/trasversion ratio) for a pairwise sequence alignment under the Jukes and Cantor (1969) (and Kimura's 1980) substitution model. The tutorials introduce concepts such as burn-in, mixing, convergence, efficiency and autocorrelation of the MCMC chain. 5 | 6 | Directory `JC69/` contains the MCMC tutorial to calculate the molecular distance under the JC69 model. The directory contains three files. File `mcmc.JCd.R` contains the main R code with exercises. File `mcmc.JCrt.R` contains the solution to exercise 7 in the previous file. File `BayesianMCMC-JC.pdf` contains a more detailed explanation of the theory used in the tutorial. 7 | 8 | Directory `K80/` contains the MCMC tutorial to calculate the molecular distance and the ts/tv ratio under the K80 model. The tutorial is similar to the JC69 one, but focusing on a two parameter MCMC instead. File `mcmc.K80.R` contains the main R code. A detailed step-by-step explanation of the R code is given at: 9 | 10 | https://thednainus.wordpress.com/2017/03/03/tutorial-bayesian-mcmc-phylogenetics-using-r/ 11 | 12 | In the K80 tutorial the user will be able to reproduce the plots in our review: 13 | 14 | **A biologist's guide to Bayesian phylogenetic analysis** 15 | Nascimento FF, dos Reis M, Yang Z. (2017) [Nature Ecology and Evolution, 1:1446-1454.](http://rdcu.be/v4jM) 16 | 17 | For example, the plot below shows the prior, likelihood and posterior surfaces for the distance, d, and the ts/tv ratio, k, for the K80 example: 18 | 19 | ![](K80/surf.png) 20 | 21 | And these are the traces for d and k obtained from an MCMC algorithm written to sample from the posterior surface above: 22 | 23 | ![](K80/traces.png) 24 | 25 | We hope you enjoy the tutorials. If you have any questions or comments please let us know. 26 | 27 | Fabricia, Mario and Ziheng. 28 | --------------------------------------------------------------------------------