├── ABBA_BABA_whole_genome
├── README.md
├── data
│ ├── Hmel2_chrom_lengths.txt
│ ├── hel92.DP8MP4BIMAC2HET75dist1K.geno.gz
│ ├── hel92.DP8MP4BIMAC2HET75dist250.geno.gz
│ ├── hel92.DP8MP4BIMAC2HET75dist500.geno.gz
│ └── hel92.pop.txt
└── images
│ └── map_and_tree.jpg
├── ABBA_BABA_windows
├── README.md
├── data
│ ├── chr18.LDhelmet_MLrho.w100.tsv
│ ├── hel92.DP8HET75MP9BIminVar2.chr18.geno.gz
│ └── hel92.pop.txt
└── images
│ └── map_and_tree.jpg
├── README.md
└── topology_weighting
└── README.md
/ABBA_BABA_whole_genome/README.md:
--------------------------------------------------------------------------------
1 | # Tutorial: *ABBA* *BABA* statistics using genome wide SNP data
2 |
3 | ___
4 | ## Requirements
5 | * Python 2.7
6 | * Numpy 1.10+
7 | * R 3.0+
8 |
9 | ___
10 | ## Introduction
11 |
12 | ABBA BABA statistics (also called 'D statistics') provide a simple and powerful test for a deviation from a strict bifurcating evolutionary history. They are therefore frequently used to test for introgression using genome-scale SNP data (e.g from whole genome sequenciing or RADseq).
13 |
14 | In this practical we will perform an ABBA BABA analysis using a **combination of available software and some code written from scratch in R**. We will analyse genomic data from several populations of *Heliconius* butterflies.
15 |
16 | #### Workflow
17 | Starting with genotype data from multiple individuals, we first **infer allele frequencies** at each SNP. We then **compute the *D* statistic** and then use a **block jackknife** method to test for a significant deviation from the null expectation of *D*=0. Finally we **estimate *f* the 'admixture proportion'**.
18 |
19 | #### Data
20 |
21 | We will study multiple races from three species: *Heliconius melpomene*, *Heliconius timareta* and *Heliconius cydno*. These species have partially overlapping ranges and they are thought to hybridise where they occur in sympatry. Our sample set includes two pairs of sympatric races of *H. melpomene* and *H. cydno* from Panama and the western slopes of the Andes in Colombia. There are also two pairs of sympatric races of *H. melpomene* and H. timareta from the eastern slopes of the Andes in Colombia and Peru. Finally, there are two samples from an outgroup species *Heliconius numata*, which are necessary for performing the ABBA BABA analyses.
22 |
23 | All samples were sequenced using high-depth **whole-genome sequencing**, and genotypes have been called for each individual for each site in the genome using a standard pipeline. The data has been filtered to retain only **bi-allelic** single nucleotide polymorphisms (SNPs), and these have been further **thinned** to reduce the file size for this tutorial.
24 |
25 | #### Hypotheses
26 |
27 | We hypothesize that hybridisation between species in sympatry will lead to sharing of genetic variation between *H. cydno* and the **sympatric** races of *H. melpomene* from the west, and between *H. timareta* and the corresponding sympatric races of *H. melpomene* from the east of the Andes. There is also another race of *H. melpomene* from French Guiana that is **allopatric** from both *H. timareta* and *H. cydno*, which should have not experienced recent genetic exchange with either species and therefore serves as a control.
28 |
29 | In addition to testing for the presnece of introgression, we will test the hypothesis that some parts of teh genome experience more introgression than others. Specifically, we know that at least one locus on the Z sex chromosome causes sterility in hybrid females between these species, indicating an incompatibility between the autosomes of one species and the Z chromosome of the other. We therefore might expect reduced introgression on the Z chromosome compared to autosomes.
30 |
31 | 
32 |
33 |
34 | #### A genome wide test for introgression
35 |
36 | In its simplest formulation, the *ABBA* *BABA* test relies on counts of sites in the genome that match the *ABBA* and *BABA* genotype patterns. That is, given three ingroup populations and an outgroup with the relationship (((P1,P2),P3),O), and given a single genome sequence representing each population (ie, H1, H2 and H3), ***ABBA*** sites are those at which H2 and H3 **share a derived allele ('B')**, while **H1 has the ancestral state ('A')**, as defined by the outgroup sample. Likewise, ***BABA*** represents sites at which **H1 and H3 share the derived allele**.
37 |
38 | Ignoring recurrant mutation, the two SNP patterns can only be produced if some parts of the genome have genealogies that do not follow the 'species tree', but instead group H2 with H3 or H1 with H3. If the populations split fairly recently, such 'discordant' genealogies are expected to occur in some parts of the genome due to variation in lineage sorting. In the absence of any deviation from a strict bifurcating topology, **we expect roughly equal proportions of the genome to show the two discordant genealogies** (((H2,H3),H1),O) and (((H1,H3),H2),O). By counting *ABBA* and *BABA* SNPs across the genome (or a large proportion of it), we are therefore **approximating the proportion of the genome represented by the two discordant genealogies**, which means **we expect a 1:1 ratio of *ABBA* and *BABA* SNPs**. A deviation could come about as a result of gene flow between populations P3 and P2 for example, although it could also indicate other phenomena that break our assumptions, such as ancestral population structure, or variable substitution rates.
39 |
40 | To quantify the deviation from the expected ratio, we calculate *D*, which is the difference in the sum of *ABBA* and *BABA* patterns across the genome, divided by their sum:
41 |
42 | *D* = \[sum(*ABBA*) - sum(*BABA*)\] / \[sum(*ABBA*) + sum(*BABA*)\]
43 |
44 | **Therefore, D ranges from -1 to 1, and should equal 0 under the null hypothesis. D > 1 indicates an excess of *ABBA*, and D < 1 indicates an excess of *BABA*.**
45 |
46 | If we have multiple samples from each population, then counting *ABBA* and *BABA* sites is less straghtforward. One option is to consider only sites at which all samples from the same population share the same allele, but that will discard a large amount of useful data. A preferable option is to use the allele frequencies at each site to quantify the extent to which the genealogy is skewed toward the *ABBA* or *BABA* pattern. This is effectively equivalent to counting *ABBA* and *BABA* SNPs using all possible sets of four haploid genomes at each site. *ABBA* and *BABA* are therefore no longer binary states, but rather numbers between 0 and 1 that represent the frequency of allele combinations matching each genealogy. They are computed based on the frequency of the derived allele (*p*) and ancestral allele (1-*p*) in each population as follows:
47 |
48 | *ABBA* = (1-*p1*) x *p2* x *p3* x 1-*pO*
49 |
50 | *BABA* = *p1* x (1-*p2*) x *p3* x 1-*pO*
51 |
52 | ## The Practical
53 |
54 | ### Preparation
55 |
56 | * Open a terminal window and navigate to a folder where you will run the excersise and store all the input and output data files.
57 |
58 | * Now create a subdirectory called 'data' and download the data files needed for tis tutorial
59 |
60 | ```bash
61 | mkdir data
62 |
63 | cd data
64 |
65 | wget https://github.com/simonhmartin/tutorials/raw/master/ABBA_BABA_whole_genome/data/hel92.DP8MP4BIMAC2HET75dist250.geno.gz
66 |
67 | wget https://github.com/simonhmartin/tutorials/raw/master/ABBA_BABA_whole_genome/data/hel92.pop.txt
68 |
69 | cd ..
70 | ```
71 |
72 | * Next, download the collection of python scripts required for this tutorial [GitHub](https://github.com/simonhmartin)
73 |
74 | ```bash
75 | wget https://github.com/simonhmartin/genomics_general/archive/master.zip
76 | unzip master.zip
77 | ```
78 |
79 | ### Genome wide allele frequencies
80 |
81 | To compute these values from population genomic data, we need to first determine the frequency of the derived allele in each populaton at each polymorphic site in the genome. We will compute these from the *Heliconius* genotype data provided using a python script. The input file has already been filtered to contain only bi-allelic sites. The frequencies script requires that we define populations. These are defined in the file `hel92.pop.txt`.
82 |
83 | ```bash
84 | python genomics_general-master/freq.py -g data/hel92.DP8MP4BIMAC2HET75dist250.geno.gz \
85 | -p mel_mel -p mel_ros -p mel_vul -p mel_mal -p mel_ama \
86 | -p cyd_chi -p cyd_zel -p tim_flo -p tim_txn -p num \
87 | --popsFile data/hel92.pop.txt --target derived \
88 | -o data/hel92.DP8MP4BIMAC2HET75dist250.derFreq.tsv.gz
89 | ```
90 | By setting `--target derived` we obtain the frquency of the derived allele in each population at each site. This is based on using the final population specified (*H. numata silvana*, or '*slv*') as the outgroup. Sites at which this population is not fixed for the ancestral state are discarded.
91 |
92 | ### Genome wide ABBA BABA analysis
93 |
94 | **(NOTE: here we're working in R, or R Studio if you prefer)**
95 |
96 | To learn how the ABBA BABA test works, we will be writing the code from scratch to do the test. **Start a new R script**. This will make it easy to re-run the whole analysis using different populations.
97 |
98 | #### R functions for ABBA BABA analyses
99 |
100 | * First we define a function for computing the ABBA and BABA proportions at each site and use these to compute the D atstistic. The input will be the frequency of the derived allele in populations P1, P2 and P3 (i.e. *p1*, *p1* and *p3*). (The frequency of the ancestral allele in the outgroup will be 1 at all sites because we used the outgroup to identify the ancestral allele, so this can be ignored).
101 |
102 | ```R
103 | D.stat <- function(p1, p2, p3) {
104 | ABBA <- (1 - p1) * p2 * p3
105 | BABA <- p1 * (1 - p2) * p3
106 | (sum(ABBA, na.rm=T) - sum(BABA, na.rm=T)) / (sum(ABBA, na.rm=T) + sum(BABA, na.rm=T))
107 | }
108 | ```
109 |
110 | #### The Data
111 |
112 | * Read in our allele frequency data.
113 |
114 | ```R
115 | freq_table = read.table("data/hel92.DP8MP4BIMAC2HET75dist250.derFreq.tsv.gz", header=T, as.is=T)
116 | ```
117 |
118 | This has created an object called `freq_table` that contains the frequencies for the derived allele at each SNP.
119 |
120 | We can check the number of sites in this table, and also look at the first few rows to get a feel for the data.
121 |
122 | ```
123 | nrow(freq_table)
124 |
125 | head(freq_table)
126 | ```
127 |
128 | Note that the first two columns give the name of the scaffold (i.e. the chromosome) and the position on the chromosome of each site. The remaining columns are the allele frequencies for the different subspecies, as indicated in the figure above.
129 |
130 | #### The *D* statistic
131 |
132 | Now, to compute D, we need to define populations P1, P2 and P3. We will start with an obvious and **previously published test case**:
133 | We will ask whether there is evidence of introgression between ***H. melpomene rosina* (`mel_ros`)** and ***H. cydno chioneus* (`cyd_chi`)**. These will be **P2** and **P3** respectively. **P1** will be our **allopatirc** population, ***H. melpomene melpomene* from French Guiana (`mel_mel`)**.
134 |
135 | We set these populations and then compute *D* by extracting the the derived allele frequencies for all SNPs for the three populations.
136 |
137 | ```R
138 | P1 <- "mel_mel"
139 | P2 <- "mel_ros"
140 | P3 <- "cyd_chi"
141 |
142 | D <- D.stat(freq_table[,P1], freq_table[,P2], freq_table[,P3])
143 |
144 | print(paste("D =", round(D,4)))
145 | ```
146 |
147 | We get a **strongly positive D statistic** (remember D varies from -1 to 1), indicating an excess of ABBA over BABA. This indicates that ***H. cydno chioneus* from Panama** (`cyd_chi`) **shares more genetic variation with the sympatric *H. melpomene rosina* from Panama** (`mel_ros`) than with the allopatirc *H. melpomene melpomene* from French Guiana (`mel_mel`). This is consistent with hybridisation and gene flow between the two species where they occur in sympatry.
148 |
149 | However, we currently don't know whether this result is statistically robust. In particular, we don't know whether the excess of ABBA is evenly distributed across the genome. If it results from odd ancestry at just one part of teh genome, we would have less confidence that there has been significant intogression.
150 |
151 | To test for a consistent genome-wide signal we use a block-jackknife procedure.
152 |
153 | #### Block Jackknife
154 |
155 | The Jackknife procedure allows us to compute the variance of *D* despite non-independence among sites. A more conventional bootstrapping approach, where we would randomly resample sites and recalculate *D*, is not appropriate because **nearby sites in the genome have similar ancestry, making them non-independnent observations**.
156 |
157 | The block jackknife procedure estimates the standard deviation for so-called 'pseudovalues' of the mean genome-wide *D*, where each pseudovalue is computed by excluding a defined block of the genome, taking the difference between the mean genom-wide *D* and *D* computed when the block is omitted.
158 |
159 | To account for non-independence among linked sites, the block size needs to exceed the distance at which autocorrelation occurs. In our case, we will use a block size of 1 Mb, because we know that linkage disequilibrium decays to background levels at a distance well below 1 Mb.
160 |
161 | The code to run the jackknife procedure is fairly simple, but we are not going to write it here. Instead, the R functions for this porpose are provided in a separate script, which we can import now.
162 |
163 | ```R
164 | source("genomics_general-master/jackknife.R")
165 | ```
166 |
167 | The first step in the process is to define the blocks that will be omitted from the genome in each iteration of the jackknife. The function `get_block_indices` in the jackknife script will do this, and return the 'indices' (i.e. the rows in our frequencies table) corresponding to each block. It requires that we specify the block size along with chromosome and position for each site to be analysed.
168 |
169 | ```R
170 | block_indices <- get.block.indices(block_size=1e6,
171 | positions=freq_table$position,
172 | chromosomes=freq_table$scaffold)
173 |
174 | n_blocks <- length(block_indices)
175 |
176 | print(paste("Genome divided into", n_blocks, "blocks."))
177 | ```
178 |
179 | Now we can run the block jackknifing procedure to compute the mean and standad error of *D*. We provide the *D* statistic function (`D.stat`) we created earlier, which will be applied in each iteration. We also provide the frequencies for each site and the block indices that will be used to exclude all sites from a given block.
180 |
181 | ```R
182 | D_jackknife <- block.jackknife(block_indices=block_indices,
183 | FUN=D.stat,
184 | freq_table[,P1], freq_table[,P2], freq_table[,P3])
185 |
186 | print(paste("D jackknife mean =", round(D_jackknife$mean,4)))
187 | ```
188 |
189 | From the unbiased estimate of the mean and standard error of *D*, we can compute the Z score to test of whether *D* deviates significantly from zero.
190 |
191 | ```R
192 | D_Z <- D_jackknife$mean / D_jackknife$standard_error
193 |
194 | print(paste("D Z score = ", round(D_Z,3)))
195 | ```
196 |
197 | Usually a Z score greater than 3 or 4 is taken as significant, so the massive Z score in this case means the devaition from zero is hugely significant.
198 |
199 | #### Estimating the admixture proportion
200 |
201 | The *D* statistic provides a powerful test for introgression, but it does not ***quantify* the proportion of the genome that has been shared**. A related method has been developed to estimate *f*, the 'admixture proportion'.
202 |
203 | The idea behind this approach is that we **compare the observed excess** of *ABBA* over *BABA* sites, **to that which would be expected under complete admixture**. To approximate the expectation under complete admixture we re-count ABBA and BABA but **substituting a second population of the P3 species in the place of P2**. If you lack a second population, you can simply split your P3 samples into two. In this case, we have two populations to represent each species, so if we're using *H. cydno chioneus* (`cyd_chi`) as P3a, we can use *H. cydno zelinde* (`cyd_zel`) as P3b).
204 |
205 | We need to write our own function to compute *f*. The inputs will be the derived allele frequencies in each population, but now we include both P3a and P3b.
206 |
207 |
208 | ```R
209 | f.stat <- function(p1, p2, p3a, p3b) {
210 | ABBA_numerator <- (1 - p1) * p2 * p3a
211 | BABA_numerator <- p1 * (1 - p2) * p3a
212 |
213 | ABBA_denominator <- (1 - p1) * p3b * p3a
214 | BABA_denominator <- p1 * (1 - p3b) * p3a
215 |
216 | (sum(ABBA_numerator, na.rm=TRUE) - sum(BABA_numerator, na.rm=TRUE)) /
217 | (sum(ABBA_denominator, na.rm=TRUE) - sum(BABA_denominator, na.rm=TRUE))
218 | }
219 | ```
220 |
221 | We can now choose our P3a and P3b, and estimate *f*.
222 |
223 | ```R
224 | P3a <- "cyd_chi"
225 | P3b <- "cyd_zel"
226 |
227 | f <- f.stat(freq_table[,P1], freq_table[,P2], freq_table[,P3a], freq_table[,P3b])
228 |
229 | print(paste("Admixture proportion = ", round(f,4)))
230 | ```
231 |
232 | This reveals that over 25% of the genome has been shared between *H. melpomene* and *H. cydno* in sympatry. The admixture proportion can be interpreted as the average proportion of foreign ancestry in any single genome. Alternatively, it can be interpreted as the expected frequency of foreign alleles in this population at any given site in the genome.
233 |
234 | We can again use the block jackknife to estimate the standard deviation of f, and obtain a confidence interval. The jackknife block indices are already computed, so we can simply run the jackknife function again, this time pecifying the *f* function as that to run each iteration.
235 |
236 | ```R
237 | f_jackknife <- block.jackknife(block_indices=block_indices,
238 | FUN=f.stat,
239 | freq_table[,P1], freq_table[,P2], freq_table[,P3a], freq_table[,P3b])
240 |
241 | ```
242 | The 95% confidence interval is the mean +/- ~1.96 standard errors.
243 |
244 | ```R
245 | f_CI_lower <- f_jackknife$mean - 1.96*f_jackknife$standard_error
246 | f_CI_upper <- f_jackknife$mean + 1.96*f_jackknife$standard_error
247 |
248 | print(paste("95% confidence interval of f =", round(f_CI_lower,4), round(f_CI_upper,4)))
249 |
250 | ```
251 |
252 | ### Chromosomal ABBA BABA analysis
253 |
254 | #### Do all chromosomes show evidence of introgression?
255 |
256 | Above, we investigated the extent of introgression across the whole genome. We can perform a similar analysis at the chromosomal level to assess introgression on individual chromosomes, assuming we have a sufficient number of SNPs from each chromosome.
257 |
258 | The first step to do this is to identify the rows in the frequencies table that correspond to each of the 21 *Heliconius* chromosomes.
259 |
260 | We first identify all chromosome names present in the dataset using the `unique` function. We then need to identify rows in the table that represent each chromosome. For this we use the `lapply` function, which applies a simple function multiple times to create a combined output in the R `list` format. In this case, we will apply the function using the chromosome names, and the function we apply will simply ask which values in the table `scaffold` column correspond to that chromosome, making use of the R `which` function.
261 |
262 |
263 | ```R
264 | chrom_names <- unique(freq_table$scaffold)
265 | chrom_indices <- lapply(chrom_names, function(chrom) which(freq_table$scaffold == chrom))
266 | names(chrom_indices) <- chrom_names
267 | ```
268 |
269 | This creates a list with 21 elements - one for each chromosome. Each element is a vector of all sites in the table that come from that chromosome. We can check how many SNPs we have per chromosome by applying the `length` function over the list we just created.
270 |
271 | ```R
272 | sapply(chrom_indices, length)
273 | ```
274 | (`sapply` is like `lapply` except that it simplifies the output if possible, so here it returns a vector, rather than a list of vectors).
275 |
276 | Now we can use these indices to compute a *D* value for each chromosome. We again use `sapply`, this time applying the `D.stat` function and indexing only the rows in the table from the specific chromosome in each case.
277 |
278 | ```R
279 | D_by_chrom <- sapply(chrom_names,
280 | function(chrom) D.stat(freq_table[chrom_indices[[chrom]], P1],
281 | freq_table[chrom_indices[[chrom]], P2],
282 | freq_table[chrom_indices[[chrom]], P3]))
283 |
284 | ```
285 |
286 | We also need to apply the jackknife to to determine whether *D* differs significantly from zero for each chromosome. First we will define the blocks to use for each chromosome.
287 |
288 | ```R
289 | block_indices_by_chrom <- sapply(chrom_names,
290 | function(chrom) get.block.indices(block_size=1e6,
291 | positions=freq_table$position[freq_table$scaffold==chrom]),
292 | simplify=FALSE)
293 | ```
294 |
295 | This command returns a *list of lists*. This is a list with 21 elements - one for each chromosome. Each of these elements is a list giving the indices for each block within that chromosome.
296 |
297 | We can check the number of blocks per chromosome, as well as the number of SNPs per block per chromsome.
298 |
299 | ```R
300 | sapply(block_indices_by_chrom, length)
301 |
302 | lapply(block_indices_by_chrom, sapply, length)
303 | ```
304 |
305 | Now we use the jackknife to compute the Z scores for *D* for each chromosome.
306 |
307 |
308 | ```R
309 | D_jackknife_by_chrom <- sapply(chrom_names,
310 | function(chrom) block.jackknife(block_indices=block_indices_by_chrom[[chrom]],
311 | FUN=D.stat,
312 | freq_table[chrom_indices[[chrom]], P1],
313 | freq_table[chrom_indices[[chrom]], P2],
314 | freq_table[chrom_indices[[chrom]], P3]))
315 |
316 | D_jackknife_by_chrom <- as.data.frame(t(D_jackknife_by_chrom))
317 |
318 | D_jackknife_by_chrom$Z <- as.numeric(D_jackknife_by_chrom$mean) / as.numeric(D_jackknife_by_chrom$standard_error)
319 |
320 | D_jackknife_by_chrom
321 | ```
322 |
323 | We see that chromosomes 1-20 all show significant evidene for introgression (Z > 4), while chromosome 21, the Z sex chromosome, does not. In fact *D* is negative for chr21, indicating that the allopatric *H. melpomene population* shares more variation with *H. cydno* than the sympatric *H. melpomene* shares with *H. cydno*, although the difference is not significant. This indicates a strong reduction in introgression on the sex chromosome compared to the rest of the genome, consistent with strong selection against introgressed alleles on the sex chromosome. This is what we would expect if there are one or more incompatibilities that cause sterility that involve loci on the Z chromsoome.
324 |
325 |
326 | ### In your own time
327 |
328 | We have run the analysis for a single set of three populations, but to fully understand the relationships among these species and subspecies, we might want to run multiple different tests. We can do this by changing the identity of P1, P2 and P3.
329 |
330 | For example,instead of using the allopatric *H. melpomene melpomene* as P1, we can use *H. melpomene vulcanus* from Colombia (`mel_vul`), which is physically closer and more closely related to *H. melpomene rosina*. Do we still see a significant *D* value and large admixture proportion? If not, why?
331 |
332 | Another obvious test is whether there is also introgression between *H. timareta* and the local *H. melpomene* populations from the other side of the Andes. What would be appropriate P1, P2 and P3 for that test?
333 |
--------------------------------------------------------------------------------
/ABBA_BABA_whole_genome/data/Hmel2_chrom_lengths.txt:
--------------------------------------------------------------------------------
1 | chr1 17206585
2 | chr2 9045316
3 | chr3 10541528
4 | chr4 9662098
5 | chr5 9908586
6 | chr6 14054175
7 | chr7 14308859
8 | chr8 9320449
9 | chr9 8708747
10 | chr10 17965481
11 | chr11 11759272
12 | chr12 16327298
13 | chr13 18127314
14 | chr14 9174305
15 | chr15 10235750
16 | chr16 10083215
17 | chr17 14773299
18 | chr18 16803890
19 | chr19 16399344
20 | chr20 14871695
21 | chr21 13359691
22 |
--------------------------------------------------------------------------------
/ABBA_BABA_whole_genome/data/hel92.DP8MP4BIMAC2HET75dist1K.geno.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simonhmartin/tutorials/ce985e50afa701fd1d217a41e66340dadd1325a9/ABBA_BABA_whole_genome/data/hel92.DP8MP4BIMAC2HET75dist1K.geno.gz
--------------------------------------------------------------------------------
/ABBA_BABA_whole_genome/data/hel92.DP8MP4BIMAC2HET75dist250.geno.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simonhmartin/tutorials/ce985e50afa701fd1d217a41e66340dadd1325a9/ABBA_BABA_whole_genome/data/hel92.DP8MP4BIMAC2HET75dist250.geno.gz
--------------------------------------------------------------------------------
/ABBA_BABA_whole_genome/data/hel92.DP8MP4BIMAC2HET75dist500.geno.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simonhmartin/tutorials/ce985e50afa701fd1d217a41e66340dadd1325a9/ABBA_BABA_whole_genome/data/hel92.DP8MP4BIMAC2HET75dist500.geno.gz
--------------------------------------------------------------------------------
/ABBA_BABA_whole_genome/data/hel92.pop.txt:
--------------------------------------------------------------------------------
1 | ros.CAM1841 mel_ros
2 | ros.CAM1880 mel_ros
3 | ros.CAM2045 mel_ros
4 | ros.CAM2059 mel_ros
5 | ros.CAM2519 mel_ros
6 | ros.CAM2552 mel_ros
7 | ros.CJ2071 mel_ros
8 | ros.CJ531 mel_ros
9 | ros.CJ533 mel_ros
10 | ros.CJ546 mel_ros
11 | vul.CS10 mel_vul
12 | vul.CS3603 mel_vul
13 | vul.CS3605 mel_vul
14 | vul.CS3606 mel_vul
15 | vul.CS3612 mel_vul
16 | vul.CS3614 mel_vul
17 | vul.CS3615 mel_vul
18 | vul.CS3617 mel_vul
19 | vul.CS3618 mel_vul
20 | vul.CS3621 mel_vul
21 | mal.CS1002 mel_mal
22 | mal.CS1011 mel_mal
23 | mal.CS1815 mel_mal
24 | mal.CS21 mel_mal
25 | mal.CS22 mel_mal
26 | mal.CS24 mel_mal
27 | mal.CS586 mel_mal
28 | mal.CS594 mel_mal
29 | mal.CS604 mel_mal
30 | mal.CS615 mel_mal
31 | ama.JM160 mel_ama
32 | ama.JM216 mel_ama
33 | ama.JM293 mel_ama
34 | ama.JM48 mel_ama
35 | ama.MJ11-3188 mel_ama
36 | ama.MJ11-3189 mel_ama
37 | ama.MJ11-3202 mel_ama
38 | ama.MJ12-3217 mel_ama
39 | ama.MJ12-3258 mel_ama
40 | ama.MJ12-3301 mel_ama
41 | melG.CAM1349 mel_mel
42 | melG.CAM1422 mel_mel
43 | melG.CAM2035 mel_mel
44 | melG.CAM8171 mel_mel
45 | melG.CAM8216 mel_mel
46 | melG.CAM8218 mel_mel
47 | melG.CJ13435 mel_mel
48 | melG.CJ9315 mel_mel
49 | melG.CJ9316 mel_mel
50 | melG.CJ9317 mel_mel
51 | chi.CAM25091 cyd_chi
52 | chi.CAM25137 cyd_chi
53 | chi.CAM580 cyd_chi
54 | chi.CAM582 cyd_chi
55 | chi.CAM585 cyd_chi
56 | chi.CAM586 cyd_chi
57 | chi.CJ553 cyd_chi
58 | chi.CJ560 cyd_chi
59 | chi.CJ564 cyd_chi
60 | chi.CJ565 cyd_chi
61 | zel.CS1 cyd_zel
62 | zel.CS1028 cyd_zel
63 | zel.CS1029 cyd_zel
64 | zel.CS1030 cyd_zel
65 | zel.CS1033 cyd_zel
66 | zel.CS1035 cyd_zel
67 | zel.CS2 cyd_zel
68 | zel.CS2262 cyd_zel
69 | zel.CS273 cyd_zel
70 | zel.CS30 cyd_zel
71 | flo.CS12 tim_flo
72 | flo.CS13 tim_flo
73 | flo.CS14 tim_flo
74 | flo.CS15 tim_flo
75 | flo.CS2337 tim_flo
76 | flo.CS2338 tim_flo
77 | flo.CS2341 tim_flo
78 | flo.CS2350 tim_flo
79 | flo.CS2358 tim_flo
80 | flo.CS2359 tim_flo
81 | thxn.JM313 tim_txn
82 | thxn.JM57 tim_txn
83 | thxn.JM84 tim_txn
84 | thxn.JM86 tim_txn
85 | thxn.MJ12-3221 tim_txn
86 | thxn.MJ12-3233 tim_txn
87 | thxn.MJ12-3308 tim_txn
88 | txn.MJ11-3339 tim_txn
89 | txn.MJ11-3340 tim_txn
90 | txn.MJ11-3460 tim_txn
91 | nu_sil.MJ09-4125 num
92 | nu_sil.MJ09-4184 num
93 |
--------------------------------------------------------------------------------
/ABBA_BABA_whole_genome/images/map_and_tree.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simonhmartin/tutorials/ce985e50afa701fd1d217a41e66340dadd1325a9/ABBA_BABA_whole_genome/images/map_and_tree.jpg
--------------------------------------------------------------------------------
/ABBA_BABA_windows/README.md:
--------------------------------------------------------------------------------
1 | # Tutorial: *ABBA* *BABA* analysis in sliding windows
2 | ___
3 | ## Requirements
4 | * Python 2.7
5 | * Numpy 1.10+
6 | * R 3.0+
7 |
8 | ___
9 | ## Introduction
10 |
11 | ABBA BABA statistics (also called D statistics) provide a simple and powerful test for a deviation from a strict bifurcating evolutionary history. They are therefore frequently used to test for introgression using genome-scale SNP data.
12 |
13 | Although originally developed to be employed for genome-wide tests for introgression, they can also be applied in smaller windows, which can allow exploration of the **genomic landscape of introgression**.
14 |
15 | In this practical we will perform window-based ABBA BABA analysis using a **available software** and then **write code in R for plotting the results**. We will analyse genomic data from several populations of *Heliconius* butterflies.
16 |
17 | Note that the details and theory of ABBA BABA statistics are more fully explored in my other [tutorial on whole genome ABBA BABA analyses](https://github.com/simonhmartin/tutorials/tree/master/ABBA_BABA_whole_genome)
18 |
19 | #### Workflow Overview
20 |
21 | Starting with genotype data from **whole-genome sequencing** of multiple individuals, we will run a script that computes a measure of the admixture proportion in individual windows across each chromosome. We will then make plots to test hypotheses about adaptive introgression.
22 |
23 | #### Data
24 |
25 | We will study multiple races from three species: *Heliconius melpomene*, *Heliconius timareta* and *Heliconius cydno*. These species have partially overlapping ranges and they are thought to hybridise where they occur in sympatry. Our sample set includes two pairs of sympatric races of *H. melpomene* and *H. cydno* from Panama and the western slopes of the Andes in Colombia. There are also two pairs of sympatric races of *H. melpomene* and H. timareta from the eastern slopes of the Andes in Colombia and Peru. Finally, there are two samples from an outgroup species *Heliconius numata*, which are necessary for performing the ABBA BABA analyses.
26 |
27 | All samples were sequenced using high-depth **whole-genome sequencing**, and genotypes have been called for each individual for each site in the genome using a standard pipeline. The data has been filtered to retain only **bi-allelic** single nucleotide polymorphisms (SNPs). This dataset includes SNP data from chromosome 18, which is known to carry a wing patterning locus of particular interest.
28 |
29 | #### Hypotheses
30 |
31 | We hypothesize that hybridisation between species in sympatry will lead to sharing of genetic variation between *H. cydno* and the **sympatric** races of *H. melpomene* from the west, and between *H. timareta* and the corresponding sympatric races of *H. melpomene* from the east of the Andes.
32 |
33 | However, not all parts fo the genome are expected to be equally affected. In particular, we suspect that the wing patterninging gene *optix* on chromosome 18 has been under strong selection. Differential regulation of *optix* can give rise to different distributions of red pigmentation on the wing, as seen in different subspecies of *H. melpomene*, or the absence of red, as seen in *H. cydno*.
34 |
35 | *Heliconius* wing patterns act as warnings to predators that they are toxic. Some species participate in Müllerian mimicry, whereby toxic species have evolved to resemble one-another, which helps to reinfoce predator learning. Mimicry can either be achived through independent convergence on the same wing patterns, or through exchange of wing patterning alleles through **adaptive introgression**. We therefore predict that co-mimetic populations of different species might show an excess signal of introgression in the vicinity of *optix*.
36 |
37 | We have an entirely different expectation for populations with different wing patterns. If predators in a given area recognise the most commobn local pattern as toxic, it will be costly to have a foreign wing pattern. Likewise any hybrid individual that has an intermediate wing pattern will also be at risk of higher predation. We therefore predict that between populations with different wing patterns there should be a reduction in the extent of introgression in the vicinity of *optix*.
38 |
39 |
40 | 
41 |
42 |
43 | #### Quantifying admixture across the genome
44 |
45 | A detailed explanation of the *ABBA BABA* test is given in in my other [tutorial on whole genome ABBA BABA analyses](https://github.com/simonhmartin/tutorials/tree/master/ABBA_BABA_whole_genome).
46 |
47 | Briefly, the test uses three populations and an outgroup with the relationship (((P1,P2),P3),O), and investigates whether there is an excess of shared variation between P2 and P3 (compared to that shared between P1 and P3).
48 |
49 | This excess can be expressed in terms of the *D statistic*, which ranges from -1 to 1, and should equal 0 under the null hypothesis of no introgression. D > 1 indicates possible introgression between P3 and P2 (or other factors that would result in a deviation from a strict bifurcating species history).
50 |
51 | This test was designed to be used at the whole-genome scale. The *D* statistic is not well suited for comparing admixture levels across the genome, because its absolute value depends on factors such as the effective population size, which can vary across the genome.
52 |
53 | The *f* estimator descibed in the [other tutorial](https://github.com/simonhmartin/tutorials/tree/master/ABBA_BABA_whole_genome) is better, because it by definition reflects the admixture proportion, but it is highly sensitive to stochastic error at the small scale. A statistic called *fd* was therefore developed for this purpose that is more robust to the error introduced by using small numbers of SNPs ([Martin et al. 2015](https://doi.org/10.1093/molbev/msu269)). While the conventional f estimator assumes that P3 is the donor population and P2 the recipient, *fd* infers the donor on a site-by-site basis.
54 |
55 | #### Selecting populations
56 |
57 | The interpretation of these statistics is strongly dependent on the populations selected. Firstly, the test is most sensitive to introgression from **P3 into P2**, rather than the other way around.
58 |
59 | Secondly, *fd* should be interpreted as a **quantification of excess shared variation between P3 and P2** that is **not also shared with P1**. If there is ongoing gene flow between P1 and P2, then any introgression from P3 to P2 will be underestimated.
60 |
61 | Finally, we are only able to quantify introgression that occured **more recently than the split between P1 and P2**.
62 |
63 | Therefore, if we want to quantify the **maximum amount of detectable introgression** that has occurred across the genome, we should choose a P1 that is **allopatric and not too closely related to P2**.
64 |
65 | However, we can also this feature of the test to our advantage. If we select a P1 that shares ongoing gene flow with P2, then the test will instead be revealing parts of the genome at which **P2 and P3 share variation that is not shared by P1**. This can be useful for identifying wing patterning alleles, as these are often the only genomic regions at which subspecies remain distinct in the face of gene flow.
66 |
67 | ## Practical
68 |
69 | ### Preparation
70 |
71 | * Open a terminal window and navigate to a folder where you will run the excersise and store all the input and output data files.
72 |
73 | * Now create a subdirectory called 'data' and download the data files needed for tis tutorial
74 |
75 | ```bash
76 | mkdir data
77 |
78 | cd data
79 |
80 | wget https://github.com/simonhmartin/tutorials/raw/master/ABBA_BABA_windows/data/hel92.DP8HET75MP9BIminVar2.chr18.geno.gz
81 |
82 | wget https://github.com/simonhmartin/tutorials/raw/master/ABBA_BABA_windows/data/hel92.pop.txt
83 |
84 | wget https://github.com/simonhmartin/tutorials/raw/master/ABBA_BABA_windows/data/chr18.LDhelmet_MLrho.w100.tsv
85 |
86 | cd ..
87 | ```
88 |
89 | * Next, download the collection of python scripts required for this tutorial [GitHub](https://github.com/simonhmartin)
90 |
91 | ```bash
92 | wget https://github.com/simonhmartin/genomics_general/archive/master.zip
93 | unzip master.zip
94 | ```
95 |
96 | ### Sliding window analysis
97 |
98 | * Run the the analysis python script for two separate cases. In both, P1 is the allpatric *H. melpomene melpomene* (`mel_mel`). P2 and P3 are the two populations we expect to be sharing genes. In the first case we are quantifying introgression between *H. melpomene rosina* (`mel_ros`) and *H. cydno chioneus* (`cyd_chi`) both from Panama. In the second we are quantifying introgression between *H. melpomene amaryllis* (`mel_ama`) and *H. timareta thelxinoe* (`tim_txn`) both from Peru.
99 |
100 | ```bash
101 | python genomics_general-master/ABBABABAwindows.py \
102 | -g data/hel92.DP8HET75MP9BIminVar2.chr18.geno.gz -f phased \
103 | -o data/hel92.DP8HET75MP9BIminVar2.chr18.ABBABABA_mel_ros_chi_num.w25m250.csv.gz \
104 | -P1 mel_mel -P2 mel_ros -P3 cyd_chi -O num \
105 | --popsFile data/hel92.pop.txt -w 25000 -m 250 --T 2
106 |
107 | python genomics_general-master/ABBABABAwindows.py \
108 | -g data/hel92.DP8HET75MP9BIminVar2.chr18.geno.gz -f phased \
109 | -o data/hel92.DP8HET75MP9BIminVar2.chr18.ABBABABA_mel_ama_txn_num.w25m250.csv.gz \
110 | -P1 mel_mel -P2 mel_ama -P3 tim_txn -O num \
111 | --popsFile data/hel92.pop.txt -w 25000 -m 250 --T 2
112 | ```
113 |
114 | We provide the scripty it with an input file containing genotype data (`-g`), an output file (`-o`), ingroup populations and outgroup (`-P1`, `-P2`, `-P3` and `-O`), and a file specifying which population each sample is in (`--popsFile`).
115 |
116 | We also give parameters for the windows. These ae "coordinate" windows, which means each window is the same length relative to the reference genome, but the number of SNPs per window can vary. The window size (`-w`) will be 25,000 bp. Windows will be required to contain a minimum (`-m`) of 250 SNPs to be considered valid.
117 |
118 | Finally, we tell the script to use two threads (`-T`). If you have a multi-core machine, you can increase this value and the script will run faster.
119 |
120 | #### Plotting window statistics
121 |
122 | * Open R and, if necessary, set the working directory to the tutorial directory. You can do this with the `setwd()` command, or in RStudio using the menus.
123 |
124 | We need to load each file of window statistics into R. We will make a list containing both datasets.
125 |
126 | * First input the names of teh input files
127 |
128 | ```R
129 | AB_files <- c("data/hel92.DP8HET75MP9BIminVar2.chr18.ABBABABA_mel_ros_chi_num.w25m250.csv.gz",
130 | "data/hel92.DP8HET75MP9BIminVar2.chr18.ABBABABA_mel_ama_txn_num.w25m250.csv.gz")
131 |
132 | AB_tables = lapply(AB_files, read.csv)
133 |
134 | head(AB_tables[[1]])
135 | ```
136 |
137 | *fd* is meaningless when D is negative, as it is designed to quantify the excess of ABBA over BABA only whgen an excess exists.
138 |
139 | * We therefore convert all *fd* values to 0 at sites where *D* is negative.
140 |
141 | ```R
142 | for (x in 1:length(AB_tables)){
143 | AB_tables[[x]]$fd = ifelse(AB_tables[[x]]$D < 0, 0, AB_tables[[x]]$fd)
144 | }
145 | ```
146 |
147 | * We can then plot of *fd* across the chromosome for the two cases we have analysed.
148 |
149 | ```R
150 | par(mfrow=c(length(AB_tables), 1), mar = c(4,4,1,1))
151 |
152 | for (x in 1:length(AB_tables)){
153 | plot(AB_tables[[x]]$mid, AB_tables[[x]]$fd,
154 | type = "l", xlim=c(0,17e6),ylim=c(0,1),ylab="Admixture Proportion",xlab="Position")
155 | rect(1000000,0,1250000,1, col = rgb(0,0,0,0.2), border=NA)
156 | }
157 | ```
158 |
159 | This reveals that there is considerable heterogeneity in the extent of introgression across the chromosome. If we consider the region around optix, we see evidence for reduced introgression between *H. melpomene rosina* and *H. cydno chioneus*, as we predicted. By contrast, we see evidence for elevated introgression between *H. melpomene amaryllis* and *H. timareta thelxinoe*, which suggests that their shared wing patterns might result from adaptive introgression. Given this evidence, it would be recommended to make a phylogeny for the region around optix to test whether the H. timareta allele appears to be 'nested' within the H. melpomene clade. In this case, previous papers have confirmed that that is the case ([Pardo-Diaz et al. 2012](https://doi.org/10.1371/journal.pgen.1002752), [Wallbank et al. 2016](https://doi.org/10.1371/journal.pbio.1002353)).
160 |
161 |
162 | #### In your own time
163 | What happens when we change the identity of P1, P2 an P3? What happens if we change the window size?
164 |
165 |
166 | ### Association between introgression and recombination rate
167 |
168 | Theory predicts that if there are many "barrier loci", at which introgression is selected against, we should see a global trend of reduced introgression in low recombination regions, due to strongler linkage between neutral introgressed alleles and deleterious ones.
169 |
170 | We can test this hypothesis by examining the relationship between recombination rate and *fd* across our chromosome.
171 |
172 | We have a previously-generated data file (provided) giving the estimated population recombination rate in 100 kb windows across this chromosome.
173 |
174 | * Open a terminal window and navidate to the tutorial folder.
175 |
176 | * Now we will make a matching dataset with *fd* for 100 kb windows, here just using the species pair showing the highest rate of introgression: *H. melpomene rosina* and *H. cydno chioneus* from Panama.
177 |
178 | ```bash
179 | python ~/Research/genomics_general/ABBABABAwindows.py \
180 | -g data/hel92.DP8HET75MP9BIminVar2.chr18.geno.gz -f phased \
181 | -o data/hel92.DP8HET75MP9BIminVar2.chr18.ABBABABA_mel_ros_chi_num.w100m1.csv.gz \
182 | -P1 mel_mel -P2 mel_ros -P3 cyd_chi -O num \
183 | --popsFile data/hel92.pop.txt -w 100000 -m 1000 --T 2
184 | ```
185 |
186 | * Now, **back in R**, read in this new data file.
187 |
188 | ```R
189 | AB_table_w100 <- read.csv("data/hel92.DP8HET75MP9BIminVar2.chr18.ABBABABA_mel_ros_chi_num.w100m1.csv.gz")
190 | ```
191 |
192 | * As before, we convert any *fd* values for windows with negative *D* to 0.
193 |
194 | ```R
195 | AB_table_w100$fd = ifelse(AB_table_w100$D < 0, 0, AB_table_w100$fd)
196 | ```
197 |
198 | Now we read in the table of recombination rates for 100 kb windows. Here the column `ML_rho` gives tha maximum likelihood estimate of the population recombination rate for each window.
199 |
200 | ```R
201 | rec_table <- read.table("data/chr18.LDhelmet_MLrho.w100.tsv", header=T)
202 | head(rec_table)
203 | ```
204 |
205 | * Due to the noisy nature of the data, we want to compare fd values in bins of different recombination rate. We will use the `cut` function to separate the windows into three bins with low, medium and high recombination rates.
206 |
207 | ```R
208 | rec_bin <- cut(rec_table$ML_rho, 3)
209 | ```
210 |
211 | * Finally, we can make boxplots to compare the inferred level of admixture (*fd*) between these bins.
212 |
213 | ```R
214 | boxplot(AB_table_w100$fd ~ rec_bin)
215 | ```
216 |
217 | This confirms that indeed the level of introgression increases with increasing recombination rate, consistent with a model in which a large number of barrier loci select against introgression geneome-wide.
218 |
219 |
220 |
--------------------------------------------------------------------------------
/ABBA_BABA_windows/data/chr18.LDhelmet_MLrho.w100.tsv:
--------------------------------------------------------------------------------
1 | scaffold start end mid sites ML_rho
2 | chr18 1 100000 46560 3995 0.00600000
3 | chr18 100001 200000 144250 3257 0.00400000
4 | chr18 200001 300000 255763 3156 0.05100000
5 | chr18 300001 400000 358576 3262 0.02900000
6 | chr18 400001 500000 444648 4045 0.00600000
7 | chr18 500001 600000 548369 4852 0.01400000
8 | chr18 600001 700000 653655 6525 0.02900000
9 | chr18 700001 800000 751285 5564 0.01600000
10 | chr18 800001 900000 849423 6993 0.05200000
11 | chr18 900001 1000000 944873 5912 0.06700000
12 | chr18 1000001 1100000 1051066 5959 0.03400000
13 | chr18 1100001 1200000 1143799 6078 0.06200000
14 | chr18 1200001 1300000 1250220 6453 0.04700000
15 | chr18 1300001 1400000 1350658 8323 0.08100000
16 | chr18 1400001 1500000 1449581 8837 0.07500000
17 | chr18 1500001 1600000 1548397 5832 0.06400000
18 | chr18 1600001 1700000 1647425 5546 0.13800000
19 | chr18 1700001 1800000 1748664 6367 0.14100000
20 | chr18 1800001 1900000 1852219 6150 0.15100000
21 | chr18 1900001 2000000 1946230 5628 0.12300000
22 | chr18 2000001 2100000 2048617 5852 0.16200000
23 | chr18 2100001 2200000 2151606 5521 0.16800000
24 | chr18 2200001 2300000 2250247 5126 0.20100000
25 | chr18 2300001 2400000 2347408 5732 0.08100000
26 | chr18 2400001 2500000 2449981 4496 0.16700000
27 | chr18 2500001 2600000 2551645 5357 0.13400000
28 | chr18 2600001 2700000 2652989 4707 0.16000000
29 | chr18 2700001 2800000 2748306 4979 0.07800000
30 | chr18 2800001 2900000 2848101 4506 0.14100000
31 | chr18 2900001 3000000 2951650 5144 0.23900000
32 | chr18 3000001 3100000 3050027 4998 0.14100000
33 | chr18 3100001 3200000 3147735 4583 0.13400000
34 | chr18 3200001 3300000 3251521 5845 0.08000000
35 | chr18 3300001 3400000 3349636 6943 0.12000000
36 | chr18 3400001 3500000 3447132 6376 0.11700000
37 | chr18 3500001 3600000 3548374 5483 0.20100000
38 | chr18 3600001 3700000 3648338 5012 0.20100000
39 | chr18 3700001 3800000 3747793 5107 0.15600000
40 | chr18 3800001 3900000 3849769 4863 0.20100000
41 | chr18 3900001 4000000 3952193 4482 0.18000000
42 | chr18 4000001 4100000 4053021 2451 0.04400000
43 | chr18 4100001 4200000 4154181 4294 0.07600000
44 | chr18 4200001 4300000 4253246 5035 0.06300000
45 | chr18 4300001 4400000 4348341 4838 0.02300000
46 | chr18 4400001 4500000 4450948 5813 0.03000000
47 | chr18 4500001 4600000 4544610 5127 0.08500000
48 | chr18 4600001 4700000 4652599 5005 0.06300000
49 | chr18 4700001 4800000 4750518 6236 0.07600000
50 | chr18 4800001 4900000 4849846 5230 0.20100000
51 | chr18 4900001 5000000 4952740 5224 0.12000000
52 | chr18 5000001 5100000 5050030 5814 0.05100000
53 | chr18 5100001 5200000 5147719 4597 0.01700000
54 | chr18 5200001 5300000 5250663 5551 0.15700000
55 | chr18 5300001 5400000 5349198 6546 0.15100000
56 | chr18 5400001 5500000 5450772 6035 0.04300000
57 | chr18 5500001 5600000 5551035 2662 0.09000000
58 | chr18 5600001 5700000 5651676 4164 0.20200000
59 | chr18 5700001 5800000 5761072 4324 0.18200000
60 | chr18 5800001 5900000 5847715 4374 0.15600000
61 | chr18 5900001 6000000 5953661 4351 0.07700000
62 | chr18 6000001 6100000 6051735 5599 0.06700000
63 | chr18 6100001 6200000 6153737 6015 0.13400000
64 | chr18 6200001 6300000 6247837 6055 0.20100000
65 | chr18 6300001 6400000 6348694 5578 0.31500000
66 | chr18 6400001 6500000 6456776 3171 0.25200000
67 | chr18 6500001 6600000 6547177 4649 0.30100000
68 | chr18 6600001 6700000 6649288 5526 0.30100000
69 | chr18 6700001 6800000 6749924 5575 0.27200000
70 | chr18 6800001 6900000 6851599 5661 0.20100000
71 | chr18 6900001 7000000 6952150 5431 0.23400000
72 | chr18 7000001 7100000 7051904 5203 0.20100000
73 | chr18 7100001 7200000 7145221 5600 0.12800000
74 | chr18 7200001 7300000 7248921 5289 0.05000000
75 | chr18 7300001 7400000 7350713 5226 0.18600000
76 | chr18 7400001 7500000 7454737 5105 0.30100000
77 | chr18 7500001 7600000 7547495 5554 0.36700000
78 | chr18 7600001 7700000 7649159 5050 0.33500000
79 | chr18 7700001 7800000 7748370 5679 0.30100000
80 | chr18 7800001 7900000 7851145 5985 0.23400000
81 | chr18 7900001 8000000 7947310 4476 0.20100000
82 | chr18 8000001 8100000 8052732 4752 0.03600000
83 | chr18 8100001 8200000 8152608 5313 0.12500000
84 | chr18 8200001 8300000 8244641 4568 0.26100000
85 | chr18 8300001 8400000 8349617 4850 0.21300000
86 | chr18 8400001 8500000 8450592 5272 0.11000000
87 | chr18 8500001 8600000 8551057 5630 0.30100000
88 | chr18 8600001 8700000 8651023 5028 0.16100000
89 | chr18 8700001 8800000 8749848 6192 0.16800000
90 | chr18 8800001 8900000 8849263 4830 0.06700000
91 | chr18 8900001 9000000 8946341 5099 0.15700000
92 | chr18 9000001 9100000 9051165 5394 0.10000000
93 | chr18 9100001 9200000 9149324 5653 0.13400000
94 | chr18 9200001 9300000 9249230 6163 0.20100000
95 | chr18 9300001 9400000 9351994 5548 0.26700000
96 | chr18 9400001 9500000 9449981 5779 0.15100000
97 | chr18 9500001 9600000 9550248 6364 0.08100000
98 | chr18 9600001 9700000 9647977 6011 0.15100000
99 | chr18 9700001 9800000 9752049 6196 0.15100000
100 | chr18 9800001 9900000 9850559 6414 0.15100000
101 | chr18 9900001 10000000 9945646 5700 0.15100000
102 | chr18 10000001 10100000 10056361 5837 0.10100000
103 | chr18 10100001 10200000 10144980 5740 0.15100000
104 | chr18 10200001 10300000 10246525 4999 0.16800000
105 | chr18 10300001 10400000 10352072 5543 0.05600000
106 | chr18 10400001 10500000 10450213 6629 0.18000000
107 | chr18 10500001 10600000 10542757 6032 0.12500000
108 | chr18 10600001 10700000 10648011 5838 0.15100000
109 | chr18 10700001 10800000 10751097 5359 0.14400000
110 | chr18 10800001 10900000 10844899 5997 0.13400000
111 | chr18 10900001 11000000 10949680 5541 0.13400000
112 | chr18 11000001 11100000 11053017 5827 0.10100000
113 | chr18 11100001 11200000 11151930 6987 0.06000000
114 | chr18 11200001 11300000 11251407 6889 0.17200000
115 | chr18 11300001 11400000 11352661 6430 0.10100000
116 | chr18 11400001 11500000 11448891 5804 0.14400000
117 | chr18 11500001 11600000 11547984 5215 0.20100000
118 | chr18 11600001 11700000 11645322 5412 0.15100000
119 | chr18 11700001 11800000 11749946 4870 0.16700000
120 | chr18 11800001 11900000 11855630 4995 0.26700000
121 | chr18 11900001 12000000 11950962 6040 0.21700000
122 | chr18 12000001 12100000 12051842 5975 0.22200000
123 | chr18 12100001 12200000 12150093 6522 0.20100000
124 | chr18 12200001 12300000 12253119 5882 0.24300000
125 | chr18 12300001 12400000 12351250 4661 0.16700000
126 | chr18 12400001 12500000 12450574 6752 0.10100000
127 | chr18 12500001 12600000 12552698 5395 0.05000000
128 | chr18 12600001 12700000 12655794 5209 0.26100000
129 | chr18 12700001 12800000 12747713 4871 0.15100000
130 | chr18 12800001 12900000 12850906 5601 0.06000000
131 | chr18 12900001 13000000 12950921 5881 0.30100000
132 | chr18 13000001 13100000 13050722 7243 0.26700000
133 | chr18 13100001 13200000 13146589 6063 0.20100000
134 | chr18 13200001 13300000 13249018 5693 0.17600000
135 | chr18 13300001 13400000 13355040 5507 0.10000000
136 | chr18 13400001 13500000 13446283 6415 0.25000000
137 | chr18 13500001 13600000 13549558 6076 0.22000000
138 | chr18 13600001 13700000 13650677 5519 0.21500000
139 | chr18 13700001 13800000 13752268 5997 0.10000000
140 | chr18 13800001 13900000 13850881 5248 0.20100000
141 | chr18 13900001 14000000 13952623 6192 0.18000000
142 | chr18 14000001 14100000 14049677 6377 0.15100000
143 | chr18 14100001 14200000 14152226 5848 0.20100000
144 | chr18 14200001 14300000 14247959 6053 0.11700000
145 | chr18 14300001 14400000 14349881 5654 0.17200000
146 | chr18 14400001 14500000 14448018 6354 0.20100000
147 | chr18 14500001 14600000 14547166 5049 0.15100000
148 | chr18 14600001 14700000 14648030 4807 0.08500000
149 | chr18 14700001 14800000 14751704 4824 0.12000000
150 | chr18 14800001 14900000 14849132 4999 0.26300000
151 | chr18 14900001 15000000 14950486 4654 0.12500000
152 | chr18 15000001 15100000 15050224 5086 0.30100000
153 | chr18 15100001 15200000 15152602 4936 0.25000000
154 | chr18 15200001 15300000 15251936 5903 0.14100000
155 | chr18 15300001 15400000 15352186 5320 0.10100000
156 | chr18 15400001 15500000 15446177 5030 0.02800000
157 | chr18 15500001 15600000 15555223 5334 0.02300000
158 | chr18 15600001 15700000 15650346 5609 0.07200000
159 | chr18 15700001 15800000 15746355 6637 0.11700000
160 | chr18 15800001 15900000 15851037 6635 0.15100000
161 | chr18 15900001 16000000 15946963 6594 0.11100000
162 | chr18 16000001 16100000 16050347 5424 0.03400000
163 | chr18 16100001 16200000 16151881 6213 0.06000000
164 | chr18 16200001 16300000 16250956 6313 0.03000000
165 | chr18 16300001 16400000 16349997 6997 0.03400000
166 | chr18 16400001 16500000 16443632 5562 0.01700000
167 | chr18 16500001 16600000 16553622 4858 0.03400000
168 | chr18 16600001 16700000 16648639 5961 0.03400000
169 | chr18 16700001 16800000 16749646 3459 0.02500000
170 |
--------------------------------------------------------------------------------
/ABBA_BABA_windows/data/hel92.DP8HET75MP9BIminVar2.chr18.geno.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simonhmartin/tutorials/ce985e50afa701fd1d217a41e66340dadd1325a9/ABBA_BABA_windows/data/hel92.DP8HET75MP9BIminVar2.chr18.geno.gz
--------------------------------------------------------------------------------
/ABBA_BABA_windows/data/hel92.pop.txt:
--------------------------------------------------------------------------------
1 | ros.CAM1841 mel_ros
2 | ros.CAM1880 mel_ros
3 | ros.CAM2045 mel_ros
4 | ros.CAM2059 mel_ros
5 | ros.CAM2519 mel_ros
6 | ros.CAM2552 mel_ros
7 | ros.CJ2071 mel_ros
8 | ros.CJ531 mel_ros
9 | ros.CJ533 mel_ros
10 | ros.CJ546 mel_ros
11 | vul.CS10 mel_vul
12 | vul.CS3603 mel_vul
13 | vul.CS3605 mel_vul
14 | vul.CS3606 mel_vul
15 | vul.CS3612 mel_vul
16 | vul.CS3614 mel_vul
17 | vul.CS3615 mel_vul
18 | vul.CS3617 mel_vul
19 | vul.CS3618 mel_vul
20 | vul.CS3621 mel_vul
21 | mal.CS1002 mel_mal
22 | mal.CS1011 mel_mal
23 | mal.CS1815 mel_mal
24 | mal.CS21 mel_mal
25 | mal.CS22 mel_mal
26 | mal.CS24 mel_mal
27 | mal.CS586 mel_mal
28 | mal.CS594 mel_mal
29 | mal.CS604 mel_mal
30 | mal.CS615 mel_mal
31 | ama.JM160 mel_ama
32 | ama.JM216 mel_ama
33 | ama.JM293 mel_ama
34 | ama.JM48 mel_ama
35 | ama.MJ11-3188 mel_ama
36 | ama.MJ11-3189 mel_ama
37 | ama.MJ11-3202 mel_ama
38 | ama.MJ12-3217 mel_ama
39 | ama.MJ12-3258 mel_ama
40 | ama.MJ12-3301 mel_ama
41 | melG.CAM1349 mel_mel
42 | melG.CAM1422 mel_mel
43 | melG.CAM2035 mel_mel
44 | melG.CAM8171 mel_mel
45 | melG.CAM8216 mel_mel
46 | melG.CAM8218 mel_mel
47 | melG.CJ13435 mel_mel
48 | melG.CJ9315 mel_mel
49 | melG.CJ9316 mel_mel
50 | melG.CJ9317 mel_mel
51 | chi.CAM25091 cyd_chi
52 | chi.CAM25137 cyd_chi
53 | chi.CAM580 cyd_chi
54 | chi.CAM582 cyd_chi
55 | chi.CAM585 cyd_chi
56 | chi.CAM586 cyd_chi
57 | chi.CJ553 cyd_chi
58 | chi.CJ560 cyd_chi
59 | chi.CJ564 cyd_chi
60 | chi.CJ565 cyd_chi
61 | zel.CS1 cyd_zel
62 | zel.CS1028 cyd_zel
63 | zel.CS1029 cyd_zel
64 | zel.CS1030 cyd_zel
65 | zel.CS1033 cyd_zel
66 | zel.CS1035 cyd_zel
67 | zel.CS2 cyd_zel
68 | zel.CS2262 cyd_zel
69 | zel.CS273 cyd_zel
70 | zel.CS30 cyd_zel
71 | flo.CS12 tim_flo
72 | flo.CS13 tim_flo
73 | flo.CS14 tim_flo
74 | flo.CS15 tim_flo
75 | flo.CS2337 tim_flo
76 | flo.CS2338 tim_flo
77 | flo.CS2341 tim_flo
78 | flo.CS2350 tim_flo
79 | flo.CS2358 tim_flo
80 | flo.CS2359 tim_flo
81 | thxn.JM313 tim_txn
82 | thxn.JM57 tim_txn
83 | thxn.JM84 tim_txn
84 | thxn.JM86 tim_txn
85 | thxn.MJ12-3221 tim_txn
86 | thxn.MJ12-3233 tim_txn
87 | thxn.MJ12-3308 tim_txn
88 | txn.MJ11-3339 tim_txn
89 | txn.MJ11-3340 tim_txn
90 | txn.MJ11-3460 tim_txn
91 | nu_sil.MJ09-4125 num
92 | nu_sil.MJ09-4184 num
93 |
--------------------------------------------------------------------------------
/ABBA_BABA_windows/images/map_and_tree.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simonhmartin/tutorials/ce985e50afa701fd1d217a41e66340dadd1325a9/ABBA_BABA_windows/images/map_and_tree.jpg
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Tutorials on evolutionary genomics
3 |
4 | * [ABBA BABA statistics using genome wide SNP data](https://github.com/simonhmartin/tutorials/tree/master/ABBA_BABA_whole_genome/README.md)
5 | A tutorial that explores the "*ABBA BABA*" *D* and *f* statistics, and how these can be applied using genome-scale SNP data, including significance testing using the block jackknife.
6 |
7 | * [ABBA BABA analysis in sliding windows](https://github.com/simonhmartin/tutorials/tree/master/ABBA_BABA_windows/README.md)
8 | A tutorial on applying "*ABBA BABA*" statistics (specifically fd [Martin et al. 2015](https://doi.org/10.1093/molbev/msu269) in sliding windows to quantify variation in the extent of introgression across the genome.
9 |
10 | * [topology_weighting](https://github.com/simonhmartin/tutorials/tree/master/topology_weighting/README.md)
11 | A tutorial on topology weighting using [Twisst](https://github.com/simonhmartin/twisst) to explore evolutionary relationships across teh genome, as described in [Martin & Van Belleghem 2017](http://doi.org/10.1534/genetics.116.194720).
12 |
--------------------------------------------------------------------------------
/topology_weighting/README.md:
--------------------------------------------------------------------------------
1 | # Tutorial: Topology Weighting
2 | ___
3 | ## Requirements
4 | * Python 3+
5 | * Numpy 1.10+
6 | * [ete3](http://etetoolkit.org/)
7 | * [msprime](https://msprime.readthedocs.io/en/stable/) (for part 3 only)
8 | * R 3.0+
9 | * ape
10 | * data.table
11 | * [Phyml](http://www.atgc-montpellier.fr/phyml/) (for part 2 only)
12 |
13 | ___
14 | ## Introduction
15 |
16 | Topology weighting is a means to quantify relationships among taxa that are not necessarily monophyletic. It provides a summary of a complex genealogy by considering simpler "taxon topologies" and quantifying the proportion of sub-trees that match each taxon topology. The method we use to compute the weightings is called *Twisst*: Topology weighting by iterative sampling of sub-trees.
17 |
18 | In this practical we will use a simulation to explore how topology weightings provide a summary of the genealogical history. We will then try to infer topology weights across our simulated chromosome using neighbour joining trees inferred for narrow windows.
19 |
20 |
21 | #### Workflow Overview
22 |
23 | In **Part 1** of the practical we will analyse a set of genealogies that represent the history of a part of chromosome that evolved under a fairly complex history including population subdivision, gene flow and selection. We will compute topology weightings across this genomic region with `twisst` and then explore the results in `R`.
24 |
25 | In **Part 2**, we take a step backwards into the real world, in which we ***don't know*** the true genealogical history, but instead we have a set of sequences from which we hope to ***infer*** the genealogy. We will use an unsophisticated approach to do this: making phylogenies for windows across the genome, using a standard phylogenetics tool. By comparing our inferred histories to the truth in `R`, we will gain insights into the **tradeoff between power and resolution** in genealogical inference.
26 |
27 | In **Part 3** we will explore how demographic parameters affect weightings by running coalescent simulations with `msprime` and then computing topology weightings directly from the output, all within Python.
28 |
29 | ___
30 | ## Practical Part 1. Analysis of simulated genealogies
31 |
32 | #### Download code and data
33 |
34 | The scripts and example data for this part of the practical are in the `twisst` package on github.
35 |
36 | ```bash
37 | #download the twisst package zip file from github
38 | wget https://github.com/simonhmartin/twisst/archive/v0.2.tar.gz
39 |
40 | #extract the files from the zipped file
41 | tar -xzf v0.2.tar.gz
42 |
43 | #remove the zipped file
44 | rm v0.2.tar.gz
45 | ```
46 |
47 | * The example data we will use consists of a text file of genealogies coded as [Newick](https://en.wikipedia.org/wiki/Newick_format) trees. In this case the trees were simulated using the coalescent simulator [`msms`](https://www.mabs.at/ewing/msms/index.shtml). If we had real data, we would not know the trees, and would have to infer them using tools like Relate, tsinfer, or by just running phylogeny inference on narrow windows, which we do in Part 2 below.
48 |
49 |
50 | We can look at the first tree in the file:
51 |
52 | ```bash
53 | zcat twisst-0.2/examples/msms_4of10_l50k_r500_sweep.trees.gz | head -n 1
54 | ```
55 |
56 | It's pretty ugly, but don't be afraid. The numbers before each `:` are the sample names. The numbers after the `:` are the branch lengths. We will only be considering the tree shape and not branch lengths in this tutorial.
57 |
58 | * We can also check the total number of distinct genealogies for this region of the chromosome:
59 |
60 | ```bash
61 | zcat twisst-0.2/examples/msms_4of10_l50k_r500_sweep.trees.gz | wc -l
62 | ```
63 |
64 | * For plotting, we also need to know where these genealogies occur on the chromosome. This data is provided in a second file with three columns: chromosome, start and end for each genealogy. This file has the same number of lines as the trees file.
65 |
66 | ```bash
67 | zcat twisst-0.2/examples/msms_4of10_l50k_r500_sweep.data.tsv.gz | head
68 | zcat twisst-0.2/examples/msms_4of10_l50k_r500_sweep.data.tsv.gz | wc -l
69 | ```
70 |
71 | As you can see, some genealogies in this simulated data set occupy very narrow regions of the chromosome, as small as 1 bp. Over many generations of recombination, it can be the case that, for a given group of samples, the genealogy varies at a fine scale across the chromosome. In this case we know the *true* genealogy for each region - it has not been inferred. We will get to that topic in Part 2.
72 |
73 | #### Compute topology weightings
74 |
75 | * We run [`twisst`](https://github.com/simonhmartin/twisst) to compute the weightings for each topology.
76 |
77 | The only information *Twisst* requires is
78 | * the name of the trees file (specified with `-t`)
79 | * the name of the output weights file (`-w`)
80 | * The name of each group, and the samples that belong to it (`-g`).
81 |
82 | The grouping may be determined by species, phenotype or geography (or whatever you like). In our case there are four groups of 10 haploid samples each.
83 | Group A consists of samples 1:10, B consists of 11:20 etc.
84 |
85 | ```bash
86 | python twisst-0.2/twisst.py \
87 | -t twisst-0.2/examples/msms_4of10_l50k_r500_sweep.trees.gz \
88 | -w msms_4of10_l50k_r500_sweep.weights.tsv.gz \
89 | -g A 1,2,3,4,5,6,7,8,9,10 \
90 | -g B 11,12,13,14,15,16,17,18,19,20 \
91 | -g C 21,22,23,24,25,26,27,28,29,30 \
92 | -g D 31,32,33,34,35,36,37,38,39,40 \
93 | --outgroup D
94 | ```
95 |
96 | `twisst` will consider all possible combinations of samples in which there is one sample per group. For example the first combination examined will be samples `1`, `11`, `21` and `31`, representing groups `A`, `B`, `C` and `D`, respectively. Ignoring all other branches in the tree, `twisst` records the topology of the 'subtree' containing just the four samples of interest, which could have one of three possible shapes: `(((A,B),C),D)`, `(((A,C),B),D)` or `(((B,C),A),D)` (Note that here the trees are represented as rooted, with D as the outgroup. We can tell `twisst` which is the outgroup (`--outgroup`) so it displays the trees as correctly rooted, but this does not affect the results).
97 |
98 | * Check what the results look like
99 |
100 | ```bash
101 | #first 10 lines
102 | zcat msms_4of10_l50k_r500_sweep.weights.tsv.gz | head -n 30
103 | #total number of lines
104 | zcat msms_4of10_l50k_r500_sweep.weights.tsv.gz | wc -l
105 | ```
106 |
107 | The three columns in the weights file represent the three topologies, which are defined in the file too. The numbers are not proportions but instead the total number of combinations representing each topology. Each line should sum to 104 = 10,000 as there are four groups of samples each.
108 |
109 | You will see that some adjacent lines have identical weightings. This has to do with the fact that some recombination events change the relationships among the samples, but not in a way that influences the weightings. It's worth understanding why this is the case.
110 |
111 | #### Analyse the results
112 |
113 | * Open `R` or `RStudio` and, if necessary, set the working directory to where you have saved the files. You can use the `setwd()` command or, in `RStudio`, using the menus.
114 |
115 | * Start a new R script to record the commands
116 |
117 | * First we will import a set of functions distributed with `twisst` that will help with plotting.
118 |
119 | ```R
120 | source("twisst-0.2/plot_twisst.R")
121 | ```
122 |
123 | * Please note the cool name of the above script.
124 |
125 | * We define the files containing the weights for each genealogy, and the start and end positions for each block along the chromosome.
126 |
127 | ```R
128 | #weights file with a column for each topology
129 | weights_file <- 'msms_4of10_l50k_r500_sweep.weights.tsv.gz'
130 |
131 | #coordinates file for each window
132 | window_data_file <- 'twisst-0.2/examples/msms_4of10_l50k_r500_sweep.data.tsv.gz'
133 | ```
134 | * We already know the structure of these two files, but instead of reading them in and working with them directly, we will use the convenient `import.twisst` function.
135 |
136 | ```R
137 | twisst_data <- import.twisst(weights_file, window_data_file)
138 | ```
139 |
140 | * plot the raw weightings using the provided `plot.twisst` function.
141 |
142 | ```R
143 | pdf("simulted_trees_weights.pdf", width=8, height=6)
144 | plot.twisst(twisst_data)
145 | dev.off()
146 | ```
147 |
148 | * This will write a pdf file to the directory you are working in - open it!.
149 |
150 | The trees at the top of the plot show the 3 different topologies we have weighted. The lower plot shows the weightings. You will see columns of colour of varying width. Each column corresponds to a single block with a unique genealogy. Some blocks are all one colour and reach a value of 1. That indicates that all subtrees in that block have the same topology, indicating a consistent and completely sorted genealogy. Other columns have two or more colours overlaid, indicating that the genealogy has a more complex evolutionary history, with individuals jumping between groups. A completely random genealogy, in which there is no clustering by group, would have equal weightings for all three topologies.
151 |
152 | It is often desirable to smooth the weightings so that we can see more clearly how they vary across the chromosome.
153 |
154 | * Create smoothed weightings using the `smooth.twisst` function and re-plot.
155 |
156 | ```R
157 | twisst_data_smooth <- smooth.twisst(twisst_data, span_bp=5000)
158 |
159 | pdf("simulted_trees_weights_smooth.pdf", width=8, height=6)
160 | plot.twisst(twisst_data_smooth)
161 | dev.off()
162 | ```
163 |
164 | This averaged the weightings over a 5 kb window. You can explore what happens when you change the `span_bp` parameter.
165 |
166 | Now we see more clearly that the dominant topologies are `topo1` and `topo3`. In this case, the simulations involved populations splitting according to `topo1`, but adaptive introgression was simulated from `C` into `B`, which is why `topo3` is more prevalent than topo2, and also why topo3 has a large spike in the middle of the region. This is the location of the selected locus.
167 |
168 | * We can look at the overall distribution of weightings too, or just check the mean values.
169 |
170 | ```R
171 | plot.twisst.summary.boxplot(twisst_data)
172 |
173 | twisst_data$weights_mean
174 | ```
175 |
176 | The simulation history followed `topo1`, which is the most abundant topology, as expected. The introgression created an excess of `topo3`. But why is `topo2` not zero?
177 |
178 | Lineage sorting is often incomplete if the taxa split recently, and even when it is complete, we can find genealogies that are discordant with the 'species tree' due to stochasticity in lineage sorting in the past. If you look carefully at the first plot we made, you might find one narrow window in which `topo2` has a weighting of 1. This indicates a *completely sorted*, but *discordant* genealogy. If the difference between incomplete lineage sorting and discordance is not immediately clear to you, you are not alone. In Part 3 we will do our own simulations to look at the conditions under which incomplete sorting and discordance increase or decrease.
179 |
180 | ___
181 |
182 | ## Practical Part 2. Infering weightings from sequence data
183 |
184 | Above we have used the 'true' genealogies as they were simulated. In most cases, all we have is sequence data, and its evolutionary history has to be inferred. In fact there are two things we do not know:
185 | 1. We do not know the genealogical relationship among all individuals
186 | 2. We do not know the 'breakpoints' at which recombination has changed the relationship as we move along the chromosome
187 |
188 | In this part, we will start from sequence data (the sequences were simulated under the history covered in Part 1, but we pretend that we do not know that at this stage). We will use a fairly straightforward approach in which we infer genealogies in windows along the genome. We will then run `twisst` on these to see whether we can recover someting close to the underlying truth.
189 |
190 | Note that one of the lessons in this part is that inferring trees in windows is **crude and potentially flawed**, especially if we get the tree size wrong.
191 |
192 | #### Download code and data
193 |
194 | * The scripts for this part are in the genomics_general package on github, which we need to download:
195 |
196 | ```bash
197 | #download package
198 | wget https://github.com/simonhmartin/genomics_general/archive/v0.3.tar.gz
199 | #extract files from zipped archive
200 | tar -xzf v0.3.tar.gz
201 | #delete zipped file
202 | rm v0.3.tar.gz
203 | ```
204 |
205 | * To ensure that the libraries are recognisable by python, add the `genomics_general' directory to the Python path
206 |
207 | ```bash
208 | export PYTHONPATH=$PYTHONPATH:genomics_general-0.3
209 | ```
210 |
211 | * The sequence file we will use is provided with the `twisst` package, downloaded in Part 1. The file is in simple `.geno` format, which has columns for chromosome, position and genotype for each individual:
212 |
213 | ```bash
214 | zcat twisst-0.2/examples/msms_4of10_l50k_r500_sweep.seqgen.SNP.geno.gz | head -n 5 | cut -f 1-8
215 | ```
216 |
217 | (In part 2B we will use a script that generates a .geno from a vcf file)
218 |
219 | #### Infering trees for windows
220 |
221 | We have a file of SNPs distributed across a 50 kb genomic region. We will infer trees in windows of a defined number of SNPs, such that each window has a similar amount of information, but might differ in its absolute span across the chromosome, depending on the SNP density.
222 |
223 | There is an underlying **tradeoff** in this approach. We want to select a window size that is ***large enough*** to provide the necessary ***power*** for tree inference, but ***small enough*** to achieve enough ***resolution*** to capture how genealogical histories change across the chromosome.
224 |
225 | We will run the script that reads the SNP file in windows and then infers a tree for each window using [Phyml](http://www.atgc-montpellier.fr/phyml/). Phyml is capable of maximum likelihood inference, but here we will not use optimisation, so the trees output will be Neighbour-Joining trees, inferred using the [BIONJ](http://www.atgc-montpellier.fr/bionj/) algorithm. Using simulations we have found that neighbour-joining algorithm performs better than maximum likelihood inference for short sequences.
226 |
227 | * First just check the options of the script
228 |
229 | ```bash
230 | python genomics_general-0.3/phylo/phyml_sliding_windows.py -h
231 | ```
232 |
233 | The main options are `-g` to specify the input `.geno` file and `--prefix` to specify the prefix of the output files
234 |
235 | There are also options for setting the type and size of the window. We will run the script four times, using a range of different window sizes of 20, 50, 100, and 500 SNPs. Note that the input file contains only SNPs, so by setting `--windType sites` each window will be set to contain a fixed number of SNPs. By partitioning the windows this way, their absolute sizes on the chromosome will vary with SNP density. The start and end positions of each window will be recorded in an output file.
236 |
237 | Finally, there are options for how to run Phyml. Here we just need to set `--optimise n` to turn of maximum-likelihood, and define the substitution model with `--model HYK85` as that is the model under which the sequences were simulated.
238 |
239 | * Run the script using a loop that sets a different window size each time
240 |
241 | ```
242 | for x in 20 50 100 500
243 | do
244 | echo "Inferring trees with window size $x"
245 |
246 | python genomics_general-0.3/phylo/phyml_sliding_windows.py \
247 | -g twisst-0.2/examples/msms_4of10_l50k_r500_sweep.seqgen.SNP.geno.gz \
248 | --prefix msms_4of10_l50k_r500_sweep.seqgen.SNP.w$x.phyml_bionj \
249 | --windType sites -w $x --model HKY85 --optimise n
250 |
251 | done
252 | ```
253 |
254 | Each time it ran, the script generated two output files: `.trees.gz` files contain the trees for each window in Newick format as we saw in Part 1. `.data.tsv` files contain the coordinates for each window, as well as the likelihood of the tree (the latter is irrelevant for us in this activity).
255 |
256 | We can now compute the weightings across the chromosome using the trees files as input. This is the same as we did in Part 1 for the simulated trees, but now we are doing it for trees we have inferred from SNPs. So we are hoping to replicate the 'true' weightings we computed in Part 1 as closely as possible. Fingers crossed!
257 |
258 | * Run `twisst` using generated trees file for each different window size, specifying the same groups as we did in Part 1
259 |
260 | ```bash
261 | for x in 20 50 100 500
262 | do
263 | echo "Running Twisst for window size $x"
264 |
265 | python twisst-0.2/twisst.py \
266 | -t msms_4of10_l50k_r500_sweep.seqgen.SNP.w$x.phyml_bionj.trees.gz \
267 | -w msms_4of10_l50k_r500_sweep.seqgen.SNP.w$x.phyml_bionj.weights.tsv \
268 | -g A 1,2,3,4,5,6,7,8,9,10 \
269 | -g B 11,12,13,14,15,16,17,18,19,20 \
270 | -g C 21,22,23,24,25,26,27,28,29,30 \
271 | -g D 31,32,33,34,35,36,37,38,39,40 \
272 | --outgroup D
273 |
274 | done
275 | ```
276 |
277 | #### Plotting inferred weights
278 |
279 | As we did in Part 1, we can now plot the weights for these inferred trees across the chromosome. This can be done in the same R script as before.
280 |
281 | * **Open R again** (if you have restarted R, you may need to reload the `plot_twisst.R` script).
282 |
283 | ```R
284 | source("twisst-0.2/plot_twisst.R")
285 | ```
286 |
287 | * As before we read in the weights and window data files. This time we will load the original ***true*** weights from the simulated genealogies, as well as the four files of ***inferred*** weights that we have just computed.
288 |
289 |
290 | ```R
291 | weights_files <- c('msms_4of10_l50k_r500_sweep.weights.tsv.gz', #true weights file from Part 1
292 | 'msms_4of10_l50k_r500_sweep.seqgen.SNP.w20.phyml_bionj.weights.tsv',
293 | 'msms_4of10_l50k_r500_sweep.seqgen.SNP.w50.phyml_bionj.weights.tsv',
294 | 'msms_4of10_l50k_r500_sweep.seqgen.SNP.w100.phyml_bionj.weights.tsv',
295 | 'msms_4of10_l50k_r500_sweep.seqgen.SNP.w500.phyml_bionj.weights.tsv')
296 |
297 | window_data_files <- c('twisst-0.2/examples/msms_4of10_l50k_r500_sweep.data.tsv.gz',
298 | 'msms_4of10_l50k_r500_sweep.seqgen.SNP.w20.phyml_bionj.data.tsv',
299 | 'msms_4of10_l50k_r500_sweep.seqgen.SNP.w50.phyml_bionj.data.tsv',
300 | 'msms_4of10_l50k_r500_sweep.seqgen.SNP.w100.phyml_bionj.data.tsv',
301 | 'msms_4of10_l50k_r500_sweep.seqgen.SNP.w500.phyml_bionj.data.tsv')
302 | ```
303 |
304 | * Load in wieghhtings and window data. Note that when given multiple input files, the `import.twisst` function will interpret them as separate chromosomes.
305 |
306 | ```R
307 | twisst_data <- import.twisst(weights_files, window_data_files,
308 | names=c("True weights", "20 SNP windows", "50 SNP windows", "100 SNP windows", "500 SNP windows"))
309 |
310 | ```
311 |
312 | * Now we plot again to compare the true and inferred weightings. (you might need to expand your plot window to display the multiple plots correctly).
313 |
314 | ```R
315 | pdf("simulted_and_inferred_trees_weights_comparison.pdf", width=8, height=8)
316 | plot.twisst(twisst_data, show_topos=FALSE, include_region_names=T)
317 | dev.off()
318 | ```
319 |
320 | How well did the inference capture the truth?
321 | Which window size is best?
322 | Where is there too little power, and where is there too little resolution?
323 |
324 | In the future, these difficulties could be solved by using tools like Relate and tsinfer to infer the complete ancestral recombination graph, but to date they have not been tested on datasets from multiple species.
325 |
326 | ## Practical Part 2B: Infering weightings from sequence data
327 |
328 | If you're tired of simulated data, here's a chance to work with real data. Note this part of the practical takes a bit longer, so you might want to skip it if you are low on time.
329 |
330 | We will now run the same steps as above on real data. The input data are phased sequences from [this paper on *Heliconius* butterflies](https://doi.org/10.1371/journal.pbio.2006288). Fig 1 of the paper shows the makeup of the data set, with 9 populations of 10 samples each and two outgroup individuals.
331 |
332 | We are interested in the role of gene flow in shaping the relationships between three species. *H. cydno* and *H. timareta* are sister species that occur on opposite sides of the Andes Mountains. *H. melpomene* occurs on both sides of the Andes, so there is sympatry between *cydno* and *melpomene* to the west and between *timareta* and *melpomene* on the eastern slopes. Hybrids are very rare in the wild, but genomic evidence indicates that gene flow occurs in both areas of sympatry.
333 |
334 | The sampling includes two cydno populations (also called races because they have different colour patterns), two timareta populations and five melpomene populations. There are also two outgroup individuals.
335 |
336 | #### Infer trees for windows
337 |
338 | There are many combinations of populations we could use. Regardless of which combination we use, we need to start by computing trees across the genome for the complete set of samples.
339 |
340 | * In this case we have a phased vcf file for a 1 mb region of chromosome 18, so our command to infer the trees has two steps. First we convert the vcf into the correct format, and then pipe it to the `phyml_sliding_windows` script to infer the trees.
341 |
342 | ```bash
343 | python genomics_general-0.3/VCF_processing/parseVCF.py -i twisst-0.2/examples/heliconius92.chr18.500001-1500000.phased.vcf.gz |
344 | python genomics_general-0.3/phylo/phyml_sliding_windows.py --threads 2 \
345 | --prefix heliconius92.chr18.500001-1500000.phyml_bionj \
346 | --windType sites -w 50 --model GTR --optimise n
347 | ```
348 | We've set the window size to 50 SNPs, because that seemed a reasonable compromise based on the simulations above. The model is set to GTR which is a somewhat arbitrary choice as we don't know what the most suitable model is here. However, given that these are very closely related taxa with few substitutions (short branches) there is no risk of mutation saturation, so the substitution model probably has very little impact.
349 |
350 | This could take a few minutes (~1300 trees to make with 184 tips each), so it's a good time to get a cup of coffee.
351 |
352 | #### Compute topology weights
353 |
354 | For `twisst`, we usually select 4 or 5 taxa. Any more and the number of possible topologies becomes large, so you would need to have a clear hypothesis about which particular topologies you want to focus on. The number of samples per taxon can be any number, but more than 4 is recommended to avoid noisy output.
355 |
356 | Here we will focus on two sympatric pairs. *H. cydno chioneus* ('chi') and H. melpomene rosina ('ros') from Panama, and *H. timareta thelxinoe* ('txn') and *H. melpomene amaryllis* ('ama') from Peru. Hybridisation occurs in both location, but you will notice that while the Panama pair have divergent colour patterns, the Peru pair are identical. This is hypothesised to have resulted from adaptive introgression.
357 |
358 | The 1 Mb region on chromosome 18 that we are targeting contains the gene optix, which is the controler of the red forewing band shared by the pair in Peru.
359 |
360 | * We will run `twisst` for these two pairs of taxa, but instead of specifying all individuals that belong to each group, we just provide a groups file. You can view the format of the populations file.
361 |
362 | ```bash
363 | head -n 25 twisst-0.2/examples/heliconius92.pop.txt
364 | ```
365 |
366 | * Then run `twisst`, jsut specifying the group *names* and providing a groups file containing all individual names
367 |
368 | ```bash
369 | python twisst-0.2/twisst.py -t heliconius92.chr18.500001-1500000.phyml_bionj.trees.gz \
370 | -w heliconius92.chr18.500001-1500000.phyml_bionj.weights.tsv \
371 | -g chi -g txn -g ros -g ama --groupsFile twisst-0.2/examples/heliconius92.pop.txt
372 | ```
373 |
374 | #### plotting the result
375 |
376 | * **Open R again** (if you have restarted R, you may need to reload the `plot_twisst.R` script).
377 |
378 | ```R
379 | source("twisst-0.2/plot_twisst.R")
380 | ```
381 |
382 | ```R
383 | weights_file = "heliconius92.chr18.500001-1500000.phyml_bionj.weights.tsv"
384 | data_file = "heliconius92.chr18.500001-1500000.phyml_bionj.data.tsv"
385 |
386 | twisst_data <- import.twisst(weights_file, data_file)
387 | ```
388 |
389 | * We will smooth the weightings because we are plotting across a fairly large (1 Mb) region. And then plot.
390 |
391 | ```R
392 | twisst_data_smooth <- smooth.twisst(twisst_data, span_bp = 20000)
393 |
394 | pdf("Heliconius_optix_region_weights_smooth.pdf", width=8, height=6)
395 | plot.twisst(twisst_data_smooth, tree_type="unrooted")
396 | dev.off()
397 | ```
398 | Topology 1 (blue) groups the two *melpomene* populations (ros and ama) together and *cydno* (chi) with *timareta* (txn), so this is the expected 'species' topology. We see a clear shift from the species topology to the 'geography' topology (topo 2, green), between position 1 and 1.2 Mb. *optix* is found at around 1 Mb, so this looks like the expected signature of introgression causing recent coalescence between ama and txn. Interestingly, it is confined to the intergenic region near *optix*, so it is consistent with introgression of regulatory variants.
399 |
400 | We used unrooted trees because there is no outgroup in this set of four taxa. This means **we cannot determine the direction of introgression**. To clarify which taxa have shared genes, and in which direction we need to include an outgroup. Fortunately we have an outgroup in the form of two samples from the more distant species *H. numata* ('num').
401 |
402 | #### Repeating the analysis with an outgroup
403 |
404 | * **Return to the regular terminal** and re-run `twisst` specifying the outgroup.
405 |
406 | ```bash
407 | python twisst-0.2/twisst.py -t heliconius92.chr18.500001-1500000.phyml_bionj.trees.gz \
408 | -w heliconius92.chr18.500001-1500000.phyml_bionj.5pops.weights.tsv \
409 | -g chi -g txn -g ros -g ama -g num --groupsFile twisst-0.2/examples/heliconius92.pop.txt --outgroup num
410 | ```
411 |
412 | Now there will be 15 different possible topologies!
413 |
414 | * **Back in R** We can plot the new results.
415 |
416 | ```R
417 | weights_file = "heliconius92.chr18.500001-1500000.phyml_bionj.5pops.weights.tsv"
418 | data_file = "heliconius92.chr18.500001-1500000.phyml_bionj.data.tsv"
419 |
420 | twisst_data <- import.twisst(weights_file, data_file)
421 | ```
422 |
423 | ```R
424 | twisst_data_smooth <- smooth.twisst(twisst_data, span_bp = 20000)
425 |
426 | pdf("Heliconius_5pops_optix_region_weights_smooth.pdf", width=8, height=6)
427 | plot.twisst(twisst_data_smooth, ncol_topos=5)
428 | dev.off()
429 | ```
430 |
431 | This shows that two topologies dominate. One is topology 3, which you will notice is the 'species' topology, in which *cydno* (chi) groups with *timareta* (txn) and the two *melpomene* populations (ros and ama) group together. The other is topology 11 (pink), in which txn is found grouped with ama (**nested within the *melpomene* pair**). This tells us that the predominant direction of introgression was from *H. melpomene amaryllis* into *H. timareta thelxinoe*. This fits with our current understanding of this group in which *timareta* expanded down the eastern slopes of the Andes, hybridising with *melpomene* and acquiring the favoured warning pattern alleles in each region.
432 |
433 | ___
434 |
435 | ## Practical Part 3: Topology weighting using Tree Sequence format
436 |
437 | So far, we have analysed tree files that have a distinct tree for each chromosome 'block' with a unique genealogy, or for each window in the case of inferred trees. This format is somewhat wasteful, because adjacent trees are often extremely similar, usually only different by a single recombination event, which moves one branch from one point in the tree to another.
438 |
439 | The Tree Sequence format is efficient because it records not only the connections between nodes on the tree, but the length of the chromosome for which each connection exists. This format is used by [`msprime`](https://msprime.readthedocs.io/en/stable/) and the [`tskit`](https://tskit.readthedocs.io/en/latest/index.html) package, which also provides more information.
440 |
441 | #### Simulating our first tree sequence
442 |
443 | We will use `msprime` to simulate a tree sequence. `msprime` is a coalescent simulator, which means it works by computing the probability that any two individuals share a common ancestor at a given time in the past. In a single population, this is determined by the population size. With multiple populations, this is also affected by the rates of migration between populations, and how long ago they descend from a single ancestral population.
444 |
445 | We will use the Python interactive environment for working with `msprime` and the tree sequence, and also to analyse it using a function from `twisst`.
446 |
447 | * To ensure that we can import the `twisst` module from within python, add it to our python path. (This assumes you already downloaded the `twisst` package in section 1.
448 |
449 | ```bash
450 | export PYTHONPATH=$PYTHONPATH:twisst-0.2
451 | ```
452 |
453 | * Now **open a Python interactive session** (type 'python'), and also open a script in a text editor, because we are going to modify and rerun some of these lines multiple times to test the effects of different simulation parameters
454 |
455 | * Import the required modules.
456 |
457 | ```python
458 | import msprime
459 | import matplotlib.pyplot as plt
460 | import twisst
461 | ```
462 |
463 | * We will start with a simple simulation of 10 samples from a single population. We have to specify the length and recombination rate, so msprime will give us a sequence of more than one genealogy, separated by recombination. Here we also specify the random seed just to ensure that in this case we all get the same simulation.
464 |
465 | ```python
466 | ts = msprime.simulate(sample_size=10,
467 | Ne=1000,
468 | length=10000,
469 | recombination_rate=5e-8,
470 | random_seed = 1)
471 | ```
472 |
473 | * We can check how many distinct genealogies are in the tree sequence.
474 |
475 | ```python
476 | ts.num_trees
477 | ```
478 |
479 | * And view them using a nice visualisation method provided with tree sequence objects.
480 |
481 | ```python
482 | for tree in ts.trees():
483 | print("interval = ", tree.interval)
484 | print(tree.draw(format="unicode"))
485 |
486 | ```
487 | Can you tell what is different between the trees? And what has stayed unchanged?
488 |
489 | * If you would like to explore further, the related tool [`tskit`](https://tskit.readthedocs.io/en/latest/index.html) has many inbuilt functions to analyse tree sequences.
490 |
491 | #### A larger simulation with four populations and gene flow
492 |
493 | * We will now set up a larger simulation with multiple populations. This requires a few different components, which we will define separately. First we define the number of samples and population size of each population.
494 |
495 | ```python
496 | pop_n = 10
497 | pop_Ne = 10000
498 |
499 | population_configurations = [msprime.PopulationConfiguration(sample_size=pop_n, initial_size=pop_Ne),
500 | msprime.PopulationConfiguration(sample_size=pop_n, initial_size=pop_Ne),
501 | msprime.PopulationConfiguration(sample_size=pop_n, initial_size=pop_Ne),
502 | msprime.PopulationConfiguration(sample_size=pop_n, initial_size=pop_Ne)]
503 | ```
504 |
505 | * Next we set the migration rates between populations. These can be defined in the form of a 4x4 matrix, with each entry giving the rate of migration into one population (rows) from the other (columns). Here we set a moderate level of migration in both directions between the second and third populations, and no migration between the others. The value represents *m*, the proportion of the population made up by migrants each generation.
506 |
507 | ```python
508 | migration_matrix = [[0, 0, 0, 0],
509 | [0, 0, 1e-4, 0],
510 | [0, 1e-4, 0, 0],
511 | [0, 0, 0, 0]]
512 | ```
513 |
514 | * Finally, we set split times, which, in the coalescent world view are joins going backwards in time. In `msprime` these are called mass migrations. So the split between the first two populations is modeled as a mass migration of all individuals from the second into the first population (called 0 and 1, because Python numbers from 0). Further back in time, populations 2 and 3 also mass migrate into population 0. After the first event (backwards in time) we also turn off migration between the 1 and 2 (because 1 technically no longer exists).
515 |
516 |
517 | ```python
518 | t_01 = 1000
519 | t_02 = 5000
520 | t_03 = 10000
521 |
522 | demographic_events = [msprime.MassMigration(time=t_01, source=1, destination=0, proportion=1.0), # first merge
523 | msprime.MigrationRateChange(time=t_01, rate=0, matrix_index=(2, 1)), # mig stop after merge
524 | msprime.MigrationRateChange(time=t_01, rate=0, matrix_index=(1, 2)),
525 | msprime.MassMigration(time=t_02, source=2, destination=0, proportion=1.0), #next merge
526 | msprime.MassMigration(time=t_03, source=3, destination=0, proportion=1.0)] #final merge
527 | ```
528 |
529 | * Now we are ready to simulate the tree sequence. We set the length to 50 kb and the recombination rate to 5e-8. `msprime` is extremely fast, so don't blink or you will miss it!
530 |
531 | ```python
532 | ts = msprime.simulate(population_configurations = population_configurations,
533 | migration_matrix = migration_matrix,
534 | demographic_events = demographic_events,
535 | length = 50000,
536 | recombination_rate = 5e-8
537 | )
538 | ```
539 |
540 | * Again we can check the number of trees
541 |
542 | ```python
543 | ts.num_trees
544 | ```
545 | * and, if we dare, we can look at the first tree in the tree sequence:
546 |
547 | ```python
548 | print(ts.first().draw(format="unicode"))
549 | ```
550 |
551 | #### Computing weights from the Tree Sequence
552 |
553 | * And then run a function from `twisst` that computes weightings from a tree sequence. We don't need to specify the groups, as this information has been included in the tree sequence object by `msprime`. But we do still need to tell it to use the final population (number 3 because python counts from 0).
554 |
555 | ```python
556 | weightsData = twisst.weightTrees(ts, treeFormat="ts", outgroup = "3", verbose=False)
557 | ```
558 |
559 | * we can get a quick summary of the average weights
560 |
561 | ```python
562 | twisst.summary(weightsData)
563 | ```
564 |
565 | * We can also quickly save a plot of the weights directly (which saves us the time of exporting to a file and plotting a fancy plot in R)
566 |
567 | ```python
568 | #extract mid positions on chromosome from tree sequence file
569 | position = [(tree.interval[0] + tree.interval[1])/2 for tree in ts.trees()]
570 |
571 | #normalise weights by dividing by number of combinations
572 | weights = weightsData["weights"]/10000
573 |
574 | #create a plot with all three topology weights
575 | for i in range(3):
576 | plt.plot(position, weights[:,i], label='topo'+str(i+1))
577 |
578 | plt.legend()
579 |
580 | #save plot
581 | plt.savefig('sim_ts_weights.pdf')
582 | ```
583 |
584 | **Excersise**: what happens to the weights when we:
585 | Increase or decrease population size?
586 | Make the population split times more or less recent?
587 | Increase or decrease migration rates?
588 |
589 | * Once you've made your predictions, test them out!
590 |
--------------------------------------------------------------------------------