├── ABBA_BABA_whole_genome
    ├── README.md
    ├── data
    │   ├── Hmel2_chrom_lengths.txt
    │   ├── hel92.DP8MP4BIMAC2HET75dist1K.geno.gz
    │   ├── hel92.DP8MP4BIMAC2HET75dist250.geno.gz
    │   ├── hel92.DP8MP4BIMAC2HET75dist500.geno.gz
    │   └── hel92.pop.txt
    └── images
    │   └── map_and_tree.jpg
├── ABBA_BABA_windows
    ├── README.md
    ├── data
    │   ├── chr18.LDhelmet_MLrho.w100.tsv
    │   ├── hel92.DP8HET75MP9BIminVar2.chr18.geno.gz
    │   └── hel92.pop.txt
    └── images
    │   └── map_and_tree.jpg
├── README.md
└── topology_weighting
    └── README.md


/ABBA_BABA_whole_genome/README.md:
--------------------------------------------------------------------------------
  1 | # Tutorial: *ABBA* *BABA* statistics using genome wide SNP data
  2 | 
  3 | ___
  4 | ## Requirements
  5 | * Python 2.7
  6 | * Numpy 1.10+
  7 | * R 3.0+
  8 | 
  9 | ___
 10 | ## Introduction
 11 | 
 12 | ABBA BABA statistics (also called 'D statistics') provide a simple and powerful test for a deviation from a strict bifurcating evolutionary history. They are therefore frequently used to test for introgression using genome-scale SNP data (e.g from whole genome sequenciing or RADseq).
 13 | 
 14 | In this practical we will perform an ABBA BABA analysis using a **combination of available software and some code written from scratch in R**. We will analyse genomic data from several populations of *Heliconius* butterflies.
 15 | 
 16 | #### Workflow
 17 | Starting with genotype data from multiple individuals, we first **infer allele frequencies** at each SNP. We then **compute the *D* statistic** and then use a **block jackknife** method to test for a significant deviation from the null expectation of *D*=0. Finally we **estimate *f* the 'admixture proportion'**.
 18 | 
 19 | #### Data
 20 | 
 21 | We will study multiple races from three species: *Heliconius melpomene*, *Heliconius timareta* and *Heliconius cydno*. These species have partially overlapping ranges and they are thought to hybridise where they occur in sympatry. Our sample set includes two pairs of sympatric races of *H. melpomene* and *H. cydno* from Panama and the western slopes of the Andes in Colombia. There are also two pairs of sympatric races of *H. melpomene* and H. timareta from the eastern slopes of the Andes in Colombia and Peru. Finally, there are two samples from an outgroup species *Heliconius numata*, which are necessary for performing the ABBA BABA analyses.
 22 | 
 23 | All samples were sequenced using high-depth **whole-genome sequencing**, and genotypes have been called for each individual for each site in the genome using a standard pipeline. The data has been filtered to retain only **bi-allelic** single nucleotide polymorphisms (SNPs), and these have been further **thinned** to reduce the file size for this tutorial.
 24 | 
 25 | #### Hypotheses
 26 | 
 27 | We hypothesize that hybridisation between species in sympatry will lead to sharing of genetic variation between *H. cydno* and the **sympatric** races of *H. melpomene* from the west, and between *H. timareta* and the corresponding sympatric races of *H. melpomene* from the east of the Andes. There is also another race of *H. melpomene* from French Guiana that is **allopatric** from both *H. timareta* and *H. cydno*, which should have not experienced recent genetic exchange with either species and therefore serves as a control.
 28 | 
 29 | In addition to testing for the presnece of introgression, we will test the hypothesis that some parts of teh genome experience more introgression than others. Specifically, we know that at least one locus on the Z sex chromosome causes sterility in hybrid females between these species, indicating an incompatibility between the autosomes of one species and the Z chromosome of the other. We therefore might expect reduced introgression on the Z chromosome compared to autosomes.
 30 | 
 31 | ![Species Map](images/map_and_tree.jpg)
 32 | 
 33 | 
 34 | #### A genome wide test for introgression
 35 | 
 36 | In its simplest formulation, the *ABBA* *BABA* test relies on counts of sites in the genome that match the *ABBA* and *BABA* genotype patterns. That is, given three ingroup populations and an outgroup with the relationship (((P1,P2),P3),O), and given a single genome sequence representing each population (ie, H1, H2 and H3), ***ABBA*** sites are those at which H2 and H3 **share a derived allele ('B')**, while **H1 has the ancestral state ('A')**, as defined by the outgroup sample. Likewise, ***BABA*** represents sites at which **H1 and H3 share the derived allele**.
 37 | 
 38 | Ignoring recurrant mutation, the two SNP patterns can only be produced if some parts of the genome have genealogies that do not follow the 'species tree', but instead group H2 with H3 or H1 with H3. If the populations split fairly recently, such 'discordant' genealogies are expected to occur in some parts of the genome due to variation in lineage sorting. In the absence of any deviation from a strict bifurcating topology, **we expect roughly equal proportions of the genome to show the two discordant genealogies** (((H2,H3),H1),O) and (((H1,H3),H2),O). By counting *ABBA* and *BABA* SNPs across the genome (or a large proportion of it), we are therefore **approximating the proportion of the genome represented by the two discordant genealogies**, which means **we expect a 1:1 ratio of *ABBA* and *BABA* SNPs**. A deviation could come about as a result of gene flow between populations P3 and P2 for example, although it could also indicate other phenomena that break our assumptions, such as ancestral population structure, or variable substitution rates.
 39 | 
 40 | To quantify the deviation from the expected ratio, we calculate *D*, which is the difference in the sum of *ABBA* and *BABA* patterns across the genome, divided by their sum:
 41 | 
 42 | *D* = \[sum(*ABBA*) - sum(*BABA*)\] / \[sum(*ABBA*) + sum(*BABA*)\]
 43 | 
 44 | **Therefore, D ranges from -1 to 1, and should equal 0 under the null hypothesis. D > 1 indicates an excess of *ABBA*, and D < 1 indicates an excess of *BABA*.**
 45 | 
 46 | If we have multiple samples from each population, then counting *ABBA* and *BABA* sites is less straghtforward. One option is to consider only sites at which all samples from the same population share the same allele, but that will discard a large amount of useful data. A preferable option is to use the allele frequencies at each site to quantify the extent to which the genealogy is skewed toward the *ABBA* or *BABA* pattern. This is effectively equivalent to counting *ABBA* and *BABA* SNPs using all possible sets of four haploid genomes at each site. *ABBA* and *BABA* are therefore no longer binary states, but rather numbers between 0 and 1 that represent the frequency of allele combinations matching each genealogy. They are computed based on the frequency of the derived allele (*p*) and ancestral allele (1-*p*) in each population as follows:
 47 | 
 48 | *ABBA* = (1-*p1*) x *p2* x *p3* x 1-*pO*
 49 | 
 50 | *BABA* = *p1* x (1-*p2*) x *p3* x 1-*pO*
 51 | 
 52 | ## The Practical
 53 | 
 54 | ### Preparation
 55 | 
 56 | * Open a terminal window and navigate to a folder where you will run the excersise and store all the input and output data files.
 57 | 
 58 | * Now create a subdirectory called 'data' and download the data files needed for tis tutorial
 59 | 
 60 | ```bash
 61 | mkdir data
 62 | 
 63 | cd data
 64 | 
 65 | wget https://github.com/simonhmartin/tutorials/raw/master/ABBA_BABA_whole_genome/data/hel92.DP8MP4BIMAC2HET75dist250.geno.gz
 66 | 
 67 | wget https://github.com/simonhmartin/tutorials/raw/master/ABBA_BABA_whole_genome/data/hel92.pop.txt
 68 | 
 69 | cd ..
 70 | ```
 71 | 
 72 | * Next, download the collection of python scripts required for this tutorial [GitHub](https://github.com/simonhmartin)
 73 | 
 74 | ```bash
 75 | wget https://github.com/simonhmartin/genomics_general/archive/master.zip
 76 | unzip master.zip
 77 | ```
 78 | 
 79 | ### Genome wide allele frequencies
 80 | 
 81 | To compute these values from population genomic data, we need to first determine the frequency of the derived allele in each populaton at each polymorphic site in the genome. We will compute these from the *Heliconius* genotype data provided using a python script. The input file has already been filtered to contain only bi-allelic sites. The frequencies script requires that we define populations. These are defined in the file `hel92.pop.txt`.
 82 | 
 83 | ```bash
 84 | python genomics_general-master/freq.py -g data/hel92.DP8MP4BIMAC2HET75dist250.geno.gz \
 85 | -p mel_mel -p mel_ros -p mel_vul -p mel_mal -p mel_ama \
 86 | -p cyd_chi -p cyd_zel -p tim_flo -p tim_txn -p num \
 87 | --popsFile data/hel92.pop.txt --target derived \
 88 | -o data/hel92.DP8MP4BIMAC2HET75dist250.derFreq.tsv.gz
 89 | ```
 90 | By setting `--target derived` we obtain the frquency of the derived allele in each population at each site. This is based on using the final population specified (*H. numata silvana*, or '*slv*') as the outgroup. Sites at which this population is not fixed for the ancestral state are discarded.
 91 | 
 92 | ### Genome wide ABBA BABA analysis
 93 | 
 94 | **(NOTE: here we're working in R, or R Studio if you prefer)**
 95 | 
 96 | To learn how the ABBA BABA test works, we will be writing the code from scratch to do the test. **Start a new R script**. This will make it easy to re-run the whole analysis using different populations.
 97 | 
 98 | #### R functions for ABBA BABA analyses
 99 | 
100 | * First we define a function for computing the ABBA and BABA proportions at each site and use these to compute the D atstistic. The input will be the frequency of the derived allele in populations P1, P2 and P3 (i.e. *p1*, *p1* and *p3*). (The frequency of the ancestral allele in the outgroup will be 1 at all sites because we used the outgroup to identify the ancestral allele, so this can be ignored).
101 | 
102 | ```R
103 | D.stat <- function(p1, p2, p3) {
104 |     ABBA <- (1 - p1) * p2 * p3
105 |     BABA <- p1 * (1 - p2) * p3
106 |     (sum(ABBA, na.rm=T) - sum(BABA, na.rm=T)) / (sum(ABBA, na.rm=T) + sum(BABA, na.rm=T))
107 |     }
108 | ```
109 | 
110 | #### The Data
111 | 
112 | * Read in our allele frequency data.
113 | 
114 | ```R
115 | freq_table = read.table("data/hel92.DP8MP4BIMAC2HET75dist250.derFreq.tsv.gz", header=T, as.is=T)
116 | ```
117 | 
118 | This has created an object called `freq_table` that contains the frequencies for the derived allele at each SNP.
119 | 
120 | We can check the number of sites in this table, and also look at the first few rows to get a feel for the data.
121 | 
122 | ```
123 | nrow(freq_table)
124 | 
125 | head(freq_table)
126 | ```
127 | 
128 | Note that the first two columns give the name of the scaffold (i.e. the chromosome) and the position on the chromosome of each site. The remaining columns are the allele frequencies for the different subspecies, as indicated in the figure above.
129 | 
130 | #### The *D* statistic
131 | 
132 | Now, to compute D, we need to define populations P1, P2 and P3. We will start with an obvious and **previously published test case**:
133 | We will ask whether there is evidence of introgression between ***H. melpomene rosina* (`mel_ros`)** and ***H. cydno chioneus* (`cyd_chi`)**. These will be **P2** and **P3** respectively. **P1** will be our **allopatirc** population, ***H. melpomene melpomene* from French Guiana (`mel_mel`)**.
134 | 
135 | We set these populations and then compute *D* by extracting the the derived allele frequencies for all SNPs for the three populations.
136 | 
137 | ```R
138 | P1 <- "mel_mel"
139 | P2 <- "mel_ros"
140 | P3 <- "cyd_chi"
141 | 
142 | D <- D.stat(freq_table[,P1], freq_table[,P2], freq_table[,P3])
143 | 
144 | print(paste("D =", round(D,4)))
145 | ```
146 | 
147 | We get a **strongly positive D statistic** (remember D varies from -1 to 1), indicating an excess of ABBA over BABA. This indicates that ***H. cydno chioneus* from Panama** (`cyd_chi`) **shares more genetic variation with the sympatric *H. melpomene rosina* from Panama** (`mel_ros`) than with the allopatirc *H. melpomene melpomene* from French Guiana (`mel_mel`). This is consistent with hybridisation and gene flow between the two species where they occur in sympatry.
148 | 
149 | However, we currently don't know whether this result is statistically robust. In particular, we don't know whether the excess of ABBA is evenly distributed across the genome. If it results from odd ancestry at just one part of teh genome, we would have less confidence that there has been significant intogression.
150 | 
151 | To test for a consistent genome-wide signal we use a block-jackknife procedure. 
152 | 
153 | #### Block Jackknife
154 | 
155 | The Jackknife procedure allows us to compute the variance of *D* despite non-independence among sites. A more conventional bootstrapping approach, where we would randomly resample sites and recalculate *D*, is not appropriate because **nearby sites in the genome have similar ancestry, making them non-independnent observations**.
156 | 
157 | The block jackknife procedure estimates the standard deviation for so-called 'pseudovalues' of the mean genome-wide *D*, where each pseudovalue is computed by excluding a defined block of the genome, taking the difference between the mean genom-wide *D* and *D* computed when the block is omitted.
158 | 
159 | To account for non-independence among linked sites, the block size needs to exceed the distance at which autocorrelation occurs. In our case, we will use a block size of 1 Mb, because we know that linkage disequilibrium decays to background levels at a distance well below 1 Mb.
160 | 
161 | The code to run the jackknife procedure is fairly simple, but we are not going to write it here. Instead, the R functions for this porpose are provided in a separate script, which we can import now.
162 | 
163 | ```R
164 | source("genomics_general-master/jackknife.R")
165 | ```
166 | 
167 | The first step in the process is to define the blocks that will be omitted from the genome in each iteration of the jackknife. The function `get_block_indices` in the jackknife script will do this, and return the 'indices' (i.e. the rows in our frequencies table) corresponding to each block. It requires that we specify the block size along with chromosome and position for each site to be analysed.
168 | 
169 | ```R
170 | block_indices <- get.block.indices(block_size=1e6,
171 |                                    positions=freq_table$position,
172 |                                    chromosomes=freq_table$scaffold)
173 | 
174 | n_blocks <- length(block_indices)
175 | 
176 | print(paste("Genome divided into", n_blocks, "blocks."))
177 | ```
178 | 
179 | Now we can run the block jackknifing procedure to compute the mean and standad error of *D*. We provide the *D* statistic function (`D.stat`) we created earlier, which will be applied in each iteration. We also provide the frequencies for each site and the block indices that will be used to exclude all sites from a given block.
180 | 
181 | ```R
182 | D_jackknife <- block.jackknife(block_indices=block_indices,
183 |                                FUN=D.stat,
184 |                                freq_table[,P1], freq_table[,P2], freq_table[,P3])
185 | 
186 | print(paste("D jackknife mean =", round(D_jackknife$mean,4)))
187 | ```
188 | 
189 | From the unbiased estimate of the mean and standard error of *D*, we can compute the Z score to test of whether *D* deviates significantly from zero.
190 | 
191 | ```R
192 | D_Z <- D_jackknife$mean / D_jackknife$standard_error
193 | 
194 | print(paste("D Z score = ", round(D_Z,3)))
195 | ```
196 | 
197 | Usually a Z score greater than 3 or 4 is taken as significant, so the massive Z score in this case means the devaition from zero is hugely significant.
198 | 
199 | #### Estimating the admixture proportion
200 | 
201 | The *D* statistic provides a powerful test for introgression, but it does not ***quantify* the proportion of the genome that has been shared**. A related method has been developed to estimate *f*, the 'admixture proportion'.
202 | 
203 | The idea behind this approach is that we **compare the observed excess** of *ABBA* over *BABA* sites, **to that which would be expected under complete admixture**. To approximate the expectation under complete admixture we re-count ABBA and BABA but **substituting a second population of the P3 species in the place of P2**. If you lack a second population, you can simply split your P3 samples into two. In this case, we have two populations to represent each species, so if we're using *H. cydno chioneus* (`cyd_chi`) as P3a, we can use *H. cydno zelinde* (`cyd_zel`) as P3b).
204 | 
205 | We need to write our own function to compute *f*. The inputs will be the derived allele frequencies in each population, but now we include both P3a and P3b.
206 | 
207 | 
208 | ```R
209 | f.stat <- function(p1, p2, p3a, p3b) {
210 |     ABBA_numerator <- (1 - p1) * p2 * p3a
211 |     BABA_numerator <- p1 * (1 - p2) * p3a
212 | 
213 |     ABBA_denominator <- (1 - p1) * p3b * p3a
214 |     BABA_denominator <- p1 * (1 - p3b) * p3a
215 | 
216 |     (sum(ABBA_numerator, na.rm=TRUE) - sum(BABA_numerator, na.rm=TRUE)) /
217 |     (sum(ABBA_denominator, na.rm=TRUE) - sum(BABA_denominator, na.rm=TRUE))
218 |     }
219 | ```
220 | 
221 | We can now choose our P3a and P3b, and estimate *f*.
222 | 
223 | ```R
224 | P3a <- "cyd_chi"
225 | P3b <- "cyd_zel"
226 | 
227 | f <- f.stat(freq_table[,P1], freq_table[,P2], freq_table[,P3a], freq_table[,P3b])
228 | 
229 | print(paste("Admixture proportion = ", round(f,4)))
230 | ```
231 | 
232 | This reveals that over 25% of the genome has been shared between *H. melpomene* and *H. cydno* in sympatry. The admixture proportion can be interpreted as the average proportion of foreign ancestry in any single genome. Alternatively, it can be interpreted as the expected frequency of foreign alleles in this population at any given site in the genome.
233 | 
234 | We can again use the block jackknife to estimate the standard deviation of f, and obtain a confidence interval. The jackknife block indices are already computed, so we can simply run the jackknife function again, this time pecifying the *f* function as that to run each iteration.
235 | 
236 | ```R
237 | f_jackknife <- block.jackknife(block_indices=block_indices,
238 |                                FUN=f.stat,
239 |                                freq_table[,P1], freq_table[,P2], freq_table[,P3a], freq_table[,P3b])
240 | 
241 | ```
242 | The 95% confidence interval is the mean +/- ~1.96 standard errors.
243 | 
244 | ```R
245 | f_CI_lower <- f_jackknife$mean - 1.96*f_jackknife$standard_error
246 | f_CI_upper <- f_jackknife$mean + 1.96*f_jackknife$standard_error
247 | 
248 | print(paste("95% confidence interval of f =", round(f_CI_lower,4), round(f_CI_upper,4)))
249 | 
250 | ```
251 | 
252 | ### Chromosomal ABBA BABA analysis
253 | 
254 | #### Do all chromosomes show evidence of introgression?
255 | 
256 | Above, we investigated the extent of introgression across the whole genome. We can perform a similar analysis at the chromosomal level to assess introgression on individual chromosomes, assuming we have a sufficient number of SNPs from each chromosome.
257 | 
258 | The first step to do this is to identify the rows in the frequencies table that correspond to each of the 21 *Heliconius* chromosomes.
259 | 
260 | We first identify all chromosome names present in the dataset using the `unique` function. We then need to identify rows in the table that represent each chromosome. For this we use the `lapply` function, which applies a simple function multiple times to create a combined output in the R `list` format. In this case, we will apply the function using the chromosome names, and the function we apply will simply ask which values in the table `scaffold` column correspond to that chromosome, making use of the R `which` function.
261 | 
262 | 
263 | ```R
264 | chrom_names <- unique(freq_table$scaffold)
265 | chrom_indices <- lapply(chrom_names, function(chrom) which(freq_table$scaffold == chrom))
266 | names(chrom_indices) <- chrom_names
267 | ```
268 | 
269 | This creates a list with 21 elements - one for each chromosome. Each element is a vector of all sites in the table that come from that chromosome. We can check how many SNPs we have per chromosome by applying the `length` function over the list we just created.
270 | 
271 | ```R
272 | sapply(chrom_indices, length)
273 | ```
274 | (`sapply` is like `lapply` except that  it simplifies the output if possible, so here it returns a vector, rather than a list of vectors).
275 | 
276 | Now we can use these indices to compute a *D* value for each chromosome. We again use `sapply`, this time applying the `D.stat` function and indexing only the rows in the table from the specific chromosome in each case.
277 | 
278 | ```R
279 | D_by_chrom <- sapply(chrom_names,
280 |                      function(chrom) D.stat(freq_table[chrom_indices[[chrom]], P1],
281 |                                             freq_table[chrom_indices[[chrom]], P2],
282 |                                             freq_table[chrom_indices[[chrom]], P3]))
283 | 
284 | ```
285 | 
286 | We also need to apply the jackknife to to determine whether *D* differs significantly from zero for each chromosome. First we will define the blocks to use for each chromosome.
287 | 	
288 | ```R
289 | block_indices_by_chrom <- sapply(chrom_names,
290 |                                  function(chrom) get.block.indices(block_size=1e6,
291 |                                                                    positions=freq_table$position[freq_table$scaffold==chrom]),
292 |                                                                    simplify=FALSE)
293 | ```
294 | 
295 | This command returns a *list of lists*. This is a list with 21 elements - one for each chromosome. Each of these elements is a list giving the indices for each block within that chromosome.
296 | 
297 | We can check the number of blocks per chromosome, as well as the number of SNPs per block per chromsome.
298 | 
299 | ```R
300 | sapply(block_indices_by_chrom, length)
301 | 
302 | lapply(block_indices_by_chrom, sapply, length)
303 | ```
304 | 
305 | Now we use the jackknife to compute the Z scores for *D* for each chromosome.
306 | 
307 | 
308 | ```R
309 | D_jackknife_by_chrom <- sapply(chrom_names,
310 |                                  function(chrom) block.jackknife(block_indices=block_indices_by_chrom[[chrom]],
311 |                                                                  FUN=D.stat,
312 |                                                                  freq_table[chrom_indices[[chrom]], P1],
313 |                                                                  freq_table[chrom_indices[[chrom]], P2],
314 |                                                                  freq_table[chrom_indices[[chrom]], P3]))
315 | 
316 | D_jackknife_by_chrom <- as.data.frame(t(D_jackknife_by_chrom))
317 | 
318 | D_jackknife_by_chrom$Z <- as.numeric(D_jackknife_by_chrom$mean) / as.numeric(D_jackknife_by_chrom$standard_error)
319 | 
320 | D_jackknife_by_chrom
321 | ```
322 | 
323 | We see that chromosomes 1-20 all show significant evidene for introgression (Z > 4), while chromosome 21, the Z sex chromosome, does not. In fact *D* is negative for chr21, indicating that the allopatric *H. melpomene population* shares more variation with *H. cydno* than the sympatric *H. melpomene* shares with *H. cydno*, although the difference is not significant. This indicates a strong reduction in introgression on the sex chromosome compared to the rest of the genome, consistent with strong selection against introgressed alleles on the sex chromosome. This is what we would expect if there are one or more incompatibilities that cause sterility that involve loci on the Z chromsoome.
324 | 
325 | 
326 | ### In your own time
327 | 
328 | We have run the analysis for a single set of three populations, but to fully understand the relationships among these species and subspecies, we might want to run multiple different tests. We can do this by changing the identity of P1, P2 and P3.
329 | 
330 | For example,instead of using the allopatric *H. melpomene melpomene* as P1, we can use *H. melpomene vulcanus* from Colombia (`mel_vul`), which is physically closer and more closely related to *H. melpomene rosina*. Do we still see a significant *D* value and large admixture proportion? If not, why?
331 | 
332 | Another obvious test is whether there is also introgression between *H. timareta* and the local *H. melpomene* populations from the other side of the Andes. What would be appropriate P1, P2 and P3 for that test?
333 | 


--------------------------------------------------------------------------------
/ABBA_BABA_whole_genome/data/Hmel2_chrom_lengths.txt:
--------------------------------------------------------------------------------
 1 | chr1	17206585
 2 | chr2	9045316
 3 | chr3	10541528
 4 | chr4	9662098
 5 | chr5	9908586
 6 | chr6	14054175
 7 | chr7	14308859
 8 | chr8	9320449
 9 | chr9	8708747
10 | chr10	17965481
11 | chr11	11759272
12 | chr12	16327298
13 | chr13	18127314
14 | chr14	9174305
15 | chr15	10235750
16 | chr16	10083215
17 | chr17	14773299
18 | chr18	16803890
19 | chr19	16399344
20 | chr20	14871695
21 | chr21	13359691
22 | 


--------------------------------------------------------------------------------
/ABBA_BABA_whole_genome/data/hel92.DP8MP4BIMAC2HET75dist1K.geno.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simonhmartin/tutorials/ce985e50afa701fd1d217a41e66340dadd1325a9/ABBA_BABA_whole_genome/data/hel92.DP8MP4BIMAC2HET75dist1K.geno.gz


--------------------------------------------------------------------------------
/ABBA_BABA_whole_genome/data/hel92.DP8MP4BIMAC2HET75dist250.geno.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simonhmartin/tutorials/ce985e50afa701fd1d217a41e66340dadd1325a9/ABBA_BABA_whole_genome/data/hel92.DP8MP4BIMAC2HET75dist250.geno.gz


--------------------------------------------------------------------------------
/ABBA_BABA_whole_genome/data/hel92.DP8MP4BIMAC2HET75dist500.geno.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simonhmartin/tutorials/ce985e50afa701fd1d217a41e66340dadd1325a9/ABBA_BABA_whole_genome/data/hel92.DP8MP4BIMAC2HET75dist500.geno.gz


--------------------------------------------------------------------------------
/ABBA_BABA_whole_genome/data/hel92.pop.txt:
--------------------------------------------------------------------------------
 1 | ros.CAM1841	mel_ros
 2 | ros.CAM1880	mel_ros
 3 | ros.CAM2045	mel_ros
 4 | ros.CAM2059	mel_ros
 5 | ros.CAM2519	mel_ros
 6 | ros.CAM2552	mel_ros
 7 | ros.CJ2071	mel_ros
 8 | ros.CJ531	mel_ros
 9 | ros.CJ533	mel_ros
10 | ros.CJ546	mel_ros
11 | vul.CS10	mel_vul
12 | vul.CS3603	mel_vul
13 | vul.CS3605	mel_vul
14 | vul.CS3606	mel_vul
15 | vul.CS3612	mel_vul
16 | vul.CS3614	mel_vul
17 | vul.CS3615	mel_vul
18 | vul.CS3617	mel_vul
19 | vul.CS3618	mel_vul
20 | vul.CS3621	mel_vul
21 | mal.CS1002	mel_mal
22 | mal.CS1011	mel_mal
23 | mal.CS1815	mel_mal
24 | mal.CS21	mel_mal
25 | mal.CS22	mel_mal
26 | mal.CS24	mel_mal
27 | mal.CS586	mel_mal
28 | mal.CS594	mel_mal
29 | mal.CS604	mel_mal
30 | mal.CS615	mel_mal
31 | ama.JM160	mel_ama
32 | ama.JM216	mel_ama
33 | ama.JM293	mel_ama
34 | ama.JM48	mel_ama
35 | ama.MJ11-3188	mel_ama
36 | ama.MJ11-3189	mel_ama
37 | ama.MJ11-3202	mel_ama
38 | ama.MJ12-3217	mel_ama
39 | ama.MJ12-3258	mel_ama
40 | ama.MJ12-3301	mel_ama
41 | melG.CAM1349	mel_mel
42 | melG.CAM1422	mel_mel
43 | melG.CAM2035	mel_mel
44 | melG.CAM8171	mel_mel
45 | melG.CAM8216	mel_mel
46 | melG.CAM8218	mel_mel
47 | melG.CJ13435	mel_mel
48 | melG.CJ9315	mel_mel
49 | melG.CJ9316	mel_mel
50 | melG.CJ9317	mel_mel
51 | chi.CAM25091	cyd_chi
52 | chi.CAM25137	cyd_chi
53 | chi.CAM580	cyd_chi
54 | chi.CAM582	cyd_chi
55 | chi.CAM585	cyd_chi
56 | chi.CAM586	cyd_chi
57 | chi.CJ553	cyd_chi
58 | chi.CJ560	cyd_chi
59 | chi.CJ564	cyd_chi
60 | chi.CJ565	cyd_chi
61 | zel.CS1	cyd_zel
62 | zel.CS1028	cyd_zel
63 | zel.CS1029	cyd_zel
64 | zel.CS1030	cyd_zel
65 | zel.CS1033	cyd_zel
66 | zel.CS1035	cyd_zel
67 | zel.CS2	cyd_zel
68 | zel.CS2262	cyd_zel
69 | zel.CS273	cyd_zel
70 | zel.CS30	cyd_zel
71 | flo.CS12	tim_flo
72 | flo.CS13	tim_flo
73 | flo.CS14	tim_flo
74 | flo.CS15	tim_flo
75 | flo.CS2337	tim_flo
76 | flo.CS2338	tim_flo
77 | flo.CS2341	tim_flo
78 | flo.CS2350	tim_flo
79 | flo.CS2358	tim_flo
80 | flo.CS2359	tim_flo
81 | thxn.JM313	tim_txn
82 | thxn.JM57	tim_txn
83 | thxn.JM84	tim_txn
84 | thxn.JM86	tim_txn
85 | thxn.MJ12-3221	tim_txn
86 | thxn.MJ12-3233	tim_txn
87 | thxn.MJ12-3308	tim_txn
88 | txn.MJ11-3339	tim_txn
89 | txn.MJ11-3340	tim_txn
90 | txn.MJ11-3460	tim_txn
91 | nu_sil.MJ09-4125	num
92 | nu_sil.MJ09-4184	num
93 | 


--------------------------------------------------------------------------------
/ABBA_BABA_whole_genome/images/map_and_tree.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simonhmartin/tutorials/ce985e50afa701fd1d217a41e66340dadd1325a9/ABBA_BABA_whole_genome/images/map_and_tree.jpg


--------------------------------------------------------------------------------
/ABBA_BABA_windows/README.md:
--------------------------------------------------------------------------------
  1 | # Tutorial: *ABBA* *BABA* analysis in sliding windows
  2 | ___
  3 | ## Requirements
  4 | * Python 2.7
  5 | * Numpy 1.10+
  6 | * R 3.0+
  7 | 
  8 | ___
  9 | ## Introduction
 10 | 
 11 | ABBA BABA statistics (also called D statistics) provide a simple and powerful test for a deviation from a strict bifurcating evolutionary history. They are therefore frequently used to test for introgression using genome-scale SNP data.
 12 | 
 13 | Although originally developed to be employed for genome-wide tests for introgression, they can also be applied in smaller windows, which can allow exploration of the **genomic landscape of introgression**.
 14 | 
 15 | In this practical we will perform window-based ABBA BABA analysis using a **available software** and then **write code in R for plotting the results**. We will analyse genomic data from several populations of *Heliconius* butterflies.
 16 | 
 17 | Note that the details and theory of ABBA BABA statistics are more fully explored in my other [tutorial on whole genome ABBA BABA analyses](https://github.com/simonhmartin/tutorials/tree/master/ABBA_BABA_whole_genome)
 18 | 
 19 | #### Workflow Overview
 20 | 
 21 | Starting with genotype data from **whole-genome sequencing** of multiple individuals, we will run a script that computes a measure of the admixture proportion in individual windows across each chromosome. We will then make plots to test hypotheses about adaptive introgression.
 22 | 
 23 | #### Data
 24 | 
 25 | We will study multiple races from three species: *Heliconius melpomene*, *Heliconius timareta* and *Heliconius cydno*. These species have partially overlapping ranges and they are thought to hybridise where they occur in sympatry. Our sample set includes two pairs of sympatric races of *H. melpomene* and *H. cydno* from Panama and the western slopes of the Andes in Colombia. There are also two pairs of sympatric races of *H. melpomene* and H. timareta from the eastern slopes of the Andes in Colombia and Peru. Finally, there are two samples from an outgroup species *Heliconius numata*, which are necessary for performing the ABBA BABA analyses.
 26 | 
 27 | All samples were sequenced using high-depth **whole-genome sequencing**, and genotypes have been called for each individual for each site in the genome using a standard pipeline. The data has been filtered to retain only **bi-allelic** single nucleotide polymorphisms (SNPs). This dataset includes SNP data from chromosome 18, which is known to carry a wing patterning locus of particular interest.
 28 | 
 29 | #### Hypotheses
 30 | 
 31 | We hypothesize that hybridisation between species in sympatry will lead to sharing of genetic variation between *H. cydno* and the **sympatric** races of *H. melpomene* from the west, and between *H. timareta* and the corresponding sympatric races of *H. melpomene* from the east of the Andes.
 32 | 
 33 | However, not all parts fo the genome are expected to be equally affected. In particular, we suspect that the wing patterninging gene *optix* on chromosome 18 has been under strong selection. Differential regulation of *optix* can give rise to different distributions of red pigmentation on the wing, as seen in different subspecies of *H. melpomene*, or the absence of red, as seen in *H. cydno*.
 34 | 
 35 | *Heliconius* wing patterns act as warnings to predators that they are toxic. Some species participate in Müllerian mimicry, whereby toxic species have evolved to resemble one-another, which helps to reinfoce predator learning. Mimicry can either be achived through independent convergence on the same wing patterns, or through exchange of wing patterning alleles through **adaptive introgression**. We therefore predict that co-mimetic populations of different species might show an excess signal of introgression in the vicinity of *optix*.
 36 | 
 37 | We have an entirely different expectation for populations with different wing patterns. If predators in a given area recognise the most commobn local pattern as toxic, it will be costly to have a foreign wing pattern. Likewise any hybrid individual that has an intermediate wing pattern will also be at risk of higher predation. We therefore predict that between populations with different wing patterns there should be a reduction in the extent of introgression in the vicinity of *optix*.
 38 | 
 39 | 
 40 | ![Species Map](images/map_and_tree.jpg)
 41 | 
 42 | 
 43 | #### Quantifying admixture across the genome
 44 | 
 45 | A detailed explanation of the *ABBA BABA* test is given in in my other [tutorial on whole genome ABBA BABA analyses](https://github.com/simonhmartin/tutorials/tree/master/ABBA_BABA_whole_genome).
 46 | 
 47 | Briefly, the test uses three populations and an outgroup with the relationship (((P1,P2),P3),O), and investigates whether there is an excess of shared variation between P2 and P3 (compared to that shared between P1 and P3).
 48 | 
 49 | This excess can be expressed in terms of the *D statistic*, which ranges from -1 to 1, and should equal 0 under the null hypothesis of no introgression. D > 1 indicates possible introgression between P3 and P2 (or other factors that would result in a deviation from a strict bifurcating species history).
 50 | 
 51 | This test was designed to be used at the whole-genome scale. The *D* statistic is not well suited for comparing admixture levels across the genome, because its absolute value depends on factors such as the effective population size, which can vary across the genome.
 52 | 
 53 | The *f* estimator descibed in the [other tutorial](https://github.com/simonhmartin/tutorials/tree/master/ABBA_BABA_whole_genome) is better, because it by definition reflects the admixture proportion, but it is highly sensitive to stochastic error at the small scale. A statistic called *f<sub>d</sub>* was therefore developed for this purpose that is more robust to the error introduced by using small numbers of SNPs ([Martin et al. 2015](https://doi.org/10.1093/molbev/msu269)). While the conventional f estimator assumes that P3 is the donor population and P2 the recipient, *f<sub>d</sub>* infers the donor on a site-by-site basis.
 54 | 
 55 | #### Selecting populations
 56 | 
 57 | The interpretation of these statistics is strongly dependent on the populations selected. Firstly, the test is most sensitive to introgression from **P3 into P2**, rather than the other way around.
 58 | 
 59 | Secondly, *f<sub>d</sub>* should be interpreted as a **quantification of excess shared variation between P3 and P2** that is **not also shared with P1**. If there is ongoing gene flow between P1 and P2, then any introgression from P3 to P2 will be underestimated.
 60 | 
 61 | Finally, we are only able to quantify introgression that occured **more recently than the split between P1 and P2**.
 62 | 
 63 | Therefore, if we want to quantify the **maximum amount of detectable introgression** that has occurred across the genome, we should choose a P1 that is **allopatric and not too closely related to P2**.
 64 | 
 65 | However, we can also this feature of the test to our advantage. If we select a P1 that shares ongoing gene flow with P2, then the test will instead be revealing parts of the genome at which **P2 and P3 share variation that is not shared by P1**. This can be useful for identifying wing patterning alleles, as these are often the only genomic regions at which subspecies remain distinct in the face of gene flow.
 66 | 
 67 | ## Practical
 68 | 
 69 | ### Preparation
 70 | 
 71 | * Open a terminal window and navigate to a folder where you will run the excersise and store all the input and output data files.
 72 | 
 73 | * Now create a subdirectory called 'data' and download the data files needed for tis tutorial
 74 | 
 75 | ```bash
 76 | mkdir data
 77 | 
 78 | cd data
 79 | 
 80 | wget https://github.com/simonhmartin/tutorials/raw/master/ABBA_BABA_windows/data/hel92.DP8HET75MP9BIminVar2.chr18.geno.gz
 81 | 
 82 | wget https://github.com/simonhmartin/tutorials/raw/master/ABBA_BABA_windows/data/hel92.pop.txt
 83 | 
 84 | wget https://github.com/simonhmartin/tutorials/raw/master/ABBA_BABA_windows/data/chr18.LDhelmet_MLrho.w100.tsv
 85 | 
 86 | cd ..
 87 | ```
 88 | 
 89 | * Next, download the collection of python scripts required for this tutorial [GitHub](https://github.com/simonhmartin)
 90 | 
 91 | ```bash
 92 | wget https://github.com/simonhmartin/genomics_general/archive/master.zip
 93 | unzip master.zip
 94 | ```
 95 | 
 96 | ### Sliding window analysis
 97 | 
 98 | * Run the the analysis python script for two separate cases. In both, P1 is the allpatric *H. melpomene melpomene* (`mel_mel`). P2 and P3 are the two populations we expect to be sharing genes. In the first case we are quantifying introgression between *H. melpomene rosina* (`mel_ros`) and *H. cydno chioneus* (`cyd_chi`) both from Panama. In the second we are quantifying introgression between *H. melpomene amaryllis* (`mel_ama`) and *H. timareta thelxinoe* (`tim_txn`) both from Peru.
 99 | 
100 | ```bash
101 | python genomics_general-master/ABBABABAwindows.py \
102 | -g data/hel92.DP8HET75MP9BIminVar2.chr18.geno.gz -f phased \
103 | -o data/hel92.DP8HET75MP9BIminVar2.chr18.ABBABABA_mel_ros_chi_num.w25m250.csv.gz \
104 | -P1 mel_mel -P2 mel_ros -P3 cyd_chi -O num \
105 | --popsFile data/hel92.pop.txt -w 25000 -m 250 --T 2
106 | 
107 | python genomics_general-master/ABBABABAwindows.py \
108 | -g data/hel92.DP8HET75MP9BIminVar2.chr18.geno.gz -f phased \
109 | -o data/hel92.DP8HET75MP9BIminVar2.chr18.ABBABABA_mel_ama_txn_num.w25m250.csv.gz \
110 | -P1 mel_mel -P2 mel_ama -P3 tim_txn -O num \
111 | --popsFile data/hel92.pop.txt -w 25000 -m 250 --T 2
112 | ```
113 | 
114 | We provide the scripty it with an input file containing genotype data (`-g`), an output file (`-o`), ingroup populations and outgroup (`-P1`, `-P2`, `-P3` and `-O`), and a file specifying which population each sample is in (`--popsFile`).
115 | 
116 | We also give parameters for the windows. These ae "coordinate" windows, which means each window is the same length relative to the reference genome, but the number of SNPs per window can vary. The window size (`-w`) will be 25,000 bp. Windows will be required to contain a minimum (`-m`) of 250 SNPs to be considered valid.
117 | 
118 | Finally, we tell the script to use two threads (`-T`). If you have a multi-core machine, you can increase this value and the script will run faster.
119 | 
120 | #### Plotting window statistics
121 | 
122 | * Open R and, if necessary, set the working directory to the tutorial directory. You can do this with the `setwd()` command, or in RStudio using the menus.
123 | 
124 | We need to load each file of window statistics into R. We will make a list containing both datasets. 
125 | 
126 | * First input the names of teh input files
127 | 
128 | ```R
129 | AB_files <- c("data/hel92.DP8HET75MP9BIminVar2.chr18.ABBABABA_mel_ros_chi_num.w25m250.csv.gz",
130 |                 "data/hel92.DP8HET75MP9BIminVar2.chr18.ABBABABA_mel_ama_txn_num.w25m250.csv.gz")
131 | 
132 | AB_tables = lapply(AB_files, read.csv)
133 | 
134 | head(AB_tables[[1]])
135 | ```
136 | 
137 | *f<sub>d<sub>* is meaningless when D is negative, as it is designed to quantify the excess of ABBA over BABA only whgen an excess exists.
138 | 
139 | * We therefore convert all *f<sub>d</sub>* values to 0 at sites where *D* is negative. 
140 | 
141 | ```R
142 | for (x in 1:length(AB_tables)){
143 | AB_tables[[x]]$fd = ifelse(AB_tables[[x]]$D < 0, 0, AB_tables[[x]]$fd)
144 |     }
145 | ```
146 | 
147 | * We can then plot of *f<sub>d</sub>* across the chromosome for the two cases we have analysed.
148 | 
149 | ```R
150 | par(mfrow=c(length(AB_tables), 1), mar = c(4,4,1,1))
151 | 
152 | for (x in 1:length(AB_tables)){
153 |     plot(AB_tables[[x]]$mid, AB_tables[[x]]$fd,
154 |     type = "l", xlim=c(0,17e6),ylim=c(0,1),ylab="Admixture Proportion",xlab="Position")
155 |     rect(1000000,0,1250000,1, col = rgb(0,0,0,0.2), border=NA)
156 |     }
157 | ```
158 | 
159 | This reveals that there is considerable heterogeneity in the extent of introgression across the chromosome. If we consider the region around optix, we see evidence for reduced introgression between *H. melpomene rosina* and *H. cydno chioneus*, as we predicted. By contrast, we see evidence for elevated introgression between *H. melpomene amaryllis* and *H. timareta thelxinoe*, which suggests that their shared wing patterns might result from adaptive introgression. Given this evidence, it would be recommended to make a phylogeny for the region around optix to test whether the H. timareta allele appears to be 'nested' within the H. melpomene clade. In this case, previous papers have confirmed that that is the case ([Pardo-Diaz et al. 2012](https://doi.org/10.1371/journal.pgen.1002752), [Wallbank et al. 2016](https://doi.org/10.1371/journal.pbio.1002353)).
160 | 
161 | 
162 | #### In your own time
163 | What happens when we change the identity of P1, P2 an P3? What happens if we change the window size?
164 | 
165 | 
166 | ### Association between introgression and recombination rate
167 | 
168 | Theory predicts that if there are many "barrier loci", at which introgression is selected against, we should see a global trend of reduced introgression in low recombination regions, due to strongler linkage between neutral introgressed alleles and deleterious ones.
169 | 
170 | We can test this hypothesis by examining the relationship between recombination rate and *f<sub>d</sub>* across our chromosome.
171 | 
172 | We have a previously-generated data file (provided) giving the estimated population recombination rate in 100 kb windows across this chromosome.
173 | 
174 | * Open a terminal window and navidate to the tutorial folder.
175 | 
176 | * Now we will make a matching dataset with *f<sub>d</sub>* for 100 kb windows, here just using the species pair showing the highest rate of introgression: *H. melpomene rosina* and *H. cydno chioneus* from Panama.
177 | 
178 | ```bash
179 | python ~/Research/genomics_general/ABBABABAwindows.py \
180 | -g data/hel92.DP8HET75MP9BIminVar2.chr18.geno.gz -f phased \
181 | -o data/hel92.DP8HET75MP9BIminVar2.chr18.ABBABABA_mel_ros_chi_num.w100m1.csv.gz \
182 | -P1 mel_mel -P2 mel_ros -P3 cyd_chi -O num \
183 | --popsFile data/hel92.pop.txt -w 100000 -m 1000 --T 2
184 | ```
185 | 
186 | * Now, **back in R**, read in this new data file.
187 | 
188 | ```R
189 | AB_table_w100 <- read.csv("data/hel92.DP8HET75MP9BIminVar2.chr18.ABBABABA_mel_ros_chi_num.w100m1.csv.gz")
190 | ```
191 | 
192 | * As before, we convert any *f<sub>d</sub>* values for windows with negative *D* to 0.
193 | 
194 | ```R
195 | AB_table_w100$fd = ifelse(AB_table_w100$D < 0, 0, AB_table_w100$fd)
196 | ```
197 | 
198 | Now we read in the table of recombination rates for 100 kb windows. Here the column `ML_rho` gives tha maximum likelihood estimate of the population recombination rate for each window.
199 | 
200 | ```R
201 | rec_table <- read.table("data/chr18.LDhelmet_MLrho.w100.tsv", header=T)
202 | head(rec_table)
203 | ```
204 | 
205 | * Due to the noisy nature of the data, we want to compare fd values in bins of different recombination rate. We will use the `cut` function to separate the windows into three bins with low, medium and high recombination rates.
206 | 
207 | ```R
208 | rec_bin <- cut(rec_table$ML_rho, 3)
209 | ```
210 | 
211 | * Finally, we can make boxplots to compare the inferred level of admixture (*f<sub>d</sub>*) between these bins.
212 | 
213 | ```R
214 | boxplot(AB_table_w100$fd ~ rec_bin)
215 | ```
216 | 
217 | This confirms that indeed the level of introgression increases with increasing recombination rate, consistent with a model in which a large number of barrier loci select against introgression geneome-wide.
218 | 
219 | 
220 | 


--------------------------------------------------------------------------------
/ABBA_BABA_windows/data/chr18.LDhelmet_MLrho.w100.tsv:
--------------------------------------------------------------------------------
  1 | scaffold	start	end	mid	sites	ML_rho
  2 | chr18	1	100000	46560	3995	0.00600000
  3 | chr18	100001	200000	144250	3257	0.00400000
  4 | chr18	200001	300000	255763	3156	0.05100000
  5 | chr18	300001	400000	358576	3262	0.02900000
  6 | chr18	400001	500000	444648	4045	0.00600000
  7 | chr18	500001	600000	548369	4852	0.01400000
  8 | chr18	600001	700000	653655	6525	0.02900000
  9 | chr18	700001	800000	751285	5564	0.01600000
 10 | chr18	800001	900000	849423	6993	0.05200000
 11 | chr18	900001	1000000	944873	5912	0.06700000
 12 | chr18	1000001	1100000	1051066	5959	0.03400000
 13 | chr18	1100001	1200000	1143799	6078	0.06200000
 14 | chr18	1200001	1300000	1250220	6453	0.04700000
 15 | chr18	1300001	1400000	1350658	8323	0.08100000
 16 | chr18	1400001	1500000	1449581	8837	0.07500000
 17 | chr18	1500001	1600000	1548397	5832	0.06400000
 18 | chr18	1600001	1700000	1647425	5546	0.13800000
 19 | chr18	1700001	1800000	1748664	6367	0.14100000
 20 | chr18	1800001	1900000	1852219	6150	0.15100000
 21 | chr18	1900001	2000000	1946230	5628	0.12300000
 22 | chr18	2000001	2100000	2048617	5852	0.16200000
 23 | chr18	2100001	2200000	2151606	5521	0.16800000
 24 | chr18	2200001	2300000	2250247	5126	0.20100000
 25 | chr18	2300001	2400000	2347408	5732	0.08100000
 26 | chr18	2400001	2500000	2449981	4496	0.16700000
 27 | chr18	2500001	2600000	2551645	5357	0.13400000
 28 | chr18	2600001	2700000	2652989	4707	0.16000000
 29 | chr18	2700001	2800000	2748306	4979	0.07800000
 30 | chr18	2800001	2900000	2848101	4506	0.14100000
 31 | chr18	2900001	3000000	2951650	5144	0.23900000
 32 | chr18	3000001	3100000	3050027	4998	0.14100000
 33 | chr18	3100001	3200000	3147735	4583	0.13400000
 34 | chr18	3200001	3300000	3251521	5845	0.08000000
 35 | chr18	3300001	3400000	3349636	6943	0.12000000
 36 | chr18	3400001	3500000	3447132	6376	0.11700000
 37 | chr18	3500001	3600000	3548374	5483	0.20100000
 38 | chr18	3600001	3700000	3648338	5012	0.20100000
 39 | chr18	3700001	3800000	3747793	5107	0.15600000
 40 | chr18	3800001	3900000	3849769	4863	0.20100000
 41 | chr18	3900001	4000000	3952193	4482	0.18000000
 42 | chr18	4000001	4100000	4053021	2451	0.04400000
 43 | chr18	4100001	4200000	4154181	4294	0.07600000
 44 | chr18	4200001	4300000	4253246	5035	0.06300000
 45 | chr18	4300001	4400000	4348341	4838	0.02300000
 46 | chr18	4400001	4500000	4450948	5813	0.03000000
 47 | chr18	4500001	4600000	4544610	5127	0.08500000
 48 | chr18	4600001	4700000	4652599	5005	0.06300000
 49 | chr18	4700001	4800000	4750518	6236	0.07600000
 50 | chr18	4800001	4900000	4849846	5230	0.20100000
 51 | chr18	4900001	5000000	4952740	5224	0.12000000
 52 | chr18	5000001	5100000	5050030	5814	0.05100000
 53 | chr18	5100001	5200000	5147719	4597	0.01700000
 54 | chr18	5200001	5300000	5250663	5551	0.15700000
 55 | chr18	5300001	5400000	5349198	6546	0.15100000
 56 | chr18	5400001	5500000	5450772	6035	0.04300000
 57 | chr18	5500001	5600000	5551035	2662	0.09000000
 58 | chr18	5600001	5700000	5651676	4164	0.20200000
 59 | chr18	5700001	5800000	5761072	4324	0.18200000
 60 | chr18	5800001	5900000	5847715	4374	0.15600000
 61 | chr18	5900001	6000000	5953661	4351	0.07700000
 62 | chr18	6000001	6100000	6051735	5599	0.06700000
 63 | chr18	6100001	6200000	6153737	6015	0.13400000
 64 | chr18	6200001	6300000	6247837	6055	0.20100000
 65 | chr18	6300001	6400000	6348694	5578	0.31500000
 66 | chr18	6400001	6500000	6456776	3171	0.25200000
 67 | chr18	6500001	6600000	6547177	4649	0.30100000
 68 | chr18	6600001	6700000	6649288	5526	0.30100000
 69 | chr18	6700001	6800000	6749924	5575	0.27200000
 70 | chr18	6800001	6900000	6851599	5661	0.20100000
 71 | chr18	6900001	7000000	6952150	5431	0.23400000
 72 | chr18	7000001	7100000	7051904	5203	0.20100000
 73 | chr18	7100001	7200000	7145221	5600	0.12800000
 74 | chr18	7200001	7300000	7248921	5289	0.05000000
 75 | chr18	7300001	7400000	7350713	5226	0.18600000
 76 | chr18	7400001	7500000	7454737	5105	0.30100000
 77 | chr18	7500001	7600000	7547495	5554	0.36700000
 78 | chr18	7600001	7700000	7649159	5050	0.33500000
 79 | chr18	7700001	7800000	7748370	5679	0.30100000
 80 | chr18	7800001	7900000	7851145	5985	0.23400000
 81 | chr18	7900001	8000000	7947310	4476	0.20100000
 82 | chr18	8000001	8100000	8052732	4752	0.03600000
 83 | chr18	8100001	8200000	8152608	5313	0.12500000
 84 | chr18	8200001	8300000	8244641	4568	0.26100000
 85 | chr18	8300001	8400000	8349617	4850	0.21300000
 86 | chr18	8400001	8500000	8450592	5272	0.11000000
 87 | chr18	8500001	8600000	8551057	5630	0.30100000
 88 | chr18	8600001	8700000	8651023	5028	0.16100000
 89 | chr18	8700001	8800000	8749848	6192	0.16800000
 90 | chr18	8800001	8900000	8849263	4830	0.06700000
 91 | chr18	8900001	9000000	8946341	5099	0.15700000
 92 | chr18	9000001	9100000	9051165	5394	0.10000000
 93 | chr18	9100001	9200000	9149324	5653	0.13400000
 94 | chr18	9200001	9300000	9249230	6163	0.20100000
 95 | chr18	9300001	9400000	9351994	5548	0.26700000
 96 | chr18	9400001	9500000	9449981	5779	0.15100000
 97 | chr18	9500001	9600000	9550248	6364	0.08100000
 98 | chr18	9600001	9700000	9647977	6011	0.15100000
 99 | chr18	9700001	9800000	9752049	6196	0.15100000
100 | chr18	9800001	9900000	9850559	6414	0.15100000
101 | chr18	9900001	10000000	9945646	5700	0.15100000
102 | chr18	10000001	10100000	10056361	5837	0.10100000
103 | chr18	10100001	10200000	10144980	5740	0.15100000
104 | chr18	10200001	10300000	10246525	4999	0.16800000
105 | chr18	10300001	10400000	10352072	5543	0.05600000
106 | chr18	10400001	10500000	10450213	6629	0.18000000
107 | chr18	10500001	10600000	10542757	6032	0.12500000
108 | chr18	10600001	10700000	10648011	5838	0.15100000
109 | chr18	10700001	10800000	10751097	5359	0.14400000
110 | chr18	10800001	10900000	10844899	5997	0.13400000
111 | chr18	10900001	11000000	10949680	5541	0.13400000
112 | chr18	11000001	11100000	11053017	5827	0.10100000
113 | chr18	11100001	11200000	11151930	6987	0.06000000
114 | chr18	11200001	11300000	11251407	6889	0.17200000
115 | chr18	11300001	11400000	11352661	6430	0.10100000
116 | chr18	11400001	11500000	11448891	5804	0.14400000
117 | chr18	11500001	11600000	11547984	5215	0.20100000
118 | chr18	11600001	11700000	11645322	5412	0.15100000
119 | chr18	11700001	11800000	11749946	4870	0.16700000
120 | chr18	11800001	11900000	11855630	4995	0.26700000
121 | chr18	11900001	12000000	11950962	6040	0.21700000
122 | chr18	12000001	12100000	12051842	5975	0.22200000
123 | chr18	12100001	12200000	12150093	6522	0.20100000
124 | chr18	12200001	12300000	12253119	5882	0.24300000
125 | chr18	12300001	12400000	12351250	4661	0.16700000
126 | chr18	12400001	12500000	12450574	6752	0.10100000
127 | chr18	12500001	12600000	12552698	5395	0.05000000
128 | chr18	12600001	12700000	12655794	5209	0.26100000
129 | chr18	12700001	12800000	12747713	4871	0.15100000
130 | chr18	12800001	12900000	12850906	5601	0.06000000
131 | chr18	12900001	13000000	12950921	5881	0.30100000
132 | chr18	13000001	13100000	13050722	7243	0.26700000
133 | chr18	13100001	13200000	13146589	6063	0.20100000
134 | chr18	13200001	13300000	13249018	5693	0.17600000
135 | chr18	13300001	13400000	13355040	5507	0.10000000
136 | chr18	13400001	13500000	13446283	6415	0.25000000
137 | chr18	13500001	13600000	13549558	6076	0.22000000
138 | chr18	13600001	13700000	13650677	5519	0.21500000
139 | chr18	13700001	13800000	13752268	5997	0.10000000
140 | chr18	13800001	13900000	13850881	5248	0.20100000
141 | chr18	13900001	14000000	13952623	6192	0.18000000
142 | chr18	14000001	14100000	14049677	6377	0.15100000
143 | chr18	14100001	14200000	14152226	5848	0.20100000
144 | chr18	14200001	14300000	14247959	6053	0.11700000
145 | chr18	14300001	14400000	14349881	5654	0.17200000
146 | chr18	14400001	14500000	14448018	6354	0.20100000
147 | chr18	14500001	14600000	14547166	5049	0.15100000
148 | chr18	14600001	14700000	14648030	4807	0.08500000
149 | chr18	14700001	14800000	14751704	4824	0.12000000
150 | chr18	14800001	14900000	14849132	4999	0.26300000
151 | chr18	14900001	15000000	14950486	4654	0.12500000
152 | chr18	15000001	15100000	15050224	5086	0.30100000
153 | chr18	15100001	15200000	15152602	4936	0.25000000
154 | chr18	15200001	15300000	15251936	5903	0.14100000
155 | chr18	15300001	15400000	15352186	5320	0.10100000
156 | chr18	15400001	15500000	15446177	5030	0.02800000
157 | chr18	15500001	15600000	15555223	5334	0.02300000
158 | chr18	15600001	15700000	15650346	5609	0.07200000
159 | chr18	15700001	15800000	15746355	6637	0.11700000
160 | chr18	15800001	15900000	15851037	6635	0.15100000
161 | chr18	15900001	16000000	15946963	6594	0.11100000
162 | chr18	16000001	16100000	16050347	5424	0.03400000
163 | chr18	16100001	16200000	16151881	6213	0.06000000
164 | chr18	16200001	16300000	16250956	6313	0.03000000
165 | chr18	16300001	16400000	16349997	6997	0.03400000
166 | chr18	16400001	16500000	16443632	5562	0.01700000
167 | chr18	16500001	16600000	16553622	4858	0.03400000
168 | chr18	16600001	16700000	16648639	5961	0.03400000
169 | chr18	16700001	16800000	16749646	3459	0.02500000
170 | 


--------------------------------------------------------------------------------
/ABBA_BABA_windows/data/hel92.DP8HET75MP9BIminVar2.chr18.geno.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simonhmartin/tutorials/ce985e50afa701fd1d217a41e66340dadd1325a9/ABBA_BABA_windows/data/hel92.DP8HET75MP9BIminVar2.chr18.geno.gz


--------------------------------------------------------------------------------
/ABBA_BABA_windows/data/hel92.pop.txt:
--------------------------------------------------------------------------------
 1 | ros.CAM1841	mel_ros
 2 | ros.CAM1880	mel_ros
 3 | ros.CAM2045	mel_ros
 4 | ros.CAM2059	mel_ros
 5 | ros.CAM2519	mel_ros
 6 | ros.CAM2552	mel_ros
 7 | ros.CJ2071	mel_ros
 8 | ros.CJ531	mel_ros
 9 | ros.CJ533	mel_ros
10 | ros.CJ546	mel_ros
11 | vul.CS10	mel_vul
12 | vul.CS3603	mel_vul
13 | vul.CS3605	mel_vul
14 | vul.CS3606	mel_vul
15 | vul.CS3612	mel_vul
16 | vul.CS3614	mel_vul
17 | vul.CS3615	mel_vul
18 | vul.CS3617	mel_vul
19 | vul.CS3618	mel_vul
20 | vul.CS3621	mel_vul
21 | mal.CS1002	mel_mal
22 | mal.CS1011	mel_mal
23 | mal.CS1815	mel_mal
24 | mal.CS21	mel_mal
25 | mal.CS22	mel_mal
26 | mal.CS24	mel_mal
27 | mal.CS586	mel_mal
28 | mal.CS594	mel_mal
29 | mal.CS604	mel_mal
30 | mal.CS615	mel_mal
31 | ama.JM160	mel_ama
32 | ama.JM216	mel_ama
33 | ama.JM293	mel_ama
34 | ama.JM48	mel_ama
35 | ama.MJ11-3188	mel_ama
36 | ama.MJ11-3189	mel_ama
37 | ama.MJ11-3202	mel_ama
38 | ama.MJ12-3217	mel_ama
39 | ama.MJ12-3258	mel_ama
40 | ama.MJ12-3301	mel_ama
41 | melG.CAM1349	mel_mel
42 | melG.CAM1422	mel_mel
43 | melG.CAM2035	mel_mel
44 | melG.CAM8171	mel_mel
45 | melG.CAM8216	mel_mel
46 | melG.CAM8218	mel_mel
47 | melG.CJ13435	mel_mel
48 | melG.CJ9315	mel_mel
49 | melG.CJ9316	mel_mel
50 | melG.CJ9317	mel_mel
51 | chi.CAM25091	cyd_chi
52 | chi.CAM25137	cyd_chi
53 | chi.CAM580	cyd_chi
54 | chi.CAM582	cyd_chi
55 | chi.CAM585	cyd_chi
56 | chi.CAM586	cyd_chi
57 | chi.CJ553	cyd_chi
58 | chi.CJ560	cyd_chi
59 | chi.CJ564	cyd_chi
60 | chi.CJ565	cyd_chi
61 | zel.CS1	cyd_zel
62 | zel.CS1028	cyd_zel
63 | zel.CS1029	cyd_zel
64 | zel.CS1030	cyd_zel
65 | zel.CS1033	cyd_zel
66 | zel.CS1035	cyd_zel
67 | zel.CS2	cyd_zel
68 | zel.CS2262	cyd_zel
69 | zel.CS273	cyd_zel
70 | zel.CS30	cyd_zel
71 | flo.CS12	tim_flo
72 | flo.CS13	tim_flo
73 | flo.CS14	tim_flo
74 | flo.CS15	tim_flo
75 | flo.CS2337	tim_flo
76 | flo.CS2338	tim_flo
77 | flo.CS2341	tim_flo
78 | flo.CS2350	tim_flo
79 | flo.CS2358	tim_flo
80 | flo.CS2359	tim_flo
81 | thxn.JM313	tim_txn
82 | thxn.JM57	tim_txn
83 | thxn.JM84	tim_txn
84 | thxn.JM86	tim_txn
85 | thxn.MJ12-3221	tim_txn
86 | thxn.MJ12-3233	tim_txn
87 | thxn.MJ12-3308	tim_txn
88 | txn.MJ11-3339	tim_txn
89 | txn.MJ11-3340	tim_txn
90 | txn.MJ11-3460	tim_txn
91 | nu_sil.MJ09-4125	num
92 | nu_sil.MJ09-4184	num
93 | 


--------------------------------------------------------------------------------
/ABBA_BABA_windows/images/map_and_tree.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simonhmartin/tutorials/ce985e50afa701fd1d217a41e66340dadd1325a9/ABBA_BABA_windows/images/map_and_tree.jpg


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # Tutorials on evolutionary genomics
 3 | 
 4 | * [ABBA BABA statistics using genome wide SNP data](https://github.com/simonhmartin/tutorials/tree/master/ABBA_BABA_whole_genome/README.md)<br>
 5 | A tutorial that explores the "*ABBA BABA*" *D* and *f* statistics, and how these can be applied using genome-scale SNP data, including significance testing using the block jackknife.
 6 | 
 7 | * [ABBA BABA analysis in sliding windows](https://github.com/simonhmartin/tutorials/tree/master/ABBA_BABA_windows/README.md)<br>
 8 | A tutorial on applying "*ABBA BABA*" statistics (specifically f<sub>d</sub> [Martin et al. 2015](https://doi.org/10.1093/molbev/msu269) in sliding windows to quantify variation in the extent of introgression across the genome.
 9 | 
10 | * [topology_weighting](https://github.com/simonhmartin/tutorials/tree/master/topology_weighting/README.md)<br>
11 | A tutorial on topology weighting using [Twisst](https://github.com/simonhmartin/twisst) to explore evolutionary relationships across teh genome, as described in [Martin & Van Belleghem 2017](http://doi.org/10.1534/genetics.116.194720).
12 | 


--------------------------------------------------------------------------------
/topology_weighting/README.md:
--------------------------------------------------------------------------------
  1 | # Tutorial: Topology Weighting
  2 | ___
  3 | ## Requirements
  4 | * Python 3+
  5 |     * Numpy 1.10+
  6 |     * [ete3](http://etetoolkit.org/)
  7 |     * [msprime](https://msprime.readthedocs.io/en/stable/) (for part 3 only)
  8 | * R 3.0+
  9 |     * ape
 10 |     * data.table
 11 | * [Phyml](http://www.atgc-montpellier.fr/phyml/) (for part 2 only)
 12 | 
 13 | ___
 14 | ## Introduction
 15 | 
 16 | Topology weighting is a means to quantify relationships among taxa that are not necessarily monophyletic. It provides a summary of a complex genealogy by considering simpler "taxon topologies" and quantifying the proportion of sub-trees that match each taxon topology. The method we use to compute the weightings is called *Twisst*: Topology weighting by iterative sampling of sub-trees.
 17 | 
 18 | In this practical we will use a simulation to explore how topology weightings provide a summary of the genealogical history. We will then try to infer topology weights across our simulated chromosome using neighbour joining trees inferred for narrow windows.
 19 | 
 20 | 
 21 | #### Workflow Overview
 22 | 
 23 | In **Part 1** of the practical we will analyse a set of genealogies that represent the history of a part of chromosome that evolved under a fairly complex history including population subdivision, gene flow and selection. We will compute topology weightings across this genomic region with `twisst` and then explore the results in `R`.
 24 | 
 25 | In **Part 2**, we take a step backwards into the real world, in which we ***don't know*** the true genealogical history, but instead we have a set of sequences from which we hope to ***infer*** the genealogy. We will use an unsophisticated approach to do this: making phylogenies for windows across the genome, using a standard phylogenetics tool. By comparing our inferred histories to the truth in `R`, we will gain insights into the **tradeoff between power and resolution** in genealogical inference.
 26 | 
 27 | In **Part 3** we will explore how demographic parameters affect weightings by running coalescent simulations with `msprime` and then computing topology weightings directly from the output, all within Python.
 28 | 
 29 | ___
 30 | ## Practical Part 1. Analysis of simulated genealogies
 31 | 
 32 | #### Download code and data
 33 | 
 34 | The scripts and example data for this part of the practical are in the `twisst` package on github.
 35 | 
 36 | ```bash
 37 | #download the twisst package zip file from github
 38 | wget https://github.com/simonhmartin/twisst/archive/v0.2.tar.gz
 39 | 
 40 | #extract the files from the zipped file
 41 | tar -xzf v0.2.tar.gz
 42 | 
 43 | #remove the zipped file
 44 | rm v0.2.tar.gz
 45 | ```
 46 | 
 47 | * The example data we will use consists of a text file of genealogies coded as [Newick](https://en.wikipedia.org/wiki/Newick_format) trees. In this case the trees were simulated using the coalescent simulator [`msms`](https://www.mabs.at/ewing/msms/index.shtml). If we had real data, we would not know the trees, and would have to infer them using tools like Relate, tsinfer, or by just running phylogeny inference on narrow windows, which we do in Part 2 below.
 48 | 
 49 | 
 50 | We can look at the first tree in the file:
 51 | 
 52 | ```bash
 53 | zcat twisst-0.2/examples/msms_4of10_l50k_r500_sweep.trees.gz | head -n 1
 54 | ```
 55 | 
 56 | It's pretty ugly, but don't be afraid. The numbers before each `:` are the sample names. The numbers after the `:` are the branch lengths. We will only be considering the tree shape and not branch lengths in this tutorial.
 57 | 
 58 | * We can also check the total number of distinct genealogies for this region of the chromosome:
 59 | 
 60 | ```bash
 61 | zcat twisst-0.2/examples/msms_4of10_l50k_r500_sweep.trees.gz | wc -l
 62 | ```
 63 | 
 64 | * For plotting, we also need to know where these genealogies occur on the chromosome. This data is provided in a second file with three columns: chromosome, start and end for each genealogy. This file has the same number of lines as the trees file.
 65 | 
 66 | ```bash
 67 | zcat twisst-0.2/examples/msms_4of10_l50k_r500_sweep.data.tsv.gz | head
 68 | zcat twisst-0.2/examples/msms_4of10_l50k_r500_sweep.data.tsv.gz | wc -l
 69 | ```
 70 | 
 71 | As you can see, some genealogies in this simulated data set occupy very narrow regions of the chromosome, as small as 1 bp. Over many generations of recombination, it can be the case that, for a given group of samples, the genealogy varies at a fine scale across the chromosome. In this case we know the *true* genealogy for each region - it has not been inferred. We will get to that topic in Part 2.
 72 | 
 73 | #### Compute topology weightings
 74 | 
 75 | * We run [`twisst`](https://github.com/simonhmartin/twisst) to compute the weightings for each topology.
 76 | 
 77 | The only information *Twisst* requires is
 78 | * the name of the trees file (specified with `-t`)
 79 | * the name of the output weights file (`-w`)
 80 | * The name of each group, and the samples that belong to it (`-g`).
 81 | 
 82 | The grouping may be determined by species, phenotype or geography (or whatever you like). In our case there are four groups of 10 haploid samples each.
 83 | Group A consists of samples 1:10, B consists of 11:20 etc.
 84 | 
 85 | ```bash
 86 | python twisst-0.2/twisst.py \
 87 | -t twisst-0.2/examples/msms_4of10_l50k_r500_sweep.trees.gz \
 88 | -w msms_4of10_l50k_r500_sweep.weights.tsv.gz \
 89 | -g A 1,2,3,4,5,6,7,8,9,10 \
 90 | -g B 11,12,13,14,15,16,17,18,19,20 \
 91 | -g C 21,22,23,24,25,26,27,28,29,30 \
 92 | -g D 31,32,33,34,35,36,37,38,39,40 \
 93 | --outgroup D
 94 | ```
 95 | 
 96 | `twisst` will consider all possible combinations of samples in which there is one sample per group. For example the first combination examined will be samples `1`, `11`, `21` and `31`, representing groups `A`, `B`, `C` and `D`, respectively. Ignoring all other branches in the tree, `twisst` records the topology of the 'subtree' containing just the four samples of interest, which could have one of three possible shapes: `(((A,B),C),D)`, `(((A,C),B),D)` or `(((B,C),A),D)` (Note that here the trees are represented as rooted, with D as the outgroup. We can tell `twisst` which is the outgroup (`--outgroup`) so it displays the trees as correctly rooted, but this does not affect the results).
 97 | 
 98 | * Check what the results look like
 99 | 
100 | ```bash
101 | #first 10 lines
102 | zcat msms_4of10_l50k_r500_sweep.weights.tsv.gz | head -n 30
103 | #total number of lines
104 | zcat msms_4of10_l50k_r500_sweep.weights.tsv.gz | wc -l
105 | ```
106 | 
107 | The three columns in the weights file represent the three topologies, which are defined in the file too. The numbers are not proportions but instead the total number of combinations representing each topology. Each line should sum to 10<sup>4</sup> = 10,000 as there are four groups of samples each. 
108 | 
109 | You will see that some adjacent lines have identical weightings. This has to do with the fact that some recombination events change the relationships among the samples, but not in a way that influences the weightings. It's worth understanding why this is the case.
110 | 
111 | #### Analyse the results
112 | 
113 | * Open `R` or `RStudio` and, if necessary, set the working directory to where you have saved the files. You can use the `setwd()` command or, in `RStudio`, using the menus.
114 | 
115 | * Start a new R script to record the commands
116 | 
117 | * First we will import a set of functions distributed with `twisst` that will help with plotting.
118 | 
119 | ```R
120 | source("twisst-0.2/plot_twisst.R")
121 | ```
122 | 
123 | * Please note the cool name of the above script.
124 | 
125 | * We define the files containing the weights for each genealogy, and the start and end positions for each block along the chromosome.
126 | 
127 | ```R
128 | #weights file with a column for each topology
129 | weights_file <- 'msms_4of10_l50k_r500_sweep.weights.tsv.gz'
130 | 
131 | #coordinates file for each window
132 | window_data_file <- 'twisst-0.2/examples/msms_4of10_l50k_r500_sweep.data.tsv.gz'
133 | ```
134 | * We already know the structure of these two files, but instead of reading them in and working with them directly, we will use the convenient `import.twisst` function.
135 | 
136 | ```R
137 | twisst_data <- import.twisst(weights_file, window_data_file)
138 | ```
139 | 
140 | * plot the raw weightings using the provided `plot.twisst` function.
141 | 
142 | ```R
143 | pdf("simulted_trees_weights.pdf", width=8, height=6)
144 | plot.twisst(twisst_data)
145 | dev.off()
146 | ```
147 | 
148 | * This will write a pdf file to the directory you are working in - open it!.
149 | 
150 | The trees at the top of the plot show the 3 different topologies we have weighted. The lower plot shows the weightings. You will see columns of colour of varying width. Each column corresponds to a single block with a unique genealogy. Some blocks are all one colour and reach a value of 1. That indicates that all subtrees in that block have the same topology, indicating a consistent and completely sorted genealogy. Other columns have two or more colours overlaid, indicating that the genealogy has a more complex evolutionary history, with individuals jumping between groups. A completely random genealogy, in which there is no clustering by group, would have equal weightings for all three topologies.
151 | 
152 | It is often desirable to smooth the weightings so that we can see more clearly how they vary across the chromosome.
153 | 
154 | * Create smoothed weightings using the `smooth.twisst` function and re-plot.
155 | 
156 | ```R
157 | twisst_data_smooth <- smooth.twisst(twisst_data, span_bp=5000)
158 | 
159 | pdf("simulted_trees_weights_smooth.pdf", width=8, height=6)
160 | plot.twisst(twisst_data_smooth)
161 | dev.off()
162 | ```
163 | 
164 | This averaged the weightings over a 5 kb window. You can explore what happens when you change the `span_bp` parameter.
165 | 
166 | Now we see more clearly that the dominant topologies are `topo1` and `topo3`. In this case, the simulations involved populations splitting according to `topo1`, but adaptive introgression was simulated from `C` into `B`, which is why `topo3` is more prevalent than topo2, and also why topo3 has a large spike in the middle of the region. This is the location of the selected locus. 
167 | 
168 | * We can look at the overall distribution of weightings too, or just check the mean values.
169 | 
170 | ```R
171 | plot.twisst.summary.boxplot(twisst_data)
172 | 
173 | twisst_data$weights_mean
174 | ```
175 | 
176 | The simulation history followed `topo1`, which is the most abundant topology, as expected. The introgression created an excess of `topo3`. But why is `topo2` not zero?
177 | 
178 | Lineage sorting is often incomplete if the taxa split recently, and even when it is complete, we can find genealogies that are discordant with the 'species tree' due to stochasticity in lineage sorting in the past. If you look carefully at the first plot we made, you might find one narrow window in which `topo2` has a weighting of 1. This indicates a *completely sorted*, but *discordant* genealogy. If the difference between incomplete lineage sorting and discordance is not immediately clear to you, you are not alone. In Part 3 we will do our own simulations to look at the conditions under which incomplete sorting and discordance increase or decrease.
179 | 
180 | ___
181 | 
182 | ## Practical Part 2. Infering weightings from sequence data
183 | 
184 | Above we have used the 'true' genealogies as they were simulated. In most cases, all we have is sequence data, and its evolutionary history has to be inferred. In fact there are two things we do not know:
185 | 1. We do not know the genealogical relationship among all individuals
186 | 2. We do not know the 'breakpoints' at which recombination has changed the relationship as we move along the chromosome
187 | 
188 | In this part, we will start from sequence data (the sequences were simulated under the history covered in Part 1, but we pretend that we do not know that at this stage). We will use a fairly straightforward approach in which we infer genealogies in windows along the genome. We will then run `twisst` on these to see whether we can recover someting close to the underlying truth.
189 | 
190 | Note that one of the lessons in this part is that inferring trees in windows is **crude and potentially flawed**, especially if we get the tree size wrong.
191 | 
192 | #### Download code and data
193 | 
194 | * The scripts for this part are in the genomics_general package on github, which we need to download:
195 | 
196 | ```bash
197 | #download package
198 | wget https://github.com/simonhmartin/genomics_general/archive/v0.3.tar.gz
199 | #extract files from zipped archive
200 | tar -xzf v0.3.tar.gz
201 | #delete zipped file
202 | rm v0.3.tar.gz
203 | ```
204 | 
205 | * To ensure that the libraries are recognisable by python, add the `genomics_general' directory to the Python path
206 | 
207 | ```bash
208 | export PYTHONPATH=$PYTHONPATH:genomics_general-0.3
209 | ```
210 | 
211 | * The sequence file we will use is provided with the `twisst` package, downloaded in Part 1. The file is in simple `.geno` format, which has columns for chromosome, position and genotype for each individual:
212 | 
213 | ```bash
214 | zcat twisst-0.2/examples/msms_4of10_l50k_r500_sweep.seqgen.SNP.geno.gz | head -n 5 | cut -f 1-8
215 | ```
216 | 
217 | (In part 2B we will use a script that generates a .geno from a vcf file)
218 | 
219 | #### Infering trees for windows
220 | 
221 | We have a file of SNPs distributed across a 50 kb genomic region. We will infer trees in windows of a defined number of SNPs, such that each window has a similar amount of information, but might differ in its absolute span across the chromosome, depending on the SNP density.
222 | 
223 | There is an underlying **tradeoff** in this approach. We want to select a window size that is ***large enough*** to provide the necessary ***power*** for tree inference, but ***small enough*** to achieve enough ***resolution*** to capture how genealogical histories change across the chromosome.
224 | 
225 | We will run the script that reads the SNP file in windows and then infers a tree for each window using [Phyml](http://www.atgc-montpellier.fr/phyml/). Phyml is capable of maximum likelihood inference, but here we will not use optimisation, so the trees output will be Neighbour-Joining trees, inferred using the [BIONJ](http://www.atgc-montpellier.fr/bionj/) algorithm. Using simulations we have found that neighbour-joining algorithm performs better than maximum likelihood inference for short sequences.
226 | 
227 | * First just check the options of the script
228 | 
229 | ```bash
230 | python genomics_general-0.3/phylo/phyml_sliding_windows.py -h
231 | ```
232 | 
233 | The main options are `-g` to specify the input `.geno` file and `--prefix` to specify the prefix of the output files
234 | 
235 | There are also options for setting the type and size of the window. We will run the script four times, using a range of different window sizes of 20, 50, 100, and 500 SNPs. Note that the input file contains only SNPs, so by setting `--windType sites` each window will be set to contain a fixed number of SNPs. By partitioning the windows this way, their absolute sizes on the chromosome will vary with SNP density. The start and end positions of each window will be recorded in an output file.
236 | 
237 | Finally, there are options for how to run Phyml. Here we just need to set `--optimise n` to turn of maximum-likelihood, and define the substitution model with `--model HYK85` as that is the model under which the sequences were simulated.
238 | 
239 | * Run the script using a loop that sets a different window size each time
240 | 
241 | ```
242 | for x in 20 50 100 500
243 | do
244 | echo "Inferring trees with window size $x"
245 | 
246 | python genomics_general-0.3/phylo/phyml_sliding_windows.py \
247 | -g twisst-0.2/examples/msms_4of10_l50k_r500_sweep.seqgen.SNP.geno.gz \
248 | --prefix msms_4of10_l50k_r500_sweep.seqgen.SNP.w$x.phyml_bionj \
249 | --windType sites -w $x  --model HKY85 --optimise n
250 | 
251 | done
252 | ```
253 | 
254 | Each time it ran, the script generated two output files: `.trees.gz` files contain the trees for each window in Newick format as we saw in Part 1. `.data.tsv` files contain the coordinates for each window, as well as the likelihood of the tree (the latter is irrelevant for us in this activity).
255 | 
256 | We can now compute the weightings across the chromosome using the trees files as input. This is the same as we did in Part 1 for the simulated trees, but now we are doing it for trees we have inferred from SNPs. So we are hoping to replicate the 'true' weightings we computed in Part 1 as closely as possible. Fingers crossed!
257 | 
258 | * Run `twisst` using generated trees file for each different window size, specifying the same groups as we did in Part 1
259 | 
260 | ```bash
261 | for x in 20 50 100 500
262 | do
263 | echo "Running Twisst for window size $x"
264 | 
265 | python twisst-0.2/twisst.py \
266 | -t msms_4of10_l50k_r500_sweep.seqgen.SNP.w$x.phyml_bionj.trees.gz \
267 | -w msms_4of10_l50k_r500_sweep.seqgen.SNP.w$x.phyml_bionj.weights.tsv \
268 | -g A 1,2,3,4,5,6,7,8,9,10 \
269 | -g B 11,12,13,14,15,16,17,18,19,20 \
270 | -g C 21,22,23,24,25,26,27,28,29,30 \
271 | -g D 31,32,33,34,35,36,37,38,39,40 \
272 | --outgroup D
273 | 
274 | done
275 | ```
276 | 
277 | #### Plotting inferred weights
278 | 
279 | As we did in Part 1, we can now plot the weights for these inferred trees across the chromosome. This can be done in the same R script as before.
280 | 
281 | * **Open R again** (if you have restarted R, you may need to reload the `plot_twisst.R` script).
282 | 
283 | ```R
284 | source("twisst-0.2/plot_twisst.R")
285 | ```
286 | 
287 | * As before we read in the weights and window data files. This time we will load the original ***true*** weights from the simulated genealogies, as well as the four files of ***inferred*** weights that we have just computed.
288 | 
289 | 
290 | ```R
291 | weights_files <- c('msms_4of10_l50k_r500_sweep.weights.tsv.gz', #true weights file from Part 1
292 |                    'msms_4of10_l50k_r500_sweep.seqgen.SNP.w20.phyml_bionj.weights.tsv',
293 |                    'msms_4of10_l50k_r500_sweep.seqgen.SNP.w50.phyml_bionj.weights.tsv',
294 |                    'msms_4of10_l50k_r500_sweep.seqgen.SNP.w100.phyml_bionj.weights.tsv',
295 |                    'msms_4of10_l50k_r500_sweep.seqgen.SNP.w500.phyml_bionj.weights.tsv')
296 | 
297 | window_data_files <- c('twisst-0.2/examples/msms_4of10_l50k_r500_sweep.data.tsv.gz',
298 |                       'msms_4of10_l50k_r500_sweep.seqgen.SNP.w20.phyml_bionj.data.tsv',
299 |                       'msms_4of10_l50k_r500_sweep.seqgen.SNP.w50.phyml_bionj.data.tsv',
300 |                       'msms_4of10_l50k_r500_sweep.seqgen.SNP.w100.phyml_bionj.data.tsv',
301 |                       'msms_4of10_l50k_r500_sweep.seqgen.SNP.w500.phyml_bionj.data.tsv')
302 | ```
303 | 
304 | * Load in wieghhtings and window data. Note that when given multiple input files, the `import.twisst` function will interpret them as separate chromosomes.
305 | 
306 | ```R
307 | twisst_data <- import.twisst(weights_files, window_data_files,
308 |                              names=c("True weights", "20 SNP windows", "50 SNP windows", "100 SNP windows", "500 SNP windows"))
309 | 
310 | ```
311 | 
312 | * Now we plot again to compare the true and inferred weightings. (you might need to expand your plot window to display the multiple plots correctly).
313 | 
314 | ```R
315 | pdf("simulted_and_inferred_trees_weights_comparison.pdf", width=8, height=8)
316 | plot.twisst(twisst_data, show_topos=FALSE, include_region_names=T)
317 | dev.off()
318 | ```
319 | 
320 | How well did the inference capture the truth?
321 | Which window size is best?
322 | Where is there too little power, and where is there too little resolution?
323 | 
324 | In the future, these difficulties could be solved by using tools like Relate and tsinfer to infer the complete ancestral recombination graph, but to date they have not been tested on datasets from multiple species.
325 | 
326 | ## Practical Part 2B: Infering weightings from sequence data
327 | 
328 | If you're tired of simulated data, here's a chance to work with real data. Note this part of the practical takes a bit longer, so you might want to skip it if you are low on time.
329 | 
330 | We will now run the same steps as above on real data. The input data are phased sequences from [this paper on *Heliconius* butterflies](https://doi.org/10.1371/journal.pbio.2006288). Fig 1 of the paper shows the makeup of the data set, with 9 populations of 10 samples each and two outgroup individuals.
331 | 
332 | We are interested in the role of gene flow in shaping the relationships between three species. *H. cydno* and *H. timareta* are sister species that occur on opposite sides of the Andes Mountains. *H. melpomene* occurs on both sides of the Andes, so there is sympatry between *cydno* and *melpomene* to the west and between *timareta* and *melpomene* on the eastern slopes. Hybrids are very rare in the wild, but genomic evidence indicates that gene flow occurs in both areas of sympatry.
333 | 
334 | The sampling includes two cydno populations (also called races because they have different colour patterns), two timareta populations and five melpomene populations. There are also two outgroup individuals.
335 | 
336 | #### Infer trees for windows
337 | 
338 | There are many combinations of populations we could use. Regardless of which combination we use, we need to start by computing trees across the genome for the complete set of samples.
339 | 
340 | * In this case we have a phased vcf file for a 1 mb region of chromosome 18, so our command to infer the trees has two steps. First we convert the vcf into the correct format, and then pipe it to the `phyml_sliding_windows` script to infer the trees.
341 | 
342 | ```bash
343 | python genomics_general-0.3/VCF_processing/parseVCF.py -i twisst-0.2/examples/heliconius92.chr18.500001-1500000.phased.vcf.gz |
344 | python genomics_general-0.3/phylo/phyml_sliding_windows.py --threads 2 \
345 | --prefix heliconius92.chr18.500001-1500000.phyml_bionj \
346 | --windType sites -w 50 --model GTR --optimise n
347 | ```
348 | We've set the window size to 50 SNPs, because that seemed a reasonable compromise based on the simulations above. The model is set to GTR which is a somewhat arbitrary choice as we don't know what the most suitable model is here. However, given that these are very closely related taxa with few substitutions (short branches) there is no risk of mutation saturation, so the substitution model probably has very little impact.
349 | 
350 | This could take a few minutes (~1300 trees to make with 184 tips each), so it's a good time to get a cup of coffee.
351 | 
352 | #### Compute topology weights
353 | 
354 | For `twisst`, we usually select 4 or 5 taxa. Any more and the number of possible topologies becomes large, so you would need to have a clear hypothesis about which particular topologies you want to focus on. The number of samples per taxon can be any number, but more than 4 is recommended to avoid noisy output.
355 | 
356 | Here we will focus on two sympatric pairs. *H. cydno chioneus* ('chi') and H. melpomene rosina ('ros') from Panama, and *H. timareta thelxinoe* ('txn') and *H. melpomene amaryllis* ('ama') from Peru. Hybridisation occurs in both location, but you will notice that while the Panama pair have divergent colour patterns, the Peru pair are identical. This is hypothesised to have resulted from adaptive introgression.
357 | 
358 | The 1 Mb region on chromosome 18 that we are targeting contains the gene optix, which is the controler of the red forewing band shared by the pair in Peru.
359 | 
360 | * We will run `twisst` for these two pairs of taxa, but instead of specifying all individuals that belong to each group, we just provide a groups file. You can view the format of the populations file.
361 | 
362 | ```bash
363 | head -n 25 twisst-0.2/examples/heliconius92.pop.txt
364 | ```
365 | 
366 | * Then run `twisst`, jsut specifying the group *names* and providing a groups file containing all individual names
367 | 
368 | ```bash
369 | python twisst-0.2/twisst.py -t heliconius92.chr18.500001-1500000.phyml_bionj.trees.gz \
370 | -w heliconius92.chr18.500001-1500000.phyml_bionj.weights.tsv \
371 | -g chi -g txn -g ros -g ama --groupsFile twisst-0.2/examples/heliconius92.pop.txt
372 | ```
373 | 
374 | #### plotting the result
375 | 
376 | * **Open R again** (if you have restarted R, you may need to reload the `plot_twisst.R` script).
377 | 
378 | ```R
379 | source("twisst-0.2/plot_twisst.R")
380 | ```
381 | 
382 | ```R
383 | weights_file = "heliconius92.chr18.500001-1500000.phyml_bionj.weights.tsv"
384 | data_file = "heliconius92.chr18.500001-1500000.phyml_bionj.data.tsv"
385 | 
386 | twisst_data <- import.twisst(weights_file, data_file)
387 | ```
388 | 
389 | * We will smooth the weightings because we are plotting across a fairly large (1 Mb) region. And then plot.
390 | 
391 | ```R
392 | twisst_data_smooth <- smooth.twisst(twisst_data, span_bp = 20000)
393 | 
394 | pdf("Heliconius_optix_region_weights_smooth.pdf", width=8, height=6)
395 | plot.twisst(twisst_data_smooth, tree_type="unrooted")
396 | dev.off()
397 | ```
398 | Topology 1 (blue) groups the two *melpomene* populations (ros and ama) together and *cydno* (chi) with *timareta* (txn), so this is the expected 'species' topology. We see a clear shift from the species topology to the 'geography' topology (topo 2, green), between position 1 and 1.2 Mb. *optix* is found at around 1 Mb, so this looks like the expected signature of introgression causing recent coalescence between ama and txn. Interestingly, it is confined to the intergenic region near *optix*, so it is consistent with introgression of regulatory variants.
399 | 
400 | We used unrooted trees because there is no outgroup in this set of four taxa. This means **we cannot determine the direction of introgression**. To clarify which taxa have shared genes, and in which direction we need to include an outgroup. Fortunately we have an outgroup in the form of two samples from the more distant species *H. numata* ('num').
401 | 
402 | #### Repeating the analysis with an outgroup
403 | 
404 | * **Return to the regular terminal** and re-run `twisst` specifying the outgroup.
405 | 
406 | ```bash
407 | python twisst-0.2/twisst.py -t heliconius92.chr18.500001-1500000.phyml_bionj.trees.gz \
408 | -w heliconius92.chr18.500001-1500000.phyml_bionj.5pops.weights.tsv \
409 | -g chi -g txn -g ros -g ama -g num --groupsFile twisst-0.2/examples/heliconius92.pop.txt --outgroup num
410 | ```
411 | 
412 | Now there will be 15 different possible topologies!
413 | 
414 | * **Back in R** We can plot the new results.
415 | 
416 | ```R
417 | weights_file = "heliconius92.chr18.500001-1500000.phyml_bionj.5pops.weights.tsv"
418 | data_file = "heliconius92.chr18.500001-1500000.phyml_bionj.data.tsv"
419 | 
420 | twisst_data <- import.twisst(weights_file, data_file)
421 | ```
422 | 
423 | ```R
424 | twisst_data_smooth <- smooth.twisst(twisst_data, span_bp = 20000)
425 | 
426 | pdf("Heliconius_5pops_optix_region_weights_smooth.pdf", width=8, height=6)
427 | plot.twisst(twisst_data_smooth, ncol_topos=5)
428 | dev.off()
429 | ```
430 | 
431 | This shows that two topologies dominate. One is topology 3, which you will notice is the 'species' topology, in which *cydno* (chi) groups with *timareta* (txn) and the two *melpomene* populations (ros and ama) group together. The other is topology 11 (pink), in which txn is found grouped with ama (**nested within the *melpomene* pair**). This tells us that the predominant direction of introgression was from *H. melpomene amaryllis* into *H. timareta thelxinoe*. This fits with our current understanding of this group in which *timareta* expanded down the eastern slopes of the Andes, hybridising with *melpomene* and acquiring the favoured warning pattern alleles in each region.
432 | 
433 | ___
434 | 
435 | ## Practical Part 3: Topology weighting using Tree Sequence format
436 | 
437 | So far, we have analysed tree files that have a distinct tree for each chromosome 'block' with a unique genealogy, or for each window in the case of inferred trees. This format is somewhat wasteful, because adjacent trees are often extremely similar, usually only different by a single recombination event, which moves one branch from one point in the tree to another.
438 | 
439 | The Tree Sequence format is efficient because it records not only the connections between nodes on the tree, but the length of the chromosome for which each connection exists. This format is used by [`msprime`](https://msprime.readthedocs.io/en/stable/) and the [`tskit`](https://tskit.readthedocs.io/en/latest/index.html) package, which also provides more information.
440 | 
441 | #### Simulating our first tree sequence
442 | 
443 | We will use `msprime` to simulate a tree sequence. `msprime` is a coalescent simulator, which means it works by computing the probability that any two individuals share a common ancestor at a given time in the past. In a single population, this is determined by the population size. With multiple populations, this is also affected by the rates of migration between populations, and how long ago they descend from a single ancestral population.
444 | 
445 | We will use the Python interactive environment for working with `msprime` and the tree sequence, and also to analyse it using a function from `twisst`.
446 | 
447 | * To ensure that we can import the `twisst` module from within python, add it to our python path. (This assumes you already downloaded the `twisst` package in section 1.
448 | 
449 | ```bash
450 | export PYTHONPATH=$PYTHONPATH:twisst-0.2
451 | ```
452 | 
453 | * Now **open a Python interactive session** (type 'python'), and also open a script in a text editor, because we are going to modify and rerun some of these lines multiple times to test the effects of different simulation parameters
454 | 
455 | * Import the required modules.
456 | 
457 | ```python
458 | import msprime
459 | import matplotlib.pyplot as plt
460 | import twisst
461 | ```
462 | 
463 | * We will start with a simple simulation of 10 samples from a single population. We have to specify the length and recombination rate, so msprime will give us a sequence of more than one genealogy, separated by recombination. Here we also specify the random seed just to ensure that in this case we all get the same simulation.
464 | 
465 | ```python
466 | ts = msprime.simulate(sample_size=10,
467 |                       Ne=1000,
468 |                       length=10000,
469 |                       recombination_rate=5e-8,
470 |                       random_seed = 1)
471 | ```
472 | 
473 | * We can check how many distinct genealogies are in the tree sequence.
474 | 
475 | ```python
476 | ts.num_trees
477 | ```
478 | 
479 | * And view them using a nice visualisation method provided with tree sequence objects.
480 | 
481 | ```python
482 | for tree in ts.trees():
483 |     print("interval = ", tree.interval)
484 |     print(tree.draw(format="unicode"))
485 | 
486 | ```
487 | Can you tell what is different between the trees? And what has stayed unchanged?
488 | 
489 | * If you would like to explore further, the related tool [`tskit`](https://tskit.readthedocs.io/en/latest/index.html) has many inbuilt functions to analyse tree sequences.
490 | 
491 | #### A larger simulation with four populations and gene flow
492 | 
493 | * We will now set up a larger simulation with multiple populations. This requires a few different components, which we will define separately. First we define the number of samples and population size of each population.
494 | 
495 | ```python
496 | pop_n = 10
497 | pop_Ne = 10000
498 | 
499 | population_configurations = [msprime.PopulationConfiguration(sample_size=pop_n, initial_size=pop_Ne),
500 |                              msprime.PopulationConfiguration(sample_size=pop_n, initial_size=pop_Ne),
501 |                              msprime.PopulationConfiguration(sample_size=pop_n, initial_size=pop_Ne),
502 |                              msprime.PopulationConfiguration(sample_size=pop_n, initial_size=pop_Ne)]
503 | ```
504 | 
505 | * Next we set the migration rates between populations. These can be defined in the form of a 4x4 matrix, with each entry giving the rate of migration into one population (rows) from the other (columns). Here we set a moderate level of migration in both directions between the second and third populations, and no migration between the others. The value represents *m*, the proportion of the population made up by migrants each generation.
506 | 
507 | ```python
508 | migration_matrix = [[0,    0,    0,    0],
509 |                     [0,    0,    1e-4, 0],
510 |                     [0,    1e-4, 0,    0],
511 |                     [0,    0,    0,    0]]
512 | ```
513 | 
514 | * Finally, we set split times, which, in the coalescent world view are joins going backwards in time. In `msprime` these are called mass migrations. So the split between the first two populations is modeled as a mass migration of all individuals from the second into the first population (called 0 and 1, because Python numbers from 0). Further back in time, populations 2 and 3 also mass migrate into population 0. After the first event (backwards in time) we also turn off migration between the 1 and 2 (because 1 technically no longer exists).
515 | 
516 | 
517 | ```python
518 | t_01 = 1000
519 | t_02 = 5000
520 | t_03 = 10000
521 | 
522 | demographic_events = [msprime.MassMigration(time=t_01, source=1, destination=0, proportion=1.0), # first merge
523 |                       msprime.MigrationRateChange(time=t_01, rate=0, matrix_index=(2, 1)), # mig stop after merge
524 |                       msprime.MigrationRateChange(time=t_01, rate=0, matrix_index=(1, 2)),
525 |                       msprime.MassMigration(time=t_02, source=2, destination=0, proportion=1.0), #next merge
526 |                       msprime.MassMigration(time=t_03, source=3, destination=0, proportion=1.0)] #final merge
527 | ```
528 | 
529 | * Now we are ready to simulate the tree sequence. We set the length to 50 kb and the recombination rate to 5e-8. `msprime` is extremely fast, so don't blink or you will miss it!
530 | 
531 | ```python
532 | ts = msprime.simulate(population_configurations = population_configurations,
533 |                       migration_matrix = migration_matrix,
534 |                       demographic_events = demographic_events,
535 |                       length = 50000,
536 |                       recombination_rate = 5e-8
537 |                       )
538 | ```
539 | 
540 | * Again we can check the number of trees
541 | 
542 | ```python
543 | ts.num_trees
544 | ```
545 | * and, if we dare, we can look at the first tree in the tree sequence:
546 | 
547 | ```python
548 | print(ts.first().draw(format="unicode"))
549 | ```
550 | 
551 | #### Computing weights from the Tree Sequence
552 | 
553 | * And then run a function from `twisst` that computes weightings from a tree sequence. We don't need to specify the groups, as this information has been included in the tree sequence object by `msprime`. But we do still need to tell it to use the final population (number 3 because python counts from 0).
554 | 
555 | ```python
556 | weightsData = twisst.weightTrees(ts, treeFormat="ts", outgroup = "3", verbose=False)
557 | ```
558 | 
559 | * we can get a quick summary of the average weights
560 | 
561 | ```python
562 | twisst.summary(weightsData)
563 | ```
564 | 
565 | * We can also quickly save a plot of the weights directly (which saves us the time of exporting to a file and plotting a fancy plot in R)
566 | 
567 | ```python
568 | #extract mid positions on chromosome from tree sequence file
569 | position = [(tree.interval[0] + tree.interval[1])/2 for tree in ts.trees()]
570 | 
571 | #normalise weights by dividing by number of combinations
572 | weights = weightsData["weights"]/10000
573 | 
574 | #create a plot with all three topology weights
575 | for i in range(3): 
576 |     plt.plot(position, weights[:,i], label='topo'+str(i+1))
577 | 
578 | plt.legend()
579 | 
580 | #save plot
581 | plt.savefig('sim_ts_weights.pdf')
582 | ```
583 | 
584 | **Excersise**: what happens to the weights when we:
585 |         Increase or decrease population size?
586 |         Make the population split times more or less recent?
587 |         Increase or decrease migration rates?
588 | 
589 | * Once you've made your predictions, test them out!
590 | 


--------------------------------------------------------------------------------