├── Effective_population_size
├── 0_get_data.md
├── 1_clean_data.md
├── 2_do_PCA.md
├── 3_plot_PCA.ipynb
├── 4_calculate_LD.md
├── 5_estimate_Ne.ipynb
├── 6_plot_Ne_Nc.ipynb
├── README.md
├── data
│ ├── Nc_estimates.txt
│ ├── pink_salmon.bed
│ ├── pink_salmon.bim
│ ├── pink_salmon.fam
│ └── tmp
├── do_everything
├── images
│ ├── sampling_locations.png
│ ├── sfs1dx.png
│ └── tmp
├── plots
│ └── tmp
├── scripts
│ ├── 0_get_data.sh
│ ├── 1_clean_data.sh
│ ├── 2_do_PCA.sh
│ ├── 3_plot_PCA.r
│ ├── 4_calculate_LD.sh
│ ├── 5_estimate_Ne.r
│ ├── 6_plot_Ne_Nc.r
│ ├── R_functions.r
│ └── tmp
├── tmp
└── work
│ └── tmp
├── Extra_exercise
├── Extra exercises.pdf
└── readme.MD
├── GWAS
├── ExerciseGWAS_2020.md
└── ExerciseGWAS_2022.md
├── HWexercise
├── .gitkeep
├── HW.md
├── HW1PA.png
├── HW2Sn.png
├── HW3HV.png
├── HW3HV2.png
├── HW4dF.png
├── HW5TP.png
├── HW6QQ.png
└── HWanswers.pdf
├── Linkage_disequilibrium_gorillas
├── ExercisesLD20.md
├── LD_ex_0_gorilla.png
├── LD_ex_1_dist_map.png
├── LD_ex_2_link_map.png
├── LD_ex_3_LD_decay.png
├── LDinGorillas.md
├── LDinGorillas_A_24.md
└── zzz
├── NaturalSelection
├── ExerciseSelection_2020.md
├── browser.png
├── fig1.png
├── fig2.png
├── fig3.png
├── fig4.png
├── fig5.png
├── sel1sickle.png
├── sel2underdominance.png
├── sel3drosophila.png
├── sel4condor.png
├── selectionAA.md
├── shinySelectionSim.R
└── z
├── NucleotideDiversiteyExercise
├── Exercise in estimating nucleotide diversity.md
├── Exercise in estimating nucleotide diversityWoA.md
├── Selection_020.png
├── chimp_distribution.png
├── ex1nucdivfigure1.png
└── slidesGeneticDiversityExercise .pdf
├── Population_history
├── population_history_lab_2020_noAnswers.md
└── z
├── Population_structure
├── ExerciseFst2022.md
├── ExerciseFst2023.md
├── ExerciseFst2024WA.md
├── ExerciseStructure.md
├── ExerciseStructure_2020.md
├── ExerciseStructure_2022.md
├── ExerciseStructure_2023.md
├── ExerciseStructure_2024.md
├── chimp_distribution.png
├── exer_struct_1_PCA.png
├── exer_struct_2_admixture.png
├── manhattanFst.png
└── z
├── QuantitativeGenetics
├── Exercises_Quantitative_genetics2020.md
├── FforInd.png
├── Fgenome.tar.gz
├── IBDexercises.md
├── QG1_5length.png
├── QG1assortative.png
├── QG2Children.png
├── QG3Ne.png
├── QG4heritability_height.png
├── QG5heritability_children.png
├── QG6monogamy.png
├── Rgenome.tar.gz
├── fig1.png
├── fig2.png
├── im.png
└── z
├── README.md
└── phylogenetics
├── ExercisesPhylo_2021.md
├── Introductory_chapter_Yang.pdf
├── Kapli_et_al_2020.pdf
├── Yang_and_Ranalla_2012.pdf
├── marsupials.fasta
├── marsupials.png
└── marsupials.tree
/Effective_population_size/0_get_data.md:
--------------------------------------------------------------------------------
1 |
2 | ## these four commands download the pink salmon genotype data files (in Plink format: bed/bim/fam) and the census population size estiamtes.
3 |
4 | ```bash
5 | wget http://people.binf.ku.dk/cetp/popgen/pink_salmon/pink_salmon.bed --directory-prefix ./data
6 | wget http://people.binf.ku.dk/cetp/popgen/pink_salmon/pink_salmon.bim --directory-prefix ./data
7 | wget http://people.binf.ku.dk/cetp/popgen/pink_salmon/pink_salmon.fam --directory-prefix ./data
8 | wget http://people.binf.ku.dk/cetp/popgen/pink_salmon/Nc_estimates.txt --directory-prefix ./data
9 | ```
10 |
11 |
--------------------------------------------------------------------------------
/Effective_population_size/1_clean_data.md:
--------------------------------------------------------------------------------
1 | # ./scripts/1_clean_data.sh
2 | All commands to be run in the terminal
3 |
4 |
5 |
6 | ### Remove loci not placed on a chromosome.
7 | Create a new data set (in ./work/) only including loci that are placed on one of the 26 chromsomes in pink salmon. Call it "pink_salmon.clean". The --make-bed command tells Plink to create pink_salmon.bed (binary file with genotype data), pink_salmon.bim (text with locus information ) and ./pink_salmon.fam (text file with sample information) files.
8 |
9 | When using Plink to analyze data from non-human species, it is important to tell Plink to not interpret chromosome "23" as the X chromosome and chromosome "24" as the Y chromsome (this is true for humans, and is the default configuration in Plink).
10 |
11 |
12 | ```bash
13 | plink --bfile ./data/pink_salmon --autosome-num 26 --not-chr 0 --make-bed --out ./work/pink_salmon.clean
14 | ```
15 |
16 | ### Create a separate set of genotype data files for each population.
17 | For each of the six populations of pink salmon, select just the individuals in the population and then filter for HWE, genotyping rate per locus, and minor allele frequency.
18 |
19 | ```bash
20 | plink --bfile ./work/pink_salmon.clean --family --keep-cluster-names Koppen_ODD --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Koppen_ODD
21 |
22 | plink --bfile ./work/pink_salmon.clean --family --keep-cluster-names Koppen_EVEN --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Koppen_EVEN
23 |
24 | plink --bfile ./work/pink_salmon.clean --family --keep-cluster-names Nome_ODD --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Nome_ODD
25 |
26 | plink --bfile ./work/pink_salmon.clean --family --keep-cluster-names Nome_EVEN --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Nome_EVEN
27 |
28 | plink --bfile ./work/pink_salmon.clean --family --keep-cluster-names Puget_ODD --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Puget_ODD
29 |
30 | plink --bfile ./work/pink_salmon.clean --family --keep-cluster-names Puget_EVEN --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Puget_EVEN
31 | ```
32 |
33 |
--------------------------------------------------------------------------------
/Effective_population_size/2_do_PCA.md:
--------------------------------------------------------------------------------
1 | # ./scripts/2_do_PCA.sh
2 | All commands to be run in the terminal.
3 |
4 | For each of the two basic data files (before and after removing loci not placed on a chromosome), use Plink to perform a principal components analysis.
5 |
6 | ```bash
7 |
8 | plink --bfile ./data/pink_salmon --autosome-num 26 --maf 0.1 --pca 3 --out ./work/pink_data.initial
9 |
10 | plink --bfile ./work/pink_salmon.clean --autosome-num 26 --maf 0.1 --pca 3 --out ./work/pink_salmon.clean
11 |
12 | ```
13 |
--------------------------------------------------------------------------------
/Effective_population_size/4_calculate_LD.md:
--------------------------------------------------------------------------------
1 | # ./scripts/4_calculate_LD.sh
2 | All commands to be run in the terminal.
3 |
4 | See Plink documentation [here](https://www.cog-genomics.org/plink2/).
5 |
6 | For each of the six poulations, use Plink to calculate LD between each pair of loci. The "--r2 square" flag to Plink will produce a LxL matrix of pairwise r2, with L as the number of loci.
7 |
8 |
9 |
10 | ```bash
11 | plink --bfile ./work/Koppen_ODD --r2 square --out ./work/Koppen_ODD
12 |
13 | plink --bfile ./work/Koppen_EVEN --r2 square --out ./work/Koppen_EVEN
14 |
15 | plink --bfile ./work/Nome_ODD --r2 square --out ./work/Nome_ODD
16 |
17 | plink --bfile ./work/Nome_EVEN --r2 square --out ./work/Nome_EVEN
18 |
19 | plink --bfile ./work/Puget_ODD --r2 square --out ./work/Puget_ODD
20 |
21 | plink --bfile ./work/Puget_EVEN --r2 square --out ./work/Puget_EVEN
22 |
23 | ```
--------------------------------------------------------------------------------
/Effective_population_size/5_estimate_Ne.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## Start R code here"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": false
15 | },
16 | "outputs": [],
17 | "source": [
18 | "# run the scripts below, used to define functions shared across multiple scripts\n",
19 | "source('./scripts/R_functions.r')"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 2,
25 | "metadata": {
26 | "collapsed": false
27 | },
28 | "outputs": [],
29 | "source": [
30 | "# Make an empty data frame to store the results\n",
31 | "DF <- data.frame(Pop=rep(NA, 6), Site=rep(NA, 6), Lineage=rep(NA, 6), Ne_est=rep(NA, 6))"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 3,
37 | "metadata": {
38 | "collapsed": false
39 | },
40 | "outputs": [
41 | {
42 | "data": {
43 | "text/html": [
44 | "
\n",
45 | "
Pop
Site
Lineage
Ne_est
\n",
46 | "\n",
47 | "\t
Nome_ODD
Nome
ODD
2014.42557567929
\n",
48 | "\t
Nome_EVEN
Nome
EVEN
3691.29596294112
\n",
49 | "\t
Koppen_ODD
Koppen
ODD
2079.5810007219
\n",
50 | "\t
Koppen_EVEN
Koppen
EVEN
9087.46320481003
\n",
51 | "\t
Puget_ODD
Puget
ODD
2934.27555677135
\n",
52 | "\t
Puget_EVEN
Puget
EVEN
1805.56226291709
\n",
53 | "\n",
54 | "
\n"
55 | ],
56 | "text/latex": [
57 | "\\begin{tabular}{r|llll}\n",
58 | " Pop & Site & Lineage & Ne\\_est\\\\\n",
59 | "\\hline\n",
60 | "\t Nome\\_ODD & Nome & ODD & 2014.42557567929\\\\\n",
61 | "\t Nome\\_EVEN & Nome & EVEN & 3691.29596294112\\\\\n",
62 | "\t Koppen\\_ODD & Koppen & ODD & 2079.5810007219 \\\\\n",
63 | "\t Koppen\\_EVEN & Koppen & EVEN & 9087.46320481003\\\\\n",
64 | "\t Puget\\_ODD & Puget & ODD & 2934.27555677135\\\\\n",
65 | "\t Puget\\_EVEN & Puget & EVEN & 1805.56226291709\\\\\n",
66 | "\\end{tabular}\n"
67 | ],
68 | "text/markdown": [
69 | "\n",
70 | "Pop | Site | Lineage | Ne_est | \n",
71 | "|---|---|---|---|---|---|\n",
72 | "| Nome_ODD | Nome | ODD | 2014.42557567929 | \n",
73 | "| Nome_EVEN | Nome | EVEN | 3691.29596294112 | \n",
74 | "| Koppen_ODD | Koppen | ODD | 2079.5810007219 | \n",
75 | "| Koppen_EVEN | Koppen | EVEN | 9087.46320481003 | \n",
76 | "| Puget_ODD | Puget | ODD | 2934.27555677135 | \n",
77 | "| Puget_EVEN | Puget | EVEN | 1805.56226291709 | \n",
78 | "\n",
79 | "\n"
80 | ],
81 | "text/plain": [
82 | " Pop Site Lineage Ne_est \n",
83 | "1 Nome_ODD Nome ODD 2014.42557567929\n",
84 | "2 Nome_EVEN Nome EVEN 3691.29596294112\n",
85 | "3 Koppen_ODD Koppen ODD 2079.5810007219 \n",
86 | "4 Koppen_EVEN Koppen EVEN 9087.46320481003\n",
87 | "5 Puget_ODD Puget ODD 2934.27555677135\n",
88 | "6 Puget_EVEN Puget EVEN 1805.56226291709"
89 | ]
90 | },
91 | "metadata": {},
92 | "output_type": "display_data"
93 | }
94 | ],
95 | "source": [
96 | "# Estaimte Ne and store the results in a dataframe\n",
97 | "Pops = c('Nome_ODD','Nome_EVEN', 'Koppen_ODD', 'Koppen_EVEN', 'Puget_ODD', 'Puget_EVEN')\n",
98 | "\n",
99 | "for (index in 1:6){\n",
100 | " POP = Pops[index]\n",
101 | " site = strsplit(POP, split = '_')[[1]][1]\n",
102 | " lin = strsplit(POP, split = '_')[[1]][2]\n",
103 | " res = get_Ne(base_path = paste(\"./work/\", POP, sep = ''))\n",
104 | " DF[index, ] = c(POP, site, lin, res$Ne_est)\n",
105 | "}\n",
106 | "\n",
107 | "# take a look at the results \n",
108 | "DF"
109 | ]
110 | },
111 | {
112 | "cell_type": "code",
113 | "execution_count": 4,
114 | "metadata": {
115 | "collapsed": false
116 | },
117 | "outputs": [],
118 | "source": [
119 | "write.table(DF, \"./work/Ne_estimates.txt\", sep = '\\t', quote = FALSE, row.names = FALSE)"
120 | ]
121 | }
122 | ],
123 | "metadata": {
124 | "kernelspec": {
125 | "display_name": "R",
126 | "language": "R",
127 | "name": "ir"
128 | },
129 | "language_info": {
130 | "codemirror_mode": "r",
131 | "file_extension": ".r",
132 | "mimetype": "text/x-r-source",
133 | "name": "R",
134 | "pygments_lexer": "r",
135 | "version": "3.3.2"
136 | }
137 | },
138 | "nbformat": 4,
139 | "nbformat_minor": 2
140 | }
141 |
--------------------------------------------------------------------------------
/Effective_population_size/data/Nc_estimates.txt:
--------------------------------------------------------------------------------
1 | Pop Site Lineage Nc_est
2 | Nome_ODD Nome ODD 300000
3 | Nome_EVEN Nome EVEN 10000
4 | Koppen_ODD Koppen ODD 200000
5 | Koppen_EVEN Koppen EVEN 200000
6 | Puget_ODD Puget ODD 1400000
7 | Puget_EVEN Puget EVEN 4000
8 |
9 |
--------------------------------------------------------------------------------
/Effective_population_size/data/pink_salmon.bed:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Effective_population_size/data/pink_salmon.bed
--------------------------------------------------------------------------------
/Effective_population_size/data/pink_salmon.fam:
--------------------------------------------------------------------------------
1 | Koppen_ODD PKOPE91T_0001 0 0 0 -9
2 | Koppen_ODD PKOPE91T_0002 0 0 0 -9
3 | Koppen_ODD PKOPE91T_0003 0 0 0 -9
4 | Koppen_ODD PKOPE91T_0005 0 0 0 -9
5 | Koppen_ODD PKOPE91T_0006 0 0 0 -9
6 | Koppen_ODD PKOPE91T_0007 0 0 0 -9
7 | Koppen_ODD PKOPE91T_0009 0 0 0 -9
8 | Koppen_ODD PKOPE91T_0010 0 0 0 -9
9 | Koppen_ODD PKOPE91T_0011 0 0 0 -9
10 | Koppen_ODD PKOPE91T_0013 0 0 0 -9
11 | Koppen_ODD PKOPE91T_0014 0 0 0 -9
12 | Koppen_ODD PKOPE91T_0015 0 0 0 -9
13 | Koppen_ODD PKOPE91T_0016 0 0 0 -9
14 | Koppen_ODD PKOPE91T_0017 0 0 0 -9
15 | Koppen_ODD PKOPE91T_0018 0 0 0 -9
16 | Koppen_ODD PKOPE91T_0019 0 0 0 -9
17 | Koppen_ODD PKOPE91T_0020 0 0 0 -9
18 | Koppen_ODD PKOPE91T_0024 0 0 0 -9
19 | Koppen_ODD PKOPE91T_0025 0 0 0 -9
20 | Koppen_ODD PKOPE91T_0026 0 0 0 -9
21 | Koppen_ODD PKOPE91T_0029 0 0 0 -9
22 | Koppen_ODD PKOPE91T_0035 0 0 0 -9
23 | Koppen_ODD PKOPE91T_0036 0 0 0 -9
24 | Koppen_ODD PKOPE91T_0037 0 0 0 -9
25 | Koppen_EVEN PKOPE96T_0001 0 0 0 -9
26 | Koppen_EVEN PKOPE96T_0002 0 0 0 -9
27 | Koppen_EVEN PKOPE96T_0003 0 0 0 -9
28 | Koppen_EVEN PKOPE96T_0004 0 0 0 -9
29 | Koppen_EVEN PKOPE96T_0005 0 0 0 -9
30 | Koppen_EVEN PKOPE96T_0006 0 0 0 -9
31 | Koppen_EVEN PKOPE96T_0007 0 0 0 -9
32 | Koppen_EVEN PKOPE96T_0008 0 0 0 -9
33 | Koppen_EVEN PKOPE96T_0009 0 0 0 -9
34 | Koppen_EVEN PKOPE96T_0010 0 0 0 -9
35 | Koppen_EVEN PKOPE96T_0011 0 0 0 -9
36 | Koppen_EVEN PKOPE96T_0012 0 0 0 -9
37 | Koppen_EVEN PKOPE96T_0013 0 0 0 -9
38 | Koppen_EVEN PKOPE96T_0014 0 0 0 -9
39 | Koppen_EVEN PKOPE96T_0015 0 0 0 -9
40 | Koppen_EVEN PKOPE96T_0016 0 0 0 -9
41 | Koppen_EVEN PKOPE96T_0017 0 0 0 -9
42 | Koppen_EVEN PKOPE96T_0018 0 0 0 -9
43 | Koppen_EVEN PKOPE96T_0019 0 0 0 -9
44 | Koppen_EVEN PKOPE96T_0020 0 0 0 -9
45 | Koppen_EVEN PKOPE96T_0021 0 0 0 -9
46 | Koppen_EVEN PKOPE96T_0022 0 0 0 -9
47 | Koppen_EVEN PKOPE96T_0023 0 0 0 -9
48 | Koppen_EVEN PKOPE96T_0024 0 0 0 -9
49 | Nome_ODD PNOME91_0001 0 0 0 -9
50 | Nome_ODD PNOME91_0002 0 0 0 -9
51 | Nome_ODD PNOME91_0003 0 0 0 -9
52 | Nome_ODD PNOME91_0004 0 0 0 -9
53 | Nome_ODD PNOME91_0005 0 0 0 -9
54 | Nome_ODD PNOME91_0008 0 0 0 -9
55 | Nome_ODD PNOME91_0009 0 0 0 -9
56 | Nome_ODD PNOME91_0010 0 0 0 -9
57 | Nome_ODD PNOME91_0011 0 0 0 -9
58 | Nome_ODD PNOME91_0013 0 0 0 -9
59 | Nome_ODD PNOME91_0014 0 0 0 -9
60 | Nome_ODD PNOME91_0015 0 0 0 -9
61 | Nome_ODD PNOME91_0016 0 0 0 -9
62 | Nome_ODD PNOME91_0017 0 0 0 -9
63 | Nome_ODD PNOME91_0018 0 0 0 -9
64 | Nome_ODD PNOME91_0021 0 0 0 -9
65 | Nome_ODD PNOME91_0025 0 0 0 -9
66 | Nome_ODD PNOME91_0026 0 0 0 -9
67 | Nome_ODD PNOME91_0027 0 0 0 -9
68 | Nome_ODD PNOME91_0028 0 0 0 -9
69 | Nome_ODD PNOME91_0030 0 0 0 -9
70 | Nome_ODD PNOME91_0031 0 0 0 -9
71 | Nome_ODD PNOME91_0032 0 0 0 -9
72 | Nome_ODD PNOME91_0033 0 0 0 -9
73 | Nome_EVEN PNOME94_0001 0 0 0 -9
74 | Nome_EVEN PNOME94_0002 0 0 0 -9
75 | Nome_EVEN PNOME94_0003 0 0 0 -9
76 | Nome_EVEN PNOME94_0004 0 0 0 -9
77 | Nome_EVEN PNOME94_0005 0 0 0 -9
78 | Nome_EVEN PNOME94_0006 0 0 0 -9
79 | Nome_EVEN PNOME94_0007 0 0 0 -9
80 | Nome_EVEN PNOME94_0008 0 0 0 -9
81 | Nome_EVEN PNOME94_0009 0 0 0 -9
82 | Nome_EVEN PNOME94_0010 0 0 0 -9
83 | Nome_EVEN PNOME94_0011 0 0 0 -9
84 | Nome_EVEN PNOME94_0012 0 0 0 -9
85 | Nome_EVEN PNOME94_0013 0 0 0 -9
86 | Nome_EVEN PNOME94_0014 0 0 0 -9
87 | Nome_EVEN PNOME94_0015 0 0 0 -9
88 | Nome_EVEN PNOME94_0016 0 0 0 -9
89 | Nome_EVEN PNOME94_0017 0 0 0 -9
90 | Nome_EVEN PNOME94_0018 0 0 0 -9
91 | Nome_EVEN PNOME94_0019 0 0 0 -9
92 | Nome_EVEN PNOME94_0020 0 0 0 -9
93 | Puget_ODD PSNOH03_0005 0 0 0 -9
94 | Puget_ODD PSNOH03_0016 0 0 0 -9
95 | Puget_ODD PSNOH03_0017 0 0 0 -9
96 | Puget_ODD PSNOH03_0019 0 0 0 -9
97 | Puget_ODD PSNOH03_0021 0 0 0 -9
98 | Puget_ODD PSNOH03_0024 0 0 0 -9
99 | Puget_ODD PSNOH03_0027 0 0 0 -9
100 | Puget_ODD PSNOH03_0029 0 0 0 -9
101 | Puget_ODD PSNOH03_0039 0 0 0 -9
102 | Puget_ODD PSNOH03_0043 0 0 0 -9
103 | Puget_ODD PSNOH03_0046 0 0 0 -9
104 | Puget_ODD PSNOH03_0063 0 0 0 -9
105 | Puget_ODD PSNOH03_0065 0 0 0 -9
106 | Puget_ODD PSNOH03_0067 0 0 0 -9
107 | Puget_ODD PSNOH03_0074 0 0 0 -9
108 | Puget_ODD PSNOH03_0075 0 0 0 -9
109 | Puget_ODD PSNOH03_0076 0 0 0 -9
110 | Puget_ODD PSNOH03_0078 0 0 0 -9
111 | Puget_ODD PSNOH03_0079 0 0 0 -9
112 | Puget_ODD PSNOH03_0081 0 0 0 -9
113 | Puget_ODD PSNOH03_0082 0 0 0 -9
114 | Puget_ODD PSNOH03_0088 0 0 0 -9
115 | Puget_ODD PSNOH03_0090 0 0 0 -9
116 | Puget_ODD PSNOH03_0095 0 0 0 -9
117 | Puget_EVEN PSNOH96_0003 0 0 0 -9
118 | Puget_EVEN PSNOH96_0004 0 0 0 -9
119 | Puget_EVEN PSNOH96_0005 0 0 0 -9
120 | Puget_EVEN PSNOH96_0006 0 0 0 -9
121 | Puget_EVEN PSNOH96_0007 0 0 0 -9
122 | Puget_EVEN PSNOH96_0008 0 0 0 -9
123 | Puget_EVEN PSNOH96_0009 0 0 0 -9
124 | Puget_EVEN PSNOH96_0010 0 0 0 -9
125 | Puget_EVEN PSNOH96_0011 0 0 0 -9
126 | Puget_EVEN PSNOH96_0012 0 0 0 -9
127 | Puget_EVEN PSNOH96_0013 0 0 0 -9
128 | Puget_EVEN PSNOH96_0014 0 0 0 -9
129 | Puget_EVEN PSNOH96_0015 0 0 0 -9
130 | Puget_EVEN PSNOH96_0016 0 0 0 -9
131 | Puget_EVEN PSNOH96_0017 0 0 0 -9
132 | Puget_EVEN PSNOH96_0018 0 0 0 -9
133 | Puget_EVEN PSNOH96_0019 0 0 0 -9
134 | Puget_EVEN PSNOH96_0021 0 0 0 -9
135 | Puget_EVEN PSNOH96_0022 0 0 0 -9
136 | Puget_EVEN PSNOH96_0023 0 0 0 -9
137 | Puget_EVEN PSNOH96_0024 0 0 0 -9
138 | Puget_EVEN PSNOH96_0025 0 0 0 -9
139 | Puget_EVEN PSNOH96_0027 0 0 0 -9
140 | Puget_EVEN PSNOH96_0028 0 0 0 -9
141 |
--------------------------------------------------------------------------------
/Effective_population_size/data/tmp:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Effective_population_size/do_everything:
--------------------------------------------------------------------------------
1 | #mkdir popgen2018-pink_salmon
2 | #cd popgen2018-pink_salmon
3 | #wget https://api.github.com/repos/rwaples/popgen2018-pink_salmon/tarball/master -O - | tar xz --strip=1
4 |
5 | #bash ./scripts/0_get_data.sh
6 | bash ./scripts/1_clean_data.sh
7 | bash ./scripts/2_do_PCA.sh
8 | Rscript ./scripts/3_plot_PCA.r
9 | bash ./scripts/4_calculate_LD.sh
10 | Rscript ./scripts/5_estimate_Ne.r
11 | Rscript ./scripts/6_plot_Ne_Nc.r
12 |
--------------------------------------------------------------------------------
/Effective_population_size/images/sampling_locations.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Effective_population_size/images/sampling_locations.png
--------------------------------------------------------------------------------
/Effective_population_size/images/sfs1dx.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Effective_population_size/images/sfs1dx.png
--------------------------------------------------------------------------------
/Effective_population_size/images/tmp:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Effective_population_size/plots/tmp:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Effective_population_size/scripts/0_get_data.sh:
--------------------------------------------------------------------------------
1 | wget http://people.binf.ku.dk/cetp/popgen/pink_salmon/pink_salmon.bed --directory-prefix ./data
2 | wget http://people.binf.ku.dk/cetp/popgen/pink_salmon/pink_salmon.bim --directory-prefix ./data
3 | wget http://people.binf.ku.dk/cetp/popgen/pink_salmon/pink_salmon.fam --directory-prefix ./data
4 | wget http://people.binf.ku.dk/cetp/popgen/pink_salmon/Nc_estimates.txt --directory-prefix ./data
5 |
--------------------------------------------------------------------------------
/Effective_population_size/scripts/1_clean_data.sh:
--------------------------------------------------------------------------------
1 | plink --bfile ./data/pink_salmon --autosome-num 26 --not-chr 0 --make-bed --out ./work/pink_salmon.clean
2 | plink --bfile ./work/pink_salmon.clean --autosome-num 26 --family --keep-cluster-names Koppen_ODD --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Koppen_ODD
3 | plink --bfile ./work/pink_salmon.clean --autosome-num 26 --family --keep-cluster-names Koppen_EVEN --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Koppen_EVEN
4 | plink --bfile ./work/pink_salmon.clean --autosome-num 26 --family --keep-cluster-names Nome_ODD --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Nome_ODD
5 | plink --bfile ./work/pink_salmon.clean --autosome-num 26 --family --keep-cluster-names Nome_EVEN --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Nome_EVEN
6 | plink --bfile ./work/pink_salmon.clean --autosome-num 26 --family --keep-cluster-names Puget_ODD --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Puget_ODD
7 | plink --bfile ./work/pink_salmon.clean --autosome-num 26 --family --keep-cluster-names Puget_EVEN --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Puget_EVEN
8 |
--------------------------------------------------------------------------------
/Effective_population_size/scripts/2_do_PCA.sh:
--------------------------------------------------------------------------------
1 | plink --bfile ./data/pink_salmon --autosome-num 26 --maf 0.1 --pca 3 --out ./work/pink_data.initial
2 | plink --bfile ./work/pink_salmon.clean --autosome-num 26 --maf 0.1 --pca 3 --out ./work/pink_salmon.clean
3 |
--------------------------------------------------------------------------------
/Effective_population_size/scripts/3_plot_PCA.r:
--------------------------------------------------------------------------------
1 |
2 | # Options for the notebook
3 | options(jupyter.plot_mimetypes = "image/png")
4 | options(repr.plot.width = 6, repr.plot.height = 6)
5 |
6 | read_pca_output <- function(path){
7 | pca_df = read.table(path)
8 | names(pca_df) = c('Population', 'Individual', 'PC1', 'PC2', 'PC3')
9 | # enforce the order of populations
10 | pca_df$Population <- factor(pca_df$Population , levels =c('Nome_ODD','Nome_EVEN', 'Koppen_ODD', 'Koppen_EVEN', 'Puget_ODD', 'Puget_EVEN'))
11 | pca_df = pca_df[order(pca_df$Population),]
12 | return(pca_df)
13 | }
14 |
15 | plot_pca_basic <- function(pca_df, title){
16 | plot(pca_df$PC1 , pca_df$PC2, col = pca_df$Population, pch = 16, main = title)
17 | legend(x="topleft", legend = levels(pca_df$Population), fill = palette()[1:6])
18 | }
19 |
20 | # unused - example how to make a scatteplot in ggplot2
21 | plot_pca_ggplot <- function(pca_df){
22 | library(ggplot2)
23 | p = ggplot(data = pca_df, aes(x = PC1, y = PC2))
24 | p = p + geom_point(aes(colour = Population))
25 | p = p + theme_classic()
26 | return(p)
27 | }
28 |
29 | pca_initial = read_pca_output('./work/pink_data.initial.eigenvec')
30 | head(pca_initial)
31 |
32 | pca_clean = read_pca_output('./work/pink_salmon.clean.eigenvec')
33 | head(pca_clean)
34 |
35 | # Make some plots and save them to a file
36 |
37 | png("./plots/PCA.pink_salmon.initial.png")
38 | plot_pca_basic(pca_initial, title = 'pink_salmon.initial')
39 | dev.off()
40 |
41 | png("./plots/PCA.pink_salmon.clean.png")
42 | plot_pca_basic(pca_clean, title = 'pink_salmon.clean')
43 | dev.off()
44 |
45 | print("Done plotting PCAs")
46 |
47 |
48 |
--------------------------------------------------------------------------------
/Effective_population_size/scripts/4_calculate_LD.sh:
--------------------------------------------------------------------------------
1 | plink --bfile ./work/Koppen_ODD --autosome-num 26 --r2 square --out ./work/Koppen_ODD
2 | plink --bfile ./work/Koppen_EVEN --autosome-num 26 --r2 square --out ./work/Koppen_EVEN
3 | plink --bfile ./work/Nome_ODD --autosome-num 26 --r2 square --out ./work/Nome_ODD
4 | plink --bfile ./work/Nome_EVEN --autosome-num 26 --r2 square --out ./work/Nome_EVEN
5 | plink --bfile ./work/Puget_ODD --autosome-num 26 --r2 square --out ./work/Puget_ODD
6 | plink --bfile ./work/Puget_EVEN --autosome-num 26 --r2 square --out ./work/Puget_EVEN
7 |
--------------------------------------------------------------------------------
/Effective_population_size/scripts/5_estimate_Ne.r:
--------------------------------------------------------------------------------
1 |
2 | source('./scripts/R_functions.r')
3 |
4 | # Make an empty data frame to store the results
5 | DF <- data.frame(Pop=rep(NA, 6), Site=rep(NA, 6), Lineage=rep(NA, 6), Ne_est=rep(NA, 6))
6 |
7 | # Calculate
8 | Pops = c('Nome_ODD','Nome_EVEN', 'Koppen_ODD', 'Koppen_EVEN', 'Puget_ODD', 'Puget_EVEN')
9 |
10 | for (index in 1:6){
11 | POP = Pops[index]
12 | site = strsplit(POP, split = '_')[[1]][1]
13 | lin = strsplit(POP, split = '_')[[1]][2]
14 | res = get_Ne(base_path = paste("./work/", POP, sep = ''))
15 | DF[index, ] = c(POP, site, lin, res$Ne_est)
16 | }
17 |
18 | # take a look at the results
19 | DF
20 |
21 | write.table(DF, "./work/Ne_estimates.txt", sep = '\t', quote = FALSE, row.names = FALSE)
22 |
23 |
24 |
--------------------------------------------------------------------------------
/Effective_population_size/scripts/6_plot_Ne_Nc.r:
--------------------------------------------------------------------------------
1 |
2 | # Load in the Ne and Nc estimates
3 |
4 | Ne = read.table("./work/Ne_estimates.txt",sep = '\t', header = TRUE)
5 | Nc = read.table("./data/Nc_estimates.txt",sep = '\t', header = TRUE)
6 |
7 | Ne
8 |
9 | Nc
10 |
11 | ## Use the merge command to join them
12 |
13 | estimates = merge(Ne, Nc)
14 | estimates
15 |
16 | estimates$ratio = estimates$Ne_est / estimates$Nc_est
17 | # reorder to match input order
18 | estimates$Pop <- factor(estimates$Pop , levels =c('Nome_ODD','Nome_EVEN', 'Koppen_ODD', 'Koppen_EVEN', 'Puget_ODD', 'Puget_EVEN'))
19 | estimates = estimates[order(estimates$Pop),]
20 | estimates
21 |
22 | for_barplot = data.matrix(t(estimates[,c('Ne_est', 'Nc_est')]))
23 | colnames(for_barplot) = estimates$Pop
24 |
25 | for_ratio_barplot = data.matrix(t(estimates[,'ratio']))
26 | colnames(for_ratio_barplot) = estimates$Pop
27 |
28 |
29 | png('./plots/Ne_estimates.png')
30 | par(mar=c(10,4,4,2))
31 | barplot(for_barplot['Ne_est',], col = "white", beside = TRUE, las=2, #axes = FALSE,
32 | main = "Ne estimates for each population",ylab = 'Ne')
33 | dev.off()
34 |
35 | png('./plots/Ne_and_Nc_estimates.png')
36 | par(mar=c(10,4,4,2))
37 |
38 | barplot(for_barplot, col = c("white","black"), beside = TRUE, las=2, axes = FALSE,
39 | main = "Ne and Nc estimates for each population",ylab = 'Size')
40 | axis(side = 2, at = c(100, 10000, 500000, 1000000, 1500000))
41 | legend("top",
42 | c("Ne_est","Nc_est"),
43 | fill = c("white","black")
44 | )
45 | dev.off()
46 |
47 | # same plot with a log y axis
48 | png('./plots/Ne_and_Nc_estimates_log-scaled.png')
49 | par(mar=c(10,4,4,2))
50 | barplot(for_barplot, col = c("white","black"), beside = TRUE, las=2,
51 | log = 'y', axes = FALSE, ylim = c(100,1400000),
52 | main = "Ne and Nc estimates for each population",ylab = 'Size (log scaled)')
53 | axis(side = 2, at = c(100, 10000, 500000, 1000000, 1500000))
54 | legend("top",
55 | c("Ne_est","Nc_est"),
56 | fill = c("white","black")
57 | )
58 | dev.off()
59 |
60 | png('./plots/Ne-Nc_ratios.png')
61 | par(mar=c(10,4,4,2))
62 | barplot(for_ratio_barplot, col = "gray", beside = TRUE, las=2, #axes = FALSE,
63 | main = "Ne/Nc ratios for each population",ylab = 'Ne/Nc ratio')
64 | dev.off()
65 |
66 | source('./scripts/R_functions.r')
67 |
68 | Ne_Nome_ODD = get_Ne('./work/Nome_ODD')
69 | Ne_Nome_EVEN = get_Ne('./work/Nome_EVEN')
70 | Ne_Koppen_ODD = get_Ne('./work/Koppen_ODD')
71 | Ne_Koppen_EVEN = get_Ne('./work/Koppen_EVEN')
72 | Ne_Puget_ODD = get_Ne('./work/Puget_ODD')
73 | Ne_Puget_EVEN = get_Ne('./work/Puget_EVEN')
74 |
75 | png('./plots/LD_Nome_ODD.png', width=nrow(Ne_Nome_ODD$r2_matrix),height=nrow(Ne_Nome_ODD$r2_matrix))
76 | image(Ne_Nome_ODD$r2_matrix, axes = FALSE, col = rev(heat.colors(256)))
77 | dev.off()
78 |
79 | png('./plots/LD_Nome_EVEN.png', width=nrow(Ne_Nome_EVEN$r2_matrix),height=nrow(Ne_Nome_EVEN$r2_matrix))
80 | image(Ne_Nome_EVEN$r2_matrix, axes = FALSE, col = rev(heat.colors(256)))
81 | dev.off()
82 |
83 | png('./plots/LD_Koppen_ODD.png', width=nrow(Ne_Koppen_ODD$r2_matrix),height=nrow(Ne_Koppen_ODD$r2_matrix))
84 | image(Ne_Koppen_ODD$r2_matrix, axes = FALSE, col = rev(heat.colors(256)))
85 | dev.off()
86 |
87 | png('./plots/LD_Koppen_EVEN.png', width=nrow(Ne_Koppen_EVEN$r2_matrix),height=nrow(Ne_Koppen_EVEN$r2_matrix))
88 | image(Ne_Koppen_EVEN$r2_matrix, axes = FALSE, col = rev(heat.colors(256)))
89 | dev.off()
90 |
91 | png('./plots/LD_Puget_ODD.png', width=nrow(Ne_Puget_ODD$r2_matrix),height=nrow(Ne_Puget_ODD$r2_matrix))
92 | image(Ne_Puget_ODD$r2_matrix, axes = FALSE, col = rev(heat.colors(256)))
93 | dev.off()
94 |
95 | png('./plots/LD_Puget_EVEN.png', width=nrow(Ne_Puget_EVEN$r2_matrix),height=nrow(Ne_Puget_EVEN$r2_matrix))
96 | image(Ne_Puget_EVEN$r2_matrix, axes = FALSE, col = rev(heat.colors(256)))
97 | dev.off()
98 |
99 |
100 |
--------------------------------------------------------------------------------
/Effective_population_size/scripts/R_functions.r:
--------------------------------------------------------------------------------
1 |
2 | ## Functions to read in data
3 |
4 | read_r2_matrix <- function(r2_path){
5 | r2_df = read.table(r2_path, sep="\t", )
6 | r2_matrix <- data.matrix(r2_df)
7 | return(r2_matrix)
8 | }
9 |
10 | get_bim <- function(bim_path){
11 | bim = read.table(bim_path, sep = '\t')
12 | names(bim) = c('chr', 'SNP', 'cm', 'bp', 'A1', 'A2')
13 | return(bim)
14 | }
15 |
16 | get_fam <- function(fam_path){
17 | fam = read.table(fam_path, sep = ' ')
18 | names(fam) = c('FID', 'IID', 'fatherID', 'motherID', 'sex', 'phenotype')
19 | return(fam)
20 | }
21 |
22 | estimate_Ne <- function(mean_r2, S){
23 | adj1_r2 = mean_r2 * (S/(S-1))**2
24 | adj2_r2 = adj1_r2 - 0.0018 - 0.907/S - 4.44/(S**2)
25 | Ne_est = (0.308 + sqrt(.308**2 - 2.08*adj2_r2))/(2*adj2_r2)
26 | return(Ne_est)
27 | }
28 |
29 | get_Ne <- function(base_path){
30 | # load files
31 | r2_path = paste(base_path, '.ld', sep ='')
32 | bim_path = paste(base_path, '.bim', sep ='')
33 | fam_path = paste(base_path, '.fam', sep ='')
34 | pop_mat = read_r2_matrix(r2_path)
35 | pop_bim = get_bim(bim_path)
36 | pop_fam = get_fam(fam_path)
37 |
38 | # get the sample size of the population from the .fam file
39 | S = nrow(pop_fam)
40 |
41 | # exclude loci on the same chromosome
42 | for (CH in 1:26){
43 | my_idx = which(pop_bim$chr==CH)
44 | pop_mat[my_idx, my_idx] <- NA
45 | }
46 | # get just the upper triangle of the square matrix
47 | r2_vals = pop_mat[upper.tri(x = pop_mat, diag = FALSE)]
48 | # remove NA values
49 | r2_vals = r2_vals[!is.na(r2_vals)]
50 |
51 | mean_r2 = mean(r2_vals)
52 | # Non -bias correcteted Ne estimate
53 | Ne_basic = 1.0/(3*mean_r2 - 3.0/S)
54 |
55 | # Bias corrected for low sample size (S<30)
56 | Ne_est = estimate_Ne(mean_r2=mean_r2, S=S)
57 |
58 | #print(c(Ne_est, Ne_basic))
59 |
60 | # return the bias-corrected estimate and the
61 | # r2 matrix used in the calculation (with within-chromsome r2 values masked)
62 | return (list(Ne_est = Ne_est, r2_matrix = pop_mat))
63 | }
64 |
--------------------------------------------------------------------------------
/Effective_population_size/scripts/tmp:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Effective_population_size/tmp:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Effective_population_size/work/tmp:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Extra_exercise/Extra exercises.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Extra_exercise/Extra exercises.pdf
--------------------------------------------------------------------------------
/Extra_exercise/readme.MD:
--------------------------------------------------------------------------------
1 | # Extra exercises for most classes based on a dataset from Wildebeest (only if there is time)
2 |
3 |
4 |
5 |
6 |
7 |
--------------------------------------------------------------------------------
/GWAS/ExerciseGWAS_2020.md:
--------------------------------------------------------------------------------
1 | # Exercise in association testing and GWAS
2 |
3 | **Ida Moltke**
4 |
5 | The focus of these exercises is to try out two simple tests for association and then to perform a GWAS using one of these tests. The goal is to give you some hands on experience with GWAS. The exercises are to some extent copy-paste based, however this is on purpose so you can learn as much as possible within the limited time we have. The idea is that the exercises will provide you with some relevant commands you can use if you later on want to perform a GWAs and at the same they will allow you to spend more time on looking at and interpreting the output than you would have had if you had to try to build up the commands from scratch. However, this also means that to get the most out of the exercises you should not just quickly copy-paste everything, but also along the way try to
6 |
7 | 1) make sure you understand what the command/code does (except the command lines that start with "Rscript" since these just call code in an R script which you do not have to look at unless you are curious)
8 |
9 | 2) spend some time on looking at and trying to interpret the output you get when you run the code
10 |
11 | .
12 |
13 | ## Exercise 1: Basic association testing (single SNP analysis)
14 |
15 | In this exercise we will go through two simple tests for association. We will use the statistical software R and look at (imaginary) data from 1237 cases and 4991 controls at two SNP loci with the following counts:
16 |
17 | | | | | |
18 | |:---:|:----:|:---:|:---:|
19 | | SNP 1 | AA | Aa | aa |
20 | | Cases | 423 | 568 | 246 |
21 | | Controls | 1955 | 2295 | 741 |
22 |
23 | | | | | |
24 | |:---:|:----:|:---:|:---:|
25 | | SNP 2 | AA | Aa | aa |
26 | | Cases | 1003 | 222 | 12 |
27 | | Controls | 4043 | 899 | 49 |
28 |
29 |
30 | #### Exercise 1A: genotype based test
31 |
32 | We want to test if the disease is associated with these two SNPs. One way to do this for the first SNP is by running the following code in R:
33 |
34 | ```R
35 | # NB remember to open R first
36 | # Input the count data into a matrix
37 | countsSNP1 <- matrix(c(423,1955,568,2295,246,741),nrow=2)
38 |
39 | # Print the matrix (so you can see it)
40 | print(countsSNP1)
41 |
42 | # Perform the test on the matrix
43 | chisq.test(countsSNP1)
44 | ```
45 |
46 | * Try to run this test and note the p-value.
47 | * Does the test suggest that the SNP is associated with the disease (you can use a p-value threshold of 0.05 when answering this question)?
48 |
49 |
50 | Now try to plot the proportions of cases in each genotype category in this SNP using the following R code:
51 | ```R
52 | barplot(countsSNP1[1,]/(countsSNP1[1,]+countsSNP1[2,]),xlab="Genotypes",ylab="Proportion of cases",names=c("AA","Aa","aa"),las=1,ylim=c(0,0.25))
53 | ```
54 | * Does is look like the SNP is associated with the disease?
55 | * Is p-value consistent with the plot?
56 |
57 | Repeat the same analyses (and answer the same questions) for SNP 2 for which the data can be input into R as follows:
58 |
59 | ```R
60 | countsSNP2 <- matrix(c(1003,4043,222,899,12,49),nrow=2)
61 | ```
62 |
63 | #### Exercise 1B: the allelic test
64 |
65 | Another simple way to test for association is by using the allelic test. Try to perform that on the two SNPs. This can be done as follows in R:
66 |
67 | ```R
68 | # For SNP 1:
69 |
70 | # - get the allelic counts
71 | allelecountsSNP1<-cbind(2*countsSNP1[,1]+countsSNP1[,2],countsSNP1[,2]+2*countsSNP1[,3])
72 | print(allelecountsSNP1)
73 |
74 | # - perform allelic test
75 | chisq.test(allelecountsSNP1,correct=F)
76 |
77 | # For SNP 2:
78 |
79 | # ... repeat the above with SNP 2 data
80 | ```
81 |
82 | * Do these tests lead to the same conclusions as you reached in exercise 1A?
83 |
84 | .
85 |
86 | ## Exercise 2: GWAS
87 |
88 | This exercise is about Genome-Wide Association Studies (GWAS): how to perform one and what pitfalls to look out for. It will be conducted from the command line using the program PLINK (you can read more about it here: https://www.cog-genomics.org/plink2/).
89 |
90 |
91 | #### Exercise 2A: getting data and running your first GWAS
92 |
93 | First close R if you have it open.
94 |
95 | Then make a folder called GWAS for this exercise, copy the relavant data into this folder and unpack the data by typing
96 |
97 | ```bash
98 | cd ~/exercises
99 | mkdir GWAS
100 | cd GWAS
101 | cp ~/groupdirs/SCIENCE-BIO-Popgen_Course/exercises/GWAS/gwasdata.tar.gz .
102 | tar -xf gwasdata.tar.gz
103 | rm gwasdata.tar.gz
104 | ```
105 |
106 | Your folder called GWAS should now contain a subfolder called data (you can check that this is true e.g. by typing ls, which is a command that gives you a list of the content in the folder you are working in). The folder "data" will contain the all files you will need in this exercise (both the data and a file called plink.plot.R, which contains R code for plotting your results).
107 |
108 | Briefly, the data consist of SNP genotyping data from 356 individuals some of which are have a certain disease (cases) and the rest do not (controls). To make sure the GWAS analyses will run fast the main data file (gwa.bed) is in a binary format, which is not very reader friendly. However, PLINK will print summary statistics about the data (number of SNPs, number of individuals, number of cases, number of controls etc) to the screen when you run an analysis. Also, there are two additional data files, gwa.bim and gwa.fam, which are not in binary format and which contain information about the SNPs in the data and the individuals in the data, respectively (you can read more about the data format in the manuals linked to above - but for now this is all you need to know).
109 |
110 | Let's try to perform a GWAS of our data, i.e. test each SNP for association with the disease. And let's try to do it using the simplest association test for case-control data, the allelic test, which you just performed in R in exercise 1B. You can perform this test on all the SNPs in the dataset using PLINK by typing the following command in the terminal:
111 |
112 | ```bash
113 | plink --bfile data/gwa --assoc --adjust
114 | ```
115 |
116 | Specifically "--bfile data/gwa" specifies that the data PLINK should analyse are the files in folder called "data" with the prefix gwa. "--assoc" specifies that we want to perform GWAS using the allelic test and "--adjust" tells PLINK to output a file that includes p-values that are adjusted for multiple testing using Bonferroni correction as well as other fancier methods. Try to run the command and take a look at the text PLINK prints to your screen. Specifically, note the
117 |
118 | * number of SNPs
119 | * number of individuals
120 | * number of cases and controls
121 |
122 | NB if you for some reason have trouble reading the PLINK output on your screen then PLINK also writes it to the file plink.log and you can look at that by typing "less plink.log". If you do that, you can use arrows to navigate up and down in the file and close the file viewing by typing q.
123 |
124 | Next, try to plot the results of the GWAS by typing:
125 |
126 | ```bash
127 | Rscript data/plink.plot.R plink.assoc
128 | ```
129 |
130 | This should give you several plots. For now just look at the Manhattan plot, which can be found in the file called plink.assoc.png. You can e.g. open it using the png viewer called display, so by typing:
131 |
132 | ```bash
133 | display plink.assoc.png
134 | ```
135 |
136 | A bonferroni corrected p-value threshold based on an initial p-value threshold of 0.05, is shown as a dotted line on the plot.
137 |
138 | * Explain how this threshold was reached.
139 | * Calculate the exact threshold using you knowledge of how many SNPs you have in your dataset (NB if you want to calculate log10 in R you can use the function log10)
140 | * Using this threshold, does any of the SNPs in your dataset seem to be associated with the disease?
141 | * Do your results seem plausible? Why/why not?
142 |
143 |
144 | #### Exercise 2B: checking if it went OK using QQ-plot
145 |
146 | Now look at the QQ-plot that you already generated in exercise 2A (the file called plink.assoc.QQ.png) by typing:
147 |
148 | ```bash
149 | display plink.assoc.QQ.png
150 | ```
151 |
152 | Here the red line is the x=y line.
153 |
154 | * What does this plot suggest and why?
155 |
156 |
157 | #### Exercise 2C: doing initial QC of data part 1 (sex check)
158 |
159 | As you can see a lot can go wrong if you do not check the quality of your data! So if you want meaningful/useful output you always have to run a lot of quality checking (QC) and filtering before running the association tests. One check that is worth running is a check if the indicated genders are correct. You can check this using PLINK to calculate the inbreeding coefficient on the X chromosome under the assumption that it is an autosomal chromosome. The reason why this is interesting is that, for technical reasons PLINK represents haploid chromosomes, such as X for males, as homozygotes. So assuming the X is an autosomal chromosome will make the males look very inbred on the X where as the woman wont (since they are diploid on the X chromosome). This means that the inbreeding coefficient estimates you get will be close to 1 for men and close to 0 for women. This gender check can be performed in PLINK using the following command:
160 |
161 | ```bash
162 | plink --bfile data/gwa --check-sex
163 | ```
164 |
165 | The result are in the file plink.sexcheck in which the gender is in column PEDSEX (1 means male and 2 means female) and the inbreeding coefficient is in column F).
166 |
167 | Check the result by typing:
168 | ```bash
169 | less plink.sexcheck
170 | ```
171 |
172 | NB you can use arrows to navigate up and down in the file and close the file viewing by typing q.
173 |
174 | * Do you see any problems?
175 |
176 | * If you observe any problems then fix them by changing the gender in the file gwa.fam (5th colunm) in the folder data. You can use the editor gedit for this. To open the file in gedit type "gedit data/gwa.fam" in the terminal. (NB usually one would instead get rid of these individuals because wrong gender could indicate that the phenotypes you have do not belong to the genotyped individual. However, in this case the genders were changed on purpose for the sake of this exercises so you can safely just change them back).
177 |
178 |
179 | #### Exercise 2D: doing initial QC of data part 2 (relatedness check)
180 |
181 | Another potential problem in association studies is spurious relatedness where some of the individuals in the sample are closely related. Closely related individuals can be inferred using PLINK as follows:
182 |
183 | ```bash
184 | plink --bfile data/gwa --genome
185 | ```
186 |
187 | And you can plot the results by typing:
188 |
189 | ```bash
190 | Rscript data/plink.plot.R plink.genome
191 | ```
192 |
193 | Do that and then take a look at the result by typing
194 |
195 | ```bash
196 | display plink.genomepairwise_relatedness_1.png
197 | ```
198 |
199 | The figure shows estimates of the relatedness for all pairs of individuals. For each pair k1 is the proportion of the genome where the pair shares 1 of their allele identical-by-descent (IBD) and k2 is the proportion of the genome where the pair shares both their alleles IBD. The expected (k1,k2) values for simple relationships are shown in the figure with MZ=monozygotic twins, PO=parent offspring, FS=full sibling, HS=half sibling (or avuncular or grandparent-granchild), C1=first cousin, C2=cousin once removed.
200 |
201 | * Are any of the individuals in your dataset closely related?
202 | * What assumption in association studies is violated when individuals are related?
203 | * And last but not least: how would you recognize if the same person is included twice (this actually happens!)
204 |
205 |
206 | #### Exercise 2E: doing initial QC of data part 3 (check for batch bias/non-random genotyping error)
207 |
208 | Check if there is a batch effect/non random genotyping error by using missingness as a proxy (missingness and genotyping error are highly correlated in SNP chip data). In other words, for each SNP test if there is a significantly different amount of missing data in the cases and controls (or equivalently if there is an association between the disease status and the missingness). Do this with PLINK by running the following command:
209 |
210 | ```bash
211 | plink --bfile data/gwa --test-missing
212 | ```
213 |
214 | View the result in the file plink.missing where the p-value is given in the right most coloumn or generate a histogram of all the p-values using the Rscript data/plink.plot.R by typing this command
215 |
216 | ```bash
217 | Rscript data/plink.plot.R plink.missing
218 | ```
219 |
220 | The resulting p-value histogram can be found in the file plink.missing.png, which you can open using the png viewer display, so by typing
221 |
222 | ```bash
223 | display plink.missing.png
224 | ```
225 |
226 | If the missingness is random the p-value should be uniformly distributed between 0 and 1.
227 |
228 | * Is this the case?
229 | * Genotyping errors are often highly correlated with missingness. How do you think this will effect your association results?
230 |
231 | #### Exercise 2F: doing initial QC of data part 4 (check for batch bias/non-random genotyping error again)
232 |
233 | Principal component analysis (PCA) and a very similar methods called multidimensional scaling is also often used to reveal problems in the data. Such analyses can be used to project all the genotype information (e.g. 500,000 marker sites) down to a low number of dimensions e.g. two.
234 |
235 | Multidimensional scaling based on this can be performed with PLINK as follows (for all individuals except for a few which has more than 20% missingness):
236 |
237 | ```bash
238 | plink --bfile data/gwa --cluster --mds-plot 2 --mind 0.2
239 | ```
240 |
241 | Run the above command and plot the results by typing:
242 |
243 | ```bash
244 | Rscript data/plink.plot.R plink.mds data/gwa.fam
245 | ```
246 |
247 | The resulting plot should now be in the file plink.mds.pdf, which you can open with the pdf viewer evince (so by typing "evince plink.mds.pdf"). It shows the first two dimensions and each individual is represented by a point, which is colored according to the individual's disease status.
248 |
249 | * Clustering of cases and controls is an indication of batch bias. Do you see such clustering?
250 | * What else could explain this clustering?
251 |
252 |
253 | #### Exercise 2G: try to rerun GWAS after quality filtering SNPs
254 |
255 | We can remove many of the error prone SNPs and individuals by removing
256 |
257 | * SNPs that are not in HWE within controls
258 | * the rare SNPs
259 | * the individuals and SNPs with lots of missing data (why?)
260 |
261 | Let us try to do rerun an association analysis where this is done:
262 |
263 | ```bash
264 | plink --bfile data/gwa --assoc --adjust --out assoc2 --hwe 0.0001 --maf 0.05 --mind 0.55 --geno 0.05
265 | ```
266 |
267 | Plot the results using
268 |
269 | ```bash
270 | Rscript data/plink.plot.R assoc2.assoc
271 | ```
272 |
273 | * How does QQ-plot look (in the file assoc2.assoc.QQ.png)?
274 | * Did the analysis go better this time?
275 | * And what does the Manhattan plot suggest (in the file assoc2.assoc.png)? Are any of the SNPs associated?
276 | * Does your answer change if you use other (smarter) methods to correct for multiple testing than Bonferroni (e.g. FDR), which PLINK provides. You can find them by typing:
277 |
278 | ```bash
279 | less assoc2.assoc.adjusted
280 | ```
281 |
282 | Note that PLINK adjusts p-values instead of the threshold (equivalent idea), so you should NOT change the threshold but stick to 0.05.
283 |
284 | .
285 |
286 | ## Bonus exercise if there is any time left: another example of a GWAS caveat
287 |
288 | For the same individuals as above we also have another phenotype. This phenotype is strongly correlated with gender. The genotyping was done independently of this phenotype so there is no batch bias. To perform association on this phenotype type
289 |
290 | ```bash
291 | plink --bfile data/gwa --assoc --pheno data/pheno3.txt --adjust --out pheno3
292 | Rscript data/plink.plot.R pheno3.assoc
293 | ```
294 |
295 | * View the plots and results. Are any of the SNP significantly associated?
296 |
297 | Now try to perform the analysis using logistic regression (another association test) adjusted for sex, which is done by adding the option "--sex":
298 |
299 | ```bash
300 | plink --bfile data/gwa --logistic --out pheno3_sexAdjusted --pheno data/pheno3.txt --sex
301 | Rscript data/plink.plot.R pheno3_sexAdjusted.assoc.logistic
302 | ```
303 |
304 | * Are there any associated SNPs according to this analysis?
305 |
306 | Some of the probes used on the chip will hybridize with multiple loci on the genome. The associated SNPs in the previous analysis all cross-hybridize with the X chromosome.
307 |
308 | * Could crosshybridization explain the difference in results from the two analyses?
309 |
310 |
311 |
--------------------------------------------------------------------------------
/GWAS/ExerciseGWAS_2022.md:
--------------------------------------------------------------------------------
1 | # Exercise in association testing and GWAS
2 |
3 | **Ida Moltke**
4 |
5 | The focus of these exercises is to try out two simple tests for association and then to perform a GWAS using one of these tests. The goal is to give you some hands on experience with GWAS. The exercises are to some extent copy-paste based, however this is on purpose so you can learn as much as possible within the limited time we have. The idea is that the exercises will provide you with some relevant commands you can use if you later on want to perform a GWAs and at the same they will allow you to spend more time on looking at and interpreting the output than you would have had if you had to try to build up the commands from scratch. However, this also means that to get the most out of the exercises you should not just quickly copy-paste everything, but also along the way try to
6 |
7 | 1) make sure you understand what the command/code does (except the command lines that start with "Rscript" since these just call code in an R script which you do not have to look at unless you are curious)
8 |
9 | 2) spend some time on looking at and trying to interpret the output you get when you run the code
10 |
11 | .
12 |
13 | ## Exercise 1: Basic association testing (single SNP analysis)
14 |
15 | In this exercise we will go through two simple tests for association. We will use the statistical software R and look at (imaginary) data from 1237 cases and 4991 controls at two SNP loci with the following counts:
16 |
17 | | | | | |
18 | |:---:|:----:|:---:|:---:|
19 | | SNP 1 | AA | Aa | aa |
20 | | Cases | 423 | 568 | 246 |
21 | | Controls | 1955 | 2295 | 741 |
22 |
23 | | | | | |
24 | |:---:|:----:|:---:|:---:|
25 | | SNP 2 | AA | Aa | aa |
26 | | Cases | 1003 | 222 | 12 |
27 | | Controls | 4043 | 899 | 49 |
28 |
29 |
30 | #### Exercise 1A: genotype based test
31 |
32 | We want to test if the disease is associated with these two SNPs. One way to do this for the first SNP is by running the following code in R:
33 |
34 | ```R
35 | # NB remember to open R first
36 | # Input the count data into a matrix
37 | countsSNP1 <- matrix(c(423,1955,568,2295,246,741),nrow=2)
38 |
39 | # Print the matrix (so you can see it)
40 | print(countsSNP1)
41 |
42 | # Perform the test on the matrix
43 | chisq.test(countsSNP1)
44 | ```
45 |
46 | * Try to run this test and note the p-value.
47 | * Does the test suggest that the SNP is associated with the disease (you can use a p-value threshold of 0.05 when answering this question)?
48 |
49 |
50 | Now try to plot the proportions of cases in each genotype category in this SNP using the following R code:
51 | ```R
52 | barplot(countsSNP1[1,]/(countsSNP1[1,]+countsSNP1[2,]),xlab="Genotypes",ylab="Proportion of cases",names=c("AA","Aa","aa"),las=1,ylim=c(0,0.25))
53 | ```
54 | * Does is look like the SNP is associated with the disease?
55 | * Is p-value consistent with the plot?
56 |
57 | Repeat the same analyses (and answer the same questions) for SNP 2 for which the data can be input into R as follows:
58 |
59 | ```R
60 | countsSNP2 <- matrix(c(1003,4043,222,899,12,49),nrow=2)
61 | ```
62 |
63 | .
64 |
65 | ## Exercise 2: GWAS
66 |
67 | This exercise is about Genome-Wide Association Studies (GWAS): how to perform one and what pitfalls to look out for. It will be conducted from the command line using the program PLINK (you can read more about it here: https://www.cog-genomics.org/plink2/).
68 |
69 |
70 | #### Exercise 2A: getting data and running your first GWAS
71 |
72 | First close R if you have it open.
73 |
74 | Then make a folder called GWAS for this exercise, copy the relavant data into this folder and unpack the data by typing
75 |
76 | ```bash
77 | mkdir GWAS
78 | cd GWAS
79 | cp ~/groupdirs/SCIENCE-BIO-Popgen_Course/exercises/GWAS/gwasdata.tar.gz .
80 | tar -xf gwasdata.tar.gz
81 | rm gwasdata.tar.gz
82 | ```
83 |
84 | Your folder called GWAS should now contain a subfolder called data (you can check that this is true e.g. by typing ls, which is a command that gives you a list of the content in the folder you are working in). The folder "data" will contain the all files you will need in this exercise (both the data and a file called plink.plot.R, which contains R code for plotting your results).
85 |
86 | Briefly, the data consist of SNP genotyping data from 356 individuals some of which are have a certain disease (cases) and the rest do not (controls). To make sure the GWAS analyses will run fast the main data file (gwa.bed) is in a binary format, which is not very reader friendly. However, PLINK will print summary statistics about the data (number of SNPs, number of individuals, number of cases, number of controls etc) to the screen when you run an analysis. Also, there are two additional data files, gwa.bim and gwa.fam, which are not in binary format and which contain information about the SNPs in the data and the individuals in the data, respectively (you can read more about the data format in the manuals linked to above - but for now this is all you need to know).
87 |
88 | Let's try to perform a GWAS of our data, i.e. test each SNP for association with the disease. And let's try to do it using a logistic regression based test assuming an additive effect. You can perform this test on all the SNPs in the dataset using PLINK by typing the following command in the terminal:
89 |
90 | ```bash
91 | plink --bfile data/gwa --logistic --adjust
92 | ```
93 |
94 | Specifically "--bfile data/gwa" specifies that the data PLINK should analyse are the files in folder called "data" with the prefix gwa. "--logistic" specifies that we want to perform GWAS using logistic regression and "--adjust" tells PLINK to output a file that includes p-values that are adjusted for multiple testing using Bonferroni correction as well as other fancier methods. Try to run the command and take a look at the text PLINK prints to your screen. Specifically, note the
95 |
96 | * number of SNPs
97 | * number of individuals
98 | * number of cases and controls
99 |
100 | NB if you for some reason have trouble reading the PLINK output on your screen then PLINK also writes it to the file plink.log and you can look at that by typing "less plink.log". If you do that, you can use arrows to navigate up and down in the file and close the file viewing by typing q.
101 |
102 | Next, try to plot the results of the GWAS by typing this command and pressing enter (note that this can take a little while so be patient):
103 |
104 | ```bash
105 | Rscript data/plink.plot.R plink.assoc.logistic
106 | ```
107 |
108 | This should give you several plots. For now just look at the Manhattan plot, which can be found in the file called plink.assoc.logistic.png. You can e.g. open it using the png viewer called display, so by typing:
109 |
110 | ```bash
111 | display plink.assoc.logistic.png
112 | ```
113 |
114 | A bonferroni corrected p-value threshold based on an initial p-value threshold of 0.05, is shown as a dotted line on the plot.
115 |
116 | * Explain how this threshold was reached.
117 | * Calculate the exact threshold using you knowledge of how many SNPs you have in your dataset (NB if you want to calculate log10 in R you can use the function log10)
118 | * Using this threshold, does any of the SNPs in your dataset seem to be associated with the disease?
119 | * Do your results seem plausible? Why/why not?
120 |
121 |
122 | #### Exercise 2B: checking if it went OK using QQ-plot
123 |
124 | Now look at the QQ-plot that you already generated in exercise 2A (the file called plink.assoc.QQ.png) by typing:
125 |
126 | ```bash
127 | display plink.assoc.logistic.QQ.png
128 | ```
129 |
130 | Here the red line is the x=y line.
131 |
132 | * What does this plot suggest and why?
133 |
134 |
135 | #### Exercise 2C: doing initial QC of data part 1 (sex check)
136 |
137 | As you can see a lot can go wrong if you do not check the quality of your data! So if you want meaningful/useful output you always have to run a lot of quality checking (QC) and filtering before running the association tests. One check that is worth running is a check if the indicated genders are correct. You can check this using PLINK to calculate the inbreeding coefficient on the X chromosome under the assumption that it is an autosomal chromosome. The reason why this is interesting is that, for technical reasons PLINK represents haploid chromosomes, such as X for males, as homozygotes. So assuming the X is an autosomal chromosome will make the males look very inbred on the X where as the woman wont (since they are diploid on the X chromosome). This means that the inbreeding coefficient estimates you get will be close to 1 for men and close to 0 for women. This gender check can be performed in PLINK using the following command:
138 |
139 | ```bash
140 | plink --bfile data/gwa --check-sex
141 | ```
142 |
143 | The result are in the file plink.sexcheck in which the gender is in column PEDSEX (1 means male and 2 means female) and the inbreeding coefficient is in column F).
144 |
145 | Check the result by typing:
146 | ```bash
147 | less plink.sexcheck
148 | ```
149 |
150 | NB you can use arrows to navigate up and down in the file and close the file viewing by typing q.
151 |
152 | * Do you see any problems?
153 |
154 | * Usually one would instead get rid of any such these individuals because wrong sex could indicate that the phenotypes you have do not belong to the genotyped individual. However, in this case I made the change on purpose, so you could see how such errors would look like, so just ignore it for now. Or if you are adventurous you can try to change the sex of the relevant individuals in the file gwa.fam (5th colunm) in the folder data. You can use the editor gedit for this. To open the file in gedit type "gedit data/gwa.fam" in the terminal.
155 |
156 |
157 | #### Exercise 2D: doing initial QC of data part 2 (relatedness check)
158 |
159 | Another potential problem in association studies is spurious relatedness where some of the individuals in the sample are closely related. Closely related individuals can be inferred using PLINK as follows:
160 |
161 | ```bash
162 | plink --bfile data/gwa --genome
163 | ```
164 |
165 | And you can plot the results by typing:
166 |
167 | ```bash
168 | Rscript data/plink.plot.R plink.genome
169 | ```
170 |
171 | Do that and then take a look at the result by typing
172 |
173 | ```bash
174 | display plink.genomepairwise_relatedness_1.png
175 | ```
176 |
177 | The figure shows estimates of the relatedness for all pairs of individuals. For each pair k1 is the proportion of the genome where the pair shares 1 of their allele identical-by-descent (IBD) and k2 is the proportion of the genome where the pair shares both their alleles IBD. The expected (k1,k2) values for simple relationships are shown in the figure with MZ=monozygotic twins, PO=parent offspring, FS=full sibling, HS=half sibling (or avuncular or grandparent-granchild), C1=first cousin, C2=cousin once removed.
178 |
179 | * Are any of the individuals in your dataset closely related?
180 | * What assumption in association studies is violated when individuals are related?
181 | * And last but not least: how would you recognize if the same person is included twice (this actually happens!)
182 |
183 |
184 | #### Exercise 2E: doing initial QC of data part 3 (check for batch bias/non-random genotyping error)
185 |
186 | Check if there is a batch effect/non random genotyping error by using missingness as a proxy (missingness and genotyping error are highly correlated in SNP chip data). In other words, for each SNP test if there is a significantly different amount of missing data in the cases and controls (or equivalently if there is an association between the disease status and the missingness). Do this with PLINK by running the following command:
187 |
188 | ```bash
189 | plink --bfile data/gwa --test-missing
190 | ```
191 |
192 | View the result in the file plink.missing where the p-value is given in the right most coloumn or generate a histogram of all the p-values using the Rscript data/plink.plot.R by typing this command
193 |
194 | ```bash
195 | Rscript data/plink.plot.R plink.missing
196 | ```
197 |
198 | The resulting p-value histogram can be found in the file plink.missing.png, which you can open using the png viewer display, so by typing
199 |
200 | ```bash
201 | display plink.missing.png
202 | ```
203 |
204 | If the missingness is random the p-value should be uniformly distributed between 0 and 1.
205 |
206 | * Is this the case?
207 | * Genotyping errors are often highly correlated with missingness. How do you think this will effect your association results?
208 |
209 | #### Exercise 2F: doing initial QC of data part 4 (check for batch bias/non-random genotyping error again)
210 |
211 | Principal component analysis (PCA) and a very similar methods called multidimensional scaling is also often used to reveal problems in the data. Such analyses can be used to project all the genotype information (e.g. 500,000 marker sites) down to a low number of dimensions e.g. two.
212 |
213 | Multidimensional scaling based on this can be performed with PLINK as follows (for all individuals except for a few which has more than 20% missingness):
214 |
215 | ```bash
216 | plink --bfile data/gwa --cluster --mds-plot 2 --mind 0.2
217 | ```
218 |
219 | Run the above command and plot the results by typing:
220 |
221 | ```bash
222 | Rscript data/plink.plot.R plink.mds data/gwa.fam
223 | ```
224 |
225 | The resulting plot should now be in the file plink.mds.pdf, which you can open with the pdf viewer evince (so by typing "evince plink.mds.pdf"). It shows the first two dimensions and each individual is represented by a point, which is colored according to the individual's disease status.
226 |
227 | * Clustering of cases and controls is an indication of batch bias. Do you see such clustering?
228 | * What else could explain this clustering?
229 |
230 |
231 | #### Exercise 2G: try to rerun GWAS after quality filtering SNPs
232 |
233 | We can remove many of the error prone SNPs and individuals by removing
234 |
235 | * SNPs that are not in HWE within controls
236 | * the rare SNPs
237 | * the individuals and SNPs with lots of missing data (why?)
238 |
239 | Let us try to do rerun an association analysis where this is done:
240 |
241 | ```bash
242 | plink --bfile data/gwa --logistic --adjust --out assoc2 --hwe 0.0001 --maf 0.05 --mind 0.55 --geno 0.05
243 | ```
244 |
245 | Plot the results using
246 |
247 | ```bash
248 | Rscript data/plink.plot.R assoc2.assoc.logistic
249 | ```
250 |
251 | * How does QQ-plot look (look at the file assoc2.assoc.logistic.QQ.png e.g. with the viewer called display as you did for the previous QQ-plot)?
252 | * Did the analysis go better this time?
253 | * And what does the Manhattan plot suggest (look at the file assoc2.assoc.logistic.png e.g. with the viewer called display)? Are any of the SNPs associated?
254 | * Does your answer change if you use other (smarter) methods to correct for multiple testing than Bonferroni (e.g. FDR), which PLINK provides. You can find them by typing:
255 |
256 | ```bash
257 | less assoc2.assoc.logistic.adjusted
258 | ```
259 |
260 | Note that PLINK adjusts p-values instead of the threshold (equivalent idea), so you should NOT change the threshold but stick to 0.05.
261 |
262 | .
263 |
264 | ## Bonus exercise if there is any time left: another example of a GWAS caveat
265 |
266 | For the same individuals as above we also have another phenotype. This phenotype is strongly correlated with gender. The genotyping was done independently of this phenotype so there is no batch bias. To perform association on this phenotype type
267 |
268 | ```bash
269 | plink --bfile data/gwa --logistic --pheno data/pheno3.txt --adjust --out pheno3
270 | Rscript data/plink.plot.R pheno3.assoc.logistic
271 | ```
272 |
273 | * View the plots and results. Are any of the SNP significantly associated?
274 |
275 | Now try to perform the analysis using logistic regression adjusted for sex, which is done by adding the option "--sex":
276 |
277 | ```bash
278 | plink --bfile data/gwa --logistic --out pheno3_sexAdjusted --pheno data/pheno3.txt --sex
279 | Rscript data/plink.plot.R pheno3_sexAdjusted.assoc.logistic
280 | ```
281 |
282 | * Are there any associated SNPs according to this analysis?
283 |
284 | Some of the probes used on the chip will hybridize with multiple loci on the genome. The associated SNPs in the previous analysis all cross-hybridize with the X chromosome.
285 |
286 | * Could crosshybridization explain the difference in results from the two analyses?
287 |
288 |
289 |
--------------------------------------------------------------------------------
/HWexercise/.gitkeep:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/HWexercise/HW1PA.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/HWexercise/HW1PA.png
--------------------------------------------------------------------------------
/HWexercise/HW2Sn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/HWexercise/HW2Sn.png
--------------------------------------------------------------------------------
/HWexercise/HW3HV.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/HWexercise/HW3HV.png
--------------------------------------------------------------------------------
/HWexercise/HW3HV2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/HWexercise/HW3HV2.png
--------------------------------------------------------------------------------
/HWexercise/HW4dF.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/HWexercise/HW4dF.png
--------------------------------------------------------------------------------
/HWexercise/HW5TP.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/HWexercise/HW5TP.png
--------------------------------------------------------------------------------
/HWexercise/HW6QQ.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/HWexercise/HW6QQ.png
--------------------------------------------------------------------------------
/HWexercise/HWanswers.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/HWexercise/HWanswers.pdf
--------------------------------------------------------------------------------
/Linkage_disequilibrium_gorillas/ExercisesLD20.md:
--------------------------------------------------------------------------------
1 | # Linkage disequilibrium
2 |
3 |
4 |
5 | **Hans R Siegismund**
6 |
7 |
8 |
9 | **Exercise 1**
10 |
11 | Take a look at the variation at two SNPs in the human genome:
12 |
13 |
14 | | | |**SNP83**| |
15 | |-----|---|----:|---:|
16 | | | |_B_ | _b_|
17 | |**SNP82**|_A_| 22 |9 |
18 | | |_a_| 0 |20 |
19 |
20 | Use the “LD test” sheet in “Hardy Weinberg test, LD test.xls” file to
21 | estimate
22 |
23 | 1) Haplotype frequencies
24 |
25 | 2) Allele frequencies
26 |
27 | 3) The linkage disequilibrium parameter *D*
28 |
29 | 4) *χ2* test for independent distribution at the two loci
30 |
31 | 5) *Dmax *
32 |
33 | 6) *Dmin*
34 |
35 | 7) *D´* = *D/Dmax* if *D* is positive
36 | *D´* = *D/Dmin* if *D* is negative
37 |
38 | 8) *r*2
39 |
40 | **Exercise 2**
41 |
42 | **Population admixture**
43 |
44 | Use the sheet “LD in admixed populations” in the “Hardy Weinberg test,
45 | LD test.xls” file to estimate the effects of admixture of two
46 | populations that have the following haplotype frequencies:
47 |
48 | | |AB | Ab |aB |ab |
49 | |----------------|---:|:---:|:---:|---:|
50 | |**Population 1**|1 | 9 | 9 | 81|
51 | |**Population 2**| 81 | 9 | 9 | 1|
52 |
53 | 1) Is there any sign of LD in the two populations?
54 |
55 | 2) Assume that the admixed population consists of 50% from each of the
56 | source populations. What is the LD in the admixed population? (Let
57 | the sample size be 100.)
58 |
59 | 3) Assume that the two loci are unlinked. What will *D* be in the next
60 | generation (i.e. offspring of the admixed population.)?
61 |
--------------------------------------------------------------------------------
/Linkage_disequilibrium_gorillas/LD_ex_0_gorilla.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Linkage_disequilibrium_gorillas/LD_ex_0_gorilla.png
--------------------------------------------------------------------------------
/Linkage_disequilibrium_gorillas/LD_ex_1_dist_map.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Linkage_disequilibrium_gorillas/LD_ex_1_dist_map.png
--------------------------------------------------------------------------------
/Linkage_disequilibrium_gorillas/LD_ex_2_link_map.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Linkage_disequilibrium_gorillas/LD_ex_2_link_map.png
--------------------------------------------------------------------------------
/Linkage_disequilibrium_gorillas/LD_ex_3_LD_decay.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Linkage_disequilibrium_gorillas/LD_ex_3_LD_decay.png
--------------------------------------------------------------------------------
/Linkage_disequilibrium_gorillas/LDinGorillas.md:
--------------------------------------------------------------------------------
1 | # Linkage Disequilibrium in Small and Large Populations of Gorillas
2 |
3 | **Peter Frandsen, Shyam Gopalakrishnan & Hans R. Siegismund**
4 |
5 | ## Program
6 |
7 |
8 |
9 |
10 |
11 | - Apply filters to prepare data for analysis
12 | - Explore patterns of LD blocks in two populations of gorilla
13 | - Estimate mean pairwise LD between SNP’s along a chromosome
14 | - Plot and compare curves of LD decay in two populations of gorilla
15 |
16 | ## Learning outcome
17 |
18 | - Get familiar with common filtering procedures in genome analyses
19 | using PLINK
20 | - Get familiar with running analyses in R and the code involved
21 | - Understand the processes that builds and breaks down LD
22 | - Understand the link between patterns of LD and population history
23 |
24 | ## Essential background reading
25 |
26 | Nielsen and Slatkin: Chapter 6 - Linkage Disequilibrium and Gene Mapping
27 | p. 107-126
28 |
29 | ## Key concepts that you should familiarize with:
30 |
31 | - Definition of linkage disequilibrium (LD)
32 |
33 | - Measures of LD
34 |
35 | - Build-up and breakdown of LD
36 |
37 | - Applications of LD in population genetics
38 |
39 | ### Suggestive reading
40 | Prado-Martinez et al. (2014) [Great ape genetic diversity and population
41 | history](http://www.nature.com/nature/journal/v499/n7459/full/nature12228.html)
42 |
43 | # Introduction
44 |
45 | Gorillas are among the most critically endangered species of great apes.
46 | The *Gorilla* genus consists of two species, with two subspecies each.
47 | In this exercise, you will work with chromosome 4 from the western
48 | lowland subspecies and the mountain gorilla subspecies. The former is
49 | distributed over a large forest area across the Congo basin in East
50 | Africa (yellow in fig. 1), while the mountain gorilla is confined to two
51 | small populations in the Virunga mountains and the impenetrable forest
52 | of Bwindi (encircled with red in fig. 1). The mountain gorilla, in
53 | particular, has been subject to active conservation efforts for decades,
54 | yet, very little is known about their genomic diversity and evolutionary
55 | history. The data you will be working with today is an extract of a
56 | large study on great ape genomes (Prado-Martinez et al.
57 | 2014).
58 |
59 |
60 |
61 |
62 |
63 | **Figure 1 Distribution range of the *Gorilla* genus.** The large yellow patch indicates the distribution of western lowland gorilla (*Gorilla gorilla gorilla*). The distribution of mountain gorilla (*Gorilla beringei beringei*) is encircled with red. Reprint from [http://maps.iucnredlist.org,](http://maps.iucnredlist.org/) edited by Peter Frandsen.
64 |
65 | ## Getting started
66 |
67 | Start by downloading the compressed folder ‘gorillas.tar.gz’ to your
68 | exercise folder unzip, and navigate to the folder ‘LD’ with the
69 | following commands: (We assume that the folder /exercises exists in your home folder; otherwise create it `mkdir ~/exercises`)
70 |
71 | ```bash
72 | cd ~/exercises
73 | mkdir LD
74 | cd LD/
75 | cp ~/groupdirs/SCIENCE-BIO-Popgen_Course/exercises/LD/gorillas.tar.gz .
76 | tar -zxvf gorillas.tar.gz
77 | rm gorillas.tar.gz
78 | ```
79 |
80 | All code needed in the exercise is given below. It will be indicated
81 | with an **\>R** when you need to run a command in R, otherwise stay in
82 | your main terminal.
83 |
84 | ## LD blocks
85 |
86 | In this first part of the exercise we will explore a specific region of chromosome 4 in terms of blocks of LD. We have not chosen this region for any specific reason and the emerging patterns should serve as a general comparison of LD in the two populations. First step will be to extract a pre-defined region using PLINK and run a LD anaysis with SnpMatrix in R. The results will be written to an .eps file that you can open with Adobe.
87 |
88 | ### Mountain Gorilla
89 |
90 | ```bash
91 | plink --file mountain --maf 0.15 --geno 0 --thin 0.25 --from 9029832 --to 14148683 --make-bed --out mountainBlock
92 | ```
93 |
94 | #### \>R
95 |
96 | ```R
97 | library(snpMatrix)
98 | data <- read.plink("mountainBlock")
99 | ld <- ld.snp(data, dep=930)
100 | plot.snp.dprime(ld, filename="Mountain.eps", res=100)
101 | q()
102 | ```
103 |
104 | ### Western Lowland Gorilla
105 |
106 | ```bash
107 | plink --file lowland --maf 0.15 --geno 0 --thin 0.25 --from 9029832 --to 14148683 --make-bed --out lowlandBlock
108 | ```
109 |
110 | #### \>R
111 |
112 | ```R
113 | library(snpMatrix)
114 | data<- read.plink("lowlandBlock")
115 | ld<- ld.snp(data, dep=2268)
116 | plot.snp.dprime(ld, filename="WesternLowland.eps", res=100)
117 | q()
118 | ```
119 |
120 | **NB**: the .eps files are quite heavy and loading the output could take a long time. You can start to open the .eps file with the commands
121 |
122 | ```bash
123 | evince Mountain.eps & # the ‘&’ will make the command run in the background
124 | evince WesternLowland.eps &
125 | ```
126 |
127 | **NB**: this will take approximately five minutes to load.*
128 |
129 | **Q1:** How would you characterize the structure you observe in the
130 | different populations?
131 |
132 | **Q2:** How many (if any) recombination hotspots can you identify in the two populations (crude estimate)?
133 |
134 | Mountain gorilla:
135 |
136 | Western lowland gorilla:
137 |
138 | ## LD decay
139 | In this part of today’s exercise we will explore the decay of LD as a function of distance along the chromosome. This will enable us to compare the two populations in terms of mean distances on the chromosome with sites in LD. To begin with, we will do some essential filtering procedures common to this type of analysis and then, again, read the data into R and run the analyses with snpMatrix.
140 |
141 | ### Filtering
142 |
143 | ```bash
144 | plink --file mountain --maf 0.15 --geno 0 --thin 0.15 --make-bed --out mountainLD
145 | plink --file lowland --maf 0.15 --geno 0 --thin 0.15 --make-bed --out lowlandLD
146 | ```
147 |
148 | **Q3:** Why do we exclude minor allele frequencies (i.e. the less common or rare alleles in the population)?
149 |
150 | **Q4:** What else are we filtering away? See list of options in PLINK http://[http://pngu.mgh.harvard.edu/\~purcell/plink/reference.shtml#options](http://pngu.mgh.harvard.edu/%7Epurcell/plink/reference.shtml#options) Seems to be broken.
151 |
152 | **Q5:** How many SNP’s did we have before and after filtering?
153 |
154 | ### Run the commands
155 |
156 | This script will load the package snpMatrix, read in the plink files, and estimate the mean LD between SNP pairs in a pre-set range (dep). At the end of the script, the result will be plotted to a .png file as well as two .txt files (check your directory).
157 |
158 | ### Mountain Gorilla
159 |
160 | **\>R** *(paste in the whole script)*
161 |
162 | ```R
163 | library("snpMatrix")
164 | data<- read.plink("mountainLD")
165 | ld<- ld.snp(data, dep=500)
166 | do.rsq2 <- ("rsq2" %in% names(ld))
167 | snp.names <- attr(ld, "snp.names")
168 | name<- c("#M1", "#M2", "rsq2", "Dprime", "lod")
169 | r.maybe <- ld$rsq2
170 | max.depth <- dim(ld$dprime)[2]
171 | res<-matrix(NA,ncol=3,nrow=length(snp.names)*max.depth)
172 | count<-1
173 | for (i.snp in c(1:(length(snp.names) - 1))) {
174 | for (j.snp in c((i.snp + 1):length(snp.names))) {
175 | step <- j.snp - i.snp
176 | if (step > max.depth) {
177 | break
178 | }
179 | res[count,]<-c(snp.names[i.snp], snp.names[j.snp],r.maybe[i.snp, step])
180 | count<-count+1;
181 | }
182 | }
183 | resNum<-as.numeric(res)#converts to numeric values
184 | dim(resNum)<-dim(res)
185 | dis<- resNum[,2]-resNum[,1] #calculates distances between sites
186 | disbin<- cut(dis, br=seq(0,2000000,by=10000))
187 | #cuts distances into bins of size 1K from distance "0" to "2000000"
188 | res<- tapply(resNum[,3], disbin, mean, na.rm=T)
189 | #takes the mean of r2 (3rd column in resNum) in each bin, removes "NA", and distances >2K
190 | pdf("MountainLDdecay.pdf") #saves output as a .pdf
191 | plot(res, type="l", ylim=c(0, 1))
192 | dev.off()
193 | ### The following code is just a way to store the LD calculations ###
194 | ressum<- tapply(resNum[,3], disbin, sum, na.rm=T)
195 | func<- function(x)#makes a function "x"
196 | length(x[!is.na(x)])
197 | #gives the lengths of what is not (denoted by "!") "NA" in "x"
198 | reslength<- tapply(resNum[,3], disbin, func)
199 | #gives the length in each bin of 1K for each value of r2
200 | write.table(reslength, file="MountainLength.txt")
201 | write.table(ressum, file="MountainSum.txt")
202 | q()
203 | n
204 | ```
205 |
206 |
207 | Look through the script to get an idea about what is going on.
208 |
209 | **Q6:** How many pair-wise comparisons have we done (*dep* in
210 | snpMatrix)?
211 |
212 | ##
213 |
214 | ### Western Lowland Gorilla
215 |
216 | **\>R** *(paste in the whole script)*
217 |
218 | ```R
219 | library("snpMatrix")
220 | data<- read.plink("lowlandLD")
221 | ld<- ld.snp(data, dep=500)
222 | do.rsq2 <- ("rsq2" %in% names(ld))
223 | snp.names <- attr(ld, "snp.names")
224 | name<- c("#M1", "#M2", "rsq2", "Dprime", "lod")
225 | r.maybe <- ld$rsq2
226 | max.depth <- dim(ld$dprime)[2]
227 | res<-matrix(NA,ncol=3,nrow=length(snp.names)*max.depth)
228 | count<-1
229 | for (i.snp in c(1:(length(snp.names) - 1))) {
230 | for (j.snp in c((i.snp + 1):length(snp.names))) {
231 | step <- j.snp - i.snp
232 | if (step > max.depth) { break
233 | }
234 | res[count,]<-c(snp.names[i.snp], snp.names[j.snp],r.maybe[i.snp, step])
235 | count<-count+1;
236 | }
237 | }
238 | resNum<-as.numeric(res)#converts to numeric values
239 | dim(resNum)<-dim(res)
240 | dis<- resNum[,2]-resNum[,1] #calculates distances between sites
241 | disbin<- cut(dis, br=seq(0,2000000,by=10000))
242 | #cuts distances into bins of size 1K from distance "0" to "2000000"
243 | res<- tapply(resNum[,3], disbin, mean, na.rm=T)
244 | #takes the mean of r2 (3rd column in resNum) in each bin, removes "NA", and distances >2K
245 | pdf("WesternLowlandLDdecay.pdf")
246 | #saves output as a .pdf file
247 | plot(res, type="l", ylim=c(0, 1))
248 | dev.off()
249 | ### The following code is just a way to store the LD calculations ###
250 | ressum<- tapply(resNum[,3], disbin, sum, na.rm=T)
251 | func<- function(x)#makes a function "x"
252 | length(x[!is.na(x)])
253 | #gives the lengths of what is not (denoted by "!") "NA" in "x"
254 | reslength<- tapply(resNum[,3], disbin, func)
255 | #gives the length in each bin of 1K for each value of r2
256 | write.table(reslength, file="LowlandLength.txt")
257 | write.table(ressum, file="LowlandSum.txt")
258 | q()
259 | n
260 | ```
261 |
262 | Extra task (optional): when done with both populations, plot the curves
263 | together with different colors and legend (*hint: the input you will
264 | need was printed to four text files in the end of the scripts above*).
265 |
266 | **Q7:** Look at the two plots and explain why LD decays with distance.
267 |
268 | **Q8:** What is the mean *r2* at distance 1M (100 on the x-axis) in the
269 | two populations?
270 |
271 | Mountain gorilla:
272 |
273 | Western lowland gorilla:
274 |
275 | **Q9:** What could explain any observed difference in the decay of LD in
276 | the two populations?
277 |
278 | **Q10:** These estimates are done on an autosomal chromosome, would you
279 | expect different LD patterns in other parts of the genome?
280 |
281 | ## Perspectives
282 |
283 | **Q11:** In comparison to the different populations of gorilla, how do
284 | you think the trajectory of the LD decay and the LD block patterns would
285 | look like in humans?
286 |
287 | **Q12:** Would you also expect different patterns of LD in human
288 | populations (*e.g.* Chinese, Europeans, and Africans)?
289 |
--------------------------------------------------------------------------------
/Linkage_disequilibrium_gorillas/LDinGorillas_A_24.md:
--------------------------------------------------------------------------------
1 | # Linkage Disequilibrium in Small and Large Populations of Gorillas
2 |
3 | **Peter Frandsen, Shyam Gopalakrishnan & Hans R. Siegismund**
4 |
5 | ## Program
6 |
7 |
8 |
9 |
10 |
11 | - Apply filters to prepare data for analysis
12 | - Explore patterns of LD blocks in two populations of gorilla
13 | - Estimate mean pairwise LD between SNP’s along a chromosome
14 | - Plot and compare curves of LD decay in two populations of gorilla
15 |
16 | ## Learning outcome
17 |
18 | - Get familiar with common filtering procedures in genome analyses
19 | using PLINK
20 | - Get familiar with running analyses in R and the code involved
21 | - Understand the processes that builds and breaks down LD
22 | - Understand the link between patterns of LD and population history
23 |
24 | ## Essential background reading
25 |
26 | Nielsen and Slatkin: Chapter 6 - Linkage Disequilibrium and Gene Mapping
27 | p. 107-126
28 |
29 | ## Key concepts that you should familiarize with:
30 |
31 | - Definition of linkage disequilibrium (LD)
32 |
33 | - Measures of LD
34 |
35 | - Build-up and breakdown of LD
36 |
37 | - Applications of LD in population genetics
38 |
39 | ### Suggestive reading
40 | Prado-Martinez et al. (2014) [Great ape genetic diversity and population
41 | history](http://www.nature.com/nature/journal/v499/n7459/full/nature12228.html)
42 |
43 | # Introduction
44 |
45 | Gorillas are among the most critically endangered species of great apes.
46 | The *Gorilla* genus consists of two species, with two subspecies each.
47 | In this exercise, you will work with chromosome 4 from the western
48 | lowland subspecies and the mountain gorilla subspecies. The former is
49 | distributed over a large forest area across the Congo basin in East
50 | Africa (yellow in fig. 1), while the mountain gorilla is confined to two
51 | small populations in the Virunga mountains and the impenetrable forest
52 | of Bwindi (encircled with red in fig. 1). The mountain gorilla, in
53 | particular, has been subject to active conservation efforts for decades,
54 | yet, very little is known about their genomic diversity and evolutionary
55 | history. The data you will be working with today is an extract of a
56 | large study on great ape genomes (Prado-Martinez et al.
57 | 2014).
58 |
59 |
60 |
61 |
62 |
63 | **Figure 1 Distribution range of the *Gorilla* genus.** The large yellow patch indicates the distribution of western lowland gorilla (*Gorilla gorilla gorilla*). The distribution of mountain gorilla (*Gorilla beringei beringei*) is encircled with red. Reprint from [http://maps.iucnredlist.org,](http://maps.iucnredlist.org/) edited by Peter Frandsen.
64 |
65 | ## Getting started
66 |
67 | Start by downloading the compressed folder ‘gorillas.tar.gz’ to your
68 | exercise folder unzip, and navigate to the folder ‘LD’ with the
69 | following commands: (We assume that the folder /exercises exists in your home folder; otherwise create it `mkdir ~/exercises`)
70 |
71 | ```bash
72 | cd ~/exercises
73 | mkdir LD
74 | cd LD/
75 | cp ~/groupdirs/SCIENCE-BIO-Popgen_Course/exercises/LD/gorillas.tar.gz .
76 | tar -zxvf gorillas.tar.gz
77 | rm gorillas.tar.gz
78 | ```
79 |
80 | All code needed in the exercise is given below. It will be indicated
81 | with an **\>R** when you need to run a command in R, otherwise stay in
82 | your main terminal.
83 |
84 | ## LD blocks
85 |
86 | In this first part of the exercise we will explore a specific region of chromosome 4 in terms of blocks of LD. We have not chosen this region for any specific reason and the emerging patterns should serve as a general comparison of LD in the two populations. First step will be to extract a pre-defined region using PLINK and run a LD anaysis with SnpMatrix in R. The results will be written to an .eps file that you can open with Adobe.
87 |
88 | ### Mountain Gorilla
89 |
90 | ```bash
91 | plink --file mountain --maf 0.15 --geno 0 --thin 0.25 --from 9029832 --to 14148683 --make-bed --out mountainBlock
92 | ```
93 |
94 | #### \>R
95 |
96 | ```R
97 | library(snpMatrix)
98 | data <- read.plink("mountainBlock")
99 | ld <- ld.snp(data, dep=930)
100 | plot.snp.dprime(ld, filename="Mountain.eps", res=100)
101 | q()
102 | ```
103 |
104 | ### Western Lowland Gorilla
105 |
106 | ```bash
107 | plink --file lowland --maf 0.15 --geno 0 --thin 0.25 --from 9029832 --to 14148683 --make-bed --out lowlandBlock
108 | ```
109 |
110 | #### \>R
111 |
112 | ```R
113 | library(snpMatrix)
114 | data<- read.plink("lowlandBlock")
115 | ld<- ld.snp(data, dep=2268)
116 | plot.snp.dprime(ld, filename="WesternLowland.eps", res=100)
117 | q()
118 | ```
119 |
120 | **NB**: the .eps files are quite heavy and loading the output could take a long time. You can start to open the .eps file with the commands
121 |
122 | ```bash
123 | evince Mountain.eps & # the ‘&’ will make the command run in the background
124 | evince WesternLowland.eps &
125 | ```
126 |
127 | **NB**: this will take approximately five minutes to load.*
128 |
129 | **Q1:** How would you characterize the structure you observe in the
130 | different populations?
131 |
132 | Answer:
133 |
134 |
135 |
136 |
137 |
138 | The mountain gorilla genome consists of large regions with blocks of
139 | high LD, while recombination has broken down such patterns in the
140 | western lowland gorilla.
141 |
142 |
143 | **Q2:** How many (if any) recombination hotspots can you identify in the two populations (crude estimate)?
144 |
145 | Mountain gorilla:
146 |
147 | Western lowland gorilla:
148 |
149 |
150 |
151 | Mountain gorilla *at least two* Western lowland gorilla no blocks of LD in this region
152 |
153 |
154 | ## LD decay
155 | In this part of today’s exercise we will explore the decay of LD as a function of distance along the chromosome. This will enable us to compare the two populations in terms of mean distances on the chromosome with sites in LD. To begin with, we will do some essential filtering procedures common to this type of analysis and then, again, read the data into R and run the analyses with snpMatrix.
156 |
157 | ### Filtering
158 |
159 | ```bash
160 | plink --file mountain --maf 0.15 --geno 0 --thin 0.15 --make-bed --out mountainLD
161 | plink --file lowland --maf 0.15 --geno 0 --thin 0.15 --make-bed --out lowlandLD
162 | ```
163 |
164 | **Q3:** Why do we exclude minor allele frequencies (i.e. the less common or rare alleles in the population)?
165 |
166 |
167 |
168 | Newly arisen and rare mutations will
169 | distort the LD pattern (elevate the mean LD) Remember, every new
170 | mutation on a chromosome is in complete LD with the rest of the
171 | chromosome. For most LD analyses, we are also more interested in the
172 | common variation. On the plus side, we also get rid of sequencing errors
173 | (which are also rare, hopefully).
174 |
175 |
176 | **Q4:** What else are we filtering away? See list of options in PLINK http://[http://pngu.mgh.harvard.edu/\~purcell/plink/reference.shtml#options](http://pngu.mgh.harvard.edu/%7Epurcell/plink/reference.shtml#options) Seems to be broken.
177 |
178 |
179 |
180 | `--geno 0` : removes all missing data
181 |
182 | `--thin 0.15` : removes a random 85 percent of the data (keeps 15
183 | percent), otherwise we would not get through the exercise in just two
184 | hours. Plus, it is not always necessary to keep all your data.
185 | Sometimes, only a fraction of your data will tell you the same story as
186 | a complete and computational heavy dataset.
187 |
188 | **Q5:** How many SNP’s did we have before and after filtering?
189 |
190 |
191 |
192 | mountain gorilla: 2.521.457 SNP’s before, 23,689 SNP’s after filtering
193 |
194 | lowland gorilla: 2.521.457 SNP’s before, 48,555 SNP’s after filtering
195 |
196 | In other words, you threw away a lot of data (but you can still reach
197 | the same conclusion as if we had looked at the whole genome)
198 |
199 |
200 | ### Run the commands
201 |
202 | This script will load the package snpMatrix, read in the plink files, and estimate the mean LD between SNP pairs in a pre-set range (dep). At the end of the script, the result will be plotted to a .png file as well as two .txt files (check your directory).
203 |
204 | ### Mountain Gorilla
205 |
206 | **\>R** *(paste in the whole script)*
207 |
208 | ```R
209 | library("snpMatrix")
210 | data<- read.plink("mountainLD")
211 | ld<- ld.snp(data, dep=500)
212 | do.rsq2 <- ("rsq2" %in% names(ld))
213 | snp.names <- attr(ld, "snp.names")
214 | name<- c("#M1", "#M2", "rsq2", "Dprime", "lod")
215 | r.maybe <- ld$rsq2
216 | max.depth <- dim(ld$dprime)[2]
217 | res<-matrix(NA,ncol=3,nrow=length(snp.names)*max.depth)
218 | count<-1
219 | for (i.snp in c(1:(length(snp.names) - 1))) {
220 | for (j.snp in c((i.snp + 1):length(snp.names))) {
221 | step <- j.snp - i.snp
222 | if (step > max.depth) {
223 | break
224 | }
225 | res[count,]<-c(snp.names[i.snp], snp.names[j.snp],r.maybe[i.snp, step])
226 | count<-count+1;
227 | }
228 | }
229 | resNum<-as.numeric(res)#converts to numeric values
230 | dim(resNum)<-dim(res)
231 | dis<- resNum[,2]-resNum[,1] #calculates distances between sites
232 | disbin<- cut(dis, br=seq(0,2000000,by=10000))
233 | #cuts distances into bins of size 1K from distance "0" to "2000000"
234 | res<- tapply(resNum[,3], disbin, mean, na.rm=T)
235 | #takes the mean of r2 (3rd column in resNum) in each bin, removes "NA", and distances >2K
236 | pdf("MountainLDdecay.pdf") #saves output as a .pdf
237 | plot(res, type="l", ylim=c(0, 1))
238 | dev.off()
239 | ### The following code is just a way to store the LD calculations ###
240 | ressum<- tapply(resNum[,3], disbin, sum, na.rm=T)
241 | func<- function(x)#makes a function "x"
242 | length(x[!is.na(x)])
243 | #gives the lengths of what is not (denoted by "!") "NA" in "x"
244 | reslength<- tapply(resNum[,3], disbin, func)
245 | #gives the length in each bin of 1K for each value of r2
246 | write.table(reslength, file="MountainLength.txt")
247 | write.table(ressum, file="MountainSum.txt")
248 | q()
249 | n
250 | ```
251 |
252 |
253 | Look through the script to get an idea about what is going on.
254 |
255 | **Q6:** How many pair-wise comparisons have we done (*dep* in
256 | snpMatrix)?
257 |
258 |
259 |
260 | 500 (ld<- ld.snp(data, dep=500))
261 |
262 |
263 | ##
264 |
265 | ### Western Lowland Gorilla
266 |
267 | **\>R** *(paste in the whole script)*
268 |
269 | ```R
270 | library("snpMatrix")
271 | data<- read.plink("lowlandLD")
272 | ld<- ld.snp(data, dep=500)
273 | do.rsq2 <- ("rsq2" %in% names(ld))
274 | snp.names <- attr(ld, "snp.names")
275 | name<- c("#M1", "#M2", "rsq2", "Dprime", "lod")
276 | r.maybe <- ld$rsq2
277 | max.depth <- dim(ld$dprime)[2]
278 | res<-matrix(NA,ncol=3,nrow=length(snp.names)*max.depth)
279 | count<-1
280 | for (i.snp in c(1:(length(snp.names) - 1))) {
281 | for (j.snp in c((i.snp + 1):length(snp.names))) {
282 | step <- j.snp - i.snp
283 | if (step > max.depth) { break
284 | }
285 | res[count,]<-c(snp.names[i.snp], snp.names[j.snp],r.maybe[i.snp, step])
286 | count<-count+1;
287 | }
288 | }
289 | resNum<-as.numeric(res)#converts to numeric values
290 | dim(resNum)<-dim(res)
291 | dis<- resNum[,2]-resNum[,1] #calculates distances between sites
292 | disbin<- cut(dis, br=seq(0,2000000,by=10000))
293 | #cuts distances into bins of size 1K from distance "0" to "2000000"
294 | res<- tapply(resNum[,3], disbin, mean, na.rm=T)
295 | #takes the mean of r2 (3rd column in resNum) in each bin, removes "NA", and distances >2K
296 | pdf("WesternLowlandLDdecay.pdf")
297 | #saves output as a .pdf file
298 | plot(res, type="l", ylim=c(0, 1))
299 | dev.off()
300 | ### The following code is just a way to store the LD calculations ###
301 | ressum<- tapply(resNum[,3], disbin, sum, na.rm=T)
302 | func<- function(x)#makes a function "x"
303 | length(x[!is.na(x)])
304 | #gives the lengths of what is not (denoted by "!") "NA" in "x"
305 | reslength<- tapply(resNum[,3], disbin, func)
306 | #gives the length in each bin of 1K for each value of r2
307 | write.table(reslength, file="LowlandLength.txt")
308 | write.table(ressum, file="LowlandSum.txt")
309 | q()
310 | n
311 | ```
312 |
313 | Extra task (optional): when done with both populations, plot the curves
314 | together with different colors and legend (*hint: the input you will
315 | need was printed to four text files in the end of the scripts above*).
316 |
317 | **Q7:** Look at the two plots and explain why LD decays with distance.
318 |
319 |
320 |
321 |
322 |
323 |
324 |
325 | LD is broken down by recombination, with increasing distance between two
326 | sites, the likelihood of a recombination event increases. See plots
327 | above.
328 |
329 |
330 | **Q8:** What is the mean *r2* at distance 1M (100 on the x-axis) in the
331 | two populations?
332 |
333 | Mountain gorilla:
334 |
335 | Western lowland gorilla:
336 |
337 |
338 |
339 | Mountain gorilla: ~0.45
340 |
341 | Western lowland gorilla: ~0.15
342 |
343 |
344 | **Q9:** What could explain any observed difference in the decay of LD in
345 | the two populations?
346 |
347 |
348 |
349 | Selection (genetic hitchhiking), admixture, and
350 | drift in course of the population history
351 |
352 |
353 | **Q10:** These estimates are done on an autosomal chromosome, would you
354 | expect different LD patterns in other parts of the genome?
355 |
356 |
357 |
358 | Yes, there is no (or very little) recombination on the Y chromosome and
359 | the mitochondria; hence, you would not observe a decay with increasing
360 | distance.
361 |
362 |
363 | ## Perspectives
364 |
365 | **Q11:** In comparison to the different populations of gorilla, how do
366 | you think the trajectory of the LD decay and the LD block patterns would
367 | look like in humans?
368 |
369 |
370 |
371 | LD decays at a rate similar to the western lowland gorilla and large
372 | parts of the genome will show comparable patterns in terms of LD blocks.
373 | Only for very small human population, that has been isolated for a long
374 | time, would we observe anything as extreme as in the mountain gorilla.
375 |
376 |
377 | **Q12:** Would you also expect different patterns of LD in human
378 | populations (*e.g.* Chinese, Europeans, and Africans)?
379 |
380 |
381 |
382 | Populations that have gone through historic bottlenecks, been subjected
383 | to genetic drift, admixture or selection will on average have a higher
384 | degree of LD. As an example, populations that went through a bottleneck
385 | when they migrated out of African (*e.g.* Europeans), have on average a
386 | higher degree of LD compared to most African populations.
387 |
388 |
--------------------------------------------------------------------------------
/Linkage_disequilibrium_gorillas/zzz:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/NaturalSelection/ExerciseSelection_2020.md:
--------------------------------------------------------------------------------
1 | # Natural selection (and other complications)
2 |
3 | **Hans R. Siegismund**
4 |
5 |
6 | **Exercise 1**
7 |
8 |
9 |
10 |
11 |
12 | In Africa and southern Europe, many human
13 | populations are polymorphic at the locus coding for the beta-hemoglobin
14 | chain. Two alleles are found, HbS, and HbA. HbS differs from HbA in that
15 | it at the position 6 the amino acid glutamic acid has been replaced with
16 | valine. A study in Tanzania found the following genotypic distribution:
17 |
18 | | | HbAHbA | HbAHbS | HbSHbS | Sum |
19 | |----------|:----------------------------:|:----------------------------:|:----------------------------:|---|
20 | | Adults | 400 | 249 | 5 | 654 |
21 | | Children | 189 | 89 | 9 | 287 |
22 |
23 | 1) Estimate the allele frequencies in both groups.
24 |
25 | 2) Do the observed genotype distributions differ from Hardy-Weinberg
26 | proportions?
27 |
28 | Answer:
29 |
30 |
31 | |Adults | HbAHbA | HbAHbS | HbSHbS | Sum |
32 | |----------|:----------------------------|:----------------------------|:----------------------------|---|
33 | |Observed | 400 | 249 | 5 | 654 |
34 | |Expected | 420.64 | 207.71 | 25.64 | 654 |
35 |
36 | where
37 |
38 | *p*(A) = 0.802
39 |
40 | *p*(S) = 0.198
41 |
42 | The test for Hardy-Weinberg proportions gives χ2 = 25.84, which is
43 | highly significant. We see that this is caused by a large excess of
44 | heterozygotes.
45 |
46 |
47 | |Children | HbAHbA | HbAHbS | HbSHbS | Sum |
48 | |----------|:----------------------------|:----------------------------|:----------------------------|---|
49 | |Observed | 189 |89 |9 | 287|
50 | |Expected | 189.97 | 87.05 | 9.97 | 287 |
51 |
52 | where
53 |
54 | *p*(A) = 0.814
55 |
56 | *p*(S) = 0.186
57 |
58 | In this case the test for Hardy-Weinberg proportions is χ2 = 0.14,
59 | which is non-significant. We also see that the allele frequencies
60 | among children and adults are similar. We do not bother to make a
61 | formal test for it.
62 |
63 | As might be known, the low survival of the HbSHbS genotype is due to
64 | sickle cell anemia. They have a high probability of dying because of
65 | this. The HbAHbS heterozygote is compared to the HbAHbA homozygote
66 | more resistant to malaria. We therefore have a system of
67 | overdominance.
68 |
69 |
70 |
71 | 3) Under the assumption that the polymorphism has reached a stable
72 | equilibrium, estimate the fitness of the three genotypes. Hints:
73 | Assume that the adults had a genotypic distribution equal to their
74 | expected Hardy-Weinberg distribution and estimate their relative
75 | fitness.
76 |
77 |
78 |
79 |
80 | |Adults | HbAHbA | HbAHbS | HbSHbS | Sum |
81 | |----------|:----------------------------|:----------------------------|:----------------------------|---|
82 | |Observed | 400 | 249 | 5 | 654 |
83 | |Expected | 420.64 | 207.71 | 25.64 | 654 |
84 | |Fitness (O/E)|0.951 |1.199 |0.195 | |
85 | |Relative fitness | 0.793 | 1.000 |0.163 | |
86 | |Selection coefficient| 0.207 | 0 |0.837 | |
87 |
88 | We can see that the selection against the HbSHbS genotype is very
89 | severe. We can use the selection coefficient to estimate the equilibrium
90 | allele frequency
91 |
92 | *p* = *t*/(*s* + *t*) = 0.837/(0.207 + 0.837) = 0.802
93 |
94 | This value is identical to the estimated allele frequency among the
95 | adults. The reason is that we have estimated it under the assumption of
96 | equilibrium.
97 |
98 |
99 |
100 | In African Americans, one out of 400 suffers of sickle cell anemia.
101 |
102 | 4) Estimate the allele frequencies among them
103 |
104 |
105 |
106 |
107 | *q* = √(1/400) = 0.05
108 |
109 |
110 | 5) Why is it lower than the frequency observed among Africans?
111 |
112 |
113 |
114 |
115 | There is no longer malaria in the USA. It has been eliminated.
116 | Therefore, there is no longer overdominant selection that keeps a high
117 | allele frequency of the deleterious allele. There has been directional
118 | selection against the deleterious allele, which has reduced its
119 | frequency.
120 |
121 |
122 | **Exercise 2**
123 |
124 |
125 |
126 |
127 |
128 | The figure to the right shows the result of 13 repeated experiments with
129 | a chromosomal polymorphism in *Drosophila meanogaster*. It shows the
130 | frequency of one of the two chromosomal forms through four generations
131 | the experiment lasted. In six of the experiments one type had
132 | frequencies slightly higher than 0.9 and 7 of the experiments had levels
133 | below 0.9. Population sizes in each experiment were 100.
134 |
135 |
136 | 1) Can the evolution of this system be explained as a result of genetic drift?
137 |
138 |
139 |
140 |
141 | No. It is highly unlikely that genetic drift in a population with a
142 | size of 100 can result in fixation after 4 generations. In addition, all six
143 | experiments starting with a frequency of about 0.9 end up being fixed
144 | for one allele, while the seven experiments, which start with a rate
145 | is below 0.9 all end up being fixed for the other allele. Genetic
146 | drift would have a more “random” nature.
147 |
148 |
149 | 2) Can the evolution be explained as a result of natural selection?
150 | How does it work? Which genotype has the lowest fitness?
151 |
152 |
153 |
154 |
155 | Yes; there must be underdominance where the heterozygote has a lower
156 | fitness than both homozygotes.
157 |
158 |
159 | **Exercise 3**
160 |
161 |
162 |
163 |
164 |
165 | A geneticist starts an experiment with *Drosophila melanogaster*. He
166 | uses 10 populations, each kept at a constant size of 8 males and 8
167 | females in each generation. In generation 0 all individuals are
168 | heterozygous for the two alleles *A*1 and *A*2 at
169 | an autosomal locus. After 19 generations, the following distribution of
170 | the allele frequency of *A*1 is observed in the 10
171 | populations:
172 |
173 | |0.18 |0.00 |0.18| 0.25| 0.30| 0.19| 0.16| 0.00| 0.15| 0.00|
174 | |-----|-----|----|-----|-----|-----|-----|-----|-----|-----|
175 |
176 | 1) Can this distribution of the allele frequency be explained by
177 | genetic drift?
178 |
179 |
180 |
181 | No. With genetic drift, the allele frequencies would be distributed
182 | randomly over the entire range from 0 to 1. Here, all 10 populations
183 | have an allele frequency of less than 0.5, which is very unlikely.
184 | (0.510 = 0.000977)
185 |
186 |
187 | 2) Which other evolutionary force has also worked during this
188 | experiment?
189 |
190 |
191 |
192 | Natural selection.
193 |
194 |
195 | After 100 generations, all ten populations were fixed for allele
196 | *A*2*.*
197 |
198 | 3) Use this information to explain how the fitness of the three
199 | genotypes, *w*11, *w*12 and *w*22,
200 | are related to each other.
201 |
202 |
203 |
204 | Natural selection has also been involved, in this case in the form of
205 | directional selection, where
206 |
207 | *w*11 < *w*12 < *w*22,
208 |
209 |
210 | **Exercise 4**
211 |
212 |
213 |
214 |
215 |
216 | After the arrival of the Europeans in America,
217 | the California condor (*Gymnogyps californianus*) was severely hunted.
218 | This resulted in a drastic decline in population size, which culminated
219 | in 1987 when the last wild condors were placed in captivity (fourteen
220 | individuals). [Later on, the condor was released again. In 2014, 425 were living in
221 | the wild or in captivity.] Among the progeny of these fourteen individuals
222 | the genetic disease chondrodystrophy (a form of dwarfism was observed).
223 | In condor, this disease is inherited at an autosomal locus where
224 | chondrodystrophy is due to a recessive lethal allele.
225 |
226 | 1) What has the frequency of the allele for chondrodystrophy at least
227 | been among the fourteen individuals who were used to found the
228 | population in captivity?
229 |
230 |
231 |
232 | *q* ≥ 2/(2 x 14) = 0.071
233 |
234 | There must at least have been two heterozygotes among the fourteen
235 | individuals.
236 |
237 |
238 |
239 | The population has since grown in number and reached around a few
240 | hundred. An estimation of the allele frequency for chondrodystrophy
241 | showed a value of 0.09.
242 |
243 | 2) Can the frequency of this lethal allele be caused by one of the
244 | following three forces separately? (In question c you will be asked
245 | whether a combination of these forces is needed to explain the
246 | frequency.)
247 |
248 | - mutation
249 | - genetic drift
250 | - natural selection
251 |
252 |
253 |
254 |
255 | - mutation No
256 | - genetic drift Yes
257 | - natural selection No
258 |
259 |
260 |
261 | 3) Is it necessary to consider that two or three of these forces act
262 | together to explain the frequency of this allele?
263 |
264 |
265 |
266 | The allele for chondrodystophy arose through mutation and genetic
267 | drift has resulted in the high frequency. (Natural selection
268 | eliminates this allele, and thus would not be able to explain the high
269 | frequency of allele.)
270 |
271 |
272 | 4) What is the expected frequency of the allele after a balance
273 | between mutation and selection? The mutation rate can be assumed to be μ =
274 | 10-6?
275 |
276 |
277 |
278 | The equilibrium between mutation and natural selection is given by
279 |
280 | *p* = √(μ/*s*) = √(10-6 /1) = 0.001.
281 |
282 |
283 | **Exercise 5**
284 |
285 | Cystic fibrosis is caused by a recessive allele at a single autosomal
286 | locus, CTFR (cystic fibrosis transmembrane conductance regulator). In
287 | European populations 1 out of 2500 newborn children are homozygous for
288 | the recessive allele.
289 |
290 | 1) What is the frequency of the recessive allele in these populations?
291 |
292 |
293 |
294 | *q* = √ (1/2500) = 1/50 = 0.02
295 |
296 |
297 | 2) What fraction of all possible parental combinations has a
298 | probability of ¼ for having a child which is homozygous for the
299 | recessive allele?
300 |
301 |
302 |
303 | It must be the combination heterozygote × heterozygote:
304 |
305 | 2*pq* × 2*pq* (2 × 0.98 × 0.02)2 = 0.03922 = 0.0015
306 |
307 |
308 |
309 | The disease used to be fatal during childhood if it is not treated.
310 | Therefore, it must be assumed that the fitness of the recessive
311 | homozygote must have been 0 during the main part of the human
312 | evolutionary history.
313 |
314 | 3) Estimate the mutation rate, assuming equilibrium between the
315 | mutation and selection.
316 |
317 |
318 |
319 | *q =* √*(μ/s*),
320 |
321 | where *μ* in is the mutation rate and *s* is the
322 | selection coefficient, which must be 1 since the fitness is 0.
323 |
324 | Therefore,
325 |
326 | *μ* = *q*2*s* = 0.022 × 1 = 0.0004
327 |
328 |
329 |
330 | A direct estimate of the mutation rate was 6.7 × 10-7, which
331 | is considerably lower than the estimate found in question c.
332 |
333 | 4) Which mechanism(s) may explain the high frequency of the recessive
334 | deleterious allele?
335 |
336 |
337 |
338 | It could be overdominant selection where the fitness of heterozygous
339 | carriers is higher than in homozygous normal. There have been several
340 | hypotheses for this: increased resistance against tuberculosis or
341 | cholera has been suggested but there are no hard data to explain it.
342 |
343 |
344 |
345 |
--------------------------------------------------------------------------------
/NaturalSelection/browser.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NaturalSelection/browser.png
--------------------------------------------------------------------------------
/NaturalSelection/fig1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NaturalSelection/fig1.png
--------------------------------------------------------------------------------
/NaturalSelection/fig2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NaturalSelection/fig2.png
--------------------------------------------------------------------------------
/NaturalSelection/fig3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NaturalSelection/fig3.png
--------------------------------------------------------------------------------
/NaturalSelection/fig4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NaturalSelection/fig4.png
--------------------------------------------------------------------------------
/NaturalSelection/fig5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NaturalSelection/fig5.png
--------------------------------------------------------------------------------
/NaturalSelection/sel1sickle.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NaturalSelection/sel1sickle.png
--------------------------------------------------------------------------------
/NaturalSelection/sel2underdominance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NaturalSelection/sel2underdominance.png
--------------------------------------------------------------------------------
/NaturalSelection/sel3drosophila.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NaturalSelection/sel3drosophila.png
--------------------------------------------------------------------------------
/NaturalSelection/sel4condor.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NaturalSelection/sel4condor.png
--------------------------------------------------------------------------------
/NaturalSelection/selectionAA.md:
--------------------------------------------------------------------------------
1 |
2 | # Selection scan exercises Part 1
3 | **Anders Albrechtsen**
4 |
5 |
6 | Go to https://genome.ucsc.edu/s/aalbrechtsen/hg18Selection
7 |
8 |
9 |
10 |
11 | This is a browser that can be used to view genomic data. With the above link you will view both genes and tajimas D for 3 populations.
12 | - Individuals with African descent are named AD
13 | - Individuals with European descent are named ED
14 | - Individuals with Chinese descent are named XD
15 | -
16 | You are looking at a random 11Mb region of the genome. Try to get a sense of the values that Tajimas D takes along the genome for the 3 populations.
17 | - You can move to another part of the chromosome by clicking ones on the chromosome arm
18 | - You can also change chromosome in the search field
19 | - You can zoom in by draging the mouse
20 |
21 |
22 |
23 |
24 |
25 |
26 | Take note of the highest and lowest values of Tajimas D that you observed.
27 |
28 |
29 |
30 | Try find the SLC45A2 gene (use the search field, an choose the top option). This is one of the strongest selected genes in Europeans.
31 | - Is this an extreme value of Tajimas D?
32 |
33 |
34 | Try to go to the LCT loci.
35 | - Does this have an extreme value of Tajimas D ?
36 | - What can you conclude on the performance of Tajima’s D
37 |
38 | # Part 2
39 | `
40 |
41 |
42 | Open R and load data
43 | ## in R
44 | ```R
45 | ## paste the following code in R ( you do not need to understand it)
46 | .libPaths( c( .libPaths(), "~/groupdirs/SCIENCE-BIO-Popgen_Course/exercises/Rlib/") )
47 | setwd("~/groupdirs/SCIENCE-BIO-Popgen_Course/exercises/selection/selectionScan")
48 | source("server.R")
49 |
50 | ```
51 |
52 | **Exercise**
53 |
54 | Let see if we can do better than the Tajima’s D by using the PBS statistics. First select 3 populations from
55 | - NAT - Native Americans (PERU+Mexico)
56 | - CHB – East Asian - Han Chinese
57 | - CEU – Central Europeans
58 | - YRI – African - Nigerians
59 |
60 | The first population is the one which branch you are investigating. The two others are the one you are comparing to. Chose CEU as the first and choose CHB and YRI as the two others.
61 |
62 |
63 | ```R
64 | #### choose populations
65 | #pops 1=NAT,2=CHB",3=CEU",4=YRI
66 | myPops <- c(3,2,4)
67 | ```
68 |
69 |
70 | First lets get an overview of the whole genome by making a manhattan plot
71 |
72 |
73 |
74 |
75 | ```R
76 | #### choose populations
77 | PBSmanPlot(myPops)
78 | ```
79 |
80 | Note which chromosomes have extreme values. A high value of PBS means a long branch length.
81 | To view a single chromosome – go to PBS region
82 |
83 | Chose the chromosome with the highest PBS value and set the starting position to -1 to get the whole chromosome
84 |
85 | e.g.
86 |
87 |
88 | ```R
89 | # see entire chromosome 1
90 | PBSmanRegion(myPops,chrom=1,start=-1)
91 | ```
92 |
93 | Zoom in to the peak by changing start and end position.
94 |
95 | ```R
96 | # see region between 20Mb and 21Mb on chromosome 1
97 | PBSmanRegion(myPops,chrom=1,start=20,end=21)
98 | ```
99 |
100 | - Locate the most extreme regions of the genome and zoom in
101 | - Identify the Gene with the highest PBS value.
102 | - What does the gene do?
103 | - Try the LCT gene (the mutations are locate in the adjacent MCM6 gene). See below on how to get the position
104 | - How does this compare with Tajima’s D
105 |
106 | If you have time you can try other genes. Here are the top ones for Humans. You can find the find the location of the genes using for example the ucsc browser https://genome-euro.ucsc.edu/cgi-bin/hgGateway (choose human GRCh37/hg19 genome). Note that there are some population that you cannot test because the populations are not represented in the data e.g. Tibetan, Ethiopian , Inuit, Siberians.
107 |
108 |
109 |
110 |
111 |
112 |
113 |
114 |
--------------------------------------------------------------------------------
/NaturalSelection/shinySelectionSim.R:
--------------------------------------------------------------------------------
1 | library(shiny)
2 | library(ggplot2)
3 |
4 | # Define UI for application
5 | ui <- fluidPage(
6 | titlePanel("Allele Frequency Trajectory Simulation"),
7 | sidebarLayout(
8 | sidebarPanel(
9 | numericInput("init_freq", "Initial Allele Frequency:", 0.5, min = 0, max = 1),
10 | numericInput("selection_coeff", "Selection Coefficient (s):", 0.1, min = 0, max = 1),
11 | numericInput("pop_size", "Population Size (N):", 1000, min = 100),
12 | numericInput("generations", "Number of Generations:", 50, min = 10),
13 | numericInput("num_simulations", "Number of Simulations:", 5, min = 1),
14 | numericInput("y_min", "Y-axis Minimum:", 0, min = 0, max = 1),
15 | numericInput("y_max", "Y-axis Maximum:", 1, min = 0, max = 1),
16 | actionButton("simulate", "Simulate")
17 | ),
18 | mainPanel(
19 | plotOutput("freqPlot")
20 | )
21 | )
22 | )
23 |
24 | # Define server logic
25 | server <- function(input, output) {
26 | simulate_data <- eventReactive(input$simulate, {
27 | init_freq <- input$init_freq
28 | selection_coeff <- input$selection_coeff
29 | pop_size <- input$pop_size
30 | generations <- input$generations
31 | num_simulations <- input$num_simulations
32 |
33 | simulations <- lapply(1:num_simulations, function(i) {
34 | allele_freq <- numeric(generations)
35 | allele_freq[1] <- init_freq
36 |
37 | for (j in 2:generations) {
38 | allele_count <- rbinom(1, pop_size, allele_freq[j-1])
39 | allele_freq[j] <- allele_count / pop_size
40 | allele_freq[j] <- allele_freq[j] + selection_coeff * allele_freq[j] * (1 - allele_freq[j])
41 | }
42 |
43 | data.frame(Generation = 1:generations, Frequency = allele_freq, Simulation = i)
44 | })
45 |
46 | do.call(rbind, simulations)
47 | })
48 |
49 | output$freqPlot <- renderPlot({
50 | req(simulate_data())
51 | ggplot(simulate_data(), aes(x = Generation, y = Frequency, group = Simulation, color = as.factor(Simulation))) +
52 | geom_line(alpha = 0.7) +
53 | labs(title = "Allele Frequency Trajectory", x = "Generation", y = "Allele Frequency", color = "Simulation") +
54 | scale_y_continuous(limits = c(input$y_min, input$y_max)) +
55 | theme_minimal()
56 | })
57 | }
58 |
59 | # Run the application
60 | shinyApp(ui = ui, server = server)
61 |
--------------------------------------------------------------------------------
/NaturalSelection/z:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/NucleotideDiversiteyExercise/Selection_020.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NucleotideDiversiteyExercise/Selection_020.png
--------------------------------------------------------------------------------
/NucleotideDiversiteyExercise/chimp_distribution.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NucleotideDiversiteyExercise/chimp_distribution.png
--------------------------------------------------------------------------------
/NucleotideDiversiteyExercise/ex1nucdivfigure1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NucleotideDiversiteyExercise/ex1nucdivfigure1.png
--------------------------------------------------------------------------------
/NucleotideDiversiteyExercise/slidesGeneticDiversityExercise .pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NucleotideDiversiteyExercise/slidesGeneticDiversityExercise .pdf
--------------------------------------------------------------------------------
/Population_history/population_history_lab_2020_noAnswers.md:
--------------------------------------------------------------------------------
1 | Simulations and population history lab
2 | ======================================
3 |
4 | Rasmus Heller, Feb 2020
5 |
6 | **Program**
7 |
8 | Part I: Quick demonstration of fastsimcoal, a coalescent simulator software.
9 |
10 | Part II: Looking at sfs’s from three different population histories.
11 |
12 | Part III: Looking at coalescence trees from the same histories.
13 |
14 | Part IV: Making a skyline plot from HIV virus data.
15 |
16 | Part V: Evaluating two hypothesis of Bornean elephant population history.
17 |
18 | **Aim**
19 |
20 | The aim of this practical lab is to use coalescent simulations to understand the
21 | connection between population history, coalescent trees and genetic data.
22 | Specifically you will learn:
23 |
24 | 1) how to set up a coalescent simulation using fastsimcoal.
25 |
26 | 2) what sfs’s and coalescent trees from different population histories look
27 | like.
28 |
29 | 3) how to interpret a skyline plot of population history.
30 |
31 | 4) how to use simulations and real data to choose between two hypotheses about population history.
32 |
33 | We are going to use the coalescent simulator software fastsimcoal to examine how
34 | different population histories affect genetic data. **The necessary files are in
35 | the ‘exercises/populationHistory’ folder** (/home/
36 | /groupdirs/SCIENCE-BIO-Popgen_Course/exercises/populationHistory/). Please copy
37 | this entire folder to your home directory by writing this:
38 |
39 | ```bash
40 | cp -r groupdirs/SCIENCE-BIO-Popgen_Course/exercises/populationHistory ./exercises/
41 | ```
42 |
43 | Now move to the new folder location:
44 |
45 | ```bash
46 | cd exercises/populationHistory
47 | ```
48 |
49 | **Part I: perform simulations**
50 |
51 | There are three .par files among the files you just copied. We will take a look
52 | at them together.
53 |
54 | Now, navigate to the ‘populationHistory’ folder in your home directory and run
55 | the following command:
56 |
57 | ```bash
58 | fsc26 -i constant1.par -n10 -X -s0 -I -T -d
59 | ```
60 |
61 | This performs the simulations from the constant1.par file. Run the same command
62 | for the two other .par files, decline1 and expand1.
63 |
64 | **Part II: plot sfs**
65 |
66 | Among the data files output from the previous commands are three sfs files
67 | output from fastsimcoal, one from each of three population histories: constant,
68 | declining and expanding population sizes. We will plot the sfs files and compare
69 | among the histories. Navigate to the output folder (/constant1/ first, after
70 | that /decline1/ and /expansion1/). Go to the constant1 folder and Open R:
71 |
72 | ```bash
73 | cd constant1
74 |
75 | R
76 | ```
77 |
78 | Now paste this code:
79 |
80 | ```R
81 | constant<-read.table("constant1_DAFpop0.obs", skip=2) #read in the constant.sfs.
82 | constant1.sfs<-as.matrix(constant[,2:10]) #extract the variable sites from the ten sfs.
83 | norm1.constant<-apply(constant1.sfs,1,function(x) x/sum(x)) #getting proportions instead of counts.
84 | barplot(norm1.constant, be=T, main="10 SFS's from a constant population, simulating 100 x 10,000bp", ylim=c(0, 0.5), ylab="Proportion of sites") #plot the sfs.
85 | ```
86 |
87 | Question: Is there a lot of variation across the ten simulations?
88 |
89 | Question: What do you think determines how different the SFS's are?
90 |
91 | Now we look at the decline1 scenario. Without closing your R session do this:
92 |
93 | ```R
94 | setwd("../decline1/") #navigates to the decline1 simulation folder.
95 | dev.new() #lets you open a new graphic device while keeping the other open.
96 | decline<-read.table("decline1_DAFpop0.obs", skip=2) #next three lines as above.
97 | decline1.sfs<-as.matrix(decline[,2:10])
98 | norm1.decline<-apply(decline1.sfs,1,function(x) x/sum(x)) #getting proportions instead of counts.
99 | barplot(norm1.decline, be=T, main="10 SFS's from a declining population, simulating 100 x 10,000bp", ylim=c(0, 0.5), ylab="Proportion of sites") #plot the sfs
100 | ```
101 |
102 | And the expand1 scenario:
103 |
104 | ```R
105 | setwd("../expand1/") #navigates to the appropriate simulation folder.
106 | dev.new()
107 | expand<-read.table("expand1_DAFpop0.obs", skip=2) #next three lines as above.
108 | expand1.sfs<-as.matrix(expand[,2:10])
109 | norm1.expand <- apply(expand1.sfs,1,function(x) x/sum(x)) #getting proportions instead of counts.
110 | barplot(norm1.expand, be=T, main="10 SFS's from an expanding population, simulating 100 x 10,000bp", ylim=c(0, 0.5), ylab="Proportion of sites") #plot the sfs
111 | ```
112 |
113 | Question: Are there obvious differences in the SFS among the three different scenarios?
114 |
115 | Question: Do the SFS look as expected (see for example Fig. 3.9 in Nielsen & Slatkin)?
116 |
117 | **Part III: examine coalescence trees**
118 |
119 | From our simulations we also created 10*100 coalescent trees from each of the
120 | three scenarios. The files containing the trees (in a text format) are called
121 | [scenario name] _1_true_trees.trees. We will browse through them and compare
122 | trees among histories.
123 |
124 | Run the following R code to load the 1000 trees from the constant1 scenario:
125 |
126 | ```R
127 | setwd("../constant1/") #navigates to the appropriate simulation folder.
128 | library(ape)
129 | conTrees<-read.nexus("constant1_1_true_trees.trees")
130 | plot(conTrees[seq(1,1000,by=100)], show.tip.label=F) #this plots the first of 100 trees from each simulation one by one and allows you to switch to the next one by hitting ENTER.
131 | ```
132 |
133 | Question: Looking at the way the simulations are done, what does each tree represent?
134 | Is it the tree of one locus in each simulation, or the combined tree of all loci
135 | in each simulation? Are the trees identical? Similar?
136 |
137 | Next, we will plot just the first of the 1000 trees:
138 |
139 | ```R
140 | plot(conTrees[[1]], show.tip.label=F)
141 | add.scale.bar() #this adds a scale bar with units in numbers of mutations.
142 | ```
143 |
144 | Now, run the following to view the first tree from the decline1 scenario:
145 |
146 | ```R
147 | setwd("../decline1/") #navigates to the appropriate simulation folder.
148 | dev.new()
149 | decTrees<-read.nexus("decline1_1_true_trees.trees")
150 | plot(decTrees[[1]] , show.tip.label=F)
151 | add.scale.bar()
152 | ```
153 |
154 | Question: Are there differences between the decline and the constant trees? Are they as
155 | you expected?
156 |
157 | Lastly, here is the code to plot the first tree from the expand1 simulation:
158 |
159 | ```R
160 | setwd("../expand1/") #navigates to the appropriate simulation folder.
161 | dev.new()
162 | expTrees<-read.nexus("expand1_1_true_trees.trees")
163 | plot(expTrees[[1]] , show.tip.label=F)
164 | add.scale.bar()
165 | ```
166 |
167 | Question: Are there obvious differences between the expansion and the decline trees?
168 | Are they as you expected?
169 |
170 | Next we will construct skyline plots of population size from the simulated
171 | trees. Let’s start by looking at one of the expansion trees:
172 |
173 | ```R
174 | dev.new()
175 | sky1<-skyline(expTrees[[1]], 200) #this takes only the 1st of 1000 expansion trees and makes the skyline plot calculations. The second parameter controls the smoothing and was chosen by us for this specific situation.
176 | plot(sky1, subst.rate=0.00001, main="Skyline plot, expansion tree 1") #plots the skyline object. The second parameter should equal the simulated mutation rate.
177 | ```
178 |
179 | Question: Do you see an expansion? Where do you think the signal is coming from?
180 |
181 | Question: Compare the 1st expansion tree with its skyline plot. Can you explain
182 | why the signal in the skyline plot becomes noisy at some point back in time?
183 |
184 | **Part IV: inferring HIV virus population history**
185 |
186 | We are going to look at a data set comprised of HIV virus isolates from central
187 | Africa. We are interested in finding out whether the HIV population size has
188 | changed over time, and over which time scale. The following is R code to load
189 | and analyze a coalescent tree from HIV virus data using APE. It will give us
190 | various population trees and a skyline plot:
191 |
192 | ```R
193 | library(ape) #loads the package 'ape' ('install.packages("ape")' will install it if not present in local R packages.
194 | data("hivtree.newick") #get example data, Newick tree file.
195 | tree1<-read.tree(text = hivtree.newick) #reads in the data as a 'tree' object in R.
196 | plot(tree1, show.tip.label=F) #plots the tree as a phylogram.
197 | ```
198 |
199 | Question: Pause here to look at the tree. What would you say about the population
200 | history of HIV from this tree?
201 |
202 | ```R
203 | sky2<-skyline(tree1, 0.0119) #constructs a 'skyline' object (generalized skyline plot) with estimated popsize for collapsed coalescent intervals.
204 | dev.new()
205 | plot(sky2, show.years=TRUE, subst.rate=0.0023, present.year = 1997) #plot generalized skyline plot.
206 | ```
207 |
208 | Question: Describe the overall population size history of HIV virus going back in time
209 | from 1997. How does this relate to what you know about the history of HIV/AIDS?
210 |
211 | **Part V: when did the elephant arrive on Borneo?**
212 |
213 | We will evaluate which of two hypotheses regarding the founding time of the
214 | population of elephants on Borneo is most likely. See our paper here:
215 | .
216 |
217 | Hypothesis 1: a small number of elephants were introduced recently (in the 17th
218 | century) by humans.
219 |
220 | Hypothesis 2: a small number of elephants migrated naturally to Borneo during
221 | the Last Glacial Maximum (LGM, about 20,000 years ago) when Borneo was connected
222 | by land bridge to mainland Asia.
223 |
224 | This is strongly debated in the scientific and the public communities, with
225 | hypothesis 1 being the consensus view. The question has implications for how the
226 | Bornean elephant is managed as well as for understanding the biogeography of the
227 | Indonesian islands. We will use data obtained from microsatellites and
228 | simulations to evaluate which of the hypothesis is more likely.
229 |
230 | I have used coalescent simulations (see Box 5.1 in N&S) to simulate 1000 data
231 | sets of 18 microsatellites under each of the two hypotheses. The two simulation
232 | output files are among the files you downloaded in the beginning of the
233 | exercise.
234 |
235 | Navigate to the populationHistory folder in your /home/ directory. Open R. We
236 | will plot the distribution of a certain summary statistic calculated from the
237 | simulated data and see how it agrees with the value we have obtained from the
238 | actual data from the elephant population. This summary statistic is allelic
239 | richness (K), or the mean number of alleles in the population across 18 microsat
240 | loci. We plot the distribution of K from both simulated scenarios:
241 |
242 | ```R
243 | sulu<-read.table("sulu_outSumStats.txt", head=T)
244 | lgm<-read.table("lgm_outSumStats.txt", head=T)
245 | plot(density(sulu$K_1), xlim=c(2,4.5), main="Distribution of K in the introduction scenario")
246 | abline(v= 4.38889, col="red") #this plots the actual value of K from real microsat data in the same plot as the simulated values.
247 | dev.new()
248 | plot(density(lgm$K_1), main="Distribution of K in the LGM scenario")
249 | abline(v= 4.38889, col="red")
250 | ```
251 |
252 | Question: Look at the distribution of allelic richness (K_1) from two different
253 | scenarios. Compare it with the observed value from the real data (red line in
254 | each plot). Which scenario appears more in agreement with the data? What would
255 | you conclude from this observation?
256 |
--------------------------------------------------------------------------------
/Population_history/z:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Population_structure/ExerciseFst2024WA.md:
--------------------------------------------------------------------------------
1 | # Exercises - measuring population differentiation and detecting signatures of selection using FST
2 |
3 | ## Genis Garcia Erill & Renzo Balboa
4 |
5 | ## Program
6 |
7 | - Read genotype data into R and apply R functions to estimate FST.
8 | - Use FST in windows across the genome and Manhattan plots to detect local signatures of natural selection.
9 | - Interpret and discuss the results from both analyses in a biological context.
10 |
11 | ## Learning Objectives
12 |
13 | - Learn to estimate FST between population pairs using genotype data.
14 | - Learn to interpret FST estimates in relation to processes of population divergence.
15 | - Learn to visualise local FST across the genome and identify candidate genes under selection.
16 |
17 | ## Recommended Reading
18 |
19 | Chapter 4 - Population Subdivision (pp. 59-76).
20 |
21 | In: Nielsen, R. and Slatkin, M. (2013). _An Introduction to Population Genetics: Theory and Applications_. Sinauer.
22 |
23 | ## Data and setup
24 |
25 | For this exercise, we will use the same dataset that was used on Monday for [inferring population structure](https://github.com/populationgenetics/exercises/blob/master/Population_structure/ExerciseStructure_2023.md).
26 | The following commands will create a new folder and copy the dataset to that new folder,
27 | but you are free to work in the `structure` folder that you created in the last exercise (in which
28 | case, you can skip the following commands).
29 |
30 | ```bash
31 | # make a directory called structure_fst inside ~/exercises/ and change to that directory
32 | mkdir -p ~/exercises/structure_fst
33 | cd ~/exercises/structure_fst
34 |
35 | # Download/copy data (remember the . at the end)
36 | cp ~/groupdirs/SCIENCE-BIO-Popgen_Course/exercises/structure/pa/* .
37 |
38 | # List the downloaded files
39 | ls -l
40 | ```
41 |
42 | Today, we are going to calculate the fixation index (*FST*), a widely used statistic in population
43 | genetics, between subspecies of chimpanzees. This is a measure of population differentiation and thus, we
44 | can use it to distinguish populations in a quantitative way.
45 | It is worth noticing that what *FST* measures is the
46 | reduction in heterozygosity within observed populations when compared to a pooled population.
47 |
48 | There are several methods for estimating *FST* from genotype data. We will not
49 | cover them in the course but if you are interested in getting an overview of some of these
50 | estimators and how they differ, you can have a look at [this article](https://genome.cshlp.org/content/23/9/1514.full.pdf).
51 |
52 | Here, we use the [Weir and Cockerham *FST* estimator from 1984](https://onlinelibrary.wiley.com/doi/pdfdirect/10.1111/j.1558-5646.1984.tb05657.x) to
53 | calculate *FST*. Again, the theory behind it and the estimator itself are
54 | not directly part of the course but if you are interested, you can find the formula that is implemented in the following R function in either
55 | the Weir and Cockerham (1984) article and/or in equation 6 from the Bhatia et al. (2011) article linked above.
56 |
57 | Open R and copy/paste the following function. You do not need to understand what the code does (but are welcome to try if you are interested, and ask if you have questions):
58 |
59 | #### \>R
60 | ```R
61 | WC84<-function(x,pop){
62 | # function to estimate Fst using Weir and Cockerham estimator.
63 | # x is NxM genotype matrix, pop is N length vector with population assignment for each sample
64 | # returns list with fst between population per M snps (theta) and other variables
65 |
66 | ###number ind in each population
67 | n<-table(pop)
68 | ###number of populations
69 | npop<-nrow(n)
70 | ###average sample size of each population
71 | n_avg<-mean(n)
72 | ###total number of samples
73 | N<-length(pop)
74 | ###frequency in samples
75 | p<-apply(x,2,function(x,pop){tapply(x,pop,mean)/2},pop=pop)
76 | ###average frequency in all samples (apply(x,2,mean)/2)
77 | p_avg<-as.vector(n%*%p/N )
78 | ###the sample variance of allele 1 over populations
79 | s2<-1/(npop-1)*(apply(p,1,function(x){((x-p_avg)^2)})%*%n)/n_avg
80 | ###average heterozygotes
81 | # h<-apply(x==1,2,function(x,pop)tapply(x,pop,mean),pop=pop)
82 | #average heterozygote frequency for allele 1
83 | # h_avg<-as.vector(n%*%h/N)
84 | ###faster version than above:
85 | h_avg<-apply(x==1,2,sum)/N
86 | ###nc (see page 1360 in wier and cockerhamm, 1984)
87 | n_c<-1/(npop-1)*(N-sum(n^2)/N)
88 | ###variance between populations
89 | a <-n_avg/n_c*(s2-(p_avg*(1-p_avg)-(npop-1)*s2/npop-h_avg/4)/(n_avg-1))
90 | ###variance between individuals within populations
91 | b <- n_avg/(n_avg-1)*(p_avg*(1-p_avg)-(npop-1)*s2/npop-(2*n_avg-1)*h_avg/(4*n_avg))
92 | ###variance within individuals
93 | c <- h_avg/2
94 | ###inbreeding (F_it)
95 | F <- 1-c/(a+b+c)
96 | ###(F_st)
97 | theta <- a/(a+b+c)
98 | ###(F_is)
99 | f <- 1-c(b+c)
100 | ###weighted average of theta
101 | theta_w<-sum(a)/sum(a+b+c)
102 | list(F=F,theta=theta,f=f,theta_w=theta_w,a=a,b=b,c=c,total=c+b+a)
103 | }
104 | ```
105 |
106 | Do not close R.
107 |
108 | ## Measuring population differentiation with *FST*
109 |
110 | Now we will read in our data and apply the function above to each of the three pairs of subspecies to estimate their
111 | *FST*. We want to make three comparisons.
112 |
113 |
114 | ```R
115 | library(snpMatrix)
116 |
117 | # read genotype data using read.plink function form snpMatrix package
118 | data <- read.plink("pruneddata")
119 |
120 | # extract genotype matrix, convert to normal R integer matrix
121 | geno <- matrix(as.integer(data@.Data),nrow=nrow(data@.Data))
122 |
123 | # original format is 0: missing, 1: hom minor, 2: het, 3: hom major
124 | # convert to NA: missing, 0: hom minor, 1: het, 2: hom major
125 | geno[geno==0] <- NA
126 | geno <- geno - 1
127 |
128 | # keep only SNPs without missing data
129 | g <- geno[,complete.cases(t(geno))]
130 | ```
131 |
132 | Let's stop for a moment and look at the dimensions of the genotype matrix before and after filtering SNPs with missing data:
133 |
134 | ```R
135 | # dimensions before filtering
136 | dim(geno)
137 |
138 | # dimensions after filtering
139 | dim(g)
140 | ```
141 |
142 | **Q1:** How many SNPs and individuals are there before and after filtering? How many SNPs did we initially have with missing data?
143 |
144 |
145 |
146 | Let's continue in the same R session:
147 |
148 | ``` R
149 | # load population information
150 | popinfo <- read.table("pop.info", stringsAsFactors=F, col.names=c("pop", "ind"))
151 |
152 | # check which subspecies we have
153 | unique(popinfo$pop)
154 |
155 | # save names of the three subspecies
156 | subspecies <- unique(popinfo$pop)
157 |
158 | # check which individuals belong to each subspecies
159 | sapply(subspecies, function(x) popinfo$ind[popinfo$pop == x])
160 | ```
161 |
162 | **Q2:** How many samples do we have from each subspecies? (Of course, this is easy to do by physically counting the number of elements in the vectors we just printed, but can you edit the code above to print the number of individuals for each subspecies?)
163 |
164 |
165 | Let's continue in R and estimate the FST values for each pair of subspecies:
166 |
167 |
168 | ``` r
169 | # get all pairs of subspecies
170 | subsppairs <- t(combn(subspecies, 2))
171 |
172 | # apply fst function to each of the three subspecies pairs
173 | fsts <- apply(subsppairs, 1, function(x) WC84(g[popinfo$pop %in% x,], popinfo$pop[popinfo$pop %in% x]))
174 |
175 | # name each fst
176 | names(fsts) <- apply(subsppairs, 1, paste, collapse="_")
177 |
178 | # print global fsts for each pair
179 | lapply(fsts, function(x) x$theta_w)
180 |
181 | ```
182 |
183 | Do not close R.
184 |
185 |
186 |
187 |
188 | **Q3:** Does population differentiation correlate with the geographical
189 | distance between subspecies? (You can find the geographical distribution of each subspecies using [Figure 1 from Monday](https://github.com/populationgenetics/exercises/blob/master/Population_structure/ExerciseStructure_2023.md#inferring-chimpanzee-population-structure-and-admixture-using-exome-data)).
190 |
191 |
192 |
193 | **Q4:** The troglodytes and schweinfurthii populations have the same
194 | divergence time with verus, but based on *FST*, schweinfurthii has slightly increased differentiation from verus. Based on what we learned in today's lecture, what factors do you think could explain the difference?
195 |
196 |
197 |
198 |
199 | ## Scanning for loci under selection using an *FST* outlier approach
200 |
201 | In the previous section, we estimated *FST* across all SNPs for which we have data and then estimated
202 | a global *FST* as the average across all SNPs. Now we will visualise local *FST* in sliding windows across
203 | the genome with the aim of finding regions with outlying large *FST*. This is a common approach that can reveal
204 | candidate SNPs and genes that may have been under positive selection in different populations.
205 |
206 | First, we will define a function to generate a Manhattan plot of local *FST* values across the genome in sliding windows.
207 | Copy the following function; you do not need to understand it (but are welcome to try if you are interested, and ask if you have questions):
208 |
209 |
210 | ``` R
211 | manhattanFstWindowPlot <- function(mainv, xlabv, ylabv, ylimv=NULL, window.size, step.size,chrom, fst, colpal = c("lightblue", "darkblue")){
212 |
213 | chroms <- unique(chrom)
214 | step.positions <- c()
215 | win.chroms <- c()
216 |
217 | for(c in chroms){
218 | whichpos <- which(chrom==c)
219 | chrom.steps <- seq(whichpos[1] + window.size/2, whichpos[length(whichpos)] - window.size/2, by=step.size)
220 | step.positions <- c(step.positions, chrom.steps)
221 | win.chroms <- c(win.chroms, rep(c, length(chrom.steps)))
222 | }
223 |
224 | n <- length(step.positions)
225 | fsts <- numeric(n)
226 | # estimate per window weighted fst
227 | for (i in 1:n) {
228 | chunk_a <- fst$a[(step.positions[i]-window.size/2):(step.positions[i]+window.size/2)]
229 | chunk_b <- fst$b[(step.positions[i]-window.size/2):(step.positions[i]+window.size/2)]
230 | chunk_c <- fst$c[(step.positions[i]-window.size/2):(step.positions[i]+window.size/2)]
231 | fsts[i] <- sum(chunk_a) / sum(chunk_a + chunk_b + chunk_c)
232 | }
233 |
234 |
235 | plot(x=1:length(fsts),y=fsts,main=mainv,xlab=xlabv,ylab=ylabv,cex=1,
236 | pch=20, cex.main=1.25, col=colpal[win.chroms %% 2 + 1], xaxt="n")
237 |
238 | yrange <- range(fsts)
239 |
240 | text(y=yrange[1] - c(0.05, 0.07) * diff(yrange) * 2, x=tapply(1:length(win.chroms), win.chroms, mean), labels=unique(win.chroms), xpd=NA)
241 |
242 | zz <- fsts[!is.na(fsts)]
243 | abline(h=quantile(zz,0.999,na.rem=TRUE),col="red", lty=2, lwd=2)
244 | abline(h=mean(fsts), lty=2, lwd=2)
245 | }
246 | ```
247 |
248 | Do not close R.
249 |
250 | Using this function, we will now produce a Manhattan plot for each of the three subspecies pairs:
251 |
252 |
253 | ``` R
254 | # read bim file to get info on snp location
255 | bim <- read.table("pruneddata.bim", h=F, stringsAsFactors=F)
256 |
257 | # keep only sites without missing data (to get same sites we used for fst)
258 | bim <- bim[complete.cases(t(geno)),]
259 |
260 | # keep chromosome and bp coordinate of each snp and define pairnames
261 | snpinfo <- data.frame(chr=bim$V1, pos=bim$V4)
262 | pairnames <- apply(subsppairs, 1, paste, collapse=" ")
263 |
264 | # group snps in windows of 10, with sliding window of 1
265 | windowsize <- 10
266 | steps <- 1
267 |
268 | # make the plots
269 | par(mfrow=c(3,1))
270 | for(pair in 1:3){
271 | mainvv = paste("Sliding window Fst:", pairnames[pair], "SNPs =", length(fsts[[pair]]$theta), "Win: ", windowsize, "Step: ", steps)
272 | manhattanFstWindowPlot(mainvv, "Chromosome", "Fst", window.size=windowsize, step.size=steps, fst =fsts[[pair]], chrom=snpinfo$chr)
273 | }
274 |
275 | ```
276 |
277 | If you want to save the plot on the server as a png (so you can then download it to your own computer, or visualise it on the server), you can use the following code:
278 |
279 | ``` R
280 | bitmap("manhattan.png", w=8, h=8, res=300)
281 | par(mfrow=c(3,1))
282 | for(pair in 1:3){
283 | mainvv = paste("Sliding window Fst:", pairnames[pair], "SNPs =", length(fsts[[pair]]$theta), "Win: ", windowsize, "Step: ", steps)
284 | manhattanFstWindowPlot(mainvv, "Chromosome", "Fst", window.size=windowsize, step.size=steps, fst =fsts[[pair]], chrom=snpinfo$chr)
285 | }
286 | dev.off()
287 | ```
288 |
289 | Do not close R.
290 |
291 |
292 |
293 | In the plot we have just generated, the black dotted line indicates the mean FST value across all windows and the red dotted line the 99.9% percentile (i.e. where only 0.1% of the windows have an FST above that value). One way to define outlying windows is to only consider windows that have an FST above the 99.9% percentile (this value is necessarily arbitrary).
294 |
295 |
296 | **Q5:** Compare the peaks of high *FST* in the three subspecies pairs. Do they tend to be found in the same position? Would you expect this to be the case? Why/why not?
297 |
298 |
299 |
300 |
301 | **Q6:** Most of the top *FST* windows are found in groups of nearby windows with high *FST*. Can you explain or guess why that happens?
302 |
303 |
304 |
305 |
306 |
307 | ### EXTRA - Exploring genes in candidate regions for selection
308 |
309 | We have now identified several SNPs that are candidates for having been positively selected in some
310 | populations. Now we can try to see what genes these SNPs are located in (the genotype data we have been working with
311 | comes from exome sequencing, meaning that SNPs will always be located within genes).
312 |
313 | To do so, we need to know the genomic coordinates of the outlier windows in the Manhattan plot.
314 |
315 | Copy and paste the following function into R; this will return the top n (default n=20) windows with maximum *FST* for a given
316 | pairwise comparison. Again, you do not need to understand what the code does (but are welcome to try if you are interested, and ask if you have questions):
317 |
318 | ```r
319 | topWindowFst <- function(window.size, step.size, chrom, pos, fst, n_tops = 20){
320 |
321 | chroms <- unique(chrom)
322 | step.positions <- c()
323 | win.chroms <- c()
324 |
325 | for(c in chroms){
326 | whichpos <- which(chrom==c)
327 | chrom.steps <- seq(whichpos[1] + window.size/2, whichpos[length(whichpos)] - window.size/2, by=step.size)
328 | step.positions <- c(step.positions, chrom.steps)
329 | win.chroms <- c(win.chroms, rep(c, length(chrom.steps)))
330 | }
331 |
332 | n <- length(step.positions)
333 | fsts <- numeric(n)
334 | win.coord <- character(n)
335 |
336 | # estimate per window weighted fst
337 | for (i in 1:n) {
338 | chunk_a <- fst$a[(step.positions[i]-window.size/2):(step.positions[i]+window.size/2)]
339 | chunk_b <- fst$b[(step.positions[i]-window.size/2):(step.positions[i]+window.size/2)]
340 | chunk_c <- fst$c[(step.positions[i]-window.size/2):(step.positions[i]+window.size/2)]
341 | fsts[i] <- sum(chunk_a) / sum(chunk_a + chunk_b + chunk_c)
342 |
343 | c <- win.chroms[i]
344 | win.coord[i] <- paste0(c, ":", pos[step.positions[i]-window.size/2], "-", pos[step.positions[i]+window.size/2])
345 | }
346 |
347 | ord <- order(fsts, decreasing=T)
348 |
349 | return(data.frame(position=win.coord[ord][1:n_tops], fst=fsts[ord][1:n_tops]))
350 | }
351 | ```
352 |
353 |
354 | Now we can use the function to identify where the top hits from the previous plot are located.
355 | Let's have a look the top 10 *FST* windows between troglodytes and schweinfurthii:
356 |
357 | ``` r
358 | windowsize <- 10
359 | steps <- 1
360 | topWindowFst(window.size=windowsize, step.size=steps, chrom=snpinfo$chr, pos=snpinfo$pos, fst=fsts[[1]], n=10)
361 |
362 | ```
363 |
364 |
365 |
366 | **Q7:** What are the genomic coordinates of the window with the highest *FST*?
367 |
368 |
369 |
370 |
371 | Now let's look at some gene annotations at this location. Open the [chimpanzee genome assembly in the UCSC genome browser](https://genome.ucsc.edu/cgi-bin/hgTracks?db=panTro5&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A78555444%2D78565444&hgsid=1293765481_hOBCvmiwGLVKt1SRo9yIaRFa0wYc) and copy/paste the chromosome and coordinates of the top window in the search bar (paste the coordinates in the same format as what was printed in R (i.e. chr:start-end)).
372 |
373 | **Q8:** Is this window within a gene? If so, can you figure out the name of the gene and its possible function? (Hint: click on the drawing of the gene on the 'Non-Chimp RefSeq Genes' track. This track describes genes identified in other organisms that have high sequence similarity to the observed region of the chimp genome. This suggests that the gene may also be present in that region of the chimp genome).
374 |
375 |
376 |
377 |
378 | **Q9:** Can we conclude that selection on this gene has driven biological differentiation between the troglodytes and schweinfurthii chimpanzee subspecies?
379 |
380 |
--------------------------------------------------------------------------------
/Population_structure/ExerciseStructure.md:
--------------------------------------------------------------------------------
1 | # Exercise in structured populations
2 |
3 | **Peter Frandsen and Casper-Emil Tingskov Pedersen**
4 |
5 | ## Program
6 |
7 | - Understanding genetic structure in populations
8 |
9 | - Construction of a Principal Component Analysis (PCA) and ADMIXTURE
10 | plot using SNP data
11 |
12 | - Understanding the effect of population structure in genetic analyses
13 |
14 | ## Aim
15 |
16 | - Introduce command-line based approach to structure analyses
17 |
18 | - Introduction to ADMIXTURE
19 |
20 | - Learn to interpret results from structure analyses and put these in
21 | a biological context
22 |
23 | ## References
24 |
25 | - See Chapter 4 in ”An introduction to Population Genetics” and
26 | chapter 5 page 99-103
27 |
28 | - See Hvilsom et al. 2013. Heredity
29 |
30 | http://www.nature.com/hdy/journal/v110/n6/pdf/hdy20139a.pdf
31 |
32 | - See Prado-Martinez et al. 2013. Nature
33 |
34 | http://www.nature.com/nature/journal/v499/n7459/pdf/nature12228.pdf
35 |
36 | ## Clarifying chimpanzee population structure and admixture using exome data
37 |
38 | Disentangling the chimpanzee taxonomy has been surrounded with much
39 | attention, and with continuously newly discovered populations of
40 | chimpanzees, controversies still exist about the true number of
41 | subspecies. The unresolved taxonomic labelling of chimpanzee populations
42 | has negative implications for future conservation planning for this
43 | endangered species. In this exercise, we will use 110.000 SNPs from the
44 | chimpanzee exome to infer population structure and admixture of
45 | chimpanzees in Africa, in order to acquire thorough knowledge of the
46 | population structure and thus help guide future conservation management
47 | programs. We make use of 29 wild born chimpanzees from across their
48 | natural distributional area (Figure 1).
49 |
50 |
51 |
52 |
53 |
54 |
55 | **Figure 1** Geographical distribution of the common chimpanzee *Pan
56 | troglodytes* (from Frandsen & Fontsere *et al.* 2020. [https://www.nature.com/articles/s41437-020-0313-0 ]).
57 |
58 | ### Warm-up exercise
59 |
60 | Consider two non-identical and separated populations of chimpanzees
61 | without genetic drift, natural selection, mutation or migration. Mating
62 | is random within each population. We analyze a variation at a biallelic locus.
63 | In population 1 the frequency of
64 | allele A is 0.1 and in population 2 the frequency of allele A is 0.9. A
65 | young researcher has collected a sample comprising 50% from each
66 | population (each population is assumed to be of equal size). Calculate
67 | the proportion of heterozygous individuals in the pooled population.
68 | Then, calculate the expected heterozygosity in the pooled population
69 |
70 | **Q1:** Is there a difference between expected and observed
71 |
72 | ## PCA
73 |
74 | We start by downloading the exome data to the folder
75 | `~/exercises/structure`. To do this you can use the following command.
76 | Open a terminal and type:
77 |
78 | ```bash
79 | cd ~/exercises # if you do not have a directory called exercises, make one: mkdir ~/exercises
80 | mkdir structure
81 | cd structure
82 | cp ~/groupdirs/SCIENCE-BIO-Popgen_Course/exercises/structure/pa.zip .
83 | unzip pa.zip
84 | rm pa.zip
85 | ```
86 |
87 | First, we want to look at the data (like you always should *before*
88 | doing any analyses) by opening R in the exercise directory (don’t close
89 | it before this manual states that you should) and typing:
90 |
91 | #### \>R
92 |
93 | ```R
94 | popinfo = read.table("pop.info")
95 | table(popinfo[,1])
96 | ```
97 |
98 | **Q2**: Which subspecies are represented in the data and how many
99 | individuals are there from each?
100 |
101 | Now we want to import our genotyped data into R.
102 | #### \>R
103 |
104 | ```R
105 | library(snpMatrix)
106 | data <- read.plink("pruneddata")
107 | geno <- matrix(as.integer(data@.Data),nrow=nrow(data@.Data))
108 | geno <- t(geno)
109 | geno[geno==0]<- NA
110 | geno<-geno-1
111 | ###Let’s take a look at the first SNP (across populations)
112 | table(geno[1,])
113 | ###Let’s take a look at the first individual (across all SNPs)
114 | table(geno[,1])
115 | ###Let’s take a look at the last individual (across all SNPs)
116 | table(geno[,29])
117 | ```
118 |
119 | **Q3**: The genotype format are of the type 0, 1, 2, explain what 0, 1
120 | and 2 means? At the fist SNP: How many homozygotes are there? And how
121 | many heterozygotes? For the first and last individual: How many
122 | homozygotes are there? And how many heterozygotes?
123 |
124 | Now we want to look at the principal components, type the following into
125 | R:
126 |
127 | #### \>R
128 | ```R
129 | summary(prcomp(na.omit(geno)))
130 | ```
131 | Q4: Look at column PC1 and PC2, how much of the variation is
132 | explained if you were to use these two principal components?
133 |
134 | Now we want to plot our genotyped data, we do that, first, by pasting
135 | the following code into R (which is first function that does PCA and then a call to this to run PCA on your data):
136 |
137 | #### \>R
138 | ```R
139 | eigenstrat<-function(geno){
140 |
141 | # Get rid of sites with missing data
142 | nMis<-rowSums(is.na(geno))
143 | geno<-geno[nMis==0,]
144 |
145 | # Get rid of non-polymorphic sites
146 | avg<-rowSums(geno)/ncol(geno)
147 | keep<-avg!=0&avg!=2
148 | avg<-avg[keep]
149 | geno<-geno[keep,]
150 |
151 | # Get number of remaining SNPs iand individuals
152 | snp<-nrow(geno)
153 | ind<-ncol(geno)
154 |
155 | # Make normalized genotype matrix
156 | freq<-avg/2
157 | M <- (geno-avg)/sqrt(freq*(1-freq))
158 |
159 | # Get sample covariance matrix
160 | X<-t(M)%*%M
161 | X<-X/(sum(diag(X))/(snp-1))
162 |
163 | # Do eigenvalue decomposition
164 | E<-eigen(X)
165 |
166 | # Calculate stuff relevant for number of components to look at
167 | mu<-(sqrt(snp-1)+sqrt(ind))^2/snp
168 | sigma<-(sqrt(snp-1)+sqrt(ind))/snp*(1/sqrt(snp-1)+1/sqrt(ind))^(1/3)
169 | E$TW<-(E$values[1]*ind/sum(E$values)-mu)/sigma
170 | E$mu<-mu
171 | E$sigma<-sigma
172 | class(E)<-"eigenstrat"
173 | E
174 | }
175 | plot.eigenstrat<-function(x,col=1,...)
176 | plot(x$vectors[,1:2],col=col,...)
177 | print.eigenstrat<-function(x)
178 | cat("statistic",x$TW,"n")
179 | e<-eigenstrat(geno)
180 |
181 | ```
182 | And then use the next lines of code to make a plot in R:
183 |
184 | #### \>R
185 | ```R
186 | plot(e,col=rep(c("lightblue","Dark red","lightgreen"),c(11,12,6)),xlab="PC1 21% of variance",ylab="PC2 12% of variance",pch=16,main="PCA plot")
187 | text(0, 0.18, "troglodytes")
188 | text(0.05, -0.2, "schweinfurthii")
189 | text(-0.32,-0.07,"verus")
190 | ```
191 |
192 |
193 | **Q5**: When looking at the plot, does the number of clusters fit with
194 | what you saw in the pop.info file? And does it make sense when looking
195 | at Figure 1?
196 | Now close R by typing `quit()` and hit `Enter` (it is up to you if you
197 | wish to safe the workspace).
198 |
199 |
200 | ## Admixture
201 |
202 | Now we know that the populations look like they are separated into three
203 | distinct clusters (in accordance to the three subspecies), but we also
204 | want to know whether there has been any admixture between the three
205 | subspecies given that at least two of the subspecies have neighboring
206 | ranges (Figure 1). For this admixture run, we will vary the input
207 | command for the number of ancestral populations (*K*) that you want
208 | ADMIXTURE to try to separate the data in. To learn more about admixture
209 | the manual can be found here:
210 |
211 | https://www.genetics.ucla.edu/software/admixture/admixture-manual.pdf
212 |
213 | First we want to know whether the separation in three distinct
214 | populations is the most true clustering given our data. We do this by
215 | running a cross-validation test, this will give us an error value for
216 | each K. We want to find the K with the lowest number of errors.
217 |
218 | To do this, run the following lines of code in the terminal (this may
219 | take some time \~3 mins):
220 |
221 | ```bash
222 | for i in 2 3 4 5; do admixture --cv pruneddata.bed $i; done > cvoutput
223 | grep -i 'CV error' cvoutput
224 | ```
225 |
226 | **Q6**: which K value has the lowest CV error?
227 |
228 | Try running ADMIXTURE using this K value, by typing this in the terminal
229 | (remember to change the K-value to the value with the lowest amount of
230 | errors):
231 |
232 | ```bash
233 | admixture pruneddata.bed K-VALUE
234 | ```
235 |
236 | Before we plot, we want a look at the results generated:
237 |
238 | ```bash
239 | less -S pruneddata.3.Q
240 | ```
241 |
242 | **Q7**: The number of columns indicate the number of K used and the rows
243 | indicate individuals and their ancestry proportion in each population.
244 | Look at individual no. 10, do you consider this individual to be
245 | recently (within the last two generations) admixed?
246 |
247 | Now we want to plot our ADMIXTURE results, to do this open R and pasting
248 | the following code in:
249 |
250 | #### \>R
251 | ```R
252 | snpk2=read.table("pruneddata.2.Q")
253 | snpk3=read.table("pruneddata.3.Q")
254 | snpk4=read.table("pruneddata.4.Q")
255 | snpk5=read.table("pruneddata.5.Q")
256 | names=c("A872_17","A872_24","A872_25","A872_28","A872_35","A872_41",
257 | "A872_53","Cindy","Sunday","EXOTA_11785","PAULA_11784","SUSI_11043",
258 | "CINDY_11525","ABOUME","AMELIE","AYRTON","BAKOUMBA","BENEFICE",
259 | "CHIQUITA","LALALA","MAKOKOU","MASUKU","NOEMIE","SITA_11262",
260 | "SEPP_TONI_11300","A872_71","A872_72","AGNETA_11758","FRITS_11052")
261 | par(mfrow=c(4,1))
262 | barplot(t(as.matrix(snpk2)),
263 | col= c("lightblue","Dark red"),
264 | border=NA, main="K=2",
265 | names.arg=(names), cex.names=0.8, las=2, ylab="ancestry")
266 | barplot(t(as.matrix(snpk3)),
267 | col= c("lightgreen","Dark red","lightblue"),
268 | border=NA, main="K=3",
269 | names.arg=(names), cex.names=0.8, las=2, ylab="ancestry")
270 | barplot(t(as.matrix(snpk4)),
271 | col= c("lightgreen","Dark red","lightblue","yellow"),
272 | border=NA, main="K=4", names.arg=(names), cex.names=0.8, las=2,
273 | ylab="ancestry")
274 | barplot(t(as.matrix(snpk5)),
275 | col= c("lightgreen","Dark red","lightblue","yellow","pink"),
276 | border=NA, main="K=5", names.arg=(names), cex.names=0.8, las=2,
277 | ylab="ancestry")
278 | ```
279 |
280 | **Q8:** Looking at the plot, does it look like there has been any
281 | admixture when using a K value of 3? Does this mean that there has not
282 | been any admixture between any of the subspecies? Why / why not ?
283 |
284 | **Q9:** In the K=4 plot, *P.t.troglodytes* (central chimpanzee) is
285 | divided into two populations, have we overlooked a chimpanzee
286 | subspecies?
287 |
288 | **Q10:** Assuming you had no prior information about your data (e.g.
289 | imagine you have a lot of data sequences sampled from random chimpanzee
290 | individuals in a zoo) while using an ADMIXTURE analysis, would you be
291 | able to reveal whether there had been any admixture between any of the
292 | subspecies in nature? Why / why not?
293 |
294 | Next, we are going to calculate the fixation index between subspecies
295 | (*Fst*), which is a widely used statistic in population
296 | genetics. This is a measure of population differentiation and thus, we
297 | can use it to distinguish populations in a quantitative way. To get you
298 | started, we calculate *Fst* by hand and then later using a
299 | script. It is worth noticing that what *Fst* measures is the
300 | reduction in heterozygosity compared to a pooled population.
301 |
302 | Here we will calculate the population differentiation in the gene
303 | (***SLC24A5***) which contributes to skin pigmentation (among other
304 | things) in humans. An allele (A) in this gene is associated with light
305 | skin. The SNP varies in frequency in populations in the Americas with
306 | mixed African and Native American ancestry. A sample from Mexico had 38%
307 | A and 62% G; in Puerto Rico the frequencies were 59% A and 41% G, and a
308 | sample of Africans had 2% A with 98% G.
309 |
310 | Calculate *Fst* in this example. Start by calculating
311 | heterozygosity, then Hs and then HT
312 |
313 | | | African | Mexican | Puerto Rican |
314 | | -------------- | -------------- | --------------- | --------------- |
315 | | | A (2%) G (98%) | A (38%) G (62%) | A (59%) G (41%) |
316 | | Heterozygosity | | | |
317 |
318 | **Q11**: What is *Fst* in this case?
319 |
320 | **Q12**: Why is this allele not lost in Africans? What happened?
321 |
322 | Here we use the Weir and Cockerham *Fst* calculator from 1984 to
323 | calculate *Fst* on the chimpanzees. Open R and copy/paste the
324 | following:
325 |
326 | #### \>R
327 | ```R
328 | WC84<-function(x,pop){
329 | #number ind each population
330 | n<-table(pop)
331 | ###number of populations
332 | npop<-nrow(n)
333 | ###average sample size of each population
334 | n_avg<-mean(n)
335 | ###total number of samples
336 | N<-length(pop)
337 | ###frequency in samples
338 | p<-apply(x,2,function(x,pop){tapply(x,pop,mean)/2},pop=pop)
339 | ###average frequency in all samples (apply(x,2,mean)/2)
340 | p_avg<-as.vector(n%*%p/N )
341 | ###the sample variance of allele 1 over populations
342 | s2<-1/(npop-1)*(apply(p,1,function(x){((x-p_avg)^2)})%*%n)/n_avg
343 | ###average heterozygotes
344 | # h<-apply(x==1,2,function(x,pop)tapply(x,pop,mean),pop=pop)
345 | #average heterozygote frequency for allele 1
346 | # h_avg<-as.vector(n%*%h/N)
347 | #faster version than above:
348 | h_avg<-apply(x==1,2,sum)/N
349 | ###nc (see page 1360 in wier and cockerhamm, 1984)
350 | n_c<-1/(npop-1)*(N-sum(n^2)/N)
351 | ###variance betwen populations
352 | a <-n_avg/n_c*(s2-(p_avg*(1-p_avg)-(npop-1)*s2/npop-h_avg/4)/(n_avg-1))
353 | ###variance between individuals within populations
354 | b <- n_avg/(n_avg-1)*(p_avg*(1-p_avg)-(npop-1)*s2/npop-(2*n_avg-1)*h_avg/(4*n_avg))
355 | ###variance within individuals
356 | c <- h_avg/2
357 | ###inbreedning (F_it)
358 | F <- 1-c/(a+b+c)
359 | ###(F_st)
360 | theta <- a/(a+b+c)
361 | ###(F_is)
362 | f <- 1-c(b+c)
363 | ###weigted average of theta
364 | theta_w<-sum(a)/sum(a+b+c)
365 | list(F=F,theta=theta,f=f,theta_w=theta_w,a=a,b=b,c=c,total=c+b+a)
366 | }
367 | ```
368 |
369 | Now read in our data. We want to make three comparisons.
370 |
371 | #### \>R
372 | ```R
373 | library(snpMatrix)
374 | data <- read.plink("pruneddata")
375 | geno <- matrix(as.integer(data@.Data),nrow=nrow(data@.Data))
376 | geno <- t(geno)
377 | geno[geno==0]<- NA
378 | geno<-geno-1
379 | g<-geno[complete.cases(geno),]
380 | pop<-c(rep(1,11),rep(2,12),rep(3,6))
381 | ### HERE WE HAVE OUR THREE COMPARISONS
382 | pop12<-pop[ifelse(pop==1,TRUE,ifelse(pop==2,TRUE,FALSE))]
383 | pop13<-pop[ifelse(pop==1,TRUE,ifelse(pop==3,TRUE,FALSE))]
384 | pop23<-pop[ifelse(pop==2,TRUE,ifelse(pop==3,TRUE,FALSE))]
385 | g12<-g[,ifelse(pop==1,TRUE,ifelse(pop==2,TRUE,FALSE))]
386 | g13<-g[,ifelse(pop==1,TRUE,ifelse(pop==3,TRUE,FALSE))]
387 | g23<-g[,ifelse(pop==2,TRUE,ifelse(pop==3,TRUE,FALSE))]
388 | result12<-WC84(t(g12),pop12)
389 | result13<-WC84(t(g13),pop13)
390 | result23<-WC84(t(g23),pop23)
391 | mean(result12$theta,na.rm=T)
392 | mean(result13$theta,na.rm=T)
393 | mean(result23$theta,na.rm=T)
394 | ```
395 |
396 | **Q13:** Does population differentiation fit with the geographical
397 | distance between subspecies and their evolutionary history?
398 |
--------------------------------------------------------------------------------
/Population_structure/ExerciseStructure_2022.md:
--------------------------------------------------------------------------------
1 | # Exercise in inference of population structure and admixture
2 |
3 | ## Program
4 |
5 | - Constructing and interpretating of a Principal Component Analysis (PCA) plot using SNP data
6 | - Running and interpreting ADMIXTURE analyses using SNP data
7 |
8 | ## Learning objectives
9 |
10 | - Learn how to perform PCA and ADMIXTURE analyses of SNP data
11 |
12 | - Learn how to interpret results from such analyses and to put these in
13 | a biological context
14 |
15 | ## Recommended reading
16 |
17 | - ”An introduction to Population Genetics” page 99-103
18 |
19 | ## Inferring chimpanzee population structure and admixture using exome data
20 |
21 | Disentangling the chimpanzee taxonomy has been surrounded with much
22 | attention, and with continuously newly discovered populations of
23 | chimpanzees, controversies still exist about the true number of
24 | subspecies. The unresolved taxonomic labelling of chimpanzee populations
25 | has negative implications for future conservation planning for this
26 | endangered species. In this exercise, we will use 110.000 SNPs from the
27 | chimpanzee exome to infer population structure and admixture of
28 | chimpanzees in Africa, in order to acquire thorough knowledge of the
29 | population structure and thus help guide future conservation management
30 | programs. We make use of 29 wild born chimpanzees from across their
31 | natural distributional area (Figure 1).
32 |
33 |
34 |
35 |
36 |
37 |
38 | **Figure 1** Geographical distribution of the common chimpanzee *Pan
39 | troglodytes* (from [Frandsen & Fontsere *et al.* 2020](https://www.nature.com/articles/s41437-020-0313-0)).
40 |
41 | *If you want to read more about chimpanzee population structure*
42 | - [Hvilsom et al. 2013. Heredity](http://www.nature.com/hdy/journal/v110/n6/pdf/hdy20139a.pdf)
43 | - [Prado-Martinez et al. 2013. Nature](http://www.nature.com/nature/journal/v499/n7459/pdf/nature12228.pdf)
44 |
45 | ## PCA
46 |
47 | We start by creating a directory for this exercise and downloading the exome data to the folder.
48 |
49 | Open a terminal and type:
50 |
51 | ```bash
52 | # Make a directory for this exercise in your exercises folder
53 | cd ~/exercises
54 | mkdir structure
55 | cd structure
56 |
57 | # Download data (remember the . in the end)
58 | cp ~/groupdirs/SCIENCE-BIO-Popgen_Course/exercises/structure/pa/* .
59 |
60 | # Show the dowloaded files
61 | ls -l
62 | ```
63 | You have now downloaded a PLINK file-set consisting of
64 |
65 | - pruneddata.**fam**: Information about each individual (one line per individual)
66 | - pruneddata.**bim**: Information about each SNP/variant (one line per SNP/variant)
67 | - pruneddata.**bed**: A non-human-readable *binary* file-format of the all variants for all individuals.
68 |
69 | and a separate file containing assumed population info for each sample.
70 |
71 | - pop.info
72 |
73 | First, we want to look at the data (like you always should *before* doing any analyses). The command `wc -l [FILENAME]` counts the number of lines in a file.
74 |
75 |
76 |
77 | **Q1: How many samples and variants do the downloaded PLINK file-set consist of?**
78 |
79 |
80 |
81 | Now open R in the exercises directory (don’t close it before this manual states that you should) and type:
82 | ```R
83 | popinfo <- read.table("pop.info")
84 | table(popinfo[,1])
85 | ```
86 |
87 |
88 |
89 | **Q2**
90 | - **Q2.1: Which subspecies are represented in the data?**
91 | - **Q2.2: How many samples are there from each subspecies?**
92 | - **Q2.3: Does the total number of samples match what you found in Q1?**
93 |
94 |
95 |
96 | Now we want to import our genotype data into R.
97 |
98 | ```R
99 | # Load data
100 | library(snpMatrix)
101 | data <- read.plink("pruneddata")
102 | geno <- matrix(as.integer(data@.Data),nrow=nrow(data@.Data))
103 | geno[geno==0] <- NA
104 | geno <- geno-1
105 |
106 | # Show the number of rows and columns
107 | dim(geno)
108 |
109 | # Show counts of genotypes for SNP/variant 17
110 | table(geno[,17], useNA='a')
111 |
112 | # Show counts of genotypes for sample 1
113 | table(geno[1,], useNA='a')
114 | ```
115 |
116 |
117 |
118 | **Q3**
119 | - **Q3.1: How many SNPs and samples have you loaded into *geno*? and does it match what you found in Q1 and Q2?**
120 | - **Q3.2: How many samples are heterozygous for SNP 17 (*what does the 0, 1, and 2 mean*)?**
121 | - **Q3.3: How many SNPs is missing data (*NA*) for sample 8 (*you need to change the code to find the information for sample 8*)**
122 |
123 |
124 |
125 | Specialized software (ex. [PCAngsd](https://doi.org/10.1534/genetics.118.301336)) can handle missing information in a clever way, but for now we will simply remove all sites that have missing information and then perform PCA with the standard R-function `prcomp`.
126 |
127 | ```R
128 | # Number of missing samples per site
129 | nMis <- colSums(is.na(geno))
130 |
131 | # Only keep sites with 0 missing samples.
132 | geno <- geno[,nMis==0]
133 |
134 | # Perform PCA
135 | pca <- prcomp(geno, scale=T, center=T)
136 |
137 | # Show summary
138 | summary(pca)
139 |
140 | # Extract importance of PCs.
141 | pca_importance <- summary(pca)$importance
142 | plot(pca_importance[2,], type='b', xlab='PC', ylab='Proportion of variance', las=1,
143 | pch=19, col='darkred', bty='L', main='Proportion of variance explained per PC')
144 | ```
145 |
146 |
147 |
148 | **Q4: Which principal components (PCs) are most important in terms of variance explained and how much variance do they explain together (cumulative)?**
149 |
150 |
151 |
152 |
153 | Now let's plot the first two principal components.
154 |
155 | ```R
156 | # Extract percentage of the variance that is explained by PC1 and PC2
157 | PC1_explained <- round(pca_importance[2,1]*100, 1)
158 | PC2_explained <- round(pca_importance[2,2]*100, 1)
159 |
160 | # Extract the PCs
161 | pcs <- as.data.frame(pca$x)
162 |
163 | # Custom colors matching the original colors on the map.
164 | palette(c('#E69F00', '#D55E00', '#56B4E9'))
165 | plot(pcs$PC1, pcs$PC2, col=popinfo$V1, pch=19, las=1, bty='L',
166 | main='PCA on 29 wild-born chimpanzees',
167 | xlab=paste0('PC1 (', PC1_explained, '% of variance)'),
168 | ylab=paste0('PC2 (', PC2_explained, '% of variance)'))
169 | legend('topleft', legend=levels(popinfo$V1), col=1:length(levels(popinfo$V1)), pch=19)
170 | ```
171 |
172 |
173 | **Q5**
174 |
175 | - **Q5.1 How many separate PCA-clusters are found in the first two PCs?**
176 | - **Q5.2 Which populations are separated by PC1? How does this match the geography from Figure 1 (top of document)**
177 | - **Q5.3 Do the PCs calculated only from genetic data match the information from the pop.info file (i.e. do any of the samples cluster with a different population than specified by the sample-info/color)?**
178 |
179 |
180 |
181 | Save/screenshot the plot for later. Now close R by typing `q()` and hit `Enter` (no need to save the workspace).
182 |
183 | **BONUS(If you finish early)**: Very often we cannot load all the data into R, so we need to calculate PCA using software such as PLINK. Using google and/or `plink --help`, try to perform PCA on the data with PLINK (remember to remove missingness) and plot the results using R
184 |
185 | ## Admixture
186 |
187 | Now we know that the populations look like they are separated into three
188 | distinct clusters (in accordance to the three subspecies), but we also
189 | want to know whether there has been any admixture between the three
190 | subspecies given that at least two of the subspecies have neighboring
191 | ranges (Figure 1). The ADMIXTURE manual can be found
192 | [here](http://dalexander.github.io/admixture/admixture-manual.pdf).
193 |
194 | When running ADMIXTURE, we input a certain number of ancestral populations, *K*.
195 | Then ADMIXTURE picks a random startingpoint (the *seed*) and tries to find the
196 | parameters (admixture proportions and ancestral frequencies) resulting in the highest likelihood.
197 |
198 | First, lets try to run ADMIXTURE once assuming *K=3* ancestral populations.
199 | ```bash
200 | # Make sure you are in the ~/exercises/structure/ directory
201 | cd ~/exercises/structure/
202 |
203 | # Run admixture once
204 | admixture -s 1 pruneddata.bed 3 > pruneddata_K3_run1.log
205 |
206 | # Look at output files
207 | ls -l
208 | ```
209 |
210 |
211 |
212 | **Q6**
213 | - **Q6.1 What is in the pruneddata.3.Q, pruneddata.3.P, and pruneddata_K3_run1.log files? (*Hints: Use `less -S` to look in the files. Use `wc -l [FILE]` to count the number of lines in the files. Look in the manual*)**
214 | - **Q6.2 What is the ancestral proportions (of the three populations) of sample 4?**
215 | - **Q6.3 Is it admixed and can we say which subspecies its from?**
216 | - **Q6.4 What is the final Loglikelihood of the fit?**
217 | - **Q6.5 Can we be sure that this is the best overall fit for this data - why/why not? (*hint: see next part of exercise, but remember to explain why*)**
218 |
219 |
220 |
221 |
222 | Now, let's run the model fit 10 times with different seeds (different starting points)
223 | ```bash
224 | # Assumed number of ancestral populations
225 | K=3
226 |
227 | # Do something 10 times
228 | for i in {1..10}
229 | do
230 | # Run admixture with seed i
231 | admixture -s ${i} pruneddata.bed ${K} > pruneddata_K${K}_run${i}.log
232 |
233 | # Rename the output files
234 | cp pruneddata.${K}.Q pruneddata_K${K}_run${i}.Q
235 | cp pruneddata.${K}.P pruneddata_K${K}_run${i}.P
236 | done
237 |
238 | # Show the likelihood of all the 10 runs (in a sorted manner):
239 | grep ^Loglikelihood: *K${K}*log | sort -k2
240 | ```
241 |
242 |
243 |
244 | **Q7**
245 | - **Q7.1 Which run-numbers have the 3 highest Loglikelihoods?**
246 | - **Q7.2 Has the model(s) converged? (*rule of thumb: it's converged if we have 3 runs within ± 1 loglikelihood unit*)**
247 | - **Q7.3 Which run would you use as the result to plot?**
248 |
249 |
250 |
251 | Now let's plot the run ancestral proportions from the best fit for each sample. **Open R**.
252 |
253 | ```R
254 | # Margins and colors
255 | par(mar=c(7,3,2,1), mgp=c(2,0.6,0))
256 | palette(c("#E69F00", "#56B4E9", "#D55E00", "#999999"))
257 |
258 | # Load sample names
259 | popinfo <- read.table("pop.info")
260 | sample_names <- popinfo$V2
261 |
262 | # Read sample ancestral proportions
263 | snp_k3_run5 <- as.matrix(read.table("pruneddata_K3_run5.Q"))
264 |
265 | barplot(t(snp_k3_run5), col=c(3,2,1), names.arg=sample_names, cex.names=0.8,
266 | border=NA, main="K=3 - Run 5", las=2, ylab="Ancestry proportion")
267 | ```
268 |
269 | Save/screenshot the plot for later. Close R by typing `q()` and hit `Enter` (no need to save the workspace).
270 |
271 |
272 |
273 | **Q8**
274 | - **Q8.1 Explain the plot - what is shown for each sample?**
275 | - **Q8.2 Does the assigned ancestral proportions match the subpecies classification for each sample?**
276 | - **Q8.3 Can you find any admixed samples (how does/would an admixed sample look)?**
277 | - **Q8.4 Does assuming 3 ancestral populations (K=3) seem to be a good fit to the data?**
278 | - **Q8.5 Could the model have assumed only 2 ancestral populations (*K*) in these runs?**
279 |
280 |
281 |
282 | Now, run admixture 10 times assuming only two ancestral populations (**K=2**). You can use the code from K=3, but changing K.
283 |
284 |
285 |
286 |
287 | **Q9**
288 | - **Q9.1 Did the model(s) converge?**
289 | - **Q9.2 Which model had the highest and which has the lowest loglikelihood?**
290 |
291 |
292 |
293 | Open R again and let's plot the best and the worst fit.
294 |
295 | First we plot the best:
296 |
297 | ```R
298 | # Margins and colors
299 | par(mar=c(7,3,2,1), mgp=c(2,0.6,0))
300 | palette(c("#E69F00", "#56B4E9", "#D55E00", "#999999"))
301 |
302 | # Load sample names
303 | popinfo <- read.table("pop.info")
304 | sample_names <- popinfo$V2
305 |
306 | # Read sample ancestral proportions
307 | snp_k2_run1 <- as.matrix(read.table("pruneddata_K2_run1.Q"))
308 |
309 | barplot(t(snp_k2_run1), col=c(2,1), names.arg=sample_names, cex.names=0.8,
310 | border=NA, main="K=2 - Run 1 (best fit)", las=2, ylab="Ancestry proportion")
311 | ```
312 |
313 | Save/screenshot the plot for later.
314 |
315 | Then we plot the worst:
316 |
317 | ```R
318 | # Read sample ancestral proportions
319 | snp_k2_run4 <- as.matrix(read.table("pruneddata_K2_run4.Q"))
320 |
321 | barplot(t(snp_k2_run4), col=c(2,1), names.arg=sample_names, cex.names=0.8,
322 | border=NA, main="K=2 - Run 4 (worst fit)", las=2, ylab="Ancestry proportion")
323 | ```
324 |
325 | Save/screenshot the plot for later. Close R by typing `q()` and hit `Enter` (no need to save the workspace).
326 |
327 |
328 |
329 | **Q10**
330 | - **Q10.1 Would the conclusion be different if you used the worst compared to the best?**
331 | - **Q10.2 How would an f1 sample (child of parents from two different subspecies) look?**
332 |
333 |
334 |
335 | Run admixture 10 times assuming 4 ancestral populations (K=4).
336 |
337 |
338 |
339 | **Q11**
340 | - **Q11.1 Did the model(s) converge?**
341 | - **Q11.2 Which model had the highest and which has the lowest loglikelihood?**
342 |
343 |
344 |
345 | It turns out that the model(s) eventually converges with the best fit at seed i=52. Run that single seed:
346 |
347 | ```bash
348 | # Assumed number of ancestral populations
349 | K=4
350 |
351 | # Specific seed
352 | i=52
353 |
354 | # Run admixture and rename output files
355 | admixture -s ${i} pruneddata.bed ${K} > pruneddata_K${K}_run${i}.log
356 | cp pruneddata.${K}.Q pruneddata_K${K}_run${i}.Q
357 | cp pruneddata.${K}.P pruneddata_K${K}_run${i}.P
358 |
359 | grep ^Loglikelihood: *K${K}*log | sort -k2
360 | ```
361 |
362 | Now plot this converged best fit:
363 | ```R
364 | # Margins and colors
365 | par(mar=c(7,3,2,1), mgp=c(2,0.6,0))
366 | palette(c("#E69F00", "#56B4E9", "#D55E00", "#999999"))
367 |
368 | # Load sample names
369 | popinfo <- read.table("pop.info")
370 | sample_names <- popinfo$V2
371 |
372 | # Read sample ancestral proportions
373 | snp_k4_run52 <- as.matrix(read.table("pruneddata_K4_run52.Q"))
374 |
375 | barplot(t(snp_k4_run52), col=c(3,4,1,2), names.arg=sample_names, cex.names=0.8,
376 | border=NA, main="K=4 - Run 52 (Best fit)", las=2, ylab="Ancestry proportion")
377 | ```
378 |
379 | Save/screenshot the plot for later. Plot the worst fit:
380 |
381 | ```R
382 | # Read sample ancestral proportions
383 | snp_k4_run4 <- as.matrix(read.table("pruneddata_K4_run4.Q"))
384 |
385 | barplot(t(snp_k4_run4), col=c(3,2,1,4), names.arg=sample_names, cex.names=0.8,
386 | border=NA, main="K=4 - Run 4 (Worst fit, not converged)", las=2, ylab="Ancestry proportion")
387 | ```
388 |
389 | Save/screenshot the plot for later. Close R by typing `q()` and hit `Enter` (no need to save the workspace).
390 |
391 | **Q12 Looking at all the results (PCA and admixture K=2, K=3, and K=4)**
392 | - **Q12.1 Do the admixture and PCA analysis correspond with the known geography?**
393 | - **Q12.2 Which number of ancestral populations do you find the most likely?**
394 | - **Q12.3 What are the possible explanaintions for the K=4 admixture results?**
395 | - **Q12.4 Does it look like we have admixed samples?**
396 | - **Q12.5 Can we conlcude anything about admixture between the subspecies?**
397 |
--------------------------------------------------------------------------------
/Population_structure/chimp_distribution.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Population_structure/chimp_distribution.png
--------------------------------------------------------------------------------
/Population_structure/exer_struct_1_PCA.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Population_structure/exer_struct_1_PCA.png
--------------------------------------------------------------------------------
/Population_structure/exer_struct_2_admixture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Population_structure/exer_struct_2_admixture.png
--------------------------------------------------------------------------------
/Population_structure/manhattanFst.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Population_structure/manhattanFst.png
--------------------------------------------------------------------------------
/Population_structure/z:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/QuantitativeGenetics/FforInd.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/FforInd.png
--------------------------------------------------------------------------------
/QuantitativeGenetics/Fgenome.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/Fgenome.tar.gz
--------------------------------------------------------------------------------
/QuantitativeGenetics/IBDexercises.md:
--------------------------------------------------------------------------------
1 | # Inbreeding
2 | You have gotten access to DNA from an individual and using genetic markers across the genome you estimate that the inbreeding coefficient to F=0.062
3 | - Is this high compared to what you would expect for a Human?
4 | - What the most simple pedigree that could give rise to such an estimate?
5 | - - Hint 1: Look at the “Expectations of inbreeding coefficient from pedigrees” slides from earlier
6 |
7 |
8 |
9 | - Go to https://popgen.dk/shiny/anders/popgenCourse/Fgenome/ to simulate the 22 autosomes for a Human.
10 | - - NB!. The link does not work if you are using the KU VPN.
11 | - - If it is slow to respond you can use a local version(see below)
12 | - - Try to simulate an individuals from this simple pedigree (use the expected F from such a pedigree, don’t use “a”)
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 | - Note the simulated inbreeding coefficient for this individual. Why is it not the same as the F you entered?
21 | - - Note the length of the inbreeding tracts. What determines how long they are?
22 | - - Note the number of chromosomes that do not have inbreeding tracts. Try to draw how this might happen
23 | - - Try to get an idea of the range of possible inbreeding coefficients by trying multiple simulations (still using the same F)
24 | - - Look in table https://www.researchgate.net/profile/Alan-Bittles-2/publication/38114212/figure/tbl1/AS:601713388052509@1520471059919/Human-genetic-relationships.png of simple consanguineous pedigrees. Does your range overlap with the expected inbreeding coefficients?
25 |
26 | - Try a few simulations of some of the other simple pedigrees and try to see which pedigrees could explain your estimated inbreeding value of 0.062?
27 | - If you infer the inbreeding tracks of your individuals the results will look like this https://github.com/populationgenetics/exercises/blob/master/QuantitativeGenetics/FforInd.png. Is this consistent with your suggested pedigrees? Or which other explanations could there be for the estimated F?
28 |
29 |
30 | ## run locally
31 | open R on you desktop (not the server). Then run
32 | ```R
33 | # if shiny is not ins
34 | if (!require("shiny")) install.packages("shiny")
35 | runUrl("https://popgen.dk/shiny/anders/popgenCourse/Fgenome.tar.gz")
36 | ```
37 |
38 |
39 |
40 |
41 |
42 | # Relatedness
43 |
44 | Here is shown the IBD sharing between two individuals on chromosome 1.
45 |
46 |
47 |
48 |
49 |
50 |
51 |
52 | - Assuming no previous inbreeding in the population what is the only relationship that can produce such IBD patterns?
53 | - Assuming both individuals has a rare recessive disorder, where on the chromosome might the disease gene be located?
54 |
55 |
56 |
57 | - Here is a figure of two distantly related individuals both with the same rare disorder. They share 0.3% of their genome IBD=1
58 |
59 |
60 |
61 |
62 |
63 |
64 |
65 | - These individuals share two regions IBD. What assumption do we have to have in order to conclude that the disease causing locus is in one of these regions?
66 | - For relatedness mapping do you think it is best to have close or distantly related individuals?
67 | - Try to guess the number of generations that separate these two individuals?
68 | - They are actually separated by 14 generations. Try to see if you can get a similar pattern using simulations https://popgen.dk/shiny/anders/popgen2016/Rgenome/
69 | - - If slow then run locally (see below)
70 | - What explains the difference between your simulations and above plot?
71 |
72 |
73 |
74 | ## run locally
75 | Open R or R studio and then run
76 | ```R
77 | # if shiny is not ins
78 | if (!require("shiny")) install.packages("shiny")
79 | runUrl("https://popgen.dk/shiny/anders/popgenCourse/Rgenome.tar.gz")
80 | ```
81 |
82 |
--------------------------------------------------------------------------------
/QuantitativeGenetics/QG1_5length.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/QG1_5length.png
--------------------------------------------------------------------------------
/QuantitativeGenetics/QG1assortative.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/QG1assortative.png
--------------------------------------------------------------------------------
/QuantitativeGenetics/QG2Children.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/QG2Children.png
--------------------------------------------------------------------------------
/QuantitativeGenetics/QG3Ne.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/QG3Ne.png
--------------------------------------------------------------------------------
/QuantitativeGenetics/QG4heritability_height.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/QG4heritability_height.png
--------------------------------------------------------------------------------
/QuantitativeGenetics/QG5heritability_children.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/QG5heritability_children.png
--------------------------------------------------------------------------------
/QuantitativeGenetics/QG6monogamy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/QG6monogamy.png
--------------------------------------------------------------------------------
/QuantitativeGenetics/Rgenome.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/Rgenome.tar.gz
--------------------------------------------------------------------------------
/QuantitativeGenetics/fig1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/fig1.png
--------------------------------------------------------------------------------
/QuantitativeGenetics/fig2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/fig2.png
--------------------------------------------------------------------------------
/QuantitativeGenetics/im.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/im.png
--------------------------------------------------------------------------------
/QuantitativeGenetics/z:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Introduction to exercises
2 |
3 | This repository contains the exercises for the University of Copenhagen Population Genetics master course.
4 |
5 |
--------------------------------------------------------------------------------
/phylogenetics/ExercisesPhylo_2021.md:
--------------------------------------------------------------------------------
1 | # Phylogenomics
2 |
3 | **David A. Duchene**
4 |
5 |
6 | Marsupials are a group of mammals that are unique to Australasia and the Americas. Several major groups of marsupials first appeared between 50 and 70 million years ago, during events of fast diversification. Given these are ancient and fast events, resolving the relationships among early marsupials is difficult, and remains a matter of interest in mammalian biology.
7 |
8 | These exercises focus on the most critical steps of a phylogenomic analysis, with the aim of resolving longstanding questions of the evolution of Australasian marsupials.
9 |
10 |
11 |
12 |
13 |
14 | ### Exercise 1: Sequence alignment
15 |
16 | a) Open MEGA and open the file marsupials.fasta via File > Open A File, and selecting Align. Have a look at the sequences and notice that they have slightly different lengths.
17 |
18 |
19 | **Question 1.** Before we conduct any phylogenetic analyses, we need to align these sequences. What is the purpose of sequence alignment?
20 |
21 |
22 | Answer:
23 |
24 | *The purpose of sequence alignment is to maximise the number of sites that we can infer as including homologous characters (i.e., those that have been inherited from a common ancestor). Sequences are aligned by inserting "gaps" where insertions and deletions are likely to have occurred.*
25 |
26 |
27 |
28 |
29 | b) Try for a few minutes to align these sequences by eye. To do this, use your mouse to select one of the nucleotides. Then you can add "gaps" with your spacebar or use the left and right arrow symbols at the top of the window to move that block of columns left and right. You can also use the keyboard, by using alt + the arrow keys.
30 |
31 | To get you started, try simply adding a single gap to the start of sequences 1, 3, 5, 6, 7, and 10. Notice that several of the sequences now appear to be better aligned, but this is only the case for a few of the successive nucleotides. Also try adding gaps to the start of sequences 4 and 9 until they align with the rest of the data set.
32 |
33 | c) Now create an automated alignment by clicking on the Alignment menu and choosing Align by Muscle, selecting all the sequences. Muscle is one of the two alignment algorithms in MEGA. The other is ClustalW. These two are widely used alignment algorithms. We will use Muscle for this practical.
34 |
35 | Change the “Gap Open” penalty to -1000. This increases the penalty for inserting gaps in the alignment, meaning that Muscle will be less willing to insert gaps to make the sequences better aligned to each other. Click “OK”.
36 |
37 |
38 | **Question 2.** How long is the sequence alignment?
39 |
40 |
41 | Answer:
42 |
43 | *This sequence alignment is 1318 sites long.*
44 |
45 |
46 |
47 |
48 | d) Now run Muscle again, but with a “Gap Open” penalty of -400.
49 |
50 |
51 | **Question 3.** Has the length of the sequence alignment changed, and which of the alignments seems more acceptable?
52 |
53 |
54 | Answer:
55 |
56 | *The length of the alignment has slightly increased to 1321. The reduced penalty for inserting gaps has led to the addition of a few sites overall.*
57 |
58 |
59 |
60 |
61 | In this practical we will accept the results of the automated alignment, but standard practice is to inspect the alignment to see if the automated method has done a good job.
62 |
63 | In some cases, visual inspection can reveal sections of the alignment that can be improved. Nowadays, many sequence data sets are far too large for visual inspection to be feasible. As a consequence, there is an increasing reliance on automated methods of identifying misleading sequences and alignment regions.
64 |
65 |
66 | **Question 4.** If one of your sequences had accidentally been shifted to the right by 1 nucleotide (so that the sequence was misaligned by 1 nucleotide compared with the remaining sequences), what would be the consequences for phylogenetic analysis?
67 |
68 |
69 | Answer:
70 |
71 | *This would lead to a very large estimated genetic divergence between this sequence and the rest of the data set. Overall, we would estimate the data set to have a very large substitution rate. The branch leading to that taxon would be estiamted to be very long, and the taxon would probably be placed incorrectly as a distant relative of all other taxa.*
72 |
73 |
74 |
75 |
76 | ### Exercise 2: Tree building
77 |
78 | Now that we have a sequence alignment, we are ready to perform phylogenetic analyses. We will perform phylogenetic analysis using maximum likelihood. Our aim will be to make estimates of the tree and branch lengths. This method uses an explicit model of molecular evolution.
79 |
80 | a) From your alignment window, click on the Data tab in the top bar and on Phylogenetic Analysis. Select "Yes" when prompted whether your data are protein-coding. Now go to the main MEGA window. From the Phylogeny menu, select "Construct Maximum Likelihood Tree" and use the active data set. This will bring up a box that contains a range of options for the analysis.
81 |
82 | Check that the options are the following:
83 | Test of Phylogeny: Bootstrap method
84 | No. of Bootstrap Replications: 100
85 | Substitutions Type: Nucleotide
86 | Model/Method: General Time Reversible model
87 | Rates among Sites: Gamma distributed (G)
88 | No of Discrete Gamma Categories: 5
89 | Gaps/Missing Data Treatment: Use all sites
90 |
91 | Click "OK" to start the maximum likelihood phylogenetic analysis, which will take about 5 minutes to complete.
92 |
93 | b) Confirm that the tree is rooted with the Opossum as the sister of all other marsupials. If this is not the case, select the branch of the Opossum and select "Place Root on Branch".
94 |
95 |
96 | **Question 5.** There is a scale bar underneath the tree. What units does it measure?
97 |
98 |
99 | Answer:
100 |
101 | *The scale bar measures expected molecular substitutions per site.*
102 |
103 |
104 |
105 |
106 | **Question 6.** What is the purpose of including an external species, or outgroup (in this case the Opossum), in the analysis?
107 |
108 |
109 | Answer:
110 |
111 | *The inclusion of an external species allows us to infer the position of the root and therefore the order of divergence events in time.*
112 |
113 |
114 |
115 |
116 | **Question 7.** Does it look like this tree has strong statistical support?
117 |
118 |
119 | Answer:
120 |
121 | *Some branches have high support, but most ancient or "deep" branches have very low support. This means that we cannot draw strong conclusions about the early divegence of marsupials from the genomic region analysed.*
122 |
123 |
124 |
125 |
126 | ### Exercise 3: Substitution models
127 |
128 | In order to conduct a reliable phylogenetic analyses, we need to identify the best-fitting model of nucleotide substitution for the data set. A key purpose of substitution models is to account for multiple substitutions.
129 |
130 | Perhaps the most widely used substitution model is the General Time Reversible (GTR) model, which allows different rates for different substitution types. For example, it allows A<->G substitutions to occur at a different rate from C<->G substitutions. The model also allows the four nucleotides to have unequal frequencies.
131 |
132 |
133 | **Question 8.** Which is the most simple substitution model, and why?
134 |
135 |
136 | Answer:
137 |
138 | *Jukes-Cantor is the most simple model because it assumes that the transition probabilities among all nucleotides are identical. Base frequencies are also assumed to be equal.*
139 |
140 |
141 |
142 |
143 | Variation in evolutionary rate across sites can be modelled using a gamma distribution, which can take a broad range of shapes. The distribution can be described using a single parameter, alpha. When alpha is small (<1), most sites are assumed to be slow-evolving, while a small portion of sites are assumed to be faster evolving (e.g., because they are not undergoing selective constraints). When alpha is large (>1) most sites are taken to evolve at approximately the same rate.
144 |
145 | We can use MEGA to compare different models and select the one that best suits our data.
146 |
147 | a) Return to the main MEGA window. From the Models menu, select "Find Best DNA/Protein Models (ML)" and use the active data. This will bring up a box that contains set of options. Click "Compute" without making any changes.
148 |
149 |
150 | **Question 9.** Which is the best model as identified by having the lowest AICc and BIC?
151 |
152 |
153 | Answer:
154 |
155 | *The model with the lowest AICc and BIC is the K2+G model. This model allows for different rates for transitions versus transversions, while assuming equal base frequencies. It also takes evolutionary rates across sites to follow a gamma distribution.*
156 |
157 |
158 |
159 |
160 | **Question 10.** For the K2+G model, what is the value of the alpha parameter of the gamma distribution? And what does that indicate about the variation in rates across sites in these data?
161 |
162 |
163 | Answer:
164 |
165 | *The alpha parameter is estimated to be 0.56. This is much below 1, and suggests that there is substantial amounts of rate variation across sites. This means that many sites are evolving slowly and a few sites are evolving rapidly.*
166 |
167 |
168 |
169 |
170 | **Question 11.** Are there differences in the results when using the JC versus the K2+G models? What does this say about the data?
171 |
172 |
173 | Answer:
174 |
175 | *While some branches have different length, the relationships are identical regardless of the model used for inference. This suggests that the tree signal is robust to the substitution model used.*
176 |
177 |
178 |
179 |
180 | ### Exercise 4: Gene trees
181 |
182 | Population-level processes can lead gene trees to vary among them. In some cases, this is a dominant form of variation and can be modelled using the multi-species coalescent.
183 |
184 | a) Open FigTree and the marsupials.tree file via File > Open and navigating to the file. This file contains a large number of trees from genes across the genomes of marsupials. The first gene tree will be shown in your window. Scroll across gene trees using the Prev/Next arrows at the top-right of the window.
185 |
186 |
187 | **Question 12.** Some branches seem to be missing or collapsed in some of the trees. What does this indicate about the phylogenetic information in the gene?
188 |
189 |
190 | Answer:
191 |
192 | *These branches are not missing but are extremely short (they effectively have length 0). This suggests that the genes contain very little information about these parts of the tree. One explanation is that these events happened in quick succession, such that no molecular changes occurred.*
193 |
194 |
195 |
196 |
197 | b) In the Layout panel on the left, toggle the tree layout to Radial, and once again explore a few trees.
198 |
199 |
200 | **Question 13.** Which are the most consistent sets of relationships across gene trees?
201 |
202 |
203 | Answer:
204 |
205 | *The koala and wombat are nearly always identified as sister clades, and so are the tasmanian devil and numbat. The grouping of the two possums and the kangaroo are also very common. Gene trees seem to have substantial incongruence regarding other relationships.*
206 |
207 |
208 |
209 |
210 | **Question 14.** Among which groupings is there apparent incomplete lineage sorting in this tree, and what does that indicate about the ancestral populations of different groups of marsupials?
211 |
212 |
213 | Answer:
214 |
215 | *The relationships among the possums and the kangaroos seem to be affected by incomplete lineage sorting, as are the relationships at the root of the marsupial mole branch. This suggests that there was substantial exchange in the early stages of the diversification of these groupings of marsupials. These divergences were likely very fast events and population sizes relatively large, leading to incomplete lineage sorting and therefore widespread incongruence among gene trees.*
216 |
217 |
--------------------------------------------------------------------------------
/phylogenetics/Introductory_chapter_Yang.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/phylogenetics/Introductory_chapter_Yang.pdf
--------------------------------------------------------------------------------
/phylogenetics/Kapli_et_al_2020.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/phylogenetics/Kapli_et_al_2020.pdf
--------------------------------------------------------------------------------
/phylogenetics/Yang_and_Ranalla_2012.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/phylogenetics/Yang_and_Ranalla_2012.pdf
--------------------------------------------------------------------------------
/phylogenetics/marsupials.fasta:
--------------------------------------------------------------------------------
1 | >Brushtail_possum
2 | ttgccatatcccttccaccaggtcctcctgaggaaccattgcctgagacacctcaacaacccccagagccaccagctgtctcccaagccccgcctcccctacctcttgcccctcgtcctgaagagtgtccatcttctcctatcccacttctcccaccaccgaagaagcgtcggaaaacagtctccttctctgcatcagaggaaaccccaaccaccccaacccctgaggttcccccacctgttccacctccagctaagcctcctggtccccttccccggaaaatctcccggggtggggatcggaccattagaaatttgcccctggaccatgcttctctggtcaagagctggccagaggatggatcccgggtggggcggaaccggagcggaggtcggggtcgtctgccagaagaagaagaacctggaactgaagtagacttagcagtgttggcagatttggctctgacgccagctcggcgagggctagtcgctctacctgtaggggatgattctgaggccacagaaacctcagatgaggctgagcgtttgggtccagcaggtccctcaatcattcatgttcttctggagcacaactatgctctagctgtccggcctgctccccctgccccagtttccagacccctggatccacttccttctcctgccactgtcttcagttcacctgccgatgaggtcctagaagctcctgaggtggtggttgctgaggtggaggaggaagaagaggaggaggaggaagaggaggaagagtcggaatcatcagagagtagtagtagcagcagtgatggggagggcgccttgaggcgcaggagccttcgatcccatgctaggcgccgtcgacagcccttgccacctgcccctccaccgccacccagctatgagccccgaagtgagtttgagcagatgaccatcttgtatgacatctggaattccgggcttgatgctgaggacatgggctacttacgactcacctatgagaggctgcttcagcaggatggtggagccgactggcttaatgatacccactgggtacatcacaccaatatcctagacctctgtgggtcaggggctagcagatgtgggagctattgtacaaaaagcctcaagtctcacctctccctcttgcgcagtcattttagtaatgatatttctgtgacttcattgtcgagtactccttttttgtctgctacttttcaccttct
3 | >Numbat
4 | tttatcatatccctttcaccaggtcctcctgaggagctattacctgagacacctcatcatcccccagagccaccagctgtttcccaaggctctcctcccttacctctggtccctcatcctgaggagtgtccatcttctcctattcctcttctcccaccaccgaaaaagcgtcggaaaacagtctccttctctgcatcagaggaaatgccaactacccccayccctgaagttcccccacctgttgcacctccagttaagcctcctggtcctcttccacggaaaatctcccgaggtggtgataggaccattcgaaatttgcctctggaccatgcttctctggtcaagagctggccagaggatggatcccgggtggggcggaaccggagtggaggtcgggctcgtctgccagaagaagaagaatctgggactgaagtagatcttgcagtgttggcagatctggccctgactccagctccagctcggcgaggactagtcactctgcctgtaggggatgattctgaggccacagaaacctcagatgaggctgagcgtggggggccagcagttccctcaatcattcatgttcttctggagcacaactatgctctggctgtccggcctgctccccctgccccagtgtccagacccttggagccacttccttcccctgccactgttttcagctcacctgccgatgaggtcttagaagctcctgaggtggtggttgctgaggtagaggaggaagaggaggaggaggaagaagaagaggagtcagagtcatcagagagtagtagcarcagcagtgatggagagggcgccttgaggcggaggagccttcgatcccatgccaggcgccgtcgacagcccttgccacctgcccctccacccccacctagctatgagcctcgaagtgagtttgagcagatgaccatcctgtatgatatctggaattctggacttgatgctgaggatatgggttatttacgactcacttatgagagattgctgcaacaggacggtggtgcagactggcttaatgatacccactgggtgcatcacactaatatcctaaactgctgtgggtcagctagcagattgggaaactcctgtaccagaaacttgaagtctaaccttttctccctcttgagtggttcttttagtagttattgtcaagtactcttctttttgtctgctatttttcaccttcgttc
5 | >Tasmanian_devil
6 | atattgtatctctttcaccaggtcctcctgaggagccattacctgagacacctcatcatcctccagagccaccacctgtctcccaagcctctcctcccctacctctagttcctcatcctgaggagtgtccatcttctcctatccctcttctcccaccaccgaagaagcgccggaaaacagtctctttctctgcatcagaggaaatgccaactacccccacccctgaagttcccccacctgttgcacctccagttaagcctcctggtccccttccacggaaaatctcccggggtggtgatcggaccattcgaaatttgcctctggaccatgcttctctggtcaagagctggccagaggatggatcccgggtggggcggaaccggagtggaggtcgggctcgtctgccagaagaagaagaatctgggactgaagtagacctcgcagtgttggcagacctggccctgactccagctccagctcggcgagggctagtcactctgcctgtaggggatgattctgaggccacagaaacctcagatgaagctgagcgtggggggccagctgttccctcgatcattcatgttcttctggagcacaactatgctctggctgtccggcctgctccccctgccccagtgtccagacccttggagccacttccttcccctgccactgtcttcagctcacctgccgatgaggtcttagaagctcctgaggtggtggttgctgaggtagaggaggaagaggatgaagaagaggaggaagaggagtcagagtcatcagagagtagtagcagcaccagtgatggagaaggcgccttgaggcgaaggagccttcgatcccatgccaggcgccgtcaacaacccttgccacctgcccctccacccccacctagctatgagccccgaagtgagtttgagcagatgaccatcctttatgacatctggaattctggacttgatgctgaggacatgggttatttacgactcacttatgagagattgctacagcaggacagtggtgctgattggcttaatgatacccactgggtacatcacactaatatcctagactgttgtgggtcaggagctagcagattggggaactactgtaccagaagcttcaagtctaaccttttctcccttttgagtggttcttttagtagttatgcttctatgactccattgtcaagtactcttctttttatctgc
7 | >Marsupial_mole
8 | agaggaaccattgcctgagacacctcaacatcttccagaaccaccggctgtctcccaagtctctcctcccctacctcttgcccctcatcctgaggagtgcccatcttctcctatcccacttctcccaccaccgaagaagcgccggaaaacagtctccttctctgcatcagaggaaaccccagctaccccaaccctagactggtcactgaggctctcccacctattccacctccacctaagcctcctggtccccttccccggaaactttctcgttctggggatcggaccattcgaaatttgcccctggaccatgcttctctggtcaagagctggccagaggaaagttgccgaacaggacgaaaccggagtggaggtcggagtcgtttgccagaagaagaagaacatgggactgaagtagacctagcagtgttagcagatctggccctgactccagctcggcgagggctagtcactctacctgttggggatgattctgaggccacagagacttctgatgaggctgagcgtgtagggccagctgttccctccatcattcatgttcttctggaacacaactatgctctggctatccggcctactccccctgccccagtttccagacccctggaaccacttccttccacagccactgtcttcagctcacctgctgatgaggtcctcgaagctcctgaggttgtggttgctgaggtagaggaggaagaggaggaagaggaagaggaatcagagtcatcagagagtagtagcagcagtagtgatggagaaggcactttgaggcggaggagccttcgatcccatgccaggcgccgtcgacagcccrtgccacctttcttaggactccaccaccttccacccagctatgagcctygaagtgagtttgagcagatgaccatcctayatgatatctggaattcygggcttgatgctgaggacrtggrctacttaagacttacctatgagaggstgctgcaacaggatgrtggagctgactggcttaatgatasccactgggtacatcacaccatcaccaacttgaccatcccttcctgccttgttaaggtg
9 | >Monito_del_monte
10 | ttaccacatccctctcaccaggtcctcctgaggaaccattgcctgagacccctcagcatcctccagagccagcagctgtctcccaagcttctcctcccttacctcttgcccctcgtcctgaggagtgtccatcttctcctatcccacttctccccccaccgaagaagcgacggaaaacagtctctttctctgcatcagaggaaaccccaaccaccccaacccctgaggttccccaasctgttccacctccagctaagcctcctggtcctcttccccggaaaatctcccggggtggggatcggaccattcgaaatttgcccctggatcatgcttctctggtcaagagctggccggaggatggatcccgggtggggcggaaccggagtggaattcgggggcgtcttccagaagaagaagaacctgggactgaagtagacctagcagtgttggcagatttggccctgactccagctccagctcggcgagggctggtcagtctacccgtaggggatgattctgaggccacagagacctcagatgaggctgagcgtgtggggccagcagttccctcaatcrttcatgttcttctggagcacaactatgctctggctgtccggcccgctcccccagccccagtttccagacccctggagccacttccttcccctgccactgtcttcagctcacctgccgatgaggtcytagaagctcctgaggtggtggtcgctgaggtggaggaggaagaagaggaggaggaggaagaggaagaggagtcagagtcatcagagagtagtagcagcagcagtgatggggagggtgccctgaggcgaaggagccttcgatcccatgccaggcgccgtcgacagcccttgccgcctgcccctccacccccacccagctatgaaccccgaagtgagtttgagcagatgaccatcctgtatgacatctggaattcygggcttgatgctgaggacatgagctacttacgactcacctacgagaggctcctgcagcaggacagtggagctgactggcttaatgatacccactgggtacatcataccaatatcctagacctctgtgggtcaggggctagtagatgtgggagctgctgtgccagatgccttgagtctcaccttctctcctccttgaggtttct
11 | >Opossum
12 | ttaccacttcccttttgccaggtcctcctgaggagccactgcccgagacccttcagcatcccccagagccaccagctgtcccccaaacctctccttcccttcctcttgcacctcgtcctgaggagtgtccatcttctcctatcccacttctcccaccaccgaagaagcgccggaaaacagtctccttctctgcatcagaggaaactccaaccaccccaacccctgaggttcccccacctgttccacctccagctaagcctcctggtcccctttcccgaaaaatctcccggggtggggatcggaccattcgaaatctgcccctggaccatgcttctttggtcaagagctggccagaggatggatcccgggtggggcggaaccggagtggaggtcgaggtcgtctgccagaagaagaagaactagggactgaagtagacctagcagtattggcagatttggccctcactccagctccggctcggcgagggctgatcgctctacctgtaggggatgattctgaggccacagaaacctcagatgaggctgagcgttcgggaccaacagttccctcagtcattcacgttcttctagagcataactatgctttggctgttcgacctgctcccccaactccagtttccagacccttggaatcacttccttctcctgccactgtcttcagctcacctgctgatgaggttctagaagcccctgaggtggtggtcgctgaggtggaggaagaggaagaaggggaggaagaagaagaagaggaggaggagtctgagtcatctgaaagtagcagcagtagcagtgatggggagggagccttgaggcgaaggagccttcgatctcatgccaggcaccgtcgacagcccctgccacctgcccctccacccccacccagctatgagccccgaagtgagtttgagcagatgaccatcctatatgacatctggaattccggacttgatgccgaggatatgggctacttaagactcacatatgagaggctgctgcagcaggacagtggagctgactggcttaatgatacccactgggtccatcacaccaatatcctagatctctatagggcaggggctagcagatgtgggagcaactgtacccaaagccccaaggttcacctttgctccttcttgagagatattgctccattgttagagtctccctcttgtctgctacttttcaccgtcatctttgttcatttcattctttgccta
13 | >Bandicoot
14 | ttaccacatcatttttcaacagaaccttctgaggaaccactgcctgagacacctcaacatcccccagagctatcagttatttcccaagcctcttctcctctacctcctgcccctcgtcctgaggagtgtccatcttctcctatcccattgctcccaccaccgaagaagcgccggaaaacagtctctttctctgcatcagaggaaaccccaaccaccccaacccctgaggttgccccacctgttccacctccagctaagccttctggtccccttccccggaaaatctcccggggtggggatcgaaccattcgaaatttgcccctggaccatgcttctctggtcaagagctggccagaggatggatgccggacagggcgaaaccgaagtggaggtcggggtcgcctgccagaagaagaagaacctgggactgaagtagacctaacagtgttggcagatttggccttgactccagctaggcgagggctagtcactctacctgtaggggatgattctgaagccacagagacttctgatgaggctgagcgtgtgggatcagcagttccctcaatcgttcatgttctccaggaacacaactatgctctggctgtccgacctgctccccctgctccagtttccagacccctggaaccacttccttctcctgccactgtcttcagctcacctgcagatgaggtcctagaagctcctgaggtggtagttgctgaggtggaggaagaggaggaggaggaggaggaggaggaggaggaggaggaagaggaagaggagtcagagtcatcagaaagtagtagtagcagcagtgatggagagggtgtcttgagacgtaggagccttcgatcccataccaggcgtcgtcgacaacccatgccacctgcccctccgcctccacccagctatgagcccagaagtgaatttgagcagatgaccatcctgtatgacatctggaattctgggcttgatgctgaggacatgagctacttacgacttacctatgagaggttgctgcagcaggatggtggagccgattggcttaatgatacccactgggtgcatcacaccaatatcctagacctctgtgggtcagggactagcagttgtgggagctactgcatcagaagcttcaagtctcaccttctc
15 | >Koala
16 | tttaccacatccttttcaccaggtcctcctgaggaaccattgcctgagacacctcaacatcctccagagccaccagctatctcccaagcctctcctcccctaccttttgcccctcgtcctgaggagtgtccatcttctcctatcccacttctcccaccaccgaagaagcgtcggaaaacagtctccttctctgcatcagaggaaaccccaaccaccccaaccccagaggttcccccacctgttccacctccagttaagcctcctggtccccttccccgaaaaatctcccgaggtggggatcggaccattcgaaatttgcctctggatcatgcttctctggtcaagagctggccagaggatggatcccgggtggggcggaaccggagtggagggcggggtcgtctgccagaagaagaagaacctgggaccgaagttgacctcacagtgttggcagatttggccctgactccagctccagctcggcgaggcctagtcactctacctgtaggggatgattctgaagccactgagacctcggatgaggctgagcgtttgggtctagcaggtccctcaatcattcatgttcttttggagcacaactatgctctggctgtccggcctgctccccctgccccagtttccagacccctggaaccacttctttctcctgccactgtcttcagctcacctgccgatgaggtcctagaagctcctgaggtggtggttgctgaggtggaggaggaggaaggggaggaggaggaagaggaggaagagtcggagtcatcagagagtagtagcagcagcagtgatggggaaggtaccttgaggcgtaggagcctccgatcccatgccaggcgccgtcgacagcccttgccacctgcccctccacctccacccagctatgagccccgaagtgagtttgagcagatgaccatcctgtatgacatctggaattctgggcttgatgctgaggacatgggctacttacgactcacctacgagaggctgctgcagcaggacggtggagccgactggcttaatgatacccactgggtacatcacaccaatatcctagacctctgtgggttgggctaccagatgtgggagctgctgtaccagaagcctcaagtctcacttctctctctctctctctctctccctttctctttctagcagtcattttactaataatatttctgtgactcagttgttgagtactctttttttctctgctacttttcaccttcatt
17 | >Wombat
18 | cctttcaccaggtcctcctgaagaaccattgcctgagacacctcaacatcctccagagccaccagctgtctcccaagcctctcctcccctacctcttgcccctcgtcctgaggagtgtccatcttctcctgtcccacttctcccaccaccgaaaaagcgtcggaaaacagtctccttctctgcatcagaggaatccccaacccccccaacccccgaggttcccccacctgtgccacctccacctaagcctcctggtccccttccccgaaaaatctcccgaggtggggatcggaccattcgaaatttgcccctggatcatgcttctctggtcaagagttggccagaggatggatcccgggtggggcggaaccggagtggaggtcggggtcgtctgccggaagaagaagaacccgggactgaagtagacctagcagtgttggcagatttggccctgactccagctccggctcggcgagggctagtcactctacctgtaggggatgattctgaggccacagagacctcggatgaggctgagcgtttgggtccagcaggtccctcaatcgttcatgttcttctggagcacaactatgctctggctgtccggcctgttccccctgctccagtttccagacccgtggaacaacttccttcccctgccactgtcttcagctcacctgccgatgaggtcctagaagctcctgaggtggtggttgctgaggtggaggaggaagaggaggaggaggaggaggaagagtcagagtcatcagagagtagtagcagcagcagtgatggggagggcaccttgaggcgtaggagcctccgatcccatgcccggcgccgtcgacagccattgccacctgcccctccgcctccacccagctatgagccccgaagtgagtttgagcagatgaccatcctgtatgacatctggaattctgggcttgatgctgaggacatgggctacttacgactcacctacgagaggctgttgcagcaggacagtggagccgactggcttaatgatacccactgggtacatcacaccaatatcctggacctctctgggttgggctagcagatgtgggagctactgtaccagaagcctcaagtcccccccccacacactctccctctctctctttttcacagtcatttcactaatgatttttctgacttcattgttaagtattcttttttttgtctgctacttttcaccttcgttcc
19 | >Kangaroo
20 | ttactacatccctttcaccaggtcctcctgaggaactattgcctgagacacctcaacatcccccagagccaccagctgtctcccaaacctctcctcccctacctcttgcccctcgtcctgaagagtgtccatcttctcctatcccacttctcccaccaccgaagaagcggcggaaaacagtctccttctctgcatcagaggaaaccccaaccaccccaactcctgaggttccaccacctgttccacctccagctaagcctattggtccccttccccggaaaatctcccggggtggggatcggaccattcgaaatttgcccctggaccatgcttctctggtcaagagctggccagaggatggatcccgggcagggcggaaccggagtggaggtcggagtcgtctgccagaagaagaagaacctgggactgaagtagacctggcagtcttggcagatttggccctgactccagctcggcgagggctggtcgctctacccataggagatgattctgaggccacagagacctcagacgaggccgagcgtttgggtccagcaggtccctcaattgttcatgttcttctggagcacaactatgctctggctgtccggcccgctccccctgctccagtttccagatccctggacccacttccttcccctgccactgtcttcagctcacctgccgatgaggtcctagaagcccctgaggtggtggttgctgaggtggaggaggaagaggaggaggaggaagaggaagaagagtctgaatcatcagagagtagtagtagcagcagtgatggggagggtgccttgaggcggaggagccttcgatcccatgccaagcgccgtcgacagcccttgccccctgcccctccacccccacccagttatgagcctcgaagtgagtttgagcagatgaccatcttgtatgacatctggaattctggacttgatgctgaggacatgggctacttacgactcacctatgagaggctgctacagcaggatggtggagctgactggctcaatgatacccactgggtccatcacaccaatatcctagacctctgtgcgcatgtgggagctattgtaccagaagcgtcaagtctcacctctccttcttaatcagtcattttagtaatgatatttctgtgagtccattgttgagtactttttttttgtctgccacttttcaccttcattcttgt
21 | >Ringtail_possum
22 | tttaccacatccttttcactaggtcctcctgaggaaccattgcctgagacacctcagcatcccccagagccaccagctgtctctcaagcctctcctcccctacctcttgccccccgtcctgaagagtgtccatcttctcctatcccacttctcccaccaccgaagaagcgtcggaaaacagtctcctcctctgtatcagaggaaatcccaatcaccccaacccctgaggttcccccacctgttccacctccagctaagccttctggtccccttccccggaaaatctcccggggtggggatcggaccattcgaaatctgcccctggaccatgcttctctggtcaagagctggccagaggatggatcccgggtggggcggaaccggagtggaggtcggggtcgtctgccagaagaagaagaacctgggactgaagtagacctagcagtgttggcagatttggctctgactccagctcgaagagggctgatcactgtacccgtaggggatgattctgaggccacggagacctcggacgaggctgagcggtcggctccggcaggtccctcaatcattcatgttcttctggagcacaactatgctctgtctgtccgacctgcaccccctgccccagtttccagacccctggacctacttccttcccctgccactgtcttcagctcacctgccgatgaggtcctagaagctcctgaggtggtggtcgctgaggtggaggaagaagaagaggaggaggaggaagaggaagaagagtcrgaatcctcagagagtagtagcagcagcagcagcgacggggagggcgccttgaggaggagaagccttcgatcccatgctaggcgccgtcgacagcccttgccacctgcccctccacccccacccagctatgagccccggagtgagtttgagcagatgaccatcctgtacgacatctggaattctgggcttgatgctgaggacatgggctacttgcgactcacctacgagaggctgctgcagcaggatggtggagccgactggctgaatgatacccactgggtgcatcacaccaatatcctagagctttgtgggtcaggggctggcagatgtgggagctgttgtaccagaagcttcaagtctcacctcttcctcttgagcattcattttaataatgatatttctgtgacctcattgtcaagtacactttttttgtctgctatttgtcaccttcgttcttc
--------------------------------------------------------------------------------
/phylogenetics/marsupials.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/phylogenetics/marsupials.png
--------------------------------------------------------------------------------