├── Effective_population_size ├── 0_get_data.md ├── 1_clean_data.md ├── 2_do_PCA.md ├── 3_plot_PCA.ipynb ├── 4_calculate_LD.md ├── 5_estimate_Ne.ipynb ├── 6_plot_Ne_Nc.ipynb ├── README.md ├── data │ ├── Nc_estimates.txt │ ├── pink_salmon.bed │ ├── pink_salmon.bim │ ├── pink_salmon.fam │ └── tmp ├── do_everything ├── images │ ├── sampling_locations.png │ ├── sfs1dx.png │ └── tmp ├── plots │ └── tmp ├── scripts │ ├── 0_get_data.sh │ ├── 1_clean_data.sh │ ├── 2_do_PCA.sh │ ├── 3_plot_PCA.r │ ├── 4_calculate_LD.sh │ ├── 5_estimate_Ne.r │ ├── 6_plot_Ne_Nc.r │ ├── R_functions.r │ └── tmp ├── tmp └── work │ └── tmp ├── Extra_exercise ├── Extra exercises.pdf └── readme.MD ├── GWAS ├── ExerciseGWAS_2020.md └── ExerciseGWAS_2022.md ├── HWexercise ├── .gitkeep ├── HW.md ├── HW1PA.png ├── HW2Sn.png ├── HW3HV.png ├── HW3HV2.png ├── HW4dF.png ├── HW5TP.png ├── HW6QQ.png └── HWanswers.pdf ├── Linkage_disequilibrium_gorillas ├── ExercisesLD20.md ├── LD_ex_0_gorilla.png ├── LD_ex_1_dist_map.png ├── LD_ex_2_link_map.png ├── LD_ex_3_LD_decay.png ├── LDinGorillas.md ├── LDinGorillas_A_24.md └── zzz ├── NaturalSelection ├── ExerciseSelection_2020.md ├── browser.png ├── fig1.png ├── fig2.png ├── fig3.png ├── fig4.png ├── fig5.png ├── sel1sickle.png ├── sel2underdominance.png ├── sel3drosophila.png ├── sel4condor.png ├── selectionAA.md ├── shinySelectionSim.R └── z ├── NucleotideDiversiteyExercise ├── Exercise in estimating nucleotide diversity.md ├── Exercise in estimating nucleotide diversityWoA.md ├── Selection_020.png ├── chimp_distribution.png ├── ex1nucdivfigure1.png └── slidesGeneticDiversityExercise .pdf ├── Population_history ├── population_history_lab_2020_noAnswers.md └── z ├── Population_structure ├── ExerciseFst2022.md ├── ExerciseFst2023.md ├── ExerciseFst2024WA.md ├── ExerciseStructure.md ├── ExerciseStructure_2020.md ├── ExerciseStructure_2022.md ├── ExerciseStructure_2023.md ├── ExerciseStructure_2024.md ├── chimp_distribution.png ├── exer_struct_1_PCA.png ├── exer_struct_2_admixture.png ├── manhattanFst.png └── z ├── QuantitativeGenetics ├── Exercises_Quantitative_genetics2020.md ├── FforInd.png ├── Fgenome.tar.gz ├── IBDexercises.md ├── QG1_5length.png ├── QG1assortative.png ├── QG2Children.png ├── QG3Ne.png ├── QG4heritability_height.png ├── QG5heritability_children.png ├── QG6monogamy.png ├── Rgenome.tar.gz ├── fig1.png ├── fig2.png ├── im.png └── z ├── README.md └── phylogenetics ├── ExercisesPhylo_2021.md ├── Introductory_chapter_Yang.pdf ├── Kapli_et_al_2020.pdf ├── Yang_and_Ranalla_2012.pdf ├── marsupials.fasta ├── marsupials.png └── marsupials.tree /Effective_population_size/0_get_data.md: -------------------------------------------------------------------------------- 1 | 2 | ## these four commands download the pink salmon genotype data files (in Plink format: bed/bim/fam) and the census population size estiamtes. 3 | 4 | ```bash 5 | wget http://people.binf.ku.dk/cetp/popgen/pink_salmon/pink_salmon.bed --directory-prefix ./data 6 | wget http://people.binf.ku.dk/cetp/popgen/pink_salmon/pink_salmon.bim --directory-prefix ./data 7 | wget http://people.binf.ku.dk/cetp/popgen/pink_salmon/pink_salmon.fam --directory-prefix ./data 8 | wget http://people.binf.ku.dk/cetp/popgen/pink_salmon/Nc_estimates.txt --directory-prefix ./data 9 | ``` 10 | 11 | -------------------------------------------------------------------------------- /Effective_population_size/1_clean_data.md: -------------------------------------------------------------------------------- 1 | # ./scripts/1_clean_data.sh 2 | All commands to be run in the terminal 3 | 4 | 5 | 6 | ### Remove loci not placed on a chromosome. 7 | Create a new data set (in ./work/) only including loci that are placed on one of the 26 chromsomes in pink salmon. Call it "pink_salmon.clean". The --make-bed command tells Plink to create pink_salmon.bed (binary file with genotype data), pink_salmon.bim (text with locus information ) and ./pink_salmon.fam (text file with sample information) files. 8 | 9 | When using Plink to analyze data from non-human species, it is important to tell Plink to not interpret chromosome "23" as the X chromosome and chromosome "24" as the Y chromsome (this is true for humans, and is the default configuration in Plink). 10 | 11 | 12 | ```bash 13 | plink --bfile ./data/pink_salmon --autosome-num 26 --not-chr 0 --make-bed --out ./work/pink_salmon.clean 14 | ``` 15 | 16 | ### Create a separate set of genotype data files for each population. 17 | For each of the six populations of pink salmon, select just the individuals in the population and then filter for HWE, genotyping rate per locus, and minor allele frequency. 18 | 19 | ```bash 20 | plink --bfile ./work/pink_salmon.clean --family --keep-cluster-names Koppen_ODD --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Koppen_ODD 21 | 22 | plink --bfile ./work/pink_salmon.clean --family --keep-cluster-names Koppen_EVEN --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Koppen_EVEN 23 | 24 | plink --bfile ./work/pink_salmon.clean --family --keep-cluster-names Nome_ODD --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Nome_ODD 25 | 26 | plink --bfile ./work/pink_salmon.clean --family --keep-cluster-names Nome_EVEN --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Nome_EVEN 27 | 28 | plink --bfile ./work/pink_salmon.clean --family --keep-cluster-names Puget_ODD --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Puget_ODD 29 | 30 | plink --bfile ./work/pink_salmon.clean --family --keep-cluster-names Puget_EVEN --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Puget_EVEN 31 | ``` 32 | 33 | -------------------------------------------------------------------------------- /Effective_population_size/2_do_PCA.md: -------------------------------------------------------------------------------- 1 | # ./scripts/2_do_PCA.sh 2 | All commands to be run in the terminal. 3 | 4 | For each of the two basic data files (before and after removing loci not placed on a chromosome), use Plink to perform a principal components analysis. 5 | 6 | ```bash 7 | 8 | plink --bfile ./data/pink_salmon --autosome-num 26 --maf 0.1 --pca 3 --out ./work/pink_data.initial 9 | 10 | plink --bfile ./work/pink_salmon.clean --autosome-num 26 --maf 0.1 --pca 3 --out ./work/pink_salmon.clean 11 | 12 | ``` 13 | -------------------------------------------------------------------------------- /Effective_population_size/4_calculate_LD.md: -------------------------------------------------------------------------------- 1 | # ./scripts/4_calculate_LD.sh 2 | All commands to be run in the terminal. 3 | 4 | See Plink documentation [here](https://www.cog-genomics.org/plink2/). 5 | 6 | For each of the six poulations, use Plink to calculate LD between each pair of loci. The "--r2 square" flag to Plink will produce a LxL matrix of pairwise r2, with L as the number of loci. 7 | 8 | 9 | 10 | ```bash 11 | plink --bfile ./work/Koppen_ODD --r2 square --out ./work/Koppen_ODD 12 | 13 | plink --bfile ./work/Koppen_EVEN --r2 square --out ./work/Koppen_EVEN 14 | 15 | plink --bfile ./work/Nome_ODD --r2 square --out ./work/Nome_ODD 16 | 17 | plink --bfile ./work/Nome_EVEN --r2 square --out ./work/Nome_EVEN 18 | 19 | plink --bfile ./work/Puget_ODD --r2 square --out ./work/Puget_ODD 20 | 21 | plink --bfile ./work/Puget_EVEN --r2 square --out ./work/Puget_EVEN 22 | 23 | ``` -------------------------------------------------------------------------------- /Effective_population_size/5_estimate_Ne.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Start R code here" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "# run the scripts below, used to define functions shared across multiple scripts\n", 19 | "source('./scripts/R_functions.r')" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 2, 25 | "metadata": { 26 | "collapsed": false 27 | }, 28 | "outputs": [], 29 | "source": [ 30 | "# Make an empty data frame to store the results\n", 31 | "DF <- data.frame(Pop=rep(NA, 6), Site=rep(NA, 6), Lineage=rep(NA, 6), Ne_est=rep(NA, 6))" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 3, 37 | "metadata": { 38 | "collapsed": false 39 | }, 40 | "outputs": [ 41 | { 42 | "data": { 43 | "text/html": [ 44 | "\n", 45 | "\n", 46 | "\n", 47 | "\t\n", 48 | "\t\n", 49 | "\t\n", 50 | "\t\n", 51 | "\t\n", 52 | "\t\n", 53 | "\n", 54 | "
PopSiteLineageNe_est
Nome_ODD Nome ODD 2014.42557567929
Nome_EVEN Nome EVEN 3691.29596294112
Koppen_ODD Koppen ODD 2079.5810007219
Koppen_EVEN Koppen EVEN 9087.46320481003
Puget_ODD Puget ODD 2934.27555677135
Puget_EVEN Puget EVEN 1805.56226291709
\n" 55 | ], 56 | "text/latex": [ 57 | "\\begin{tabular}{r|llll}\n", 58 | " Pop & Site & Lineage & Ne\\_est\\\\\n", 59 | "\\hline\n", 60 | "\t Nome\\_ODD & Nome & ODD & 2014.42557567929\\\\\n", 61 | "\t Nome\\_EVEN & Nome & EVEN & 3691.29596294112\\\\\n", 62 | "\t Koppen\\_ODD & Koppen & ODD & 2079.5810007219 \\\\\n", 63 | "\t Koppen\\_EVEN & Koppen & EVEN & 9087.46320481003\\\\\n", 64 | "\t Puget\\_ODD & Puget & ODD & 2934.27555677135\\\\\n", 65 | "\t Puget\\_EVEN & Puget & EVEN & 1805.56226291709\\\\\n", 66 | "\\end{tabular}\n" 67 | ], 68 | "text/markdown": [ 69 | "\n", 70 | "Pop | Site | Lineage | Ne_est | \n", 71 | "|---|---|---|---|---|---|\n", 72 | "| Nome_ODD | Nome | ODD | 2014.42557567929 | \n", 73 | "| Nome_EVEN | Nome | EVEN | 3691.29596294112 | \n", 74 | "| Koppen_ODD | Koppen | ODD | 2079.5810007219 | \n", 75 | "| Koppen_EVEN | Koppen | EVEN | 9087.46320481003 | \n", 76 | "| Puget_ODD | Puget | ODD | 2934.27555677135 | \n", 77 | "| Puget_EVEN | Puget | EVEN | 1805.56226291709 | \n", 78 | "\n", 79 | "\n" 80 | ], 81 | "text/plain": [ 82 | " Pop Site Lineage Ne_est \n", 83 | "1 Nome_ODD Nome ODD 2014.42557567929\n", 84 | "2 Nome_EVEN Nome EVEN 3691.29596294112\n", 85 | "3 Koppen_ODD Koppen ODD 2079.5810007219 \n", 86 | "4 Koppen_EVEN Koppen EVEN 9087.46320481003\n", 87 | "5 Puget_ODD Puget ODD 2934.27555677135\n", 88 | "6 Puget_EVEN Puget EVEN 1805.56226291709" 89 | ] 90 | }, 91 | "metadata": {}, 92 | "output_type": "display_data" 93 | } 94 | ], 95 | "source": [ 96 | "# Estaimte Ne and store the results in a dataframe\n", 97 | "Pops = c('Nome_ODD','Nome_EVEN', 'Koppen_ODD', 'Koppen_EVEN', 'Puget_ODD', 'Puget_EVEN')\n", 98 | "\n", 99 | "for (index in 1:6){\n", 100 | " POP = Pops[index]\n", 101 | " site = strsplit(POP, split = '_')[[1]][1]\n", 102 | " lin = strsplit(POP, split = '_')[[1]][2]\n", 103 | " res = get_Ne(base_path = paste(\"./work/\", POP, sep = ''))\n", 104 | " DF[index, ] = c(POP, site, lin, res$Ne_est)\n", 105 | "}\n", 106 | "\n", 107 | "# take a look at the results \n", 108 | "DF" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 4, 114 | "metadata": { 115 | "collapsed": false 116 | }, 117 | "outputs": [], 118 | "source": [ 119 | "write.table(DF, \"./work/Ne_estimates.txt\", sep = '\\t', quote = FALSE, row.names = FALSE)" 120 | ] 121 | } 122 | ], 123 | "metadata": { 124 | "kernelspec": { 125 | "display_name": "R", 126 | "language": "R", 127 | "name": "ir" 128 | }, 129 | "language_info": { 130 | "codemirror_mode": "r", 131 | "file_extension": ".r", 132 | "mimetype": "text/x-r-source", 133 | "name": "R", 134 | "pygments_lexer": "r", 135 | "version": "3.3.2" 136 | } 137 | }, 138 | "nbformat": 4, 139 | "nbformat_minor": 2 140 | } 141 | -------------------------------------------------------------------------------- /Effective_population_size/data/Nc_estimates.txt: -------------------------------------------------------------------------------- 1 | Pop Site Lineage Nc_est 2 | Nome_ODD Nome ODD 300000 3 | Nome_EVEN Nome EVEN 10000 4 | Koppen_ODD Koppen ODD 200000 5 | Koppen_EVEN Koppen EVEN 200000 6 | Puget_ODD Puget ODD 1400000 7 | Puget_EVEN Puget EVEN 4000 8 | 9 | -------------------------------------------------------------------------------- /Effective_population_size/data/pink_salmon.bed: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Effective_population_size/data/pink_salmon.bed -------------------------------------------------------------------------------- /Effective_population_size/data/pink_salmon.fam: -------------------------------------------------------------------------------- 1 | Koppen_ODD PKOPE91T_0001 0 0 0 -9 2 | Koppen_ODD PKOPE91T_0002 0 0 0 -9 3 | Koppen_ODD PKOPE91T_0003 0 0 0 -9 4 | Koppen_ODD PKOPE91T_0005 0 0 0 -9 5 | Koppen_ODD PKOPE91T_0006 0 0 0 -9 6 | Koppen_ODD PKOPE91T_0007 0 0 0 -9 7 | Koppen_ODD PKOPE91T_0009 0 0 0 -9 8 | Koppen_ODD PKOPE91T_0010 0 0 0 -9 9 | Koppen_ODD PKOPE91T_0011 0 0 0 -9 10 | Koppen_ODD PKOPE91T_0013 0 0 0 -9 11 | Koppen_ODD PKOPE91T_0014 0 0 0 -9 12 | Koppen_ODD PKOPE91T_0015 0 0 0 -9 13 | Koppen_ODD PKOPE91T_0016 0 0 0 -9 14 | Koppen_ODD PKOPE91T_0017 0 0 0 -9 15 | Koppen_ODD PKOPE91T_0018 0 0 0 -9 16 | Koppen_ODD PKOPE91T_0019 0 0 0 -9 17 | Koppen_ODD PKOPE91T_0020 0 0 0 -9 18 | Koppen_ODD PKOPE91T_0024 0 0 0 -9 19 | Koppen_ODD PKOPE91T_0025 0 0 0 -9 20 | Koppen_ODD PKOPE91T_0026 0 0 0 -9 21 | Koppen_ODD PKOPE91T_0029 0 0 0 -9 22 | Koppen_ODD PKOPE91T_0035 0 0 0 -9 23 | Koppen_ODD PKOPE91T_0036 0 0 0 -9 24 | Koppen_ODD PKOPE91T_0037 0 0 0 -9 25 | Koppen_EVEN PKOPE96T_0001 0 0 0 -9 26 | Koppen_EVEN PKOPE96T_0002 0 0 0 -9 27 | Koppen_EVEN PKOPE96T_0003 0 0 0 -9 28 | Koppen_EVEN PKOPE96T_0004 0 0 0 -9 29 | Koppen_EVEN PKOPE96T_0005 0 0 0 -9 30 | Koppen_EVEN PKOPE96T_0006 0 0 0 -9 31 | Koppen_EVEN PKOPE96T_0007 0 0 0 -9 32 | Koppen_EVEN PKOPE96T_0008 0 0 0 -9 33 | Koppen_EVEN PKOPE96T_0009 0 0 0 -9 34 | Koppen_EVEN PKOPE96T_0010 0 0 0 -9 35 | Koppen_EVEN PKOPE96T_0011 0 0 0 -9 36 | Koppen_EVEN PKOPE96T_0012 0 0 0 -9 37 | Koppen_EVEN PKOPE96T_0013 0 0 0 -9 38 | Koppen_EVEN PKOPE96T_0014 0 0 0 -9 39 | Koppen_EVEN PKOPE96T_0015 0 0 0 -9 40 | Koppen_EVEN PKOPE96T_0016 0 0 0 -9 41 | Koppen_EVEN PKOPE96T_0017 0 0 0 -9 42 | Koppen_EVEN PKOPE96T_0018 0 0 0 -9 43 | Koppen_EVEN PKOPE96T_0019 0 0 0 -9 44 | Koppen_EVEN PKOPE96T_0020 0 0 0 -9 45 | Koppen_EVEN PKOPE96T_0021 0 0 0 -9 46 | Koppen_EVEN PKOPE96T_0022 0 0 0 -9 47 | Koppen_EVEN PKOPE96T_0023 0 0 0 -9 48 | Koppen_EVEN PKOPE96T_0024 0 0 0 -9 49 | Nome_ODD PNOME91_0001 0 0 0 -9 50 | Nome_ODD PNOME91_0002 0 0 0 -9 51 | Nome_ODD PNOME91_0003 0 0 0 -9 52 | Nome_ODD PNOME91_0004 0 0 0 -9 53 | Nome_ODD PNOME91_0005 0 0 0 -9 54 | Nome_ODD PNOME91_0008 0 0 0 -9 55 | Nome_ODD PNOME91_0009 0 0 0 -9 56 | Nome_ODD PNOME91_0010 0 0 0 -9 57 | Nome_ODD PNOME91_0011 0 0 0 -9 58 | Nome_ODD PNOME91_0013 0 0 0 -9 59 | Nome_ODD PNOME91_0014 0 0 0 -9 60 | Nome_ODD PNOME91_0015 0 0 0 -9 61 | Nome_ODD PNOME91_0016 0 0 0 -9 62 | Nome_ODD PNOME91_0017 0 0 0 -9 63 | Nome_ODD PNOME91_0018 0 0 0 -9 64 | Nome_ODD PNOME91_0021 0 0 0 -9 65 | Nome_ODD PNOME91_0025 0 0 0 -9 66 | Nome_ODD PNOME91_0026 0 0 0 -9 67 | Nome_ODD PNOME91_0027 0 0 0 -9 68 | Nome_ODD PNOME91_0028 0 0 0 -9 69 | Nome_ODD PNOME91_0030 0 0 0 -9 70 | Nome_ODD PNOME91_0031 0 0 0 -9 71 | Nome_ODD PNOME91_0032 0 0 0 -9 72 | Nome_ODD PNOME91_0033 0 0 0 -9 73 | Nome_EVEN PNOME94_0001 0 0 0 -9 74 | Nome_EVEN PNOME94_0002 0 0 0 -9 75 | Nome_EVEN PNOME94_0003 0 0 0 -9 76 | Nome_EVEN PNOME94_0004 0 0 0 -9 77 | Nome_EVEN PNOME94_0005 0 0 0 -9 78 | Nome_EVEN PNOME94_0006 0 0 0 -9 79 | Nome_EVEN PNOME94_0007 0 0 0 -9 80 | Nome_EVEN PNOME94_0008 0 0 0 -9 81 | Nome_EVEN PNOME94_0009 0 0 0 -9 82 | Nome_EVEN PNOME94_0010 0 0 0 -9 83 | Nome_EVEN PNOME94_0011 0 0 0 -9 84 | Nome_EVEN PNOME94_0012 0 0 0 -9 85 | Nome_EVEN PNOME94_0013 0 0 0 -9 86 | Nome_EVEN PNOME94_0014 0 0 0 -9 87 | Nome_EVEN PNOME94_0015 0 0 0 -9 88 | Nome_EVEN PNOME94_0016 0 0 0 -9 89 | Nome_EVEN PNOME94_0017 0 0 0 -9 90 | Nome_EVEN PNOME94_0018 0 0 0 -9 91 | Nome_EVEN PNOME94_0019 0 0 0 -9 92 | Nome_EVEN PNOME94_0020 0 0 0 -9 93 | Puget_ODD PSNOH03_0005 0 0 0 -9 94 | Puget_ODD PSNOH03_0016 0 0 0 -9 95 | Puget_ODD PSNOH03_0017 0 0 0 -9 96 | Puget_ODD PSNOH03_0019 0 0 0 -9 97 | Puget_ODD PSNOH03_0021 0 0 0 -9 98 | Puget_ODD PSNOH03_0024 0 0 0 -9 99 | Puget_ODD PSNOH03_0027 0 0 0 -9 100 | Puget_ODD PSNOH03_0029 0 0 0 -9 101 | Puget_ODD PSNOH03_0039 0 0 0 -9 102 | Puget_ODD PSNOH03_0043 0 0 0 -9 103 | Puget_ODD PSNOH03_0046 0 0 0 -9 104 | Puget_ODD PSNOH03_0063 0 0 0 -9 105 | Puget_ODD PSNOH03_0065 0 0 0 -9 106 | Puget_ODD PSNOH03_0067 0 0 0 -9 107 | Puget_ODD PSNOH03_0074 0 0 0 -9 108 | Puget_ODD PSNOH03_0075 0 0 0 -9 109 | Puget_ODD PSNOH03_0076 0 0 0 -9 110 | Puget_ODD PSNOH03_0078 0 0 0 -9 111 | Puget_ODD PSNOH03_0079 0 0 0 -9 112 | Puget_ODD PSNOH03_0081 0 0 0 -9 113 | Puget_ODD PSNOH03_0082 0 0 0 -9 114 | Puget_ODD PSNOH03_0088 0 0 0 -9 115 | Puget_ODD PSNOH03_0090 0 0 0 -9 116 | Puget_ODD PSNOH03_0095 0 0 0 -9 117 | Puget_EVEN PSNOH96_0003 0 0 0 -9 118 | Puget_EVEN PSNOH96_0004 0 0 0 -9 119 | Puget_EVEN PSNOH96_0005 0 0 0 -9 120 | Puget_EVEN PSNOH96_0006 0 0 0 -9 121 | Puget_EVEN PSNOH96_0007 0 0 0 -9 122 | Puget_EVEN PSNOH96_0008 0 0 0 -9 123 | Puget_EVEN PSNOH96_0009 0 0 0 -9 124 | Puget_EVEN PSNOH96_0010 0 0 0 -9 125 | Puget_EVEN PSNOH96_0011 0 0 0 -9 126 | Puget_EVEN PSNOH96_0012 0 0 0 -9 127 | Puget_EVEN PSNOH96_0013 0 0 0 -9 128 | Puget_EVEN PSNOH96_0014 0 0 0 -9 129 | Puget_EVEN PSNOH96_0015 0 0 0 -9 130 | Puget_EVEN PSNOH96_0016 0 0 0 -9 131 | Puget_EVEN PSNOH96_0017 0 0 0 -9 132 | Puget_EVEN PSNOH96_0018 0 0 0 -9 133 | Puget_EVEN PSNOH96_0019 0 0 0 -9 134 | Puget_EVEN PSNOH96_0021 0 0 0 -9 135 | Puget_EVEN PSNOH96_0022 0 0 0 -9 136 | Puget_EVEN PSNOH96_0023 0 0 0 -9 137 | Puget_EVEN PSNOH96_0024 0 0 0 -9 138 | Puget_EVEN PSNOH96_0025 0 0 0 -9 139 | Puget_EVEN PSNOH96_0027 0 0 0 -9 140 | Puget_EVEN PSNOH96_0028 0 0 0 -9 141 | -------------------------------------------------------------------------------- /Effective_population_size/data/tmp: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Effective_population_size/do_everything: -------------------------------------------------------------------------------- 1 | #mkdir popgen2018-pink_salmon 2 | #cd popgen2018-pink_salmon 3 | #wget https://api.github.com/repos/rwaples/popgen2018-pink_salmon/tarball/master -O - | tar xz --strip=1 4 | 5 | #bash ./scripts/0_get_data.sh 6 | bash ./scripts/1_clean_data.sh 7 | bash ./scripts/2_do_PCA.sh 8 | Rscript ./scripts/3_plot_PCA.r 9 | bash ./scripts/4_calculate_LD.sh 10 | Rscript ./scripts/5_estimate_Ne.r 11 | Rscript ./scripts/6_plot_Ne_Nc.r 12 | -------------------------------------------------------------------------------- /Effective_population_size/images/sampling_locations.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Effective_population_size/images/sampling_locations.png -------------------------------------------------------------------------------- /Effective_population_size/images/sfs1dx.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Effective_population_size/images/sfs1dx.png -------------------------------------------------------------------------------- /Effective_population_size/images/tmp: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Effective_population_size/plots/tmp: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Effective_population_size/scripts/0_get_data.sh: -------------------------------------------------------------------------------- 1 | wget http://people.binf.ku.dk/cetp/popgen/pink_salmon/pink_salmon.bed --directory-prefix ./data 2 | wget http://people.binf.ku.dk/cetp/popgen/pink_salmon/pink_salmon.bim --directory-prefix ./data 3 | wget http://people.binf.ku.dk/cetp/popgen/pink_salmon/pink_salmon.fam --directory-prefix ./data 4 | wget http://people.binf.ku.dk/cetp/popgen/pink_salmon/Nc_estimates.txt --directory-prefix ./data 5 | -------------------------------------------------------------------------------- /Effective_population_size/scripts/1_clean_data.sh: -------------------------------------------------------------------------------- 1 | plink --bfile ./data/pink_salmon --autosome-num 26 --not-chr 0 --make-bed --out ./work/pink_salmon.clean 2 | plink --bfile ./work/pink_salmon.clean --autosome-num 26 --family --keep-cluster-names Koppen_ODD --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Koppen_ODD 3 | plink --bfile ./work/pink_salmon.clean --autosome-num 26 --family --keep-cluster-names Koppen_EVEN --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Koppen_EVEN 4 | plink --bfile ./work/pink_salmon.clean --autosome-num 26 --family --keep-cluster-names Nome_ODD --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Nome_ODD 5 | plink --bfile ./work/pink_salmon.clean --autosome-num 26 --family --keep-cluster-names Nome_EVEN --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Nome_EVEN 6 | plink --bfile ./work/pink_salmon.clean --autosome-num 26 --family --keep-cluster-names Puget_ODD --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Puget_ODD 7 | plink --bfile ./work/pink_salmon.clean --autosome-num 26 --family --keep-cluster-names Puget_EVEN --hwe .001 --geno 0.02 --maf 0.05 --make-bed --out ./work/Puget_EVEN 8 | -------------------------------------------------------------------------------- /Effective_population_size/scripts/2_do_PCA.sh: -------------------------------------------------------------------------------- 1 | plink --bfile ./data/pink_salmon --autosome-num 26 --maf 0.1 --pca 3 --out ./work/pink_data.initial 2 | plink --bfile ./work/pink_salmon.clean --autosome-num 26 --maf 0.1 --pca 3 --out ./work/pink_salmon.clean 3 | -------------------------------------------------------------------------------- /Effective_population_size/scripts/3_plot_PCA.r: -------------------------------------------------------------------------------- 1 | 2 | # Options for the notebook 3 | options(jupyter.plot_mimetypes = "image/png") 4 | options(repr.plot.width = 6, repr.plot.height = 6) 5 | 6 | read_pca_output <- function(path){ 7 | pca_df = read.table(path) 8 | names(pca_df) = c('Population', 'Individual', 'PC1', 'PC2', 'PC3') 9 | # enforce the order of populations 10 | pca_df$Population <- factor(pca_df$Population , levels =c('Nome_ODD','Nome_EVEN', 'Koppen_ODD', 'Koppen_EVEN', 'Puget_ODD', 'Puget_EVEN')) 11 | pca_df = pca_df[order(pca_df$Population),] 12 | return(pca_df) 13 | } 14 | 15 | plot_pca_basic <- function(pca_df, title){ 16 | plot(pca_df$PC1 , pca_df$PC2, col = pca_df$Population, pch = 16, main = title) 17 | legend(x="topleft", legend = levels(pca_df$Population), fill = palette()[1:6]) 18 | } 19 | 20 | # unused - example how to make a scatteplot in ggplot2 21 | plot_pca_ggplot <- function(pca_df){ 22 | library(ggplot2) 23 | p = ggplot(data = pca_df, aes(x = PC1, y = PC2)) 24 | p = p + geom_point(aes(colour = Population)) 25 | p = p + theme_classic() 26 | return(p) 27 | } 28 | 29 | pca_initial = read_pca_output('./work/pink_data.initial.eigenvec') 30 | head(pca_initial) 31 | 32 | pca_clean = read_pca_output('./work/pink_salmon.clean.eigenvec') 33 | head(pca_clean) 34 | 35 | # Make some plots and save them to a file 36 | 37 | png("./plots/PCA.pink_salmon.initial.png") 38 | plot_pca_basic(pca_initial, title = 'pink_salmon.initial') 39 | dev.off() 40 | 41 | png("./plots/PCA.pink_salmon.clean.png") 42 | plot_pca_basic(pca_clean, title = 'pink_salmon.clean') 43 | dev.off() 44 | 45 | print("Done plotting PCAs") 46 | 47 | 48 | -------------------------------------------------------------------------------- /Effective_population_size/scripts/4_calculate_LD.sh: -------------------------------------------------------------------------------- 1 | plink --bfile ./work/Koppen_ODD --autosome-num 26 --r2 square --out ./work/Koppen_ODD 2 | plink --bfile ./work/Koppen_EVEN --autosome-num 26 --r2 square --out ./work/Koppen_EVEN 3 | plink --bfile ./work/Nome_ODD --autosome-num 26 --r2 square --out ./work/Nome_ODD 4 | plink --bfile ./work/Nome_EVEN --autosome-num 26 --r2 square --out ./work/Nome_EVEN 5 | plink --bfile ./work/Puget_ODD --autosome-num 26 --r2 square --out ./work/Puget_ODD 6 | plink --bfile ./work/Puget_EVEN --autosome-num 26 --r2 square --out ./work/Puget_EVEN 7 | -------------------------------------------------------------------------------- /Effective_population_size/scripts/5_estimate_Ne.r: -------------------------------------------------------------------------------- 1 | 2 | source('./scripts/R_functions.r') 3 | 4 | # Make an empty data frame to store the results 5 | DF <- data.frame(Pop=rep(NA, 6), Site=rep(NA, 6), Lineage=rep(NA, 6), Ne_est=rep(NA, 6)) 6 | 7 | # Calculate 8 | Pops = c('Nome_ODD','Nome_EVEN', 'Koppen_ODD', 'Koppen_EVEN', 'Puget_ODD', 'Puget_EVEN') 9 | 10 | for (index in 1:6){ 11 | POP = Pops[index] 12 | site = strsplit(POP, split = '_')[[1]][1] 13 | lin = strsplit(POP, split = '_')[[1]][2] 14 | res = get_Ne(base_path = paste("./work/", POP, sep = '')) 15 | DF[index, ] = c(POP, site, lin, res$Ne_est) 16 | } 17 | 18 | # take a look at the results 19 | DF 20 | 21 | write.table(DF, "./work/Ne_estimates.txt", sep = '\t', quote = FALSE, row.names = FALSE) 22 | 23 | 24 | -------------------------------------------------------------------------------- /Effective_population_size/scripts/6_plot_Ne_Nc.r: -------------------------------------------------------------------------------- 1 | 2 | # Load in the Ne and Nc estimates 3 | 4 | Ne = read.table("./work/Ne_estimates.txt",sep = '\t', header = TRUE) 5 | Nc = read.table("./data/Nc_estimates.txt",sep = '\t', header = TRUE) 6 | 7 | Ne 8 | 9 | Nc 10 | 11 | ## Use the merge command to join them 12 | 13 | estimates = merge(Ne, Nc) 14 | estimates 15 | 16 | estimates$ratio = estimates$Ne_est / estimates$Nc_est 17 | # reorder to match input order 18 | estimates$Pop <- factor(estimates$Pop , levels =c('Nome_ODD','Nome_EVEN', 'Koppen_ODD', 'Koppen_EVEN', 'Puget_ODD', 'Puget_EVEN')) 19 | estimates = estimates[order(estimates$Pop),] 20 | estimates 21 | 22 | for_barplot = data.matrix(t(estimates[,c('Ne_est', 'Nc_est')])) 23 | colnames(for_barplot) = estimates$Pop 24 | 25 | for_ratio_barplot = data.matrix(t(estimates[,'ratio'])) 26 | colnames(for_ratio_barplot) = estimates$Pop 27 | 28 | 29 | png('./plots/Ne_estimates.png') 30 | par(mar=c(10,4,4,2)) 31 | barplot(for_barplot['Ne_est',], col = "white", beside = TRUE, las=2, #axes = FALSE, 32 | main = "Ne estimates for each population",ylab = 'Ne') 33 | dev.off() 34 | 35 | png('./plots/Ne_and_Nc_estimates.png') 36 | par(mar=c(10,4,4,2)) 37 | 38 | barplot(for_barplot, col = c("white","black"), beside = TRUE, las=2, axes = FALSE, 39 | main = "Ne and Nc estimates for each population",ylab = 'Size') 40 | axis(side = 2, at = c(100, 10000, 500000, 1000000, 1500000)) 41 | legend("top", 42 | c("Ne_est","Nc_est"), 43 | fill = c("white","black") 44 | ) 45 | dev.off() 46 | 47 | # same plot with a log y axis 48 | png('./plots/Ne_and_Nc_estimates_log-scaled.png') 49 | par(mar=c(10,4,4,2)) 50 | barplot(for_barplot, col = c("white","black"), beside = TRUE, las=2, 51 | log = 'y', axes = FALSE, ylim = c(100,1400000), 52 | main = "Ne and Nc estimates for each population",ylab = 'Size (log scaled)') 53 | axis(side = 2, at = c(100, 10000, 500000, 1000000, 1500000)) 54 | legend("top", 55 | c("Ne_est","Nc_est"), 56 | fill = c("white","black") 57 | ) 58 | dev.off() 59 | 60 | png('./plots/Ne-Nc_ratios.png') 61 | par(mar=c(10,4,4,2)) 62 | barplot(for_ratio_barplot, col = "gray", beside = TRUE, las=2, #axes = FALSE, 63 | main = "Ne/Nc ratios for each population",ylab = 'Ne/Nc ratio') 64 | dev.off() 65 | 66 | source('./scripts/R_functions.r') 67 | 68 | Ne_Nome_ODD = get_Ne('./work/Nome_ODD') 69 | Ne_Nome_EVEN = get_Ne('./work/Nome_EVEN') 70 | Ne_Koppen_ODD = get_Ne('./work/Koppen_ODD') 71 | Ne_Koppen_EVEN = get_Ne('./work/Koppen_EVEN') 72 | Ne_Puget_ODD = get_Ne('./work/Puget_ODD') 73 | Ne_Puget_EVEN = get_Ne('./work/Puget_EVEN') 74 | 75 | png('./plots/LD_Nome_ODD.png', width=nrow(Ne_Nome_ODD$r2_matrix),height=nrow(Ne_Nome_ODD$r2_matrix)) 76 | image(Ne_Nome_ODD$r2_matrix, axes = FALSE, col = rev(heat.colors(256))) 77 | dev.off() 78 | 79 | png('./plots/LD_Nome_EVEN.png', width=nrow(Ne_Nome_EVEN$r2_matrix),height=nrow(Ne_Nome_EVEN$r2_matrix)) 80 | image(Ne_Nome_EVEN$r2_matrix, axes = FALSE, col = rev(heat.colors(256))) 81 | dev.off() 82 | 83 | png('./plots/LD_Koppen_ODD.png', width=nrow(Ne_Koppen_ODD$r2_matrix),height=nrow(Ne_Koppen_ODD$r2_matrix)) 84 | image(Ne_Koppen_ODD$r2_matrix, axes = FALSE, col = rev(heat.colors(256))) 85 | dev.off() 86 | 87 | png('./plots/LD_Koppen_EVEN.png', width=nrow(Ne_Koppen_EVEN$r2_matrix),height=nrow(Ne_Koppen_EVEN$r2_matrix)) 88 | image(Ne_Koppen_EVEN$r2_matrix, axes = FALSE, col = rev(heat.colors(256))) 89 | dev.off() 90 | 91 | png('./plots/LD_Puget_ODD.png', width=nrow(Ne_Puget_ODD$r2_matrix),height=nrow(Ne_Puget_ODD$r2_matrix)) 92 | image(Ne_Puget_ODD$r2_matrix, axes = FALSE, col = rev(heat.colors(256))) 93 | dev.off() 94 | 95 | png('./plots/LD_Puget_EVEN.png', width=nrow(Ne_Puget_EVEN$r2_matrix),height=nrow(Ne_Puget_EVEN$r2_matrix)) 96 | image(Ne_Puget_EVEN$r2_matrix, axes = FALSE, col = rev(heat.colors(256))) 97 | dev.off() 98 | 99 | 100 | -------------------------------------------------------------------------------- /Effective_population_size/scripts/R_functions.r: -------------------------------------------------------------------------------- 1 | 2 | ## Functions to read in data 3 | 4 | read_r2_matrix <- function(r2_path){ 5 | r2_df = read.table(r2_path, sep="\t", ) 6 | r2_matrix <- data.matrix(r2_df) 7 | return(r2_matrix) 8 | } 9 | 10 | get_bim <- function(bim_path){ 11 | bim = read.table(bim_path, sep = '\t') 12 | names(bim) = c('chr', 'SNP', 'cm', 'bp', 'A1', 'A2') 13 | return(bim) 14 | } 15 | 16 | get_fam <- function(fam_path){ 17 | fam = read.table(fam_path, sep = ' ') 18 | names(fam) = c('FID', 'IID', 'fatherID', 'motherID', 'sex', 'phenotype') 19 | return(fam) 20 | } 21 | 22 | estimate_Ne <- function(mean_r2, S){ 23 | adj1_r2 = mean_r2 * (S/(S-1))**2 24 | adj2_r2 = adj1_r2 - 0.0018 - 0.907/S - 4.44/(S**2) 25 | Ne_est = (0.308 + sqrt(.308**2 - 2.08*adj2_r2))/(2*adj2_r2) 26 | return(Ne_est) 27 | } 28 | 29 | get_Ne <- function(base_path){ 30 | # load files 31 | r2_path = paste(base_path, '.ld', sep ='') 32 | bim_path = paste(base_path, '.bim', sep ='') 33 | fam_path = paste(base_path, '.fam', sep ='') 34 | pop_mat = read_r2_matrix(r2_path) 35 | pop_bim = get_bim(bim_path) 36 | pop_fam = get_fam(fam_path) 37 | 38 | # get the sample size of the population from the .fam file 39 | S = nrow(pop_fam) 40 | 41 | # exclude loci on the same chromosome 42 | for (CH in 1:26){ 43 | my_idx = which(pop_bim$chr==CH) 44 | pop_mat[my_idx, my_idx] <- NA 45 | } 46 | # get just the upper triangle of the square matrix 47 | r2_vals = pop_mat[upper.tri(x = pop_mat, diag = FALSE)] 48 | # remove NA values 49 | r2_vals = r2_vals[!is.na(r2_vals)] 50 | 51 | mean_r2 = mean(r2_vals) 52 | # Non -bias correcteted Ne estimate 53 | Ne_basic = 1.0/(3*mean_r2 - 3.0/S) 54 | 55 | # Bias corrected for low sample size (S<30) 56 | Ne_est = estimate_Ne(mean_r2=mean_r2, S=S) 57 | 58 | #print(c(Ne_est, Ne_basic)) 59 | 60 | # return the bias-corrected estimate and the 61 | # r2 matrix used in the calculation (with within-chromsome r2 values masked) 62 | return (list(Ne_est = Ne_est, r2_matrix = pop_mat)) 63 | } 64 | -------------------------------------------------------------------------------- /Effective_population_size/scripts/tmp: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Effective_population_size/tmp: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Effective_population_size/work/tmp: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Extra_exercise/Extra exercises.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Extra_exercise/Extra exercises.pdf -------------------------------------------------------------------------------- /Extra_exercise/readme.MD: -------------------------------------------------------------------------------- 1 | # Extra exercises for most classes based on a dataset from Wildebeest (only if there is time) 2 | 3 | 4 | 5 | 6 | 7 | -------------------------------------------------------------------------------- /GWAS/ExerciseGWAS_2020.md: -------------------------------------------------------------------------------- 1 | # Exercise in association testing and GWAS 2 | 3 | **Ida Moltke** 4 | 5 | The focus of these exercises is to try out two simple tests for association and then to perform a GWAS using one of these tests. The goal is to give you some hands on experience with GWAS. The exercises are to some extent copy-paste based, however this is on purpose so you can learn as much as possible within the limited time we have. The idea is that the exercises will provide you with some relevant commands you can use if you later on want to perform a GWAs and at the same they will allow you to spend more time on looking at and interpreting the output than you would have had if you had to try to build up the commands from scratch. However, this also means that to get the most out of the exercises you should not just quickly copy-paste everything, but also along the way try to 6 | 7 | 1) make sure you understand what the command/code does (except the command lines that start with "Rscript" since these just call code in an R script which you do not have to look at unless you are curious) 8 | 9 | 2) spend some time on looking at and trying to interpret the output you get when you run the code 10 | 11 | . 12 | 13 | ## Exercise 1: Basic association testing (single SNP analysis) 14 | 15 | In this exercise we will go through two simple tests for association. We will use the statistical software R and look at (imaginary) data from 1237 cases and 4991 controls at two SNP loci with the following counts: 16 | 17 | | | | | | 18 | |:---:|:----:|:---:|:---:| 19 | | SNP 1 | AA | Aa | aa | 20 | | Cases | 423 | 568 | 246 | 21 | | Controls | 1955 | 2295 | 741 | 22 | 23 | | | | | | 24 | |:---:|:----:|:---:|:---:| 25 | | SNP 2 | AA | Aa | aa | 26 | | Cases | 1003 | 222 | 12 | 27 | | Controls | 4043 | 899 | 49 | 28 | 29 | 30 | #### Exercise 1A: genotype based test 31 | 32 | We want to test if the disease is associated with these two SNPs. One way to do this for the first SNP is by running the following code in R: 33 | 34 | ```R 35 | # NB remember to open R first 36 | # Input the count data into a matrix 37 | countsSNP1 <- matrix(c(423,1955,568,2295,246,741),nrow=2) 38 | 39 | # Print the matrix (so you can see it) 40 | print(countsSNP1) 41 | 42 | # Perform the test on the matrix 43 | chisq.test(countsSNP1) 44 | ``` 45 | 46 | * Try to run this test and note the p-value. 47 | * Does the test suggest that the SNP is associated with the disease (you can use a p-value threshold of 0.05 when answering this question)? 48 | 49 | 50 | Now try to plot the proportions of cases in each genotype category in this SNP using the following R code: 51 | ```R 52 | barplot(countsSNP1[1,]/(countsSNP1[1,]+countsSNP1[2,]),xlab="Genotypes",ylab="Proportion of cases",names=c("AA","Aa","aa"),las=1,ylim=c(0,0.25)) 53 | ``` 54 | * Does is look like the SNP is associated with the disease? 55 | * Is p-value consistent with the plot? 56 | 57 | Repeat the same analyses (and answer the same questions) for SNP 2 for which the data can be input into R as follows: 58 | 59 | ```R 60 | countsSNP2 <- matrix(c(1003,4043,222,899,12,49),nrow=2) 61 | ``` 62 | 63 | #### Exercise 1B: the allelic test 64 | 65 | Another simple way to test for association is by using the allelic test. Try to perform that on the two SNPs. This can be done as follows in R: 66 | 67 | ```R 68 | # For SNP 1: 69 | 70 | # - get the allelic counts 71 | allelecountsSNP1<-cbind(2*countsSNP1[,1]+countsSNP1[,2],countsSNP1[,2]+2*countsSNP1[,3]) 72 | print(allelecountsSNP1) 73 | 74 | # - perform allelic test 75 | chisq.test(allelecountsSNP1,correct=F) 76 | 77 | # For SNP 2: 78 | 79 | # ... repeat the above with SNP 2 data 80 | ``` 81 | 82 | * Do these tests lead to the same conclusions as you reached in exercise 1A? 83 | 84 | . 85 | 86 | ## Exercise 2: GWAS 87 | 88 | This exercise is about Genome-Wide Association Studies (GWAS): how to perform one and what pitfalls to look out for. It will be conducted from the command line using the program PLINK (you can read more about it here: https://www.cog-genomics.org/plink2/). 89 | 90 | 91 | #### Exercise 2A: getting data and running your first GWAS 92 | 93 | First close R if you have it open. 94 | 95 | Then make a folder called GWAS for this exercise, copy the relavant data into this folder and unpack the data by typing 96 | 97 | ```bash 98 | cd ~/exercises 99 | mkdir GWAS 100 | cd GWAS 101 | cp ~/groupdirs/SCIENCE-BIO-Popgen_Course/exercises/GWAS/gwasdata.tar.gz . 102 | tar -xf gwasdata.tar.gz 103 | rm gwasdata.tar.gz 104 | ``` 105 | 106 | Your folder called GWAS should now contain a subfolder called data (you can check that this is true e.g. by typing ls, which is a command that gives you a list of the content in the folder you are working in). The folder "data" will contain the all files you will need in this exercise (both the data and a file called plink.plot.R, which contains R code for plotting your results). 107 | 108 | Briefly, the data consist of SNP genotyping data from 356 individuals some of which are have a certain disease (cases) and the rest do not (controls). To make sure the GWAS analyses will run fast the main data file (gwa.bed) is in a binary format, which is not very reader friendly. However, PLINK will print summary statistics about the data (number of SNPs, number of individuals, number of cases, number of controls etc) to the screen when you run an analysis. Also, there are two additional data files, gwa.bim and gwa.fam, which are not in binary format and which contain information about the SNPs in the data and the individuals in the data, respectively (you can read more about the data format in the manuals linked to above - but for now this is all you need to know). 109 | 110 | Let's try to perform a GWAS of our data, i.e. test each SNP for association with the disease. And let's try to do it using the simplest association test for case-control data, the allelic test, which you just performed in R in exercise 1B. You can perform this test on all the SNPs in the dataset using PLINK by typing the following command in the terminal: 111 | 112 | ```bash 113 | plink --bfile data/gwa --assoc --adjust 114 | ``` 115 | 116 | Specifically "--bfile data/gwa" specifies that the data PLINK should analyse are the files in folder called "data" with the prefix gwa. "--assoc" specifies that we want to perform GWAS using the allelic test and "--adjust" tells PLINK to output a file that includes p-values that are adjusted for multiple testing using Bonferroni correction as well as other fancier methods. Try to run the command and take a look at the text PLINK prints to your screen. Specifically, note the 117 | 118 | * number of SNPs 119 | * number of individuals 120 | * number of cases and controls 121 | 122 | NB if you for some reason have trouble reading the PLINK output on your screen then PLINK also writes it to the file plink.log and you can look at that by typing "less plink.log". If you do that, you can use arrows to navigate up and down in the file and close the file viewing by typing q. 123 | 124 | Next, try to plot the results of the GWAS by typing: 125 | 126 | ```bash 127 | Rscript data/plink.plot.R plink.assoc 128 | ``` 129 | 130 | This should give you several plots. For now just look at the Manhattan plot, which can be found in the file called plink.assoc.png. You can e.g. open it using the png viewer called display, so by typing: 131 | 132 | ```bash 133 | display plink.assoc.png 134 | ``` 135 | 136 | A bonferroni corrected p-value threshold based on an initial p-value threshold of 0.05, is shown as a dotted line on the plot. 137 | 138 | * Explain how this threshold was reached. 139 | * Calculate the exact threshold using you knowledge of how many SNPs you have in your dataset (NB if you want to calculate log10 in R you can use the function log10) 140 | * Using this threshold, does any of the SNPs in your dataset seem to be associated with the disease? 141 | * Do your results seem plausible? Why/why not? 142 | 143 | 144 | #### Exercise 2B: checking if it went OK using QQ-plot 145 | 146 | Now look at the QQ-plot that you already generated in exercise 2A (the file called plink.assoc.QQ.png) by typing: 147 | 148 | ```bash 149 | display plink.assoc.QQ.png 150 | ``` 151 | 152 | Here the red line is the x=y line. 153 | 154 | * What does this plot suggest and why? 155 | 156 | 157 | #### Exercise 2C: doing initial QC of data part 1 (sex check) 158 | 159 | As you can see a lot can go wrong if you do not check the quality of your data! So if you want meaningful/useful output you always have to run a lot of quality checking (QC) and filtering before running the association tests. One check that is worth running is a check if the indicated genders are correct. You can check this using PLINK to calculate the inbreeding coefficient on the X chromosome under the assumption that it is an autosomal chromosome. The reason why this is interesting is that, for technical reasons PLINK represents haploid chromosomes, such as X for males, as homozygotes. So assuming the X is an autosomal chromosome will make the males look very inbred on the X where as the woman wont (since they are diploid on the X chromosome). This means that the inbreeding coefficient estimates you get will be close to 1 for men and close to 0 for women. This gender check can be performed in PLINK using the following command: 160 | 161 | ```bash 162 | plink --bfile data/gwa --check-sex 163 | ``` 164 | 165 | The result are in the file plink.sexcheck in which the gender is in column PEDSEX (1 means male and 2 means female) and the inbreeding coefficient is in column F). 166 | 167 | Check the result by typing: 168 | ```bash 169 | less plink.sexcheck 170 | ``` 171 | 172 | NB you can use arrows to navigate up and down in the file and close the file viewing by typing q. 173 | 174 | * Do you see any problems? 175 | 176 | * If you observe any problems then fix them by changing the gender in the file gwa.fam (5th colunm) in the folder data. You can use the editor gedit for this. To open the file in gedit type "gedit data/gwa.fam" in the terminal. (NB usually one would instead get rid of these individuals because wrong gender could indicate that the phenotypes you have do not belong to the genotyped individual. However, in this case the genders were changed on purpose for the sake of this exercises so you can safely just change them back). 177 | 178 | 179 | #### Exercise 2D: doing initial QC of data part 2 (relatedness check) 180 | 181 | Another potential problem in association studies is spurious relatedness where some of the individuals in the sample are closely related. Closely related individuals can be inferred using PLINK as follows: 182 | 183 | ```bash 184 | plink --bfile data/gwa --genome 185 | ``` 186 | 187 | And you can plot the results by typing: 188 | 189 | ```bash 190 | Rscript data/plink.plot.R plink.genome 191 | ``` 192 | 193 | Do that and then take a look at the result by typing 194 | 195 | ```bash 196 | display plink.genomepairwise_relatedness_1.png 197 | ``` 198 | 199 | The figure shows estimates of the relatedness for all pairs of individuals. For each pair k1 is the proportion of the genome where the pair shares 1 of their allele identical-by-descent (IBD) and k2 is the proportion of the genome where the pair shares both their alleles IBD. The expected (k1,k2) values for simple relationships are shown in the figure with MZ=monozygotic twins, PO=parent offspring, FS=full sibling, HS=half sibling (or avuncular or grandparent-granchild), C1=first cousin, C2=cousin once removed. 200 | 201 | * Are any of the individuals in your dataset closely related? 202 | * What assumption in association studies is violated when individuals are related? 203 | * And last but not least: how would you recognize if the same person is included twice (this actually happens!) 204 | 205 | 206 | #### Exercise 2E: doing initial QC of data part 3 (check for batch bias/non-random genotyping error) 207 | 208 | Check if there is a batch effect/non random genotyping error by using missingness as a proxy (missingness and genotyping error are highly correlated in SNP chip data). In other words, for each SNP test if there is a significantly different amount of missing data in the cases and controls (or equivalently if there is an association between the disease status and the missingness). Do this with PLINK by running the following command: 209 | 210 | ```bash 211 | plink --bfile data/gwa --test-missing 212 | ``` 213 | 214 | View the result in the file plink.missing where the p-value is given in the right most coloumn or generate a histogram of all the p-values using the Rscript data/plink.plot.R by typing this command 215 | 216 | ```bash 217 | Rscript data/plink.plot.R plink.missing 218 | ``` 219 | 220 | The resulting p-value histogram can be found in the file plink.missing.png, which you can open using the png viewer display, so by typing 221 | 222 | ```bash 223 | display plink.missing.png 224 | ``` 225 | 226 | If the missingness is random the p-value should be uniformly distributed between 0 and 1. 227 | 228 | * Is this the case? 229 | * Genotyping errors are often highly correlated with missingness. How do you think this will effect your association results? 230 | 231 | #### Exercise 2F: doing initial QC of data part 4 (check for batch bias/non-random genotyping error again) 232 | 233 | Principal component analysis (PCA) and a very similar methods called multidimensional scaling is also often used to reveal problems in the data. Such analyses can be used to project all the genotype information (e.g. 500,000 marker sites) down to a low number of dimensions e.g. two. 234 | 235 | Multidimensional scaling based on this can be performed with PLINK as follows (for all individuals except for a few which has more than 20% missingness): 236 | 237 | ```bash 238 | plink --bfile data/gwa --cluster --mds-plot 2 --mind 0.2 239 | ``` 240 | 241 | Run the above command and plot the results by typing: 242 | 243 | ```bash 244 | Rscript data/plink.plot.R plink.mds data/gwa.fam 245 | ``` 246 | 247 | The resulting plot should now be in the file plink.mds.pdf, which you can open with the pdf viewer evince (so by typing "evince plink.mds.pdf"). It shows the first two dimensions and each individual is represented by a point, which is colored according to the individual's disease status. 248 | 249 | * Clustering of cases and controls is an indication of batch bias. Do you see such clustering? 250 | * What else could explain this clustering? 251 | 252 | 253 | #### Exercise 2G: try to rerun GWAS after quality filtering SNPs 254 | 255 | We can remove many of the error prone SNPs and individuals by removing 256 | 257 | * SNPs that are not in HWE within controls 258 | * the rare SNPs 259 | * the individuals and SNPs with lots of missing data (why?) 260 | 261 | Let us try to do rerun an association analysis where this is done: 262 | 263 | ```bash 264 | plink --bfile data/gwa --assoc --adjust --out assoc2 --hwe 0.0001 --maf 0.05 --mind 0.55 --geno 0.05 265 | ``` 266 | 267 | Plot the results using 268 | 269 | ```bash 270 | Rscript data/plink.plot.R assoc2.assoc 271 | ``` 272 | 273 | * How does QQ-plot look (in the file assoc2.assoc.QQ.png)? 274 | * Did the analysis go better this time? 275 | * And what does the Manhattan plot suggest (in the file assoc2.assoc.png)? Are any of the SNPs associated? 276 | * Does your answer change if you use other (smarter) methods to correct for multiple testing than Bonferroni (e.g. FDR), which PLINK provides. You can find them by typing: 277 | 278 | ```bash 279 | less assoc2.assoc.adjusted 280 | ``` 281 | 282 | Note that PLINK adjusts p-values instead of the threshold (equivalent idea), so you should NOT change the threshold but stick to 0.05. 283 | 284 | . 285 | 286 | ## Bonus exercise if there is any time left: another example of a GWAS caveat 287 | 288 | For the same individuals as above we also have another phenotype. This phenotype is strongly correlated with gender. The genotyping was done independently of this phenotype so there is no batch bias. To perform association on this phenotype type 289 | 290 | ```bash 291 | plink --bfile data/gwa --assoc --pheno data/pheno3.txt --adjust --out pheno3 292 | Rscript data/plink.plot.R pheno3.assoc 293 | ``` 294 | 295 | * View the plots and results. Are any of the SNP significantly associated? 296 | 297 | Now try to perform the analysis using logistic regression (another association test) adjusted for sex, which is done by adding the option "--sex": 298 | 299 | ```bash 300 | plink --bfile data/gwa --logistic --out pheno3_sexAdjusted --pheno data/pheno3.txt --sex 301 | Rscript data/plink.plot.R pheno3_sexAdjusted.assoc.logistic 302 | ``` 303 | 304 | * Are there any associated SNPs according to this analysis? 305 | 306 | Some of the probes used on the chip will hybridize with multiple loci on the genome. The associated SNPs in the previous analysis all cross-hybridize with the X chromosome. 307 | 308 | * Could crosshybridization explain the difference in results from the two analyses? 309 | 310 | 311 | -------------------------------------------------------------------------------- /GWAS/ExerciseGWAS_2022.md: -------------------------------------------------------------------------------- 1 | # Exercise in association testing and GWAS 2 | 3 | **Ida Moltke** 4 | 5 | The focus of these exercises is to try out two simple tests for association and then to perform a GWAS using one of these tests. The goal is to give you some hands on experience with GWAS. The exercises are to some extent copy-paste based, however this is on purpose so you can learn as much as possible within the limited time we have. The idea is that the exercises will provide you with some relevant commands you can use if you later on want to perform a GWAs and at the same they will allow you to spend more time on looking at and interpreting the output than you would have had if you had to try to build up the commands from scratch. However, this also means that to get the most out of the exercises you should not just quickly copy-paste everything, but also along the way try to 6 | 7 | 1) make sure you understand what the command/code does (except the command lines that start with "Rscript" since these just call code in an R script which you do not have to look at unless you are curious) 8 | 9 | 2) spend some time on looking at and trying to interpret the output you get when you run the code 10 | 11 | . 12 | 13 | ## Exercise 1: Basic association testing (single SNP analysis) 14 | 15 | In this exercise we will go through two simple tests for association. We will use the statistical software R and look at (imaginary) data from 1237 cases and 4991 controls at two SNP loci with the following counts: 16 | 17 | | | | | | 18 | |:---:|:----:|:---:|:---:| 19 | | SNP 1 | AA | Aa | aa | 20 | | Cases | 423 | 568 | 246 | 21 | | Controls | 1955 | 2295 | 741 | 22 | 23 | | | | | | 24 | |:---:|:----:|:---:|:---:| 25 | | SNP 2 | AA | Aa | aa | 26 | | Cases | 1003 | 222 | 12 | 27 | | Controls | 4043 | 899 | 49 | 28 | 29 | 30 | #### Exercise 1A: genotype based test 31 | 32 | We want to test if the disease is associated with these two SNPs. One way to do this for the first SNP is by running the following code in R: 33 | 34 | ```R 35 | # NB remember to open R first 36 | # Input the count data into a matrix 37 | countsSNP1 <- matrix(c(423,1955,568,2295,246,741),nrow=2) 38 | 39 | # Print the matrix (so you can see it) 40 | print(countsSNP1) 41 | 42 | # Perform the test on the matrix 43 | chisq.test(countsSNP1) 44 | ``` 45 | 46 | * Try to run this test and note the p-value. 47 | * Does the test suggest that the SNP is associated with the disease (you can use a p-value threshold of 0.05 when answering this question)? 48 | 49 | 50 | Now try to plot the proportions of cases in each genotype category in this SNP using the following R code: 51 | ```R 52 | barplot(countsSNP1[1,]/(countsSNP1[1,]+countsSNP1[2,]),xlab="Genotypes",ylab="Proportion of cases",names=c("AA","Aa","aa"),las=1,ylim=c(0,0.25)) 53 | ``` 54 | * Does is look like the SNP is associated with the disease? 55 | * Is p-value consistent with the plot? 56 | 57 | Repeat the same analyses (and answer the same questions) for SNP 2 for which the data can be input into R as follows: 58 | 59 | ```R 60 | countsSNP2 <- matrix(c(1003,4043,222,899,12,49),nrow=2) 61 | ``` 62 | 63 | . 64 | 65 | ## Exercise 2: GWAS 66 | 67 | This exercise is about Genome-Wide Association Studies (GWAS): how to perform one and what pitfalls to look out for. It will be conducted from the command line using the program PLINK (you can read more about it here: https://www.cog-genomics.org/plink2/). 68 | 69 | 70 | #### Exercise 2A: getting data and running your first GWAS 71 | 72 | First close R if you have it open. 73 | 74 | Then make a folder called GWAS for this exercise, copy the relavant data into this folder and unpack the data by typing 75 | 76 | ```bash 77 | mkdir GWAS 78 | cd GWAS 79 | cp ~/groupdirs/SCIENCE-BIO-Popgen_Course/exercises/GWAS/gwasdata.tar.gz . 80 | tar -xf gwasdata.tar.gz 81 | rm gwasdata.tar.gz 82 | ``` 83 | 84 | Your folder called GWAS should now contain a subfolder called data (you can check that this is true e.g. by typing ls, which is a command that gives you a list of the content in the folder you are working in). The folder "data" will contain the all files you will need in this exercise (both the data and a file called plink.plot.R, which contains R code for plotting your results). 85 | 86 | Briefly, the data consist of SNP genotyping data from 356 individuals some of which are have a certain disease (cases) and the rest do not (controls). To make sure the GWAS analyses will run fast the main data file (gwa.bed) is in a binary format, which is not very reader friendly. However, PLINK will print summary statistics about the data (number of SNPs, number of individuals, number of cases, number of controls etc) to the screen when you run an analysis. Also, there are two additional data files, gwa.bim and gwa.fam, which are not in binary format and which contain information about the SNPs in the data and the individuals in the data, respectively (you can read more about the data format in the manuals linked to above - but for now this is all you need to know). 87 | 88 | Let's try to perform a GWAS of our data, i.e. test each SNP for association with the disease. And let's try to do it using a logistic regression based test assuming an additive effect. You can perform this test on all the SNPs in the dataset using PLINK by typing the following command in the terminal: 89 | 90 | ```bash 91 | plink --bfile data/gwa --logistic --adjust 92 | ``` 93 | 94 | Specifically "--bfile data/gwa" specifies that the data PLINK should analyse are the files in folder called "data" with the prefix gwa. "--logistic" specifies that we want to perform GWAS using logistic regression and "--adjust" tells PLINK to output a file that includes p-values that are adjusted for multiple testing using Bonferroni correction as well as other fancier methods. Try to run the command and take a look at the text PLINK prints to your screen. Specifically, note the 95 | 96 | * number of SNPs 97 | * number of individuals 98 | * number of cases and controls 99 | 100 | NB if you for some reason have trouble reading the PLINK output on your screen then PLINK also writes it to the file plink.log and you can look at that by typing "less plink.log". If you do that, you can use arrows to navigate up and down in the file and close the file viewing by typing q. 101 | 102 | Next, try to plot the results of the GWAS by typing this command and pressing enter (note that this can take a little while so be patient): 103 | 104 | ```bash 105 | Rscript data/plink.plot.R plink.assoc.logistic 106 | ``` 107 | 108 | This should give you several plots. For now just look at the Manhattan plot, which can be found in the file called plink.assoc.logistic.png. You can e.g. open it using the png viewer called display, so by typing: 109 | 110 | ```bash 111 | display plink.assoc.logistic.png 112 | ``` 113 | 114 | A bonferroni corrected p-value threshold based on an initial p-value threshold of 0.05, is shown as a dotted line on the plot. 115 | 116 | * Explain how this threshold was reached. 117 | * Calculate the exact threshold using you knowledge of how many SNPs you have in your dataset (NB if you want to calculate log10 in R you can use the function log10) 118 | * Using this threshold, does any of the SNPs in your dataset seem to be associated with the disease? 119 | * Do your results seem plausible? Why/why not? 120 | 121 | 122 | #### Exercise 2B: checking if it went OK using QQ-plot 123 | 124 | Now look at the QQ-plot that you already generated in exercise 2A (the file called plink.assoc.QQ.png) by typing: 125 | 126 | ```bash 127 | display plink.assoc.logistic.QQ.png 128 | ``` 129 | 130 | Here the red line is the x=y line. 131 | 132 | * What does this plot suggest and why? 133 | 134 | 135 | #### Exercise 2C: doing initial QC of data part 1 (sex check) 136 | 137 | As you can see a lot can go wrong if you do not check the quality of your data! So if you want meaningful/useful output you always have to run a lot of quality checking (QC) and filtering before running the association tests. One check that is worth running is a check if the indicated genders are correct. You can check this using PLINK to calculate the inbreeding coefficient on the X chromosome under the assumption that it is an autosomal chromosome. The reason why this is interesting is that, for technical reasons PLINK represents haploid chromosomes, such as X for males, as homozygotes. So assuming the X is an autosomal chromosome will make the males look very inbred on the X where as the woman wont (since they are diploid on the X chromosome). This means that the inbreeding coefficient estimates you get will be close to 1 for men and close to 0 for women. This gender check can be performed in PLINK using the following command: 138 | 139 | ```bash 140 | plink --bfile data/gwa --check-sex 141 | ``` 142 | 143 | The result are in the file plink.sexcheck in which the gender is in column PEDSEX (1 means male and 2 means female) and the inbreeding coefficient is in column F). 144 | 145 | Check the result by typing: 146 | ```bash 147 | less plink.sexcheck 148 | ``` 149 | 150 | NB you can use arrows to navigate up and down in the file and close the file viewing by typing q. 151 | 152 | * Do you see any problems? 153 | 154 | * Usually one would instead get rid of any such these individuals because wrong sex could indicate that the phenotypes you have do not belong to the genotyped individual. However, in this case I made the change on purpose, so you could see how such errors would look like, so just ignore it for now. Or if you are adventurous you can try to change the sex of the relevant individuals in the file gwa.fam (5th colunm) in the folder data. You can use the editor gedit for this. To open the file in gedit type "gedit data/gwa.fam" in the terminal. 155 | 156 | 157 | #### Exercise 2D: doing initial QC of data part 2 (relatedness check) 158 | 159 | Another potential problem in association studies is spurious relatedness where some of the individuals in the sample are closely related. Closely related individuals can be inferred using PLINK as follows: 160 | 161 | ```bash 162 | plink --bfile data/gwa --genome 163 | ``` 164 | 165 | And you can plot the results by typing: 166 | 167 | ```bash 168 | Rscript data/plink.plot.R plink.genome 169 | ``` 170 | 171 | Do that and then take a look at the result by typing 172 | 173 | ```bash 174 | display plink.genomepairwise_relatedness_1.png 175 | ``` 176 | 177 | The figure shows estimates of the relatedness for all pairs of individuals. For each pair k1 is the proportion of the genome where the pair shares 1 of their allele identical-by-descent (IBD) and k2 is the proportion of the genome where the pair shares both their alleles IBD. The expected (k1,k2) values for simple relationships are shown in the figure with MZ=monozygotic twins, PO=parent offspring, FS=full sibling, HS=half sibling (or avuncular or grandparent-granchild), C1=first cousin, C2=cousin once removed. 178 | 179 | * Are any of the individuals in your dataset closely related? 180 | * What assumption in association studies is violated when individuals are related? 181 | * And last but not least: how would you recognize if the same person is included twice (this actually happens!) 182 | 183 | 184 | #### Exercise 2E: doing initial QC of data part 3 (check for batch bias/non-random genotyping error) 185 | 186 | Check if there is a batch effect/non random genotyping error by using missingness as a proxy (missingness and genotyping error are highly correlated in SNP chip data). In other words, for each SNP test if there is a significantly different amount of missing data in the cases and controls (or equivalently if there is an association between the disease status and the missingness). Do this with PLINK by running the following command: 187 | 188 | ```bash 189 | plink --bfile data/gwa --test-missing 190 | ``` 191 | 192 | View the result in the file plink.missing where the p-value is given in the right most coloumn or generate a histogram of all the p-values using the Rscript data/plink.plot.R by typing this command 193 | 194 | ```bash 195 | Rscript data/plink.plot.R plink.missing 196 | ``` 197 | 198 | The resulting p-value histogram can be found in the file plink.missing.png, which you can open using the png viewer display, so by typing 199 | 200 | ```bash 201 | display plink.missing.png 202 | ``` 203 | 204 | If the missingness is random the p-value should be uniformly distributed between 0 and 1. 205 | 206 | * Is this the case? 207 | * Genotyping errors are often highly correlated with missingness. How do you think this will effect your association results? 208 | 209 | #### Exercise 2F: doing initial QC of data part 4 (check for batch bias/non-random genotyping error again) 210 | 211 | Principal component analysis (PCA) and a very similar methods called multidimensional scaling is also often used to reveal problems in the data. Such analyses can be used to project all the genotype information (e.g. 500,000 marker sites) down to a low number of dimensions e.g. two. 212 | 213 | Multidimensional scaling based on this can be performed with PLINK as follows (for all individuals except for a few which has more than 20% missingness): 214 | 215 | ```bash 216 | plink --bfile data/gwa --cluster --mds-plot 2 --mind 0.2 217 | ``` 218 | 219 | Run the above command and plot the results by typing: 220 | 221 | ```bash 222 | Rscript data/plink.plot.R plink.mds data/gwa.fam 223 | ``` 224 | 225 | The resulting plot should now be in the file plink.mds.pdf, which you can open with the pdf viewer evince (so by typing "evince plink.mds.pdf"). It shows the first two dimensions and each individual is represented by a point, which is colored according to the individual's disease status. 226 | 227 | * Clustering of cases and controls is an indication of batch bias. Do you see such clustering? 228 | * What else could explain this clustering? 229 | 230 | 231 | #### Exercise 2G: try to rerun GWAS after quality filtering SNPs 232 | 233 | We can remove many of the error prone SNPs and individuals by removing 234 | 235 | * SNPs that are not in HWE within controls 236 | * the rare SNPs 237 | * the individuals and SNPs with lots of missing data (why?) 238 | 239 | Let us try to do rerun an association analysis where this is done: 240 | 241 | ```bash 242 | plink --bfile data/gwa --logistic --adjust --out assoc2 --hwe 0.0001 --maf 0.05 --mind 0.55 --geno 0.05 243 | ``` 244 | 245 | Plot the results using 246 | 247 | ```bash 248 | Rscript data/plink.plot.R assoc2.assoc.logistic 249 | ``` 250 | 251 | * How does QQ-plot look (look at the file assoc2.assoc.logistic.QQ.png e.g. with the viewer called display as you did for the previous QQ-plot)? 252 | * Did the analysis go better this time? 253 | * And what does the Manhattan plot suggest (look at the file assoc2.assoc.logistic.png e.g. with the viewer called display)? Are any of the SNPs associated? 254 | * Does your answer change if you use other (smarter) methods to correct for multiple testing than Bonferroni (e.g. FDR), which PLINK provides. You can find them by typing: 255 | 256 | ```bash 257 | less assoc2.assoc.logistic.adjusted 258 | ``` 259 | 260 | Note that PLINK adjusts p-values instead of the threshold (equivalent idea), so you should NOT change the threshold but stick to 0.05. 261 | 262 | . 263 | 264 | ## Bonus exercise if there is any time left: another example of a GWAS caveat 265 | 266 | For the same individuals as above we also have another phenotype. This phenotype is strongly correlated with gender. The genotyping was done independently of this phenotype so there is no batch bias. To perform association on this phenotype type 267 | 268 | ```bash 269 | plink --bfile data/gwa --logistic --pheno data/pheno3.txt --adjust --out pheno3 270 | Rscript data/plink.plot.R pheno3.assoc.logistic 271 | ``` 272 | 273 | * View the plots and results. Are any of the SNP significantly associated? 274 | 275 | Now try to perform the analysis using logistic regression adjusted for sex, which is done by adding the option "--sex": 276 | 277 | ```bash 278 | plink --bfile data/gwa --logistic --out pheno3_sexAdjusted --pheno data/pheno3.txt --sex 279 | Rscript data/plink.plot.R pheno3_sexAdjusted.assoc.logistic 280 | ``` 281 | 282 | * Are there any associated SNPs according to this analysis? 283 | 284 | Some of the probes used on the chip will hybridize with multiple loci on the genome. The associated SNPs in the previous analysis all cross-hybridize with the X chromosome. 285 | 286 | * Could crosshybridization explain the difference in results from the two analyses? 287 | 288 | 289 | -------------------------------------------------------------------------------- /HWexercise/.gitkeep: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /HWexercise/HW1PA.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/HWexercise/HW1PA.png -------------------------------------------------------------------------------- /HWexercise/HW2Sn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/HWexercise/HW2Sn.png -------------------------------------------------------------------------------- /HWexercise/HW3HV.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/HWexercise/HW3HV.png -------------------------------------------------------------------------------- /HWexercise/HW3HV2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/HWexercise/HW3HV2.png -------------------------------------------------------------------------------- /HWexercise/HW4dF.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/HWexercise/HW4dF.png -------------------------------------------------------------------------------- /HWexercise/HW5TP.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/HWexercise/HW5TP.png -------------------------------------------------------------------------------- /HWexercise/HW6QQ.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/HWexercise/HW6QQ.png -------------------------------------------------------------------------------- /HWexercise/HWanswers.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/HWexercise/HWanswers.pdf -------------------------------------------------------------------------------- /Linkage_disequilibrium_gorillas/ExercisesLD20.md: -------------------------------------------------------------------------------- 1 | # Linkage disequilibrium 2 | 3 | 4 | 5 | **Hans R Siegismund** 6 | 7 | 8 | 9 | **Exercise 1** 10 | 11 | Take a look at the variation at two SNPs in the human genome: 12 | 13 | 14 | | | |**SNP83**| | 15 | |-----|---|----:|---:| 16 | | | |_B_ | _b_| 17 | |**SNP82**|_A_| 22 |9 | 18 | | |_a_| 0 |20 | 19 | 20 | Use the “LD test” sheet in “Hardy Weinberg test, LD test.xls” file to 21 | estimate 22 | 23 | 1) Haplotype frequencies 24 | 25 | 2) Allele frequencies 26 | 27 | 3) The linkage disequilibrium parameter *D* 28 | 29 | 4) *χ2* test for independent distribution at the two loci 30 | 31 | 5) *Dmax * 32 | 33 | 6) *Dmin* 34 | 35 | 7) *D´* = *D/Dmax* if *D* is positive 36 | *D´* = *D/Dmin* if *D* is negative 37 | 38 | 8) *r*2 39 | 40 | **Exercise 2** 41 | 42 | **Population admixture** 43 | 44 | Use the sheet “LD in admixed populations” in the “Hardy Weinberg test, 45 | LD test.xls” file to estimate the effects of admixture of two 46 | populations that have the following haplotype frequencies: 47 | 48 | | |AB | Ab |aB |ab | 49 | |----------------|---:|:---:|:---:|---:| 50 | |**Population 1**|1 | 9 | 9 | 81| 51 | |**Population 2**| 81 | 9 | 9 | 1| 52 | 53 | 1) Is there any sign of LD in the two populations? 54 | 55 | 2) Assume that the admixed population consists of 50% from each of the 56 | source populations. What is the LD in the admixed population? (Let 57 | the sample size be 100.) 58 | 59 | 3) Assume that the two loci are unlinked. What will *D* be in the next 60 | generation (i.e. offspring of the admixed population.)? 61 | -------------------------------------------------------------------------------- /Linkage_disequilibrium_gorillas/LD_ex_0_gorilla.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Linkage_disequilibrium_gorillas/LD_ex_0_gorilla.png -------------------------------------------------------------------------------- /Linkage_disequilibrium_gorillas/LD_ex_1_dist_map.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Linkage_disequilibrium_gorillas/LD_ex_1_dist_map.png -------------------------------------------------------------------------------- /Linkage_disequilibrium_gorillas/LD_ex_2_link_map.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Linkage_disequilibrium_gorillas/LD_ex_2_link_map.png -------------------------------------------------------------------------------- /Linkage_disequilibrium_gorillas/LD_ex_3_LD_decay.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Linkage_disequilibrium_gorillas/LD_ex_3_LD_decay.png -------------------------------------------------------------------------------- /Linkage_disequilibrium_gorillas/LDinGorillas.md: -------------------------------------------------------------------------------- 1 | # Linkage Disequilibrium in Small and Large Populations of Gorillas 2 | 3 | **Peter Frandsen, Shyam Gopalakrishnan & Hans R. Siegismund** 4 | 5 | ## Program 6 | 7 |
8 | 9 |
10 | 11 | - Apply filters to prepare data for analysis 12 | - Explore patterns of LD blocks in two populations of gorilla 13 | - Estimate mean pairwise LD between SNP’s along a chromosome 14 | - Plot and compare curves of LD decay in two populations of gorilla 15 | 16 | ## Learning outcome 17 | 18 | - Get familiar with common filtering procedures in genome analyses 19 | using PLINK 20 | - Get familiar with running analyses in R and the code involved 21 | - Understand the processes that builds and breaks down LD 22 | - Understand the link between patterns of LD and population history 23 | 24 | ## Essential background reading 25 | 26 | Nielsen and Slatkin: Chapter 6 - Linkage Disequilibrium and Gene Mapping 27 | p. 107-126 28 | 29 | ## Key concepts that you should familiarize with: 30 | 31 | - Definition of linkage disequilibrium (LD) 32 | 33 | - Measures of LD 34 | 35 | - Build-up and breakdown of LD 36 | 37 | - Applications of LD in population genetics 38 | 39 | ### Suggestive reading 40 | Prado-Martinez et al. (2014) [Great ape genetic diversity and population 41 | history](http://www.nature.com/nature/journal/v499/n7459/full/nature12228.html) 42 | 43 | # Introduction 44 | 45 | Gorillas are among the most critically endangered species of great apes. 46 | The *Gorilla* genus consists of two species, with two subspecies each. 47 | In this exercise, you will work with chromosome 4 from the western 48 | lowland subspecies and the mountain gorilla subspecies. The former is 49 | distributed over a large forest area across the Congo basin in East 50 | Africa (yellow in fig. 1), while the mountain gorilla is confined to two 51 | small populations in the Virunga mountains and the impenetrable forest 52 | of Bwindi (encircled with red in fig. 1). The mountain gorilla, in 53 | particular, has been subject to active conservation efforts for decades, 54 | yet, very little is known about their genomic diversity and evolutionary 55 | history. The data you will be working with today is an extract of a 56 | large study on great ape genomes (Prado-Martinez et al. 57 | 2014). 58 | 59 |
60 | 61 |
62 | 63 | **Figure 1 Distribution range of the *Gorilla* genus.** The large yellow patch indicates the distribution of western lowland gorilla (*Gorilla gorilla gorilla*). The distribution of mountain gorilla (*Gorilla beringei beringei*) is encircled with red. Reprint from [http://maps.iucnredlist.org,](http://maps.iucnredlist.org/) edited by Peter Frandsen. 64 | 65 | ## Getting started 66 | 67 | Start by downloading the compressed folder ‘gorillas.tar.gz’ to your 68 | exercise folder unzip, and navigate to the folder ‘LD’ with the 69 | following commands: (We assume that the folder /exercises exists in your home folder; otherwise create it `mkdir ~/exercises`) 70 | 71 | ```bash 72 | cd ~/exercises 73 | mkdir LD 74 | cd LD/ 75 | cp ~/groupdirs/SCIENCE-BIO-Popgen_Course/exercises/LD/gorillas.tar.gz . 76 | tar -zxvf gorillas.tar.gz 77 | rm gorillas.tar.gz 78 | ``` 79 | 80 | All code needed in the exercise is given below. It will be indicated 81 | with an **\>R** when you need to run a command in R, otherwise stay in 82 | your main terminal. 83 | 84 | ## LD blocks 85 | 86 | In this first part of the exercise we will explore a specific region of chromosome 4 in terms of blocks of LD. We have not chosen this region for any specific reason and the emerging patterns should serve as a general comparison of LD in the two populations. First step will be to extract a pre-defined region using PLINK and run a LD anaysis with SnpMatrix in R. The results will be written to an .eps file that you can open with Adobe. 87 | 88 | ### Mountain Gorilla 89 | 90 | ```bash 91 | plink --file mountain --maf 0.15 --geno 0 --thin 0.25 --from 9029832 --to 14148683 --make-bed --out mountainBlock 92 | ``` 93 | 94 | #### \>R 95 | 96 | ```R 97 | library(snpMatrix) 98 | data <- read.plink("mountainBlock") 99 | ld <- ld.snp(data, dep=930) 100 | plot.snp.dprime(ld, filename="Mountain.eps", res=100) 101 | q() 102 | ``` 103 | 104 | ### Western Lowland Gorilla 105 | 106 | ```bash 107 | plink --file lowland --maf 0.15 --geno 0 --thin 0.25 --from 9029832 --to 14148683 --make-bed --out lowlandBlock 108 | ``` 109 | 110 | #### \>R 111 | 112 | ```R 113 | library(snpMatrix) 114 | data<- read.plink("lowlandBlock") 115 | ld<- ld.snp(data, dep=2268) 116 | plot.snp.dprime(ld, filename="WesternLowland.eps", res=100) 117 | q() 118 | ``` 119 | 120 | **NB**: the .eps files are quite heavy and loading the output could take a long time. You can start to open the .eps file with the commands 121 | 122 | ```bash 123 | evince Mountain.eps & # the ‘&’ will make the command run in the background 124 | evince WesternLowland.eps & 125 | ``` 126 | 127 | **NB**: this will take approximately five minutes to load.* 128 | 129 | **Q1:** How would you characterize the structure you observe in the 130 | different populations? 131 | 132 | **Q2:** How many (if any) recombination hotspots can you identify in the two populations (crude estimate)? 133 | 134 | Mountain gorilla: 135 | 136 | Western lowland gorilla: 137 | 138 | ## LD decay 139 | In this part of today’s exercise we will explore the decay of LD as a function of distance along the chromosome. This will enable us to compare the two populations in terms of mean distances on the chromosome with sites in LD. To begin with, we will do some essential filtering procedures common to this type of analysis and then, again, read the data into R and run the analyses with snpMatrix. 140 | 141 | ### Filtering 142 | 143 | ```bash 144 | plink --file mountain --maf 0.15 --geno 0 --thin 0.15 --make-bed --out mountainLD 145 | plink --file lowland --maf 0.15 --geno 0 --thin 0.15 --make-bed --out lowlandLD 146 | ``` 147 | 148 | **Q3:** Why do we exclude minor allele frequencies (i.e. the less common or rare alleles in the population)? 149 | 150 | **Q4:** What else are we filtering away? See list of options in PLINK http://[http://pngu.mgh.harvard.edu/\~purcell/plink/reference.shtml#options](http://pngu.mgh.harvard.edu/%7Epurcell/plink/reference.shtml#options) Seems to be broken. 151 | 152 | **Q5:** How many SNP’s did we have before and after filtering? 153 | 154 | ### Run the commands 155 | 156 | This script will load the package snpMatrix, read in the plink files, and estimate the mean LD between SNP pairs in a pre-set range (dep). At the end of the script, the result will be plotted to a .png file as well as two .txt files (check your directory). 157 | 158 | ### Mountain Gorilla 159 | 160 | **\>R** *(paste in the whole script)* 161 | 162 | ```R 163 | library("snpMatrix") 164 | data<- read.plink("mountainLD") 165 | ld<- ld.snp(data, dep=500) 166 | do.rsq2 <- ("rsq2" %in% names(ld)) 167 | snp.names <- attr(ld, "snp.names") 168 | name<- c("#M1", "#M2", "rsq2", "Dprime", "lod") 169 | r.maybe <- ld$rsq2 170 | max.depth <- dim(ld$dprime)[2] 171 | res<-matrix(NA,ncol=3,nrow=length(snp.names)*max.depth) 172 | count<-1 173 | for (i.snp in c(1:(length(snp.names) - 1))) { 174 | for (j.snp in c((i.snp + 1):length(snp.names))) { 175 | step <- j.snp - i.snp 176 | if (step > max.depth) { 177 | break 178 | } 179 | res[count,]<-c(snp.names[i.snp], snp.names[j.snp],r.maybe[i.snp, step]) 180 | count<-count+1; 181 | } 182 | } 183 | resNum<-as.numeric(res)#converts to numeric values 184 | dim(resNum)<-dim(res) 185 | dis<- resNum[,2]-resNum[,1] #calculates distances between sites 186 | disbin<- cut(dis, br=seq(0,2000000,by=10000)) 187 | #cuts distances into bins of size 1K from distance "0" to "2000000" 188 | res<- tapply(resNum[,3], disbin, mean, na.rm=T) 189 | #takes the mean of r2 (3rd column in resNum) in each bin, removes "NA", and distances >2K 190 | pdf("MountainLDdecay.pdf") #saves output as a .pdf 191 | plot(res, type="l", ylim=c(0, 1)) 192 | dev.off() 193 | ### The following code is just a way to store the LD calculations ### 194 | ressum<- tapply(resNum[,3], disbin, sum, na.rm=T) 195 | func<- function(x)#makes a function "x" 196 | length(x[!is.na(x)]) 197 | #gives the lengths of what is not (denoted by "!") "NA" in "x" 198 | reslength<- tapply(resNum[,3], disbin, func) 199 | #gives the length in each bin of 1K for each value of r2 200 | write.table(reslength, file="MountainLength.txt") 201 | write.table(ressum, file="MountainSum.txt") 202 | q() 203 | n 204 | ``` 205 | 206 | 207 | Look through the script to get an idea about what is going on. 208 | 209 | **Q6:** How many pair-wise comparisons have we done (*dep* in 210 | snpMatrix)? 211 | 212 | ## 213 | 214 | ### Western Lowland Gorilla 215 | 216 | **\>R** *(paste in the whole script)* 217 | 218 | ```R 219 | library("snpMatrix") 220 | data<- read.plink("lowlandLD") 221 | ld<- ld.snp(data, dep=500) 222 | do.rsq2 <- ("rsq2" %in% names(ld)) 223 | snp.names <- attr(ld, "snp.names") 224 | name<- c("#M1", "#M2", "rsq2", "Dprime", "lod") 225 | r.maybe <- ld$rsq2 226 | max.depth <- dim(ld$dprime)[2] 227 | res<-matrix(NA,ncol=3,nrow=length(snp.names)*max.depth) 228 | count<-1 229 | for (i.snp in c(1:(length(snp.names) - 1))) { 230 | for (j.snp in c((i.snp + 1):length(snp.names))) { 231 | step <- j.snp - i.snp 232 | if (step > max.depth) { break 233 | } 234 | res[count,]<-c(snp.names[i.snp], snp.names[j.snp],r.maybe[i.snp, step]) 235 | count<-count+1; 236 | } 237 | } 238 | resNum<-as.numeric(res)#converts to numeric values 239 | dim(resNum)<-dim(res) 240 | dis<- resNum[,2]-resNum[,1] #calculates distances between sites 241 | disbin<- cut(dis, br=seq(0,2000000,by=10000)) 242 | #cuts distances into bins of size 1K from distance "0" to "2000000" 243 | res<- tapply(resNum[,3], disbin, mean, na.rm=T) 244 | #takes the mean of r2 (3rd column in resNum) in each bin, removes "NA", and distances >2K 245 | pdf("WesternLowlandLDdecay.pdf") 246 | #saves output as a .pdf file 247 | plot(res, type="l", ylim=c(0, 1)) 248 | dev.off() 249 | ### The following code is just a way to store the LD calculations ### 250 | ressum<- tapply(resNum[,3], disbin, sum, na.rm=T) 251 | func<- function(x)#makes a function "x" 252 | length(x[!is.na(x)]) 253 | #gives the lengths of what is not (denoted by "!") "NA" in "x" 254 | reslength<- tapply(resNum[,3], disbin, func) 255 | #gives the length in each bin of 1K for each value of r2 256 | write.table(reslength, file="LowlandLength.txt") 257 | write.table(ressum, file="LowlandSum.txt") 258 | q() 259 | n 260 | ``` 261 | 262 | Extra task (optional): when done with both populations, plot the curves 263 | together with different colors and legend (*hint: the input you will 264 | need was printed to four text files in the end of the scripts above*). 265 | 266 | **Q7:** Look at the two plots and explain why LD decays with distance. 267 | 268 | **Q8:** What is the mean *r2* at distance 1M (100 on the x-axis) in the 269 | two populations? 270 | 271 | Mountain gorilla: 272 | 273 | Western lowland gorilla: 274 | 275 | **Q9:** What could explain any observed difference in the decay of LD in 276 | the two populations? 277 | 278 | **Q10:** These estimates are done on an autosomal chromosome, would you 279 | expect different LD patterns in other parts of the genome? 280 | 281 | ## Perspectives 282 | 283 | **Q11:** In comparison to the different populations of gorilla, how do 284 | you think the trajectory of the LD decay and the LD block patterns would 285 | look like in humans? 286 | 287 | **Q12:** Would you also expect different patterns of LD in human 288 | populations (*e.g.* Chinese, Europeans, and Africans)? 289 | -------------------------------------------------------------------------------- /Linkage_disequilibrium_gorillas/LDinGorillas_A_24.md: -------------------------------------------------------------------------------- 1 | # Linkage Disequilibrium in Small and Large Populations of Gorillas 2 | 3 | **Peter Frandsen, Shyam Gopalakrishnan & Hans R. Siegismund** 4 | 5 | ## Program 6 | 7 |
8 | 9 |
10 | 11 | - Apply filters to prepare data for analysis 12 | - Explore patterns of LD blocks in two populations of gorilla 13 | - Estimate mean pairwise LD between SNP’s along a chromosome 14 | - Plot and compare curves of LD decay in two populations of gorilla 15 | 16 | ## Learning outcome 17 | 18 | - Get familiar with common filtering procedures in genome analyses 19 | using PLINK 20 | - Get familiar with running analyses in R and the code involved 21 | - Understand the processes that builds and breaks down LD 22 | - Understand the link between patterns of LD and population history 23 | 24 | ## Essential background reading 25 | 26 | Nielsen and Slatkin: Chapter 6 - Linkage Disequilibrium and Gene Mapping 27 | p. 107-126 28 | 29 | ## Key concepts that you should familiarize with: 30 | 31 | - Definition of linkage disequilibrium (LD) 32 | 33 | - Measures of LD 34 | 35 | - Build-up and breakdown of LD 36 | 37 | - Applications of LD in population genetics 38 | 39 | ### Suggestive reading 40 | Prado-Martinez et al. (2014) [Great ape genetic diversity and population 41 | history](http://www.nature.com/nature/journal/v499/n7459/full/nature12228.html) 42 | 43 | # Introduction 44 | 45 | Gorillas are among the most critically endangered species of great apes. 46 | The *Gorilla* genus consists of two species, with two subspecies each. 47 | In this exercise, you will work with chromosome 4 from the western 48 | lowland subspecies and the mountain gorilla subspecies. The former is 49 | distributed over a large forest area across the Congo basin in East 50 | Africa (yellow in fig. 1), while the mountain gorilla is confined to two 51 | small populations in the Virunga mountains and the impenetrable forest 52 | of Bwindi (encircled with red in fig. 1). The mountain gorilla, in 53 | particular, has been subject to active conservation efforts for decades, 54 | yet, very little is known about their genomic diversity and evolutionary 55 | history. The data you will be working with today is an extract of a 56 | large study on great ape genomes (Prado-Martinez et al. 57 | 2014). 58 | 59 |
60 | 61 |
62 | 63 | **Figure 1 Distribution range of the *Gorilla* genus.** The large yellow patch indicates the distribution of western lowland gorilla (*Gorilla gorilla gorilla*). The distribution of mountain gorilla (*Gorilla beringei beringei*) is encircled with red. Reprint from [http://maps.iucnredlist.org,](http://maps.iucnredlist.org/) edited by Peter Frandsen. 64 | 65 | ## Getting started 66 | 67 | Start by downloading the compressed folder ‘gorillas.tar.gz’ to your 68 | exercise folder unzip, and navigate to the folder ‘LD’ with the 69 | following commands: (We assume that the folder /exercises exists in your home folder; otherwise create it `mkdir ~/exercises`) 70 | 71 | ```bash 72 | cd ~/exercises 73 | mkdir LD 74 | cd LD/ 75 | cp ~/groupdirs/SCIENCE-BIO-Popgen_Course/exercises/LD/gorillas.tar.gz . 76 | tar -zxvf gorillas.tar.gz 77 | rm gorillas.tar.gz 78 | ``` 79 | 80 | All code needed in the exercise is given below. It will be indicated 81 | with an **\>R** when you need to run a command in R, otherwise stay in 82 | your main terminal. 83 | 84 | ## LD blocks 85 | 86 | In this first part of the exercise we will explore a specific region of chromosome 4 in terms of blocks of LD. We have not chosen this region for any specific reason and the emerging patterns should serve as a general comparison of LD in the two populations. First step will be to extract a pre-defined region using PLINK and run a LD anaysis with SnpMatrix in R. The results will be written to an .eps file that you can open with Adobe. 87 | 88 | ### Mountain Gorilla 89 | 90 | ```bash 91 | plink --file mountain --maf 0.15 --geno 0 --thin 0.25 --from 9029832 --to 14148683 --make-bed --out mountainBlock 92 | ``` 93 | 94 | #### \>R 95 | 96 | ```R 97 | library(snpMatrix) 98 | data <- read.plink("mountainBlock") 99 | ld <- ld.snp(data, dep=930) 100 | plot.snp.dprime(ld, filename="Mountain.eps", res=100) 101 | q() 102 | ``` 103 | 104 | ### Western Lowland Gorilla 105 | 106 | ```bash 107 | plink --file lowland --maf 0.15 --geno 0 --thin 0.25 --from 9029832 --to 14148683 --make-bed --out lowlandBlock 108 | ``` 109 | 110 | #### \>R 111 | 112 | ```R 113 | library(snpMatrix) 114 | data<- read.plink("lowlandBlock") 115 | ld<- ld.snp(data, dep=2268) 116 | plot.snp.dprime(ld, filename="WesternLowland.eps", res=100) 117 | q() 118 | ``` 119 | 120 | **NB**: the .eps files are quite heavy and loading the output could take a long time. You can start to open the .eps file with the commands 121 | 122 | ```bash 123 | evince Mountain.eps & # the ‘&’ will make the command run in the background 124 | evince WesternLowland.eps & 125 | ``` 126 | 127 | **NB**: this will take approximately five minutes to load.* 128 | 129 | **Q1:** How would you characterize the structure you observe in the 130 | different populations? 131 |
132 | Answer: 133 | 134 |
135 | 136 |
137 | 138 | The mountain gorilla genome consists of large regions with blocks of 139 | high LD, while recombination has broken down such patterns in the 140 | western lowland gorilla. 141 |
142 | 143 | **Q2:** How many (if any) recombination hotspots can you identify in the two populations (crude estimate)? 144 | 145 | Mountain gorilla: 146 | 147 | Western lowland gorilla: 148 |
149 | 150 | 151 | Mountain gorilla *at least two* Western lowland gorilla no blocks of LD in this region 152 |
153 | 154 | ## LD decay 155 | In this part of today’s exercise we will explore the decay of LD as a function of distance along the chromosome. This will enable us to compare the two populations in terms of mean distances on the chromosome with sites in LD. To begin with, we will do some essential filtering procedures common to this type of analysis and then, again, read the data into R and run the analyses with snpMatrix. 156 | 157 | ### Filtering 158 | 159 | ```bash 160 | plink --file mountain --maf 0.15 --geno 0 --thin 0.15 --make-bed --out mountainLD 161 | plink --file lowland --maf 0.15 --geno 0 --thin 0.15 --make-bed --out lowlandLD 162 | ``` 163 | 164 | **Q3:** Why do we exclude minor allele frequencies (i.e. the less common or rare alleles in the population)? 165 |
166 | 167 | 168 | Newly arisen and rare mutations will 169 | distort the LD pattern (elevate the mean LD) Remember, every new 170 | mutation on a chromosome is in complete LD with the rest of the 171 | chromosome. For most LD analyses, we are also more interested in the 172 | common variation. On the plus side, we also get rid of sequencing errors 173 | (which are also rare, hopefully). 174 |
175 | 176 | **Q4:** What else are we filtering away? See list of options in PLINK http://[http://pngu.mgh.harvard.edu/\~purcell/plink/reference.shtml#options](http://pngu.mgh.harvard.edu/%7Epurcell/plink/reference.shtml#options) Seems to be broken. 177 |
178 | 179 | 180 | `--geno 0` : removes all missing data 181 | 182 | `--thin 0.15` : removes a random 85 percent of the data (keeps 15 183 | percent), otherwise we would not get through the exercise in just two 184 | hours. Plus, it is not always necessary to keep all your data. 185 | Sometimes, only a fraction of your data will tell you the same story as 186 | a complete and computational heavy dataset.
187 | 188 | **Q5:** How many SNP’s did we have before and after filtering? 189 |
190 | 191 | 192 | mountain gorilla: 2.521.457 SNP’s before, 23,689 SNP’s after filtering 193 | 194 | lowland gorilla: 2.521.457 SNP’s before, 48,555 SNP’s after filtering 195 | 196 | In other words, you threw away a lot of data (but you can still reach 197 | the same conclusion as if we had looked at the whole genome) 198 |
199 | 200 | ### Run the commands 201 | 202 | This script will load the package snpMatrix, read in the plink files, and estimate the mean LD between SNP pairs in a pre-set range (dep). At the end of the script, the result will be plotted to a .png file as well as two .txt files (check your directory). 203 | 204 | ### Mountain Gorilla 205 | 206 | **\>R** *(paste in the whole script)* 207 | 208 | ```R 209 | library("snpMatrix") 210 | data<- read.plink("mountainLD") 211 | ld<- ld.snp(data, dep=500) 212 | do.rsq2 <- ("rsq2" %in% names(ld)) 213 | snp.names <- attr(ld, "snp.names") 214 | name<- c("#M1", "#M2", "rsq2", "Dprime", "lod") 215 | r.maybe <- ld$rsq2 216 | max.depth <- dim(ld$dprime)[2] 217 | res<-matrix(NA,ncol=3,nrow=length(snp.names)*max.depth) 218 | count<-1 219 | for (i.snp in c(1:(length(snp.names) - 1))) { 220 | for (j.snp in c((i.snp + 1):length(snp.names))) { 221 | step <- j.snp - i.snp 222 | if (step > max.depth) { 223 | break 224 | } 225 | res[count,]<-c(snp.names[i.snp], snp.names[j.snp],r.maybe[i.snp, step]) 226 | count<-count+1; 227 | } 228 | } 229 | resNum<-as.numeric(res)#converts to numeric values 230 | dim(resNum)<-dim(res) 231 | dis<- resNum[,2]-resNum[,1] #calculates distances between sites 232 | disbin<- cut(dis, br=seq(0,2000000,by=10000)) 233 | #cuts distances into bins of size 1K from distance "0" to "2000000" 234 | res<- tapply(resNum[,3], disbin, mean, na.rm=T) 235 | #takes the mean of r2 (3rd column in resNum) in each bin, removes "NA", and distances >2K 236 | pdf("MountainLDdecay.pdf") #saves output as a .pdf 237 | plot(res, type="l", ylim=c(0, 1)) 238 | dev.off() 239 | ### The following code is just a way to store the LD calculations ### 240 | ressum<- tapply(resNum[,3], disbin, sum, na.rm=T) 241 | func<- function(x)#makes a function "x" 242 | length(x[!is.na(x)]) 243 | #gives the lengths of what is not (denoted by "!") "NA" in "x" 244 | reslength<- tapply(resNum[,3], disbin, func) 245 | #gives the length in each bin of 1K for each value of r2 246 | write.table(reslength, file="MountainLength.txt") 247 | write.table(ressum, file="MountainSum.txt") 248 | q() 249 | n 250 | ``` 251 | 252 | 253 | Look through the script to get an idea about what is going on. 254 | 255 | **Q6:** How many pair-wise comparisons have we done (*dep* in 256 | snpMatrix)? 257 |
258 | 259 | 260 | 500 (ld<- ld.snp(data, dep=500)) 261 |
262 | 263 | ## 264 | 265 | ### Western Lowland Gorilla 266 | 267 | **\>R** *(paste in the whole script)* 268 | 269 | ```R 270 | library("snpMatrix") 271 | data<- read.plink("lowlandLD") 272 | ld<- ld.snp(data, dep=500) 273 | do.rsq2 <- ("rsq2" %in% names(ld)) 274 | snp.names <- attr(ld, "snp.names") 275 | name<- c("#M1", "#M2", "rsq2", "Dprime", "lod") 276 | r.maybe <- ld$rsq2 277 | max.depth <- dim(ld$dprime)[2] 278 | res<-matrix(NA,ncol=3,nrow=length(snp.names)*max.depth) 279 | count<-1 280 | for (i.snp in c(1:(length(snp.names) - 1))) { 281 | for (j.snp in c((i.snp + 1):length(snp.names))) { 282 | step <- j.snp - i.snp 283 | if (step > max.depth) { break 284 | } 285 | res[count,]<-c(snp.names[i.snp], snp.names[j.snp],r.maybe[i.snp, step]) 286 | count<-count+1; 287 | } 288 | } 289 | resNum<-as.numeric(res)#converts to numeric values 290 | dim(resNum)<-dim(res) 291 | dis<- resNum[,2]-resNum[,1] #calculates distances between sites 292 | disbin<- cut(dis, br=seq(0,2000000,by=10000)) 293 | #cuts distances into bins of size 1K from distance "0" to "2000000" 294 | res<- tapply(resNum[,3], disbin, mean, na.rm=T) 295 | #takes the mean of r2 (3rd column in resNum) in each bin, removes "NA", and distances >2K 296 | pdf("WesternLowlandLDdecay.pdf") 297 | #saves output as a .pdf file 298 | plot(res, type="l", ylim=c(0, 1)) 299 | dev.off() 300 | ### The following code is just a way to store the LD calculations ### 301 | ressum<- tapply(resNum[,3], disbin, sum, na.rm=T) 302 | func<- function(x)#makes a function "x" 303 | length(x[!is.na(x)]) 304 | #gives the lengths of what is not (denoted by "!") "NA" in "x" 305 | reslength<- tapply(resNum[,3], disbin, func) 306 | #gives the length in each bin of 1K for each value of r2 307 | write.table(reslength, file="LowlandLength.txt") 308 | write.table(ressum, file="LowlandSum.txt") 309 | q() 310 | n 311 | ``` 312 | 313 | Extra task (optional): when done with both populations, plot the curves 314 | together with different colors and legend (*hint: the input you will 315 | need was printed to four text files in the end of the scripts above*). 316 | 317 | **Q7:** Look at the two plots and explain why LD decays with distance. 318 |
319 | 320 | 321 |
322 | 323 |
324 | 325 | LD is broken down by recombination, with increasing distance between two 326 | sites, the likelihood of a recombination event increases. See plots 327 | above. 328 |
329 | 330 | **Q8:** What is the mean *r2* at distance 1M (100 on the x-axis) in the 331 | two populations? 332 | 333 | Mountain gorilla: 334 | 335 | Western lowland gorilla: 336 |
337 | 338 | 339 | Mountain gorilla: ~0.45 340 | 341 | Western lowland gorilla: ~0.15 342 |
343 | 344 | **Q9:** What could explain any observed difference in the decay of LD in 345 | the two populations? 346 |
347 | 348 | 349 | Selection (genetic hitchhiking), admixture, and 350 | drift in course of the population history 351 |
352 | 353 | **Q10:** These estimates are done on an autosomal chromosome, would you 354 | expect different LD patterns in other parts of the genome? 355 |
356 | 357 | 358 | Yes, there is no (or very little) recombination on the Y chromosome and 359 | the mitochondria; hence, you would not observe a decay with increasing 360 | distance. 361 |
362 | 363 | ## Perspectives 364 | 365 | **Q11:** In comparison to the different populations of gorilla, how do 366 | you think the trajectory of the LD decay and the LD block patterns would 367 | look like in humans? 368 |
369 | 370 | 371 | LD decays at a rate similar to the western lowland gorilla and large 372 | parts of the genome will show comparable patterns in terms of LD blocks. 373 | Only for very small human population, that has been isolated for a long 374 | time, would we observe anything as extreme as in the mountain gorilla. 375 |
376 | 377 | **Q12:** Would you also expect different patterns of LD in human 378 | populations (*e.g.* Chinese, Europeans, and Africans)? 379 |
380 | 381 | 382 | Populations that have gone through historic bottlenecks, been subjected 383 | to genetic drift, admixture or selection will on average have a higher 384 | degree of LD. As an example, populations that went through a bottleneck 385 | when they migrated out of African (*e.g.* Europeans), have on average a 386 | higher degree of LD compared to most African populations. 387 |
388 | -------------------------------------------------------------------------------- /Linkage_disequilibrium_gorillas/zzz: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /NaturalSelection/ExerciseSelection_2020.md: -------------------------------------------------------------------------------- 1 | # Natural selection (and other complications) 2 | 3 | **Hans R. Siegismund** 4 | 5 | 6 | **Exercise 1** 7 | 8 | 9 |
10 | 11 |
12 | In Africa and southern Europe, many human 13 | populations are polymorphic at the locus coding for the beta-hemoglobin 14 | chain. Two alleles are found, HbS, and HbA. HbS differs from HbA in that 15 | it at the position 6 the amino acid glutamic acid has been replaced with 16 | valine. A study in Tanzania found the following genotypic distribution: 17 | 18 | | | HbAHbA | HbAHbS | HbSHbS | Sum | 19 | |----------|:----------------------------:|:----------------------------:|:----------------------------:|---| 20 | | Adults | 400 | 249 | 5 | 654 | 21 | | Children | 189 | 89 | 9 | 287 | 22 | 23 | 1) Estimate the allele frequencies in both groups. 24 | 25 | 2) Do the observed genotype distributions differ from Hardy-Weinberg 26 | proportions? 27 |
28 | Answer: 29 | 30 | 31 | |Adults | HbAHbA | HbAHbS | HbSHbS | Sum | 32 | |----------|:----------------------------|:----------------------------|:----------------------------|---| 33 | |Observed | 400 | 249 | 5 | 654 | 34 | |Expected | 420.64 | 207.71 | 25.64 | 654 | 35 | 36 | where 37 | 38 | *p*(A) = 0.802 39 | 40 | *p*(S) = 0.198 41 | 42 | The test for Hardy-Weinberg proportions gives χ2 = 25.84, which is 43 | highly significant. We see that this is caused by a large excess of 44 | heterozygotes. 45 | 46 | 47 | |Children | HbAHbA | HbAHbS | HbSHbS | Sum | 48 | |----------|:----------------------------|:----------------------------|:----------------------------|---| 49 | |Observed | 189 |89 |9 | 287| 50 | |Expected | 189.97 | 87.05 | 9.97 | 287 | 51 | 52 | where 53 | 54 | *p*(A) = 0.814 55 | 56 | *p*(S) = 0.186 57 | 58 | In this case the test for Hardy-Weinberg proportions is χ2 = 0.14, 59 | which is non-significant. We also see that the allele frequencies 60 | among children and adults are similar. We do not bother to make a 61 | formal test for it. 62 | 63 | As might be known, the low survival of the HbSHbS genotype is due to 64 | sickle cell anemia. They have a high probability of dying because of 65 | this. The HbAHbS heterozygote is compared to the HbAHbA homozygote 66 | more resistant to malaria. We therefore have a system of 67 | overdominance. 68 |
69 | 70 | 71 | 3) Under the assumption that the polymorphism has reached a stable 72 | equilibrium, estimate the fitness of the three genotypes. Hints: 73 | Assume that the adults had a genotypic distribution equal to their 74 | expected Hardy-Weinberg distribution and estimate their relative 75 | fitness. 76 | 77 |
78 | 79 | 80 | |Adults | HbAHbA | HbAHbS | HbSHbS | Sum | 81 | |----------|:----------------------------|:----------------------------|:----------------------------|---| 82 | |Observed | 400 | 249 | 5 | 654 | 83 | |Expected | 420.64 | 207.71 | 25.64 | 654 | 84 | |Fitness (O/E)|0.951 |1.199 |0.195 | | 85 | |Relative fitness | 0.793 | 1.000 |0.163 | | 86 | |Selection coefficient| 0.207 | 0 |0.837 | | 87 | 88 | We can see that the selection against the HbSHbS genotype is very 89 | severe. We can use the selection coefficient to estimate the equilibrium 90 | allele frequency 91 | 92 | *p* = *t*/(*s* + *t*) = 0.837/(0.207 + 0.837) = 0.802 93 | 94 | This value is identical to the estimated allele frequency among the 95 | adults. The reason is that we have estimated it under the assumption of 96 | equilibrium. 97 | 98 |
99 | 100 | In African Americans, one out of 400 suffers of sickle cell anemia. 101 | 102 | 4) Estimate the allele frequencies among them 103 | 104 |
105 | 106 | 107 | *q* = √(1/400) = 0.05 108 |
109 | 110 | 5) Why is it lower than the frequency observed among Africans? 111 | 112 |
113 | 114 | 115 | There is no longer malaria in the USA. It has been eliminated. 116 | Therefore, there is no longer overdominant selection that keeps a high 117 | allele frequency of the deleterious allele. There has been directional 118 | selection against the deleterious allele, which has reduced its 119 | frequency. 120 |
121 | 122 | **Exercise 2** 123 | 124 |
125 | 126 |
127 | 128 | The figure to the right shows the result of 13 repeated experiments with 129 | a chromosomal polymorphism in *Drosophila meanogaster*. It shows the 130 | frequency of one of the two chromosomal forms through four generations 131 | the experiment lasted. In six of the experiments one type had 132 | frequencies slightly higher than 0.9 and 7 of the experiments had levels 133 | below 0.9. Population sizes in each experiment were 100. 134 | 135 | 136 | 1) Can the evolution of this system be explained as a result of genetic drift? 137 | 138 |
139 | 140 | 141 | No. It is highly unlikely that genetic drift in a population with a 142 | size of 100 can result in fixation after 4 generations. In addition, all six 143 | experiments starting with a frequency of about 0.9 end up being fixed 144 | for one allele, while the seven experiments, which start with a rate 145 | is below 0.9 all end up being fixed for the other allele. Genetic 146 | drift would have a more “random” nature. 147 |
148 | 149 | 2) Can the evolution be explained as a result of natural selection? 150 | How does it work? Which genotype has the lowest fitness? 151 | 152 |
153 | 154 | 155 | Yes; there must be underdominance where the heterozygote has a lower 156 | fitness than both homozygotes. 157 |
158 | 159 | **Exercise 3** 160 | 161 |
162 | 163 |
164 | 165 | A geneticist starts an experiment with *Drosophila melanogaster*. He 166 | uses 10 populations, each kept at a constant size of 8 males and 8 167 | females in each generation. In generation 0 all individuals are 168 | heterozygous for the two alleles *A*1 and *A*2 at 169 | an autosomal locus. After 19 generations, the following distribution of 170 | the allele frequency of *A*1 is observed in the 10 171 | populations: 172 | 173 | |0.18 |0.00 |0.18| 0.25| 0.30| 0.19| 0.16| 0.00| 0.15| 0.00| 174 | |-----|-----|----|-----|-----|-----|-----|-----|-----|-----| 175 | 176 | 1) Can this distribution of the allele frequency be explained by 177 | genetic drift? 178 |
179 | 180 | 181 | No. With genetic drift, the allele frequencies would be distributed 182 | randomly over the entire range from 0 to 1. Here, all 10 populations 183 | have an allele frequency of less than 0.5, which is very unlikely. 184 | (0.510 = 0.000977) 185 |
186 | 187 | 2)  Which other evolutionary force has also worked during this 188 | experiment? 189 |
190 | 191 | 192 | Natural selection. 193 |
194 | 195 | After 100 generations, all ten populations were fixed for allele 196 | *A*2*.* 197 | 198 | 3) Use this information to explain how the fitness of the three 199 | genotypes, *w*11, *w*12 and *w*22, 200 | are related to each other. 201 |
202 | 203 | 204 | Natural selection has also been involved, in this case in the form of 205 | directional selection, where 206 | 207 | *w*11 < *w*12 < *w*22, 208 |
209 | 210 | **Exercise 4** 211 | 212 |
213 | 214 |
215 | 216 | After the arrival of the Europeans in America, 217 | the California condor (*Gymnogyps californianus*) was severely hunted. 218 | This resulted in a drastic decline in population size, which culminated 219 | in 1987 when the last wild condors were placed in captivity (fourteen 220 | individuals). [Later on, the condor was released again. In 2014, 425 were living in 221 | the wild or in captivity.] Among the progeny of these fourteen individuals 222 | the genetic disease chondrodystrophy (a form of dwarfism was observed). 223 | In condor, this disease is inherited at an autosomal locus where 224 | chondrodystrophy is due to a recessive lethal allele. 225 | 226 | 1) What has the frequency of the allele for chondrodystrophy at least 227 | been among the fourteen individuals who were used to found the 228 | population in captivity? 229 |
230 | 231 | 232 | *q* ≥ 2/(2 x 14) = 0.071 233 | 234 | There must at least have been two heterozygotes among the fourteen 235 | individuals. 236 | 237 |
238 | 239 | The population has since grown in number and reached around a few 240 | hundred. An estimation of the allele frequency for chondrodystrophy 241 | showed a value of 0.09. 242 | 243 | 2) Can the frequency of this lethal allele be caused by one of the 244 | following three forces separately? (In question c you will be asked 245 | whether a combination of these forces is needed to explain the 246 | frequency.) 247 | 248 | - mutation 249 | - genetic drift 250 | - natural selection 251 | 252 |
253 | 254 | 255 | - mutation No 256 | - genetic drift Yes 257 | - natural selection No 258 | 259 |
260 | 261 | 3) Is it necessary to consider that two or three of these forces act 262 | together to explain the frequency of this allele? 263 |
264 | 265 | 266 | The allele for chondrodystophy arose through mutation and genetic 267 | drift has resulted in the high frequency. (Natural selection 268 | eliminates this allele, and thus would not be able to explain the high 269 | frequency of allele.) 270 |
271 | 272 | 4) What is the expected frequency of the allele after a balance 273 | between mutation and selection? The mutation rate can be assumed to be μ = 274 | 10-6? 275 |
276 | 277 | 278 | The equilibrium between mutation and natural selection is given by 279 | 280 | *p* = √(μ/*s*) = √(10-6 /1) = 0.001. 281 |
282 | 283 | **Exercise 5** 284 | 285 | Cystic fibrosis is caused by a recessive allele at a single autosomal 286 | locus, CTFR (cystic fibrosis transmembrane conductance regulator). In 287 | European populations 1 out of 2500 newborn children are homozygous for 288 | the recessive allele. 289 | 290 | 1) What is the frequency of the recessive allele in these populations? 291 |
292 | 293 | 294 | *q* = √ (1/2500) = 1/50 = 0.02 295 |
296 | 297 | 2) What fraction of all possible parental combinations has a 298 | probability of ¼ for having a child which is homozygous for the 299 | recessive allele? 300 |
301 | 302 | 303 | It must be the combination heterozygote × heterozygote: 304 | 305 | 2*pq* × 2*pq* (2 × 0.98 × 0.02)2 = 0.03922 = 0.0015 306 | 307 |
308 | 309 | The disease used to be fatal during childhood if it is not treated. 310 | Therefore, it must be assumed that the fitness of the recessive 311 | homozygote must have been 0 during the main part of the human 312 | evolutionary history. 313 | 314 | 3) Estimate the mutation rate, assuming equilibrium between the 315 | mutation and selection. 316 |
317 | 318 | 319 | *q =* √*(μ/s*), 320 | 321 | where *μ* in is the mutation rate and *s* is the 322 | selection coefficient, which must be 1 since the fitness is 0. 323 | 324 | Therefore, 325 | 326 | *μ* = *q*2*s* = 0.022 × 1 = 0.0004 327 | 328 |
329 | 330 | A direct estimate of the mutation rate was 6.7 × 10-7, which 331 | is considerably lower than the estimate found in question c. 332 | 333 | 4) Which mechanism(s) may explain the high frequency of the recessive 334 | deleterious allele? 335 |
336 | 337 | 338 | It could be overdominant selection where the fitness of heterozygous 339 | carriers is higher than in homozygous normal. There have been several 340 | hypotheses for this: increased resistance against tuberculosis or 341 | cholera has been suggested but there are no hard data to explain it. 342 | 343 |
344 | 345 | -------------------------------------------------------------------------------- /NaturalSelection/browser.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NaturalSelection/browser.png -------------------------------------------------------------------------------- /NaturalSelection/fig1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NaturalSelection/fig1.png -------------------------------------------------------------------------------- /NaturalSelection/fig2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NaturalSelection/fig2.png -------------------------------------------------------------------------------- /NaturalSelection/fig3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NaturalSelection/fig3.png -------------------------------------------------------------------------------- /NaturalSelection/fig4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NaturalSelection/fig4.png -------------------------------------------------------------------------------- /NaturalSelection/fig5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NaturalSelection/fig5.png -------------------------------------------------------------------------------- /NaturalSelection/sel1sickle.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NaturalSelection/sel1sickle.png -------------------------------------------------------------------------------- /NaturalSelection/sel2underdominance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NaturalSelection/sel2underdominance.png -------------------------------------------------------------------------------- /NaturalSelection/sel3drosophila.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NaturalSelection/sel3drosophila.png -------------------------------------------------------------------------------- /NaturalSelection/sel4condor.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NaturalSelection/sel4condor.png -------------------------------------------------------------------------------- /NaturalSelection/selectionAA.md: -------------------------------------------------------------------------------- 1 | 2 | # Selection scan exercises Part 1 3 | **Anders Albrechtsen** 4 | 5 | 6 | Go to https://genome.ucsc.edu/s/aalbrechtsen/hg18Selection 7 | 8 | 9 | 10 | 11 | This is a browser that can be used to view genomic data. With the above link you will view both genes and tajimas D for 3 populations. 12 | - Individuals with African descent are named AD 13 | - Individuals with European descent are named ED 14 | - Individuals with Chinese descent are named XD 15 | -

16 | You are looking at a random 11Mb region of the genome. Try to get a sense of the values that Tajimas D takes along the genome for the 3 populations. 17 | - You can move to another part of the chromosome by clicking ones on the chromosome arm 18 | - You can also change chromosome in the search field 19 | - You can zoom in by draging the mouse 20 |
21 | 22 |
23 | 24 | 25 | 26 | Take note of the highest and lowest values of Tajimas D that you observed. 27 |

28 | 29 | 30 | Try find the SLC45A2 gene (use the search field, an choose the top option). This is one of the strongest selected genes in Europeans. 31 | - Is this an extreme value of Tajimas D? 32 | 33 |

34 | Try to go to the LCT loci. 35 | - Does this have an extreme value of Tajimas D ? 36 | - What can you conclude on the performance of Tajima’s D 37 | 38 | # Part 2 39 | ` 40 | 41 | 42 | Open R and load data 43 | ## in R 44 | ```R 45 | ## paste the following code in R ( you do not need to understand it) 46 | .libPaths( c( .libPaths(), "~/groupdirs/SCIENCE-BIO-Popgen_Course/exercises/Rlib/") ) 47 | setwd("~/groupdirs/SCIENCE-BIO-Popgen_Course/exercises/selection/selectionScan") 48 | source("server.R") 49 | 50 | ``` 51 | 52 | **Exercise** 53 | 54 | Let see if we can do better than the Tajima’s D by using the PBS statistics. First select 3 populations from 55 | - NAT - Native Americans (PERU+Mexico) 56 | - CHB – East Asian - Han Chinese 57 | - CEU – Central Europeans 58 | - YRI – African - Nigerians 59 | 60 | The first population is the one which branch you are investigating. The two others are the one you are comparing to. Chose CEU as the first and choose CHB and YRI as the two others. 61 | 62 | 63 | ```R 64 | #### choose populations 65 | #pops 1=NAT,2=CHB",3=CEU",4=YRI 66 | myPops <- c(3,2,4) 67 | ``` 68 | 69 | 70 | First lets get an overview of the whole genome by making a manhattan plot 71 | 72 | 73 | 74 | 75 | ```R 76 | #### choose populations 77 | PBSmanPlot(myPops) 78 | ``` 79 | 80 | Note which chromosomes have extreme values. A high value of PBS means a long branch length. 81 | To view a single chromosome – go to PBS region 82 | 83 | Chose the chromosome with the highest PBS value and set the starting position to -1 to get the whole chromosome 84 | 85 | e.g. 86 | 87 | 88 | ```R 89 | # see entire chromosome 1 90 | PBSmanRegion(myPops,chrom=1,start=-1) 91 | ``` 92 | 93 | Zoom in to the peak by changing start and end position. 94 | 95 | ```R 96 | # see region between 20Mb and 21Mb on chromosome 1 97 | PBSmanRegion(myPops,chrom=1,start=20,end=21) 98 | ``` 99 | 100 | - Locate the most extreme regions of the genome and zoom in 101 | - Identify the Gene with the highest PBS value. 102 | - What does the gene do? 103 | - Try the LCT gene (the mutations are locate in the adjacent MCM6 gene). See below on how to get the position 104 | - How does this compare with Tajima’s D 105 | 106 | If you have time you can try other genes. Here are the top ones for Humans. You can find the find the location of the genes using for example the ucsc browser https://genome-euro.ucsc.edu/cgi-bin/hgGateway (choose human GRCh37/hg19 genome). Note that there are some population that you cannot test because the populations are not represented in the data e.g. Tibetan, Ethiopian , Inuit, Siberians. 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | -------------------------------------------------------------------------------- /NaturalSelection/shinySelectionSim.R: -------------------------------------------------------------------------------- 1 | library(shiny) 2 | library(ggplot2) 3 | 4 | # Define UI for application 5 | ui <- fluidPage( 6 | titlePanel("Allele Frequency Trajectory Simulation"), 7 | sidebarLayout( 8 | sidebarPanel( 9 | numericInput("init_freq", "Initial Allele Frequency:", 0.5, min = 0, max = 1), 10 | numericInput("selection_coeff", "Selection Coefficient (s):", 0.1, min = 0, max = 1), 11 | numericInput("pop_size", "Population Size (N):", 1000, min = 100), 12 | numericInput("generations", "Number of Generations:", 50, min = 10), 13 | numericInput("num_simulations", "Number of Simulations:", 5, min = 1), 14 | numericInput("y_min", "Y-axis Minimum:", 0, min = 0, max = 1), 15 | numericInput("y_max", "Y-axis Maximum:", 1, min = 0, max = 1), 16 | actionButton("simulate", "Simulate") 17 | ), 18 | mainPanel( 19 | plotOutput("freqPlot") 20 | ) 21 | ) 22 | ) 23 | 24 | # Define server logic 25 | server <- function(input, output) { 26 | simulate_data <- eventReactive(input$simulate, { 27 | init_freq <- input$init_freq 28 | selection_coeff <- input$selection_coeff 29 | pop_size <- input$pop_size 30 | generations <- input$generations 31 | num_simulations <- input$num_simulations 32 | 33 | simulations <- lapply(1:num_simulations, function(i) { 34 | allele_freq <- numeric(generations) 35 | allele_freq[1] <- init_freq 36 | 37 | for (j in 2:generations) { 38 | allele_count <- rbinom(1, pop_size, allele_freq[j-1]) 39 | allele_freq[j] <- allele_count / pop_size 40 | allele_freq[j] <- allele_freq[j] + selection_coeff * allele_freq[j] * (1 - allele_freq[j]) 41 | } 42 | 43 | data.frame(Generation = 1:generations, Frequency = allele_freq, Simulation = i) 44 | }) 45 | 46 | do.call(rbind, simulations) 47 | }) 48 | 49 | output$freqPlot <- renderPlot({ 50 | req(simulate_data()) 51 | ggplot(simulate_data(), aes(x = Generation, y = Frequency, group = Simulation, color = as.factor(Simulation))) + 52 | geom_line(alpha = 0.7) + 53 | labs(title = "Allele Frequency Trajectory", x = "Generation", y = "Allele Frequency", color = "Simulation") + 54 | scale_y_continuous(limits = c(input$y_min, input$y_max)) + 55 | theme_minimal() 56 | }) 57 | } 58 | 59 | # Run the application 60 | shinyApp(ui = ui, server = server) 61 | -------------------------------------------------------------------------------- /NaturalSelection/z: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /NucleotideDiversiteyExercise/Selection_020.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NucleotideDiversiteyExercise/Selection_020.png -------------------------------------------------------------------------------- /NucleotideDiversiteyExercise/chimp_distribution.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NucleotideDiversiteyExercise/chimp_distribution.png -------------------------------------------------------------------------------- /NucleotideDiversiteyExercise/ex1nucdivfigure1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NucleotideDiversiteyExercise/ex1nucdivfigure1.png -------------------------------------------------------------------------------- /NucleotideDiversiteyExercise/slidesGeneticDiversityExercise .pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/NucleotideDiversiteyExercise/slidesGeneticDiversityExercise .pdf -------------------------------------------------------------------------------- /Population_history/population_history_lab_2020_noAnswers.md: -------------------------------------------------------------------------------- 1 | Simulations and population history lab 2 | ====================================== 3 | 4 | Rasmus Heller, Feb 2020 5 | 6 | **Program** 7 | 8 | Part I: Quick demonstration of fastsimcoal, a coalescent simulator software. 9 | 10 | Part II: Looking at sfs’s from three different population histories. 11 | 12 | Part III: Looking at coalescence trees from the same histories. 13 | 14 | Part IV: Making a skyline plot from HIV virus data. 15 | 16 | Part V: Evaluating two hypothesis of Bornean elephant population history. 17 | 18 | **Aim** 19 | 20 | The aim of this practical lab is to use coalescent simulations to understand the 21 | connection between population history, coalescent trees and genetic data. 22 | Specifically you will learn: 23 | 24 | 1) how to set up a coalescent simulation using fastsimcoal. 25 | 26 | 2) what sfs’s and coalescent trees from different population histories look 27 | like. 28 | 29 | 3) how to interpret a skyline plot of population history. 30 | 31 | 4) how to use simulations and real data to choose between two hypotheses about population history. 32 | 33 | We are going to use the coalescent simulator software fastsimcoal to examine how 34 | different population histories affect genetic data. **The necessary files are in 35 | the ‘exercises/populationHistory’ folder** (/home/ 36 | /groupdirs/SCIENCE-BIO-Popgen_Course/exercises/populationHistory/). Please copy 37 | this entire folder to your home directory by writing this: 38 | 39 | ```bash 40 | cp -r groupdirs/SCIENCE-BIO-Popgen_Course/exercises/populationHistory ./exercises/ 41 | ``` 42 | 43 | Now move to the new folder location: 44 | 45 | ```bash 46 | cd exercises/populationHistory 47 | ``` 48 | 49 | **Part I: perform simulations** 50 | 51 | There are three .par files among the files you just copied. We will take a look 52 | at them together. 53 | 54 | Now, navigate to the ‘populationHistory’ folder in your home directory and run 55 | the following command: 56 | 57 | ```bash 58 | fsc26 -i constant1.par -n10 -X -s0 -I -T -d 59 | ``` 60 | 61 | This performs the simulations from the constant1.par file. Run the same command 62 | for the two other .par files, decline1 and expand1. 63 | 64 | **Part II: plot sfs** 65 | 66 | Among the data files output from the previous commands are three sfs files 67 | output from fastsimcoal, one from each of three population histories: constant, 68 | declining and expanding population sizes. We will plot the sfs files and compare 69 | among the histories. Navigate to the output folder (/constant1/ first, after 70 | that /decline1/ and /expansion1/). Go to the constant1 folder and Open R: 71 | 72 | ```bash 73 | cd constant1 74 | 75 | R 76 | ``` 77 | 78 | Now paste this code: 79 | 80 | ```R 81 | constant<-read.table("constant1_DAFpop0.obs", skip=2) #read in the constant.sfs. 82 | constant1.sfs<-as.matrix(constant[,2:10]) #extract the variable sites from the ten sfs. 83 | norm1.constant<-apply(constant1.sfs,1,function(x) x/sum(x)) #getting proportions instead of counts. 84 | barplot(norm1.constant, be=T, main="10 SFS's from a constant population, simulating 100 x 10,000bp", ylim=c(0, 0.5), ylab="Proportion of sites") #plot the sfs. 85 | ``` 86 | 87 | Question: Is there a lot of variation across the ten simulations? 88 | 89 | Question: What do you think determines how different the SFS's are? 90 | 91 | Now we look at the decline1 scenario. Without closing your R session do this: 92 | 93 | ```R 94 | setwd("../decline1/") #navigates to the decline1 simulation folder. 95 | dev.new() #lets you open a new graphic device while keeping the other open. 96 | decline<-read.table("decline1_DAFpop0.obs", skip=2) #next three lines as above. 97 | decline1.sfs<-as.matrix(decline[,2:10]) 98 | norm1.decline<-apply(decline1.sfs,1,function(x) x/sum(x)) #getting proportions instead of counts. 99 | barplot(norm1.decline, be=T, main="10 SFS's from a declining population, simulating 100 x 10,000bp", ylim=c(0, 0.5), ylab="Proportion of sites") #plot the sfs 100 | ``` 101 | 102 | And the expand1 scenario: 103 | 104 | ```R 105 | setwd("../expand1/") #navigates to the appropriate simulation folder. 106 | dev.new() 107 | expand<-read.table("expand1_DAFpop0.obs", skip=2) #next three lines as above. 108 | expand1.sfs<-as.matrix(expand[,2:10]) 109 | norm1.expand <- apply(expand1.sfs,1,function(x) x/sum(x)) #getting proportions instead of counts. 110 | barplot(norm1.expand, be=T, main="10 SFS's from an expanding population, simulating 100 x 10,000bp", ylim=c(0, 0.5), ylab="Proportion of sites") #plot the sfs 111 | ``` 112 | 113 | Question: Are there obvious differences in the SFS among the three different scenarios? 114 | 115 | Question: Do the SFS look as expected (see for example Fig. 3.9 in Nielsen & Slatkin)? 116 | 117 | **Part III: examine coalescence trees** 118 | 119 | From our simulations we also created 10*100 coalescent trees from each of the 120 | three scenarios. The files containing the trees (in a text format) are called 121 | [scenario name] _1_true_trees.trees. We will browse through them and compare 122 | trees among histories. 123 | 124 | Run the following R code to load the 1000 trees from the constant1 scenario: 125 | 126 | ```R 127 | setwd("../constant1/") #navigates to the appropriate simulation folder. 128 | library(ape) 129 | conTrees<-read.nexus("constant1_1_true_trees.trees") 130 | plot(conTrees[seq(1,1000,by=100)], show.tip.label=F) #this plots the first of 100 trees from each simulation one by one and allows you to switch to the next one by hitting ENTER. 131 | ``` 132 | 133 | Question: Looking at the way the simulations are done, what does each tree represent? 134 | Is it the tree of one locus in each simulation, or the combined tree of all loci 135 | in each simulation? Are the trees identical? Similar? 136 | 137 | Next, we will plot just the first of the 1000 trees: 138 | 139 | ```R 140 | plot(conTrees[[1]], show.tip.label=F) 141 | add.scale.bar() #this adds a scale bar with units in numbers of mutations. 142 | ``` 143 | 144 | Now, run the following to view the first tree from the decline1 scenario: 145 | 146 | ```R 147 | setwd("../decline1/") #navigates to the appropriate simulation folder. 148 | dev.new() 149 | decTrees<-read.nexus("decline1_1_true_trees.trees") 150 | plot(decTrees[[1]] , show.tip.label=F) 151 | add.scale.bar() 152 | ``` 153 | 154 | Question: Are there differences between the decline and the constant trees? Are they as 155 | you expected? 156 | 157 | Lastly, here is the code to plot the first tree from the expand1 simulation: 158 | 159 | ```R 160 | setwd("../expand1/") #navigates to the appropriate simulation folder. 161 | dev.new() 162 | expTrees<-read.nexus("expand1_1_true_trees.trees") 163 | plot(expTrees[[1]] , show.tip.label=F) 164 | add.scale.bar() 165 | ``` 166 | 167 | Question: Are there obvious differences between the expansion and the decline trees? 168 | Are they as you expected? 169 | 170 | Next we will construct skyline plots of population size from the simulated 171 | trees. Let’s start by looking at one of the expansion trees: 172 | 173 | ```R 174 | dev.new() 175 | sky1<-skyline(expTrees[[1]], 200) #this takes only the 1st of 1000 expansion trees and makes the skyline plot calculations. The second parameter controls the smoothing and was chosen by us for this specific situation. 176 | plot(sky1, subst.rate=0.00001, main="Skyline plot, expansion tree 1") #plots the skyline object. The second parameter should equal the simulated mutation rate. 177 | ``` 178 | 179 | Question: Do you see an expansion? Where do you think the signal is coming from? 180 | 181 | Question: Compare the 1st expansion tree with its skyline plot. Can you explain 182 | why the signal in the skyline plot becomes noisy at some point back in time? 183 | 184 | **Part IV: inferring HIV virus population history** 185 | 186 | We are going to look at a data set comprised of HIV virus isolates from central 187 | Africa. We are interested in finding out whether the HIV population size has 188 | changed over time, and over which time scale. The following is R code to load 189 | and analyze a coalescent tree from HIV virus data using APE. It will give us 190 | various population trees and a skyline plot: 191 | 192 | ```R 193 | library(ape) #loads the package 'ape' ('install.packages("ape")' will install it if not present in local R packages. 194 | data("hivtree.newick") #get example data, Newick tree file. 195 | tree1<-read.tree(text = hivtree.newick) #reads in the data as a 'tree' object in R. 196 | plot(tree1, show.tip.label=F) #plots the tree as a phylogram. 197 | ``` 198 | 199 | Question: Pause here to look at the tree. What would you say about the population 200 | history of HIV from this tree? 201 | 202 | ```R 203 | sky2<-skyline(tree1, 0.0119) #constructs a 'skyline' object (generalized skyline plot) with estimated popsize for collapsed coalescent intervals. 204 | dev.new() 205 | plot(sky2, show.years=TRUE, subst.rate=0.0023, present.year = 1997) #plot generalized skyline plot. 206 | ``` 207 | 208 | Question: Describe the overall population size history of HIV virus going back in time 209 | from 1997. How does this relate to what you know about the history of HIV/AIDS? 210 | 211 | **Part V: when did the elephant arrive on Borneo?** 212 | 213 | We will evaluate which of two hypotheses regarding the founding time of the 214 | population of elephants on Borneo is most likely. See our paper here: 215 | . 216 | 217 | Hypothesis 1: a small number of elephants were introduced recently (in the 17th 218 | century) by humans. 219 | 220 | Hypothesis 2: a small number of elephants migrated naturally to Borneo during 221 | the Last Glacial Maximum (LGM, about 20,000 years ago) when Borneo was connected 222 | by land bridge to mainland Asia. 223 | 224 | This is strongly debated in the scientific and the public communities, with 225 | hypothesis 1 being the consensus view. The question has implications for how the 226 | Bornean elephant is managed as well as for understanding the biogeography of the 227 | Indonesian islands. We will use data obtained from microsatellites and 228 | simulations to evaluate which of the hypothesis is more likely. 229 | 230 | I have used coalescent simulations (see Box 5.1 in N&S) to simulate 1000 data 231 | sets of 18 microsatellites under each of the two hypotheses. The two simulation 232 | output files are among the files you downloaded in the beginning of the 233 | exercise. 234 | 235 | Navigate to the populationHistory folder in your /home/ directory. Open R. We 236 | will plot the distribution of a certain summary statistic calculated from the 237 | simulated data and see how it agrees with the value we have obtained from the 238 | actual data from the elephant population. This summary statistic is allelic 239 | richness (K), or the mean number of alleles in the population across 18 microsat 240 | loci. We plot the distribution of K from both simulated scenarios: 241 | 242 | ```R 243 | sulu<-read.table("sulu_outSumStats.txt", head=T) 244 | lgm<-read.table("lgm_outSumStats.txt", head=T) 245 | plot(density(sulu$K_1), xlim=c(2,4.5), main="Distribution of K in the introduction scenario") 246 | abline(v= 4.38889, col="red") #this plots the actual value of K from real microsat data in the same plot as the simulated values. 247 | dev.new() 248 | plot(density(lgm$K_1), main="Distribution of K in the LGM scenario") 249 | abline(v= 4.38889, col="red") 250 | ``` 251 | 252 | Question: Look at the distribution of allelic richness (K_1) from two different 253 | scenarios. Compare it with the observed value from the real data (red line in 254 | each plot). Which scenario appears more in agreement with the data? What would 255 | you conclude from this observation? 256 | -------------------------------------------------------------------------------- /Population_history/z: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Population_structure/ExerciseFst2024WA.md: -------------------------------------------------------------------------------- 1 | # Exercises - measuring population differentiation and detecting signatures of selection using FST 2 | 3 | ## Genis Garcia Erill & Renzo Balboa 4 | 5 | ## Program 6 | 7 | - Read genotype data into R and apply R functions to estimate FST. 8 | - Use FST in windows across the genome and Manhattan plots to detect local signatures of natural selection. 9 | - Interpret and discuss the results from both analyses in a biological context. 10 | 11 | ## Learning Objectives 12 | 13 | - Learn to estimate FST between population pairs using genotype data. 14 | - Learn to interpret FST estimates in relation to processes of population divergence. 15 | - Learn to visualise local FST across the genome and identify candidate genes under selection. 16 | 17 | ## Recommended Reading 18 | 19 | Chapter 4 - Population Subdivision (pp. 59-76). 20 | 21 | In: Nielsen, R. and Slatkin, M. (2013). _An Introduction to Population Genetics: Theory and Applications_. Sinauer. 22 | 23 | ## Data and setup 24 | 25 | For this exercise, we will use the same dataset that was used on Monday for [inferring population structure](https://github.com/populationgenetics/exercises/blob/master/Population_structure/ExerciseStructure_2023.md). 26 | The following commands will create a new folder and copy the dataset to that new folder, 27 | but you are free to work in the `structure` folder that you created in the last exercise (in which 28 | case, you can skip the following commands). 29 | 30 | ```bash 31 | # make a directory called structure_fst inside ~/exercises/ and change to that directory 32 | mkdir -p ~/exercises/structure_fst 33 | cd ~/exercises/structure_fst 34 | 35 | # Download/copy data (remember the . at the end) 36 | cp ~/groupdirs/SCIENCE-BIO-Popgen_Course/exercises/structure/pa/* . 37 | 38 | # List the downloaded files 39 | ls -l 40 | ``` 41 | 42 | Today, we are going to calculate the fixation index (*FST*), a widely used statistic in population 43 | genetics, between subspecies of chimpanzees. This is a measure of population differentiation and thus, we 44 | can use it to distinguish populations in a quantitative way. 45 | It is worth noticing that what *FST* measures is the 46 | reduction in heterozygosity within observed populations when compared to a pooled population. 47 | 48 | There are several methods for estimating *FST* from genotype data. We will not 49 | cover them in the course but if you are interested in getting an overview of some of these 50 | estimators and how they differ, you can have a look at [this article](https://genome.cshlp.org/content/23/9/1514.full.pdf). 51 | 52 | Here, we use the [Weir and Cockerham *FST* estimator from 1984](https://onlinelibrary.wiley.com/doi/pdfdirect/10.1111/j.1558-5646.1984.tb05657.x) to 53 | calculate *FST*. Again, the theory behind it and the estimator itself are 54 | not directly part of the course but if you are interested, you can find the formula that is implemented in the following R function in either 55 | the Weir and Cockerham (1984) article and/or in equation 6 from the Bhatia et al. (2011) article linked above. 56 | 57 | Open R and copy/paste the following function. You do not need to understand what the code does (but are welcome to try if you are interested, and ask if you have questions): 58 | 59 | #### \>R 60 | ```R 61 | WC84<-function(x,pop){ 62 | # function to estimate Fst using Weir and Cockerham estimator. 63 | # x is NxM genotype matrix, pop is N length vector with population assignment for each sample 64 | # returns list with fst between population per M snps (theta) and other variables 65 | 66 | ###number ind in each population 67 | n<-table(pop) 68 | ###number of populations 69 | npop<-nrow(n) 70 | ###average sample size of each population 71 | n_avg<-mean(n) 72 | ###total number of samples 73 | N<-length(pop) 74 | ###frequency in samples 75 | p<-apply(x,2,function(x,pop){tapply(x,pop,mean)/2},pop=pop) 76 | ###average frequency in all samples (apply(x,2,mean)/2) 77 | p_avg<-as.vector(n%*%p/N ) 78 | ###the sample variance of allele 1 over populations 79 | s2<-1/(npop-1)*(apply(p,1,function(x){((x-p_avg)^2)})%*%n)/n_avg 80 | ###average heterozygotes 81 | # h<-apply(x==1,2,function(x,pop)tapply(x,pop,mean),pop=pop) 82 | #average heterozygote frequency for allele 1 83 | # h_avg<-as.vector(n%*%h/N) 84 | ###faster version than above: 85 | h_avg<-apply(x==1,2,sum)/N 86 | ###nc (see page 1360 in wier and cockerhamm, 1984) 87 | n_c<-1/(npop-1)*(N-sum(n^2)/N) 88 | ###variance between populations 89 | a <-n_avg/n_c*(s2-(p_avg*(1-p_avg)-(npop-1)*s2/npop-h_avg/4)/(n_avg-1)) 90 | ###variance between individuals within populations 91 | b <- n_avg/(n_avg-1)*(p_avg*(1-p_avg)-(npop-1)*s2/npop-(2*n_avg-1)*h_avg/(4*n_avg)) 92 | ###variance within individuals 93 | c <- h_avg/2 94 | ###inbreeding (F_it) 95 | F <- 1-c/(a+b+c) 96 | ###(F_st) 97 | theta <- a/(a+b+c) 98 | ###(F_is) 99 | f <- 1-c(b+c) 100 | ###weighted average of theta 101 | theta_w<-sum(a)/sum(a+b+c) 102 | list(F=F,theta=theta,f=f,theta_w=theta_w,a=a,b=b,c=c,total=c+b+a) 103 | } 104 | ``` 105 | 106 | Do not close R. 107 | 108 | ## Measuring population differentiation with *FST* 109 | 110 | Now we will read in our data and apply the function above to each of the three pairs of subspecies to estimate their 111 | *FST*. We want to make three comparisons. 112 | 113 | 114 | ```R 115 | library(snpMatrix) 116 | 117 | # read genotype data using read.plink function form snpMatrix package 118 | data <- read.plink("pruneddata") 119 | 120 | # extract genotype matrix, convert to normal R integer matrix 121 | geno <- matrix(as.integer(data@.Data),nrow=nrow(data@.Data)) 122 | 123 | # original format is 0: missing, 1: hom minor, 2: het, 3: hom major 124 | # convert to NA: missing, 0: hom minor, 1: het, 2: hom major 125 | geno[geno==0] <- NA 126 | geno <- geno - 1 127 | 128 | # keep only SNPs without missing data 129 | g <- geno[,complete.cases(t(geno))] 130 | ``` 131 | 132 | Let's stop for a moment and look at the dimensions of the genotype matrix before and after filtering SNPs with missing data: 133 | 134 | ```R 135 | # dimensions before filtering 136 | dim(geno) 137 | 138 | # dimensions after filtering 139 | dim(g) 140 | ``` 141 | 142 | **Q1:** How many SNPs and individuals are there before and after filtering? How many SNPs did we initially have with missing data? 143 | 144 | 145 | 146 | Let's continue in the same R session: 147 | 148 | ``` R 149 | # load population information 150 | popinfo <- read.table("pop.info", stringsAsFactors=F, col.names=c("pop", "ind")) 151 | 152 | # check which subspecies we have 153 | unique(popinfo$pop) 154 | 155 | # save names of the three subspecies 156 | subspecies <- unique(popinfo$pop) 157 | 158 | # check which individuals belong to each subspecies 159 | sapply(subspecies, function(x) popinfo$ind[popinfo$pop == x]) 160 | ``` 161 | 162 | **Q2:** How many samples do we have from each subspecies? (Of course, this is easy to do by physically counting the number of elements in the vectors we just printed, but can you edit the code above to print the number of individuals for each subspecies?) 163 | 164 | 165 | Let's continue in R and estimate the FST values for each pair of subspecies: 166 | 167 | 168 | ``` r 169 | # get all pairs of subspecies 170 | subsppairs <- t(combn(subspecies, 2)) 171 | 172 | # apply fst function to each of the three subspecies pairs 173 | fsts <- apply(subsppairs, 1, function(x) WC84(g[popinfo$pop %in% x,], popinfo$pop[popinfo$pop %in% x])) 174 | 175 | # name each fst 176 | names(fsts) <- apply(subsppairs, 1, paste, collapse="_") 177 | 178 | # print global fsts for each pair 179 | lapply(fsts, function(x) x$theta_w) 180 | 181 | ``` 182 | 183 | Do not close R. 184 | 185 | 186 | 187 | 188 | **Q3:** Does population differentiation correlate with the geographical 189 | distance between subspecies? (You can find the geographical distribution of each subspecies using [Figure 1 from Monday](https://github.com/populationgenetics/exercises/blob/master/Population_structure/ExerciseStructure_2023.md#inferring-chimpanzee-population-structure-and-admixture-using-exome-data)). 190 | 191 | 192 | 193 | **Q4:** The troglodytes and schweinfurthii populations have the same 194 | divergence time with verus, but based on *FST*, schweinfurthii has slightly increased differentiation from verus. Based on what we learned in today's lecture, what factors do you think could explain the difference? 195 | 196 | 197 | 198 | 199 | ## Scanning for loci under selection using an *FST* outlier approach 200 | 201 | In the previous section, we estimated *FST* across all SNPs for which we have data and then estimated 202 | a global *FST* as the average across all SNPs. Now we will visualise local *FST* in sliding windows across 203 | the genome with the aim of finding regions with outlying large *FST*. This is a common approach that can reveal 204 | candidate SNPs and genes that may have been under positive selection in different populations. 205 | 206 | First, we will define a function to generate a Manhattan plot of local *FST* values across the genome in sliding windows. 207 | Copy the following function; you do not need to understand it (but are welcome to try if you are interested, and ask if you have questions): 208 | 209 | 210 | ``` R 211 | manhattanFstWindowPlot <- function(mainv, xlabv, ylabv, ylimv=NULL, window.size, step.size,chrom, fst, colpal = c("lightblue", "darkblue")){ 212 | 213 | chroms <- unique(chrom) 214 | step.positions <- c() 215 | win.chroms <- c() 216 | 217 | for(c in chroms){ 218 | whichpos <- which(chrom==c) 219 | chrom.steps <- seq(whichpos[1] + window.size/2, whichpos[length(whichpos)] - window.size/2, by=step.size) 220 | step.positions <- c(step.positions, chrom.steps) 221 | win.chroms <- c(win.chroms, rep(c, length(chrom.steps))) 222 | } 223 | 224 | n <- length(step.positions) 225 | fsts <- numeric(n) 226 | # estimate per window weighted fst 227 | for (i in 1:n) { 228 | chunk_a <- fst$a[(step.positions[i]-window.size/2):(step.positions[i]+window.size/2)] 229 | chunk_b <- fst$b[(step.positions[i]-window.size/2):(step.positions[i]+window.size/2)] 230 | chunk_c <- fst$c[(step.positions[i]-window.size/2):(step.positions[i]+window.size/2)] 231 | fsts[i] <- sum(chunk_a) / sum(chunk_a + chunk_b + chunk_c) 232 | } 233 | 234 | 235 | plot(x=1:length(fsts),y=fsts,main=mainv,xlab=xlabv,ylab=ylabv,cex=1, 236 | pch=20, cex.main=1.25, col=colpal[win.chroms %% 2 + 1], xaxt="n") 237 | 238 | yrange <- range(fsts) 239 | 240 | text(y=yrange[1] - c(0.05, 0.07) * diff(yrange) * 2, x=tapply(1:length(win.chroms), win.chroms, mean), labels=unique(win.chroms), xpd=NA) 241 | 242 | zz <- fsts[!is.na(fsts)] 243 | abline(h=quantile(zz,0.999,na.rem=TRUE),col="red", lty=2, lwd=2) 244 | abline(h=mean(fsts), lty=2, lwd=2) 245 | } 246 | ``` 247 | 248 | Do not close R. 249 | 250 | Using this function, we will now produce a Manhattan plot for each of the three subspecies pairs: 251 | 252 | 253 | ``` R 254 | # read bim file to get info on snp location 255 | bim <- read.table("pruneddata.bim", h=F, stringsAsFactors=F) 256 | 257 | # keep only sites without missing data (to get same sites we used for fst) 258 | bim <- bim[complete.cases(t(geno)),] 259 | 260 | # keep chromosome and bp coordinate of each snp and define pairnames 261 | snpinfo <- data.frame(chr=bim$V1, pos=bim$V4) 262 | pairnames <- apply(subsppairs, 1, paste, collapse=" ") 263 | 264 | # group snps in windows of 10, with sliding window of 1 265 | windowsize <- 10 266 | steps <- 1 267 | 268 | # make the plots 269 | par(mfrow=c(3,1)) 270 | for(pair in 1:3){ 271 | mainvv = paste("Sliding window Fst:", pairnames[pair], "SNPs =", length(fsts[[pair]]$theta), "Win: ", windowsize, "Step: ", steps) 272 | manhattanFstWindowPlot(mainvv, "Chromosome", "Fst", window.size=windowsize, step.size=steps, fst =fsts[[pair]], chrom=snpinfo$chr) 273 | } 274 | 275 | ``` 276 | 277 | If you want to save the plot on the server as a png (so you can then download it to your own computer, or visualise it on the server), you can use the following code: 278 | 279 | ``` R 280 | bitmap("manhattan.png", w=8, h=8, res=300) 281 | par(mfrow=c(3,1)) 282 | for(pair in 1:3){ 283 | mainvv = paste("Sliding window Fst:", pairnames[pair], "SNPs =", length(fsts[[pair]]$theta), "Win: ", windowsize, "Step: ", steps) 284 | manhattanFstWindowPlot(mainvv, "Chromosome", "Fst", window.size=windowsize, step.size=steps, fst =fsts[[pair]], chrom=snpinfo$chr) 285 | } 286 | dev.off() 287 | ``` 288 | 289 | Do not close R. 290 | 291 | 292 | 293 | In the plot we have just generated, the black dotted line indicates the mean FST value across all windows and the red dotted line the 99.9% percentile (i.e. where only 0.1% of the windows have an FST above that value). One way to define outlying windows is to only consider windows that have an FST above the 99.9% percentile (this value is necessarily arbitrary). 294 | 295 | 296 | **Q5:** Compare the peaks of high *FST* in the three subspecies pairs. Do they tend to be found in the same position? Would you expect this to be the case? Why/why not? 297 | 298 | 299 | 300 | 301 | **Q6:** Most of the top *FST* windows are found in groups of nearby windows with high *FST*. Can you explain or guess why that happens? 302 | 303 | 304 | 305 | 306 | 307 | ### EXTRA - Exploring genes in candidate regions for selection 308 | 309 | We have now identified several SNPs that are candidates for having been positively selected in some 310 | populations. Now we can try to see what genes these SNPs are located in (the genotype data we have been working with 311 | comes from exome sequencing, meaning that SNPs will always be located within genes). 312 | 313 | To do so, we need to know the genomic coordinates of the outlier windows in the Manhattan plot. 314 | 315 | Copy and paste the following function into R; this will return the top n (default n=20) windows with maximum *FST* for a given 316 | pairwise comparison. Again, you do not need to understand what the code does (but are welcome to try if you are interested, and ask if you have questions): 317 | 318 | ```r 319 | topWindowFst <- function(window.size, step.size, chrom, pos, fst, n_tops = 20){ 320 | 321 | chroms <- unique(chrom) 322 | step.positions <- c() 323 | win.chroms <- c() 324 | 325 | for(c in chroms){ 326 | whichpos <- which(chrom==c) 327 | chrom.steps <- seq(whichpos[1] + window.size/2, whichpos[length(whichpos)] - window.size/2, by=step.size) 328 | step.positions <- c(step.positions, chrom.steps) 329 | win.chroms <- c(win.chroms, rep(c, length(chrom.steps))) 330 | } 331 | 332 | n <- length(step.positions) 333 | fsts <- numeric(n) 334 | win.coord <- character(n) 335 | 336 | # estimate per window weighted fst 337 | for (i in 1:n) { 338 | chunk_a <- fst$a[(step.positions[i]-window.size/2):(step.positions[i]+window.size/2)] 339 | chunk_b <- fst$b[(step.positions[i]-window.size/2):(step.positions[i]+window.size/2)] 340 | chunk_c <- fst$c[(step.positions[i]-window.size/2):(step.positions[i]+window.size/2)] 341 | fsts[i] <- sum(chunk_a) / sum(chunk_a + chunk_b + chunk_c) 342 | 343 | c <- win.chroms[i] 344 | win.coord[i] <- paste0(c, ":", pos[step.positions[i]-window.size/2], "-", pos[step.positions[i]+window.size/2]) 345 | } 346 | 347 | ord <- order(fsts, decreasing=T) 348 | 349 | return(data.frame(position=win.coord[ord][1:n_tops], fst=fsts[ord][1:n_tops])) 350 | } 351 | ``` 352 | 353 | 354 | Now we can use the function to identify where the top hits from the previous plot are located. 355 | Let's have a look the top 10 *FST* windows between troglodytes and schweinfurthii: 356 | 357 | ``` r 358 | windowsize <- 10 359 | steps <- 1 360 | topWindowFst(window.size=windowsize, step.size=steps, chrom=snpinfo$chr, pos=snpinfo$pos, fst=fsts[[1]], n=10) 361 | 362 | ``` 363 | 364 | 365 | 366 | **Q7:** What are the genomic coordinates of the window with the highest *FST*? 367 | 368 | 369 | 370 | 371 | Now let's look at some gene annotations at this location. Open the [chimpanzee genome assembly in the UCSC genome browser](https://genome.ucsc.edu/cgi-bin/hgTracks?db=panTro5&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A78555444%2D78565444&hgsid=1293765481_hOBCvmiwGLVKt1SRo9yIaRFa0wYc) and copy/paste the chromosome and coordinates of the top window in the search bar (paste the coordinates in the same format as what was printed in R (i.e. chr:start-end)). 372 | 373 | **Q8:** Is this window within a gene? If so, can you figure out the name of the gene and its possible function? (Hint: click on the drawing of the gene on the 'Non-Chimp RefSeq Genes' track. This track describes genes identified in other organisms that have high sequence similarity to the observed region of the chimp genome. This suggests that the gene may also be present in that region of the chimp genome). 374 | 375 | 376 | 377 | 378 | **Q9:** Can we conclude that selection on this gene has driven biological differentiation between the troglodytes and schweinfurthii chimpanzee subspecies? 379 | 380 | -------------------------------------------------------------------------------- /Population_structure/ExerciseStructure.md: -------------------------------------------------------------------------------- 1 | # Exercise in structured populations 2 | 3 | **Peter Frandsen and Casper-Emil Tingskov Pedersen** 4 | 5 | ## Program 6 | 7 | - Understanding genetic structure in populations 8 | 9 | - Construction of a Principal Component Analysis (PCA) and ADMIXTURE 10 | plot using SNP data 11 | 12 | - Understanding the effect of population structure in genetic analyses 13 | 14 | ## Aim 15 | 16 | - Introduce command-line based approach to structure analyses 17 | 18 | - Introduction to ADMIXTURE 19 | 20 | - Learn to interpret results from structure analyses and put these in 21 | a biological context 22 | 23 | ## References 24 | 25 | - See Chapter 4 in ”An introduction to Population Genetics” and 26 | chapter 5 page 99-103 27 | 28 | - See Hvilsom et al. 2013. Heredity 29 | 30 | http://www.nature.com/hdy/journal/v110/n6/pdf/hdy20139a.pdf 31 | 32 | - See Prado-Martinez et al. 2013. Nature 33 | 34 | http://www.nature.com/nature/journal/v499/n7459/pdf/nature12228.pdf 35 | 36 | ## Clarifying chimpanzee population structure and admixture using exome data 37 | 38 | Disentangling the chimpanzee taxonomy has been surrounded with much 39 | attention, and with continuously newly discovered populations of 40 | chimpanzees, controversies still exist about the true number of 41 | subspecies. The unresolved taxonomic labelling of chimpanzee populations 42 | has negative implications for future conservation planning for this 43 | endangered species. In this exercise, we will use 110.000 SNPs from the 44 | chimpanzee exome to infer population structure and admixture of 45 | chimpanzees in Africa, in order to acquire thorough knowledge of the 46 | population structure and thus help guide future conservation management 47 | programs. We make use of 29 wild born chimpanzees from across their 48 | natural distributional area (Figure 1). 49 | 50 | 51 |
52 | 53 |
54 | 55 | **Figure 1** Geographical distribution of the common chimpanzee *Pan 56 | troglodytes* (from Frandsen & Fontsere *et al.* 2020. [https://www.nature.com/articles/s41437-020-0313-0 ]). 57 | 58 | ### Warm-up exercise 59 | 60 | Consider two non-identical and separated populations of chimpanzees 61 | without genetic drift, natural selection, mutation or migration. Mating 62 | is random within each population. We analyze a variation at a biallelic locus. 63 | In population 1 the frequency of 64 | allele A is 0.1 and in population 2 the frequency of allele A is 0.9. A 65 | young researcher has collected a sample comprising 50% from each 66 | population (each population is assumed to be of equal size). Calculate 67 | the proportion of heterozygous individuals in the pooled population. 68 | Then, calculate the expected heterozygosity in the pooled population 69 | 70 | **Q1:** Is there a difference between expected and observed 71 | 72 | ## PCA 73 | 74 | We start by downloading the exome data to the folder 75 | `~/exercises/structure`. To do this you can use the following command. 76 | Open a terminal and type: 77 | 78 | ```bash 79 | cd ~/exercises # if you do not have a directory called exercises, make one: mkdir ~/exercises 80 | mkdir structure 81 | cd structure 82 | cp ~/groupdirs/SCIENCE-BIO-Popgen_Course/exercises/structure/pa.zip . 83 | unzip pa.zip 84 | rm pa.zip 85 | ``` 86 | 87 | First, we want to look at the data (like you always should *before* 88 | doing any analyses) by opening R in the exercise directory (don’t close 89 | it before this manual states that you should) and typing: 90 | 91 | #### \>R 92 | 93 | ```R 94 | popinfo = read.table("pop.info") 95 | table(popinfo[,1]) 96 | ``` 97 | 98 | **Q2**: Which subspecies are represented in the data and how many 99 | individuals are there from each? 100 | 101 | Now we want to import our genotyped data into R. 102 | #### \>R 103 | 104 | ```R 105 | library(snpMatrix) 106 | data <- read.plink("pruneddata") 107 | geno <- matrix(as.integer(data@.Data),nrow=nrow(data@.Data)) 108 | geno <- t(geno) 109 | geno[geno==0]<- NA 110 | geno<-geno-1 111 | ###Let’s take a look at the first SNP (across populations) 112 | table(geno[1,]) 113 | ###Let’s take a look at the first individual (across all SNPs) 114 | table(geno[,1]) 115 | ###Let’s take a look at the last individual (across all SNPs) 116 | table(geno[,29]) 117 | ``` 118 | 119 | **Q3**: The genotype format are of the type 0, 1, 2, explain what 0, 1 120 | and 2 means? At the fist SNP: How many homozygotes are there? And how 121 | many heterozygotes? For the first and last individual: How many 122 | homozygotes are there? And how many heterozygotes? 123 | 124 | Now we want to look at the principal components, type the following into 125 | R: 126 | 127 | #### \>R 128 | ```R 129 | summary(prcomp(na.omit(geno))) 130 | ``` 131 | Q4: Look at column PC1 and PC2, how much of the variation is 132 | explained if you were to use these two principal components? 133 | 134 | Now we want to plot our genotyped data, we do that, first, by pasting 135 | the following code into R (which is first function that does PCA and then a call to this to run PCA on your data): 136 | 137 | #### \>R 138 | ```R 139 | eigenstrat<-function(geno){ 140 | 141 | # Get rid of sites with missing data 142 | nMis<-rowSums(is.na(geno)) 143 | geno<-geno[nMis==0,] 144 | 145 | # Get rid of non-polymorphic sites 146 | avg<-rowSums(geno)/ncol(geno) 147 | keep<-avg!=0&avg!=2 148 | avg<-avg[keep] 149 | geno<-geno[keep,] 150 | 151 | # Get number of remaining SNPs iand individuals 152 | snp<-nrow(geno) 153 | ind<-ncol(geno) 154 | 155 | # Make normalized genotype matrix 156 | freq<-avg/2 157 | M <- (geno-avg)/sqrt(freq*(1-freq)) 158 | 159 | # Get sample covariance matrix 160 | X<-t(M)%*%M 161 | X<-X/(sum(diag(X))/(snp-1)) 162 | 163 | # Do eigenvalue decomposition 164 | E<-eigen(X) 165 | 166 | # Calculate stuff relevant for number of components to look at 167 | mu<-(sqrt(snp-1)+sqrt(ind))^2/snp 168 | sigma<-(sqrt(snp-1)+sqrt(ind))/snp*(1/sqrt(snp-1)+1/sqrt(ind))^(1/3) 169 | E$TW<-(E$values[1]*ind/sum(E$values)-mu)/sigma 170 | E$mu<-mu 171 | E$sigma<-sigma 172 | class(E)<-"eigenstrat" 173 | E 174 | } 175 | plot.eigenstrat<-function(x,col=1,...) 176 | plot(x$vectors[,1:2],col=col,...) 177 | print.eigenstrat<-function(x) 178 | cat("statistic",x$TW,"n") 179 | e<-eigenstrat(geno) 180 | 181 | ``` 182 | And then use the next lines of code to make a plot in R: 183 | 184 | #### \>R 185 | ```R 186 | plot(e,col=rep(c("lightblue","Dark red","lightgreen"),c(11,12,6)),xlab="PC1 21% of variance",ylab="PC2 12% of variance",pch=16,main="PCA plot") 187 | text(0, 0.18, "troglodytes") 188 | text(0.05, -0.2, "schweinfurthii") 189 | text(-0.32,-0.07,"verus") 190 | ``` 191 | 192 | 193 | **Q5**: When looking at the plot, does the number of clusters fit with 194 | what you saw in the pop.info file? And does it make sense when looking 195 | at Figure 1? 196 | Now close R by typing `quit()` and hit `Enter` (it is up to you if you 197 | wish to safe the workspace). 198 | 199 | 200 | ## Admixture 201 | 202 | Now we know that the populations look like they are separated into three 203 | distinct clusters (in accordance to the three subspecies), but we also 204 | want to know whether there has been any admixture between the three 205 | subspecies given that at least two of the subspecies have neighboring 206 | ranges (Figure 1). For this admixture run, we will vary the input 207 | command for the number of ancestral populations (*K*) that you want 208 | ADMIXTURE to try to separate the data in. To learn more about admixture 209 | the manual can be found here: 210 | 211 | https://www.genetics.ucla.edu/software/admixture/admixture-manual.pdf 212 | 213 | First we want to know whether the separation in three distinct 214 | populations is the most true clustering given our data. We do this by 215 | running a cross-validation test, this will give us an error value for 216 | each K. We want to find the K with the lowest number of errors. 217 | 218 | To do this, run the following lines of code in the terminal (this may 219 | take some time \~3 mins): 220 | 221 | ```bash 222 | for i in 2 3 4 5; do admixture --cv pruneddata.bed $i; done > cvoutput 223 | grep -i 'CV error' cvoutput 224 | ``` 225 | 226 | **Q6**: which K value has the lowest CV error? 227 | 228 | Try running ADMIXTURE using this K value, by typing this in the terminal 229 | (remember to change the K-value to the value with the lowest amount of 230 | errors): 231 | 232 | ```bash 233 | admixture pruneddata.bed K-VALUE 234 | ``` 235 | 236 | Before we plot, we want a look at the results generated: 237 | 238 | ```bash 239 | less -S pruneddata.3.Q 240 | ``` 241 | 242 | **Q7**: The number of columns indicate the number of K used and the rows 243 | indicate individuals and their ancestry proportion in each population. 244 | Look at individual no. 10, do you consider this individual to be 245 | recently (within the last two generations) admixed? 246 | 247 | Now we want to plot our ADMIXTURE results, to do this open R and pasting 248 | the following code in: 249 | 250 | #### \>R 251 | ```R 252 | snpk2=read.table("pruneddata.2.Q") 253 | snpk3=read.table("pruneddata.3.Q") 254 | snpk4=read.table("pruneddata.4.Q") 255 | snpk5=read.table("pruneddata.5.Q") 256 | names=c("A872_17","A872_24","A872_25","A872_28","A872_35","A872_41", 257 | "A872_53","Cindy","Sunday","EXOTA_11785","PAULA_11784","SUSI_11043", 258 | "CINDY_11525","ABOUME","AMELIE","AYRTON","BAKOUMBA","BENEFICE", 259 | "CHIQUITA","LALALA","MAKOKOU","MASUKU","NOEMIE","SITA_11262", 260 | "SEPP_TONI_11300","A872_71","A872_72","AGNETA_11758","FRITS_11052") 261 | par(mfrow=c(4,1)) 262 | barplot(t(as.matrix(snpk2)), 263 | col= c("lightblue","Dark red"), 264 | border=NA, main="K=2", 265 | names.arg=(names), cex.names=0.8, las=2, ylab="ancestry") 266 | barplot(t(as.matrix(snpk3)), 267 | col= c("lightgreen","Dark red","lightblue"), 268 | border=NA, main="K=3", 269 | names.arg=(names), cex.names=0.8, las=2, ylab="ancestry") 270 | barplot(t(as.matrix(snpk4)), 271 | col= c("lightgreen","Dark red","lightblue","yellow"), 272 | border=NA, main="K=4", names.arg=(names), cex.names=0.8, las=2, 273 | ylab="ancestry") 274 | barplot(t(as.matrix(snpk5)), 275 | col= c("lightgreen","Dark red","lightblue","yellow","pink"), 276 | border=NA, main="K=5", names.arg=(names), cex.names=0.8, las=2, 277 | ylab="ancestry") 278 | ``` 279 | 280 | **Q8:** Looking at the plot, does it look like there has been any 281 | admixture when using a K value of 3? Does this mean that there has not 282 | been any admixture between any of the subspecies? Why / why not ? 283 | 284 | **Q9:** In the K=4 plot, *P.t.troglodytes* (central chimpanzee) is 285 | divided into two populations, have we overlooked a chimpanzee 286 | subspecies? 287 | 288 | **Q10:** Assuming you had no prior information about your data (e.g. 289 | imagine you have a lot of data sequences sampled from random chimpanzee 290 | individuals in a zoo) while using an ADMIXTURE analysis, would you be 291 | able to reveal whether there had been any admixture between any of the 292 | subspecies in nature? Why / why not? 293 | 294 | Next, we are going to calculate the fixation index between subspecies 295 | (*Fst*), which is a widely used statistic in population 296 | genetics. This is a measure of population differentiation and thus, we 297 | can use it to distinguish populations in a quantitative way. To get you 298 | started, we calculate *Fst* by hand and then later using a 299 | script. It is worth noticing that what *Fst* measures is the 300 | reduction in heterozygosity compared to a pooled population. 301 | 302 | Here we will calculate the population differentiation in the gene 303 | (***SLC24A5***) which contributes to skin pigmentation (among other 304 | things) in humans. An allele (A) in this gene is associated with light 305 | skin. The SNP varies in frequency in populations in the Americas with 306 | mixed African and Native American ancestry. A sample from Mexico had 38% 307 | A and 62% G; in Puerto Rico the frequencies were 59% A and 41% G, and a 308 | sample of Africans had 2% A with 98% G. 309 | 310 | Calculate *Fst* in this example. Start by calculating 311 | heterozygosity, then Hs and then HT 312 | 313 | | | African | Mexican | Puerto Rican | 314 | | -------------- | -------------- | --------------- | --------------- | 315 | | | A (2%) G (98%) | A (38%) G (62%) | A (59%) G (41%) | 316 | | Heterozygosity | | | | 317 | 318 | **Q11**: What is *Fst* in this case? 319 | 320 | **Q12**: Why is this allele not lost in Africans? What happened? 321 | 322 | Here we use the Weir and Cockerham *Fst* calculator from 1984 to 323 | calculate *Fst* on the chimpanzees. Open R and copy/paste the 324 | following: 325 | 326 | #### \>R 327 | ```R 328 | WC84<-function(x,pop){ 329 | #number ind each population 330 | n<-table(pop) 331 | ###number of populations 332 | npop<-nrow(n) 333 | ###average sample size of each population 334 | n_avg<-mean(n) 335 | ###total number of samples 336 | N<-length(pop) 337 | ###frequency in samples 338 | p<-apply(x,2,function(x,pop){tapply(x,pop,mean)/2},pop=pop) 339 | ###average frequency in all samples (apply(x,2,mean)/2) 340 | p_avg<-as.vector(n%*%p/N ) 341 | ###the sample variance of allele 1 over populations 342 | s2<-1/(npop-1)*(apply(p,1,function(x){((x-p_avg)^2)})%*%n)/n_avg 343 | ###average heterozygotes 344 | # h<-apply(x==1,2,function(x,pop)tapply(x,pop,mean),pop=pop) 345 | #average heterozygote frequency for allele 1 346 | # h_avg<-as.vector(n%*%h/N) 347 | #faster version than above: 348 | h_avg<-apply(x==1,2,sum)/N 349 | ###nc (see page 1360 in wier and cockerhamm, 1984) 350 | n_c<-1/(npop-1)*(N-sum(n^2)/N) 351 | ###variance betwen populations 352 | a <-n_avg/n_c*(s2-(p_avg*(1-p_avg)-(npop-1)*s2/npop-h_avg/4)/(n_avg-1)) 353 | ###variance between individuals within populations 354 | b <- n_avg/(n_avg-1)*(p_avg*(1-p_avg)-(npop-1)*s2/npop-(2*n_avg-1)*h_avg/(4*n_avg)) 355 | ###variance within individuals 356 | c <- h_avg/2 357 | ###inbreedning (F_it) 358 | F <- 1-c/(a+b+c) 359 | ###(F_st) 360 | theta <- a/(a+b+c) 361 | ###(F_is) 362 | f <- 1-c(b+c) 363 | ###weigted average of theta 364 | theta_w<-sum(a)/sum(a+b+c) 365 | list(F=F,theta=theta,f=f,theta_w=theta_w,a=a,b=b,c=c,total=c+b+a) 366 | } 367 | ``` 368 | 369 | Now read in our data. We want to make three comparisons. 370 | 371 | #### \>R 372 | ```R 373 | library(snpMatrix) 374 | data <- read.plink("pruneddata") 375 | geno <- matrix(as.integer(data@.Data),nrow=nrow(data@.Data)) 376 | geno <- t(geno) 377 | geno[geno==0]<- NA 378 | geno<-geno-1 379 | g<-geno[complete.cases(geno),] 380 | pop<-c(rep(1,11),rep(2,12),rep(3,6)) 381 | ### HERE WE HAVE OUR THREE COMPARISONS 382 | pop12<-pop[ifelse(pop==1,TRUE,ifelse(pop==2,TRUE,FALSE))] 383 | pop13<-pop[ifelse(pop==1,TRUE,ifelse(pop==3,TRUE,FALSE))] 384 | pop23<-pop[ifelse(pop==2,TRUE,ifelse(pop==3,TRUE,FALSE))] 385 | g12<-g[,ifelse(pop==1,TRUE,ifelse(pop==2,TRUE,FALSE))] 386 | g13<-g[,ifelse(pop==1,TRUE,ifelse(pop==3,TRUE,FALSE))] 387 | g23<-g[,ifelse(pop==2,TRUE,ifelse(pop==3,TRUE,FALSE))] 388 | result12<-WC84(t(g12),pop12) 389 | result13<-WC84(t(g13),pop13) 390 | result23<-WC84(t(g23),pop23) 391 | mean(result12$theta,na.rm=T) 392 | mean(result13$theta,na.rm=T) 393 | mean(result23$theta,na.rm=T) 394 | ``` 395 | 396 | **Q13:** Does population differentiation fit with the geographical 397 | distance between subspecies and their evolutionary history? 398 | -------------------------------------------------------------------------------- /Population_structure/ExerciseStructure_2022.md: -------------------------------------------------------------------------------- 1 | # Exercise in inference of population structure and admixture 2 | 3 | ## Program 4 | 5 | - Constructing and interpretating of a Principal Component Analysis (PCA) plot using SNP data 6 | - Running and interpreting ADMIXTURE analyses using SNP data 7 | 8 | ## Learning objectives 9 | 10 | - Learn how to perform PCA and ADMIXTURE analyses of SNP data 11 | 12 | - Learn how to interpret results from such analyses and to put these in 13 | a biological context 14 | 15 | ## Recommended reading 16 | 17 | - ”An introduction to Population Genetics” page 99-103 18 | 19 | ## Inferring chimpanzee population structure and admixture using exome data 20 | 21 | Disentangling the chimpanzee taxonomy has been surrounded with much 22 | attention, and with continuously newly discovered populations of 23 | chimpanzees, controversies still exist about the true number of 24 | subspecies. The unresolved taxonomic labelling of chimpanzee populations 25 | has negative implications for future conservation planning for this 26 | endangered species. In this exercise, we will use 110.000 SNPs from the 27 | chimpanzee exome to infer population structure and admixture of 28 | chimpanzees in Africa, in order to acquire thorough knowledge of the 29 | population structure and thus help guide future conservation management 30 | programs. We make use of 29 wild born chimpanzees from across their 31 | natural distributional area (Figure 1). 32 | 33 | 34 |
35 | 36 |
37 | 38 | **Figure 1** Geographical distribution of the common chimpanzee *Pan 39 | troglodytes* (from [Frandsen & Fontsere *et al.* 2020](https://www.nature.com/articles/s41437-020-0313-0)). 40 | 41 | *If you want to read more about chimpanzee population structure* 42 | - [Hvilsom et al. 2013. Heredity](http://www.nature.com/hdy/journal/v110/n6/pdf/hdy20139a.pdf) 43 | - [Prado-Martinez et al. 2013. Nature](http://www.nature.com/nature/journal/v499/n7459/pdf/nature12228.pdf) 44 | 45 | ## PCA 46 | 47 | We start by creating a directory for this exercise and downloading the exome data to the folder. 48 | 49 | Open a terminal and type: 50 | 51 | ```bash 52 | # Make a directory for this exercise in your exercises folder 53 | cd ~/exercises 54 | mkdir structure 55 | cd structure 56 | 57 | # Download data (remember the . in the end) 58 | cp ~/groupdirs/SCIENCE-BIO-Popgen_Course/exercises/structure/pa/* . 59 | 60 | # Show the dowloaded files 61 | ls -l 62 | ``` 63 | You have now downloaded a PLINK file-set consisting of 64 | 65 | - pruneddata.**fam**: Information about each individual (one line per individual) 66 | - pruneddata.**bim**: Information about each SNP/variant (one line per SNP/variant) 67 | - pruneddata.**bed**: A non-human-readable *binary* file-format of the all variants for all individuals. 68 | 69 | and a separate file containing assumed population info for each sample. 70 | 71 | - pop.info 72 | 73 | First, we want to look at the data (like you always should *before* doing any analyses). The command `wc -l [FILENAME]` counts the number of lines in a file. 74 | 75 |
76 | 77 | **Q1: How many samples and variants do the downloaded PLINK file-set consist of?** 78 | 79 |
80 | 81 | Now open R in the exercises directory (don’t close it before this manual states that you should) and type: 82 | ```R 83 | popinfo <- read.table("pop.info") 84 | table(popinfo[,1]) 85 | ``` 86 | 87 |
88 | 89 | **Q2** 90 | - **Q2.1: Which subspecies are represented in the data?** 91 | - **Q2.2: How many samples are there from each subspecies?** 92 | - **Q2.3: Does the total number of samples match what you found in Q1?** 93 | 94 |
95 | 96 | Now we want to import our genotype data into R. 97 | 98 | ```R 99 | # Load data 100 | library(snpMatrix) 101 | data <- read.plink("pruneddata") 102 | geno <- matrix(as.integer(data@.Data),nrow=nrow(data@.Data)) 103 | geno[geno==0] <- NA 104 | geno <- geno-1 105 | 106 | # Show the number of rows and columns 107 | dim(geno) 108 | 109 | # Show counts of genotypes for SNP/variant 17 110 | table(geno[,17], useNA='a') 111 | 112 | # Show counts of genotypes for sample 1 113 | table(geno[1,], useNA='a') 114 | ``` 115 | 116 |
117 | 118 | **Q3** 119 | - **Q3.1: How many SNPs and samples have you loaded into *geno*? and does it match what you found in Q1 and Q2?** 120 | - **Q3.2: How many samples are heterozygous for SNP 17 (*what does the 0, 1, and 2 mean*)?** 121 | - **Q3.3: How many SNPs is missing data (*NA*) for sample 8 (*you need to change the code to find the information for sample 8*)** 122 | 123 |
124 | 125 | Specialized software (ex. [PCAngsd](https://doi.org/10.1534/genetics.118.301336)) can handle missing information in a clever way, but for now we will simply remove all sites that have missing information and then perform PCA with the standard R-function `prcomp`. 126 | 127 | ```R 128 | # Number of missing samples per site 129 | nMis <- colSums(is.na(geno)) 130 | 131 | # Only keep sites with 0 missing samples. 132 | geno <- geno[,nMis==0] 133 | 134 | # Perform PCA 135 | pca <- prcomp(geno, scale=T, center=T) 136 | 137 | # Show summary 138 | summary(pca) 139 | 140 | # Extract importance of PCs. 141 | pca_importance <- summary(pca)$importance 142 | plot(pca_importance[2,], type='b', xlab='PC', ylab='Proportion of variance', las=1, 143 | pch=19, col='darkred', bty='L', main='Proportion of variance explained per PC') 144 | ``` 145 | 146 |
147 | 148 | **Q4: Which principal components (PCs) are most important in terms of variance explained and how much variance do they explain together (cumulative)?** 149 | 150 |
151 | 152 | 153 | Now let's plot the first two principal components. 154 | 155 | ```R 156 | # Extract percentage of the variance that is explained by PC1 and PC2 157 | PC1_explained <- round(pca_importance[2,1]*100, 1) 158 | PC2_explained <- round(pca_importance[2,2]*100, 1) 159 | 160 | # Extract the PCs 161 | pcs <- as.data.frame(pca$x) 162 | 163 | # Custom colors matching the original colors on the map. 164 | palette(c('#E69F00', '#D55E00', '#56B4E9')) 165 | plot(pcs$PC1, pcs$PC2, col=popinfo$V1, pch=19, las=1, bty='L', 166 | main='PCA on 29 wild-born chimpanzees', 167 | xlab=paste0('PC1 (', PC1_explained, '% of variance)'), 168 | ylab=paste0('PC2 (', PC2_explained, '% of variance)')) 169 | legend('topleft', legend=levels(popinfo$V1), col=1:length(levels(popinfo$V1)), pch=19) 170 | ``` 171 |
172 | 173 | **Q5** 174 | 175 | - **Q5.1 How many separate PCA-clusters are found in the first two PCs?** 176 | - **Q5.2 Which populations are separated by PC1? How does this match the geography from Figure 1 (top of document)** 177 | - **Q5.3 Do the PCs calculated only from genetic data match the information from the pop.info file (i.e. do any of the samples cluster with a different population than specified by the sample-info/color)?** 178 | 179 |
180 | 181 | Save/screenshot the plot for later. Now close R by typing `q()` and hit `Enter` (no need to save the workspace). 182 | 183 | **BONUS(If you finish early)**: Very often we cannot load all the data into R, so we need to calculate PCA using software such as PLINK. Using google and/or `plink --help`, try to perform PCA on the data with PLINK (remember to remove missingness) and plot the results using R 184 | 185 | ## Admixture 186 | 187 | Now we know that the populations look like they are separated into three 188 | distinct clusters (in accordance to the three subspecies), but we also 189 | want to know whether there has been any admixture between the three 190 | subspecies given that at least two of the subspecies have neighboring 191 | ranges (Figure 1). The ADMIXTURE manual can be found 192 | [here](http://dalexander.github.io/admixture/admixture-manual.pdf). 193 | 194 | When running ADMIXTURE, we input a certain number of ancestral populations, *K*. 195 | Then ADMIXTURE picks a random startingpoint (the *seed*) and tries to find the 196 | parameters (admixture proportions and ancestral frequencies) resulting in the highest likelihood. 197 | 198 | First, lets try to run ADMIXTURE once assuming *K=3* ancestral populations. 199 | ```bash 200 | # Make sure you are in the ~/exercises/structure/ directory 201 | cd ~/exercises/structure/ 202 | 203 | # Run admixture once 204 | admixture -s 1 pruneddata.bed 3 > pruneddata_K3_run1.log 205 | 206 | # Look at output files 207 | ls -l 208 | ``` 209 | 210 |
211 | 212 | **Q6** 213 | - **Q6.1 What is in the pruneddata.3.Q, pruneddata.3.P, and pruneddata_K3_run1.log files? (*Hints: Use `less -S` to look in the files. Use `wc -l [FILE]` to count the number of lines in the files. Look in the manual*)** 214 | - **Q6.2 What is the ancestral proportions (of the three populations) of sample 4?** 215 | - **Q6.3 Is it admixed and can we say which subspecies its from?** 216 | - **Q6.4 What is the final Loglikelihood of the fit?** 217 | - **Q6.5 Can we be sure that this is the best overall fit for this data - why/why not? (*hint: see next part of exercise, but remember to explain why*)** 218 | 219 |
220 | 221 | 222 | Now, let's run the model fit 10 times with different seeds (different starting points) 223 | ```bash 224 | # Assumed number of ancestral populations 225 | K=3 226 | 227 | # Do something 10 times 228 | for i in {1..10} 229 | do 230 | # Run admixture with seed i 231 | admixture -s ${i} pruneddata.bed ${K} > pruneddata_K${K}_run${i}.log 232 | 233 | # Rename the output files 234 | cp pruneddata.${K}.Q pruneddata_K${K}_run${i}.Q 235 | cp pruneddata.${K}.P pruneddata_K${K}_run${i}.P 236 | done 237 | 238 | # Show the likelihood of all the 10 runs (in a sorted manner): 239 | grep ^Loglikelihood: *K${K}*log | sort -k2 240 | ``` 241 | 242 |
243 | 244 | **Q7** 245 | - **Q7.1 Which run-numbers have the 3 highest Loglikelihoods?** 246 | - **Q7.2 Has the model(s) converged? (*rule of thumb: it's converged if we have 3 runs within ± 1 loglikelihood unit*)** 247 | - **Q7.3 Which run would you use as the result to plot?** 248 | 249 |
250 | 251 | Now let's plot the run ancestral proportions from the best fit for each sample. **Open R**. 252 | 253 | ```R 254 | # Margins and colors 255 | par(mar=c(7,3,2,1), mgp=c(2,0.6,0)) 256 | palette(c("#E69F00", "#56B4E9", "#D55E00", "#999999")) 257 | 258 | # Load sample names 259 | popinfo <- read.table("pop.info") 260 | sample_names <- popinfo$V2 261 | 262 | # Read sample ancestral proportions 263 | snp_k3_run5 <- as.matrix(read.table("pruneddata_K3_run5.Q")) 264 | 265 | barplot(t(snp_k3_run5), col=c(3,2,1), names.arg=sample_names, cex.names=0.8, 266 | border=NA, main="K=3 - Run 5", las=2, ylab="Ancestry proportion") 267 | ``` 268 | 269 | Save/screenshot the plot for later. Close R by typing `q()` and hit `Enter` (no need to save the workspace). 270 | 271 |
272 | 273 | **Q8** 274 | - **Q8.1 Explain the plot - what is shown for each sample?** 275 | - **Q8.2 Does the assigned ancestral proportions match the subpecies classification for each sample?** 276 | - **Q8.3 Can you find any admixed samples (how does/would an admixed sample look)?** 277 | - **Q8.4 Does assuming 3 ancestral populations (K=3) seem to be a good fit to the data?** 278 | - **Q8.5 Could the model have assumed only 2 ancestral populations (*K*) in these runs?** 279 | 280 |
281 | 282 | Now, run admixture 10 times assuming only two ancestral populations (**K=2**). You can use the code from K=3, but changing K. 283 | 284 | 285 |
286 | 287 | **Q9** 288 | - **Q9.1 Did the model(s) converge?** 289 | - **Q9.2 Which model had the highest and which has the lowest loglikelihood?** 290 | 291 |
292 | 293 | Open R again and let's plot the best and the worst fit. 294 | 295 | First we plot the best: 296 | 297 | ```R 298 | # Margins and colors 299 | par(mar=c(7,3,2,1), mgp=c(2,0.6,0)) 300 | palette(c("#E69F00", "#56B4E9", "#D55E00", "#999999")) 301 | 302 | # Load sample names 303 | popinfo <- read.table("pop.info") 304 | sample_names <- popinfo$V2 305 | 306 | # Read sample ancestral proportions 307 | snp_k2_run1 <- as.matrix(read.table("pruneddata_K2_run1.Q")) 308 | 309 | barplot(t(snp_k2_run1), col=c(2,1), names.arg=sample_names, cex.names=0.8, 310 | border=NA, main="K=2 - Run 1 (best fit)", las=2, ylab="Ancestry proportion") 311 | ``` 312 | 313 | Save/screenshot the plot for later. 314 | 315 | Then we plot the worst: 316 | 317 | ```R 318 | # Read sample ancestral proportions 319 | snp_k2_run4 <- as.matrix(read.table("pruneddata_K2_run4.Q")) 320 | 321 | barplot(t(snp_k2_run4), col=c(2,1), names.arg=sample_names, cex.names=0.8, 322 | border=NA, main="K=2 - Run 4 (worst fit)", las=2, ylab="Ancestry proportion") 323 | ``` 324 | 325 | Save/screenshot the plot for later. Close R by typing `q()` and hit `Enter` (no need to save the workspace). 326 | 327 |
328 | 329 | **Q10** 330 | - **Q10.1 Would the conclusion be different if you used the worst compared to the best?** 331 | - **Q10.2 How would an f1 sample (child of parents from two different subspecies) look?** 332 | 333 |
334 | 335 | Run admixture 10 times assuming 4 ancestral populations (K=4). 336 | 337 |
338 | 339 | **Q11** 340 | - **Q11.1 Did the model(s) converge?** 341 | - **Q11.2 Which model had the highest and which has the lowest loglikelihood?** 342 | 343 |
344 | 345 | It turns out that the model(s) eventually converges with the best fit at seed i=52. Run that single seed: 346 | 347 | ```bash 348 | # Assumed number of ancestral populations 349 | K=4 350 | 351 | # Specific seed 352 | i=52 353 | 354 | # Run admixture and rename output files 355 | admixture -s ${i} pruneddata.bed ${K} > pruneddata_K${K}_run${i}.log 356 | cp pruneddata.${K}.Q pruneddata_K${K}_run${i}.Q 357 | cp pruneddata.${K}.P pruneddata_K${K}_run${i}.P 358 | 359 | grep ^Loglikelihood: *K${K}*log | sort -k2 360 | ``` 361 | 362 | Now plot this converged best fit: 363 | ```R 364 | # Margins and colors 365 | par(mar=c(7,3,2,1), mgp=c(2,0.6,0)) 366 | palette(c("#E69F00", "#56B4E9", "#D55E00", "#999999")) 367 | 368 | # Load sample names 369 | popinfo <- read.table("pop.info") 370 | sample_names <- popinfo$V2 371 | 372 | # Read sample ancestral proportions 373 | snp_k4_run52 <- as.matrix(read.table("pruneddata_K4_run52.Q")) 374 | 375 | barplot(t(snp_k4_run52), col=c(3,4,1,2), names.arg=sample_names, cex.names=0.8, 376 | border=NA, main="K=4 - Run 52 (Best fit)", las=2, ylab="Ancestry proportion") 377 | ``` 378 | 379 | Save/screenshot the plot for later. Plot the worst fit: 380 | 381 | ```R 382 | # Read sample ancestral proportions 383 | snp_k4_run4 <- as.matrix(read.table("pruneddata_K4_run4.Q")) 384 | 385 | barplot(t(snp_k4_run4), col=c(3,2,1,4), names.arg=sample_names, cex.names=0.8, 386 | border=NA, main="K=4 - Run 4 (Worst fit, not converged)", las=2, ylab="Ancestry proportion") 387 | ``` 388 | 389 | Save/screenshot the plot for later. Close R by typing `q()` and hit `Enter` (no need to save the workspace). 390 | 391 | **Q12 Looking at all the results (PCA and admixture K=2, K=3, and K=4)** 392 | - **Q12.1 Do the admixture and PCA analysis correspond with the known geography?** 393 | - **Q12.2 Which number of ancestral populations do you find the most likely?** 394 | - **Q12.3 What are the possible explanaintions for the K=4 admixture results?** 395 | - **Q12.4 Does it look like we have admixed samples?** 396 | - **Q12.5 Can we conlcude anything about admixture between the subspecies?** 397 | -------------------------------------------------------------------------------- /Population_structure/chimp_distribution.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Population_structure/chimp_distribution.png -------------------------------------------------------------------------------- /Population_structure/exer_struct_1_PCA.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Population_structure/exer_struct_1_PCA.png -------------------------------------------------------------------------------- /Population_structure/exer_struct_2_admixture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Population_structure/exer_struct_2_admixture.png -------------------------------------------------------------------------------- /Population_structure/manhattanFst.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/Population_structure/manhattanFst.png -------------------------------------------------------------------------------- /Population_structure/z: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /QuantitativeGenetics/FforInd.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/FforInd.png -------------------------------------------------------------------------------- /QuantitativeGenetics/Fgenome.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/Fgenome.tar.gz -------------------------------------------------------------------------------- /QuantitativeGenetics/IBDexercises.md: -------------------------------------------------------------------------------- 1 | # Inbreeding 2 | You have gotten access to DNA from an individual and using genetic markers across the genome you estimate that the inbreeding coefficient to F=0.062 3 | - Is this high compared to what you would expect for a Human? 4 | - What the most simple pedigree that could give rise to such an estimate? 5 | - - Hint 1: Look at the “Expectations of inbreeding coefficient from pedigrees” slides from earlier 6 | 7 | 8 | 9 | - Go to https://popgen.dk/shiny/anders/popgenCourse/Fgenome/ to simulate the 22 autosomes for a Human. 10 | - - NB!. The link does not work if you are using the KU VPN. 11 | - - If it is slow to respond you can use a local version(see below) 12 | - - Try to simulate an individuals from this simple pedigree (use the expected F from such a pedigree, don’t use “a”) 13 | 14 | 15 |

16 | 17 |
18 |

19 | 20 | - Note the simulated inbreeding coefficient for this individual. Why is it not the same as the F you entered? 21 | - - Note the length of the inbreeding tracts. What determines how long they are? 22 | - - Note the number of chromosomes that do not have inbreeding tracts. Try to draw how this might happen 23 | - - Try to get an idea of the range of possible inbreeding coefficients by trying multiple simulations (still using the same F) 24 | - - Look in table https://www.researchgate.net/profile/Alan-Bittles-2/publication/38114212/figure/tbl1/AS:601713388052509@1520471059919/Human-genetic-relationships.png of simple consanguineous pedigrees. Does your range overlap with the expected inbreeding coefficients? 25 | 26 | - Try a few simulations of some of the other simple pedigrees and try to see which pedigrees could explain your estimated inbreeding value of 0.062? 27 | - If you infer the inbreeding tracks of your individuals the results will look like this https://github.com/populationgenetics/exercises/blob/master/QuantitativeGenetics/FforInd.png. Is this consistent with your suggested pedigrees? Or which other explanations could there be for the estimated F? 28 | 29 | 30 | ## run locally 31 | open R on you desktop (not the server). Then run 32 | ```R 33 | # if shiny is not ins 34 | if (!require("shiny")) install.packages("shiny") 35 | runUrl("https://popgen.dk/shiny/anders/popgenCourse/Fgenome.tar.gz") 36 | ``` 37 | 38 | 39 | 40 | 41 | 42 | # Relatedness 43 | 44 | Here is shown the IBD sharing between two individuals on chromosome 1. 45 | 46 |

47 | 48 |
49 |

50 | 51 | 52 | - Assuming no previous inbreeding in the population what is the only relationship that can produce such IBD patterns? 53 | - Assuming both individuals has a rare recessive disorder, where on the chromosome might the disease gene be located? 54 | 55 | 56 | 57 | - Here is a figure of two distantly related individuals both with the same rare disorder. They share 0.3% of their genome IBD=1 58 | 59 |

60 | 61 |
62 |

63 | 64 | 65 | - These individuals share two regions IBD. What assumption do we have to have in order to conclude that the disease causing locus is in one of these regions? 66 | - For relatedness mapping do you think it is best to have close or distantly related individuals? 67 | - Try to guess the number of generations that separate these two individuals? 68 | - They are actually separated by 14 generations. Try to see if you can get a similar pattern using simulations https://popgen.dk/shiny/anders/popgen2016/Rgenome/ 69 | - - If slow then run locally (see below) 70 | - What explains the difference between your simulations and above plot? 71 | 72 | 73 | 74 | ## run locally 75 | Open R or R studio and then run 76 | ```R 77 | # if shiny is not ins 78 | if (!require("shiny")) install.packages("shiny") 79 | runUrl("https://popgen.dk/shiny/anders/popgenCourse/Rgenome.tar.gz") 80 | ``` 81 | 82 | -------------------------------------------------------------------------------- /QuantitativeGenetics/QG1_5length.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/QG1_5length.png -------------------------------------------------------------------------------- /QuantitativeGenetics/QG1assortative.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/QG1assortative.png -------------------------------------------------------------------------------- /QuantitativeGenetics/QG2Children.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/QG2Children.png -------------------------------------------------------------------------------- /QuantitativeGenetics/QG3Ne.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/QG3Ne.png -------------------------------------------------------------------------------- /QuantitativeGenetics/QG4heritability_height.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/QG4heritability_height.png -------------------------------------------------------------------------------- /QuantitativeGenetics/QG5heritability_children.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/QG5heritability_children.png -------------------------------------------------------------------------------- /QuantitativeGenetics/QG6monogamy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/QG6monogamy.png -------------------------------------------------------------------------------- /QuantitativeGenetics/Rgenome.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/Rgenome.tar.gz -------------------------------------------------------------------------------- /QuantitativeGenetics/fig1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/fig1.png -------------------------------------------------------------------------------- /QuantitativeGenetics/fig2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/fig2.png -------------------------------------------------------------------------------- /QuantitativeGenetics/im.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/QuantitativeGenetics/im.png -------------------------------------------------------------------------------- /QuantitativeGenetics/z: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Introduction to exercises 2 | 3 | This repository contains the exercises for the University of Copenhagen Population Genetics master course. 4 | 5 | -------------------------------------------------------------------------------- /phylogenetics/ExercisesPhylo_2021.md: -------------------------------------------------------------------------------- 1 | # Phylogenomics 2 | 3 | **David A. Duchene** 4 | 5 | 6 | Marsupials are a group of mammals that are unique to Australasia and the Americas. Several major groups of marsupials first appeared between 50 and 70 million years ago, during events of fast diversification. Given these are ancient and fast events, resolving the relationships among early marsupials is difficult, and remains a matter of interest in mammalian biology. 7 | 8 | These exercises focus on the most critical steps of a phylogenomic analysis, with the aim of resolving longstanding questions of the evolution of Australasian marsupials. 9 | 10 |
11 | 12 |
13 | 14 | ### Exercise 1: Sequence alignment 15 | 16 | a) Open MEGA and open the file marsupials.fasta via File > Open A File, and selecting Align. Have a look at the sequences and notice that they have slightly different lengths. 17 | 18 | 19 | **Question 1.** Before we conduct any phylogenetic analyses, we need to align these sequences. What is the purpose of sequence alignment? 20 | 21 |
22 | Answer: 23 | 24 | *The purpose of sequence alignment is to maximise the number of sites that we can infer as including homologous characters (i.e., those that have been inherited from a common ancestor). Sequences are aligned by inserting "gaps" where insertions and deletions are likely to have occurred.* 25 | 26 |
27 | 28 | 29 | b) Try for a few minutes to align these sequences by eye. To do this, use your mouse to select one of the nucleotides. Then you can add "gaps" with your spacebar or use the left and right arrow symbols at the top of the window to move that block of columns left and right. You can also use the keyboard, by using alt + the arrow keys. 30 | 31 | To get you started, try simply adding a single gap to the start of sequences 1, 3, 5, 6, 7, and 10. Notice that several of the sequences now appear to be better aligned, but this is only the case for a few of the successive nucleotides. Also try adding gaps to the start of sequences 4 and 9 until they align with the rest of the data set. 32 | 33 | c) Now create an automated alignment by clicking on the Alignment menu and choosing Align by Muscle, selecting all the sequences. Muscle is one of the two alignment algorithms in MEGA. The other is ClustalW. These two are widely used alignment algorithms. We will use Muscle for this practical. 34 | 35 | Change the “Gap Open” penalty to -1000. This increases the penalty for inserting gaps in the alignment, meaning that Muscle will be less willing to insert gaps to make the sequences better aligned to each other. Click “OK”. 36 | 37 | 38 | **Question 2.** How long is the sequence alignment? 39 | 40 |
41 | Answer: 42 | 43 | *This sequence alignment is 1318 sites long.* 44 | 45 |
46 | 47 | 48 | d) Now run Muscle again, but with a “Gap Open” penalty of -400. 49 | 50 | 51 | **Question 3.** Has the length of the sequence alignment changed, and which of the alignments seems more acceptable? 52 | 53 |
54 | Answer: 55 | 56 | *The length of the alignment has slightly increased to 1321. The reduced penalty for inserting gaps has led to the addition of a few sites overall.* 57 | 58 |
59 | 60 | 61 | In this practical we will accept the results of the automated alignment, but standard practice is to inspect the alignment to see if the automated method has done a good job. 62 | 63 | In some cases, visual inspection can reveal sections of the alignment that can be improved. Nowadays, many sequence data sets are far too large for visual inspection to be feasible. As a consequence, there is an increasing reliance on automated methods of identifying misleading sequences and alignment regions. 64 | 65 | 66 | **Question 4.** If one of your sequences had accidentally been shifted to the right by 1 nucleotide (so that the sequence was misaligned by 1 nucleotide compared with the remaining sequences), what would be the consequences for phylogenetic analysis? 67 | 68 |
69 | Answer: 70 | 71 | *This would lead to a very large estimated genetic divergence between this sequence and the rest of the data set. Overall, we would estimate the data set to have a very large substitution rate. The branch leading to that taxon would be estiamted to be very long, and the taxon would probably be placed incorrectly as a distant relative of all other taxa.* 72 | 73 |
74 | 75 | 76 | ### Exercise 2: Tree building 77 | 78 | Now that we have a sequence alignment, we are ready to perform phylogenetic analyses. We will perform phylogenetic analysis using maximum likelihood. Our aim will be to make estimates of the tree and branch lengths. This method uses an explicit model of molecular evolution. 79 | 80 | a) From your alignment window, click on the Data tab in the top bar and on Phylogenetic Analysis. Select "Yes" when prompted whether your data are protein-coding. Now go to the main MEGA window. From the Phylogeny menu, select "Construct Maximum Likelihood Tree" and use the active data set. This will bring up a box that contains a range of options for the analysis. 81 | 82 | Check that the options are the following: 83 | Test of Phylogeny: Bootstrap method 84 | No. of Bootstrap Replications: 100 85 | Substitutions Type: Nucleotide 86 | Model/Method: General Time Reversible model 87 | Rates among Sites: Gamma distributed (G) 88 | No of Discrete Gamma Categories: 5 89 | Gaps/Missing Data Treatment: Use all sites 90 | 91 | Click "OK" to start the maximum likelihood phylogenetic analysis, which will take about 5 minutes to complete. 92 | 93 | b) Confirm that the tree is rooted with the Opossum as the sister of all other marsupials. If this is not the case, select the branch of the Opossum and select "Place Root on Branch". 94 | 95 | 96 | **Question 5.** There is a scale bar underneath the tree. What units does it measure? 97 | 98 |
99 | Answer: 100 | 101 | *The scale bar measures expected molecular substitutions per site.* 102 | 103 |
104 | 105 | 106 | **Question 6.** What is the purpose of including an external species, or outgroup (in this case the Opossum), in the analysis? 107 | 108 |
109 | Answer: 110 | 111 | *The inclusion of an external species allows us to infer the position of the root and therefore the order of divergence events in time.* 112 | 113 |
114 | 115 | 116 | **Question 7.** Does it look like this tree has strong statistical support? 117 | 118 |
119 | Answer: 120 | 121 | *Some branches have high support, but most ancient or "deep" branches have very low support. This means that we cannot draw strong conclusions about the early divegence of marsupials from the genomic region analysed.* 122 | 123 |
124 | 125 | 126 | ### Exercise 3: Substitution models 127 | 128 | In order to conduct a reliable phylogenetic analyses, we need to identify the best-fitting model of nucleotide substitution for the data set. A key purpose of substitution models is to account for multiple substitutions. 129 | 130 | Perhaps the most widely used substitution model is the General Time Reversible (GTR) model, which allows different rates for different substitution types. For example, it allows A<->G substitutions to occur at a different rate from C<->G substitutions. The model also allows the four nucleotides to have unequal frequencies. 131 | 132 | 133 | **Question 8.** Which is the most simple substitution model, and why? 134 | 135 |
136 | Answer: 137 | 138 | *Jukes-Cantor is the most simple model because it assumes that the transition probabilities among all nucleotides are identical. Base frequencies are also assumed to be equal.* 139 | 140 |
141 | 142 | 143 | Variation in evolutionary rate across sites can be modelled using a gamma distribution, which can take a broad range of shapes. The distribution can be described using a single parameter, alpha. When alpha is small (<1), most sites are assumed to be slow-evolving, while a small portion of sites are assumed to be faster evolving (e.g., because they are not undergoing selective constraints). When alpha is large (>1) most sites are taken to evolve at approximately the same rate. 144 | 145 | We can use MEGA to compare different models and select the one that best suits our data. 146 | 147 | a) Return to the main MEGA window. From the Models menu, select "Find Best DNA/Protein Models (ML)" and use the active data. This will bring up a box that contains set of options. Click "Compute" without making any changes. 148 | 149 | 150 | **Question 9.** Which is the best model as identified by having the lowest AICc and BIC? 151 | 152 |
153 | Answer: 154 | 155 | *The model with the lowest AICc and BIC is the K2+G model. This model allows for different rates for transitions versus transversions, while assuming equal base frequencies. It also takes evolutionary rates across sites to follow a gamma distribution.* 156 | 157 |
158 | 159 | 160 | **Question 10.** For the K2+G model, what is the value of the alpha parameter of the gamma distribution? And what does that indicate about the variation in rates across sites in these data? 161 | 162 |
163 | Answer: 164 | 165 | *The alpha parameter is estimated to be 0.56. This is much below 1, and suggests that there is substantial amounts of rate variation across sites. This means that many sites are evolving slowly and a few sites are evolving rapidly.* 166 | 167 |
168 | 169 | 170 | **Question 11.** Are there differences in the results when using the JC versus the K2+G models? What does this say about the data? 171 | 172 |
173 | Answer: 174 | 175 | *While some branches have different length, the relationships are identical regardless of the model used for inference. This suggests that the tree signal is robust to the substitution model used.* 176 | 177 |
178 | 179 | 180 | ### Exercise 4: Gene trees 181 | 182 | Population-level processes can lead gene trees to vary among them. In some cases, this is a dominant form of variation and can be modelled using the multi-species coalescent. 183 | 184 | a) Open FigTree and the marsupials.tree file via File > Open and navigating to the file. This file contains a large number of trees from genes across the genomes of marsupials. The first gene tree will be shown in your window. Scroll across gene trees using the Prev/Next arrows at the top-right of the window. 185 | 186 | 187 | **Question 12.** Some branches seem to be missing or collapsed in some of the trees. What does this indicate about the phylogenetic information in the gene? 188 | 189 |
190 | Answer: 191 | 192 | *These branches are not missing but are extremely short (they effectively have length 0). This suggests that the genes contain very little information about these parts of the tree. One explanation is that these events happened in quick succession, such that no molecular changes occurred.* 193 | 194 |
195 | 196 | 197 | b) In the Layout panel on the left, toggle the tree layout to Radial, and once again explore a few trees. 198 | 199 | 200 | **Question 13.** Which are the most consistent sets of relationships across gene trees? 201 | 202 |
203 | Answer: 204 | 205 | *The koala and wombat are nearly always identified as sister clades, and so are the tasmanian devil and numbat. The grouping of the two possums and the kangaroo are also very common. Gene trees seem to have substantial incongruence regarding other relationships.* 206 | 207 |
208 | 209 | 210 | **Question 14.** Among which groupings is there apparent incomplete lineage sorting in this tree, and what does that indicate about the ancestral populations of different groups of marsupials? 211 | 212 |
213 | Answer: 214 | 215 | *The relationships among the possums and the kangaroos seem to be affected by incomplete lineage sorting, as are the relationships at the root of the marsupial mole branch. This suggests that there was substantial exchange in the early stages of the diversification of these groupings of marsupials. These divergences were likely very fast events and population sizes relatively large, leading to incomplete lineage sorting and therefore widespread incongruence among gene trees.* 216 | 217 |
-------------------------------------------------------------------------------- /phylogenetics/Introductory_chapter_Yang.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/phylogenetics/Introductory_chapter_Yang.pdf -------------------------------------------------------------------------------- /phylogenetics/Kapli_et_al_2020.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/phylogenetics/Kapli_et_al_2020.pdf -------------------------------------------------------------------------------- /phylogenetics/Yang_and_Ranalla_2012.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/phylogenetics/Yang_and_Ranalla_2012.pdf -------------------------------------------------------------------------------- /phylogenetics/marsupials.fasta: -------------------------------------------------------------------------------- 1 | >Brushtail_possum 2 | ttgccatatcccttccaccaggtcctcctgaggaaccattgcctgagacacctcaacaacccccagagccaccagctgtctcccaagccccgcctcccctacctcttgcccctcgtcctgaagagtgtccatcttctcctatcccacttctcccaccaccgaagaagcgtcggaaaacagtctccttctctgcatcagaggaaaccccaaccaccccaacccctgaggttcccccacctgttccacctccagctaagcctcctggtccccttccccggaaaatctcccggggtggggatcggaccattagaaatttgcccctggaccatgcttctctggtcaagagctggccagaggatggatcccgggtggggcggaaccggagcggaggtcggggtcgtctgccagaagaagaagaacctggaactgaagtagacttagcagtgttggcagatttggctctgacgccagctcggcgagggctagtcgctctacctgtaggggatgattctgaggccacagaaacctcagatgaggctgagcgtttgggtccagcaggtccctcaatcattcatgttcttctggagcacaactatgctctagctgtccggcctgctccccctgccccagtttccagacccctggatccacttccttctcctgccactgtcttcagttcacctgccgatgaggtcctagaagctcctgaggtggtggttgctgaggtggaggaggaagaagaggaggaggaggaagaggaggaagagtcggaatcatcagagagtagtagtagcagcagtgatggggagggcgccttgaggcgcaggagccttcgatcccatgctaggcgccgtcgacagcccttgccacctgcccctccaccgccacccagctatgagccccgaagtgagtttgagcagatgaccatcttgtatgacatctggaattccgggcttgatgctgaggacatgggctacttacgactcacctatgagaggctgcttcagcaggatggtggagccgactggcttaatgatacccactgggtacatcacaccaatatcctagacctctgtgggtcaggggctagcagatgtgggagctattgtacaaaaagcctcaagtctcacctctccctcttgcgcagtcattttagtaatgatatttctgtgacttcattgtcgagtactccttttttgtctgctacttttcaccttct 3 | >Numbat 4 | tttatcatatccctttcaccaggtcctcctgaggagctattacctgagacacctcatcatcccccagagccaccagctgtttcccaaggctctcctcccttacctctggtccctcatcctgaggagtgtccatcttctcctattcctcttctcccaccaccgaaaaagcgtcggaaaacagtctccttctctgcatcagaggaaatgccaactacccccayccctgaagttcccccacctgttgcacctccagttaagcctcctggtcctcttccacggaaaatctcccgaggtggtgataggaccattcgaaatttgcctctggaccatgcttctctggtcaagagctggccagaggatggatcccgggtggggcggaaccggagtggaggtcgggctcgtctgccagaagaagaagaatctgggactgaagtagatcttgcagtgttggcagatctggccctgactccagctccagctcggcgaggactagtcactctgcctgtaggggatgattctgaggccacagaaacctcagatgaggctgagcgtggggggccagcagttccctcaatcattcatgttcttctggagcacaactatgctctggctgtccggcctgctccccctgccccagtgtccagacccttggagccacttccttcccctgccactgttttcagctcacctgccgatgaggtcttagaagctcctgaggtggtggttgctgaggtagaggaggaagaggaggaggaggaagaagaagaggagtcagagtcatcagagagtagtagcarcagcagtgatggagagggcgccttgaggcggaggagccttcgatcccatgccaggcgccgtcgacagcccttgccacctgcccctccacccccacctagctatgagcctcgaagtgagtttgagcagatgaccatcctgtatgatatctggaattctggacttgatgctgaggatatgggttatttacgactcacttatgagagattgctgcaacaggacggtggtgcagactggcttaatgatacccactgggtgcatcacactaatatcctaaactgctgtgggtcagctagcagattgggaaactcctgtaccagaaacttgaagtctaaccttttctccctcttgagtggttcttttagtagttattgtcaagtactcttctttttgtctgctatttttcaccttcgttc 5 | >Tasmanian_devil 6 | atattgtatctctttcaccaggtcctcctgaggagccattacctgagacacctcatcatcctccagagccaccacctgtctcccaagcctctcctcccctacctctagttcctcatcctgaggagtgtccatcttctcctatccctcttctcccaccaccgaagaagcgccggaaaacagtctctttctctgcatcagaggaaatgccaactacccccacccctgaagttcccccacctgttgcacctccagttaagcctcctggtccccttccacggaaaatctcccggggtggtgatcggaccattcgaaatttgcctctggaccatgcttctctggtcaagagctggccagaggatggatcccgggtggggcggaaccggagtggaggtcgggctcgtctgccagaagaagaagaatctgggactgaagtagacctcgcagtgttggcagacctggccctgactccagctccagctcggcgagggctagtcactctgcctgtaggggatgattctgaggccacagaaacctcagatgaagctgagcgtggggggccagctgttccctcgatcattcatgttcttctggagcacaactatgctctggctgtccggcctgctccccctgccccagtgtccagacccttggagccacttccttcccctgccactgtcttcagctcacctgccgatgaggtcttagaagctcctgaggtggtggttgctgaggtagaggaggaagaggatgaagaagaggaggaagaggagtcagagtcatcagagagtagtagcagcaccagtgatggagaaggcgccttgaggcgaaggagccttcgatcccatgccaggcgccgtcaacaacccttgccacctgcccctccacccccacctagctatgagccccgaagtgagtttgagcagatgaccatcctttatgacatctggaattctggacttgatgctgaggacatgggttatttacgactcacttatgagagattgctacagcaggacagtggtgctgattggcttaatgatacccactgggtacatcacactaatatcctagactgttgtgggtcaggagctagcagattggggaactactgtaccagaagcttcaagtctaaccttttctcccttttgagtggttcttttagtagttatgcttctatgactccattgtcaagtactcttctttttatctgc 7 | >Marsupial_mole 8 | agaggaaccattgcctgagacacctcaacatcttccagaaccaccggctgtctcccaagtctctcctcccctacctcttgcccctcatcctgaggagtgcccatcttctcctatcccacttctcccaccaccgaagaagcgccggaaaacagtctccttctctgcatcagaggaaaccccagctaccccaaccctagactggtcactgaggctctcccacctattccacctccacctaagcctcctggtccccttccccggaaactttctcgttctggggatcggaccattcgaaatttgcccctggaccatgcttctctggtcaagagctggccagaggaaagttgccgaacaggacgaaaccggagtggaggtcggagtcgtttgccagaagaagaagaacatgggactgaagtagacctagcagtgttagcagatctggccctgactccagctcggcgagggctagtcactctacctgttggggatgattctgaggccacagagacttctgatgaggctgagcgtgtagggccagctgttccctccatcattcatgttcttctggaacacaactatgctctggctatccggcctactccccctgccccagtttccagacccctggaaccacttccttccacagccactgtcttcagctcacctgctgatgaggtcctcgaagctcctgaggttgtggttgctgaggtagaggaggaagaggaggaagaggaagaggaatcagagtcatcagagagtagtagcagcagtagtgatggagaaggcactttgaggcggaggagccttcgatcccatgccaggcgccgtcgacagcccrtgccacctttcttaggactccaccaccttccacccagctatgagcctygaagtgagtttgagcagatgaccatcctayatgatatctggaattcygggcttgatgctgaggacrtggrctacttaagacttacctatgagaggstgctgcaacaggatgrtggagctgactggcttaatgatasccactgggtacatcacaccatcaccaacttgaccatcccttcctgccttgttaaggtg 9 | >Monito_del_monte 10 | ttaccacatccctctcaccaggtcctcctgaggaaccattgcctgagacccctcagcatcctccagagccagcagctgtctcccaagcttctcctcccttacctcttgcccctcgtcctgaggagtgtccatcttctcctatcccacttctccccccaccgaagaagcgacggaaaacagtctctttctctgcatcagaggaaaccccaaccaccccaacccctgaggttccccaasctgttccacctccagctaagcctcctggtcctcttccccggaaaatctcccggggtggggatcggaccattcgaaatttgcccctggatcatgcttctctggtcaagagctggccggaggatggatcccgggtggggcggaaccggagtggaattcgggggcgtcttccagaagaagaagaacctgggactgaagtagacctagcagtgttggcagatttggccctgactccagctccagctcggcgagggctggtcagtctacccgtaggggatgattctgaggccacagagacctcagatgaggctgagcgtgtggggccagcagttccctcaatcrttcatgttcttctggagcacaactatgctctggctgtccggcccgctcccccagccccagtttccagacccctggagccacttccttcccctgccactgtcttcagctcacctgccgatgaggtcytagaagctcctgaggtggtggtcgctgaggtggaggaggaagaagaggaggaggaggaagaggaagaggagtcagagtcatcagagagtagtagcagcagcagtgatggggagggtgccctgaggcgaaggagccttcgatcccatgccaggcgccgtcgacagcccttgccgcctgcccctccacccccacccagctatgaaccccgaagtgagtttgagcagatgaccatcctgtatgacatctggaattcygggcttgatgctgaggacatgagctacttacgactcacctacgagaggctcctgcagcaggacagtggagctgactggcttaatgatacccactgggtacatcataccaatatcctagacctctgtgggtcaggggctagtagatgtgggagctgctgtgccagatgccttgagtctcaccttctctcctccttgaggtttct 11 | >Opossum 12 | ttaccacttcccttttgccaggtcctcctgaggagccactgcccgagacccttcagcatcccccagagccaccagctgtcccccaaacctctccttcccttcctcttgcacctcgtcctgaggagtgtccatcttctcctatcccacttctcccaccaccgaagaagcgccggaaaacagtctccttctctgcatcagaggaaactccaaccaccccaacccctgaggttcccccacctgttccacctccagctaagcctcctggtcccctttcccgaaaaatctcccggggtggggatcggaccattcgaaatctgcccctggaccatgcttctttggtcaagagctggccagaggatggatcccgggtggggcggaaccggagtggaggtcgaggtcgtctgccagaagaagaagaactagggactgaagtagacctagcagtattggcagatttggccctcactccagctccggctcggcgagggctgatcgctctacctgtaggggatgattctgaggccacagaaacctcagatgaggctgagcgttcgggaccaacagttccctcagtcattcacgttcttctagagcataactatgctttggctgttcgacctgctcccccaactccagtttccagacccttggaatcacttccttctcctgccactgtcttcagctcacctgctgatgaggttctagaagcccctgaggtggtggtcgctgaggtggaggaagaggaagaaggggaggaagaagaagaagaggaggaggagtctgagtcatctgaaagtagcagcagtagcagtgatggggagggagccttgaggcgaaggagccttcgatctcatgccaggcaccgtcgacagcccctgccacctgcccctccacccccacccagctatgagccccgaagtgagtttgagcagatgaccatcctatatgacatctggaattccggacttgatgccgaggatatgggctacttaagactcacatatgagaggctgctgcagcaggacagtggagctgactggcttaatgatacccactgggtccatcacaccaatatcctagatctctatagggcaggggctagcagatgtgggagcaactgtacccaaagccccaaggttcacctttgctccttcttgagagatattgctccattgttagagtctccctcttgtctgctacttttcaccgtcatctttgttcatttcattctttgccta 13 | >Bandicoot 14 | ttaccacatcatttttcaacagaaccttctgaggaaccactgcctgagacacctcaacatcccccagagctatcagttatttcccaagcctcttctcctctacctcctgcccctcgtcctgaggagtgtccatcttctcctatcccattgctcccaccaccgaagaagcgccggaaaacagtctctttctctgcatcagaggaaaccccaaccaccccaacccctgaggttgccccacctgttccacctccagctaagccttctggtccccttccccggaaaatctcccggggtggggatcgaaccattcgaaatttgcccctggaccatgcttctctggtcaagagctggccagaggatggatgccggacagggcgaaaccgaagtggaggtcggggtcgcctgccagaagaagaagaacctgggactgaagtagacctaacagtgttggcagatttggccttgactccagctaggcgagggctagtcactctacctgtaggggatgattctgaagccacagagacttctgatgaggctgagcgtgtgggatcagcagttccctcaatcgttcatgttctccaggaacacaactatgctctggctgtccgacctgctccccctgctccagtttccagacccctggaaccacttccttctcctgccactgtcttcagctcacctgcagatgaggtcctagaagctcctgaggtggtagttgctgaggtggaggaagaggaggaggaggaggaggaggaggaggaggaggaggaagaggaagaggagtcagagtcatcagaaagtagtagtagcagcagtgatggagagggtgtcttgagacgtaggagccttcgatcccataccaggcgtcgtcgacaacccatgccacctgcccctccgcctccacccagctatgagcccagaagtgaatttgagcagatgaccatcctgtatgacatctggaattctgggcttgatgctgaggacatgagctacttacgacttacctatgagaggttgctgcagcaggatggtggagccgattggcttaatgatacccactgggtgcatcacaccaatatcctagacctctgtgggtcagggactagcagttgtgggagctactgcatcagaagcttcaagtctcaccttctc 15 | >Koala 16 | tttaccacatccttttcaccaggtcctcctgaggaaccattgcctgagacacctcaacatcctccagagccaccagctatctcccaagcctctcctcccctaccttttgcccctcgtcctgaggagtgtccatcttctcctatcccacttctcccaccaccgaagaagcgtcggaaaacagtctccttctctgcatcagaggaaaccccaaccaccccaaccccagaggttcccccacctgttccacctccagttaagcctcctggtccccttccccgaaaaatctcccgaggtggggatcggaccattcgaaatttgcctctggatcatgcttctctggtcaagagctggccagaggatggatcccgggtggggcggaaccggagtggagggcggggtcgtctgccagaagaagaagaacctgggaccgaagttgacctcacagtgttggcagatttggccctgactccagctccagctcggcgaggcctagtcactctacctgtaggggatgattctgaagccactgagacctcggatgaggctgagcgtttgggtctagcaggtccctcaatcattcatgttcttttggagcacaactatgctctggctgtccggcctgctccccctgccccagtttccagacccctggaaccacttctttctcctgccactgtcttcagctcacctgccgatgaggtcctagaagctcctgaggtggtggttgctgaggtggaggaggaggaaggggaggaggaggaagaggaggaagagtcggagtcatcagagagtagtagcagcagcagtgatggggaaggtaccttgaggcgtaggagcctccgatcccatgccaggcgccgtcgacagcccttgccacctgcccctccacctccacccagctatgagccccgaagtgagtttgagcagatgaccatcctgtatgacatctggaattctgggcttgatgctgaggacatgggctacttacgactcacctacgagaggctgctgcagcaggacggtggagccgactggcttaatgatacccactgggtacatcacaccaatatcctagacctctgtgggttgggctaccagatgtgggagctgctgtaccagaagcctcaagtctcacttctctctctctctctctctctccctttctctttctagcagtcattttactaataatatttctgtgactcagttgttgagtactctttttttctctgctacttttcaccttcatt 17 | >Wombat 18 | cctttcaccaggtcctcctgaagaaccattgcctgagacacctcaacatcctccagagccaccagctgtctcccaagcctctcctcccctacctcttgcccctcgtcctgaggagtgtccatcttctcctgtcccacttctcccaccaccgaaaaagcgtcggaaaacagtctccttctctgcatcagaggaatccccaacccccccaacccccgaggttcccccacctgtgccacctccacctaagcctcctggtccccttccccgaaaaatctcccgaggtggggatcggaccattcgaaatttgcccctggatcatgcttctctggtcaagagttggccagaggatggatcccgggtggggcggaaccggagtggaggtcggggtcgtctgccggaagaagaagaacccgggactgaagtagacctagcagtgttggcagatttggccctgactccagctccggctcggcgagggctagtcactctacctgtaggggatgattctgaggccacagagacctcggatgaggctgagcgtttgggtccagcaggtccctcaatcgttcatgttcttctggagcacaactatgctctggctgtccggcctgttccccctgctccagtttccagacccgtggaacaacttccttcccctgccactgtcttcagctcacctgccgatgaggtcctagaagctcctgaggtggtggttgctgaggtggaggaggaagaggaggaggaggaggaggaagagtcagagtcatcagagagtagtagcagcagcagtgatggggagggcaccttgaggcgtaggagcctccgatcccatgcccggcgccgtcgacagccattgccacctgcccctccgcctccacccagctatgagccccgaagtgagtttgagcagatgaccatcctgtatgacatctggaattctgggcttgatgctgaggacatgggctacttacgactcacctacgagaggctgttgcagcaggacagtggagccgactggcttaatgatacccactgggtacatcacaccaatatcctggacctctctgggttgggctagcagatgtgggagctactgtaccagaagcctcaagtcccccccccacacactctccctctctctctttttcacagtcatttcactaatgatttttctgacttcattgttaagtattcttttttttgtctgctacttttcaccttcgttcc 19 | >Kangaroo 20 | ttactacatccctttcaccaggtcctcctgaggaactattgcctgagacacctcaacatcccccagagccaccagctgtctcccaaacctctcctcccctacctcttgcccctcgtcctgaagagtgtccatcttctcctatcccacttctcccaccaccgaagaagcggcggaaaacagtctccttctctgcatcagaggaaaccccaaccaccccaactcctgaggttccaccacctgttccacctccagctaagcctattggtccccttccccggaaaatctcccggggtggggatcggaccattcgaaatttgcccctggaccatgcttctctggtcaagagctggccagaggatggatcccgggcagggcggaaccggagtggaggtcggagtcgtctgccagaagaagaagaacctgggactgaagtagacctggcagtcttggcagatttggccctgactccagctcggcgagggctggtcgctctacccataggagatgattctgaggccacagagacctcagacgaggccgagcgtttgggtccagcaggtccctcaattgttcatgttcttctggagcacaactatgctctggctgtccggcccgctccccctgctccagtttccagatccctggacccacttccttcccctgccactgtcttcagctcacctgccgatgaggtcctagaagcccctgaggtggtggttgctgaggtggaggaggaagaggaggaggaggaagaggaagaagagtctgaatcatcagagagtagtagtagcagcagtgatggggagggtgccttgaggcggaggagccttcgatcccatgccaagcgccgtcgacagcccttgccccctgcccctccacccccacccagttatgagcctcgaagtgagtttgagcagatgaccatcttgtatgacatctggaattctggacttgatgctgaggacatgggctacttacgactcacctatgagaggctgctacagcaggatggtggagctgactggctcaatgatacccactgggtccatcacaccaatatcctagacctctgtgcgcatgtgggagctattgtaccagaagcgtcaagtctcacctctccttcttaatcagtcattttagtaatgatatttctgtgagtccattgttgagtactttttttttgtctgccacttttcaccttcattcttgt 21 | >Ringtail_possum 22 | tttaccacatccttttcactaggtcctcctgaggaaccattgcctgagacacctcagcatcccccagagccaccagctgtctctcaagcctctcctcccctacctcttgccccccgtcctgaagagtgtccatcttctcctatcccacttctcccaccaccgaagaagcgtcggaaaacagtctcctcctctgtatcagaggaaatcccaatcaccccaacccctgaggttcccccacctgttccacctccagctaagccttctggtccccttccccggaaaatctcccggggtggggatcggaccattcgaaatctgcccctggaccatgcttctctggtcaagagctggccagaggatggatcccgggtggggcggaaccggagtggaggtcggggtcgtctgccagaagaagaagaacctgggactgaagtagacctagcagtgttggcagatttggctctgactccagctcgaagagggctgatcactgtacccgtaggggatgattctgaggccacggagacctcggacgaggctgagcggtcggctccggcaggtccctcaatcattcatgttcttctggagcacaactatgctctgtctgtccgacctgcaccccctgccccagtttccagacccctggacctacttccttcccctgccactgtcttcagctcacctgccgatgaggtcctagaagctcctgaggtggtggtcgctgaggtggaggaagaagaagaggaggaggaggaagaggaagaagagtcrgaatcctcagagagtagtagcagcagcagcagcgacggggagggcgccttgaggaggagaagccttcgatcccatgctaggcgccgtcgacagcccttgccacctgcccctccacccccacccagctatgagccccggagtgagtttgagcagatgaccatcctgtacgacatctggaattctgggcttgatgctgaggacatgggctacttgcgactcacctacgagaggctgctgcagcaggatggtggagccgactggctgaatgatacccactgggtgcatcacaccaatatcctagagctttgtgggtcaggggctggcagatgtgggagctgttgtaccagaagcttcaagtctcacctcttcctcttgagcattcattttaataatgatatttctgtgacctcattgtcaagtacactttttttgtctgctatttgtcaccttcgttcttc -------------------------------------------------------------------------------- /phylogenetics/marsupials.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/populationgenetics/exercises/755d3645a55bf744cf87dc2406991fd1d437d45d/phylogenetics/marsupials.png --------------------------------------------------------------------------------