├── LICENSE ├── README.md ├── setup.py └── src ├── bcf_vcf.py ├── helper_functions.py ├── hmm_functions.py ├── main.py ├── make_mutationrate.py └── make_test_data.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 LauritsSkov 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Introgression detection (hmmix) 2 | 3 | ## Contact 4 | 5 | 6 | 7 | --- 8 | 9 | ## Helpful files 10 | 11 | The outgroup files, mutation rate files, reference genomes, ancestral alleles and callability files and ancestral allele files are now premade! 12 | 13 | (hg19 and hg38) 14 | 15 | Using these files I have already called archaic segments in 1000 genomes and HDGP datasets (hg38 reference coordinate system) 16 | 17 | (archaic introgression callsets for HGDP and 1000genomes in hg38) 18 | 19 | VCF file containing 4 high coverage archaic genomes (Altai, Vindija and Chagyrskaya Neanderthals and Denisovan) here: 20 | 21 | (hg19) 22 | 23 | (hg38) 24 | 25 | --- 26 | 27 | If you are working with archaic introgression into present-day humans of non-African ancestry you can use these files and skip the following steps: 28 | Find derived variants in outgroup and Estimate local mutation rate. 29 | 30 | These are the scripts needed to infere archaic introgression in modern populations using an unadmixed outgroup. 31 | 32 | 1. [Installation](#installation) 33 | 2. [Usage](#usage) 34 | 3. [Quick tutorial](#quick-tutorial) 35 | 4. [1000 genomes tutorial](#example-with-1000-genomes-data) 36 | - [Get data](#getting-data) 37 | - [Find derived variants in outgroup](#finding-snps-which-are-derived-in-the-outgroup) 38 | - [Estimate local mutation rate](#estimating-mutation-rate-across-genome) 39 | - [Find variants in ingroup](#find-a-set-of-variants-which-are-not-derived-in-the-outgroup) 40 | - [Train the HMM](#training) 41 | - [Decoding](#decoding) 42 | - [Phased data](#training-and-decoding-with-phased-data) 43 | - [Annotate](#annotate-with-known-admixing-population) 44 | 5. [Run in python](#annotate-with-known-admixing-population) 45 | 46 | --- 47 | 48 | ## Installation 49 | 50 | Run the following command to install: 51 | 52 | ```bash 53 | pip install hmmix 54 | ``` 55 | 56 | If you want to work with bcf/vcf files you should also install vcftools and bcftools. You can either use conda or visit their websites. 57 | 58 | ```bash 59 | conda install -c bioconda vcftools bcftools 60 | ``` 61 | 62 | ![Overview of model](https://user-images.githubusercontent.com/30321818/43464826-4d11d46c-94dc-11e8-8f1a-6851aa5d9125.jpg) 63 | 64 | The way the model works is by removing variation found in an outgroup population and then using the remaining variants to group the genome into regions of different variant density. If the model works well we would expect that introgressed regions have higher variant density than non-introgressed - because they have spend more time accumulation variation that is not found in the outgroup. 65 | 66 | An example on simulated data is provided below: 67 | 68 | ![het_vs_archaic](https://user-images.githubusercontent.com/30321818/46877046-217eff80-ce40-11e8-9010-edb544e3e1ee.png) 69 | 70 | In this example we zoom in on 1 Mb of simulated data for a haploid genome. The top panel shows the coalescence times with the outgroup across the region and the green segment is an archaic introgressed segment. Notice how much more deeper the coalescence time with the outgroup is. The second panel shows that probability of being in the archaic state. We can see that the probability is much higher in the archaic segment, demonstrating that in this toy example the model is working like we would hope. The next panel is the snp density if you dont remove all snps found in the outgroup. By looking at this one cant tell where the archaic segments begins and ends, or even if there is one. The bottom panel is the snp density when all variation in the outgroup is removed. Notice that now it is much clearer where the archaic segment begins and ends! 71 | 72 | The method is now published in PlosGenetics and can be found here: [Detecting archaic introgression using an unadmixed outgroup](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1007641) This paper is describing and evaluating the method. 73 | 74 | --- 75 | 76 | ## Usage 77 | 78 | ```note 79 | Script for identifying introgressed archaic segments 80 | 81 | > Turorial: 82 | hmmix make_test_data 83 | hmmix train -obs=obs.txt -weights=weights.bed -mutrates=mutrates.bed -param=Initialguesses.json -out=trained.json 84 | hmmix decode -obs=obs.txt -weights=weights.bed -mutrates=mutrates.bed -param=trained.json 85 | 86 | 87 | Different modes (you can also see the options for each by writing hmmix make_test_data -h): 88 | > make_test_data 89 | -windows Number of Kb windows to create (defaults to 50,000 per chromosome) 90 | -chromosomes Number of chromosomes to simulate (defaults to 2) 91 | -nooutfiles Don't create obs.txt, mutrates.bed, weights.bed, Initialguesses.json, simulated_segments.txt (defaults to yes) 92 | -param markov parameters file (default is human/neanderthal like parameters) 93 | 94 | > mutation_rate 95 | -outgroup [required] path to variants found in outgroup 96 | -out outputfile (defaults to mutationrate.bed) 97 | -weights file with callability (defaults to all positions being called) 98 | -window_size size of bins (defaults to 1 Mb) 99 | 100 | > create_outgroup 101 | -ind [required] ingroup/outgrop list (json file) or comma-separated list e.g. ind1,ind2 102 | -vcf [required] path to list of comma-separated vcf/bcf file(s) or wildcard characters e.g. chr*.bcf 103 | -weights file with callability (defaults to all positions being called) 104 | -out outputfile (defaults to stdout) 105 | -ancestral fasta file with ancestral information - comma-separated list or wildcards like vcf argument (default none) 106 | -refgenome fasta file with reference genome - comma-separated list or wildcards like vcf argument (default none) 107 | 108 | > create_ingroup 109 | -ind [required] ingroup/outgrop list (json file) or comma-separated list e.g. ind1,ind2 110 | -vcf [required] path to list of comma-separated vcf/bcf file(s) or wildcard characters e.g. chr*.bcf 111 | -outgroup [required] path to variant found in outgroup 112 | -weights file with callability (defaults to all positions being called) 113 | -out outputfile prefix (default is a file named obs..txt where ind is the name of individual in ingroup/outgrop list) 114 | -ancestral fasta file with ancestral information - comma-separated list or wildcards like vcf argument (default none) 115 | 116 | > train 117 | -obs [required] file with observation data 118 | -chrom Subset to chromosome or comma separated list of chromosomes e.g chr1 or chr1,chr2,chr3 (default is use all chromosomes) 119 | -weights file with callability (defaults to all positions being called) 120 | -mutrates file with mutation rates (default is mutation rate is uniform) 121 | -param markov parameters file (default is human/neanderthal like parameters) 122 | -out outputfile (default is a file named trained.json) 123 | -window_size size of bins (default is 1000 bp) 124 | -haploid Change from using diploid data to haploid data (default is diploid) 125 | 126 | > decode 127 | -obs [required] file with observation data 128 | -chrom Subset to chromosome or comma separated list of chromosomes e.g chr1 or chr1,chr2,chr3 (default is use all chromosomes) 129 | -weights file with callability (defaults to all positions being called) 130 | -mutrates file with mutation rates (default is mutation rate is uniform) 131 | -param markov parameters file (default is human/neanderthal like parameters) 132 | -out outputfile prefix .hap1.txt and .hap2.txt if -haploid option is used or .diploid.txt (default is stdout) 133 | -window_size size of bins (default is 1000 bp) 134 | -haploid Change from using diploid data to haploid data (default is diploid) 135 | -admixpop Annotate using vcffile with admixing population (default is none) 136 | -extrainfo Add variant position for each SNP (default is off) 137 | -viterbi decode using the viterbi algorithm (default is posterior decoding) 138 | 139 | > inhomogeneous 140 | -obs [required] file with observation data 141 | -chrom Subset to chromosome or comma separated list of chromosomes e.g chr1 or chr1,chr2,chr3 (default is use all chromosomes) 142 | -weights file with callability (defaults to all positions being called) 143 | -mutrates file with mutation rates (default is mutation rate is uniform) 144 | -param markov parameters file (default is human/neanderthal like parameters) 145 | -out outputfile prefix .hap1_sim(0-n).txt and .hap2_sim(0-n).txt if -haploid option is used or .diploid_(0-n).txt (default is stdout) 146 | -window_size size of bins (default is 1000 bp) 147 | -haploid Change from using diploid data to haploid data (default is diploid) 148 | -samples Number of simulated paths for the inhomogeneous markov chain (default is 100) 149 | -admixpop Annotate using vcffile with admixing population (default is none) 150 | -extrainfo Add variant position for each SNP (default is off) 151 | ``` 152 | 153 | --- 154 | 155 | ## Quick tutorial 156 | 157 | Here is how we can simulate test data using hmmix. Lets make some test data and start using the program. 158 | 159 | ```note 160 | > hmmix make_test_data 161 | > creating 2 chromosomes each with 50000 kb of test data with the following parameters.. 162 | > hmm parameters file: None 163 | > state_names = ['Human', 'Archaic'] 164 | > starting_probabilities = [0.98, 0.02] 165 | > transitions = [[1.0, 0.0], [0.02, 0.98]] 166 | > emissions = [0.04, 0.4] 167 | > Seed is 42 168 | ``` 169 | 170 | This will generate 5 files, obs.txt, weights.bed, mutrates.bed, simulated_segments.txt and Initialguesses.json. obs.txt. These are the mutations that are left after removing variants which are found in the outgroup. 171 | 172 | ```note 173 | chrom pos ancestral_base genotype 174 | chr1 17102 C CT 175 | chr1 34435 C CT 176 | chr1 69860 T TA 177 | chr1 122270 C CA 178 | chr1 181106 G GC 179 | chr1 218071 A AC 180 | chr1 220700 T TG 181 | chr1 231020 A AG 182 | chr1 235614 T TG 183 | ``` 184 | 185 | weights.bed. This is the parts of the genome that we can accurately map to - in this case we have simulated the data and can accurately access the entire genome. 186 | 187 | ```note 188 | chr1 0 50000000 189 | chr2 0 50000000 190 | ``` 191 | 192 | mutrates.bed. This is the normalized mutation rate across the genome. 193 | 194 | ```note 195 | chr1 0 50000000 1 196 | chr2 0 50000000 1 197 | ``` 198 | 199 | Initialguesses.json. This is our initial guesses when training the model - note these are different from those we simulated from. 200 | 201 | ```json 202 | { 203 | "state_names": ["Human","Archaic"], 204 | "starting_probabilities": [0.5,0.5], 205 | "transitions": [[0.99,0.01],[0.02,0.98]], 206 | "emissions": [0.03,0.3] 207 | } 208 | ``` 209 | 210 | The simulated_segments.txt contains the simulated states which generated the data (you can compare this to the decoded results later and see that it matches). 211 | 212 | ```note 213 | chrom start end length state 214 | chr1 0 22980000 22980000 Human 215 | chr1 22980000 23071000 91000 Archaic 216 | chr1 23071000 43905000 20834000 Human 217 | chr1 43905000 43911000 6000 Archaic 218 | chr1 43911000 47419000 3508000 Human 219 | chr1 47419000 47443000 24000 Archaic 220 | chr1 47443000 50000000 2557000 Human 221 | chr2 0 16378000 16378000 Human 222 | chr2 16378000 16492000 114000 Archaic 223 | chr2 16492000 19478000 2986000 Human 224 | chr2 19478000 19512000 34000 Archaic 225 | chr2 19512000 37728000 18216000 Human 226 | chr2 37728000 37751000 23000 Archaic 227 | chr2 37751000 46777000 9026000 Human 228 | chr2 46777000 46791000 14000 Archaic 229 | chr2 46791000 50000000 3209000 Human 230 | ``` 231 | 232 | We can find the best fitting parameters using BaumWelsch training. Here is how you use it: - note you can try to ommit the weights and mutrates arguments. Since this is simulated data the mutation is constant across the genome and we can asses the entire genome. Also notice how the parameters approach the parameters the data was generated from (jubii). 233 | 234 | ```note 235 | > hmmix train -obs=obs.txt -weights=weights.bed -mutrates=mutrates.bed -param=Initialguesses.json -out=trained.json 236 | ---------------------------------------- 237 | > state_names = ['Human', 'Archaic'] 238 | > starting_probabilities = [0.5, 0.5] 239 | > transitions = [[0.99, 0.01], [0.02, 0.98]] 240 | > emissions = [0.03, 0.3] 241 | > chromosomes to use: All 242 | > number of windows: 100000. Number of snps = 4116 243 | > total callability: 100000000 bp (100.0 %) 244 | > average mutation rate per bin: 1.0 245 | > Output is trained.json 246 | > Window size is 1000 bp 247 | > Haploid False 248 | ---------------------------------------- 249 | iteration loglikelihood start1 start2 emis1 emis2 trans1_1 trans2_2 250 | 0 -17905.0945 0.5 0.5 0.03 0.3 0.99 0.98 251 | 1 -17259.7101 0.96 0.04 0.0346 0.2009 0.9968 0.9217 252 | 2 -17244.4109 0.969 0.031 0.0365 0.1861 0.9971 0.9105 253 | ... 254 | 29 -17196.1361 0.997 0.003 0.04 0.4477 0.9999 0.9802 255 | 30 -17196.1324 0.997 0.003 0.04 0.4482 0.9999 0.9806 256 | 31 -17196.1316 0.997 0.003 0.04 0.4485 0.9999 0.9808 257 | 258 | 259 | # run without mutrate and weights (only do this for simulated data) 260 | > hmmix train -obs=obs.txt -param=Initialguesses.json -out=trained.json 261 | ``` 262 | 263 | We can now decode the data with the best parameters that maximize the likelihood and find the archaic segments. Please note it is the weights file that determine the end of chromosomes. If you do not provide a weights file then the last window will be the last window with a SNP. So using the test data above the decoded output would end at window 49,985,000 for chromosome 1 and 49,997,000 for chromosome 2. This is because hmmix uses the position of the last SNP when no weight file is provided to determine the length of the chromosome. The last SNP on chromosome 1 is 49,984,119 and the last SNP on chromosome 2 is 49,996,253. 264 | 265 | ```note 266 | > hmmix decode -obs=obs.txt -weights=weights.bed -mutrates=mutrates.bed -param=trained.json 267 | ---------------------------------------- 268 | > state_names = ['Human', 'Archaic'] 269 | > starting_probabilities = [0.997, 0.003] 270 | > transitions = [[1.0, 0.0], [0.019, 0.981]] 271 | > emissions = [0.04, 0.449] 272 | > chromosomes to use: All 273 | > number of windows: 100000. Number of snps = 4116 274 | > total callability: 100000000 bp (100.0 %) 275 | > average mutation rate per bin: 1.0 276 | > Output prefix is /dev/stdout 277 | > Window size is 1000 bp 278 | > Haploid False 279 | > Decode with posterior decoding 280 | ---------------------------------------- 281 | chrom start end length state mean_prob snps 282 | chr1 0 22979000 22979000 Human 0.99989 903 283 | chr1 22979000 23066000 87000 Archaic 0.96368 32 284 | chr1 23066000 47418000 24352000 Human 0.99975 935 285 | chr1 47418000 47443000 25000 Archaic 0.88235 10 286 | chr1 47443000 50000000 2557000 Human 0.99934 96 287 | chr2 0 16381000 16381000 Human 0.99981 653 288 | chr2 16381000 16492000 111000 Archaic 0.99166 60 289 | chr2 16492000 19478000 2986000 Human 0.99883 134 290 | chr2 19478000 19512000 34000 Archaic 0.96454 18 291 | chr2 19512000 50000000 30488000 Human 0.99981 1275 292 | 293 | ``` 294 | 295 | --- 296 | 297 | ## Example with 1000 genomes data 298 | 299 | --- 300 | 301 | The whole pipeline we will run looks like this. In the following section we will go through all the steps on the way 302 | 303 | NOTE: The outgroup files, mutation rate files, reference genomes, ancestral alleles and callability files and ancestral allele files are now premade! 304 | They can be downloaded in hg38 and hg19 here: 305 | 306 | But keep reading along if you want to know HOW the files were generated! Another important thing to note is that hmmix is relying on VCFtools which only support VCF files up to format V4.2 - so if you have VCFfiles in version 4.3 you will need to change this in your header! 307 | 308 | ```note 309 | hmmix create_outgroup -ind=individuals.json -vcf=*.bcf -weights=strickmask.bed -out=outgroup.txt -ancestral=hg19_ancestral/homo_sapiens_ancestor_*.fa -refgenome=hg19_refgenome/*fa 310 | hmmix mutation_rate -outgroup=outgroup.txt -weights=strickmask.bed -window_size=1000000 -out mutationrate.bed 311 | hmmix create_ingroup -ind=individuals.json -vcf=*.bcf -weights=strickmask.bed -out=obs -outgroup=outgroup.txt -ancestral=hg19_ancestral/homo_sapiens_ancestor_*.fa 312 | hmmix train -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -out=trained.HG00096.json 313 | hmmix decode -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -param=trained.HG00096.json 314 | ``` 315 | 316 | ### Getting data 317 | 318 | I thought it would be nice to have an entire reproduceble example of how to use this model. From a common starting point such as a VCF file (well a BCF file in this case) to the final output. The reason for using BCF files is because it is MUCH faster to extract data for each individual. You can convert a vcf file to a bcf file like this: 319 | 320 | ```note 321 | bcftools view file.vcf -l 1 -O b > file.bcf 322 | bcftools index file.bcf 323 | ``` 324 | 325 | In this example I will analyse an individual (HG00096) from the 1000 genomes project phase 3. All analysis are run on my lenovo thinkpad (8th gen) computer so it should run on yours too! 326 | 327 | First we will need to know which 1) bases can be called in the genome and 2) which variants are found in the outgroup. So let's start out by downloading the files from the following directories. 328 | To download callability regions, ancestral alleles information, ingroup outgroup information call this command: 329 | 330 | ```bash 331 | # bcffiles (hg19) 332 | ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/bcf_files/ 333 | 334 | # callability (remember to remove chr in the beginning of each line to make it compatible with hg19 e.g. chr1 > 1) 335 | ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/accessible_genome_masks/20141020.strict_mask.whole_genome.bed 336 | sed 's/^chr\|%$//g' 20141020.strict_mask.whole_genome.bed | awk '{print $1"\t"$2"\t"$3}' | grep -v Y > strickmask.bed 337 | 338 | # outgroup information 339 | ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel 340 | 341 | # Ancestral information 342 | ftp://ftp.ensembl.org/pub/release-74/fasta/ancestral_alleles/hg19_ancestral.tar.bz2 343 | 344 | # Reference genome 345 | wget 'ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz' -O chromFa.tar.gz 346 | 347 | # Archaic variants (Altai, Vindija, Chagyrskaya and Denisova in hg19) 348 | https://zenodo.org/records/7246376 349 | 350 | ``` 351 | 352 | For this example we will use all individuals from 'YRI','MSL' and 'ESN' as outgroup individuals. While we will only be decoding HG00096 in this example you can add as many individuals as you want to the ingroup. 353 | 354 | ```json 355 | { 356 | "ingroup": [ 357 | "HG00096", 358 | "HG00097" 359 | ], 360 | "outgroup": [ 361 | "HG02922", 362 | "HG02923", 363 | ... 364 | "HG02944", 365 | "HG02946"] 366 | } 367 | ``` 368 | 369 | --- 370 | 371 | ### Finding snps which are derived in the outgroup 372 | 373 | First we need to find a set of variants found in the outgroup. We can use the wildcard character to loop through all bcf files. It is best if you have files with the ancestral alleles (in FASTA format) and the reference genome (in FASTA format) but the program will run without. 374 | 375 | Something to note is that if you use an outgroup vcffile (like 1000 genomes) and an ingroup vcf file from a different dataset (like SGDP) there is an edge case which could occur. There could be recurrent mutations where every individual in 1000 genome has the derived variant and one individual in SGDP where the derived variant has mutated back to the ancestral allele. This means that this position will not be present in the outgroup file. However if a recurrent mutation occurs it will look like multiple individuals in the ingroup file have the mutation. This does not happen often but that is why I recommend having files with the ancestral allele and reference genome information. 376 | 377 | ```note 378 | # Recommended usage (if you want to remove sites which are fixed derived in your outgroup/ingroup). This is the file from zenodo. 379 | (took two hours) > hmmix create_outgroup -ind=individuals.json -vcf=*.bcf -weights=strickmask.bed -out=outgroup.txt -ancestral=hg19_ancestral/homo_sapiens_ancestor_*.fa -refgenome=hg19_refgenome/*fa 380 | ---------------------------------------- 381 | > Outgroup individuals: 292 382 | > Using vcf and ancestral files 383 | vcffile: chr1.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_1.fa reffile: hg19_refgenome/chr1.fa 384 | vcffile: chr2.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_2.fa reffile: hg19_refgenome/chr2.fa 385 | vcffile: chr3.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_3.fa reffile: hg19_refgenome/chr3.fa 386 | vcffile: chr4.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_4.fa reffile: hg19_refgenome/chr4.fa 387 | vcffile: chr5.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_5.fa reffile: hg19_refgenome/chr5.fa 388 | vcffile: chr6.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_6.fa reffile: hg19_refgenome/chr6.fa 389 | vcffile: chr7.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_7.fa reffile: hg19_refgenome/chr7.fa 390 | vcffile: chr8.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_8.fa reffile: hg19_refgenome/chr8.fa 391 | vcffile: chr9.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_9.fa reffile: hg19_refgenome/chr9.fa 392 | vcffile: chr10.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_10.fa reffile: hg19_refgenome/chr10.fa 393 | vcffile: chr11.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_11.fa reffile: hg19_refgenome/chr11.fa 394 | vcffile: chr12.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_12.fa reffile: hg19_refgenome/chr12.fa 395 | vcffile: chr13.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_13.fa reffile: hg19_refgenome/chr13.fa 396 | vcffile: chr14.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_14.fa reffile: hg19_refgenome/chr14.fa 397 | vcffile: chr15.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_15.fa reffile: hg19_refgenome/chr15.fa 398 | vcffile: chr16.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_16.fa reffile: hg19_refgenome/chr16.fa 399 | vcffile: chr17.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_17.fa reffile: hg19_refgenome/chr17.fa 400 | vcffile: chr18.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_18.fa reffile: hg19_refgenome/chr18.fa 401 | vcffile: chr19.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_19.fa reffile: hg19_refgenome/chr19.fa 402 | vcffile: chr20.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_20.fa reffile: hg19_refgenome/chr20.fa 403 | vcffile: chr21.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_21.fa reffile: hg19_refgenome/chr21.fa 404 | vcffile: chr22.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_22.fa reffile: hg19_refgenome/chr22.fa 405 | vcffile: chrX.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_X.fa reffile: hg19_refgenome/chrX.fa 406 | 407 | > Callability file: strickmask.bed 408 | > Writing output to: outgroup.txt 409 | ---------------------------------------- 410 | ``` 411 | 412 | Here it is important to check that hmmix matches up the reference, ancestral and vcffiles correctly e.g. chr1.bcf should fit with hg19_ancestral/homo_sapiens_ancestor_1.fa and hg19_refgenome/chr1.fa for instance. If you see an issue here its better to give the files as commaseparated values. 413 | 414 | ```note 415 | hmmix create_outgroup -ind=individuals.json \ 416 | -vcf=chr1.bcf,chr2.bcf,chr3.bcf,chr4.bcf,chr5.bcf,chr6.bcf,chr7.bcf,chr8.bcf,chr9.bcf,chr10.bcf,chr11.bcf,chr12.bcf,chr13.bcf,chr14.bcf,chr15.bcf,chr16.bcf,chr17.bcf,chr18.bcf,chr19.bcf,chr20.bcf,chr21.bcf,chr22.bcf,chrX.bcf \ 417 | -ancestral=hg19_ancestral/homo_sapiens_ancestor_1.fa,hg19_ancestral/homo_sapiens_ancestor_2.fa,hg19_ancestral/homo_sapiens_ancestor_3.fa,hg19_ancestral/homo_sapiens_ancestor_4.fa,hg19_ancestral/homo_sapiens_ancestor_5.fa,hg19_ancestral/homo_sapiens_ancestor_6.fa,hg19_ancestral/homo_sapiens_ancestor_7.fa,hg19_ancestral/homo_sapiens_ancestor_8.fa,hg19_ancestral/homo_sapiens_ancestor_9.fa,hg19_ancestral/homo_sapiens_ancestor_10.fa,hg19_ancestral/homo_sapiens_ancestor_11.fa,hg19_ancestral/homo_sapiens_ancestor_12.fa,hg19_ancestral/homo_sapiens_ancestor_13.fa,hg19_ancestral/homo_sapiens_ancestor_14.fa,hg19_ancestral/homo_sapiens_ancestor_15.fa,hg19_ancestral/homo_sapiens_ancestor_16.fa,hg19_ancestral/homo_sapiens_ancestor_17.fa,hg19_ancestral/homo_sapiens_ancestor_18.fa,hg19_ancestral/homo_sapiens_ancestor_19.fa,hg19_ancestral/homo_sapiens_ancestor_20.fa,hg19_ancestral/homo_sapiens_ancestor_21.fa,hg19_ancestral/homo_sapiens_ancestor_22.fa,hg19_ancestral/homo_sapiens_ancestor_X.fa \ 418 | -refgenome=hg19_refgenome/chr1.fa,hg19_refgenome/chr2.fa,hg19_refgenome/chr3.fa,hg19_refgenome/chr4.fa,hg19_refgenome/chr5.fa,hg19_refgenome/chr6.fa,hg19_refgenome/chr7.fa,hg19_refgenome/chr8.fa,hg19_refgenome/chr9.fa,hg19_refgenome/chr10.fa,hg19_refgenome/chr11.fa,hg19_refgenome/chr12.fa,hg19_refgenome/chr13.fa,hg19_refgenome/chr14.fa,hg19_refgenome/chr15.fa,hg19_refgenome/chr16.fa,hg19_refgenome/chr17.fa,hg19_refgenome/chr18.fa,hg19_refgenome/chr19.fa,hg19_refgenome/chr20.fa,hg19_refgenome/chr21.fa,hg19_refgenome/chr22.fa,hg19_refgenome/chrX.fa \ 419 | -weights=strickmask.bed \ 420 | -out=outgroup.txt 421 | ``` 422 | 423 | --- 424 | 425 | ### Estimating mutation rate across genome 426 | 427 | We can use the number of variants in the outgroup to estimate the substitution rate as a proxy for mutation rate. 428 | 429 | ```note 430 | (took 30 sec) > hmmix mutation_rate -outgroup=outgroup.txt -weights=strickmask.bed -window_size=1000000 -out mutationrate.bed 431 | ---------------------------------------- 432 | > Outgroupfile: outgroup.txt 433 | > Outputfile is: mutationrate.bed 434 | > Callability file is: strickmask.bed 435 | > Window size: 1000000 436 | ---------------------------------------- 437 | ``` 438 | 439 | --- 440 | 441 | ### Find a set of variants which are not derived in the outgroup 442 | 443 | Keep variants that are not found to be derived in the outgroup for each individual in ingroup. You can also speficy a single individual or a comma separated list of individuals. 444 | 445 | ```note 446 | (took 20 min) > hmmix create_ingroup -ind=individuals.json -vcf=*.bcf -weights=strickmask.bed -out=obs -outgroup=outgroup.txt -ancestral=hg19_ancestral/homo_sapiens_ancestor_*.fa 447 | ---------------------------------------- 448 | > Ingroup individuals: 2 449 | > Using vcf and ancestral files 450 | vcffile: chr1.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_1.fa 451 | vcffile: chr2.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_2.fa 452 | vcffile: chr3.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_3.fa 453 | vcffile: chr4.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_4.fa 454 | vcffile: chr5.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_5.fa 455 | vcffile: chr6.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_6.fa 456 | vcffile: chr7.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_7.fa 457 | vcffile: chr8.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_8.fa 458 | vcffile: chr9.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_9.fa 459 | vcffile: chr10.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_10.fa 460 | vcffile: chr11.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_11.fa 461 | vcffile: chr12.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_12.fa 462 | vcffile: chr13.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_13.fa 463 | vcffile: chr14.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_14.fa 464 | vcffile: chr15.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_15.fa 465 | vcffile: chr16.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_16.fa 466 | vcffile: chr17.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_17.fa 467 | vcffile: chr18.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_18.fa 468 | vcffile: chr19.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_19.fa 469 | vcffile: chr20.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_20.fa 470 | vcffile: chr21.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_21.fa 471 | vcffile: chr22.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_22.fa 472 | vcffile: chrX.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_X.fa 473 | 474 | > Using outgroup variants from: outgroup.txt 475 | > Callability file: strickmask.bed 476 | > Writing output to file with prefix: obs..txt 477 | 478 | ---------------------------------------- 479 | Running command: 480 | bcftools view -m2 -M2 -v snps -s HG00096 -T strickmask.bed chr1.bcf | vcftools --vcf - --exclude-positions outgroup.txt --recode --stdout 481 | ... 482 | bcftools view -m2 -M2 -v snps -s HG00097 -T strickmask.bed chr22.bcf | vcftools --vcf - --exclude-positions outgroup.txt --recode --stdout 483 | 484 | 485 | # Different way to define which individuals are in the ingroup 486 | (took 20 min) > hmmix create_ingroup -ind=HG00096,HG00097 -vcf=*.bcf -weights=strickmask.bed -out=obs -outgroup=outgroup.txt -ancestral=hg19_ancestral/homo_sapiens_ancestor_*.fa 487 | ``` 488 | 489 | --- 490 | 491 | ### Training 492 | 493 | Now for training the HMM parameters and decoding 494 | 495 | ```note 496 | (took 2 min) > hmmix train -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -out=trained.HG00096.json 497 | ---------------------------------------- 498 | > state_names = ['Human', 'Archaic'] 499 | > starting_probabilities = [0.98, 0.02] 500 | > transitions = [[1.0, 0.0], [0.02, 0.98]] 501 | > emissions = [0.04, 0.4] 502 | > chromosomes to use: All 503 | > number of windows: 3034097. Number of snps = 129803 504 | > total callability: 2178532324 bp (71.8 %) 505 | > average mutation rate per bin: 1.0 506 | > Output is trained.HG00096.json 507 | > Window size is 1000 bp 508 | > Haploid False 509 | ---------------------------------------- 510 | iteration loglikelihood start1 start2 emis1 emis2 trans1_1 trans2_2 511 | 0 -495723.941 0.98 0.02 0.04 0.4 0.9999 0.98 512 | 1 -493161.0783 0.964 0.036 0.0459 0.3894 0.9995 0.9859 513 | 2 -492985.5422 0.959 0.041 0.0454 0.3847 0.9993 0.9834 514 | ... 515 | 20 -492843.1842 0.954 0.046 0.0441 0.3724 0.9989 0.9768 516 | 21 -492843.1828 0.954 0.046 0.0441 0.3724 0.9989 0.9768 517 | 22 -492843.182 0.954 0.046 0.0441 0.3724 0.9989 0.9768 518 | ``` 519 | 520 | --- 521 | 522 | ### Decoding 523 | 524 | ```note 525 | (took 30 sec) > hmmix decode -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -param=trained.HG00096.json 526 | ---------------------------------------- 527 | > state_names = ['Human', 'Archaic'] 528 | > starting_probabilities = [0.954, 0.046] 529 | > transitions = [[0.999, 0.001], [0.023, 0.977]] 530 | > emissions = [0.044, 0.372] 531 | > chromosomes to use: All 532 | > number of windows: 3034097. Number of snps = 129803 533 | > total callability: 2178532324 bp (71.8 %) 534 | > average mutation rate per bin: 1.0 535 | > Output prefix is /dev/stdout 536 | > Window size is 1000 bp 537 | > Haploid False 538 | ---------------------------------------- 539 | chrom start end length state mean_prob snps 540 | 1 0 2988000 2988000 Human 0.9843 91 541 | 1 2988000 2997000 9000 Archaic 0.76267 6 542 | 1 2997000 3425000 428000 Human 0.98774 30 543 | 1 3425000 3452000 27000 Archaic 0.95818 22 544 | 1 3452000 4302000 850000 Human 0.97914 36 545 | 1 4302000 4361000 59000 Archaic 0.86728 20 546 | 1 4361000 4500000 139000 Human 0.9685 4 547 | 1 4500000 4510000 10000 Archaic 0.85533 7 548 | ``` 549 | 550 | You can also save to an output file with the command: 551 | 552 | ```note 553 | hmmix decode -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -param=trained.HG00096.json -out=HG00096.decoded 554 | ``` 555 | 556 | This will create a file named HG00096.decoded.diploid.txt because the default option is treating the data as diploid (more on haploid decoding in next chapter) 557 | 558 | --- 559 | 560 | ### Training and decoding with phased data 561 | 562 | It is also possible to tell the model that the data is phased with the -haploid parameter. For that we first need to train the parameters for haploid data and then decode. Training the model on phased data is done like this - and we also remember to change the name of the parameter file to include phased so future versions of ourselves don't forget. Another thing to note is that the number of snps is bigger than before 135483 vs 129803. This is because the program is counting snps on both haplotypes and homozygotes will be counted twice! Also the number of windows is now double due to the fact that we are looking at each chromosome pair seperately. 563 | 564 | ```note 565 | (took 4 min) > hmmix train -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -out=trained.HG00096.phased.json -haploid 566 | ---------------------------------------- 567 | > state_names = ['Human', 'Archaic'] 568 | > starting_probabilities = [0.98, 0.02] 569 | > transitions = [[1.0, 0.0], [0.02, 0.98]] 570 | > emissions = [0.04, 0.4] 571 | > chromosomes to use: All 572 | > number of windows: 6068194. Number of snps = 135483 573 | > total callability: 4357064649 bp (71.8 %) 574 | > average mutation rate per bin: 1.0 575 | > Output is trained.HG00096.phased.json 576 | > Window size is 1000 bp 577 | > Haploid True 578 | ---------------------------------------- 579 | iteration loglikelihood start1 start2 emis1 emis2 trans1_1 trans2_2 580 | 0 -605546.7352 0.98 0.02 0.04 0.4 0.9999 0.98 581 | 1 -589566.629 0.985 0.015 0.0248 0.3999 0.9998 0.9851 582 | 2 -588897.0833 0.98 0.02 0.0238 0.3671 0.9996 0.9825 583 | ... 584 | 20 -588529.8136 0.973 0.027 0.0227 0.3266 0.9993 0.9755 585 | 21 -588529.8124 0.973 0.027 0.0227 0.3265 0.9993 0.9755 586 | 22 -588529.8117 0.973 0.027 0.0227 0.3265 0.9993 0.9755 587 | ``` 588 | 589 | Below I am only showing the first archaic segments on chromosome 1 for each haplotype (note you have to scroll down after chrom X before the new haplotype begins). The seem to fall more or less in the same places as when we used diploid data. 590 | 591 | ```note 592 | (took 30 sec) > hmmix decode -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -param=trained.HG00096.phased.json -haploid 593 | ---------------------------------------- 594 | > state_names = ['Human', 'Archaic'] 595 | > starting_probabilities = [0.973, 0.027] 596 | > transitions = [[0.999, 0.001], [0.024, 0.976]] 597 | > emissions = [0.023, 0.327] 598 | > chromosomes to use: All 599 | > number of windows: 6068194. Number of snps = 135483 600 | > total callability: 4357064649 bp (71.8 %) 601 | > average mutation rate per bin: 1.0 602 | > Output prefix is /dev/stdout 603 | > Window size is 1000 bp 604 | > Haploid True 605 | ---------------------------------------- 606 | hap1 607 | chrom start end length state mean_prob snps 608 | 1 2156000 2185000 29000 Archaic 0.64814 6 609 | 1 3425000 3452000 27000 Archaic 0.96702 22 610 | 611 | ... 612 | hap2 613 | 1 2780000 2803000 23000 Archaic 0.68384 7 614 | 1 4302000 4337000 35000 Archaic 0.94248 13 615 | 1 4500000 4511000 11000 Archaic 0.87943 7 616 | 1 4989000 5001000 12000 Archaic 0.6195 5 617 | ``` 618 | 619 | You can also save to an output file with the command: 620 | 621 | ```note 622 | hmmix decode -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -param=trained.HG00096.phased.json -haploid -out=HG00096.decoded 623 | ``` 624 | 625 | This will create two files named HG00096.decoded.hap1.txt and HG00096.decoded.hap2.txt 626 | 627 | --- 628 | 629 | ### Annotate with known admixing population 630 | 631 | Even though this method does not use archaic reference genomes for finding segments you can still use them to annotate your segments. 632 | I have uploaded a VCF file containing 4 high coverage archaic genomes (3 Neanderthals and 1 Denisovan) here: 633 | 634 | (hg19 - the one I use in this example) 635 | 636 | (hg38) 637 | 638 | If you have a vcf from the population that admixed in VCF/BCF format you can write this: 639 | 640 | ```note 641 | > hmmix decode -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -param=trained.HG00096.json -admixpop=archaicvar/*bcf 642 | ---------------------------------------- 643 | > state_names = ['Human', 'Archaic'] 644 | > starting_probabilities = [0.954, 0.046] 645 | > transitions = [[0.999, 0.001], [0.023, 0.977]] 646 | > emissions = [0.044, 0.372] 647 | > chromosomes to use: All 648 | > number of windows: 3034097. Number of snps = 129803 649 | > total callability: 2178532324 bp (71.8 %) 650 | > average mutation rate per bin: 1.0 651 | > Output prefix is /dev/stdout 652 | > Window size is 1000 bp 653 | > Haploid False 654 | ---------------------------------------- 655 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_9.bcf 656 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_19.bcf 657 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_7.bcf 658 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_21.bcf 659 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_20.bcf 660 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_15.bcf 661 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_10.bcf 662 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_3.bcf 663 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_17.bcf 664 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_6.bcf 665 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_X.bcf 666 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_16.bcf 667 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_1.bcf 668 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_18.bcf 669 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_14.bcf 670 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_4.bcf 671 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_2.bcf 672 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_22.bcf 673 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_5.bcf 674 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_8.bcf 675 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_11.bcf 676 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_12.bcf 677 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_13.bcf 678 | chrom start end length state mean_prob snps admixpopvariants AltaiNeandertal Vindija33.19 Denisova Chagyrskaya-Phalanx 679 | 1 2988000 2997000 9000 Archaic 0.76267 6 4 4 4 1 4 680 | 1 3425000 3452000 27000 Archaic 0.95818 22 17 17 15 3 17 681 | 1 4302000 4361000 59000 Archaic 0.86728 20 12 11 12 11 11 682 | 1 4500000 4510000 10000 Archaic 0.85533 7 5 4 5 4 5 683 | 1 5306000 5319000 13000 Archaic 0.55713 4 1 1 1 0 1 684 | 1 5338000 5348000 10000 Archaic 0.65123 5 3 2 3 0 3 685 | 1 9321000 9355000 34000 Archaic 0.86446 9 0 0 0 0 0 686 | 1 12599000 12655000 56000 Archaic 0.91166 18 11 4 11 0 10 687 | ``` 688 | 689 | For the first segment there are 6 derived snps. Of these snps 4 are shared with Altai,Vindija, Denisova and Chagyrskaya. Only 1 is shared with Denisova so this segment likeli introgressed from Neanderthals 690 | 691 | --- 692 | 693 | And that is it! Now you have run the model and gotten a set of parameters that you can interpret biologically (see my paper) and you have a list of segments that belong to the human and Archaic state. 694 | 695 | If you have any questions about the use of the scripts, if you find errors or if you have feedback you can contact my here (make an issue) or write to: 696 | lauritsskov2@gmail.com 697 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | with open('README.md', 'r') as fh: 4 | long_description = fh.read() 5 | 6 | setup( 7 | name='hmmix', 8 | python_requires='>3.5, <3.10', 9 | version = '0.8.2', 10 | description='Find introgressed segments', 11 | py_modules=['bcf_vcf', 'helper_functions', 'hmm_functions', 'main', 'make_mutationrate', 'make_test_data', 'artemis'], 12 | package_dir={'': 'src'}, 13 | classifiers=[ 14 | 'Programming Language :: Python :: 3', 15 | 'Programming Language :: Python :: 3.5', 16 | 'Programming Language :: Python :: 3.6', 17 | 'Programming Language :: Python :: 3.7', 18 | 'Programming Language :: Python :: 3.8', 19 | 'Programming Language :: Python :: 3.9', 20 | 'License :: OSI Approved :: MIT License', 21 | ], 22 | long_description=long_description, 23 | long_description_content_type='text/markdown', 24 | url = 'https://github.com/LauritsSkov/Introgression-detection', 25 | author = 'Laurits Skov and Moises Coll Macia', 26 | author_email='lauritsskov2@gmail.com', 27 | entry_points = { 28 | 'console_scripts': [ 29 | 'hmmix = main:main' 30 | ]}, 31 | install_requires=[ 32 | 'numpy>=1.15', 33 | 'scipy>=1.5', 34 | 'matplotlib>=3.3', 35 | 'numba' 36 | ], 37 | ) 38 | 39 | -------------------------------------------------------------------------------- /src/bcf_vcf.py: -------------------------------------------------------------------------------- 1 | import os 2 | from collections import defaultdict 3 | 4 | from helper_functions import sortby, Make_folder_if_not_exists, load_fasta, convert_to_bases, clean_files 5 | 6 | 7 | 8 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 9 | # Make Outgroup 10 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 11 | def make_out_group(individuals_input, bedfile, vcffiles, outputfile, ancestralfiles, refgenomefiles): 12 | 13 | Make_folder_if_not_exists(outputfile) 14 | outgroup_individuals = ','.join(individuals_input) 15 | 16 | with open(outputfile + '.unsorted', 'w') as out: 17 | 18 | print('chrom', 'pos', 'ref_allele_info', 'alt_allele_info', 'ancestral_base', sep = '\t', file = out) 19 | 20 | for vcffile, ancestralfile, reffile in zip(vcffiles, ancestralfiles, refgenomefiles): 21 | 22 | if ancestralfile is not None: 23 | ancestral_allele = load_fasta(ancestralfile) 24 | 25 | if bedfile is not None: 26 | command = f'bcftools view -s {outgroup_individuals} -T {bedfile} {vcffile} | bcftools norm -m -any | bcftools view -v snps | vcftools --vcf - --counts --stdout' 27 | else: 28 | command = f'bcftools view -s {outgroup_individuals} {vcffile} | bcftools norm -m -any | bcftools view -v snps | vcftools --vcf - --counts --stdout' 29 | 30 | print(f'Processing {vcffile}...') 31 | print('Running command:') 32 | print(command, '\n\n') 33 | 34 | variants_seen = defaultdict(int) 35 | for index, line in enumerate(os.popen(command)): 36 | if not line.startswith('CHROM'): 37 | 38 | chrom, pos, _, _, ref_allele_info, alt_allele_info = line.strip().split() 39 | 40 | ref_allele, ref_count = ref_allele_info.split(':') 41 | alt_allele, alt_count = alt_allele_info.split(':') 42 | pos, ref_count, alt_count = int(pos), int(ref_count), int(alt_count) 43 | 44 | # Always include polymorphic sites 45 | if alt_count * ref_count > 0: 46 | ancestral_base = ref_allele if ref_count > alt_count else alt_allele 47 | 48 | # Use ancestral base info if available 49 | if ancestralfile is not None: 50 | ancestral_base_temp = ancestral_allele[pos-1] 51 | if ancestral_base_temp in [ref_allele, alt_allele]: 52 | ancestral_base = ancestral_base_temp 53 | 54 | print(chrom, pos, ref_allele_info, alt_allele_info, ancestral_base, sep = '\t', file = out) 55 | variants_seen[pos-1] = 1 56 | 57 | # Fixed sites 58 | elif alt_count * ref_count == 0: 59 | ancestral_base = ref_allele if ref_count > alt_count else alt_allele 60 | 61 | # Use ancestral base info if available 62 | if ancestralfile is not None: 63 | ancestral_base_temp = ancestral_allele[pos-1] 64 | if ancestral_base_temp in [ref_allele, alt_allele]: 65 | ancestral_base = ancestral_base_temp 66 | 67 | if ancestral_base == alt_allele: 68 | derived_count = ref_count 69 | else: 70 | derived_count = alt_count 71 | 72 | if derived_count > 0: 73 | print(chrom, pos, ref_allele_info, alt_allele_info, ancestral_base, sep = '\t', file = out) 74 | variants_seen[pos-1] = 1 75 | 76 | 77 | if index % 100000 == 0: 78 | print(f'at line {index} at chrom {chrom} and position {pos}') 79 | 80 | # If reference genome is provided then remove positions where the reference and ancestral differ AND which is not found in the outgroup 81 | if reffile is not None and ancestralfile is not None: 82 | print('Find fixed derived sites') 83 | refgenome_allele = load_fasta(reffile) 84 | 85 | for index, (refbase, ancbase) in enumerate(zip(refgenome_allele, ancestral_allele)): 86 | if ancbase in ['A','C','G','T'] and refbase in ['A','C','G','T'] : 87 | if refbase != ancbase and variants_seen[index] == 0: 88 | print(chrom, index + 1, f'{refbase}:100', f'{ancbase}:0', ancbase, sep = '\t', file = out) 89 | 90 | # Sort outgroup file 91 | print('Sorting outgroup file') 92 | positions_to_sort = defaultdict(lambda: defaultdict(str)) 93 | with open(outputfile + '.unsorted') as data, open(outputfile, 'w') as out: 94 | for line in data: 95 | if line.startswith('chrom'): 96 | out.write(line) 97 | else: 98 | chrom, pos = line.strip().split()[0:2] 99 | positions_to_sort[chrom][int(pos)] = line 100 | 101 | for chrom in sorted(positions_to_sort, key=sortby): 102 | for pos in sorted(positions_to_sort[chrom]): 103 | line = positions_to_sort[chrom][pos] 104 | out.write(line) 105 | 106 | # Clean log files generated by vcf and bcf tools 107 | clean_files(outputfile + '.unsorted') 108 | clean_files('out.log') 109 | 110 | 111 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 112 | # Make ingroup 113 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 114 | def make_ingroup_obs(ingroup_individuals, bedfile, vcffiles, outprefix, outgroupfile, ancestralfiles): 115 | 116 | # handle output files 117 | Make_folder_if_not_exists(outprefix) 118 | outfile_handler = defaultdict(str) 119 | for individual in ingroup_individuals: 120 | outfile_handler[individual] = open(f'{outprefix}.{individual}.txt','w') 121 | print('chrom', 'pos', 'ancestral_base', 'genotype', sep = '\t', file = outfile_handler[individual]) 122 | 123 | individuals_for_bcf = ','.join(ingroup_individuals) 124 | 125 | for vcffile, ancestralfile in zip(vcffiles, ancestralfiles): 126 | 127 | if ancestralfile is not None: 128 | ancestral_allele = load_fasta(ancestralfile) 129 | 130 | if bedfile is not None: 131 | command = f'bcftools view -v snps -s {individuals_for_bcf} -T {bedfile} {vcffile} | bcftools norm -m +any | vcftools --vcf - --exclude-positions {outgroupfile} --recode --stdout' 132 | else: 133 | command = f'bcftools view -v snps -s {individuals_for_bcf} {vcffile} | bcftools norm -m +any | vcftools --vcf - --exclude-positions {outgroupfile} --recode --stdout' 134 | 135 | print('Running command:') 136 | print(command, '\n\n') 137 | 138 | for index, line in enumerate(os.popen(command)): 139 | 140 | if line.startswith('#CHROM'): 141 | individuals_in_vcffile = line.strip().split()[9:] 142 | 143 | if not line.startswith('#'): 144 | 145 | chrom, pos, _, ref_allele, alt_allele = line.strip().split()[0:5] 146 | pos = int(pos) 147 | genotypes = [x.split(':')[0] for x in line.strip().split()[9:]] 148 | all_bases = [ref_allele] + alt_allele.split(',') 149 | 150 | if ref_allele in ['A','C','G','T']: 151 | 152 | for original_genotype, individual in zip(genotypes, individuals_in_vcffile): 153 | genotype = convert_to_bases(original_genotype, all_bases) 154 | 155 | if ancestralfile is not None: 156 | # With ancestral information look for derived alleles 157 | ancestral_base = ancestral_allele[pos-1] 158 | if ancestral_base in all_bases and genotype.count(ancestral_base) != 2 and genotype != 'NN': 159 | print(chrom, pos, ancestral_base, genotype, sep = '\t', file = outfile_handler[individual]) 160 | 161 | else: 162 | # If no ancestral information is provided only include heterozygous variants 163 | if genotype[0] != genotype[1]: 164 | print(chrom, pos, ref_allele, genotype, sep = '\t', file = outfile_handler[individual]) 165 | 166 | 167 | if index % 100000 == 0: 168 | print(f'at line {index} at chrom {chrom} and position {pos}') 169 | 170 | # Clean log files generated by vcf and bcf tools 171 | clean_files('out.log') 172 | 173 | for individual in ingroup_individuals: 174 | outfile_handler[individual].close() 175 | 176 | 177 | 178 | -------------------------------------------------------------------------------- /src/helper_functions.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import json 3 | from collections import defaultdict 4 | import os, sys 5 | from glob import glob 6 | 7 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 8 | # Functions for handling observertions/bed files 9 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 10 | def make_callability_from_bed(bedfile, window_size): 11 | callability = defaultdict(lambda: defaultdict(float)) 12 | with open(bedfile) as data: 13 | for line in data: 14 | 15 | if line.startswith('chrom'): 16 | continue 17 | 18 | if len(line.strip().split('\t')) == 3: 19 | chrom, start, end = line.strip().split('\t') 20 | value = 1 21 | elif len(line.strip().split('\t')) > 3: 22 | chrom, start, end, value = line.strip().split('\t')[0:4] 23 | value = float(value) 24 | 25 | start, end = int(start), int(end) 26 | 27 | firstwindow = start - start % window_size 28 | lastwindow = end - end % window_size 29 | 30 | # not spanning multiple windows (all is added to same window) 31 | if firstwindow == lastwindow: 32 | callability[chrom][firstwindow] += (end-start+1) * value 33 | 34 | # spanning multiple windows 35 | else: 36 | # add to end windows 37 | firstwindow_fill = window_size - start % window_size 38 | lastwindow_fill = end %window_size 39 | 40 | callability[chrom][firstwindow] += firstwindow_fill * value 41 | callability[chrom][lastwindow] += (lastwindow_fill+1) * value 42 | 43 | # fill in windows in the middle 44 | for window_tofil in range(firstwindow + window_size, lastwindow, window_size): 45 | callability[chrom][window_tofil] += window_size * value 46 | 47 | return callability 48 | 49 | 50 | 51 | def Load_observations_weights_mutrates(obs_file, weights_file, mutrates_file, window_size = 1000, haploid = False, chrom_to_look_for = 'All'): 52 | 53 | # get span of data 54 | chromosome_spans = defaultdict(int) 55 | with open(obs_file) as data: 56 | for line in data: 57 | if line.startswith('chrom'): 58 | continue 59 | chrom, pos, ancestral_base, genotype = line.strip().split() 60 | 61 | # convert 1-indexed position to 0-indexed position 62 | zero_based_pos = int(pos) - 1 63 | rounded_pos = zero_based_pos - zero_based_pos % window_size 64 | chromosome_spans[chrom] = rounded_pos 65 | 66 | 67 | # get span of callability 68 | if weights_file: 69 | callability = make_callability_from_bed(weights_file, window_size) 70 | for chrom in callability: 71 | chromosome_spans[chrom] = max(callability[chrom]) 72 | 73 | # which chromosomes are we interested in 74 | if chrom_to_look_for != 'All': 75 | chromosome_list = sorted(chrom_to_look_for.split(','), key=sortby) 76 | else: 77 | chromosome_list = sorted(list(chromosome_spans.keys()), key=sortby) 78 | 79 | obs_counter = defaultdict(lambda: defaultdict(lambda: defaultdict(list))) 80 | haplotypes = defaultdict(int) 81 | chroms, starts, variants, obs = [], [], [], [] 82 | 83 | # read observation data 84 | with open(obs_file) as data: 85 | for line in data: 86 | if line.startswith('chrom'): 87 | continue 88 | 89 | chrom, pos, ancestral_base, genotype = line.strip().split() 90 | if not chrom in chromosome_list: 91 | continue 92 | 93 | # convert 1-indexed position to 0-indexed position 94 | zero_based_pos = int(pos) - 1 95 | rounded_pos = zero_based_pos - zero_based_pos % window_size 96 | 97 | if haploid: 98 | for i, base in enumerate(genotype): 99 | if base != ancestral_base: 100 | obs_counter[chrom][rounded_pos][f'_hap{i+1}'].append(pos) 101 | haplotypes[f'_hap{i+1}'] += 1 102 | else: 103 | obs_counter[chrom][rounded_pos][''].append(pos) 104 | haplotypes[''] += 1 105 | 106 | 107 | # take care of cases where there is 0 observations 108 | if len(haplotypes) == 0: 109 | if haploid: 110 | sys.exit('Could not determine haploidity because there is no data') 111 | else: 112 | haplotypes[''] += 1 113 | 114 | 115 | for haplotype in sorted(haplotypes, key=sortby_haplotype): 116 | 117 | for chrom in chromosome_list: 118 | lastwindow = chromosome_spans[chrom] 119 | 120 | for window in range(0, lastwindow, window_size): 121 | chroms.append(f'{chrom}{haplotype}') 122 | starts.append(window) 123 | variants.append(','.join(obs_counter[chrom][window][haplotype])) 124 | obs.append(len(obs_counter[chrom][window][haplotype])) 125 | 126 | 127 | # Read weights file is it exists - else set all weights to 1 128 | if weights_file is None: 129 | weights = np.ones(len(obs)) 130 | else: 131 | callability = make_callability_from_bed(weights_file, window_size) 132 | weights = [] 133 | for haplotype in sorted(haplotypes, key=sortby_haplotype): 134 | 135 | for chrom in chromosome_list: 136 | lastwindow = chromosome_spans[chrom] 137 | 138 | for window in range(0, lastwindow, window_size): 139 | weights.append(callability[chrom][window] / float(window_size)) 140 | 141 | 142 | # Read mutation rate file is it exists - else set all mutation rates to 1 143 | if mutrates_file is None: 144 | mutrates = np.ones(len(obs)) 145 | else: 146 | callability = make_callability_from_bed(mutrates_file, window_size) 147 | mutrates = [] 148 | for haplotype in sorted(haplotypes, key=sortby_haplotype): 149 | 150 | for chrom in chromosome_list: 151 | lastwindow = chromosome_spans[chrom] 152 | 153 | for window in range(0, lastwindow, window_size): 154 | mutrates.append(callability[chrom][window] / float(window_size)) 155 | 156 | 157 | 158 | # Make sure there are no places with obs > 0 and 0 in mutation rate or weight 159 | for index, (observation, w, m) in enumerate(zip(obs, weights, mutrates)): 160 | if w*m == 0 and observation != 0: 161 | print(f'warning, you had {observation} observations but no called bases/no mutation rate at index:{index}. weights:{w}, mutrates:{m}') 162 | obs[index] = 0 163 | 164 | return np.array(obs).astype(int), chroms, starts, variants, np.array(mutrates).astype(float), np.array(weights).astype(float) 165 | 166 | 167 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 168 | # For decoding/training 169 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 170 | 171 | def find_runs(inarray): 172 | """ run length encoding. Partial credit to R rle function. 173 | Multi datatype arrays catered for including non Numpy 174 | returns: tuple (runlengths, startpositions, values) """ 175 | ia = np.asarray(inarray) # force numpy 176 | n = len(ia) 177 | if n == 0: 178 | return (None, None, None) 179 | else: 180 | y = ia[1:] != ia[:-1] # pairwise unequal (string safe) 181 | i = np.append(np.where(y), n - 1) # must include last element posi 182 | z = np.diff(np.append(-1, i)) # run lengths 183 | p = np.cumsum(np.append(0, z))[:-1] # positions 184 | 185 | for (a, b, c) in zip(ia[i], p, z): 186 | yield (a, b, c) 187 | 188 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 189 | # Various helper functions 190 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 191 | def load_fasta(fasta_file): 192 | ''' 193 | Read a fasta file with a single chromosome in and return the sequence as a string 194 | ''' 195 | fasta_sequence = '' 196 | with open(fasta_file) as data: 197 | for line in data: 198 | if not line.startswith('>'): 199 | fasta_sequence += line.strip().upper() 200 | 201 | return fasta_sequence 202 | 203 | def sortby(x): 204 | ''' 205 | This function is used in the sorted() function. It will sort first by numeric values, then strings then other symbols 206 | 207 | Usage: 208 | mylist = ['1', '12', '2', 3, 'MT', 'Y'] 209 | sortedlist = sorted(mylist, key=sortby) 210 | returns ['1', '2', 3, '12', 'MT', 'Y'] 211 | ''' 212 | 213 | lower_case_letters = 'abcdefghijklmnopqrstuvwxyz' 214 | if x.isnumeric(): 215 | return int(x) 216 | elif type(x) == str and len(x) > 0: 217 | if x[0].lower() in lower_case_letters: 218 | return 1e6 + lower_case_letters.index(x[0].lower()) 219 | else: 220 | return 2e6 221 | else: 222 | return 3e6 223 | 224 | def sortby_haplotype(x): 225 | ''' 226 | This function will sort haplotypes by number 227 | ''' 228 | 229 | if '_hap' in x: 230 | return int(x.replace('_hap', '')) 231 | else: 232 | return x 233 | 234 | 235 | 236 | 237 | def Make_folder_if_not_exists(path): 238 | ''' 239 | Check if path exists - otherwise make it; 240 | ''' 241 | path = os.path.dirname(path) 242 | if path != '': 243 | if not os.path.exists(path): 244 | os.makedirs(path) 245 | 246 | 247 | 248 | def Annotate_with_ref_genome(vcffiles, obsfile): 249 | obs = defaultdict(list) 250 | shared_with = defaultdict(str) 251 | 252 | tempobsfile = obsfile + 'temp' 253 | with open(obsfile) as data, open(tempobsfile,'w') as out: 254 | for line in data: 255 | if not line.startswith('chrom'): 256 | out.write(line) 257 | chrom, pos, ancestral_base, genotype = line.strip().split() 258 | derived_variant = genotype.replace(ancestral_base, '')[0] 259 | ID = f'{chrom}_{pos}' 260 | obs[ID] = [ancestral_base, derived_variant] 261 | 262 | # handle case with 0 SNPs 263 | if len(obs) == 0: 264 | for vcffile in handle_infiles(vcffiles): 265 | command = f'bcftools view -h {vcffile}' 266 | print('You have no observations!') 267 | 268 | for line in os.popen(command): 269 | if line.startswith('#CHROM'): 270 | individuals_in_vcffile = line.strip().split()[9:] 271 | 272 | # Clean log files generated by vcf and bcf tools 273 | clean_files('out.log') 274 | clean_files(tempobsfile) 275 | 276 | return shared_with, individuals_in_vcffile 277 | 278 | 279 | print('Loading in admixpop snp information') 280 | for vcffile in handle_infiles(vcffiles): 281 | command = f'bcftools view -a -R {tempobsfile} {vcffile}' 282 | print(command) 283 | 284 | for line in os.popen(command): 285 | if line.startswith('#CHROM'): 286 | individuals_in_vcffile = line.strip().split()[9:] 287 | 288 | if not line.startswith('#'): 289 | 290 | chrom, pos, _, ref_allele, alt_allele = line.strip().split()[0:5] 291 | ID = f'{chrom}_{pos}' 292 | genotypes = [x.split(':')[0] for x in line.strip().split()[9:]] 293 | all_bases = [ref_allele] + alt_allele.split(',') 294 | 295 | ancestral_base, derived_base = obs[ID] 296 | found_in = [] 297 | 298 | for original_genotype, individual in zip(genotypes, individuals_in_vcffile): 299 | 300 | if '.' not in original_genotype: 301 | genotype = convert_to_bases(original_genotype, all_bases) 302 | 303 | if genotype.count(derived_base) > 0: 304 | found_in.append(individual) 305 | 306 | if len(found_in) > 0: 307 | shared_with[ID] = '|'.join(found_in) 308 | 309 | 310 | # Clean log files generated by vcf and bcf tools 311 | clean_files('out.log') 312 | clean_files(tempobsfile) 313 | 314 | return shared_with, individuals_in_vcffile 315 | 316 | def handle_individuals_input(argument, group_to_choose): 317 | if os.path.exists(argument): 318 | with open(argument) as json_file: 319 | data = json.load(json_file) 320 | return data[group_to_choose] 321 | else: 322 | return argument.split(',') 323 | 324 | 325 | 326 | 327 | # Clean up 328 | def clean_files(filename): 329 | if os.path.exists(filename): 330 | os.remove(filename) 331 | 332 | 333 | # Find variants from admixed population 334 | def flatten_list(variants_list): 335 | return ','.join([x for x in variants_list if x != '']) 336 | 337 | 338 | 339 | def convert_to_bases(genotype, both_bases): 340 | 341 | return_genotype = 'NN' 342 | separator = None 343 | 344 | if '/' in genotype or '|' in genotype: 345 | separator = '|' if '|' in genotype else '/' 346 | 347 | base1, base2 = [x for x in genotype.split(separator)] 348 | if base1.isnumeric() and base2.isnumeric(): 349 | base1, base2 = int(base1), int(base2) 350 | 351 | if both_bases[base1] in ['A','C','G','T'] and both_bases[base2] in ['A','C','G','T']: 352 | return_genotype = both_bases[base1] + both_bases[base2] 353 | 354 | return return_genotype 355 | 356 | 357 | 358 | 359 | 360 | def get_consensus(infiles): 361 | ''' 362 | Find consensus prefix, suffix and value that changes in set of files: 363 | 364 | myfiles = ['chr1.vcf', 'chr2.vcf', 'chr3.vcf'] 365 | prefix, suffix, values = get_consensus(myfiles) 366 | 367 | prefix = chr 368 | suffix = .vcf 369 | values = [1,2,3] 370 | ''' 371 | infiles = [str(x) for x in infiles] 372 | 373 | if len(infiles) <= 1: 374 | return None, None, None 375 | 376 | # find longest common prefix 377 | prefix = infiles[0] 378 | for s in infiles[1:]: 379 | while not s.startswith(prefix): 380 | prefix = prefix[:-1] 381 | 382 | # find longest common suffix 383 | suffix = infiles[0] 384 | for s in infiles[1:]: 385 | while not s.endswith(suffix): 386 | suffix = suffix[1:] 387 | 388 | values = [x[len(prefix):-len(suffix)] if suffix else x[len(prefix):] for x in infiles] 389 | 390 | return prefix, suffix, sorted(values, key=sortby) 391 | 392 | 393 | # Check which type of input we are dealing with 394 | def handle_infiles(input): 395 | 396 | file_list = glob(input) 397 | if len(file_list) > 0: 398 | return file_list 399 | else: 400 | return input.split(',') 401 | 402 | 403 | 404 | 405 | 406 | # Check which type of input we are dealing with 407 | def combined_files(ancestralfiles, vcffiles): 408 | 409 | if len(ancestralfiles) == len(vcffiles) and len(vcffiles) == 1 and ancestralfiles != ['']: 410 | return ancestralfiles, vcffiles 411 | 412 | # Get ancestral and vcf consensus 413 | prefix1, postfix1, values1 = get_consensus(vcffiles) 414 | prefix2, postfix2, values2 = get_consensus(ancestralfiles) 415 | 416 | 417 | # No ancestral files 418 | if ancestralfiles == ['']: 419 | 420 | ancestralfiles = [None for _ in vcffiles] 421 | vcffiles = [f'{prefix1}{x}{postfix1}' for x in values1] 422 | return ancestralfiles, vcffiles 423 | 424 | # Same length 425 | elif len(ancestralfiles) == len(vcffiles): 426 | 427 | vcffiles = [f'{prefix1}{x}{postfix1}' for x in values1] 428 | ancestralfiles = [f'{prefix2}{x}{postfix2}' for x in values2] 429 | return ancestralfiles, vcffiles 430 | 431 | # diff lengthts (both longer than 1) 432 | elif len(ancestralfiles) > 1 and len(vcffiles) > 1: 433 | 434 | vcffiles = [] 435 | ancestralfiles = [] 436 | 437 | for joined in sorted(set(values1).intersection(set(values2)), key=sortby): 438 | vcffiles.append(''.join([prefix1, joined, postfix1])) 439 | ancestralfiles.append(''.join([prefix2, joined, postfix2])) 440 | return ancestralfiles, vcffiles 441 | 442 | # Many ancestral files only one vcf 443 | elif len(ancestralfiles) > 1 and len(vcffiles) == 1: 444 | ancestralfiles = [] 445 | 446 | for key in values2: 447 | if key in vcffiles[0]: 448 | ancestralfiles.append(''.join([prefix2, key, postfix2])) 449 | 450 | if len(vcffiles) != len(ancestralfiles): 451 | sys.exit('Could not resolve ancestral files and vcffiles (try comma separated values)') 452 | 453 | return ancestralfiles, vcffiles 454 | 455 | # only one ancestral file and many vcf files 456 | elif len(ancestralfiles) == 1 and len(vcffiles) > 1: 457 | vcffiles = [] 458 | 459 | for key in values1: 460 | if key in ancestralfiles[0]: 461 | vcffiles.append(''.join([prefix1, key, postfix1])) 462 | 463 | if len(vcffiles) != len(ancestralfiles): 464 | sys.exit('Could not resolve ancestral files and vcffiles (try comma separated values)') 465 | 466 | 467 | return ancestralfiles, vcffiles 468 | else: 469 | sys.exit('Could not resolve ancestral files and vcffiles (try comma separated values)') 470 | -------------------------------------------------------------------------------- /src/hmm_functions.py: -------------------------------------------------------------------------------- 1 | from collections import defaultdict 2 | import numpy as np 3 | from numba import njit 4 | import json 5 | 6 | from helper_functions import find_runs, Annotate_with_ref_genome, Make_folder_if_not_exists, flatten_list 7 | 8 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 9 | # HMM Parameter Class 10 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 11 | class HMMParam: 12 | def __init__(self, state_names, starting_probabilities, transitions, emissions): 13 | self.state_names = np.array(state_names) 14 | self.starting_probabilities = np.array(starting_probabilities) 15 | self.transitions = np.array(transitions) 16 | self.emissions = np.array(emissions) 17 | 18 | def __str__(self): 19 | out = f'> state_names = {self.state_names.tolist()}\n' 20 | out += f'> starting_probabilities = {np.matrix.round(self.starting_probabilities, 3).tolist()}\n' 21 | out += f'> transitions = {np.matrix.round(self.transitions, 3).tolist()}\n' 22 | out += f'> emissions = {np.matrix.round(self.emissions, 3).tolist()}' 23 | return out 24 | 25 | def __repr__(self): 26 | return f'{self.__class__.__name__}({self.state_names}, {self.starting_probabilities}, {self.transitions}, {self.emissions})' 27 | 28 | # Read HMM parameters from a json file 29 | def read_HMM_parameters_from_file(filename): 30 | 31 | if filename is None: 32 | return get_default_HMM_parameters() 33 | 34 | with open(filename) as json_file: 35 | data = json.load(json_file) 36 | 37 | return HMMParam(state_names = data['state_names'], 38 | starting_probabilities = data['starting_probabilities'], 39 | transitions = data['transitions'], 40 | emissions = data['emissions']) 41 | 42 | # Set default parameters 43 | def get_default_HMM_parameters(): 44 | return HMMParam(state_names = ['Human', 'Archaic'], 45 | starting_probabilities = [0.98, 0.02], 46 | transitions = [[0.9999,0.0001],[0.02,0.98]], 47 | emissions = [0.04, 0.4]) 48 | 49 | # Save HMMParam to a json file 50 | def write_HMM_to_file(hmmparam, outfile): 51 | data = {key: value.tolist() for key, value in vars(hmmparam).items()} 52 | json_string = json.dumps(data, indent = 2) 53 | with open(outfile, 'w') as out: 54 | out.write(json_string) 55 | 56 | 57 | 58 | 59 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 60 | # HMM functions 61 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 62 | 63 | @njit 64 | def poisson_probability_underflow_safe(n, lam): 65 | # naive: np.exp(-lam) * lam**n / factorial(n) 66 | 67 | # iterative, to keep the components from getting too large or small: 68 | p = np.exp(-lam) 69 | for i in range(n): 70 | p *= lam 71 | p /= i+1 72 | return p 73 | 74 | @njit 75 | def Emission_probs_poisson(emissions, observations, weights, mutrates): 76 | n = len(observations) 77 | n_states = len(emissions) 78 | 79 | probabilities = np.zeros( (n, n_states) ) 80 | for state in range(n_states): 81 | for index in range(n): 82 | lam = emissions[state] * weights[index] * mutrates[index] 83 | probabilities[index,state] = poisson_probability_underflow_safe(observations[index], lam) 84 | 85 | return probabilities 86 | 87 | @njit 88 | def fwd_step(alpha_prev, E, trans_mat): 89 | alpha_new = (alpha_prev @ trans_mat) * E 90 | n = np.sum(alpha_new) 91 | return alpha_new / n, n 92 | 93 | @njit 94 | def forward(probabilities, transitions, init_start): 95 | n = len(probabilities) 96 | forwards_in = np.zeros( (n, len(init_start)) ) 97 | scale_param = np.ones(n) 98 | 99 | for t in range(n): 100 | if t == 0: 101 | forwards_in[t,:] = init_start * probabilities[t,:] 102 | scale_param[t] = np.sum( forwards_in[t,:]) 103 | forwards_in[t,:] = forwards_in[t,:] / scale_param[t] 104 | else: 105 | forwards_in[t,:], scale_param[t] = fwd_step(forwards_in[t-1,:], probabilities[t,:], transitions) 106 | 107 | return forwards_in, scale_param 108 | 109 | @njit 110 | def bwd_step(beta_next, E, trans_mat, n): 111 | beta = (trans_mat * E) @ beta_next 112 | return beta / n 113 | 114 | @njit 115 | def backward(emissions, transitions, scales): 116 | n, n_states = emissions.shape 117 | beta = np.ones((n, n_states)) 118 | for i in range(n - 1, 0, -1): 119 | beta[i - 1,:] = bwd_step(beta[i,:], emissions[i,:], transitions, scales[i]) 120 | return beta 121 | 122 | 123 | def GetProbability(hmm_parameters, weights, obs, mutrates): 124 | 125 | emissions = Emission_probs_poisson(hmm_parameters.emissions, obs, weights, mutrates) 126 | _, scales = forward(emissions, hmm_parameters.transitions, hmm_parameters.starting_probabilities) 127 | forward_probility_of_obs = np.sum(np.log(scales)) 128 | 129 | return forward_probility_of_obs 130 | 131 | 132 | @njit 133 | def fwd_step_keep_track(alpha_prev, E, trans_mat): 134 | 135 | # scaling factor 136 | n = np.sum((alpha_prev @ trans_mat) * E) 137 | 138 | results = np.zeros(len(E)) 139 | back_track_states = np.zeros(len(E)) 140 | 141 | for current_s in range(len(E)): 142 | for prev_s in range(len(E)): 143 | new_prob = alpha_prev[prev_s] * trans_mat[prev_s, current_s] * E[current_s] / n 144 | 145 | if new_prob > results[current_s]: 146 | results[current_s] = new_prob 147 | back_track_states[current_s] = prev_s 148 | 149 | return results, back_track_states 150 | 151 | 152 | @njit 153 | def viterbi(probabilities, transitions, init_start): 154 | n = len(probabilities) 155 | forwards_in = np.zeros( (n, len(init_start)) ) 156 | backtracks = np.zeros( (n, len(init_start)), dtype=np.int32) 157 | 158 | for t in range(n): 159 | if t == 0: 160 | forwards_in[t,:] = init_start * probabilities[t,:] 161 | scale_param = np.sum( forwards_in[t,:]) 162 | forwards_in[t,:] = forwards_in[t,:] / scale_param 163 | else: 164 | forwards_in[t,:], backtracks[t,:] = fwd_step_keep_track(forwards_in[t-1,:], probabilities[t,:], transitions) 165 | 166 | return forwards_in, backtracks 167 | 168 | 169 | @njit 170 | def calculate_log(x): 171 | return np.log(x) 172 | 173 | 174 | @njit 175 | def hybrid_step(prev, alpha, em, trans): 176 | value = prev + alpha * calculate_log(em * trans) 177 | best_state = np.argmax(value) 178 | max_prob = value[best_state] 179 | 180 | return best_state, max_prob 181 | 182 | 183 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 184 | # Train 185 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 186 | 187 | def logoutput(hmm_parameters, loglikelihood, iteration): 188 | 189 | n_states = len(hmm_parameters.emissions) 190 | 191 | # Make header 192 | if iteration == 0: 193 | print_emissions = '\t'.join(['emis{0}'.format(x + 1) for x in range(n_states)]) 194 | print_starting_probabilities = '\t'.join(['start{0}'.format(x + 1) for x in range(n_states)]) 195 | print_transitions = '\t'.join(['trans{0}_{0}'.format(x + 1) for x in range(n_states)]) 196 | print('iteration', 'loglikelihood', print_starting_probabilities, print_emissions, print_transitions, sep = '\t') 197 | 198 | # Print parameters 199 | print_emissions = '\t'.join([str(x) for x in np.matrix.round(hmm_parameters.emissions, 4)]) 200 | print_starting_probabilities = '\t'.join([str(x) for x in np.matrix.round(hmm_parameters.starting_probabilities, 3)]) 201 | print_transitions = '\t'.join([str(x) for x in np.matrix.round(hmm_parameters.transitions, 4).diagonal()]) 202 | print(iteration, round(loglikelihood, 4), print_starting_probabilities, print_emissions, print_transitions, sep = '\t') 203 | 204 | 205 | 206 | def TrainBaumWelsch(hmm_parameters, weights, obs, mutrates): 207 | """ 208 | Trains the model once, using the forward-backward algorithm. 209 | """ 210 | 211 | n_states = len(hmm_parameters.starting_probabilities) 212 | 213 | emissions = Emission_probs_poisson(hmm_parameters.emissions, obs, weights, mutrates) 214 | forward_probs, scales = forward(emissions, hmm_parameters.transitions, hmm_parameters.starting_probabilities) 215 | backward_probs = backward(emissions, hmm_parameters.transitions, scales) 216 | 217 | # Update starting probs 218 | posterior_probs = forward_probs * backward_probs 219 | normalize = np.sum(posterior_probs) 220 | new_starting_probabilities = np.sum(posterior_probs, axis=0)/normalize 221 | 222 | # Update emission 223 | new_emissions_matrix = np.zeros((n_states)) 224 | for state in range(n_states): 225 | top = np.sum(posterior_probs[:,state] * obs) 226 | bottom = np.sum(posterior_probs[:,state] * (weights * mutrates) ) 227 | new_emissions_matrix[state] = top/bottom 228 | 229 | # Update Transition probs 230 | new_transitions_matrix = np.zeros((n_states, n_states)) 231 | for state1 in range(n_states): 232 | for state2 in range(n_states): 233 | new_transitions_matrix[state1,state2] = np.sum( forward_probs[:-1,state1] * backward_probs[1:,state2] * hmm_parameters.transitions[state1, state2] * emissions[1:,state2]/ scales[1:] ) 234 | new_transitions_matrix /= new_transitions_matrix.sum(axis=1)[:,np.newaxis] 235 | 236 | return HMMParam(hmm_parameters.state_names,new_starting_probabilities, new_transitions_matrix, new_emissions_matrix) 237 | 238 | 239 | def TrainModel(obs, mutrates, weights, hmm_parameters, epsilon = 1e-3, maxiterations = 1000): 240 | 241 | # Get probability of data with initial parameters 242 | previous_loglikelihood = GetProbability(hmm_parameters, weights, obs, mutrates) 243 | logoutput(hmm_parameters, previous_loglikelihood, 0) 244 | 245 | # Train parameters using Baum Welch algorithm 246 | for i in range(1,maxiterations): 247 | hmm_parameters = TrainBaumWelsch(hmm_parameters, weights, obs, mutrates) 248 | new_loglikelihood = GetProbability(hmm_parameters, weights, obs, mutrates) 249 | logoutput(hmm_parameters, new_loglikelihood, i) 250 | 251 | if new_loglikelihood - previous_loglikelihood < epsilon: 252 | break 253 | 254 | previous_loglikelihood = new_loglikelihood 255 | 256 | # Write the optimal parameters 257 | return hmm_parameters 258 | 259 | 260 | 261 | 262 | 263 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 264 | # Decode (posterior decoding) 265 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 266 | 267 | def Calculate_Posterior_probabillities(emissions, hmm_parameters): 268 | """Get posterior probability of being in state s at time t""" 269 | 270 | forward_probs, scales = forward(emissions, hmm_parameters.transitions, hmm_parameters.starting_probabilities) 271 | backward_probs = backward(emissions, hmm_parameters.transitions, scales) 272 | posterior_probabilities = (forward_probs * backward_probs).T 273 | 274 | return posterior_probabilities 275 | 276 | 277 | def PMAP_path(posterior_probabilities): 278 | """Get maximum posterior decoding path""" 279 | path = np.argmax(posterior_probabilities, axis = 0) 280 | return path 281 | 282 | 283 | def Viterbi_path(emissions, hmm_parameters): 284 | """Get Viterbi path - aka most likeli path""" 285 | n_obs, _ = emissions.shape 286 | 287 | viterbi_probs, backtracks = viterbi(emissions, hmm_parameters.transitions, hmm_parameters.starting_probabilities) 288 | 289 | # backtracking 290 | viterbi_path = np.zeros(n_obs, dtype = int) 291 | viterbi_path[-1] = np.argmax(viterbi_probs[-1,:]) 292 | for t in range(n_obs - 2, -1, -1): 293 | viterbi_path[t] = backtracks[t + 1, viterbi_path[t + 1]] 294 | 295 | return viterbi_path 296 | 297 | @njit 298 | def Hybrid_path(emissions, starting_probs, trans_matrix, logged_posterior_probabilities, ALPHA): 299 | """ 300 | Decodes the model using the hybrid method. When alpha parameter is 0 this is the same as posterior and when it is 1 it is the same as viterbi 301 | """ 302 | n_obs, n_states = emissions.shape 303 | BETA = 1 - ALPHA 304 | 305 | # for backtracking the hybrid method 306 | delta = np.zeros((n_obs, n_states)) 307 | psi = np.zeros((n_obs - 1, n_states), dtype = np.int32) 308 | 309 | # initialize 310 | delta[0,:] = ALPHA * calculate_log(starting_probs * emissions[0,:]) + BETA * logged_posterior_probabilities[0,:] 311 | 312 | for t in range(1, n_obs): 313 | for state in range(n_states): 314 | best_state, max_prob = hybrid_step(delta[t-1, :], ALPHA, emissions[t,state] , trans_matrix[:, state] ) 315 | delta[t,state] = max_prob + BETA * logged_posterior_probabilities[t,state] 316 | psi[t-1,state] = best_state 317 | 318 | 319 | # backtracking 320 | path = np.zeros(n_obs, dtype = np.int32) 321 | path[-1] = np.argmax(delta[-1,:]) 322 | for i in range(n_obs - 2, -1, -1): 323 | path[i] = psi[i, path[i + 1]] 324 | 325 | return path 326 | 327 | 328 | 329 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 330 | # inhomogeneous markov chain 331 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 332 | 333 | @njit 334 | def Simulate_values(p): 335 | return np.random.binomial(1, p) 336 | 337 | def Simulate_transition(n_states, matrix, current_state): 338 | 339 | # prob of staying (can be done with numba so its quick) 340 | next_state = Simulate_values(matrix[current_state]) 341 | if next_state == 1: 342 | return current_state 343 | else: 344 | if n_states == 2: 345 | return abs(current_state - 1) 346 | else: 347 | new_matrix = [matrix[x] for x in range(n_states) if current_state != x] 348 | new_matrix /= np.sum(new_matrix) 349 | new_states = [x for x in range(n_states) if current_state != x] 350 | 351 | return np.random.choice(new_states, p=new_matrix) 352 | 353 | 354 | 355 | def Make_inhomogeneous_transition_matrix(emissions, hmm_parameters): 356 | """ 357 | Calculate transition matrix for each position in the sequence (given the data) 358 | """ 359 | 360 | n_obs, n_states = emissions.shape 361 | _, scales = forward(emissions, hmm_parameters.transitions, hmm_parameters.starting_probabilities) 362 | backward_probs = backward(emissions, hmm_parameters.transitions, scales) 363 | 364 | # Make and initialise new transition matrix 365 | new_transition_matrix = np.zeros((n_obs, n_states, n_states)) 366 | 367 | # starting probabilities 368 | sim_starting_probabilities = np.zeros(n_states) 369 | for state in range(n_states): 370 | sim_starting_probabilities[state] = hmm_parameters.starting_probabilities[state] * backward_probs[0, state] * emissions[0, state] / scales[0] 371 | 372 | for state in range(n_states): 373 | for otherstate in range(n_states): 374 | new_transition_matrix[1:, otherstate, state] = (backward_probs[1:, state] ) / (backward_probs[:-1, otherstate] * scales[1:]) * hmm_parameters.transitions[otherstate, state] * emissions[1:, state] 375 | 376 | return sim_starting_probabilities, new_transition_matrix 377 | 378 | 379 | 380 | def Simulate_from_transition_matrix(sim_starting_probabilities, new_transition_matrix): 381 | 382 | number_observations, n_states, _, = new_transition_matrix.shape 383 | sim_path = np.zeros(number_observations, dtype=int) 384 | 385 | # set start state 386 | current_state = np.random.choice(n_states, p=sim_starting_probabilities) 387 | sim_path[0] = current_state 388 | 389 | for t in range(1, number_observations): 390 | next_state = Simulate_transition(n_states, new_transition_matrix[t,current_state,:], current_state) 391 | sim_path[t] = next_state 392 | current_state = next_state 393 | 394 | return sim_path 395 | 396 | 397 | 398 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 399 | # Write segments to output file 400 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 401 | 402 | def Write_inhomogeneous_transition_matrix(chroms, starts, weights, mutrates, variants, hmm_parameters, new_transition_matrix, filename): 403 | n_states = len(hmm_parameters.state_names) 404 | 405 | with open(filename, 'w') as out: 406 | state_combs = [] 407 | for state1 in hmm_parameters.state_names: 408 | for state2 in hmm_parameters.state_names: 409 | state_combs.append(f'{state1}_{state2}') 410 | 411 | state_combs = '\t'.join(state_combs) 412 | print('chrom', 'start', 'called_sequence', 'mutationrate', state_combs,'variants', sep = '\t', file = out) 413 | 414 | for (chrom, start, w, m, transvalues, var) in zip(chroms, starts, weights, mutrates, new_transition_matrix, variants): 415 | posterior_to_print = [] 416 | for state1 in range(n_states): 417 | for state2 in range(n_states): 418 | posterior_to_print.append(str(round(transvalues[state1, state2], 4))) 419 | posterior_to_print = '\t'.join(posterior_to_print) 420 | 421 | print(chrom, start, w, m, posterior_to_print, var, sep = '\t', file = out) 422 | 423 | 424 | def Write_posterior_probs(chroms, starts, weights, mutrates, post_seq, path, variants, hmm_parameters, filename): 425 | post_seq = post_seq.T 426 | 427 | with open(filename, 'w') as out: 428 | state_names = '\t'.join(hmm_parameters.state_names) 429 | print('chrom', 'start', 'called_sequence', 'mutationrate', state_names,'state','variants', sep = '\t', file = out) 430 | 431 | for (chrom, start, w, m, posterior, state, var) in zip(chroms, starts, weights, mutrates, post_seq, path, variants): 432 | posterior_to_print = '\t'.join([str(round(x, 4)) for x in posterior]) 433 | print(chrom, start, w, m, posterior_to_print, hmm_parameters.state_names[state], var, sep = '\t', file = out) 434 | 435 | 436 | 437 | def Convert_genome_coordinates(window_size, CHROMOSOME_BREAKPOINTS, starts, variants, post_seq, path, hmm_parameters, weights, mutrates, obs): 438 | 439 | segments = [] 440 | for (chrom, chrom_start_index, chrom_length_index) in CHROMOSOME_BREAKPOINTS: 441 | 442 | # Diploid or haploid 443 | if '_hap' in chrom: 444 | newchrom, ploidity = chrom.split('_') 445 | else: 446 | ploidity = 'diploid' 447 | newchrom = chrom 448 | 449 | state_with_highest_prob = path[chrom_start_index:chrom_start_index + chrom_length_index] 450 | 451 | for (state, start_index, length_index) in find_runs(state_with_highest_prob): 452 | 453 | start_index = start_index + chrom_start_index 454 | end_index = start_index + length_index 455 | 456 | genome_start = starts[start_index] 457 | genome_length = length_index * window_size 458 | genome_end = genome_start + genome_length 459 | 460 | called_sequence = int(np.sum(weights[start_index:end_index]) * window_size) 461 | average_mutation_rate = round(np.mean(mutrates[start_index:end_index]), 3) 462 | 463 | snp_counter = np.sum(obs[start_index:end_index]) 464 | mean_prob = round(np.mean(post_seq[state, start_index:end_index]), 5) 465 | variants_segment = flatten_list(variants[start_index:end_index]) 466 | 467 | segments.append([newchrom, genome_start, genome_end, genome_length, hmm_parameters.state_names[state], mean_prob, snp_counter, ploidity, called_sequence, average_mutation_rate, variants_segment]) 468 | 469 | return segments 470 | 471 | 472 | 473 | def Write_Decoded_output(outputprefix, segments, obs_file = None, admixpop_file = None, extrainfo = False): 474 | 475 | # Load archaic data 476 | if admixpop_file is not None: 477 | admix_pop_variants, admixpop_names = Annotate_with_ref_genome(admixpop_file, obs_file) 478 | 479 | # Are we doing haploid/diploid? 480 | outfile_mapper = {} 481 | for _, _, _, _, _, _, _, ploidity, _, _, _ in segments: 482 | if outputprefix == '/dev/stdout': 483 | outfile_mapper[ploidity] = '/dev/stdout' 484 | else: 485 | outfile_mapper[ploidity] = f'{outputprefix}.{ploidity}.txt' 486 | 487 | 488 | # Make output files and write headers 489 | outputfiles_handlers = defaultdict(str) 490 | for ploidity, output in outfile_mapper.items(): 491 | 492 | Make_folder_if_not_exists(output) 493 | outputfiles_handlers[ploidity] = open(output, 'w') 494 | out = outputfiles_handlers[ploidity] 495 | 496 | if admixpop_file is not None: 497 | if extrainfo: 498 | out.write('chrom\tstart\tend\tlength\tstate\tmean_prob\tsnps\tadmixpopvariants\t{}\tcalled_sequence\tmutationrate\tvariants\n'.format('\t'.join(admixpop_names))) 499 | else: 500 | out.write('chrom\tstart\tend\tlength\tstate\tmean_prob\tsnps\tadmixpopvariants\t{}\n'.format('\t'.join(admixpop_names))) 501 | else: 502 | if extrainfo: 503 | out.write('chrom\tstart\tend\tlength\tstate\tmean_prob\tsnps\tcalled_sequence\tmutationrate\tvariants\n') 504 | else: 505 | out.write('chrom\tstart\tend\tlength\tstate\tmean_prob\tsnps\n') 506 | 507 | # Go through segments and write to output 508 | for chrom, genome_start, genome_end, genome_length, state, mean_prob, snp_counter, ploidity, called_sequence, average_mutation_rate, variants in segments: 509 | 510 | out = outputfiles_handlers[ploidity] 511 | 512 | if admixpop_file is not None: 513 | archiac_variants_dict = defaultdict(int) 514 | for snp_position in variants.split(','): 515 | carriers = admix_pop_variants[f'{chrom}_{snp_position}'] 516 | if carriers != '': 517 | if '|' in carriers: 518 | for ind in carriers.split('|'): 519 | archiac_variants_dict[ind] += 1 520 | else: 521 | archiac_variants_dict[carriers] += 1 522 | 523 | archiac_variants_dict['total'] += 1 524 | 525 | archaic_variants = '\t'.join([str(archiac_variants_dict[x]) for x in ['total'] + admixpop_names]) 526 | 527 | if extrainfo: 528 | print(chrom, genome_start, genome_end, genome_length, state, mean_prob, snp_counter, archaic_variants, called_sequence, average_mutation_rate, variants, sep = '\t', file = out) 529 | else: 530 | print(chrom, genome_start, genome_end, genome_length, state, mean_prob, snp_counter, archaic_variants, sep = '\t', file = out) 531 | 532 | else: 533 | 534 | if extrainfo: 535 | print(chrom, genome_start, genome_end, genome_length, state, mean_prob, snp_counter, called_sequence, average_mutation_rate, variants, sep = '\t', file = out) 536 | else: 537 | print(chrom, genome_start, genome_end, genome_length, state, mean_prob, snp_counter, sep = '\t', file = out) 538 | 539 | 540 | # Close output files 541 | for ploidity, out in outputfiles_handlers.items(): 542 | out.close() 543 | 544 | 545 | 546 | -------------------------------------------------------------------------------- /src/main.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import numpy as np 3 | import sys 4 | from hmm_functions import TrainModel, write_HMM_to_file, read_HMM_parameters_from_file, Write_Decoded_output, Calculate_Posterior_probabillities, PMAP_path, Viterbi_path, Hybrid_path, Convert_genome_coordinates, Write_posterior_probs, Make_inhomogeneous_transition_matrix, Simulate_from_transition_matrix, Write_inhomogeneous_transition_matrix, Emission_probs_poisson 5 | from bcf_vcf import make_out_group, make_ingroup_obs 6 | from make_test_data import simulate_path, write_data 7 | from make_mutationrate import make_mutation_rate 8 | from helper_functions import Load_observations_weights_mutrates, handle_individuals_input, handle_infiles, combined_files, find_runs 9 | from artemis import Find_best_alpha 10 | 11 | VERSION = '0.8.2' 12 | 13 | def print_script_usage(): 14 | toprint = f''' 15 | Script for identifying introgressed archaic segments (version: {VERSION}) 16 | 17 | > Turorial: 18 | hmmix make_test_data 19 | hmmix train -obs=obs.txt -weights=weights.bed -mutrates=mutrates.bed -param=Initialguesses.json -out=trained.json 20 | hmmix decode -obs=obs.txt -weights=weights.bed -mutrates=mutrates.bed -param=trained.json 21 | 22 | 23 | Different modes (you can also see the options for each by writing hmmix make_test_data -h): 24 | > make_test_data 25 | -windows Number of Kb windows to create (defaults to 50,000 per chromosome) 26 | -chromosomes Number of chromosomes to simulate (defaults to 2) 27 | -no_out_files Don't create obs.txt, mutrates.bed, weights.bed, Initialguesses.json (defaults to yes) 28 | -param markov parameters file (default is human/neanderthal like parameters) 29 | -seed Set seed (default is 42) 30 | 31 | > mutation_rate 32 | -outgroup [required] path to variants found in outgroup 33 | -out outputfile (defaults to mutationrate.bed) 34 | -weights file with callability (defaults to all positions being called) 35 | -window_size size of bins (defaults to 1 Mb) 36 | 37 | > create_outgroup 38 | -ind [required] ingroup/outgrop list (json file) or comma-separated list e.g. ind1,ind2 39 | -vcf [required] path to list of comma-separated vcf/bcf file(s) or wildcard characters e.g. chr*.bcf 40 | -weights file with callability (defaults to all positions being called) 41 | -out outputfile (defaults to stdout) 42 | -ancestral fasta file with ancestral information - comma-separated list or wildcards like vcf argument (default none) 43 | -refgenome fasta file with reference genome - comma-separated list or wildcards like vcf argument (default none) 44 | 45 | > create_ingroup 46 | -ind [required] ingroup/outgrop list (json file) or comma-separated list e.g. ind1,ind2 47 | -vcf [required] path to list of comma-separated vcf/bcf file(s) or wildcard characters e.g. chr*.bcf 48 | -outgroup [required] path to variant found in outgroup 49 | -weights file with callability (defaults to all positions being called) 50 | -out outputfile prefix (default is a file named obs..txt where ind is the name of individual in ingroup/outgrop list) 51 | -ancestral fasta file with ancestral information - comma-separated list or wildcards like vcf argument (default none) 52 | 53 | > train 54 | -obs [required] file with observation data 55 | -chrom Subset to chromosome or comma separated list of chromosomes e.g chr1 or chr1,chr2,chr3 (default is use all chromosomes) 56 | -weights file with callability (defaults to all positions being called) 57 | -mutrates file with mutation rates (default is mutation rate is uniform) 58 | -param markov parameters file (default is human/neanderthal like parameters) 59 | -out outputfile (default is a file named trained.json) 60 | -window_size size of bins (default is 1000 bp) 61 | -haploid Change from using diploid data to haploid data (default is diploid) 62 | 63 | > decode 64 | -obs [required] file with observation data 65 | -chrom Subset to chromosome or comma separated list of chromosomes e.g chr1 or chr1,chr2,chr3 (default is use all chromosomes) 66 | -weights file with callability (defaults to all positions being called) 67 | -mutrates file with mutation rates (default is mutation rate is uniform) 68 | -param markov parameters file (default is human/neanderthal like parameters) 69 | -out outputfile prefix .hap1.txt and .hap2.txt if -haploid option is used or .diploid.txt (default is stdout) 70 | -window_size size of bins (default is 1000 bp) 71 | -haploid Change from using diploid data to haploid data (default is diploid) 72 | -admixpop Annotate using vcffile with admixing population (default is none) 73 | -extrainfo Add variant position for each SNP (default is off) 74 | -viterbi decode using the viterbi algorithm (default is posterior decoding) 75 | -hybrid decode using the hybrid algorithm. Set value between 0 and 1 where 0=posterior and 1=viterbi 76 | -posterior_probs File location for posterior prob 77 | 78 | > inhomogeneous 79 | -obs [required] file with observation data 80 | -chrom Subset to chromosome or comma separated list of chromosomes e.g chr1 or chr1,chr2,chr3 (default is use all chromosomes) 81 | -weights file with callability (defaults to all positions being called) 82 | -mutrates file with mutation rates (default is mutation rate is uniform) 83 | -param markov parameters file (default is human/neanderthal like parameters) 84 | -out outputfile prefix .hap1_sim(0-n).txt and .hap2_sim(0-n).txt if -haploid option is used or .diploid_(0-n).txt (default is stdout) 85 | -window_size size of bins (default is 1000 bp) 86 | -haploid Change from using diploid data to haploid data (default is diploid) 87 | -samples Number of simulated paths for the inhomogeneous markov chain (default is 100) 88 | -admixpop Annotate using vcffile with admixing population (default is none) 89 | -extrainfo Add variant position for each SNP (default is off) 90 | -inhomogen_matrix File location for inhomogeneous transition matrix 91 | 92 | > artemis 93 | -param [required] markov parameters file (default is human/neanderthal like parameters) 94 | -out_plot File path for artemis plot - can be pdf or jpg (default is Artemis_plot.pdf) 95 | -out Save alphas, likelihoods and pointwise accuracy to file (default is stdout) 96 | -windows Number of Kb windows to create (defaults to 500,000) 97 | -iterations Number of iterations (defaults to 10) 98 | -start First alpha values to simulate (default is 0) 99 | -end Last alpha values to simulate (default is 1) 100 | -steps Number of alpha values to simulate between start and end (defaults to 101) 101 | -seed Set seed (default is 42) 102 | ''' 103 | 104 | return toprint 105 | 106 | 107 | 108 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 109 | # Main 110 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 111 | def main(): 112 | 113 | parser = argparse.ArgumentParser(description=print_script_usage(), formatter_class=argparse.RawTextHelpFormatter) 114 | 115 | subparser = parser.add_subparsers(dest = 'mode') 116 | 117 | # Make test data 118 | test_subparser = subparser.add_parser('make_test_data', help='Create test data') 119 | test_subparser.add_argument("-windows", metavar='',help="Number of Kb windows to create (defaults to 50,000 per chromosome)", type=int, default = 50000) 120 | test_subparser.add_argument("-chromosomes", metavar='',help="Number of chromosomes to simulate (defaults to 2)", type=int, default = 2) 121 | test_subparser.add_argument("-no_out_files",help="Don't create obs.txt, mutrates.bed, weights.bed, Initialguesses.json (defaults to yes)", action='store_false', default = True) 122 | test_subparser.add_argument("-param", metavar='',help="markov parameters file (default is human/neanderthal like parameters)", type=str) 123 | test_subparser.add_argument("-seed", metavar='',help="set seed", type=int, default=42) 124 | 125 | # Make outgroup 126 | outgroup_subparser = subparser.add_parser('create_outgroup', help='Create outgroup information') 127 | outgroup_subparser.add_argument("-ind",help="[required] ingroup/outgrop list (json file) or comma-separated list e.g. ind1,ind2", type=str, required = True) 128 | outgroup_subparser.add_argument("-vcf",help="[required] path to list of comma-separated vcf/bcf file(s) or wildcard characters e.g. chr*.bcf", type=str, required = True) 129 | outgroup_subparser.add_argument("-weights", metavar='',help="file with callability (defaults to all positions being called)") 130 | outgroup_subparser.add_argument("-out", metavar='',help="outputfile (defaults to stdout)", default = '/dev/stdout') 131 | outgroup_subparser.add_argument("-ancestral", metavar='',help="fasta file with ancestral information - comma-separated list or wildcards like vcf argument (default none)", default='') 132 | outgroup_subparser.add_argument("-refgenome", metavar='',help="fasta file with reference genome - comma-separated list or wildcards like vcf argument (default none)", default='') 133 | 134 | # Make mutation rates 135 | mutation_rate = subparser.add_parser('mutation_rate', help='Estimate mutation rate') 136 | mutation_rate.add_argument("-outgroup", help="[required] path to variants found in outgroup", type=str, required = True) 137 | mutation_rate.add_argument("-out", metavar='',help="outputfile (defaults to mutationrate.bed)", default = 'mutationrate.bed') 138 | mutation_rate.add_argument("-weights", metavar='',help="file with callability (defaults to all positions being called)") 139 | mutation_rate.add_argument("-window_size", metavar='',help="size of bins (defaults to 1 Mb)", type=int, default = 1000000) 140 | 141 | # Make ingroup observations 142 | create_obs_subparser = subparser.add_parser('create_ingroup', help='Create ingroup data') 143 | create_obs_subparser.add_argument("-ind", help="[required] ingroup/outgrop list (json file) or comma-separated list e.g. ind1,ind2", type=str, required = True) 144 | create_obs_subparser.add_argument("-vcf", help="[required] path to list of comma-separated vcf/bcf file(s) or wildcard characters e.g. chr*.bcf", type=str, required = True) 145 | create_obs_subparser.add_argument("-outgroup", help="[required] path to variant found in outgroup", type=str, required = True) 146 | create_obs_subparser.add_argument("-weights", metavar='',help="file with callability (defaults to all positions being called)") 147 | create_obs_subparser.add_argument("-out", metavar='',help="outputfile prefix (default is a file named obs..txt where ind is the name of individual in ingroup/outgrop list)", default = 'obs') 148 | create_obs_subparser.add_argument("-ancestral", metavar='',help="fasta file with ancestral information - comma-separated list or wildcards like vcf argument (default none)", default='') 149 | 150 | # Train model 151 | train_subparser = subparser.add_parser('train', help='Train HMM') 152 | train_subparser.add_argument("-obs",help="[required] file with observation data", type=str, required = True) 153 | train_subparser.add_argument("-chrom",help="Subset to chromosome or comma separated list of chromosomes e.g chr1 or chr1,chr2,chr3", type=str, default='All') 154 | train_subparser.add_argument("-weights", metavar='',help="file with callability (defaults to all positions being called)") 155 | train_subparser.add_argument("-mutrates", metavar='',help="file with mutation rates (default is mutation rate is uniform)") 156 | train_subparser.add_argument("-param", metavar='',help="markov parameters file (default is human/neanderthal like parameters)", type=str) 157 | train_subparser.add_argument("-out", metavar='',help="outputfile (default is a file named trained.json)", default = 'trained.json') 158 | train_subparser.add_argument("-window_size", metavar='',help="size of bins (default is 1000 bp)", type=int, default = 1000) 159 | train_subparser.add_argument("-haploid",help="Change from using diploid data to haploid data (default is diploid)", action='store_true', default = False) 160 | 161 | # Decode model 162 | decode_subparser = subparser.add_parser('decode', help='Decode HMM') 163 | decode_subparser.add_argument("-obs",help="[required] file with observation data", type=str, required = True) 164 | decode_subparser.add_argument("-chrom",help="Subset to chromosome or comma separated list of chromosomes e.g chr1 or chr1,chr2,chr3", type=str, default='All') 165 | decode_subparser.add_argument("-weights", metavar='',help="file with callability (defaults to all positions being called)") 166 | decode_subparser.add_argument("-mutrates", metavar='',help="file with mutation rates (default is mutation rate is uniform)") 167 | decode_subparser.add_argument("-param", metavar='',help="markov parameters file (default is human/neanderthal like parameters)", type=str) 168 | decode_subparser.add_argument("-out", metavar='',help="outputfile prefix .hap1.txt and .hap2.txt if -haploid option is used or .diploid.txt (default is stdout)", default = '/dev/stdout') 169 | decode_subparser.add_argument("-window_size", metavar='',help="size of bins (default is 1000 bp)", type=int, default = 1000) 170 | decode_subparser.add_argument("-haploid",help="Change from using diploid data to haploid data (default is diploid)", action='store_true', default = False) 171 | decode_subparser.add_argument("-admixpop",help="Annotate using vcffile with admixing population (default is none)") 172 | decode_subparser.add_argument("-extrainfo",help="Add archaic information on each SNP", action='store_true', default = False) 173 | decode_subparser.add_argument("-viterbi",help="Decode using the Viterbi algorithm", action='store_true', default = False) 174 | decode_subparser.add_argument("-hybrid",help="Decode using the hybrid algorithm. Set value between 0 and 1 where 0=posterior and 1=viterbi", type=float, default = -1) 175 | decode_subparser.add_argument("-posterior_probs",help="File location for posterior probs", default = None) 176 | 177 | # inhomogeneous markov chain 178 | inhomogen_subparser = subparser.add_parser('inhomogeneous', help='Make inhomogen markov chain') 179 | inhomogen_subparser.add_argument("-obs",help="[required] file with observation data", type=str, required = True) 180 | inhomogen_subparser.add_argument("-chrom",help="Subset to chromosome or comma separated list of chromosomes e.g chr1 or chr1,chr2,chr3", type=str, default='All') 181 | inhomogen_subparser.add_argument("-weights", metavar='',help="file with callability (defaults to all positions being called)") 182 | inhomogen_subparser.add_argument("-mutrates", metavar='',help="file with mutation rates (default is mutation rate is uniform)") 183 | inhomogen_subparser.add_argument("-param", metavar='',help="markov parameters file (default is human/neanderthal like parameters)", type=str) 184 | inhomogen_subparser.add_argument("-out", metavar='',help="outputfile prefix .hap1.txt and .hap2.txt if -haploid option is used or .diploid.txt (default is stdout)", default = '/dev/stdout') 185 | inhomogen_subparser.add_argument("-window_size", metavar='',help="size of bins (default is 1000 bp)", type=int, default = 1000) 186 | inhomogen_subparser.add_argument("-haploid",help="Change from using diploid data to haploid data (default is diploid)", action='store_true', default = False) 187 | inhomogen_subparser.add_argument("-samples",help="Number of paths to sample (default is 100)", type=int, default = 100) 188 | inhomogen_subparser.add_argument("-admixpop",help="Annotate using vcffile with admixing population (default is none)") 189 | inhomogen_subparser.add_argument("-extrainfo",help="Add archaic information on each SNP", action='store_true', default = False) 190 | inhomogen_subparser.add_argument("-inhomogen_matrix",help="File location for inhomogeneous transition matrix", default = None) 191 | 192 | # Find best alpha (artemis plos) 193 | artemis_subparser = subparser.add_parser('artemis', help='Finds best alphas and make artemis plots') 194 | artemis_subparser.add_argument("-param", metavar='',help="[required] markov parameters file (default is human/neanderthal like parameters)", type=str, required = True) 195 | artemis_subparser.add_argument("-out", metavar='',help="Save alphas, likelihoods and pointwise accuracy to file (default is stdout)", default ='/dev/stdout') 196 | artemis_subparser.add_argument("-out_plot", metavar='',help="File path for artemis plot - can be pdf or jpg (default is Artemis_plot.pdf)", default = 'Artemis_plot.pdf') 197 | artemis_subparser.add_argument("-windows", metavar='',help="Number of Kb windows to create (defaults to 500,000)", type=int, default = 500000) 198 | artemis_subparser.add_argument("-iterations",help="Number of iterations", type = int, default = 10) 199 | artemis_subparser.add_argument("-start",help="First alpha values to simulate (default is 0)", type=float, default = 0.0) 200 | artemis_subparser.add_argument("-end",help="Last alpha values to simulate (default is 1)", type=float, default = 1.0) 201 | artemis_subparser.add_argument("-steps",help="Number of steps (values to simulate between start and end)", type=int, default = 101) 202 | artemis_subparser.add_argument("-seed", metavar='',help="set seed", type=int, default=42) 203 | 204 | args = parser.parse_args() 205 | 206 | # Make test data 207 | # ------------------------------------------------------------------------------------------------------------ 208 | if args.mode == 'make_test_data': 209 | 210 | print('-' * 40) 211 | print(f'> creating {args.chromosomes} chromosomes each with {args.windows} kb of test data with the following parameters..') 212 | hmm_parameters = read_HMM_parameters_from_file(args.param) 213 | print(f'> hmm parameters file: {args.param}') 214 | print(hmm_parameters) 215 | print(f'> Seed is {args.seed}') 216 | print('-' * 40) 217 | 218 | obs, mutrates, weights, path = simulate_path(args.windows, args.chromosomes, hmm_parameters, args.seed) 219 | 220 | if args.no_out_files: 221 | write_data(path, obs, args.windows, args.chromosomes, hmm_parameters, args.seed) 222 | 223 | 224 | # Train parameters 225 | # ------------------------------------------------------------------------------------------------------------ 226 | elif args.mode == 'train': 227 | 228 | hmm_parameters = read_HMM_parameters_from_file(args.param) 229 | obs, _, _, _, mutrates, weights = Load_observations_weights_mutrates(args.obs, args.weights, args.mutrates, args.window_size, args.haploid, args.chrom) 230 | 231 | print('-' * 40) 232 | print(hmm_parameters) 233 | print(f'> chromosomes to use: {args.chrom}') 234 | print(f'> number of windows: {len(obs)}. Number of snps = {sum(obs)}') 235 | print(f'> total callability: {int(np.sum(weights) * args.window_size)} bp ({round(np.sum(weights) / len(obs) * 100,2)} %)') 236 | print('> average mutation rate per bin:', round(np.sum(mutrates * weights) / np.sum(weights), 2) ) 237 | print('> Output is',args.out) 238 | print('> Window size is',args.window_size, 'bp') 239 | print('> Haploid',args.haploid) 240 | print('-' * 40) 241 | 242 | hmm_parameters = TrainModel(obs, mutrates, weights, hmm_parameters) 243 | write_HMM_to_file(hmm_parameters, args.out) 244 | 245 | 246 | # Decode observations using parameters 247 | # ------------------------------------------------------------------------------------------------------------ 248 | elif args.mode == 'decode': 249 | 250 | obs, chroms, starts, variants, mutrates, weights = Load_observations_weights_mutrates(args.obs, args.weights, args.mutrates, args.window_size, args.haploid, args.chrom) 251 | hmm_parameters = read_HMM_parameters_from_file(args.param) 252 | CHROMOSOME_BREAKPOINTS = [x for x in find_runs(chroms)] 253 | 254 | print('-' * 40) 255 | print(hmm_parameters) 256 | print(f'> chromosomes to use: {args.chrom}') 257 | print(f'> number of windows: {len(obs)}. Number of snps = {sum(obs)}') 258 | print(f'> total callability: {int(np.sum(weights) * args.window_size)} bp ({round(np.sum(weights) / len(obs) * 100,2)} %)') 259 | print('> average mutation rate per bin:', round(np.sum(mutrates * weights) / np.sum(weights), 2) ) 260 | print('> Output prefix is',args.out) 261 | print('> Window size is',args.window_size, 'bp') 262 | print('> Haploid',args.haploid) 263 | 264 | emissions = Emission_probs_poisson(hmm_parameters.emissions, obs, weights, mutrates) 265 | posterior_probs = Calculate_Posterior_probabillities(emissions, hmm_parameters) 266 | 267 | if args.hybrid != -1: 268 | if 0 <= args.hybrid <= 1: 269 | print(f'> Decode using hybrid algorithm with parameter: {args.hybrid}') 270 | print('-' * 40) 271 | logged_posterior_probs = np.log(posterior_probs.T) 272 | path = Hybrid_path(emissions, hmm_parameters.starting_probabilities, hmm_parameters.transitions, logged_posterior_probs, args.hybrid) 273 | else: 274 | sys.exit('\n\nERROR! Hybrid parameter must be between 0 and 1\n\n') 275 | else: 276 | if args.viterbi: 277 | print('> Decode using viterbi algorithm') 278 | print('-' * 40) 279 | path = Viterbi_path(emissions, hmm_parameters) 280 | else: 281 | print('> Decode with posterior decoding') 282 | print('-' * 40) 283 | path = PMAP_path(posterior_probs) 284 | 285 | 286 | if args.posterior_probs is not None: 287 | Write_posterior_probs(chroms, starts, weights, mutrates, posterior_probs, path, variants, hmm_parameters, args.posterior_probs) 288 | 289 | segments = Convert_genome_coordinates(args.window_size, CHROMOSOME_BREAKPOINTS, starts, variants, posterior_probs, path, hmm_parameters, weights, mutrates, obs) 290 | Write_Decoded_output(args.out, segments, args.obs, args.admixpop, args.extrainfo) 291 | 292 | 293 | # inhomogeneous markov chain 294 | # ------------------------------------------------------------------------------------------------------------ 295 | elif args.mode == 'inhomogeneous': 296 | 297 | obs, chroms, starts, variants, mutrates, weights = Load_observations_weights_mutrates(args.obs, args.weights, args.mutrates, args.window_size, args.haploid, args.chrom) 298 | hmm_parameters = read_HMM_parameters_from_file(args.param) 299 | CHROMOSOME_BREAKPOINTS = [x for x in find_runs(chroms)] 300 | 301 | print('-' * 40) 302 | print(hmm_parameters) 303 | print(f'> chromosomes to use: {args.chrom}') 304 | print(f'> number of windows: {len(obs)}. Number of snps = {sum(obs)}') 305 | print(f'> total callability: {int(np.sum(weights) * args.window_size)} bp ({round(np.sum(weights) / len(obs) * 100,2)} %)') 306 | print('> average mutation rate per bin:', round(np.sum(mutrates * weights) / np.sum(weights), 2) ) 307 | print('> Output prefix is',args.out) 308 | print('> Window size is',args.window_size, 'bp') 309 | print('> Haploid',args.haploid) 310 | print('-' * 40) 311 | 312 | # Find segments and write output 313 | emissions = Emission_probs_poisson(hmm_parameters.emissions, obs, weights, mutrates) 314 | posterior_probs = Calculate_Posterior_probabillities(emissions, hmm_parameters) 315 | starting_probabilities, inhom_transition_matrix = Make_inhomogeneous_transition_matrix(emissions, hmm_parameters) 316 | 317 | if args.inhomogen_matrix is not None: 318 | Write_inhomogeneous_transition_matrix(chroms, starts, weights, mutrates, variants, hmm_parameters, inhom_transition_matrix, args.inhomogen_matrix) 319 | 320 | for sim_number in range(args.samples): 321 | print(f'Running inhomogen markov chain simulation {sim_number + 1}/{args.samples}') 322 | path = Simulate_from_transition_matrix(starting_probabilities, inhom_transition_matrix) 323 | segments = Convert_genome_coordinates(args.window_size, CHROMOSOME_BREAKPOINTS, starts, variants, posterior_probs, path, hmm_parameters, weights, mutrates, obs) 324 | 325 | if args.out == '/dev/stdout': 326 | output = args.out 327 | else: 328 | output = f'{args.out}.{sim_number}' 329 | 330 | Write_Decoded_output(output, segments, args.obs, args.admixpop, args.extrainfo) 331 | 332 | 333 | 334 | # Create outgroup snps (set of snps to be removed) 335 | # ------------------------------------------------------------------------------------------------------------ 336 | elif args.mode == 'create_outgroup': 337 | 338 | # Get list of outgroup individuals 339 | outgroup_individuals = handle_individuals_input(args.ind, 'outgroup') 340 | 341 | # Get a list of vcffiles and ancestral files and intersect them 342 | vcffiles = handle_infiles(args.vcf) 343 | ancestralfiles = handle_infiles(args.ancestral) 344 | refgenomefiles = handle_infiles(args.refgenome) 345 | 346 | ancestralfiles, vcffiles = combined_files(ancestralfiles, vcffiles) 347 | refgenomefiles, vcffiles = combined_files(refgenomefiles, vcffiles) 348 | 349 | print('-' * 40) 350 | print('> Outgroup individuals:', len(outgroup_individuals)) 351 | print('> Using vcf and ancestral files') 352 | for vcffile, ancestralfile, reffile in zip(vcffiles, ancestralfiles, refgenomefiles): 353 | print('vcffile:',vcffile, 'ancestralfile:',ancestralfile, 'reffile:', reffile) 354 | print() 355 | print('> Callability file:', args.weights) 356 | print(f'> Writing output to:', args.out) 357 | print('-' * 40) 358 | 359 | make_out_group(outgroup_individuals, args.weights, vcffiles, args.out, ancestralfiles, refgenomefiles) 360 | 361 | 362 | # Create ingroup observations 363 | # ------------------------------------------------------------------------------------------------------------ 364 | elif args.mode == 'create_ingroup': 365 | 366 | # Get a list of ingroup individuals 367 | ingroup_individuals = handle_individuals_input(args.ind,'ingroup') 368 | 369 | # Get a list of vcffiles and ancestral files and intersect them 370 | vcffiles = handle_infiles(args.vcf) 371 | ancestralfiles = handle_infiles(args.ancestral) 372 | 373 | ancestralfiles, vcffiles = combined_files(ancestralfiles, vcffiles) 374 | 375 | print('-' * 40) 376 | print('> Ingroup individuals:', len(ingroup_individuals)) 377 | print('> Using vcf and ancestral files') 378 | for vcffile, ancestralfile in zip(vcffiles, ancestralfiles): 379 | print('vcffile:',vcffile, 'ancestralfile:',ancestralfile) 380 | print() 381 | print('> Using outgroup variants from:', args.outgroup) 382 | print('> Callability file:', args.weights) 383 | print(f'> Writing output to file with prefix: {args.out}..txt') 384 | print('-' * 40) 385 | 386 | make_ingroup_obs(ingroup_individuals, args.weights, vcffiles, args.out, args.outgroup, ancestralfiles) 387 | 388 | 389 | # Estimate mutation rate 390 | # ------------------------------------------------------------------------------------------------------------ 391 | elif args.mode == 'mutation_rate': 392 | print('-' * 40) 393 | print(f'> Outgroupfile:', args.outgroup) 394 | print(f'> Outputfile is:', args.out) 395 | print(f'> Callability file is:', args.weights) 396 | print(f'> Window size:', args.window_size) 397 | print('-' * 40) 398 | 399 | make_mutation_rate(args.outgroup, args.out, args.weights, args.window_size) 400 | 401 | 402 | # Find best alphas and make artemis plots 403 | # ------------------------------------------------------------------------------------------------------------ 404 | elif args.mode == 'artemis': 405 | 406 | hmm_parameters = read_HMM_parameters_from_file(args.param) 407 | 408 | print('-' * 40) 409 | print(hmm_parameters) 410 | print(f'> Save data to {args.out}') 411 | print(f'> Save plot to {args.out_plot}') 412 | print(f'> Number of windows: {args.windows}') 413 | print(f'> Number of iterations: {args.iterations}') 414 | print(f'> Test {args.steps} alphas between {args.start} and {args.end}:') 415 | print(f'> Seed is {args.seed}') 416 | print('-' * 40) 417 | 418 | Find_best_alpha(hmm_parameters, args.windows, args.out, args.out_plot, args.iterations, args.start, args.end, args.steps, args.seed) 419 | 420 | 421 | # Print usage 422 | # ------------------------------------------------------------------------------------------------------------ 423 | else: 424 | print(print_script_usage()) 425 | 426 | 427 | if __name__ == "__main__": 428 | main() 429 | 430 | -------------------------------------------------------------------------------- /src/make_mutationrate.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from collections import defaultdict 3 | 4 | from helper_functions import sortby, make_callability_from_bed, Make_folder_if_not_exists 5 | 6 | def make_mutation_rate(freqfile, outfile, callablefile, window_size): 7 | 8 | snps_counts_window = defaultdict(lambda: defaultdict(int)) 9 | with open(freqfile) as data: 10 | for line in data: 11 | if not line.startswith('chrom'): 12 | chrom, pos = line.strip().split()[0:2] 13 | pos = int(pos) 14 | window = pos - pos%window_size 15 | snps_counts_window[chrom][window] += 1 16 | 17 | 18 | mutations = [] 19 | genome_positions = [] 20 | for chrom in sorted(snps_counts_window, key=sortby): 21 | lastwindow = max(snps_counts_window[chrom]) + window_size 22 | 23 | for window in range(0, lastwindow, window_size): 24 | mutations.append(snps_counts_window[chrom][window]) 25 | genome_positions.append([chrom, window, window + window_size]) 26 | 27 | mutations = np.array(mutations) 28 | 29 | if callablefile is not None: 30 | callability = make_callability_from_bed(callablefile, window_size) 31 | callable_region = [] 32 | for chrom in sorted(snps_counts_window, key=sortby): 33 | lastwindow = max(snps_counts_window[chrom]) + window_size 34 | for window in range(0, lastwindow, window_size): 35 | callable_region.append(callability[chrom][window]/window_size) 36 | else: 37 | callable_region = np.ones(len(mutations)) * window_size 38 | 39 | genome_mean = np.sum(mutations) / np.sum(callable_region) 40 | 41 | Make_folder_if_not_exists(outfile) 42 | with open(outfile,'w') as out: 43 | print('chrom', 'start', 'end', 'mutationrate', sep = '\t', file = out) 44 | for genome_pos, mut, call in zip(genome_positions, mutations, callable_region): 45 | chrom, start, end = genome_pos 46 | if mut * call == 0: 47 | ratio = 0 48 | else: 49 | ratio = round(mut/call/genome_mean, 2) 50 | 51 | print(chrom, start, end, ratio, sep = '\t', file = out) 52 | -------------------------------------------------------------------------------- /src/make_test_data.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from numba import njit 3 | 4 | from hmm_functions import HMMParam, write_HMM_to_file, Simulate_transition 5 | from helper_functions import find_runs 6 | 7 | 8 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 9 | # Make test data 10 | # ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 11 | 12 | @njit 13 | def set_seed(value): 14 | np.random.seed(value) 15 | 16 | 17 | @njit 18 | def simulatate_mutation_position(n_mutations): 19 | return np.random.choice(1000, n_mutations) 20 | 21 | 22 | @njit 23 | def simulate_poisson(lam): 24 | return np.random.poisson(lam) 25 | 26 | 27 | def simulate_path(data_set_length, n_chromosomes, hmm_parameters, SEED): 28 | '''Create test data set of size data_set_length. Also create uniform weights and uniform mutation rates''' 29 | 30 | # Config 31 | np.random.seed(SEED) 32 | set_seed(SEED) 33 | 34 | total_size = data_set_length * n_chromosomes 35 | 36 | state_values = [x for x in range(len(hmm_parameters.state_names))] 37 | n_states = len(state_values) 38 | path = np.zeros(total_size, dtype = int) 39 | 40 | observations = np.zeros(total_size, dtype = int) 41 | weights = np.ones(total_size) 42 | mutrates = np.ones(total_size) 43 | 44 | for index in range(total_size): 45 | 46 | # Use prior dist if starting window 47 | if index == 0: 48 | current_state = np.random.choice(state_values, p=hmm_parameters.starting_probabilities) 49 | else: 50 | current_state = Simulate_transition(n_states, hmm_parameters.transitions[prevstate,:], prevstate) 51 | 52 | 53 | observations[index] = simulate_poisson(hmm_parameters.emissions[current_state]) 54 | path[index] = current_state 55 | prevstate = current_state 56 | 57 | 58 | return observations, mutrates, weights, path 59 | 60 | 61 | def write_data(path, obs, data_set_length, n_chromosomes, hmm_parameters, SEED): 62 | 63 | # Config 64 | np.random.seed(SEED) 65 | set_seed(SEED) 66 | 67 | window_size = 1000 68 | bases = np.array(['A','C','G','T']) 69 | 70 | CHROMOSOMES = [f'chr{x + 1}' for x in range(n_chromosomes)] 71 | CHROMOSOME_RUNS = [] 72 | previous_start = 0 73 | for chrom in CHROMOSOMES: 74 | CHROMOSOME_RUNS.append([chrom, previous_start, data_set_length]) 75 | previous_start += data_set_length 76 | 77 | 78 | 79 | # Make obs file and true simulated segments 80 | with open('obs.txt','w') as obs_file, open('simulated_segments.txt', 'w') as out: 81 | print('chrom', 'pos', 'ancestral_base', 'genotype', sep = '\t', file = obs_file) 82 | print('chrom', 'start', 'end', 'length', 'state', sep = '\t', file = out) 83 | 84 | for (chrom, chrom_start_index, chrom_length_index) in CHROMOSOME_RUNS: 85 | for (state_id, start_index, length_index) in find_runs(path[chrom_start_index:chrom_start_index + chrom_length_index]): 86 | 87 | state = hmm_parameters.state_names[state_id] 88 | genome_start = start_index * window_size 89 | genome_length = length_index * window_size 90 | genome_end = genome_start + genome_length 91 | print(chrom, genome_start, genome_end, genome_length, state, sep = '\t', file = out) 92 | 93 | # write mutations 94 | n_mutations_segment = obs[(chrom_start_index + start_index):(chrom_start_index + start_index + length_index)] 95 | for index, n_mutations in enumerate(n_mutations_segment): 96 | 97 | if n_mutations == 0: 98 | continue 99 | 100 | for random_int in simulatate_mutation_position(n_mutations): 101 | mutation = (start_index + index) * window_size + random_int 102 | ancestral_base, derived_base = np.random.choice(bases, 2, replace = False) 103 | 104 | print(chrom, mutation, ancestral_base, ancestral_base + derived_base, sep = '\t', file = obs_file) 105 | 106 | # Make weights file and mutation file 107 | with open('weights.bed','w') as weights_file, open('mutrates.bed','w') as mutrates_file: 108 | for chrom in CHROMOSOMES: 109 | print(chrom, 0, data_set_length * window_size, sep = '\t', file = weights_file) 110 | print(chrom, 0, data_set_length * window_size, 1, sep = '\t', file = mutrates_file) 111 | 112 | # Make initial guesses 113 | initial_guess = HMMParam(['Human', 'Archaic'], [0.5, 0.5], [[0.99,0.01],[0.02,0.98]], [0.03, 0.3]) 114 | write_HMM_to_file(initial_guess, 'Initialguesses.json') 115 | 116 | 117 | return 118 | 119 | --------------------------------------------------------------------------------