├── LICENSE
├── README.md
├── setup.py
└── src
    ├── bcf_vcf.py
    ├── helper_functions.py
    ├── hmm_functions.py
    ├── main.py
    ├── make_mutationrate.py
    └── make_test_data.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 LauritsSkov
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Introgression detection (hmmix)
  2 | 
  3 | ## Contact
  4 | 
  5 | <https://sites.google.com/view/laurits-skov>
  6 | 
  7 | ---
  8 | 
  9 | ## Helpful files
 10 | 
 11 | The outgroup files, mutation rate files, reference genomes, ancestral alleles and callability files and ancestral allele files are now premade!
 12 | 
 13 | <https://doi.org/10.5281/zenodo.11212339> (hg19 and hg38)
 14 | 
 15 | Using these files I have already called archaic segments in 1000 genomes and HDGP datasets (hg38 reference coordinate system)
 16 | 
 17 | <https://doi.org/10.5281/zenodo.14136628> (archaic introgression callsets for HGDP and 1000genomes in hg38)
 18 | 
 19 | VCF file containing 4 high coverage archaic genomes (Altai, Vindija and Chagyrskaya Neanderthals and Denisovan) here:
 20 | 
 21 | <https://zenodo.org/records/7246376> (hg19)
 22 | 
 23 | <https://zenodo.org/records/13368126> (hg38)
 24 | 
 25 | ---
 26 | 
 27 | If you are working with archaic introgression into present-day humans of non-African ancestry you can use these files and skip the following steps:
 28 | Find derived variants in outgroup and Estimate local mutation rate.
 29 | 
 30 | These are the scripts needed to infere archaic introgression in modern populations using an unadmixed outgroup.
 31 | 
 32 | 1. [Installation](#installation)
 33 | 2. [Usage](#usage)
 34 | 3. [Quick tutorial](#quick-tutorial)
 35 | 4. [1000 genomes tutorial](#example-with-1000-genomes-data)
 36 |       - [Get data](#getting-data)
 37 |       - [Find derived variants in outgroup](#finding-snps-which-are-derived-in-the-outgroup)
 38 |       - [Estimate local mutation rate](#estimating-mutation-rate-across-genome)
 39 |       - [Find variants in ingroup](#find-a-set-of-variants-which-are-not-derived-in-the-outgroup)
 40 |       - [Train the HMM](#training)
 41 |       - [Decoding](#decoding)
 42 |       - [Phased data](#training-and-decoding-with-phased-data)
 43 |       - [Annotate](#annotate-with-known-admixing-population)
 44 | 5. [Run in python](#annotate-with-known-admixing-population)
 45 | 
 46 | ---
 47 | 
 48 | ## Installation
 49 | 
 50 | Run the following command to install:
 51 | 
 52 | ```bash
 53 | pip install hmmix 
 54 | ```
 55 | 
 56 | If you want to work with bcf/vcf files you should also install vcftools and bcftools. You can either use conda or visit their websites.
 57 | 
 58 | ```bash
 59 | conda install -c bioconda vcftools bcftools
 60 | ```
 61 | 
 62 | ![Overview of model](https://user-images.githubusercontent.com/30321818/43464826-4d11d46c-94dc-11e8-8f1a-6851aa5d9125.jpg)
 63 | 
 64 | The way the model works is by removing variation found in an outgroup population and then using the remaining variants to group the genome into regions of different variant density. If the model works well we would expect that introgressed regions have higher variant density than non-introgressed - because they have spend more time accumulation variation that is not found in the outgroup.
 65 | 
 66 | An example on simulated data is provided below:
 67 | 
 68 | ![het_vs_archaic](https://user-images.githubusercontent.com/30321818/46877046-217eff80-ce40-11e8-9010-edb544e3e1ee.png)
 69 | 
 70 | In this example we zoom in on 1 Mb of simulated data for a haploid genome. The top panel shows the coalescence times with the outgroup across the region and the green segment is an archaic introgressed segment. Notice how much more deeper the coalescence time with the outgroup is. The second panel shows that probability of being in the archaic state. We can see that the probability is much higher in the archaic segment, demonstrating that in this toy example the model is working like we would hope. The next panel is the snp density if you dont remove all snps found in the outgroup. By looking at this one cant tell where the archaic segments begins and ends, or even if there is one. The bottom panel is the snp density when all variation in the outgroup is removed. Notice that now it is much clearer where the archaic segment begins and ends!
 71 | 
 72 | The method is now published in PlosGenetics and can be found here: [Detecting archaic introgression using an unadmixed outgroup](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1007641) This paper is describing and evaluating the method.
 73 | 
 74 | ---
 75 | 
 76 | ## Usage
 77 | 
 78 | ```note
 79 | Script for identifying introgressed archaic segments
 80 | 
 81 | > Turorial:
 82 | hmmix make_test_data 
 83 | hmmix train  -obs=obs.txt -weights=weights.bed -mutrates=mutrates.bed -param=Initialguesses.json -out=trained.json 
 84 | hmmix decode -obs=obs.txt -weights=weights.bed -mutrates=mutrates.bed -param=trained.json
 85 | 
 86 | 
 87 | Different modes (you can also see the options for each by writing hmmix make_test_data -h):
 88 | > make_test_data        
 89 |     -windows            Number of Kb windows to create (defaults to 50,000 per chromosome)
 90 |     -chromosomes        Number of chromosomes to simulate (defaults to 2)
 91 |     -nooutfiles         Don't create obs.txt, mutrates.bed, weights.bed, Initialguesses.json, simulated_segments.txt (defaults to yes)
 92 |     -param              markov parameters file (default is human/neanderthal like parameters)
 93 | 
 94 | > mutation_rate         
 95 |     -outgroup           [required] path to variants found in outgroup
 96 |     -out                outputfile (defaults to mutationrate.bed)
 97 |     -weights            file with callability (defaults to all positions being called)
 98 |     -window_size        size of bins (defaults to 1 Mb)
 99 | 
100 | > create_outgroup       
101 |     -ind                [required] ingroup/outgrop list (json file) or comma-separated list e.g. ind1,ind2
102 |     -vcf                [required] path to list of comma-separated vcf/bcf file(s) or wildcard characters e.g. chr*.bcf
103 |     -weights            file with callability (defaults to all positions being called)
104 |     -out                outputfile (defaults to stdout)
105 |     -ancestral          fasta file with ancestral information - comma-separated list or wildcards like vcf argument (default none)
106 |     -refgenome          fasta file with reference genome - comma-separated list or wildcards like vcf argument (default none)
107 | 
108 | > create_ingroup        
109 |     -ind                [required] ingroup/outgrop list (json file) or comma-separated list e.g. ind1,ind2
110 |     -vcf                [required] path to list of comma-separated vcf/bcf file(s) or wildcard characters e.g. chr*.bcf
111 |     -outgroup           [required] path to variant found in outgroup
112 |     -weights            file with callability (defaults to all positions being called)
113 |     -out                outputfile prefix (default is a file named obs.<ind>.txt where ind is the name of individual in ingroup/outgrop list)
114 |     -ancestral          fasta file with ancestral information - comma-separated list or wildcards like vcf argument (default none)
115 | 
116 | > train                 
117 |     -obs                [required] file with observation data
118 |     -chrom              Subset to chromosome or comma separated list of chromosomes e.g chr1 or chr1,chr2,chr3 (default is use all chromosomes)
119 |     -weights            file with callability (defaults to all positions being called)
120 |     -mutrates           file with mutation rates (default is mutation rate is uniform)
121 |     -param              markov parameters file (default is human/neanderthal like parameters)
122 |     -out                outputfile (default is a file named trained.json)
123 |     -window_size        size of bins (default is 1000 bp)
124 |     -haploid            Change from using diploid data to haploid data (default is diploid)
125 | 
126 | > decode                
127 |     -obs                [required] file with observation data
128 |     -chrom              Subset to chromosome or comma separated list of chromosomes e.g chr1 or chr1,chr2,chr3 (default is use all chromosomes)
129 |     -weights            file with callability (defaults to all positions being called)
130 |     -mutrates           file with mutation rates (default is mutation rate is uniform)
131 |     -param              markov parameters file (default is human/neanderthal like parameters)
132 |     -out                outputfile prefix <out>.hap1.txt and <out>.hap2.txt if -haploid option is used or <out>.diploid.txt (default is stdout)
133 |     -window_size        size of bins (default is 1000 bp)
134 |     -haploid            Change from using diploid data to haploid data (default is diploid)
135 |     -admixpop           Annotate using vcffile with admixing population (default is none)
136 |     -extrainfo          Add variant position for each SNP (default is off)
137 |     -viterbi            decode using the viterbi algorithm (default is posterior decoding)
138 | 
139 | > inhomogeneous                
140 |     -obs                [required] file with observation data
141 |     -chrom              Subset to chromosome or comma separated list of chromosomes e.g chr1 or chr1,chr2,chr3 (default is use all chromosomes)
142 |     -weights            file with callability (defaults to all positions being called)
143 |     -mutrates           file with mutation rates (default is mutation rate is uniform)
144 |     -param              markov parameters file (default is human/neanderthal like parameters)
145 |     -out                outputfile prefix <out>.hap1_sim(0-n).txt and <out>.hap2_sim(0-n).txt if -haploid option is used or <out>.diploid_(0-n).txt (default is stdout)
146 |     -window_size        size of bins (default is 1000 bp)
147 |     -haploid            Change from using diploid data to haploid data (default is diploid)
148 |     -samples            Number of simulated paths for the inhomogeneous markov chain (default is 100)
149 |     -admixpop           Annotate using vcffile with admixing population (default is none)
150 |     -extrainfo          Add variant position for each SNP (default is off)
151 | ```
152 | 
153 | ---
154 | 
155 | ## Quick tutorial
156 | 
157 | Here is how we can simulate test data using hmmix. Lets make some test data and start using the program.
158 | 
159 | ```note
160 | > hmmix make_test_data
161 | > creating 2 chromosomes each with 50000 kb of test data with the following parameters..
162 | > hmm parameters file: None
163 | > state_names = ['Human', 'Archaic']
164 | > starting_probabilities = [0.98, 0.02]
165 | > transitions = [[1.0, 0.0], [0.02, 0.98]]
166 | > emissions = [0.04, 0.4]
167 | > Seed is 42
168 | ```
169 | 
170 | This will generate 5 files, obs.txt, weights.bed, mutrates.bed, simulated_segments.txt and Initialguesses.json. obs.txt. These are the mutations that are left after removing variants which are found in the outgroup.
171 | 
172 | ```note
173 | chrom  pos     ancestral_base  genotype
174 | chr1   17102   C               CT
175 | chr1   34435   C               CT
176 | chr1   69860   T               TA
177 | chr1   122270  C               CA
178 | chr1   181106  G               GC
179 | chr1   218071  A               AC
180 | chr1   220700  T               TG
181 | chr1   231020  A               AG
182 | chr1   235614  T               TG
183 | ```
184 | 
185 | weights.bed. This is the parts of the genome that we can accurately map to - in this case we have simulated the data and can accurately access the entire genome.
186 | 
187 | ```note
188 | chr1   0   50000000
189 | chr2   0   50000000
190 | ```
191 | 
192 | mutrates.bed. This is the normalized mutation rate across the genome.
193 | 
194 | ```note
195 | chr1   0   50000000   1
196 | chr2   0   50000000   1
197 | ```
198 | 
199 | Initialguesses.json. This is our initial guesses when training the model - note these are different from those we simulated from.
200 | 
201 | ```json
202 | {
203 |   "state_names": ["Human","Archaic"],
204 |   "starting_probabilities": [0.5,0.5],
205 |   "transitions": [[0.99,0.01],[0.02,0.98]],
206 |   "emissions": [0.03,0.3]
207 | }
208 | ```
209 | 
210 | The simulated_segments.txt contains the simulated states which generated the data (you can compare this to the decoded results later and see that it matches).  
211 | 
212 | ```note
213 | chrom  start     end       length    state
214 | chr1   0         22980000  22980000  Human
215 | chr1   22980000  23071000  91000     Archaic
216 | chr1   23071000  43905000  20834000  Human
217 | chr1   43905000  43911000  6000      Archaic
218 | chr1   43911000  47419000  3508000   Human
219 | chr1   47419000  47443000  24000     Archaic
220 | chr1   47443000  50000000  2557000   Human
221 | chr2   0         16378000  16378000  Human
222 | chr2   16378000  16492000  114000    Archaic
223 | chr2   16492000  19478000  2986000   Human
224 | chr2   19478000  19512000  34000     Archaic
225 | chr2   19512000  37728000  18216000  Human
226 | chr2   37728000  37751000  23000     Archaic
227 | chr2   37751000  46777000  9026000   Human
228 | chr2   46777000  46791000  14000     Archaic
229 | chr2   46791000  50000000  3209000   Human
230 | ```
231 | 
232 | We can find the best fitting parameters using BaumWelsch training. Here is how you use it: - note you can try to ommit the weights and mutrates arguments. Since this is simulated data the mutation is constant across the genome and we can asses the entire genome. Also notice how the parameters approach the parameters the data was generated from (jubii).
233 | 
234 | ```note
235 | > hmmix train  -obs=obs.txt -weights=weights.bed -mutrates=mutrates.bed -param=Initialguesses.json -out=trained.json
236 | ----------------------------------------
237 | > state_names = ['Human', 'Archaic']
238 | > starting_probabilities = [0.5, 0.5]
239 | > transitions = [[0.99, 0.01], [0.02, 0.98]]
240 | > emissions = [0.03, 0.3]
241 | > chromosomes to use: All
242 | > number of windows: 100000. Number of snps = 4116
243 | > total callability: 100000000 bp (100.0 %)
244 | > average mutation rate per bin: 1.0
245 | > Output is trained.json
246 | > Window size is 1000 bp
247 | > Haploid False
248 | ----------------------------------------
249 | iteration  loglikelihood  start1  start2  emis1   emis2   trans1_1  trans2_2
250 | 0          -17905.0945    0.5     0.5     0.03    0.3     0.99      0.98
251 | 1          -17259.7101    0.96    0.04    0.0346  0.2009  0.9968    0.9217
252 | 2          -17244.4109    0.969   0.031   0.0365  0.1861  0.9971    0.9105
253 | ...
254 | 29         -17196.1361    0.997   0.003   0.04    0.4477  0.9999    0.9802
255 | 30         -17196.1324    0.997   0.003   0.04    0.4482  0.9999    0.9806
256 | 31         -17196.1316    0.997   0.003   0.04    0.4485  0.9999    0.9808
257 | 
258 | 
259 | # run without mutrate and weights (only do this for simulated data)
260 | > hmmix train  -obs=obs.txt -param=Initialguesses.json -out=trained.json
261 | ```
262 | 
263 | We can now decode the data with the best parameters that maximize the likelihood and find the archaic segments. Please note it is the weights file that determine the end of chromosomes. If you do not provide a weights file then the last window will be the last window with a SNP. So using the test data above the decoded output would end at window 49,985,000 for chromosome 1 and 49,997,000 for chromosome 2. This is because hmmix uses the position of the last SNP when no weight file is provided to determine the length of the chromosome. The last SNP on chromosome 1 is 49,984,119 and the last SNP on chromosome 2 is 49,996,253.
264 | 
265 | ```note
266 | > hmmix decode -obs=obs.txt -weights=weights.bed -mutrates=mutrates.bed -param=trained.json
267 | ----------------------------------------
268 | > state_names = ['Human', 'Archaic']
269 | > starting_probabilities = [0.997, 0.003]
270 | > transitions = [[1.0, 0.0], [0.019, 0.981]]
271 | > emissions = [0.04, 0.449]
272 | > chromosomes to use: All
273 | > number of windows: 100000. Number of snps = 4116
274 | > total callability: 100000000 bp (100.0 %)
275 | > average mutation rate per bin: 1.0
276 | > Output prefix is /dev/stdout
277 | > Window size is 1000 bp
278 | > Haploid False
279 | > Decode with posterior decoding
280 | ----------------------------------------
281 | chrom  start     end       length    state    mean_prob  snps
282 | chr1   0         22979000  22979000  Human    0.99989    903
283 | chr1   22979000  23066000  87000     Archaic  0.96368    32
284 | chr1   23066000  47418000  24352000  Human    0.99975    935
285 | chr1   47418000  47443000  25000     Archaic  0.88235    10
286 | chr1   47443000  50000000  2557000   Human    0.99934    96
287 | chr2   0         16381000  16381000  Human    0.99981    653
288 | chr2   16381000  16492000  111000    Archaic  0.99166    60
289 | chr2   16492000  19478000  2986000   Human    0.99883    134
290 | chr2   19478000  19512000  34000     Archaic  0.96454    18
291 | chr2   19512000  50000000  30488000  Human    0.99981    1275
292 | 
293 | ```
294 | 
295 | ---
296 | 
297 | ## Example with 1000 genomes data
298 | 
299 | ---
300 | 
301 | The whole pipeline we will run looks like this. In the following section we will go through all the steps on the way
302 | 
303 | NOTE: The outgroup files, mutation rate files, reference genomes, ancestral alleles and callability files and ancestral allele files are now premade!
304 | They can be downloaded in hg38 and hg19 here: <https://doi.org/10.5281/zenodo.11212339>
305 | 
306 | But keep reading along if you want to know HOW the files were generated! Another important thing to note is that hmmix is relying on VCFtools which only support VCF files up to format V4.2 - so if you have VCFfiles in version 4.3 you will need to change this in your header!
307 | 
308 | ```note
309 | hmmix create_outgroup -ind=individuals.json -vcf=*.bcf -weights=strickmask.bed -out=outgroup.txt -ancestral=hg19_ancestral/homo_sapiens_ancestor_*.fa -refgenome=hg19_refgenome/*fa
310 | hmmix mutation_rate -outgroup=outgroup.txt  -weights=strickmask.bed -window_size=1000000 -out mutationrate.bed
311 | hmmix create_ingroup  -ind=individuals.json -vcf=*.bcf -weights=strickmask.bed -out=obs -outgroup=outgroup.txt -ancestral=hg19_ancestral/homo_sapiens_ancestor_*.fa
312 | hmmix train  -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -out=trained.HG00096.json 
313 | hmmix decode -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -param=trained.HG00096.json 
314 | ```
315 | 
316 | ### Getting data
317 | 
318 | I thought it would be nice to have an entire reproduceble example of how to use this model. From a common starting point such as a VCF file (well a BCF file in this case) to the final output.  The reason for using BCF files is because it is MUCH faster to extract data for each individual. You can convert a vcf file to a bcf file like this:
319 | 
320 | ```note
321 | bcftools view file.vcf -l 1 -O b > file.bcf
322 | bcftools index file.bcf
323 | ```
324 | 
325 | In this example I will analyse an individual (HG00096) from the 1000 genomes project phase 3. All analysis are run on my lenovo thinkpad (8th gen) computer so it should run on yours too!
326 | 
327 | First we will need to know which 1) bases can be called in the genome and 2) which variants are found in the outgroup. So let's start out by downloading the files from the following directories.
328 | To download callability regions, ancestral alleles information, ingroup outgroup information call this command:
329 | 
330 | ```bash
331 | # bcffiles (hg19)
332 | ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/bcf_files/
333 | 
334 | # callability (remember to remove chr in the beginning of each line to make it compatible with hg19 e.g. chr1 > 1)
335 | ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/accessible_genome_masks/20141020.strict_mask.whole_genome.bed
336 | sed 's/^chr\|%$//g' 20141020.strict_mask.whole_genome.bed | awk '{print $1"\t"$2"\t"$3}' | grep -v Y > strickmask.bed
337 | 
338 | # outgroup information
339 | ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
340 | 
341 | # Ancestral information
342 | ftp://ftp.ensembl.org/pub/release-74/fasta/ancestral_alleles/hg19_ancestral.tar.bz2
343 | 
344 | # Reference genome
345 | wget 'ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz' -O chromFa.tar.gz
346 | 
347 | # Archaic variants (Altai, Vindija, Chagyrskaya and Denisova in hg19)
348 | https://zenodo.org/records/7246376
349 | 
350 | ```
351 | 
352 | For this example we will use all individuals from 'YRI','MSL' and 'ESN' as outgroup individuals. While we will only be decoding HG00096 in this example you can add as many individuals as you want to the ingroup.  
353 | 
354 | ```json
355 | {
356 |   "ingroup": [
357 |     "HG00096",
358 |     "HG00097"
359 |   ],
360 |   "outgroup": [
361 |     "HG02922",
362 |     "HG02923",
363 |     ...
364 |     "HG02944",
365 |     "HG02946"]
366 | }
367 | ```
368 | 
369 | ---
370 | 
371 | ### Finding snps which are derived in the outgroup
372 | 
373 | First we need to find a set of variants found in the outgroup. We can use the wildcard character to loop through all bcf files. It is best if you have files with the ancestral alleles (in FASTA format) and the reference genome (in FASTA format) but the program will run without.
374 | 
375 | Something to note is that if you use an outgroup vcffile (like 1000 genomes) and an ingroup vcf file from a different dataset (like SGDP) there is an edge case which could occur. There could be recurrent mutations where every individual in 1000 genome has the derived variant and one individual in SGDP where the derived variant has mutated back to the ancestral allele. This means that this position will not be present in the outgroup file. However if a recurrent mutation occurs it will look like multiple individuals in the ingroup file have the mutation. This does not happen often but that is why I recommend having files with the ancestral allele and reference genome information.
376 | 
377 | ```note
378 | # Recommended usage (if you want to remove sites which are fixed derived in your outgroup/ingroup). This is the file from zenodo. 
379 | (took two hours) > hmmix create_outgroup -ind=individuals.json -vcf=*.bcf -weights=strickmask.bed -out=outgroup.txt -ancestral=hg19_ancestral/homo_sapiens_ancestor_*.fa -refgenome=hg19_refgenome/*fa
380 | ----------------------------------------
381 | > Outgroup individuals: 292
382 | > Using vcf and ancestral files
383 | vcffile: chr1.bcf ancestralfile:  hg19_ancestral/homo_sapiens_ancestor_1.fa reffile:  hg19_refgenome/chr1.fa
384 | vcffile: chr2.bcf ancestralfile:  hg19_ancestral/homo_sapiens_ancestor_2.fa reffile:  hg19_refgenome/chr2.fa
385 | vcffile: chr3.bcf ancestralfile:  hg19_ancestral/homo_sapiens_ancestor_3.fa reffile:  hg19_refgenome/chr3.fa
386 | vcffile: chr4.bcf ancestralfile:  hg19_ancestral/homo_sapiens_ancestor_4.fa reffile:  hg19_refgenome/chr4.fa
387 | vcffile: chr5.bcf ancestralfile:  hg19_ancestral/homo_sapiens_ancestor_5.fa reffile:  hg19_refgenome/chr5.fa
388 | vcffile: chr6.bcf ancestralfile:  hg19_ancestral/homo_sapiens_ancestor_6.fa reffile:  hg19_refgenome/chr6.fa
389 | vcffile: chr7.bcf ancestralfile:  hg19_ancestral/homo_sapiens_ancestor_7.fa reffile:  hg19_refgenome/chr7.fa
390 | vcffile: chr8.bcf ancestralfile:  hg19_ancestral/homo_sapiens_ancestor_8.fa reffile:  hg19_refgenome/chr8.fa
391 | vcffile: chr9.bcf ancestralfile:  hg19_ancestral/homo_sapiens_ancestor_9.fa reffile:  hg19_refgenome/chr9.fa
392 | vcffile: chr10.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_10.fa reffile: hg19_refgenome/chr10.fa
393 | vcffile: chr11.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_11.fa reffile: hg19_refgenome/chr11.fa
394 | vcffile: chr12.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_12.fa reffile: hg19_refgenome/chr12.fa
395 | vcffile: chr13.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_13.fa reffile: hg19_refgenome/chr13.fa
396 | vcffile: chr14.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_14.fa reffile: hg19_refgenome/chr14.fa
397 | vcffile: chr15.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_15.fa reffile: hg19_refgenome/chr15.fa
398 | vcffile: chr16.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_16.fa reffile: hg19_refgenome/chr16.fa
399 | vcffile: chr17.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_17.fa reffile: hg19_refgenome/chr17.fa
400 | vcffile: chr18.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_18.fa reffile: hg19_refgenome/chr18.fa
401 | vcffile: chr19.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_19.fa reffile: hg19_refgenome/chr19.fa
402 | vcffile: chr20.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_20.fa reffile: hg19_refgenome/chr20.fa
403 | vcffile: chr21.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_21.fa reffile: hg19_refgenome/chr21.fa
404 | vcffile: chr22.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_22.fa reffile: hg19_refgenome/chr22.fa
405 | vcffile: chrX.bcf ancestralfile:  hg19_ancestral/homo_sapiens_ancestor_X.fa reffile:  hg19_refgenome/chrX.fa
406 | 
407 | > Callability file: strickmask.bed
408 | > Writing output to: outgroup.txt
409 | ----------------------------------------
410 | ```
411 | 
412 | Here it is important to check that hmmix matches up the reference, ancestral and vcffiles correctly e.g. chr1.bcf should fit with hg19_ancestral/homo_sapiens_ancestor_1.fa and hg19_refgenome/chr1.fa for instance. If you see an issue here its better to give the files as commaseparated values.
413 | 
414 | ```note
415 | hmmix create_outgroup -ind=individuals.json \
416 | -vcf=chr1.bcf,chr2.bcf,chr3.bcf,chr4.bcf,chr5.bcf,chr6.bcf,chr7.bcf,chr8.bcf,chr9.bcf,chr10.bcf,chr11.bcf,chr12.bcf,chr13.bcf,chr14.bcf,chr15.bcf,chr16.bcf,chr17.bcf,chr18.bcf,chr19.bcf,chr20.bcf,chr21.bcf,chr22.bcf,chrX.bcf \
417 | -ancestral=hg19_ancestral/homo_sapiens_ancestor_1.fa,hg19_ancestral/homo_sapiens_ancestor_2.fa,hg19_ancestral/homo_sapiens_ancestor_3.fa,hg19_ancestral/homo_sapiens_ancestor_4.fa,hg19_ancestral/homo_sapiens_ancestor_5.fa,hg19_ancestral/homo_sapiens_ancestor_6.fa,hg19_ancestral/homo_sapiens_ancestor_7.fa,hg19_ancestral/homo_sapiens_ancestor_8.fa,hg19_ancestral/homo_sapiens_ancestor_9.fa,hg19_ancestral/homo_sapiens_ancestor_10.fa,hg19_ancestral/homo_sapiens_ancestor_11.fa,hg19_ancestral/homo_sapiens_ancestor_12.fa,hg19_ancestral/homo_sapiens_ancestor_13.fa,hg19_ancestral/homo_sapiens_ancestor_14.fa,hg19_ancestral/homo_sapiens_ancestor_15.fa,hg19_ancestral/homo_sapiens_ancestor_16.fa,hg19_ancestral/homo_sapiens_ancestor_17.fa,hg19_ancestral/homo_sapiens_ancestor_18.fa,hg19_ancestral/homo_sapiens_ancestor_19.fa,hg19_ancestral/homo_sapiens_ancestor_20.fa,hg19_ancestral/homo_sapiens_ancestor_21.fa,hg19_ancestral/homo_sapiens_ancestor_22.fa,hg19_ancestral/homo_sapiens_ancestor_X.fa \
418 | -refgenome=hg19_refgenome/chr1.fa,hg19_refgenome/chr2.fa,hg19_refgenome/chr3.fa,hg19_refgenome/chr4.fa,hg19_refgenome/chr5.fa,hg19_refgenome/chr6.fa,hg19_refgenome/chr7.fa,hg19_refgenome/chr8.fa,hg19_refgenome/chr9.fa,hg19_refgenome/chr10.fa,hg19_refgenome/chr11.fa,hg19_refgenome/chr12.fa,hg19_refgenome/chr13.fa,hg19_refgenome/chr14.fa,hg19_refgenome/chr15.fa,hg19_refgenome/chr16.fa,hg19_refgenome/chr17.fa,hg19_refgenome/chr18.fa,hg19_refgenome/chr19.fa,hg19_refgenome/chr20.fa,hg19_refgenome/chr21.fa,hg19_refgenome/chr22.fa,hg19_refgenome/chrX.fa \
419 | -weights=strickmask.bed \
420 | -out=outgroup.txt 
421 | ```
422 | 
423 | ---
424 | 
425 | ### Estimating mutation rate across genome
426 | 
427 | We can use the number of variants in the outgroup to estimate the substitution rate as a proxy for mutation rate.
428 | 
429 | ```note
430 | (took 30 sec) > hmmix mutation_rate -outgroup=outgroup.txt  -weights=strickmask.bed -window_size=1000000 -out mutationrate.bed
431 | ----------------------------------------
432 | > Outgroupfile: outgroup.txt
433 | > Outputfile is: mutationrate.bed
434 | > Callability file is: strickmask.bed
435 | > Window size: 1000000
436 | ----------------------------------------
437 | ```
438 | 
439 | ---
440 | 
441 | ### Find a set of variants which are not derived in the outgroup
442 | 
443 | Keep variants that are not found to be derived in the outgroup for each individual in ingroup. You can also speficy a single individual or a comma separated list of individuals.
444 | 
445 | ```note
446 | (took 20 min) > hmmix create_ingroup  -ind=individuals.json -vcf=*.bcf -weights=strickmask.bed -out=obs -outgroup=outgroup.txt -ancestral=hg19_ancestral/homo_sapiens_ancestor_*.fa
447 | ----------------------------------------
448 | > Ingroup individuals: 2
449 | > Using vcf and ancestral files
450 | vcffile: chr1.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_1.fa
451 | vcffile: chr2.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_2.fa
452 | vcffile: chr3.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_3.fa
453 | vcffile: chr4.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_4.fa
454 | vcffile: chr5.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_5.fa
455 | vcffile: chr6.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_6.fa
456 | vcffile: chr7.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_7.fa
457 | vcffile: chr8.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_8.fa
458 | vcffile: chr9.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_9.fa
459 | vcffile: chr10.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_10.fa
460 | vcffile: chr11.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_11.fa
461 | vcffile: chr12.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_12.fa
462 | vcffile: chr13.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_13.fa
463 | vcffile: chr14.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_14.fa
464 | vcffile: chr15.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_15.fa
465 | vcffile: chr16.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_16.fa
466 | vcffile: chr17.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_17.fa
467 | vcffile: chr18.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_18.fa
468 | vcffile: chr19.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_19.fa
469 | vcffile: chr20.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_20.fa
470 | vcffile: chr21.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_21.fa
471 | vcffile: chr22.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_22.fa
472 | vcffile: chrX.bcf ancestralfile: hg19_ancestral/homo_sapiens_ancestor_X.fa
473 | 
474 | > Using outgroup variants from: outgroup.txt 
475 | > Callability file: strickmask.bed 
476 | > Writing output to file with prefix: obs.<individual>.txt
477 | 
478 | ----------------------------------------
479 | Running command:
480 | bcftools view -m2 -M2 -v snps -s HG00096 -T strickmask.bed chr1.bcf | vcftools --vcf - --exclude-positions outgroup.txt --recode --stdout 
481 | ...
482 | bcftools view -m2 -M2 -v snps -s HG00097 -T strickmask.bed chr22.bcf | vcftools --vcf - --exclude-positions outgroup.txt --recode --stdout 
483 | 
484 | 
485 | # Different way to define which individuals are in the ingroup
486 | (took 20 min) > hmmix create_ingroup  -ind=HG00096,HG00097 -vcf=*.bcf -weights=strickmask.bed -out=obs -outgroup=outgroup.txt -ancestral=hg19_ancestral/homo_sapiens_ancestor_*.fa
487 | ```
488 | 
489 | ---
490 | 
491 | ### Training
492 | 
493 | Now for training the HMM parameters and decoding
494 | 
495 | ```note
496 | (took 2 min) > hmmix train  -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -out=trained.HG00096.json 
497 | ----------------------------------------
498 | > state_names = ['Human', 'Archaic']
499 | > starting_probabilities = [0.98, 0.02]
500 | > transitions = [[1.0, 0.0], [0.02, 0.98]]
501 | > emissions = [0.04, 0.4]
502 | > chromosomes to use: All
503 | > number of windows: 3034097. Number of snps = 129803
504 | > total callability: 2178532324 bp (71.8 %)
505 | > average mutation rate per bin: 1.0
506 | > Output is trained.HG00096.json
507 | > Window size is 1000 bp
508 | > Haploid False
509 | ----------------------------------------
510 | iteration  loglikelihood  start1  start2  emis1   emis2   trans1_1  trans2_2
511 | 0          -495723.941    0.98    0.02    0.04    0.4     0.9999    0.98
512 | 1          -493161.0783   0.964   0.036   0.0459  0.3894  0.9995    0.9859
513 | 2          -492985.5422   0.959   0.041   0.0454  0.3847  0.9993    0.9834
514 | ...
515 | 20         -492843.1842   0.954   0.046   0.0441  0.3724  0.9989    0.9768
516 | 21         -492843.1828   0.954   0.046   0.0441  0.3724  0.9989    0.9768
517 | 22         -492843.182    0.954   0.046   0.0441  0.3724  0.9989    0.9768
518 | ```
519 | 
520 | ---
521 | 
522 | ### Decoding
523 | 
524 | ```note
525 | (took 30 sec) > hmmix decode -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -param=trained.HG00096.json 
526 | ----------------------------------------
527 | > state_names = ['Human', 'Archaic']
528 | > starting_probabilities = [0.954, 0.046]
529 | > transitions = [[0.999, 0.001], [0.023, 0.977]]
530 | > emissions = [0.044, 0.372]
531 | > chromosomes to use: All
532 | > number of windows: 3034097. Number of snps = 129803
533 | > total callability: 2178532324 bp (71.8 %)
534 | > average mutation rate per bin: 1.0
535 | > Output prefix is /dev/stdout
536 | > Window size is 1000 bp
537 | > Haploid False
538 | ----------------------------------------
539 | chrom  start    end      length   state    mean_prob  snps
540 | 1      0        2988000  2988000  Human    0.9843     91
541 | 1      2988000  2997000  9000     Archaic  0.76267    6
542 | 1      2997000  3425000  428000   Human    0.98774    30
543 | 1      3425000  3452000  27000    Archaic  0.95818    22
544 | 1      3452000  4302000  850000   Human    0.97914    36
545 | 1      4302000  4361000  59000    Archaic  0.86728    20
546 | 1      4361000  4500000  139000   Human    0.9685     4
547 | 1      4500000  4510000  10000    Archaic  0.85533    7
548 | ```
549 | 
550 | You can also save to an output file with the command:
551 | 
552 | ```note
553 | hmmix decode -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -param=trained.HG00096.json -out=HG00096.decoded
554 | ```
555 | 
556 | This will create a file named HG00096.decoded.diploid.txt because the default option is treating the data as diploid (more on haploid decoding in next chapter)
557 | 
558 | ---
559 | 
560 | ### Training and decoding with phased data
561 | 
562 | It is also possible to tell the model that the data is phased with the -haploid parameter. For that we first need to train the parameters for haploid data and then decode. Training the model on phased data is done like this - and we also remember to change the name of the parameter file to include phased so future versions of ourselves don't forget. Another thing to note is that the number of snps is bigger than before 135483 vs 129803. This is because the program is counting snps on both haplotypes and homozygotes will be counted twice! Also the number of windows is now double due to the fact that we are looking at each chromosome pair seperately.
563 | 
564 | ```note
565 | (took 4 min) > hmmix train  -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -out=trained.HG00096.phased.json -haploid
566 | ----------------------------------------
567 | > state_names = ['Human', 'Archaic']
568 | > starting_probabilities = [0.98, 0.02]
569 | > transitions = [[1.0, 0.0], [0.02, 0.98]]
570 | > emissions = [0.04, 0.4]
571 | > chromosomes to use: All
572 | > number of windows: 6068194. Number of snps = 135483
573 | > total callability: 4357064649 bp (71.8 %)
574 | > average mutation rate per bin: 1.0
575 | > Output is trained.HG00096.phased.json
576 | > Window size is 1000 bp
577 | > Haploid True
578 | ----------------------------------------
579 | iteration  loglikelihood  start1  start2  emis1   emis2   trans1_1  trans2_2
580 | 0          -605546.7352   0.98    0.02    0.04    0.4     0.9999    0.98
581 | 1          -589566.629    0.985   0.015   0.0248  0.3999  0.9998    0.9851
582 | 2          -588897.0833   0.98    0.02    0.0238  0.3671  0.9996    0.9825
583 | ...
584 | 20         -588529.8136   0.973   0.027   0.0227  0.3266  0.9993    0.9755
585 | 21         -588529.8124   0.973   0.027   0.0227  0.3265  0.9993    0.9755
586 | 22         -588529.8117   0.973   0.027   0.0227  0.3265  0.9993    0.9755
587 | ```
588 | 
589 | Below I am only showing the first archaic segments on chromosome 1 for each haplotype (note you have to scroll down after chrom X before the new haplotype begins). The seem to fall more or less in the same places as when we used diploid data.
590 | 
591 | ```note
592 | (took 30 sec) > hmmix decode -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -param=trained.HG00096.phased.json -haploid
593 | ----------------------------------------
594 | > state_names = ['Human', 'Archaic']
595 | > starting_probabilities = [0.973, 0.027]
596 | > transitions = [[0.999, 0.001], [0.024, 0.976]]
597 | > emissions = [0.023, 0.327]
598 | > chromosomes to use: All
599 | > number of windows: 6068194. Number of snps = 135483
600 | > total callability: 4357064649 bp (71.8 %)
601 | > average mutation rate per bin: 1.0
602 | > Output prefix is /dev/stdout
603 | > Window size is 1000 bp
604 | > Haploid True
605 | ----------------------------------------
606 | hap1
607 | chrom  start    end      length  state    mean_prob  snps
608 | 1      2156000  2185000  29000   Archaic  0.64814    6
609 | 1      3425000  3452000  27000   Archaic  0.96702    22
610 | 
611 | ...
612 | hap2
613 | 1      2780000  2803000  23000   Archaic  0.68384    7
614 | 1      4302000  4337000  35000   Archaic  0.94248    13
615 | 1      4500000  4511000  11000   Archaic  0.87943    7
616 | 1      4989000  5001000  12000   Archaic  0.6195     5
617 | ```
618 | 
619 | You can also save to an output file with the command:
620 | 
621 | ```note
622 | hmmix decode -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -param=trained.HG00096.phased.json -haploid -out=HG00096.decoded
623 | ```
624 | 
625 | This will create two files named HG00096.decoded.hap1.txt and HG00096.decoded.hap2.txt
626 | 
627 | ---
628 | 
629 | ### Annotate with known admixing population
630 | 
631 | Even though this method does not use archaic reference genomes for finding segments you can still use them to annotate your segments.
632 | I have uploaded a VCF file containing 4 high coverage archaic genomes (3 Neanderthals and 1 Denisovan) here:
633 | 
634 | <https://zenodo.org/records/7246376> (hg19 - the one I use in this example)
635 | 
636 | <https://zenodo.org/records/13368126> (hg38)
637 | 
638 | If you have a vcf from the population that admixed in VCF/BCF format you can write this:
639 | 
640 | ```note
641 | > hmmix decode -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -param=trained.HG00096.json -admixpop=archaicvar/*bcf
642 | ----------------------------------------
643 | > state_names = ['Human', 'Archaic']
644 | > starting_probabilities = [0.954, 0.046]
645 | > transitions = [[0.999, 0.001], [0.023, 0.977]]
646 | > emissions = [0.044, 0.372]
647 | > chromosomes to use: All
648 | > number of windows: 3034097. Number of snps = 129803
649 | > total callability: 2178532324 bp (71.8 %)
650 | > average mutation rate per bin: 1.0
651 | > Output prefix is /dev/stdout
652 | > Window size is 1000 bp
653 | > Haploid False
654 | ----------------------------------------
655 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_9.bcf
656 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_19.bcf
657 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_7.bcf
658 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_21.bcf
659 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_20.bcf
660 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_15.bcf
661 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_10.bcf
662 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_3.bcf
663 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_17.bcf
664 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_6.bcf
665 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_X.bcf
666 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_16.bcf
667 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_1.bcf
668 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_18.bcf
669 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_14.bcf
670 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_4.bcf
671 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_2.bcf
672 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_22.bcf
673 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_5.bcf
674 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_8.bcf
675 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_11.bcf
676 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_12.bcf
677 | bcftools view -a -R obs.HG00096.txttemp archaicvar/highcov_ind_13.bcf
678 | chrom  start     end       length  state    mean_prob  snps  admixpopvariants  AltaiNeandertal  Vindija33.19  Denisova  Chagyrskaya-Phalanx
679 | 1      2988000   2997000   9000    Archaic  0.76267    6     4                 4                4             1         4
680 | 1      3425000   3452000   27000   Archaic  0.95818    22    17                17               15            3         17
681 | 1      4302000   4361000   59000   Archaic  0.86728    20    12                11               12            11        11
682 | 1      4500000   4510000   10000   Archaic  0.85533    7     5                 4                5             4         5
683 | 1      5306000   5319000   13000   Archaic  0.55713    4     1                 1                1             0         1
684 | 1      5338000   5348000   10000   Archaic  0.65123    5     3                 2                3             0         3
685 | 1      9321000   9355000   34000   Archaic  0.86446    9     0                 0                0             0         0
686 | 1      12599000  12655000  56000   Archaic  0.91166    18    11                4                11            0         10
687 | ```
688 | 
689 | For the first segment there are 6 derived snps. Of these snps 4 are shared with Altai,Vindija, Denisova and Chagyrskaya. Only 1 is shared with Denisova so this segment likeli introgressed from Neanderthals
690 | 
691 | ---
692 | 
693 | And that is it! Now you have run the model and gotten a set of parameters that you can interpret biologically (see my paper) and you have a list of segments that belong to the human and Archaic state.
694 | 
695 | If you have any questions about the use of the scripts, if you find errors or if you have feedback you can contact my here (make an issue) or write to:
696 | lauritsskov2@gmail.com
697 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup
 2 | 
 3 | with open('README.md', 'r') as fh:
 4 |     long_description = fh.read()
 5 | 
 6 | setup(
 7 |     name='hmmix', 
 8 |     python_requires='>3.5, <3.10',
 9 |     version = '0.8.2',
10 |     description='Find introgressed segments',
11 |     py_modules=['bcf_vcf', 'helper_functions', 'hmm_functions', 'main', 'make_mutationrate', 'make_test_data', 'artemis'],
12 |     package_dir={'': 'src'},
13 |     classifiers=[
14 |         'Programming Language :: Python :: 3',
15 |         'Programming Language :: Python :: 3.5',
16 |         'Programming Language :: Python :: 3.6',
17 |         'Programming Language :: Python :: 3.7',
18 |         'Programming Language :: Python :: 3.8',
19 |         'Programming Language :: Python :: 3.9',
20 |         'License :: OSI Approved :: MIT License',
21 |     ],
22 |     long_description=long_description,
23 |     long_description_content_type='text/markdown',
24 |     url = 'https://github.com/LauritsSkov/Introgression-detection',
25 |     author = 'Laurits Skov and Moises Coll Macia',
26 |     author_email='lauritsskov2@gmail.com',
27 |     entry_points = {
28 |     'console_scripts': [
29 |         'hmmix = main:main'
30 |     ]},
31 |     install_requires=[
32 |           'numpy>=1.15',
33 |           'scipy>=1.5',
34 |           'matplotlib>=3.3',
35 |           'numba'
36 |       ],   
37 | )
38 | 
39 | 


--------------------------------------------------------------------------------
/src/bcf_vcf.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | from collections import defaultdict
  3 | 
  4 | from helper_functions import sortby, Make_folder_if_not_exists, load_fasta, convert_to_bases, clean_files
  5 | 
  6 | 
  7 | 
  8 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
  9 | # Make Outgroup
 10 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
 11 | def make_out_group(individuals_input, bedfile, vcffiles, outputfile, ancestralfiles, refgenomefiles):
 12 | 
 13 |     Make_folder_if_not_exists(outputfile)
 14 |     outgroup_individuals = ','.join(individuals_input)
 15 | 
 16 |     with open(outputfile + '.unsorted', 'w') as out:
 17 | 
 18 |         print('chrom', 'pos', 'ref_allele_info', 'alt_allele_info', 'ancestral_base', sep = '\t', file = out)
 19 | 
 20 |         for vcffile, ancestralfile, reffile in zip(vcffiles, ancestralfiles, refgenomefiles):
 21 | 
 22 |             if ancestralfile is not None:
 23 |                 ancestral_allele = load_fasta(ancestralfile)
 24 | 
 25 |             if bedfile is not None:
 26 |                 command = f'bcftools view -s {outgroup_individuals} -T {bedfile} {vcffile} | bcftools norm -m -any | bcftools view -v snps | vcftools --vcf - --counts --stdout'
 27 |             else:
 28 |                 command = f'bcftools view -s {outgroup_individuals} {vcffile} | bcftools norm -m -any | bcftools view -v snps | vcftools --vcf - --counts --stdout'
 29 |             
 30 |             print(f'Processing {vcffile}...')
 31 |             print('Running command:')
 32 |             print(command, '\n\n')
 33 | 
 34 |             variants_seen = defaultdict(int)
 35 |             for index, line in enumerate(os.popen(command)):
 36 |                 if not line.startswith('CHROM'):
 37 | 
 38 |                     chrom, pos, _, _, ref_allele_info, alt_allele_info = line.strip().split()
 39 | 
 40 |                     ref_allele, ref_count = ref_allele_info.split(':')
 41 |                     alt_allele, alt_count = alt_allele_info.split(':')
 42 |                     pos, ref_count, alt_count = int(pos),  int(ref_count), int(alt_count)
 43 | 
 44 |                     # Always include polymorphic sites
 45 |                     if alt_count * ref_count > 0:
 46 |                         ancestral_base = ref_allele if ref_count > alt_count else alt_allele
 47 | 
 48 |                         # Use ancestral base info if available
 49 |                         if ancestralfile is not None:
 50 |                             ancestral_base_temp = ancestral_allele[pos-1]
 51 |                             if ancestral_base_temp in [ref_allele, alt_allele]:
 52 |                                  ancestral_base = ancestral_base_temp
 53 | 
 54 |                         print(chrom, pos, ref_allele_info, alt_allele_info, ancestral_base, sep = '\t', file = out)
 55 |                         variants_seen[pos-1] = 1
 56 | 
 57 |                     # Fixed sites
 58 |                     elif alt_count * ref_count == 0:
 59 |                         ancestral_base = ref_allele if ref_count > alt_count else alt_allele
 60 | 
 61 |                         # Use ancestral base info if available
 62 |                         if ancestralfile is not None:
 63 |                             ancestral_base_temp = ancestral_allele[pos-1]
 64 |                             if ancestral_base_temp in [ref_allele, alt_allele]:
 65 |                                  ancestral_base = ancestral_base_temp
 66 | 
 67 |                         if ancestral_base == alt_allele:
 68 |                             derived_count = ref_count
 69 |                         else:
 70 |                              derived_count = alt_count
 71 | 
 72 |                         if derived_count > 0:
 73 |                             print(chrom, pos, ref_allele_info, alt_allele_info, ancestral_base, sep = '\t', file = out)
 74 |                             variants_seen[pos-1] = 1
 75 | 
 76 | 
 77 |                     if index % 100000 == 0:
 78 |                         print(f'at line {index} at chrom {chrom} and position {pos}')
 79 | 
 80 |             # If reference genome is provided then remove positions where the reference and ancestral differ AND which is not found in the outgroup
 81 |             if reffile is not None and ancestralfile is not None:
 82 |                 print('Find fixed derived sites')
 83 |                 refgenome_allele = load_fasta(reffile)
 84 | 
 85 |                 for index, (refbase, ancbase) in enumerate(zip(refgenome_allele, ancestral_allele)):
 86 |                     if ancbase in ['A','C','G','T']  and refbase in ['A','C','G','T'] :
 87 |                         if refbase != ancbase and variants_seen[index] == 0:
 88 |                             print(chrom, index + 1, f'{refbase}:100', f'{ancbase}:0', ancbase, sep = '\t', file = out)
 89 | 
 90 |     # Sort outgroup file
 91 |     print('Sorting outgroup file')
 92 |     positions_to_sort = defaultdict(lambda: defaultdict(str))
 93 |     with open(outputfile + '.unsorted') as data, open(outputfile, 'w') as out:
 94 |         for line in data:
 95 |             if line.startswith('chrom'):
 96 |                 out.write(line)
 97 |             else:
 98 |                 chrom, pos = line.strip().split()[0:2]
 99 |                 positions_to_sort[chrom][int(pos)] = line
100 | 
101 |         for chrom in sorted(positions_to_sort, key=sortby):
102 |             for pos in sorted(positions_to_sort[chrom]):
103 |                 line =  positions_to_sort[chrom][pos]
104 |                 out.write(line)
105 | 
106 |     # Clean log files generated by vcf and bcf tools
107 |     clean_files(outputfile + '.unsorted')
108 |     clean_files('out.log')
109 |     
110 | 
111 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
112 | # Make ingroup
113 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
114 | def make_ingroup_obs(ingroup_individuals, bedfile, vcffiles, outprefix, outgroupfile, ancestralfiles):
115 | 
116 |     # handle output files
117 |     Make_folder_if_not_exists(outprefix)
118 |     outfile_handler = defaultdict(str)
119 |     for individual in ingroup_individuals:
120 |         outfile_handler[individual] = open(f'{outprefix}.{individual}.txt','w')
121 |         print('chrom', 'pos', 'ancestral_base', 'genotype', sep = '\t', file = outfile_handler[individual])
122 |         
123 |     individuals_for_bcf = ','.join(ingroup_individuals)
124 | 
125 |     for vcffile, ancestralfile in zip(vcffiles, ancestralfiles):
126 | 
127 |         if ancestralfile is not None:
128 |             ancestral_allele = load_fasta(ancestralfile)
129 | 
130 |         if bedfile is not None:
131 |             command = f'bcftools view -v snps -s {individuals_for_bcf} -T {bedfile} {vcffile} | bcftools norm -m +any | vcftools --vcf - --exclude-positions {outgroupfile} --recode --stdout'
132 |         else:
133 |             command = f'bcftools view -v snps -s {individuals_for_bcf} {vcffile} | bcftools norm -m +any | vcftools --vcf - --exclude-positions {outgroupfile} --recode --stdout'
134 | 
135 |         print('Running command:')
136 |         print(command, '\n\n')
137 | 
138 |         for index, line in enumerate(os.popen(command)):
139 | 
140 |             if line.startswith('#CHROM'):
141 |                 individuals_in_vcffile = line.strip().split()[9:]
142 | 
143 |             if not line.startswith('#'):
144 | 
145 |                 chrom, pos, _, ref_allele, alt_allele = line.strip().split()[0:5]
146 |                 pos = int(pos)
147 |                 genotypes = [x.split(':')[0] for x in line.strip().split()[9:]]
148 |                 all_bases = [ref_allele] + alt_allele.split(',')
149 | 
150 |                 if ref_allele in ['A','C','G','T']:
151 | 
152 |                     for original_genotype, individual in zip(genotypes, individuals_in_vcffile):  
153 |                         genotype = convert_to_bases(original_genotype, all_bases)   
154 | 
155 |                         if ancestralfile is not None:
156 |                             # With ancestral information look for derived alleles
157 |                             ancestral_base = ancestral_allele[pos-1]
158 |                             if ancestral_base in all_bases and genotype.count(ancestral_base) != 2 and genotype != 'NN':
159 |                                 print(chrom, pos, ancestral_base, genotype, sep = '\t', file = outfile_handler[individual])
160 | 
161 |                         else:
162 |                             # If no ancestral information is provided only include heterozygous variants
163 |                             if genotype[0] != genotype[1]:
164 |                                 print(chrom, pos, ref_allele, genotype, sep = '\t', file = outfile_handler[individual])
165 |                 
166 | 
167 |                 if index % 100000 == 0:
168 |                     print(f'at line {index} at chrom {chrom} and position {pos}')
169 | 
170 |     # Clean log files generated by vcf and bcf tools
171 |     clean_files('out.log')
172 | 
173 |     for individual in ingroup_individuals:
174 |         outfile_handler[individual].close()
175 | 
176 | 
177 | 
178 | 


--------------------------------------------------------------------------------
/src/helper_functions.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import json
  3 | from collections import defaultdict
  4 | import os, sys
  5 | from glob import glob
  6 | 
  7 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
  8 | # Functions for handling observertions/bed files
  9 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
 10 | def make_callability_from_bed(bedfile, window_size):
 11 |     callability = defaultdict(lambda: defaultdict(float))
 12 |     with open(bedfile) as data:
 13 |         for line in data:
 14 | 
 15 |             if line.startswith('chrom'):
 16 |                 continue
 17 | 
 18 |             if len(line.strip().split('\t')) == 3:
 19 |                 chrom, start, end = line.strip().split('\t')
 20 |                 value = 1
 21 |             elif  len(line.strip().split('\t')) > 3:
 22 |                 chrom, start, end, value = line.strip().split('\t')[0:4]
 23 |                 value = float(value)
 24 | 
 25 |             start, end = int(start), int(end)
 26 | 
 27 |             firstwindow = start - start % window_size
 28 |             lastwindow = end - end % window_size
 29 |             
 30 |             # not spanning multiple windows (all is added to same window)
 31 |             if firstwindow == lastwindow:
 32 |                 callability[chrom][firstwindow] += (end-start+1) * value
 33 | 
 34 |             # spanning multiple windows
 35 |             else:
 36 |                 # add to end windows
 37 |                 firstwindow_fill = window_size - start % window_size
 38 |                 lastwindow_fill = end %window_size
 39 |             
 40 |                 callability[chrom][firstwindow] += firstwindow_fill * value
 41 |                 callability[chrom][lastwindow] += (lastwindow_fill+1) * value
 42 | 
 43 |                 # fill in windows in the middle
 44 |                 for window_tofil in range(firstwindow + window_size, lastwindow, window_size):
 45 |                     callability[chrom][window_tofil] += window_size * value
 46 | 
 47 |     return callability
 48 | 
 49 | 
 50 | 
 51 | def Load_observations_weights_mutrates(obs_file, weights_file, mutrates_file, window_size = 1000, haploid = False, chrom_to_look_for = 'All'):
 52 | 
 53 |     # get span of data
 54 |     chromosome_spans = defaultdict(int)
 55 |     with open(obs_file) as data:
 56 |         for line in data:
 57 |             if line.startswith('chrom'):
 58 |                 continue
 59 |             chrom, pos, ancestral_base, genotype = line.strip().split()
 60 | 
 61 |             # convert 1-indexed position to 0-indexed position
 62 |             zero_based_pos = int(pos) - 1
 63 |             rounded_pos = zero_based_pos - zero_based_pos % window_size
 64 |             chromosome_spans[chrom] = rounded_pos
 65 | 
 66 | 
 67 |     # get span of callability
 68 |     if weights_file: 
 69 |         callability = make_callability_from_bed(weights_file, window_size)
 70 |         for chrom in callability:
 71 |             chromosome_spans[chrom] = max(callability[chrom])
 72 | 
 73 |     # which chromosomes are we interested in
 74 |     if chrom_to_look_for != 'All':
 75 |         chromosome_list = sorted(chrom_to_look_for.split(','), key=sortby)
 76 |     else:
 77 |         chromosome_list = sorted(list(chromosome_spans.keys()), key=sortby)
 78 | 
 79 |     obs_counter = defaultdict(lambda: defaultdict(lambda: defaultdict(list)))
 80 |     haplotypes = defaultdict(int)
 81 |     chroms, starts, variants, obs = [], [], [], []
 82 | 
 83 |     # read observation data
 84 |     with open(obs_file) as data:
 85 |         for line in data:
 86 |             if line.startswith('chrom'):
 87 |                 continue 
 88 | 
 89 |             chrom, pos, ancestral_base, genotype = line.strip().split()
 90 |             if not chrom in chromosome_list:
 91 |                 continue
 92 | 
 93 |             # convert 1-indexed position to 0-indexed position
 94 |             zero_based_pos = int(pos) - 1
 95 |             rounded_pos = zero_based_pos - zero_based_pos % window_size
 96 | 
 97 |             if haploid:  
 98 |                 for i, base in enumerate(genotype):
 99 |                     if base != ancestral_base:
100 |                         obs_counter[chrom][rounded_pos][f'_hap{i+1}'].append(pos)
101 |                         haplotypes[f'_hap{i+1}'] += 1
102 |             else:
103 |                 obs_counter[chrom][rounded_pos][''].append(pos)
104 |                 haplotypes[''] += 1
105 | 
106 | 
107 |     # take care of cases where there is 0 observations
108 |     if len(haplotypes) == 0:
109 |         if haploid:
110 |             sys.exit('Could not determine haploidity because there is no data')
111 |         else:
112 |             haplotypes[''] += 1
113 |         
114 | 
115 |     for haplotype in sorted(haplotypes, key=sortby_haplotype):
116 |         
117 |         for chrom in chromosome_list:
118 |             lastwindow = chromosome_spans[chrom]
119 | 
120 |             for window in range(0, lastwindow, window_size):
121 |                 chroms.append(f'{chrom}{haplotype}')   
122 |                 starts.append(window)
123 |                 variants.append(','.join(obs_counter[chrom][window][haplotype]))  
124 |                 obs.append(len(obs_counter[chrom][window][haplotype])) 
125 |             
126 | 
127 |     # Read weights file is it exists - else set all weights to 1
128 |     if weights_file is None:
129 |         weights = np.ones(len(obs)) 
130 |     else:  
131 |         callability = make_callability_from_bed(weights_file, window_size)
132 |         weights = []
133 |         for haplotype in sorted(haplotypes, key=sortby_haplotype):
134 |             
135 |             for chrom in chromosome_list:
136 |                 lastwindow = chromosome_spans[chrom]
137 | 
138 |                 for window in range(0, lastwindow, window_size):
139 |                     weights.append(callability[chrom][window] / float(window_size))
140 | 
141 | 
142 |     # Read mutation rate file is it exists - else set all mutation rates to 1
143 |     if mutrates_file is None:
144 |         mutrates = np.ones(len(obs)) 
145 |     else:  
146 |         callability = make_callability_from_bed(mutrates_file, window_size)
147 |         mutrates = []
148 |         for haplotype in sorted(haplotypes, key=sortby_haplotype):
149 |             
150 |             for chrom in chromosome_list:
151 |                 lastwindow = chromosome_spans[chrom]
152 | 
153 |                 for window in range(0, lastwindow, window_size):
154 |                     mutrates.append(callability[chrom][window] / float(window_size))
155 | 
156 | 
157 | 
158 |     # Make sure there are no places with obs > 0 and 0 in mutation rate or weight
159 |     for index, (observation, w, m) in enumerate(zip(obs, weights, mutrates)):
160 |         if w*m == 0 and observation != 0:
161 |             print(f'warning, you had {observation} observations but no called bases/no mutation rate at index:{index}. weights:{w}, mutrates:{m}')
162 |             obs[index] = 0
163 |             
164 |     return np.array(obs).astype(int), chroms, starts, variants, np.array(mutrates).astype(float), np.array(weights).astype(float)
165 | 
166 | 
167 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
168 | # For decoding/training
169 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
170 | 
171 | def find_runs(inarray):
172 |     """ run length encoding. Partial credit to R rle function. 
173 |         Multi datatype arrays catered for including non Numpy
174 |         returns: tuple (runlengths, startpositions, values) """
175 |     ia = np.asarray(inarray)                # force numpy
176 |     n = len(ia)
177 |     if n == 0: 
178 |         return (None, None, None)
179 |     else:
180 |         y = ia[1:] != ia[:-1]               # pairwise unequal (string safe)
181 |         i = np.append(np.where(y), n - 1)   # must include last element posi
182 |         z = np.diff(np.append(-1, i))       # run lengths
183 |         p = np.cumsum(np.append(0, z))[:-1] # positions
184 | 
185 |         for (a, b, c) in zip(ia[i], p, z):
186 |             yield (a, b, c)
187 | 
188 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
189 | # Various helper functions
190 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
191 | def load_fasta(fasta_file):
192 |     '''
193 |     Read a fasta file with a single chromosome in and return the sequence as a string
194 |     '''
195 |     fasta_sequence = ''
196 |     with open(fasta_file) as data:
197 |         for line in data:
198 |             if not line.startswith('>'):
199 |                 fasta_sequence += line.strip().upper()
200 | 
201 |     return fasta_sequence
202 | 
203 | def sortby(x):
204 |     '''
205 |     This function is used in the sorted() function. It will sort first by numeric values, then strings then other symbols
206 | 
207 |     Usage:
208 |     mylist = ['1', '12', '2', 3, 'MT', 'Y']
209 |     sortedlist = sorted(mylist, key=sortby)
210 |     returns ['1', '2', 3, '12', 'MT', 'Y']
211 |     '''
212 | 
213 |     lower_case_letters = 'abcdefghijklmnopqrstuvwxyz'
214 |     if x.isnumeric():
215 |         return int(x)
216 |     elif type(x) == str and len(x) > 0:
217 |         if x[0].lower() in lower_case_letters:
218 |             return 1e6 + lower_case_letters.index(x[0].lower())
219 |         else:
220 |             return 2e6
221 |     else:
222 |         return 3e6
223 |     
224 | def sortby_haplotype(x):
225 |     '''
226 |     This function will sort haplotypes by number
227 |     '''
228 | 
229 |     if '_hap' in x:
230 |         return int(x.replace('_hap', ''))
231 |     else:
232 |         return x
233 |     
234 | 
235 | 
236 | 
237 | def Make_folder_if_not_exists(path):
238 |     '''
239 |     Check if path exists - otherwise make it;
240 |     '''
241 |     path = os.path.dirname(path)
242 |     if path != '':
243 |         if not os.path.exists(path):
244 |             os.makedirs(path)
245 | 
246 | 
247 | 
248 | def Annotate_with_ref_genome(vcffiles, obsfile):
249 |     obs = defaultdict(list)
250 |     shared_with = defaultdict(str)
251 | 
252 |     tempobsfile = obsfile + 'temp'
253 |     with open(obsfile) as data, open(tempobsfile,'w') as out:
254 |         for line in data:
255 |             if not line.startswith('chrom'):
256 |                 out.write(line)
257 |                 chrom, pos, ancestral_base, genotype = line.strip().split()
258 |                 derived_variant = genotype.replace(ancestral_base, '')[0]
259 |                 ID = f'{chrom}_{pos}'
260 |                 obs[ID] = [ancestral_base, derived_variant]
261 | 
262 |     # handle case with 0 SNPs
263 |     if len(obs) == 0:
264 |         for vcffile in handle_infiles(vcffiles):
265 |             command = f'bcftools view -h {vcffile}'
266 |             print('You have no observations!')
267 | 
268 |             for line in os.popen(command):
269 |                 if line.startswith('#CHROM'):
270 |                     individuals_in_vcffile = line.strip().split()[9:]
271 | 
272 |             # Clean log files generated by vcf and bcf tools
273 |             clean_files('out.log')
274 |             clean_files(tempobsfile)
275 | 
276 |             return shared_with, individuals_in_vcffile
277 | 
278 | 
279 |     print('Loading in admixpop snp information')
280 |     for vcffile in handle_infiles(vcffiles):
281 |         command = f'bcftools view -a -R {tempobsfile} {vcffile}'
282 |         print(command)
283 | 
284 |         for line in os.popen(command):
285 |             if line.startswith('#CHROM'):
286 |                 individuals_in_vcffile = line.strip().split()[9:]
287 | 
288 |             if not line.startswith('#'):
289 | 
290 |                 chrom, pos, _, ref_allele, alt_allele = line.strip().split()[0:5]
291 |                 ID =  f'{chrom}_{pos}'
292 |                 genotypes = [x.split(':')[0] for x in line.strip().split()[9:]]
293 |                 all_bases = [ref_allele] + alt_allele.split(',')
294 | 
295 |                 ancestral_base, derived_base = obs[ID]
296 |                 found_in = []
297 | 
298 |                 for original_genotype, individual in zip(genotypes, individuals_in_vcffile):
299 | 
300 |                     if '.' not in original_genotype:
301 |                         genotype = convert_to_bases(original_genotype, all_bases)   
302 | 
303 |                         if genotype.count(derived_base) > 0:
304 |                             found_in.append(individual)
305 | 
306 |                 if len(found_in) > 0:
307 |                     shared_with[ID] = '|'.join(found_in)
308 | 
309 | 
310 |     # Clean log files generated by vcf and bcf tools
311 |     clean_files('out.log')
312 |     clean_files(tempobsfile)
313 | 
314 |     return shared_with, individuals_in_vcffile
315 | 
316 | def handle_individuals_input(argument, group_to_choose):
317 |     if os.path.exists(argument):
318 |         with open(argument) as json_file:
319 |             data = json.load(json_file)
320 |             return data[group_to_choose]
321 |     else:
322 |         return argument.split(',')
323 | 
324 | 
325 | 
326 | 
327 | # Clean up
328 | def clean_files(filename):
329 |     if os.path.exists(filename):
330 |         os.remove(filename)
331 | 
332 | 
333 | # Find variants from admixed population
334 | def flatten_list(variants_list):
335 |     return ','.join([x for x in variants_list if x != ''])
336 | 
337 | 
338 | 
339 | def convert_to_bases(genotype, both_bases):
340 | 
341 |     return_genotype = 'NN'
342 |     separator = None
343 |     
344 |     if '/' in genotype or '|' in genotype:
345 |         separator = '|' if '|' in genotype else '/'
346 | 
347 |         base1, base2 = [x for x in genotype.split(separator)]
348 |         if base1.isnumeric() and base2.isnumeric():
349 |             base1, base2 = int(base1), int(base2)
350 | 
351 |             if both_bases[base1] in ['A','C','G','T'] and both_bases[base2] in ['A','C','G','T']:
352 |                 return_genotype = both_bases[base1] + both_bases[base2]
353 | 
354 |     return return_genotype
355 | 
356 | 
357 | 
358 | 
359 | 
360 | def get_consensus(infiles):
361 |     '''
362 |     Find consensus prefix, suffix and value that changes in set of files:
363 | 
364 |     myfiles = ['chr1.vcf', 'chr2.vcf', 'chr3.vcf']
365 |     prefix, suffix, values = get_consensus(myfiles) 
366 |     
367 |     prefix  = chr
368 |     suffix = .vcf
369 |     values  = [1,2,3] 
370 |     '''
371 |     infiles = [str(x) for x in infiles]
372 | 
373 |     if len(infiles) <= 1:
374 |         return None, None, None
375 |     
376 |     # find longest common prefix
377 |     prefix = infiles[0]
378 |     for s in infiles[1:]:
379 |         while not s.startswith(prefix):
380 |             prefix = prefix[:-1]
381 | 
382 |     # find longest common suffix
383 |     suffix = infiles[0]
384 |     for s in infiles[1:]:
385 |         while not s.endswith(suffix):
386 |             suffix = suffix[1:]
387 | 
388 |     values = [x[len(prefix):-len(suffix)] if suffix else x[len(prefix):] for x in infiles]
389 | 
390 |     return prefix, suffix, sorted(values, key=sortby)
391 | 
392 | 
393 | # Check which type of input we are dealing with
394 | def handle_infiles(input):
395 | 
396 |     file_list = glob(input)
397 |     if len(file_list) > 0:
398 |         return file_list
399 |     else:
400 |         return input.split(',')
401 |     
402 | 
403 | 
404 |     
405 | 
406 | # Check which type of input we are dealing with
407 | def combined_files(ancestralfiles, vcffiles):
408 | 
409 |     if len(ancestralfiles) == len(vcffiles) and len(vcffiles) == 1 and ancestralfiles != ['']:
410 |         return ancestralfiles, vcffiles
411 | 
412 |     # Get ancestral and vcf consensus
413 |     prefix1, postfix1, values1 = get_consensus(vcffiles)
414 |     prefix2, postfix2, values2 = get_consensus(ancestralfiles)
415 | 
416 | 
417 |     # No ancestral files
418 |     if ancestralfiles == ['']:
419 | 
420 |         ancestralfiles = [None for _ in vcffiles]
421 |         vcffiles = [f'{prefix1}{x}{postfix1}' for x in values1]
422 |         return ancestralfiles, vcffiles
423 | 
424 |     # Same length
425 |     elif len(ancestralfiles) == len(vcffiles):
426 | 
427 |         vcffiles = [f'{prefix1}{x}{postfix1}' for x in values1]
428 |         ancestralfiles = [f'{prefix2}{x}{postfix2}' for x in values2]
429 |         return ancestralfiles, vcffiles
430 | 
431 |     # diff lengthts (both longer than 1)       
432 |     elif len(ancestralfiles) > 1 and len(vcffiles) > 1:
433 | 
434 |         vcffiles = []
435 |         ancestralfiles = []
436 | 
437 |         for joined in sorted(set(values1).intersection(set(values2)), key=sortby):
438 |             vcffiles.append(''.join([prefix1, joined, postfix1]))
439 |             ancestralfiles.append(''.join([prefix2, joined, postfix2]))
440 |         return ancestralfiles, vcffiles
441 | 
442 |     # Many ancestral files only one vcf    
443 |     elif len(ancestralfiles) > 1 and len(vcffiles) == 1:
444 |         ancestralfiles = []
445 |         
446 |         for key in values2:
447 |             if key in vcffiles[0]:
448 |                 ancestralfiles.append(''.join([prefix2, key, postfix2]))
449 | 
450 |         if len(vcffiles) != len(ancestralfiles):
451 |             sys.exit('Could not resolve ancestral files and vcffiles (try comma separated values)')
452 | 
453 |         return ancestralfiles, vcffiles
454 | 
455 |     # only one ancestral file and many vcf files
456 |     elif len(ancestralfiles) == 1 and len(vcffiles) > 1:
457 |         vcffiles = []
458 |        
459 |         for key in values1:
460 |             if key in ancestralfiles[0]:
461 |                 vcffiles.append(''.join([prefix1, key, postfix1]))
462 | 
463 |         if len(vcffiles) != len(ancestralfiles):
464 |             sys.exit('Could not resolve ancestral files and vcffiles (try comma separated values)')
465 | 
466 | 
467 |         return ancestralfiles, vcffiles
468 |     else:
469 |         sys.exit('Could not resolve ancestral files and vcffiles (try comma separated values)')
470 | 


--------------------------------------------------------------------------------
/src/hmm_functions.py:
--------------------------------------------------------------------------------
  1 | from collections import defaultdict
  2 | import numpy as np
  3 | from numba import njit
  4 | import json
  5 | 
  6 | from helper_functions import find_runs, Annotate_with_ref_genome, Make_folder_if_not_exists, flatten_list
  7 | 
  8 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
  9 | # HMM Parameter Class
 10 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
 11 | class HMMParam:
 12 |     def __init__(self, state_names, starting_probabilities, transitions, emissions): 
 13 |         self.state_names = np.array(state_names)
 14 |         self.starting_probabilities = np.array(starting_probabilities)
 15 |         self.transitions = np.array(transitions)
 16 |         self.emissions = np.array(emissions)
 17 | 
 18 |     def __str__(self):
 19 |         out = f'> state_names = {self.state_names.tolist()}\n'
 20 |         out += f'> starting_probabilities = {np.matrix.round(self.starting_probabilities, 3).tolist()}\n'
 21 |         out += f'> transitions = {np.matrix.round(self.transitions, 3).tolist()}\n'
 22 |         out += f'> emissions = {np.matrix.round(self.emissions, 3).tolist()}'
 23 |         return out
 24 | 
 25 |     def __repr__(self):
 26 |         return f'{self.__class__.__name__}({self.state_names}, {self.starting_probabilities}, {self.transitions}, {self.emissions})'
 27 |         
 28 | # Read HMM parameters from a json file
 29 | def read_HMM_parameters_from_file(filename):
 30 | 
 31 |     if filename is None:
 32 |         return get_default_HMM_parameters()
 33 | 
 34 |     with open(filename) as json_file:
 35 |         data = json.load(json_file)
 36 | 
 37 |     return HMMParam(state_names = data['state_names'], 
 38 |                     starting_probabilities = data['starting_probabilities'], 
 39 |                     transitions = data['transitions'], 
 40 |                     emissions = data['emissions'])
 41 | 
 42 | # Set default parameters
 43 | def get_default_HMM_parameters():
 44 |     return HMMParam(state_names = ['Human', 'Archaic'], 
 45 |                     starting_probabilities = [0.98, 0.02], 
 46 |                     transitions = [[0.9999,0.0001],[0.02,0.98]], 
 47 |                     emissions = [0.04, 0.4])
 48 | 
 49 | # Save HMMParam to a json file
 50 | def write_HMM_to_file(hmmparam, outfile):
 51 |     data = {key: value.tolist() for key, value in vars(hmmparam).items()}
 52 |     json_string = json.dumps(data, indent = 2) 
 53 |     with open(outfile, 'w') as out:
 54 |         out.write(json_string)
 55 | 
 56 | 
 57 | 
 58 | 
 59 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
 60 | # HMM functions
 61 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
 62 | 
 63 | @njit
 64 | def poisson_probability_underflow_safe(n, lam):
 65 |     # naive:   np.exp(-lam) * lam**n / factorial(n)
 66 | 
 67 |     # iterative, to keep the components from getting too large or small:
 68 |     p = np.exp(-lam)
 69 |     for i in range(n):
 70 |         p *= lam
 71 |         p /= i+1
 72 |     return p
 73 | 
 74 | @njit
 75 | def Emission_probs_poisson(emissions, observations, weights, mutrates):
 76 |     n = len(observations)
 77 |     n_states = len(emissions)
 78 |     
 79 |     probabilities = np.zeros( (n, n_states) ) 
 80 |     for state in range(n_states): 
 81 |         for index in range(n):
 82 |             lam = emissions[state] * weights[index] * mutrates[index]
 83 |             probabilities[index,state] = poisson_probability_underflow_safe(observations[index], lam)
 84 | 
 85 |     return probabilities
 86 | 
 87 | @njit
 88 | def fwd_step(alpha_prev, E, trans_mat):
 89 |     alpha_new = (alpha_prev @ trans_mat) * E
 90 |     n = np.sum(alpha_new)
 91 |     return alpha_new / n, n
 92 | 
 93 | @njit
 94 | def forward(probabilities, transitions, init_start):
 95 |     n = len(probabilities)
 96 |     forwards_in = np.zeros( (n, len(init_start)) ) 
 97 |     scale_param = np.ones(n)
 98 | 
 99 |     for t in range(n):
100 |         if t == 0:
101 |             forwards_in[t,:] = init_start  * probabilities[t,:]
102 |             scale_param[t] = np.sum( forwards_in[t,:])
103 |             forwards_in[t,:] = forwards_in[t,:] / scale_param[t]  
104 |         else:
105 |             forwards_in[t,:], scale_param[t] =  fwd_step(forwards_in[t-1,:], probabilities[t,:], transitions) 
106 | 
107 |     return forwards_in, scale_param
108 | 
109 | @njit
110 | def bwd_step(beta_next, E, trans_mat, n):
111 |     beta = (trans_mat * E) @ beta_next
112 |     return beta / n
113 | 
114 | @njit
115 | def backward(emissions, transitions, scales):
116 |     n, n_states = emissions.shape
117 |     beta = np.ones((n, n_states))
118 |     for i in range(n - 1, 0, -1):
119 |         beta[i - 1,:] = bwd_step(beta[i,:], emissions[i,:], transitions, scales[i])
120 |     return beta
121 | 
122 | 
123 | def GetProbability(hmm_parameters, weights, obs, mutrates):
124 | 
125 |     emissions = Emission_probs_poisson(hmm_parameters.emissions, obs, weights, mutrates)
126 |     _, scales = forward(emissions, hmm_parameters.transitions, hmm_parameters.starting_probabilities)
127 |     forward_probility_of_obs = np.sum(np.log(scales))
128 | 
129 |     return forward_probility_of_obs
130 | 
131 | 
132 | @njit
133 | def fwd_step_keep_track(alpha_prev, E, trans_mat):
134 |     
135 |     # scaling factor
136 |     n = np.sum((alpha_prev @ trans_mat) * E)
137 |     
138 |     results = np.zeros(len(E))
139 |     back_track_states = np.zeros(len(E))
140 | 
141 |     for current_s in range(len(E)):
142 |         for prev_s in range(len(E)):
143 |             new_prob = alpha_prev[prev_s] * trans_mat[prev_s, current_s] * E[current_s] / n
144 | 
145 |             if new_prob > results[current_s]:
146 |                 results[current_s] = new_prob
147 |                 back_track_states[current_s] = prev_s
148 | 
149 |     return results, back_track_states
150 | 
151 | 
152 | @njit
153 | def viterbi(probabilities, transitions, init_start):
154 |     n = len(probabilities)
155 |     forwards_in = np.zeros( (n, len(init_start)) ) 
156 |     backtracks = np.zeros( (n, len(init_start)), dtype=np.int32) 
157 | 
158 |     for t in range(n):
159 |         if t == 0:
160 |             forwards_in[t,:] = init_start  * probabilities[t,:]
161 |             scale_param = np.sum( forwards_in[t,:])
162 |             forwards_in[t,:] = forwards_in[t,:] / scale_param
163 |         else:
164 |             forwards_in[t,:], backtracks[t,:] =  fwd_step_keep_track(forwards_in[t-1,:], probabilities[t,:], transitions) 
165 | 
166 |     return forwards_in, backtracks
167 | 
168 | 
169 | @njit
170 | def calculate_log(x):
171 | 	return np.log(x)
172 | 
173 | 
174 | @njit
175 | def hybrid_step(prev, alpha, em, trans):
176 |     value = prev + alpha * calculate_log(em * trans)
177 |     best_state = np.argmax(value)
178 |     max_prob = value[best_state]
179 | 
180 |     return best_state, max_prob
181 | 
182 | 
183 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
184 | # Train
185 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
186 | 
187 | def logoutput(hmm_parameters, loglikelihood, iteration):
188 | 
189 |     n_states = len(hmm_parameters.emissions)
190 | 
191 |     # Make header
192 |     if iteration == 0:    
193 |         print_emissions = '\t'.join(['emis{0}'.format(x + 1) for x in range(n_states)])
194 |         print_starting_probabilities = '\t'.join(['start{0}'.format(x + 1) for x in range(n_states)])
195 |         print_transitions = '\t'.join(['trans{0}_{0}'.format(x + 1) for x in range(n_states)])
196 |         print('iteration', 'loglikelihood', print_starting_probabilities, print_emissions, print_transitions, sep = '\t')
197 | 
198 |     # Print parameters
199 |     print_emissions = '\t'.join([str(x) for x in np.matrix.round(hmm_parameters.emissions, 4)])
200 |     print_starting_probabilities = '\t'.join([str(x) for x in np.matrix.round(hmm_parameters.starting_probabilities, 3)])
201 |     print_transitions = '\t'.join([str(x) for x in np.matrix.round(hmm_parameters.transitions, 4).diagonal()])
202 |     print(iteration, round(loglikelihood, 4), print_starting_probabilities, print_emissions, print_transitions, sep = '\t')
203 | 
204 | 
205 | 
206 | def TrainBaumWelsch(hmm_parameters, weights, obs, mutrates):
207 |     """
208 |     Trains the model once, using the forward-backward algorithm. 
209 |     """
210 | 
211 |     n_states = len(hmm_parameters.starting_probabilities)
212 | 
213 |     emissions = Emission_probs_poisson(hmm_parameters.emissions, obs, weights, mutrates)
214 |     forward_probs, scales = forward(emissions, hmm_parameters.transitions, hmm_parameters.starting_probabilities)
215 |     backward_probs = backward(emissions, hmm_parameters.transitions, scales)
216 | 
217 |     # Update starting probs
218 |     posterior_probs = forward_probs * backward_probs
219 |     normalize = np.sum(posterior_probs)
220 |     new_starting_probabilities = np.sum(posterior_probs, axis=0)/normalize 
221 | 
222 |     # Update emission
223 |     new_emissions_matrix = np.zeros((n_states))
224 |     for state in range(n_states):
225 |         top = np.sum(posterior_probs[:,state] * obs)
226 |         bottom = np.sum(posterior_probs[:,state] * (weights * mutrates) )
227 |         new_emissions_matrix[state] = top/bottom
228 | 
229 |     # Update Transition probs 
230 |     new_transitions_matrix =  np.zeros((n_states, n_states))
231 |     for state1 in range(n_states):
232 |         for state2 in range(n_states):
233 |             new_transitions_matrix[state1,state2] = np.sum( forward_probs[:-1,state1]  * backward_probs[1:,state2]  * hmm_parameters.transitions[state1, state2] * emissions[1:,state2]/ scales[1:] )
234 |     new_transitions_matrix /= new_transitions_matrix.sum(axis=1)[:,np.newaxis]
235 | 
236 |     return HMMParam(hmm_parameters.state_names,new_starting_probabilities, new_transitions_matrix, new_emissions_matrix)
237 | 
238 | 
239 | def TrainModel(obs, mutrates, weights, hmm_parameters, epsilon = 1e-3, maxiterations = 1000):
240 | 
241 |     # Get probability of data with initial parameters
242 |     previous_loglikelihood = GetProbability(hmm_parameters, weights, obs, mutrates)
243 |     logoutput(hmm_parameters, previous_loglikelihood, 0)
244 |     
245 |     # Train parameters using Baum Welch algorithm
246 |     for i in range(1,maxiterations):
247 |         hmm_parameters = TrainBaumWelsch(hmm_parameters, weights, obs, mutrates)
248 |         new_loglikelihood = GetProbability(hmm_parameters, weights, obs, mutrates)
249 |         logoutput(hmm_parameters, new_loglikelihood, i)
250 | 
251 |         if new_loglikelihood - previous_loglikelihood < epsilon:       
252 |             break 
253 | 
254 |         previous_loglikelihood = new_loglikelihood
255 | 
256 |     # Write the optimal parameters
257 |     return hmm_parameters
258 | 
259 | 
260 | 
261 | 
262 | 
263 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
264 | # Decode (posterior decoding)
265 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
266 | 
267 | def Calculate_Posterior_probabillities(emissions, hmm_parameters):
268 |     """Get posterior probability of being in state s at time t"""
269 |     
270 |     forward_probs, scales = forward(emissions, hmm_parameters.transitions, hmm_parameters.starting_probabilities)
271 |     backward_probs = backward(emissions, hmm_parameters.transitions, scales)
272 |     posterior_probabilities = (forward_probs * backward_probs).T
273 | 
274 |     return posterior_probabilities
275 | 
276 | 
277 | def PMAP_path(posterior_probabilities):
278 |     """Get maximum posterior decoding path"""
279 |     path = np.argmax(posterior_probabilities, axis = 0)
280 |     return path 
281 | 
282 | 
283 | def Viterbi_path(emissions, hmm_parameters):
284 |     """Get Viterbi path - aka most likeli path"""
285 |     n_obs, _ = emissions.shape
286 |     
287 |     viterbi_probs, backtracks = viterbi(emissions, hmm_parameters.transitions, hmm_parameters.starting_probabilities)
288 |     
289 |     # backtracking
290 |     viterbi_path = np.zeros(n_obs, dtype = int)
291 |     viterbi_path[-1] = np.argmax(viterbi_probs[-1,:])
292 |     for t in range(n_obs - 2, -1, -1):
293 |         viterbi_path[t] = backtracks[t + 1, viterbi_path[t + 1]]
294 | 
295 |     return viterbi_path
296 | 
297 | @njit
298 | def Hybrid_path(emissions, starting_probs, trans_matrix, logged_posterior_probabilities, ALPHA):
299 |     """
300 |     Decodes the model using the hybrid method. When alpha parameter is 0 this is the same as posterior and when it is 1 it is the same as viterbi
301 |     """
302 |     n_obs, n_states = emissions.shape
303 |     BETA = 1 - ALPHA
304 | 
305 |     # for backtracking the hybrid method
306 |     delta = np.zeros((n_obs, n_states))
307 |     psi = np.zeros((n_obs - 1, n_states), dtype = np.int32)
308 | 
309 |     # initialize
310 |     delta[0,:] = ALPHA * calculate_log(starting_probs * emissions[0,:]) + BETA * logged_posterior_probabilities[0,:]
311 | 
312 |     for t in range(1, n_obs):
313 |         for state in range(n_states):
314 |             best_state, max_prob = hybrid_step(delta[t-1, :], ALPHA, emissions[t,state] , trans_matrix[:, state] )
315 |             delta[t,state] = max_prob + BETA * logged_posterior_probabilities[t,state]
316 |             psi[t-1,state] = best_state
317 |         
318 | 
319 |     # backtracking
320 |     path = np.zeros(n_obs, dtype = np.int32)
321 |     path[-1] = np.argmax(delta[-1,:])    
322 |     for i in range(n_obs - 2, -1, -1):
323 |         path[i] = psi[i, path[i + 1]]
324 |     
325 |     return path
326 | 
327 | 
328 | 
329 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
330 | # inhomogeneous markov chain
331 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
332 | 
333 | @njit
334 | def Simulate_values(p):
335 |     return np.random.binomial(1, p)
336 | 
337 | def Simulate_transition(n_states, matrix, current_state):
338 | 
339 |     # prob of staying (can be done with numba so its quick)
340 |     next_state = Simulate_values(matrix[current_state])
341 |     if next_state == 1:
342 |         return current_state
343 |     else:
344 |         if n_states == 2:
345 |             return abs(current_state - 1)
346 |         else:
347 |             new_matrix = [matrix[x] for x in range(n_states) if current_state != x]
348 |             new_matrix /= np.sum(new_matrix)
349 |             new_states = [x for x in range(n_states) if current_state != x]
350 | 
351 |             return np.random.choice(new_states, p=new_matrix)
352 | 
353 | 
354 | 
355 | def Make_inhomogeneous_transition_matrix(emissions, hmm_parameters):
356 |     """
357 |     Calculate transition matrix for each position in the sequence (given the data)
358 |     """ 
359 | 
360 |     n_obs, n_states = emissions.shape
361 |     _, scales = forward(emissions, hmm_parameters.transitions, hmm_parameters.starting_probabilities)
362 |     backward_probs = backward(emissions, hmm_parameters.transitions, scales)
363 | 
364 |     # Make and initialise new transition matrix
365 |     new_transition_matrix = np.zeros((n_obs, n_states, n_states))
366 | 
367 |     # starting probabilities
368 |     sim_starting_probabilities = np.zeros(n_states)
369 |     for state in range(n_states):
370 |         sim_starting_probabilities[state] = hmm_parameters.starting_probabilities[state] * backward_probs[0, state] * emissions[0, state] / scales[0]
371 |     
372 |     for state in range(n_states):
373 |         for otherstate in range(n_states):
374 |             new_transition_matrix[1:, otherstate, state] = (backward_probs[1:, state] ) / (backward_probs[:-1, otherstate] * scales[1:]) * hmm_parameters.transitions[otherstate, state] * emissions[1:, state]
375 |     
376 |     return sim_starting_probabilities, new_transition_matrix
377 | 
378 | 
379 | 
380 | def Simulate_from_transition_matrix(sim_starting_probabilities, new_transition_matrix):
381 | 
382 |     number_observations, n_states, _, = new_transition_matrix.shape
383 |     sim_path = np.zeros(number_observations, dtype=int)
384 |         
385 |     # set start state
386 |     current_state = np.random.choice(n_states, p=sim_starting_probabilities)
387 |     sim_path[0] = current_state
388 | 
389 |     for t in range(1, number_observations):
390 |         next_state = Simulate_transition(n_states, new_transition_matrix[t,current_state,:], current_state)
391 |         sim_path[t] = next_state
392 |         current_state = next_state
393 | 
394 |     return sim_path
395 | 
396 | 
397 | 
398 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
399 | # Write segments to output file
400 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
401 | 
402 | def Write_inhomogeneous_transition_matrix(chroms, starts, weights, mutrates, variants, hmm_parameters, new_transition_matrix, filename):
403 |     n_states = len(hmm_parameters.state_names)
404 | 
405 |     with open(filename, 'w') as out:
406 |         state_combs = []
407 |         for state1 in hmm_parameters.state_names:
408 |             for state2 in hmm_parameters.state_names:
409 |                 state_combs.append(f'{state1}_{state2}')
410 |         
411 |         state_combs = '\t'.join(state_combs)
412 |         print('chrom', 'start', 'called_sequence', 'mutationrate', state_combs,'variants', sep = '\t', file = out)
413 | 
414 |         for (chrom, start, w, m, transvalues, var) in zip(chroms, starts, weights, mutrates, new_transition_matrix, variants):
415 |             posterior_to_print = []
416 |             for state1 in range(n_states):
417 |                 for state2 in range(n_states):
418 |                     posterior_to_print.append(str(round(transvalues[state1, state2], 4)))
419 |             posterior_to_print = '\t'.join(posterior_to_print)
420 | 
421 |             print(chrom, start, w, m, posterior_to_print, var, sep = '\t', file = out)
422 | 
423 | 
424 | def Write_posterior_probs(chroms, starts, weights, mutrates, post_seq, path, variants, hmm_parameters, filename):
425 |     post_seq = post_seq.T
426 | 
427 |     with open(filename, 'w') as out:
428 |         state_names = '\t'.join(hmm_parameters.state_names)
429 |         print('chrom', 'start', 'called_sequence', 'mutationrate', state_names,'state','variants', sep = '\t', file = out)
430 | 
431 |         for (chrom, start, w, m, posterior, state, var) in zip(chroms, starts, weights, mutrates, post_seq, path, variants):
432 |             posterior_to_print = '\t'.join([str(round(x, 4)) for x in posterior])
433 |             print(chrom, start, w, m, posterior_to_print, hmm_parameters.state_names[state], var, sep = '\t', file = out)
434 | 
435 | 
436 | 
437 | def Convert_genome_coordinates(window_size, CHROMOSOME_BREAKPOINTS, starts, variants, post_seq, path, hmm_parameters, weights, mutrates, obs):
438 | 
439 |     segments = []
440 |     for (chrom, chrom_start_index, chrom_length_index) in CHROMOSOME_BREAKPOINTS:
441 | 
442 |         # Diploid or haploid
443 |         if '_hap' in chrom:
444 |             newchrom, ploidity = chrom.split('_')
445 |         else:
446 |             ploidity = 'diploid'
447 |             newchrom = chrom
448 | 
449 |         state_with_highest_prob = path[chrom_start_index:chrom_start_index + chrom_length_index]
450 | 
451 |         for (state, start_index, length_index) in find_runs(state_with_highest_prob):
452 |             
453 |             start_index = start_index + chrom_start_index
454 |             end_index = start_index + length_index
455 | 
456 |             genome_start = starts[start_index]
457 |             genome_length =  length_index * window_size
458 |             genome_end = genome_start + genome_length
459 | 
460 |             called_sequence = int(np.sum(weights[start_index:end_index]) * window_size)
461 |             average_mutation_rate = round(np.mean(mutrates[start_index:end_index]), 3)
462 | 
463 |             snp_counter = np.sum(obs[start_index:end_index])
464 |             mean_prob = round(np.mean(post_seq[state, start_index:end_index]), 5)
465 |             variants_segment = flatten_list(variants[start_index:end_index])
466 | 
467 |             segments.append([newchrom, genome_start,  genome_end, genome_length, hmm_parameters.state_names[state], mean_prob, snp_counter, ploidity, called_sequence, average_mutation_rate, variants_segment]) 
468 |     
469 |     return segments
470 | 
471 | 
472 | 
473 | def Write_Decoded_output(outputprefix, segments, obs_file = None, admixpop_file = None, extrainfo = False):
474 | 
475 |     # Load archaic data
476 |     if admixpop_file is not None:
477 |         admix_pop_variants, admixpop_names = Annotate_with_ref_genome(admixpop_file, obs_file)
478 | 
479 |     # Are we doing haploid/diploid?
480 |     outfile_mapper = {}
481 |     for _, _, _, _, _, _, _, ploidity, _, _, _ in segments:
482 |         if outputprefix == '/dev/stdout':
483 |             outfile_mapper[ploidity] = '/dev/stdout'
484 |         else:
485 |             outfile_mapper[ploidity] = f'{outputprefix}.{ploidity}.txt'
486 | 
487 | 
488 |     # Make output files and write headers
489 |     outputfiles_handlers = defaultdict(str)
490 |     for ploidity, output in outfile_mapper.items():
491 |        
492 |         Make_folder_if_not_exists(output)
493 |         outputfiles_handlers[ploidity] = open(output, 'w')
494 |         out = outputfiles_handlers[ploidity]
495 | 
496 |         if admixpop_file is not None:
497 |             if extrainfo:
498 |                 out.write('chrom\tstart\tend\tlength\tstate\tmean_prob\tsnps\tadmixpopvariants\t{}\tcalled_sequence\tmutationrate\tvariants\n'.format('\t'.join(admixpop_names)))
499 |             else:
500 |                 out.write('chrom\tstart\tend\tlength\tstate\tmean_prob\tsnps\tadmixpopvariants\t{}\n'.format('\t'.join(admixpop_names)))
501 |         else:
502 |             if extrainfo:
503 |                 out.write('chrom\tstart\tend\tlength\tstate\tmean_prob\tsnps\tcalled_sequence\tmutationrate\tvariants\n')
504 |             else:
505 |                 out.write('chrom\tstart\tend\tlength\tstate\tmean_prob\tsnps\n')
506 | 
507 |     # Go through segments and write to output
508 |     for chrom, genome_start, genome_end, genome_length, state, mean_prob, snp_counter, ploidity, called_sequence, average_mutation_rate, variants  in segments:
509 | 
510 |         out = outputfiles_handlers[ploidity]
511 | 
512 |         if admixpop_file is not None:
513 |             archiac_variants_dict = defaultdict(int)
514 |             for snp_position in variants.split(','):
515 |                 carriers = admix_pop_variants[f'{chrom}_{snp_position}']
516 |                 if carriers != '':
517 |                     if '|' in carriers:
518 |                         for ind in carriers.split('|'):
519 |                             archiac_variants_dict[ind] += 1
520 |                     else:
521 |                         archiac_variants_dict[carriers] += 1
522 | 
523 |                     archiac_variants_dict['total'] += 1
524 | 
525 |             archaic_variants = '\t'.join([str(archiac_variants_dict[x]) for x in ['total'] + admixpop_names])
526 | 
527 |             if extrainfo:
528 |                 print(chrom, genome_start, genome_end, genome_length, state, mean_prob, snp_counter, archaic_variants, called_sequence, average_mutation_rate, variants, sep = '\t', file = out)
529 |             else:
530 |                 print(chrom, genome_start, genome_end, genome_length, state, mean_prob, snp_counter, archaic_variants, sep = '\t', file = out)
531 | 
532 |         else:
533 | 
534 |             if extrainfo:
535 |                 print(chrom, genome_start, genome_end, genome_length, state, mean_prob, snp_counter, called_sequence, average_mutation_rate, variants, sep = '\t', file = out)
536 |             else:    
537 |                 print(chrom, genome_start, genome_end, genome_length, state, mean_prob, snp_counter, sep = '\t', file = out)
538 | 
539 | 
540 |     # Close output files
541 |     for ploidity, out in outputfiles_handlers.items():
542 |         out.close()
543 | 
544 | 
545 | 
546 | 


--------------------------------------------------------------------------------
/src/main.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import numpy as np
  3 | import sys
  4 | from hmm_functions import TrainModel, write_HMM_to_file, read_HMM_parameters_from_file, Write_Decoded_output, Calculate_Posterior_probabillities, PMAP_path, Viterbi_path, Hybrid_path, Convert_genome_coordinates, Write_posterior_probs, Make_inhomogeneous_transition_matrix, Simulate_from_transition_matrix, Write_inhomogeneous_transition_matrix, Emission_probs_poisson
  5 | from bcf_vcf import make_out_group, make_ingroup_obs
  6 | from make_test_data import simulate_path, write_data
  7 | from make_mutationrate import make_mutation_rate
  8 | from helper_functions import Load_observations_weights_mutrates, handle_individuals_input, handle_infiles, combined_files, find_runs
  9 | from artemis import Find_best_alpha
 10 | 
 11 | VERSION = '0.8.2'
 12 | 
 13 | def print_script_usage():
 14 |     toprint = f'''
 15 | Script for identifying introgressed archaic segments (version: {VERSION})
 16 | 
 17 | > Turorial:
 18 | hmmix make_test_data 
 19 | hmmix train  -obs=obs.txt -weights=weights.bed -mutrates=mutrates.bed -param=Initialguesses.json -out=trained.json 
 20 | hmmix decode -obs=obs.txt -weights=weights.bed -mutrates=mutrates.bed -param=trained.json
 21 | 
 22 | 
 23 | Different modes (you can also see the options for each by writing hmmix make_test_data -h):
 24 | > make_test_data        
 25 |     -windows            Number of Kb windows to create (defaults to 50,000 per chromosome)
 26 |     -chromosomes        Number of chromosomes to simulate (defaults to 2)
 27 |     -no_out_files       Don't create obs.txt, mutrates.bed, weights.bed, Initialguesses.json (defaults to yes)
 28 |     -param              markov parameters file (default is human/neanderthal like parameters)
 29 |     -seed               Set seed (default is 42)
 30 | 
 31 | > mutation_rate         
 32 |     -outgroup           [required] path to variants found in outgroup
 33 |     -out                outputfile (defaults to mutationrate.bed)
 34 |     -weights            file with callability (defaults to all positions being called)
 35 |     -window_size        size of bins (defaults to 1 Mb)
 36 | 
 37 | > create_outgroup       
 38 |     -ind                [required] ingroup/outgrop list (json file) or comma-separated list e.g. ind1,ind2
 39 |     -vcf                [required] path to list of comma-separated vcf/bcf file(s) or wildcard characters e.g. chr*.bcf
 40 |     -weights            file with callability (defaults to all positions being called)
 41 |     -out                outputfile (defaults to stdout)
 42 |     -ancestral          fasta file with ancestral information - comma-separated list or wildcards like vcf argument (default none)
 43 |     -refgenome          fasta file with reference genome - comma-separated list or wildcards like vcf argument (default none)
 44 | 
 45 | > create_ingroup        
 46 |     -ind                [required] ingroup/outgrop list (json file) or comma-separated list e.g. ind1,ind2
 47 |     -vcf                [required] path to list of comma-separated vcf/bcf file(s) or wildcard characters e.g. chr*.bcf
 48 |     -outgroup           [required] path to variant found in outgroup
 49 |     -weights            file with callability (defaults to all positions being called)
 50 |     -out                outputfile prefix (default is a file named obs.<ind>.txt where ind is the name of individual in ingroup/outgrop list)
 51 |     -ancestral          fasta file with ancestral information - comma-separated list or wildcards like vcf argument (default none)
 52 | 
 53 | > train                 
 54 |     -obs                [required] file with observation data
 55 |     -chrom              Subset to chromosome or comma separated list of chromosomes e.g chr1 or chr1,chr2,chr3 (default is use all chromosomes)
 56 |     -weights            file with callability (defaults to all positions being called)
 57 |     -mutrates           file with mutation rates (default is mutation rate is uniform)
 58 |     -param              markov parameters file (default is human/neanderthal like parameters)
 59 |     -out                outputfile (default is a file named trained.json)
 60 |     -window_size        size of bins (default is 1000 bp)
 61 |     -haploid            Change from using diploid data to haploid data (default is diploid)
 62 | 
 63 | > decode                
 64 |     -obs                [required] file with observation data
 65 |     -chrom              Subset to chromosome or comma separated list of chromosomes e.g chr1 or chr1,chr2,chr3 (default is use all chromosomes)
 66 |     -weights            file with callability (defaults to all positions being called)
 67 |     -mutrates           file with mutation rates (default is mutation rate is uniform)
 68 |     -param              markov parameters file (default is human/neanderthal like parameters)
 69 |     -out                outputfile prefix <out>.hap1.txt and <out>.hap2.txt if -haploid option is used or <out>.diploid.txt (default is stdout)
 70 |     -window_size        size of bins (default is 1000 bp)
 71 |     -haploid            Change from using diploid data to haploid data (default is diploid)
 72 |     -admixpop           Annotate using vcffile with admixing population (default is none)
 73 |     -extrainfo          Add variant position for each SNP (default is off)
 74 |     -viterbi            decode using the viterbi algorithm (default is posterior decoding)
 75 |     -hybrid             decode using the hybrid algorithm. Set value between 0 and 1 where 0=posterior and 1=viterbi
 76 |     -posterior_probs    File location for posterior prob
 77 | 
 78 | > inhomogeneous                
 79 |     -obs                [required] file with observation data
 80 |     -chrom              Subset to chromosome or comma separated list of chromosomes e.g chr1 or chr1,chr2,chr3 (default is use all chromosomes)
 81 |     -weights            file with callability (defaults to all positions being called)
 82 |     -mutrates           file with mutation rates (default is mutation rate is uniform)
 83 |     -param              markov parameters file (default is human/neanderthal like parameters)
 84 |     -out                outputfile prefix <out>.hap1_sim(0-n).txt and <out>.hap2_sim(0-n).txt if -haploid option is used or <out>.diploid_(0-n).txt (default is stdout)
 85 |     -window_size        size of bins (default is 1000 bp)
 86 |     -haploid            Change from using diploid data to haploid data (default is diploid)
 87 |     -samples            Number of simulated paths for the inhomogeneous markov chain (default is 100)
 88 |     -admixpop           Annotate using vcffile with admixing population (default is none)
 89 |     -extrainfo          Add variant position for each SNP (default is off)
 90 |     -inhomogen_matrix   File location for inhomogeneous transition matrix
 91 | 
 92 | > artemis
 93 |     -param              [required] markov parameters file (default is human/neanderthal like parameters)
 94 |     -out_plot           File path for artemis plot - can be pdf or jpg (default is Artemis_plot.pdf)
 95 |     -out                Save alphas, likelihoods and pointwise accuracy to file (default is stdout)
 96 |     -windows            Number of Kb windows to create (defaults to 500,000)
 97 |     -iterations         Number of iterations (defaults to 10)
 98 |     -start              First alpha values to simulate (default is 0)
 99 |     -end                Last alpha values to simulate (default is 1)
100 |     -steps              Number of alpha values to simulate between start and end (defaults to 101)
101 |     -seed               Set seed (default is 42)
102 |     '''
103 | 
104 |     return toprint
105 | 
106 | 
107 | 
108 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
109 | # Main
110 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
111 | def main():
112 | 
113 |     parser = argparse.ArgumentParser(description=print_script_usage(), formatter_class=argparse.RawTextHelpFormatter)
114 | 
115 |     subparser = parser.add_subparsers(dest = 'mode')
116 | 
117 |     # Make test data
118 |     test_subparser = subparser.add_parser('make_test_data', help='Create test data')
119 |     test_subparser.add_argument("-windows", metavar='',help="Number of Kb windows to create (defaults to 50,000 per chromosome)", type=int, default = 50000)
120 |     test_subparser.add_argument("-chromosomes", metavar='',help="Number of chromosomes to simulate (defaults to 2)", type=int, default = 2)
121 |     test_subparser.add_argument("-no_out_files",help="Don't create obs.txt, mutrates.bed, weights.bed, Initialguesses.json (defaults to yes)", action='store_false', default = True)
122 |     test_subparser.add_argument("-param", metavar='',help="markov parameters file (default is human/neanderthal like parameters)", type=str)
123 |     test_subparser.add_argument("-seed", metavar='',help="set seed", type=int, default=42)
124 | 
125 |     # Make outgroup
126 |     outgroup_subparser = subparser.add_parser('create_outgroup', help='Create outgroup information')
127 |     outgroup_subparser.add_argument("-ind",help="[required] ingroup/outgrop list (json file) or comma-separated list e.g. ind1,ind2", type=str, required = True)
128 |     outgroup_subparser.add_argument("-vcf",help="[required] path to list of comma-separated vcf/bcf file(s) or wildcard characters e.g. chr*.bcf", type=str, required = True)
129 |     outgroup_subparser.add_argument("-weights", metavar='',help="file with callability (defaults to all positions being called)")
130 |     outgroup_subparser.add_argument("-out", metavar='',help="outputfile (defaults to stdout)", default = '/dev/stdout')
131 |     outgroup_subparser.add_argument("-ancestral", metavar='',help="fasta file with ancestral information - comma-separated list or wildcards like vcf argument (default none)", default='')
132 |     outgroup_subparser.add_argument("-refgenome", metavar='',help="fasta file with reference genome - comma-separated list or wildcards like vcf argument (default none)", default='')
133 | 
134 |     # Make mutation rates
135 |     mutation_rate = subparser.add_parser('mutation_rate', help='Estimate mutation rate')
136 |     mutation_rate.add_argument("-outgroup", help="[required] path to variants found in outgroup", type=str, required = True)
137 |     mutation_rate.add_argument("-out", metavar='',help="outputfile (defaults to mutationrate.bed)", default = 'mutationrate.bed')
138 |     mutation_rate.add_argument("-weights", metavar='',help="file with callability (defaults to all positions being called)")
139 |     mutation_rate.add_argument("-window_size", metavar='',help="size of bins (defaults to 1 Mb)", type=int, default = 1000000)
140 | 
141 |     # Make ingroup observations
142 |     create_obs_subparser = subparser.add_parser('create_ingroup', help='Create ingroup data')
143 |     create_obs_subparser.add_argument("-ind", help="[required] ingroup/outgrop list (json file) or comma-separated list e.g. ind1,ind2", type=str, required = True)
144 |     create_obs_subparser.add_argument("-vcf", help="[required] path to list of comma-separated vcf/bcf file(s) or wildcard characters e.g. chr*.bcf", type=str, required = True)
145 |     create_obs_subparser.add_argument("-outgroup", help="[required] path to variant found in outgroup", type=str, required = True)
146 |     create_obs_subparser.add_argument("-weights", metavar='',help="file with callability (defaults to all positions being called)")
147 |     create_obs_subparser.add_argument("-out", metavar='',help="outputfile prefix (default is a file named obs.<ind>.txt where ind is the name of individual in ingroup/outgrop list)", default = 'obs')
148 |     create_obs_subparser.add_argument("-ancestral", metavar='',help="fasta file with ancestral information - comma-separated list or wildcards like vcf argument (default none)", default='')
149 | 
150 |     # Train model
151 |     train_subparser = subparser.add_parser('train', help='Train HMM')
152 |     train_subparser.add_argument("-obs",help="[required] file with observation data", type=str, required = True)
153 |     train_subparser.add_argument("-chrom",help="Subset to chromosome or comma separated list of chromosomes e.g chr1 or chr1,chr2,chr3", type=str, default='All')
154 |     train_subparser.add_argument("-weights", metavar='',help="file with callability (defaults to all positions being called)")
155 |     train_subparser.add_argument("-mutrates", metavar='',help="file with mutation rates (default is mutation rate is uniform)")
156 |     train_subparser.add_argument("-param", metavar='',help="markov parameters file (default is human/neanderthal like parameters)", type=str)
157 |     train_subparser.add_argument("-out", metavar='',help="outputfile (default is a file named trained.json)", default = 'trained.json')
158 |     train_subparser.add_argument("-window_size", metavar='',help="size of bins (default is 1000 bp)", type=int, default = 1000)
159 |     train_subparser.add_argument("-haploid",help="Change from using diploid data to haploid data (default is diploid)", action='store_true', default = False)
160 | 
161 |     # Decode model
162 |     decode_subparser = subparser.add_parser('decode', help='Decode HMM')
163 |     decode_subparser.add_argument("-obs",help="[required] file with observation data", type=str, required = True)
164 |     decode_subparser.add_argument("-chrom",help="Subset to chromosome or comma separated list of chromosomes e.g chr1 or chr1,chr2,chr3", type=str, default='All')
165 |     decode_subparser.add_argument("-weights", metavar='',help="file with callability (defaults to all positions being called)")
166 |     decode_subparser.add_argument("-mutrates", metavar='',help="file with mutation rates (default is mutation rate is uniform)")
167 |     decode_subparser.add_argument("-param", metavar='',help="markov parameters file (default is human/neanderthal like parameters)", type=str)
168 |     decode_subparser.add_argument("-out", metavar='',help="outputfile prefix <out>.hap1.txt and <out>.hap2.txt if -haploid option is used or <out>.diploid.txt (default is stdout)", default = '/dev/stdout')
169 |     decode_subparser.add_argument("-window_size", metavar='',help="size of bins (default is 1000 bp)", type=int, default = 1000)
170 |     decode_subparser.add_argument("-haploid",help="Change from using diploid data to haploid data (default is diploid)", action='store_true', default = False)
171 |     decode_subparser.add_argument("-admixpop",help="Annotate using vcffile with admixing population (default is none)")
172 |     decode_subparser.add_argument("-extrainfo",help="Add archaic information on each SNP", action='store_true', default = False)
173 |     decode_subparser.add_argument("-viterbi",help="Decode using the Viterbi algorithm", action='store_true', default = False)
174 |     decode_subparser.add_argument("-hybrid",help="Decode using the hybrid algorithm. Set value between 0 and 1 where 0=posterior and 1=viterbi", type=float, default = -1)
175 |     decode_subparser.add_argument("-posterior_probs",help="File location for posterior probs", default = None)
176 | 
177 |     # inhomogeneous markov chain
178 |     inhomogen_subparser = subparser.add_parser('inhomogeneous', help='Make inhomogen markov chain')
179 |     inhomogen_subparser.add_argument("-obs",help="[required] file with observation data", type=str, required = True)
180 |     inhomogen_subparser.add_argument("-chrom",help="Subset to chromosome or comma separated list of chromosomes e.g chr1 or chr1,chr2,chr3", type=str, default='All')
181 |     inhomogen_subparser.add_argument("-weights", metavar='',help="file with callability (defaults to all positions being called)")
182 |     inhomogen_subparser.add_argument("-mutrates", metavar='',help="file with mutation rates (default is mutation rate is uniform)")
183 |     inhomogen_subparser.add_argument("-param", metavar='',help="markov parameters file (default is human/neanderthal like parameters)", type=str)
184 |     inhomogen_subparser.add_argument("-out", metavar='',help="outputfile prefix <out>.hap1.txt and <out>.hap2.txt if -haploid option is used or <out>.diploid.txt (default is stdout)", default = '/dev/stdout')
185 |     inhomogen_subparser.add_argument("-window_size", metavar='',help="size of bins (default is 1000 bp)", type=int, default = 1000)
186 |     inhomogen_subparser.add_argument("-haploid",help="Change from using diploid data to haploid data (default is diploid)", action='store_true', default = False)
187 |     inhomogen_subparser.add_argument("-samples",help="Number of paths to sample (default is 100)", type=int, default = 100)
188 |     inhomogen_subparser.add_argument("-admixpop",help="Annotate using vcffile with admixing population (default is none)")
189 |     inhomogen_subparser.add_argument("-extrainfo",help="Add archaic information on each SNP", action='store_true', default = False)
190 |     inhomogen_subparser.add_argument("-inhomogen_matrix",help="File location for inhomogeneous transition matrix", default = None)
191 | 
192 |     # Find best alpha (artemis plos)
193 |     artemis_subparser = subparser.add_parser('artemis', help='Finds best alphas and make artemis plots')
194 |     artemis_subparser.add_argument("-param", metavar='',help="[required] markov parameters file (default is human/neanderthal like parameters)", type=str, required = True)
195 |     artemis_subparser.add_argument("-out", metavar='',help="Save alphas, likelihoods and pointwise accuracy to file (default is stdout)", default ='/dev/stdout')
196 |     artemis_subparser.add_argument("-out_plot", metavar='',help="File path for artemis plot - can be pdf or jpg (default is Artemis_plot.pdf)", default = 'Artemis_plot.pdf')
197 |     artemis_subparser.add_argument("-windows", metavar='',help="Number of Kb windows to create (defaults to 500,000)", type=int, default = 500000)
198 |     artemis_subparser.add_argument("-iterations",help="Number of iterations", type = int, default = 10)
199 |     artemis_subparser.add_argument("-start",help="First alpha values to simulate (default is 0)", type=float, default = 0.0)
200 |     artemis_subparser.add_argument("-end",help="Last alpha values to simulate (default is 1)", type=float, default = 1.0)
201 |     artemis_subparser.add_argument("-steps",help="Number of steps (values to simulate between start and end)", type=int, default = 101)
202 |     artemis_subparser.add_argument("-seed", metavar='',help="set seed", type=int, default=42)
203 | 
204 |     args = parser.parse_args()
205 | 
206 |     # Make test data
207 |     # ------------------------------------------------------------------------------------------------------------
208 |     if args.mode == 'make_test_data':
209 | 
210 |         print('-' * 40)
211 |         print(f'> creating {args.chromosomes} chromosomes each with {args.windows} kb of test data with the following parameters..')
212 |         hmm_parameters = read_HMM_parameters_from_file(args.param)
213 |         print(f'> hmm parameters file: {args.param}')
214 |         print(hmm_parameters) 
215 |         print(f'> Seed is {args.seed}')
216 |         print('-' * 40)
217 |         
218 |         obs, mutrates, weights, path = simulate_path(args.windows, args.chromosomes, hmm_parameters, args.seed)
219 |         
220 |         if args.no_out_files:
221 |             write_data(path, obs, args.windows, args.chromosomes, hmm_parameters, args.seed)
222 | 
223 | 
224 |     # Train parameters
225 |     # ------------------------------------------------------------------------------------------------------------
226 |     elif args.mode == 'train':
227 | 
228 |         hmm_parameters = read_HMM_parameters_from_file(args.param)
229 |         obs, _, _, _, mutrates, weights = Load_observations_weights_mutrates(args.obs, args.weights, args.mutrates, args.window_size, args.haploid, args.chrom)
230 |         
231 |         print('-' * 40)
232 |         print(hmm_parameters)
233 |         print(f'> chromosomes to use: {args.chrom}')
234 |         print(f'> number of windows: {len(obs)}. Number of snps = {sum(obs)}')
235 |         print(f'> total callability: {int(np.sum(weights) * args.window_size)} bp ({round(np.sum(weights) / len(obs) * 100,2)} %)')
236 |         print('> average mutation rate per bin:', round(np.sum(mutrates * weights) / np.sum(weights), 2) )
237 |         print('> Output is',args.out) 
238 |         print('> Window size is',args.window_size, 'bp') 
239 |         print('> Haploid',args.haploid) 
240 |         print('-' * 40)
241 | 
242 |         hmm_parameters = TrainModel(obs, mutrates, weights, hmm_parameters)
243 |         write_HMM_to_file(hmm_parameters, args.out)
244 | 
245 | 
246 |     # Decode observations using parameters
247 |     # ------------------------------------------------------------------------------------------------------------
248 |     elif args.mode == 'decode':
249 | 
250 |         obs, chroms, starts, variants, mutrates, weights  = Load_observations_weights_mutrates(args.obs, args.weights, args.mutrates, args.window_size, args.haploid, args.chrom)
251 |         hmm_parameters = read_HMM_parameters_from_file(args.param)
252 |         CHROMOSOME_BREAKPOINTS = [x for x in find_runs(chroms)]
253 | 
254 |         print('-' * 40)
255 |         print(hmm_parameters)  
256 |         print(f'> chromosomes to use: {args.chrom}')
257 |         print(f'> number of windows: {len(obs)}. Number of snps = {sum(obs)}')
258 |         print(f'> total callability: {int(np.sum(weights) * args.window_size)} bp ({round(np.sum(weights) / len(obs) * 100,2)} %)')
259 |         print('> average mutation rate per bin:', round(np.sum(mutrates * weights) / np.sum(weights), 2) )
260 |         print('> Output prefix is',args.out) 
261 |         print('> Window size is',args.window_size, 'bp') 
262 |         print('> Haploid',args.haploid)
263 | 
264 |         emissions = Emission_probs_poisson(hmm_parameters.emissions, obs, weights, mutrates)
265 |         posterior_probs = Calculate_Posterior_probabillities(emissions, hmm_parameters)
266 | 
267 |         if args.hybrid != -1:
268 |             if 0 <= args.hybrid <= 1:
269 |                 print(f'> Decode using hybrid algorithm with parameter: {args.hybrid}')
270 |                 print('-' * 40) 
271 |                 logged_posterior_probs = np.log(posterior_probs.T)
272 |                 path = Hybrid_path(emissions, hmm_parameters.starting_probabilities, hmm_parameters.transitions, logged_posterior_probs, args.hybrid)
273 |             else:
274 |                 sys.exit('\n\nERROR! Hybrid parameter must be between 0 and 1\n\n')
275 |         else:
276 |             if args.viterbi:
277 |                 print('> Decode using viterbi algorithm') 
278 |                 print('-' * 40)
279 |                 path = Viterbi_path(emissions, hmm_parameters)
280 |             else:
281 |                 print('> Decode with posterior decoding')
282 |                 print('-' * 40) 
283 |                 path = PMAP_path(posterior_probs)
284 | 
285 | 
286 |         if args.posterior_probs is not None:
287 |             Write_posterior_probs(chroms, starts, weights, mutrates, posterior_probs, path, variants, hmm_parameters, args.posterior_probs)
288 |         
289 |         segments = Convert_genome_coordinates(args.window_size, CHROMOSOME_BREAKPOINTS, starts, variants, posterior_probs, path, hmm_parameters, weights, mutrates, obs)
290 |         Write_Decoded_output(args.out, segments, args.obs, args.admixpop, args.extrainfo)
291 | 
292 | 
293 |     # inhomogeneous markov chain
294 |     # ------------------------------------------------------------------------------------------------------------
295 |     elif args.mode == 'inhomogeneous':
296 | 
297 |         obs, chroms, starts, variants, mutrates, weights  = Load_observations_weights_mutrates(args.obs, args.weights, args.mutrates, args.window_size, args.haploid, args.chrom)
298 |         hmm_parameters = read_HMM_parameters_from_file(args.param)
299 |         CHROMOSOME_BREAKPOINTS = [x for x in find_runs(chroms)]
300 |         
301 |         print('-' * 40)
302 |         print(hmm_parameters)  
303 |         print(f'> chromosomes to use: {args.chrom}')
304 |         print(f'> number of windows: {len(obs)}. Number of snps = {sum(obs)}')
305 |         print(f'> total callability: {int(np.sum(weights) * args.window_size)} bp ({round(np.sum(weights) / len(obs) * 100,2)} %)')
306 |         print('> average mutation rate per bin:', round(np.sum(mutrates * weights) / np.sum(weights), 2) )
307 |         print('> Output prefix is',args.out) 
308 |         print('> Window size is',args.window_size, 'bp') 
309 |         print('> Haploid',args.haploid) 
310 |         print('-' * 40)
311 | 
312 |         # Find segments and write output
313 |         emissions = Emission_probs_poisson(hmm_parameters.emissions, obs, weights, mutrates)   
314 |         posterior_probs = Calculate_Posterior_probabillities(emissions, hmm_parameters)
315 |         starting_probabilities, inhom_transition_matrix = Make_inhomogeneous_transition_matrix(emissions, hmm_parameters)
316 | 
317 |         if args.inhomogen_matrix is not None:
318 |             Write_inhomogeneous_transition_matrix(chroms, starts, weights, mutrates, variants, hmm_parameters, inhom_transition_matrix, args.inhomogen_matrix)
319 | 
320 |         for sim_number in range(args.samples):
321 |             print(f'Running inhomogen markov chain simulation {sim_number + 1}/{args.samples}')
322 |             path = Simulate_from_transition_matrix(starting_probabilities, inhom_transition_matrix)
323 |             segments = Convert_genome_coordinates(args.window_size, CHROMOSOME_BREAKPOINTS, starts, variants, posterior_probs, path, hmm_parameters, weights, mutrates, obs)
324 |             
325 |             if args.out == '/dev/stdout':
326 |                 output = args.out
327 |             else:
328 |                 output = f'{args.out}.{sim_number}'
329 | 
330 |             Write_Decoded_output(output, segments, args.obs, args.admixpop, args.extrainfo)
331 | 
332 |             
333 | 
334 |     # Create outgroup snps (set of snps to be removed)
335 |     # ------------------------------------------------------------------------------------------------------------
336 |     elif args.mode == 'create_outgroup':
337 | 
338 |         # Get list of outgroup individuals
339 |         outgroup_individuals = handle_individuals_input(args.ind, 'outgroup')
340 | 
341 |         # Get a list of vcffiles and ancestral files and intersect them
342 |         vcffiles = handle_infiles(args.vcf)
343 |         ancestralfiles = handle_infiles(args.ancestral)
344 |         refgenomefiles = handle_infiles(args.refgenome)
345 | 
346 |         ancestralfiles, vcffiles = combined_files(ancestralfiles, vcffiles)
347 |         refgenomefiles, vcffiles = combined_files(refgenomefiles, vcffiles)
348 | 
349 |         print('-' * 40)
350 |         print('> Outgroup individuals:', len(outgroup_individuals))
351 |         print('> Using vcf and ancestral files')
352 |         for vcffile, ancestralfile, reffile in zip(vcffiles, ancestralfiles, refgenomefiles):
353 |             print('vcffile:',vcffile, 'ancestralfile:',ancestralfile, 'reffile:', reffile)
354 |         print()    
355 |         print('> Callability file:',  args.weights)
356 |         print(f'> Writing output to:', args.out)
357 |         print('-' * 40)
358 | 
359 |         make_out_group(outgroup_individuals, args.weights, vcffiles, args.out, ancestralfiles, refgenomefiles)
360 | 
361 | 
362 |     # Create ingroup observations
363 |     # ------------------------------------------------------------------------------------------------------------
364 |     elif args.mode == 'create_ingroup':
365 | 
366 |         # Get a list of ingroup individuals
367 |         ingroup_individuals = handle_individuals_input(args.ind,'ingroup')
368 | 
369 |         # Get a list of vcffiles and ancestral files and intersect them
370 |         vcffiles = handle_infiles(args.vcf)
371 |         ancestralfiles = handle_infiles(args.ancestral)
372 | 
373 |         ancestralfiles, vcffiles  = combined_files(ancestralfiles, vcffiles)
374 | 
375 |         print('-' * 40)
376 |         print('> Ingroup individuals:', len(ingroup_individuals))
377 |         print('> Using vcf and ancestral files')
378 |         for vcffile, ancestralfile in zip(vcffiles, ancestralfiles):
379 |             print('vcffile:',vcffile, 'ancestralfile:',ancestralfile)
380 |         print()  
381 |         print('> Using outgroup variants from:', args.outgroup)  
382 |         print('> Callability file:', args.weights)
383 |         print(f'> Writing output to file with prefix: {args.out}.<individual>.txt')
384 |         print('-' * 40)
385 | 
386 |         make_ingroup_obs(ingroup_individuals, args.weights, vcffiles, args.out, args.outgroup, ancestralfiles)
387 | 
388 | 
389 |     # Estimate mutation rate
390 |     # ------------------------------------------------------------------------------------------------------------
391 |     elif args.mode == 'mutation_rate':
392 |         print('-' * 40)
393 |         print(f'> Outgroupfile:', args.outgroup)
394 |         print(f'> Outputfile is:', args.out)
395 |         print(f'> Callability file is:', args.weights)
396 |         print(f'> Window size:', args.window_size)
397 |         print('-' * 40)
398 | 
399 |         make_mutation_rate(args.outgroup, args.out, args.weights, args.window_size)
400 | 
401 | 
402 |     # Find best alphas and make artemis plots
403 |     # ------------------------------------------------------------------------------------------------------------
404 |     elif args.mode == 'artemis':
405 | 
406 |         hmm_parameters = read_HMM_parameters_from_file(args.param)
407 | 
408 |         print('-' * 40)
409 |         print(hmm_parameters)  
410 |         print(f'> Save data to {args.out}')
411 |         print(f'> Save plot to {args.out_plot}')
412 |         print(f'> Number of windows: {args.windows}')
413 |         print(f'> Number of iterations: {args.iterations}')
414 |         print(f'> Test {args.steps} alphas between {args.start} and {args.end}:')
415 |         print(f'> Seed is {args.seed}')
416 |         print('-' * 40)
417 | 
418 |         Find_best_alpha(hmm_parameters, args.windows, args.out, args.out_plot, args.iterations,  args.start, args.end, args.steps, args.seed)
419 |     
420 |     
421 |     # Print usage
422 |     # ------------------------------------------------------------------------------------------------------------
423 |     else:
424 |         print(print_script_usage())
425 | 
426 | 
427 | if __name__ == "__main__":
428 |     main()
429 | 
430 | 


--------------------------------------------------------------------------------
/src/make_mutationrate.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from collections import defaultdict
 3 | 
 4 | from helper_functions import sortby, make_callability_from_bed, Make_folder_if_not_exists
 5 | 
 6 | def make_mutation_rate(freqfile, outfile, callablefile, window_size):
 7 | 
 8 |     snps_counts_window = defaultdict(lambda: defaultdict(int))
 9 |     with open(freqfile) as data:
10 |         for line in data:
11 |             if not line.startswith('chrom'):
12 |                 chrom, pos = line.strip().split()[0:2]
13 |                 pos = int(pos)
14 |                 window = pos - pos%window_size
15 |                 snps_counts_window[chrom][window] += 1
16 | 
17 | 
18 |     mutations = []
19 |     genome_positions = []
20 |     for chrom in sorted(snps_counts_window, key=sortby):
21 |         lastwindow = max(snps_counts_window[chrom]) + window_size
22 | 
23 |         for window in range(0, lastwindow, window_size):
24 |             mutations.append(snps_counts_window[chrom][window])
25 |             genome_positions.append([chrom, window, window + window_size])
26 | 
27 |     mutations = np.array(mutations)
28 | 
29 |     if callablefile is not None:
30 |         callability = make_callability_from_bed(callablefile, window_size)
31 |         callable_region = []
32 |         for chrom in sorted(snps_counts_window, key=sortby):
33 |             lastwindow = max(snps_counts_window[chrom]) + window_size
34 |             for window in range(0, lastwindow, window_size):
35 |                 callable_region.append(callability[chrom][window]/window_size)
36 |     else:
37 |         callable_region = np.ones(len(mutations)) * window_size
38 | 
39 |     genome_mean = np.sum(mutations) / np.sum(callable_region)
40 | 
41 |     Make_folder_if_not_exists(outfile)
42 |     with open(outfile,'w') as out:
43 |         print('chrom', 'start', 'end', 'mutationrate', sep = '\t', file = out)
44 |         for genome_pos, mut, call in zip(genome_positions, mutations, callable_region):
45 |             chrom, start, end = genome_pos
46 |             if mut * call == 0:
47 |                 ratio = 0
48 |             else:
49 |                 ratio = round(mut/call/genome_mean, 2)
50 | 
51 |             print(chrom, start, end, ratio, sep = '\t', file = out)
52 | 


--------------------------------------------------------------------------------
/src/make_test_data.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | from numba import njit
  3 | 
  4 | from hmm_functions import HMMParam, write_HMM_to_file, Simulate_transition
  5 | from helper_functions import find_runs
  6 | 
  7 | 
  8 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
  9 | # Make test data
 10 | # ----------------------------------------------------------------------------------------------------------------------------------------------------------------
 11 | 
 12 | @njit
 13 | def set_seed(value):
 14 |     np.random.seed(value)
 15 | 
 16 | 
 17 | @njit
 18 | def simulatate_mutation_position(n_mutations):
 19 |     return np.random.choice(1000, n_mutations)
 20 | 
 21 | 
 22 | @njit
 23 | def simulate_poisson(lam):
 24 |     return np.random.poisson(lam)
 25 | 
 26 | 
 27 | def simulate_path(data_set_length, n_chromosomes, hmm_parameters, SEED):
 28 |     '''Create test data set of size data_set_length. Also create uniform weights and uniform mutation rates'''
 29 |     
 30 |     # Config
 31 |     np.random.seed(SEED)
 32 |     set_seed(SEED)
 33 |     
 34 |     total_size = data_set_length * n_chromosomes
 35 | 
 36 |     state_values = [x for x in range(len(hmm_parameters.state_names))]
 37 |     n_states = len(state_values)
 38 |     path = np.zeros(total_size, dtype = int)
 39 | 
 40 |     observations = np.zeros(total_size, dtype = int)
 41 |     weights = np.ones(total_size)
 42 |     mutrates = np.ones(total_size)  
 43 | 
 44 |     for index in range(total_size):
 45 |         
 46 |         # Use prior dist if starting window
 47 |         if index == 0:
 48 |             current_state = np.random.choice(state_values, p=hmm_parameters.starting_probabilities)
 49 |         else:
 50 |             current_state = Simulate_transition(n_states, hmm_parameters.transitions[prevstate,:], prevstate)
 51 | 
 52 | 
 53 |         observations[index] = simulate_poisson(hmm_parameters.emissions[current_state])
 54 |         path[index] = current_state
 55 |         prevstate = current_state
 56 |             
 57 |             
 58 |     return observations, mutrates, weights, path
 59 | 
 60 | 
 61 | def write_data(path, obs, data_set_length, n_chromosomes, hmm_parameters, SEED):
 62 | 
 63 |     # Config
 64 |     np.random.seed(SEED)
 65 |     set_seed(SEED)
 66 |     
 67 |     window_size = 1000
 68 |     bases = np.array(['A','C','G','T'])
 69 | 
 70 |     CHROMOSOMES = [f'chr{x + 1}' for x in range(n_chromosomes)] 
 71 |     CHROMOSOME_RUNS = []
 72 |     previous_start = 0
 73 |     for chrom in CHROMOSOMES:
 74 |         CHROMOSOME_RUNS.append([chrom, previous_start, data_set_length])
 75 |         previous_start += data_set_length
 76 | 
 77 | 
 78 | 
 79 |     # Make obs file and true simulated segments
 80 |     with open('obs.txt','w') as obs_file, open('simulated_segments.txt', 'w') as out:
 81 |         print('chrom', 'pos', 'ancestral_base', 'genotype', sep = '\t', file = obs_file)
 82 |         print('chrom', 'start', 'end', 'length', 'state', sep = '\t', file = out)
 83 | 
 84 |         for (chrom, chrom_start_index, chrom_length_index) in CHROMOSOME_RUNS:
 85 |             for (state_id, start_index, length_index) in find_runs(path[chrom_start_index:chrom_start_index + chrom_length_index]):
 86 | 
 87 |                 state = hmm_parameters.state_names[state_id]
 88 |                 genome_start = start_index * window_size
 89 |                 genome_length =  length_index * window_size
 90 |                 genome_end = genome_start + genome_length
 91 |                 print(chrom, genome_start, genome_end, genome_length, state, sep = '\t', file = out)
 92 | 
 93 |                 # write mutations
 94 |                 n_mutations_segment = obs[(chrom_start_index + start_index):(chrom_start_index + start_index + length_index)]
 95 |                 for index, n_mutations in enumerate(n_mutations_segment):
 96 | 
 97 |                     if n_mutations == 0:
 98 |                         continue
 99 | 
100 |                     for random_int in simulatate_mutation_position(n_mutations):
101 |                         mutation = (start_index + index) * window_size + random_int
102 |                         ancestral_base, derived_base = np.random.choice(bases, 2, replace = False)
103 |                         
104 |                         print(chrom, mutation, ancestral_base, ancestral_base + derived_base, sep = '\t', file = obs_file)
105 | 
106 |     # Make weights file and mutation file
107 |     with open('weights.bed','w') as weights_file, open('mutrates.bed','w') as mutrates_file:
108 |         for chrom in CHROMOSOMES:
109 |             print(chrom, 0, data_set_length * window_size, sep = '\t', file = weights_file)
110 |             print(chrom, 0, data_set_length * window_size, 1, sep = '\t', file = mutrates_file)
111 | 
112 |     # Make initial guesses
113 |     initial_guess = HMMParam(['Human', 'Archaic'], [0.5, 0.5], [[0.99,0.01],[0.02,0.98]], [0.03, 0.3]) 
114 |     write_HMM_to_file(initial_guess, 'Initialguesses.json')
115 | 
116 |                   
117 |     return 
118 | 
119 | 


--------------------------------------------------------------------------------