├── .gitignore ├── .gitmodules ├── Dockerfile ├── LICENSE ├── Makefile ├── README.md ├── hapdiff.py └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | *.swp 2 | -------------------------------------------------------------------------------- /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "submodules/svim-asm"] 2 | path = submodules/svim-asm 3 | url = https://github.com/fenderglass/svim-asm 4 | [submodule "submodules/minimap2"] 5 | path = submodules/minimap2 6 | url = https://github.com/lh3/minimap2 7 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM ubuntu:22.04 2 | MAINTAINER Mikhail Kolmogorov, mikolmogorov@gmail.com 3 | 4 | # update and install dependencies 5 | RUN apt-get update && \ 6 | DEBIAN_FRONTEND="noninteractive" apt-get -y install tzdata && \ 7 | apt-get -y install make gcc g++ && \ 8 | apt-get -y install autoconf bzip2 wget tabix libz-dev libncurses5-dev libbz2-dev liblzma-dev && \ 9 | #apt-get -y install samtools && \ 10 | apt-get -y install bedtools && \ 11 | apt-get -y install python3-pip 12 | 13 | RUN python3 --version 14 | RUN python3 -m pip install --upgrade pip 15 | RUN python3 -m pip install pysam scipy edlib matplotlib biopython 16 | 17 | ### samtools 18 | # 1.9 19 | WORKDIR /opt/samtools 20 | RUN wget https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2 && \ 21 | tar xvf samtools-1.9.tar.bz2 && \ 22 | rm -r /opt/samtools/samtools-1.9.tar.bz2 && \ 23 | cd samtools-1.9/ && \ 24 | autoheader && \ 25 | autoconf -Wno-header && \ 26 | ./configure && \ 27 | make && \ 28 | cp samtools /usr/bin/samtools 29 | 30 | COPY . /opt/hapdiff 31 | WORKDIR /opt/hapdiff 32 | RUN make 33 | 34 | ENV PATH "/opt/hapdiff:${PATH}" 35 | ENV PYTHONUNBUFFERED "1" 36 | ENV MPLCONFIGDIR "/tmp" 37 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Mikhail Kolmogorov 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | all: 2 | make -C submodules/minimap2 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # hapdiff 2 | 3 | This is a simple SV calling package for diploid assemblies. It uses a modified version of [svim-asm](https://github.com/eldariont/svim-asm). 4 | The package includes its own version [minimap2](https://github.com/lh3/minimap2) to ensure reproducibility between runs, 5 | as the result might be dependent on the aligner version and parameters. 6 | 7 | ## Version 0.9 8 | 9 | Quick start 10 | ----------- 11 | 12 | 13 | Dipdiff takes as input reference genome and a pair of haplotypes, and outputs 14 | structural vaiant calls in VCF format. A recommended way to run is the Docker distribution. 15 | 16 | Next steps assume that your `ref.fasta`, `hap_1.fasta` and `hap_2.fasta` are in the same directory, 17 | which will also be used for hapdiff output. If it is not the case, you might need to bind additional 18 | directories using the Docker's `-v / --volume` argument. The number of threads (`-t` argument) 19 | should be adjusted according to the available resources. 20 | 21 | 22 | ``` 23 | cd directory_with_input 24 | DD_DIR=`pwd` 25 | docker run -v $DD_DIR:$DD_DIR -u `id -u`:`id -g` mkolmogo/hapdiff:0.9 \ 26 | hapdiff.py --reference $DD_DIR/ref.fasta --pat $DD_DIR/hap_1.fasta --mat $DD_DIR/hap_2.fasta --out-dir $DD_DIR/hapdiff -t 20 27 | ``` 28 | 29 | Output files 30 | ------------ 31 | 32 | The output directory will contain `hapdiff_unphased.vcf.gz` and `hapdiff_phased.vcf.gz` files with structural variants. 33 | Both files represent the same SVs, but in either phased or unphased VCF. 34 | 35 | Output also contains `confident_regions.bed` that reflects the regions of the reference where SV calls are comprehensive. 36 | 37 | 38 | Source Installation 39 | ------------------- 40 | 41 | Alernatively, you can run hapdiff locally as follows. 42 | 43 | ``` 44 | git clone https://github.com/KolmogorovLab/hapdiff 45 | cd hapdiff 46 | git submodule update --init 47 | make 48 | pip install -r requirements.txt 49 | ``` 50 | 51 | In addition, hapdiff requires [samtools](https://github.com/samtools) and [bedtools](https://github.com/arq5x/bedtools2) 52 | to be installed in your system. 53 | 54 | Afterwards, you can execute: 55 | 56 | ``` 57 | ./hapdiff.py --reference ref.fasta --pat hap_1.fasta --mat hap_2.fasta --out-dir out_path -t 20 58 | ``` 59 | 60 | Acknowledgements 61 | ---------------- 62 | 63 | The major parts of the hapdiff pipeline are: 64 | 65 | * [minimap2](https://github.com/lh3/minimap2) 66 | * [svim-asm](https://github.com/eldariont/svim-asm) 67 | 68 | 69 | Authors 70 | ------- 71 | 72 | The pipeline was originally developed at [Paten lab at UC Santa Cruz](https://ucscgenomics.soe.ucsc.edu/). The work continues at [Kolmogorov lab at NCI](https://ccr.cancer.gov/staff-directory/mikhail-kolmogorov). 73 | 74 | Main code contributors: 75 | * Mikhail Kolmogorov 76 | 77 | 78 | License 79 | ------- 80 | 81 | hapdiff is distributed under a BSD license. See the [LICENSE file](LICENSE) for details. 82 | Other software included in this discrubution is released under either MIT or BSD licenses. 83 | 84 | 85 | How to get help 86 | --------------- 87 | A preferred way report any problems or ask questions is the 88 | [issue tracker](https://github.com/KolmogorovLab/hapdiff/issues). 89 | 90 | 91 | -------------------------------------------------------------------------------- /hapdiff.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | from threading import Thread 4 | import sys 5 | import subprocess 6 | import os 7 | import argparse 8 | from distutils import spawn 9 | 10 | from Bio import SeqIO 11 | from Bio.SeqIO import SeqRecord 12 | 13 | 14 | pipeline_dir = os.path.dirname(os.path.realpath(__file__)) 15 | MINIMAP2 = os.path.join(pipeline_dir, "submodules", "minimap2", "minimap2") 16 | SAMTOOLS = "samtools" 17 | BEDTOOLS = "bedtools" 18 | 19 | VERSION = "0.9" 20 | 21 | sys.path.insert(0, os.path.join(pipeline_dir, "submodules", "svim-asm", "src")) 22 | import svim_asm.main as svim 23 | 24 | 25 | def file_check(filename): 26 | if not os.path.isfile(filename) or os.path.getsize(filename) == 0: 27 | raise Exception("File not found, or has zero length:", filename) 28 | 29 | 30 | def generate_alignment(ref_path, asm_path, num_threads, out_bam): 31 | cmd = MINIMAP2 + " -ax asm20 -B 2 -E 3,1 -O 6,100 --cs -t {0} {1} {2} -K 5G | samtools sort -m 4G -@ 8 >{3}" \ 32 | .format(num_threads, ref_path, asm_path, out_bam) 33 | print("Running: " + cmd) 34 | subprocess.check_call(cmd, shell=True) 35 | subprocess.check_call("samtools index -@ 4 {0}".format(out_bam), shell=True) 36 | 37 | 38 | def fragment(input_fasta, output_fasta, frag_size): 39 | with open(output_fasta, "w") as fout: 40 | for seq in SeqIO.parse(input_fasta, "fasta"): 41 | if len(seq.seq) < frag_size: 42 | SeqIO.write(seq, fout, "fasta") 43 | else: 44 | for chunk in range(0, len(seq.seq) // frag_size + 1): 45 | chunk_seq = seq.seq[chunk * frag_size : (chunk + 1) * frag_size] 46 | chunk_id = str(seq.id) + "_chunk_" + str(chunk) 47 | SeqIO.write(SeqRecord(seq=chunk_seq, id=chunk_id, description=""), fout, "fasta") 48 | 49 | 50 | def main(): 51 | if sys.version_info < (3,): 52 | raise SystemExit("Requires Python 3") 53 | 54 | parser = argparse.ArgumentParser \ 55 | (description="Call structural variants for a diploid assembly") 56 | 57 | parser.add_argument("--reference", dest="reference", 58 | metavar="path", required=True, 59 | help="path to reference file (fasta format)") 60 | parser.add_argument("--pat", dest="hap_pat", required=True, metavar="path", 61 | help="path to paternal haplotype (in fasta format)") 62 | parser.add_argument("--mat", dest="hap_mat", required=True, metavar="path", 63 | help="path to maternal haplotype (in fasta format)") 64 | parser.add_argument("--out-dir", dest="out_dir", 65 | default=None, required=True, 66 | metavar="path", help="Output directory") 67 | parser.add_argument("--tandem-repeats", dest="tandem_repeats", 68 | default=None, required=False, 69 | metavar="path", help="Tandem repeat annotation in bed format") 70 | parser.add_argument("--sample", dest="sample", 71 | default="Sample", required=False, 72 | help="Sample ID [deafult=Fample]") 73 | parser.add_argument("--sv-size", dest="sv_size", type=int, 74 | default=30, metavar="int", help="minimum SV size [30]") 75 | parser.add_argument("--fragment", dest="fragment", type=int, 76 | default=None, metavar="int", help="fragment query to X Mb to reduce minimap2 memory footprint [None]") 77 | #parser.add_argument("--phased", dest="phased", action="store_true", 78 | # default=False, help="produce phased vcf") 79 | parser.add_argument("-t", "--threads", dest="threads", type=int, 80 | default=10, metavar="int", help="number of parallel threads [10]") 81 | parser.add_argument("-v", "--version", action="version", version=VERSION) 82 | args = parser.parse_args() 83 | 84 | for e in [MINIMAP2, SAMTOOLS, BEDTOOLS]: 85 | if not spawn.find_executable(e): 86 | print("Not installed: " + e, file=sys.stderr) 87 | return 1 88 | 89 | if not os.path.isdir(args.out_dir): 90 | os.mkdir(args.out_dir) 91 | 92 | file_check(args.reference) 93 | file_check(args.hap_pat) 94 | file_check(args.hap_mat) 95 | 96 | prefix = "hapdiff" 97 | aln_1 = os.path.join(args.out_dir, prefix + "_pat" + ".bam") 98 | aln_2 = os.path.join(args.out_dir, prefix + "_mat" + ".bam") 99 | 100 | fragmented_pat = args.hap_pat 101 | fragmented_mat = args.hap_mat 102 | if args.fragment is not None: 103 | fragmented_pat = os.path.join(args.out_dir, "fragmented_pat.fasta") 104 | fragment(args.hap_pat, fragmented_pat, args.fragment * 1000000) 105 | fragmented_mat = os.path.join(args.out_dir, "fragmented_mat.fasta") 106 | fragment(args.hap_mat, fragmented_mat, args.fragment * 1000000) 107 | 108 | generate_alignment(args.reference, fragmented_pat, args.threads, aln_1) 109 | file_check(aln_1) 110 | generate_alignment(args.reference, fragmented_mat, args.threads, aln_2) 111 | file_check(aln_2) 112 | 113 | def run_svim(out_file, phased): 114 | svim_cmd = ["diploid", args.out_dir, aln_1, aln_2, args.reference, "--min_sv_size", str(args.sv_size), 115 | "--partition_max_distance", "5000", "--max_edit_distance", "0.3", 116 | "--filter_contained", "--query_names", "--sample", args.sample] 117 | if phased: 118 | svim_cmd.append("--phased_gt") 119 | if args.tandem_repeats: 120 | svim_cmd.extend(["--tandem", args.tandem_repeats]) 121 | svim.main(svim_cmd) 122 | 123 | SVIM_OUTPUT = os.path.join(args.out_dir, "variants.vcf") 124 | SV_LENGTHS = os.path.join(args.out_dir, "sv-lengths.png") 125 | out_with_prefix = os.path.join(args.out_dir, out_file) 126 | os.rename(SVIM_OUTPUT, out_with_prefix) 127 | 128 | subprocess.check_call(["bgzip", "-f", out_with_prefix]) 129 | subprocess.check_call(["tabix", "-f", out_with_prefix + ".gz"]) 130 | if os.path.isfile(SV_LENGTHS): 131 | os.remove(SV_LENGTHS) 132 | 133 | run_svim("hapdiff_unphased.vcf", False) 134 | run_svim("hapdiff_phased.vcf", True) 135 | 136 | conf_pat = os.path.join(args.out_dir, "aln_coverage_pat.bed") 137 | conf_mat = os.path.join(args.out_dir, "aln_coverage_mat.bed") 138 | merged_bed = os.path.join(args.out_dir, "confident_regions.bed") 139 | 140 | with open(merged_bed, "w") as fout: 141 | fout.write("#SAMPLE:{0}\n".format(args.sample)) 142 | fout.flush() 143 | bedtools_cmd = [BEDTOOLS, "intersect", "-a", conf_pat, "-b", conf_mat, "-sortout", "|", "uniq"] 144 | subprocess.check_call(" ".join(bedtools_cmd), shell=True, stdout=fout) 145 | 146 | return 0 147 | 148 | 149 | if __name__ == "__main__": 150 | sys.exit(main()) 151 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pysam 2 | scipy 3 | edlib 4 | matplotlib 5 | biopython 6 | --------------------------------------------------------------------------------