├── .gitignore
├── .gitmodules
├── .travis.yml
├── LICENSE
├── README.md
├── README_NIM.md
├── SVNibbler.png
├── makefile
├── multimedia
├── nibSV.jpg
└── nibSV_presentation.pptx
├── nib.nimble
├── nim.cfg
├── src
├── nibpkg
│ ├── captain.nim
│ ├── classify.nim
│ ├── compose.nim
│ ├── kmers.nim
│ ├── read.nim
│ ├── refmers.nim
│ ├── reporter.nim
│ ├── svidx.nim
│ ├── util.nim
│ └── welcome.nim
└── nibsv.nim
├── test-data
├── GIAB-chr22.vcf
├── GIAB_PBSV_TRIO_CALLS.vcf
├── GIAB_PBSV_TRIO_CALLS_TEST2.vcf
├── GIAB_PBSV_TRIO_CALLS_TEST2_regions.bed
├── README.md
├── event_four.bam
├── event_four.bam.bai
├── event_one.bam
├── event_one.bam.bai
├── event_three.bam
├── event_three.bam.bai
├── event_two.bam
└── event_two.bam.bai
├── tests
├── .gitignore
├── all.nim
├── config.nims
├── foo.fasta
├── foo.fasta.fai
├── makefile
├── nim.cfg
├── t_composer.nim
├── t_kmers.nim
├── t_read.nim
├── t_refmers.nim
├── t_svidx.nim
├── t_util.nim
└── t_welcome.nim
└── vendor
└── README.md
/.gitignore:
--------------------------------------------------------------------------------
1 | /nimbleDir
2 | /nibsv
3 | /src/nibsv
4 | .DS_Store
5 | *.dSYM
6 | *.msgpck
7 | tests/test_read
8 | tests/test_svidx
9 |
--------------------------------------------------------------------------------
/.gitmodules:
--------------------------------------------------------------------------------
1 | [submodule "vendor/STRling"]
2 | path = vendor/STRling
3 | url = https://github.com/quinlan-lab/STRling.git
4 | ignore = dirty
5 | [submodule "vendor/threadpools"]
6 | path = vendor/threadpools
7 | url = https://github.com/yglukhov/threadpools.git
8 | ignore = dirty
9 |
--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | sudo: required
2 | services:
3 | - docker
4 | before_install:
5 | - docker pull nimlang/nim
6 | script:
7 | - docker run nimlang/nim nim --version
8 | - docker run -v "$(pwd):/project" -w /project nimlang/nim sh -c "make tests"
9 | # - docker run -v "$(pwd):/project" -w /project nimlang/nim sh -c "find src/ -name '*.nim' -type f -exec nim doc {} \;"
10 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2020 collaborativebioinformatics
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | # DEPRICATED! Moved to : https://github.com/fritzsedlazeck/nibSV
4 |
5 |
6 |
7 | # NibblerSV
8 |
9 | ## Contributors
10 |
11 | Brent Pederson1, Christopher Dunn2, Eric Dawson3, Fritz Sedlazeck4, Peter Xie5, and Zev Kronenberg2
12 |
13 | 1 University of Utah; 2PacBio; 3Nvidia Corporation; 4Baylor College of Medicine; 5JBrowse (UC Berkeley);
14 |
15 | ## Intro statement
16 | Structural variation (SV) are the largest source of genetic variation within the human population. Long read DNA sequencing is becoming the preferred method for discovering structural variants. Structural variation can be longer than a short-read (<500bp) DNA trace, meaning the SV allele is not contained, which causes challenges and problems in the detection.
17 |
18 | Long read sequencing has proven superior to identify Structural Variations in individuals. Nevertheless, it is important to obtain accurate allele frequencies of these complex alleles across a population to rank and identify potential pathogenic variations. Thus, it is important to be able to genotype SV events in a large set of previously short read based sequenced samples (e.g. 1000genomes, Topmed, CCDG, etc.). Two main approaches has been recently shown to achieve this with high accuracy even for insertions: Paragraph and VG. However, these methods still consume hours per sample and even more depending on the number of SV to be genotyped along the genome or in regions. Furthermore and maybe more crucially rely on precise breakpoints that do not change in other samples. This assumption might be flawed over repetitive regions. In addition the problem currently arises that some data sets are mapped to different genomic version than others (e.g hg19 vs. GRCH38 vs. CHM13) and will require a different VCF catalog to be genotyped.
19 |
20 | # Why NibblerSV
21 | NibblerSV can overcome these challenges. NibblerSV relies on a k-mer based strategy to identify SV breakpoints in short read data set. Due to innovative k-mer design and efficient implementation, NibblerSV is able to run on a 30x cram file within minutes with low memory requirements. Its k-mer strategy of spaced k-mers allow a relaxed constrain on the precision of the breakpoint. In addition, utilizing k-mers NibblerSV is independent of the genomic reference the short reads were aligned to and can even work on raw fastq reads. This makes NibblerSV a lightweight, scalable and easy to apply methods to identify the frequency of Structural Variatons.
22 |
23 |
24 | Who doesn't like to nibble on SV?
25 | # How does it work ?
26 | NibblerSV is a light weighted framework to identify the presence and absence of Structural Variations across a large set of Illumina sequenced samples. To achieve this we take a VCF file including all the SV that should be genotyped. Next, we extract the reference and alternative allele kmers. This is done such that we include the flanking regions. Subsequently, we count the occurrence of these k-mers in the reference fasta file. This is necessary to not miscount certain k-mers. To enable large scaling of NibblerSV the results of these two steps are written into a temporary file, which is all that is needed for the actual genotyping step.
27 |
28 | During the genotyping step NibblerSV uses the small temporary file and the bam/cram file of the sample. NibblerSV then identifies the presence /absence of the reference and alternative k-mer across the entire sample. This is very fast and requires only minimal resources of memory as the number of k-mers is limited. Once NibblerSV finished the scanning of the bam/cram file it reports out which SV have been re-identified by adding a tag in the output VCF file of this sample. The VCF per sample can then be merged to obtain population frequencies.
29 |
30 | 
31 |
32 | # How to use
33 |
34 | To run nibblerSV just execute this example which uses the test data provided. You should have a copy of GRCh38 available to run this.
35 | ```
36 | ./src/nibsv main -v test-data/GIAB_PBSV_TRIO_CALLS_TEST2.vcf -r hg38.fa.gz --reads-fn test-data/event_one.bam -p HG02
37 | ```
38 |
39 | Full usage:
40 | ```
41 | (base) ZKRONENBERG-MAC:nibSV zkronenberg$ ./src/nibsv main -h
42 | Usage:
43 | main [required&optional-params]
44 | Generate a SV kmer database, and genotype new samples. If a file called "{prefix}.sv_kmers.msgpack" exists, use it. Otherwise,
45 | generate it.
46 | Options:
47 | -h, --help print this cligen-erated help
48 | --help-syntax advanced: prepend,plurals,..
49 | -v=, --variants-fn= string REQUIRED long read VCF SV calls
50 | -r=, --refSeq-fn= string REQUIRED reference genome FASTA, compressed OK
51 | --reads-fn= string REQUIRED input short-reads in BAM/SAM/CRAM/FASTQ
52 | -p=, --prefix= string "test" output prefix
53 | -k=, --kmer-size= int 25 kmer size, for spaced seeds use <=16 otherwise <=32
54 | -s, --spaced-seeds bool false turn on spaced seeds
55 | --space= int 50 width between spaced kmers
56 | -f=, --flank= int 100 number of bases on either side of ALT/REF in VCF records
57 | -m=, --max-ref-kmer-count= uint32 0 max number of reference kmers allowed in SV event
58 | ```
59 |
60 |
61 | # Quickstart
62 | ## Input
63 | 1. A Strucutural variant VCF
64 | 2. An indexed FASTA file of the reference genome
65 | 3. A BAM/CRAM file (new genome)
66 |
67 | ## Output
68 | A VCF file with a tag in INFO field identifying the present/ absance for each SV.
69 |
70 | # Testing
71 | We have tested NibblerSV on HG002 from GIAB and various other control data sets.
72 |
73 | # Installation
74 |
75 | ## Install Nim
76 | * https://nim-lang.org/install.html
77 |
78 | See also [README_NIM.md](README_NIM.md)
79 |
80 | ## Install htslib
81 | This needs to be available as a dynamically loadable library
82 | on your system.
83 |
84 | * http://www.htslib.org/download/
85 |
86 | ## Setup and build
87 | ```sh
88 | make setup
89 | make build
90 |
91 | # Or for faster executable
92 | make release
93 | ```
94 |
--------------------------------------------------------------------------------
/README_NIM.md:
--------------------------------------------------------------------------------
1 | ### Installing Nim
2 |
3 | * https://nim-lang.org/install.html
4 |
5 | Then, if you want to control which version you have:
6 |
7 | ```
8 | nimble install nimble
9 | export PATH=~/.nimble/bin:$PATH
10 | choosenim stable
11 | ```
12 |
--------------------------------------------------------------------------------
/SVNibbler.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/collaborativebioinformatics/nibSV/3fec3eb8fda8ee1c21879287f4c5660dd3961998/SVNibbler.png
--------------------------------------------------------------------------------
/makefile:
--------------------------------------------------------------------------------
1 | #NIMBLE_DIR?=${CURDIR}/nimbleDir
2 | #export NIMBLE_DIR
3 | # Alternatively, use --nimbleDir:${NIMBLE_DIR} everywhere
4 | UNAME=$(shell uname)
5 | ifeq (${UNAME},Darwin)
6 | install=install_name_tool -add_rpath /opt/local/lib
7 | else
8 | install=echo
9 | endif
10 |
11 | build:
12 | nim c src/nibsv.nim
13 | ${install} src/nibsv
14 | release:
15 | nim c -d:release -d:danger src/nibsv.nim
16 | all:
17 | ${MAKE} install
18 | quick:
19 | nim c -r tests/t_kmers.nim
20 | nim c -r tests/t_util.nim
21 | help:
22 | nimble -h
23 | nimble tasks
24 | tests:
25 | @# much faster than nimble
26 | ${MAKE} -C tests
27 | test:
28 | nimble test # uses "tests/" directory by default
29 | integ-test:
30 | @echo 'integ-test TBD'
31 | install:
32 | nimble install -y
33 | pretty:
34 | find src -name '*.nim' | xargs -L1 nimpretty --maxLineLen=1024
35 | find tests -name '*.nim' | xargs -L1 nimpretty --maxLineLen=1024
36 | setup: #vendor/threadpools vendor/STRling
37 | nimble install --verbose -y hts kmer bitvector cligen msgpack4nim
38 | #cd vendor/threadpools; nimble install --verbose -y
39 | #cd vendor/STRling; nimble install --verbose -y
40 | vendor/threadpools vendor/STRling:
41 | git submodule update --init
42 | rsync: # not used for now
43 | mkdir -p ${NIMBLE_DIR}/pkgs/
44 | rsync -av vendor/STRling/ ${NIMBLE_DIR}/pkgs/strling-0.3.0/
45 | rsync -av vendor/threadpools/ ${NIMBLE_DIR}/pkgs/threadpools-0.1.0/
46 |
47 | .PHONY: test tests
48 |
--------------------------------------------------------------------------------
/multimedia/nibSV.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/collaborativebioinformatics/nibSV/3fec3eb8fda8ee1c21879287f4c5660dd3961998/multimedia/nibSV.jpg
--------------------------------------------------------------------------------
/multimedia/nibSV_presentation.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/collaborativebioinformatics/nibSV/3fec3eb8fda8ee1c21879287f4c5660dd3961998/multimedia/nibSV_presentation.pptx
--------------------------------------------------------------------------------
/nib.nimble:
--------------------------------------------------------------------------------
1 | # Package
2 |
3 | version = "0.2.0"
4 | author = "Zev Kronenberg"
5 | author = "Christopher Dunn"
6 | author = "(Add your name here)"
7 | description = "Structural Variant nibbles"
8 | license = "BSD-3-Clause"
9 | srcDir = "src"
10 | installDirs = @["nibpkg"]
11 | bin = @["nibsv"]
12 |
13 |
14 | # Dependencies
15 |
16 | requires "nim >= 1.2.0", "hts", "kmer", "bitvector >= 0.4.10", "cligen", "msgpack4nim"
17 |
--------------------------------------------------------------------------------
/nim.cfg:
--------------------------------------------------------------------------------
1 | --hint[Conf]:off
2 | --hint[XDeclaredButNotUsed]:off
3 | --hint[Processing]:off
4 | --hint[Name]:off
5 | #--warning[UnusedImport]:off
6 |
--------------------------------------------------------------------------------
/src/nibpkg/captain.nim:
--------------------------------------------------------------------------------
1 | # vim: sw=4 ts=4 sts=4 tw=0 et:
2 | import refmers
3 | import svidx
4 | import strformat
5 | import classify
6 | import reporter
7 | #from ./read import `$`
8 | from os import nil
9 | from tables import len
10 |
11 | proc main_runner*(variants_fn, refSeq_fn, reads_fn: string, prefix = "test", kmer_size: int = 25, spaced_seeds : bool = false, space: int = 0, flank: int = 100, maxRefKmerCount : uint32 = 1 ) =
12 | ## Generate a SV kmer database, and genotype new samples.
13 | ## If a file called "{prefix}.sv_kmers.msgpack" exists, use it.
14 | ## Otherwise, generate it.
15 | var index_fn = "{prefix}.sv_kmers.msgpck".fmt
16 | var idx: SvIndex
17 |
18 | if not os.fileExists(index_fn):
19 | echo "building an SV kmer DB."
20 | let sp = if spaced_seeds:
21 | space
22 | else:
23 | 0
24 | idx = buildSvIndex(refSeq_fn, variants_fn, flank, kmer_size, sp)
25 | echo "updating reference kmer counts."
26 | updateSvIndex(refSeq_fn, idx, kmer_size, 1000000, sp)
27 | echo "dumpIndexToFile:'", index_fn, "'"
28 | dumpIndexToFile(idx, index_fn)
29 | else:
30 | echo "loadIndexFromFile:'", index_fn, "'"
31 | idx = loadIndexFromFile(index_fn, kmer_size)
32 |
33 | echo "final idx contains: {idx.len} forward and reverse SV kmers.".fmt
34 |
35 |
36 | filterRefKmers(idx, maxRefKmerCount)
37 |
38 |
39 | #echo dumpIndexToJson(idx)
40 |
41 |
42 | let classifyCount = classify_file(reads_fn, idx, kmer_size, spaced_seeds, space)
43 |
44 | #echo "classifyCount:"
45 | #echo classifyCount
46 |
47 |
48 | echo "reporting variants."
49 |
50 | report(variants_fn, classifyCount, idx, prefix)
51 |
52 | echo "nibbleSV finished without problems, goodbye!"
53 |
54 |
55 | when isMainModule:
56 | import cligen
57 | dispatch(main_runner)
58 |
--------------------------------------------------------------------------------
/src/nibpkg/classify.nim:
--------------------------------------------------------------------------------
1 | import strutils
2 | import tables
3 | import hts
4 | import ./read
5 | import ./svidx
6 | from ./compose import nil
7 |
8 | proc buildSvIndex*(reference_path: string, vcf_path: string, flank: int = 100, k: int = 25, space: int = 0): SvIndex =
9 | ## Open FASTA index
10 | var fai: Fai
11 | doAssert fai.open(reference_path), "Failed to open FASTA file: " & reference_path
12 |
13 | var variants: VCF
14 | doAssert(open(variants, vcf_path))
15 |
16 | result.kmerSize = k.uint8
17 |
18 | var sv_idx = 0
19 | echo "flank:", flank
20 | for v in variants:
21 | let sv_chrom = $v.CHROM
22 |
23 | let flanks = compose.retrieve_flanking_sequences_from_fai(fai, $v.CHROM, v.start.int, v.stop.int, flank)
24 | var p = compose.composePositioned(v, flanks.left, flanks.right, k, space)
25 |
26 | result.insert(p.sequences.alt_seq, k, sv_idx)
27 | # The insert function allows us to add to the ref count, but refmer also
28 | # adds the same counts, so for now i'm commenting this out to minimize the
29 | # lines of code we are debugging. --Zev
30 | # result.insert(p.sequences.ref_seq, k, -1)
31 |
32 | sv_idx.inc
33 |
34 | proc classify_bam(filename: string, idx: SvIndex, k: int = 25, spacedSeeds: bool = false, space: int = 50, threads: int = 2): CountTableRef[uint32] =
35 | new(result)
36 |
37 | var bamfile: Bam
38 | open(bamfile, filename, index = false, threads=threads)
39 | var sequence: string
40 |
41 | for record in bamfile:
42 | # NOTE: we may also want to filter record.flag.dup in the future, but
43 | # that will make results differ between bam and fastq
44 | if record.flag.secondary or record.flag.supplementary: continue
45 | record.sequence(sequence)
46 |
47 | var read_classification = process_read(sequence, idx, k, spacedSeeds, space)
48 |
49 | #if read_classification.compatible_SVs.len != 0:
50 | # echo read_classification
51 |
52 | filter_read_matches(read_classification, winner_takes_all=false)
53 | for svId, count in read_classification.compatible_SVs:
54 | result.inc(svId)
55 |
56 | #echo result
57 |
58 |
59 | proc classify_file*(filename: string, idx: SvIndex, k: int = 25, spacedSeeds: bool = false, space: int = 50): CountTableRef[uint32] =
60 | if endsWith(filename, ".bam"):
61 | return classify_bam(filename, idx, k, spacedSeeds, space)
62 | else:
63 | quit("Error: only BAM input currently supported.")
64 |
65 | proc main_classify*(read_file: string, vcf_file: string, ref_file: string, k: int = 25, flank: int = 100) =
66 | var idx: SvIndex = buildSvIndex(ref_file, vcf_file, flank, k)
67 | var svCounts: CountTableRef[uint32] = classify_file(read_file, idx, k)
68 |
--------------------------------------------------------------------------------
/src/nibpkg/compose.nim:
--------------------------------------------------------------------------------
1 | # vim: sw=2 ts=2 sts=2 tw=0 et ft=python:
2 | import hts
3 | import kmers
4 |
5 | type
6 | FlankSeq* = object
7 | left*, right*: string
8 |
9 | type
10 | PositionedSequence* = object
11 | sequences*: tuple[ref_seq: string, alt_seq: string]
12 | kmers: tuple[ref_kmers: seq[seed_t], alt_kmers: seq[seed_t]]
13 | chrom: string
14 | position: int32
15 |
16 | proc retrieve_flanking_sequences_from_fai*(fastaIdx: Fai, chrom: string,
17 | start_pos: int, end_pos: int, flank: int): FlankSeq =
18 | ## this function lacks a return
19 | result.left = fastaIdx.get(chrom, max(0, start_pos - flank), start_pos)
20 | result.right = fastaIdx.get(chrom, end_pos, end_pos + flank)
21 |
22 | proc kmerize(s: string, k: int = 25, space: int = 0): seq[seed_t] =
23 | var kmers = Dna(s).dna_to_kmers(k)
24 | if space > 0:
25 | kmers = spacing_kmer(kmers, space)
26 | return kmers.seeds
27 |
28 | proc composePositioned*(variant: Variant, left_flank: string,
29 | right_flank: string, k: int = 25 ; space: int = 0): PositionedSequence =
30 | ## Takes in a VCF variant, the 5' and 3' reference flanking sequences,
31 | ## and a kmer size. Produces a PositionedSequence, which holds the ref/alt
32 | ## sequences as well as the kmers of those sequences (in addition to
33 | ## minimal position information)
34 | var variant_type: string
35 | doAssert variant.info.get("SVTYPE", variant_type) == Status.OK
36 | if variant_type == "DEL":
37 | var deleted_bases: string = $variant.REF ## Chop the reference base prefix in the REF allele.
38 | result.sequences.ref_seq = left_flank & deleted_bases & right_flank
39 | result.sequences.alt_seq = left_flank & right_flank
40 | if k > 0:
41 | result.kmers.ref_kmers = kmerize(result.sequences.ref_seq, k, space)
42 | result.kmers.alt_kmers = kmerize(result.sequences.alt_seq, k, space)
43 | elif variant_type == "INS":
44 | # the first base in the alt string is ref (silly VCF format). ^1 prevents going off the end of the seq (which ^0 did)
45 | var inserted_seq: string = variant.ALT[0][1 .. ^1] ## Chop the reference base prefix in the ALT allele.
46 | result.sequences.ref_seq = left_flank & right_flank
47 | result.sequences.alt_seq = left_flank & inserted_seq & right_flank
48 | if k > 0:
49 | result.kmers.ref_kmers = kmerize(result.sequences.ref_seq, k, space)
50 | result.kmers.alt_kmers = kmerize(result.sequences.alt_seq, k, space)
51 | elif variant_type == "INV":
52 | return
53 | #raise newException(ValueError,
54 | #"Error: Inversion processing not implemented.")
55 |
56 | result.position = int32(variant.start) - int32(len(right_flank))
57 | result.chrom = $variant.CHROM
58 |
59 |
60 | proc compose_variants*(variant_file: string, reference_file: string; k: int = 31, space: int = 0): seq[
61 | PositionedSequence] =
62 | ## function to compose variants from their sequence / FASTA flanking regions
63 | ## Returns a Sequence of strings representing the DNA sequence of the flanking
64 | ## regions and variant sequence.
65 |
66 | var composed_seqs = newSeq[PositionedSequence]()
67 |
68 | ## Open FASTA index
69 | var fai: Fai
70 | if not fai.open(reference_file):
71 | quit ("Failed to open FASTA file: " & reference_file)
72 |
73 | var variants: VCF
74 | doAssert(open(variants, variant_file))
75 |
76 |
77 | for v in variants:
78 | var variant_type: string
79 | if v.info.get("SVTYPE", variant_type) != Status.OK:
80 | continue
81 | let sv_chrom = $v.CHROM
82 | ## Retrieve flanks, either from FAI or string cache
83 | let flanks = retrieve_flanking_sequences_from_fai(fai, sv_chrom, int(
84 | v.start), int(v.stop), 100)
85 | ## Generate a single sequence from variant seq + flank,
86 | ## taking into account the variant type.
87 | var variant_seq = composePositioned(v, flanks.left, flanks.right, k, space)
88 | composed_seqs.add(variant_seq)
89 |
90 | return composed_seqs
91 |
92 | when isMainModule:
93 | import cligen
94 | dispatch(compose_variants)
95 |
--------------------------------------------------------------------------------
/src/nibpkg/kmers.nim:
--------------------------------------------------------------------------------
1 | # vim: sw=4 ts=4 sts=4 tw=0 et:
2 | import deques
3 | import tables
4 | #from sets import nil
5 | from algorithm import sort
6 | from hashes import nil
7 | from strutils import format
8 | from ./util import raiseEx, PbError
9 |
10 | export PbError
11 |
12 | type
13 | Dna* = string # someday, this might be an array
14 | Bin* = uint64 # compact bitvector of DNA
15 | ## In bitvector, A is 0, C is 1, G is two, and T is 3.
16 |
17 | Min* = uint64 # minimizer
18 | Strand* = enum
19 | forward, reverse
20 |
21 | ## kmer - a uint64 supporting a maximum of 32 DNA bases.
22 | ## pos - position along the sequence
23 | seed_t* = object
24 | kmer*: Bin
25 | pos*: uint32
26 | strand*: Strand
27 |
28 | minimizer_t* = object
29 | minimizer*: Min
30 | pos*: uint32
31 | strand*: Strand
32 |
33 | ## a & b are two seed_t's designed for matching in the hash lookup
34 | seed_pair_t* = object
35 | a*: seed_t
36 | b*: seed_t
37 |
38 | Hash* = int
39 |
40 | ## seeds - a pointer to the kmers
41 | pot_t* = ref object of RootObj
42 | word_size*: uint8 # <=32
43 | seeds*: seq[seed_t]
44 |
45 | ## searchable seed-pot
46 | spot_t* = ref object of pot_t
47 | ht*: tables.TableRef[Bin, int]
48 |
49 | var seq_nt4_table: array[256, int] = [
50 | 0, 1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
51 | 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
52 | 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
53 | 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
54 | 4, 0, 4, 1, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 4,
55 | 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
56 | 4, 0, 4, 1, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 4,
57 | 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
58 | 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
59 | 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
60 | 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
61 | 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
62 | 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
63 | 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
64 | 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
65 | 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
66 |
67 |
68 | ## @return uninitialized
69 | #
70 | proc newDna(size: int): Dna =
71 | return newString(size)
72 |
73 | # hashes for sets and tables
74 |
75 | proc hash*(s: kmers.seed_t): hashes.Hash =
76 | #hashes.hash(s.pos) + hashes.hash(s.kmer shl 8) + hashes.hash(s.strand)
77 | hashes.hash([s.pos.int64, s.kmer.int64, s.strand.int64])
78 |
79 | proc hash*(p: kmers.seed_pair_t): hashes.Hash =
80 | hashes.hash([hash(p.a), hash(p.b)])
81 |
82 | # convenience for C coders
83 |
84 | template `<<`(a, b: uint64): uint64 =
85 | a shl b
86 |
87 | template `>>`(a, b: uint64): uint64 =
88 | a shr b
89 |
90 | ## Return binary version of kmer exactly matching Dna.
91 | ## Mostly for testing, as this is not efficient for sliding windows.
92 | ## NOT FINISHED.
93 | #
94 | proc encode(sq: Dna) =
95 | assert sq.len() <= 32
96 | let k = sq.len()
97 | let
98 | shift1: uint64 = 2'u64 * (k - 1).uint64
99 | mask: uint64 = (1'u64 << (2 * k).uint64) - 1
100 | var
101 | forward_bin: Bin = 0
102 | reverse_bin: Bin = 0
103 | for i in 0 ..< k:
104 | let ch = cast[uint8](sq[i])
105 | let c = seq_nt4_table[ch].uint64
106 | assert c < 4
107 | forward_bin = (forward_bin << 2 or c) and mask
108 | reverse_bin = (reverse_bin >> 2) or (
109 | 3'u64 xor c) << shift1
110 |
111 | ## Converts a char * into a set of seed_t objects.
112 | ## @param sq - sequence
113 | ## @param k - kmer size (<=32)
114 | ## @return pot
115 | #
116 | proc dna_to_kmers*(sq: Dna; k: int): pot_t =
117 | if k > 32:
118 | raiseEx("k > 32")
119 |
120 | let
121 | shift1: uint64 = 2'u64 * (k - 1).uint64
122 | mask: uint64 = (1'u64 << (2 * k).uint64) - 1
123 | #echo format("shift1=$# mask=$#", shift1, mask)
124 |
125 | var forward_kmer: seed_t
126 | var reverse_kmer: seed_t
127 |
128 | forward_kmer.kmer = 0
129 | forward_kmer.pos = 0
130 | reverse_kmer.kmer = 0
131 | reverse_kmer.pos = 0
132 | forward_kmer.strand = forward
133 | reverse_kmer.strand = reverse
134 |
135 | var kmers: pot_t
136 | new(kmers)
137 | kmers.seeds = newSeqOfCap[seed_t](max(sq.len - int(k) + 1,0))
138 | kmers.word_size = k.uint8
139 |
140 | ## lk is the length of the kmers being built on the fly. The variable n is the total number of
141 | var
142 | i: int
143 | lk: int
144 | n: int
145 | i = 0
146 | lk = 0
147 | n = 0
148 |
149 | while i < sq.len():
150 | let ch = cast[uint8](sq[i])
151 | let c = seq_nt4_table[ch].uint64
152 | if c < 4:
153 | forward_kmer.kmer = (forward_kmer.kmer << 2 or c) and mask
154 | reverse_kmer.kmer = (reverse_kmer.kmer >> 2) or (
155 | 3'u64 xor c) << shift1
156 | #echo format("[$#]=$# $#==$#($# $#) f:$# r:$#",
157 | # i, sq[i], ch, c, (3'u8 xor c), (3'u8 xor c).uint64 shl shift1, forward_kmer.kmer, reverse_kmer.kmer)
158 | inc(lk)
159 | else:
160 | ## advance the window beyond the unknown character
161 | lk = 0
162 | inc(i, k)
163 | inc(forward_kmer.pos, k)
164 | forward_kmer.kmer = 0
165 | inc(reverse_kmer.pos, k)
166 | reverse_kmer.kmer = 0
167 |
168 | if lk >= k:
169 | inc(n, 2)
170 | kmers.seeds.add(forward_kmer)
171 | kmers.seeds.add(reverse_kmer)
172 | inc(forward_kmer.pos, 1)
173 | inc(reverse_kmer.pos, 1)
174 | inc(i)
175 |
176 |
177 |
178 |
179 | return kmers
180 |
181 | ## A function to convert the binary DNA back into character
182 | ## @param kmer up to 32 2-bit bases
183 | ## @param k kmer length
184 | ## @param strand If reverse, start at kth bit and go backwards.
185 | ##
186 | ## Zero is A, one is C, G is two, and T is 3
187 | #
188 | proc bin_to_dna*(kmer: Bin; k: uint8; strand: Strand = forward): Dna =
189 | var lookup: array[4, char] = ['A', 'C', 'G', 'T']
190 | var mask: uint64 = 3
191 | var i: uint8 = 0
192 | var tmp: uint64 = 0
193 | var offset: uint64 = 0
194 |
195 | var dna = newDna(k.int)
196 | i = 0
197 | while i < k:
198 | if strand == forward:
199 | offset = (k - i - 1) * 2
200 | tmp = kmer >> offset
201 | dna[i.int] = lookup[mask and tmp]
202 | else:
203 | offset = i * 2
204 | tmp = kmer >> offset
205 | dna[i.int] = lookup[mask and not tmp]
206 | inc(i)
207 |
208 | return dna
209 |
210 | proc nkmers*(pot: pot_t): int =
211 | return len(pot.seeds)
212 |
213 | ## Prints the pot structure to STDOUT
214 | ## @param pot a ref to the pot
215 | #
216 | proc print_pot*(pot: pot_t) =
217 | var i: int = 0
218 |
219 | while i < pot.seeds.len():
220 | let dna = bin_to_dna(pot.seeds[i].kmer, pot.word_size,
221 | pot.seeds[i].strand)
222 | echo format("pos:$# strand:$# seq:$# bin:$#",
223 | pot.seeds[i].pos, pot.seeds[i].strand, dna, pot.seeds[i].kmer)
224 | inc(i, 1)
225 |
226 | proc get_dnas*(pot: pot_t): seq[Dna] =
227 | for i in 0 ..< pot.seeds.len():
228 | let dna = bin_to_dna(pot.seeds[i].kmer, pot.word_size,
229 | pot.seeds[i].strand)
230 | result.add(dna)
231 |
232 | proc cmp_seeds(a, b: seed_t): int =
233 | let c = a.kmer
234 | let d = b.kmer
235 |
236 | if c < d:
237 | return -1
238 |
239 | if c == d:
240 | if a.pos < b.pos:
241 | return -1
242 | else:
243 | return 0
244 |
245 | return 1
246 |
247 | # Actual implementation, private.
248 | #
249 | proc make_searchable(seeds: var seq[seed_t]; ht: var tables.TableRef[Bin, int]) =
250 | seeds.sort(cmp_seeds)
251 | ht = newTable[Bin, int]()
252 | #let dups = sets.initHashSet[Bin]()
253 | var ndups = 0
254 |
255 | var i: int = 0
256 | while i < seeds.len():
257 | let key = seeds[i].kmer
258 | if ht.hasKeyOrPut(key, i):
259 | ndups += 1
260 | #echo format("WARNING: Duplicate seed $# @$#, not re-adding @$#",
261 | # key, i, ht[key])
262 | inc(i)
263 | #[
264 | if ndups > 0:
265 | echo format("WARNING: $# duplicates in kmer table", ndups)
266 |
267 | ]#
268 |
269 | ## Construct searchable-pot from pot.
270 | ## Move construct seeds (i.e. original is emptied).
271 | ##
272 | ## Sort the seeds and load the kmers into a hash table.
273 | ## For any dups, the table refers to the first seed with that kmer.
274 | #
275 | proc initSpot*(kms: var pot_t): spot_t =
276 | new(result)
277 | result.word_size = kms.word_size
278 | shallowCopy(result.seeds, kms.seeds)
279 | #kms.seeds = @[]
280 | kms = nil # simpler, obvious move-construction
281 | make_searchable(result.seeds, result.ht)
282 |
283 | ## Check for the presence or absence of a kmer in a
284 | ## pot regardless of the position.
285 | ## @param pot_t * - a pointer to a pot_t
286 | ## @return false if kmer doesn't exist
287 | #
288 | proc haskmer*(target: spot_t; query: Bin): bool =
289 | if target.ht.hasKey(query):
290 | return true
291 | return false
292 |
293 | ## Counts the number of shared unique kmers
294 | ## @param pot_t * - a pointer to a pot_t
295 | ## @param pot_t * - a pointer to a pot_t
296 | ## @return int - number of shared kmers
297 | #
298 | proc uniqueShared*(a, b: spot_t): int =
299 | result = 0
300 |
301 | for k in a.ht.keys():
302 | if(haskmer(b, k)):
303 | inc(result)
304 |
305 | ## Find (target - remove), without altering target.
306 | #
307 | proc difference*(target: pot_t; remove: spot_t): pot_t =
308 | new(result)
309 | result.word_size = target.word_size
310 |
311 | var kmer_stack = newSeq[seed_t]()
312 |
313 | for i in 0 ..< target.seeds.len():
314 | if(not haskmer(remove, target.seeds[i].kmer)):
315 | kmer_stack.add(target.seeds[i])
316 |
317 | result.seeds = kmer_stack
318 |
319 | ## Return the seeds in the intersection of target and query.
320 | #
321 | proc search*(target: spot_t; query: pot_t): deques.Deque[seed_pair_t] =
322 | echo format("Searching through $# kmers", query.seeds.len())
323 | var hit_stack = deques.initDeque[seed_pair_t](128)
324 | var hit: seed_pair_t
325 | var hit_index: int
326 |
327 | var i: int = 0
328 | #echo format("target.ht=$#", target.ht)
329 | #echo format("query.ht=$#", query.ht)
330 | while i < query.seeds.len():
331 | let key = query.seeds[i].kmer
332 | if key in target.ht:
333 | hit_index = target.ht[key]
334 | #echo format("For $# ($#), ql=$# tl=$#, hit_index=$#", i, key, query.seeds.len(), target.seeds.len(), hit_index)
335 | while (hit_index < target.seeds.len() and key == target.seeds[
336 | hit_index].kmer):
337 | #echo format("--For $# ($#), ql=$# tl=$#, hit_index=$#", i, key, query.seeds.len(), target.seeds.len(), hit_index)
338 | hit.a = query.seeds[i]
339 | hit.b = target.seeds[hit_index]
340 | deques.addLast(hit_stack, hit)
341 | inc(hit_index, 1)
342 | inc(i)
343 |
344 | return hit_stack
345 |
346 | ## This function counts the number of uniq kmers in the pot if searchable if not
347 | ## the function calls make searchable.
348 | ## @param pot_t - a ref to a pot_t
349 | ## TODO: add test coverage
350 | #
351 | proc nuniq*(pot: spot_t): int =
352 | return len(pot.ht)
353 |
354 | proc spacing_kmer*(pot: pot_t; space: int): pot_t =
355 | #doAssert(space > int(pot.word_size)) # typical, but not necessary
356 | doAssert(pot.word_size <= 16)
357 |
358 | new(result) #default return knwos the type from function header
359 | result.word_size = pot.word_size*2
360 |
361 | for i in (0 ..< pot.seeds.len - 2*(space + pot.word_size.int)):
362 | let j = i + 2*(space + pot.word_size.int)
363 | assert(j < pot.seeds.len)
364 |
365 | let k1 = pot.seeds[i]
366 | let k2 = pot.seeds[j]
367 | assert(k1.strand == k2.strand)
368 |
369 | # new kmer
370 | var k: seed_t
371 | let (left, right) = if k1.strand == forward:
372 | (k1, k2)
373 | else:
374 | (k2, k1)
375 | k.kmer = left.kmer
376 | k.kmer = k.kmer << 2*pot.word_size
377 | k.kmer = k.kmer or right.kmer
378 | k.strand = left.strand
379 | k.pos = left.pos
380 | result.seeds.add(k)
381 |
--------------------------------------------------------------------------------
/src/nibpkg/read.nim:
--------------------------------------------------------------------------------
1 | import ./kmers
2 | import tables
3 | export tables
4 | import ./svidx
5 |
6 | type Read* = object
7 | ## key of svid, count of supporting kmers
8 | compatible_SVs*: CountTable[uint32]
9 |
10 | proc process_read*(s: string, idx: SvIndex, k: int = 25, spacedSeeds: bool = false, space: int = 50): Read =
11 | # find SVs with kmers intersecting with those from this read.
12 | var kmers = Dna(s).dna_to_kmers(k)
13 | if(spacedSeeds):
14 | kmers = spacing_kmer(kmers, space)
15 | for kmer in kmers.seeds:
16 | var matching_svs = idx.lookupKmer(kmer)
17 | for svId in matching_svs:
18 | result.compatible_SVs.inc(svId)
19 |
20 |
21 | proc filter_read_matches*(read: var Read, min_matches: int = 2, winner_takes_all: bool = false) =
22 | ## track sv with most kmer matches
23 | var removables: seq[uint32]
24 | var max_sv = int.high
25 | var max_kcnt = 0
26 | for sv, kcnt in read.compatible_SVs:
27 | if kcnt < min_matches:
28 | removables.add(sv)
29 | if kcnt > max_kcnt:
30 | max_sv = sv.int
31 | max_kcnt = kcnt
32 |
33 | if winner_takes_all:
34 | clear(read.compatible_SVs)
35 | read.compatible_SVs.inc(max_sv.uint32, max_kcnt)
36 | else:
37 | for r in removables:
38 | read.compatible_SVs.del(r)
39 |
--------------------------------------------------------------------------------
/src/nibpkg/refmers.nim:
--------------------------------------------------------------------------------
1 | # vim: sw=4 ts=4 sts=4 tw=0 et:
2 | import hts
3 | import kmers
4 | import tables
5 | import svidx
6 |
7 | type
8 | Chunk = object
9 | chrom_name: string
10 | chrom_start: int
11 | chrom_end: int
12 |
13 |
14 | iterator createdChunks(fai: Fai, chunk_size: int): Chunk =
15 | for i in 0.. 0. (Try 50.)
28 | var convertedKmers: pot_t = dna_to_kmers(full_sequence, kmer_size)
29 | if space > 0:
30 | convertedKmers = spacing_kmer(convertedKmers, space)
31 | #for seed in convertedKmers.seeds:
32 | # echo "btd:", bin_to_dna(seed.kmer, convertedKmers.word_size, seed.strand), ' ', seed.kmer
33 |
34 | for km in convertedKmers.seeds:
35 | if km.kmer in svKmers.counts:
36 | svKmers.counts[km.kmer].refCount.inc
37 |
38 | proc updateChunk(svKmers: var SvIndex, fai: Fai, chunk: Chunk, kmer_size: int, space: int = 0) =
39 | var sub_seq = fai.get(chunk.chrom_name, chunk.chrom_start, chunk.chrom_end)
40 | addRefCount(svKmers, sub_seq, kmer_size, space)
41 |
42 | proc updateSvIndex*(input_ref_fn: string, svKmers: var SvIndex, kmer_size: int = 25, chunk_size: int = 1_000_000, space: int = 0) =
43 | ## Walk over reference sequences and count kmers.
44 | ## Update any existing svIdx entries with these counts.
45 | ## Use spaced-seeds if space > 0. (Try 50.)
46 | var fai: Fai
47 | if not fai.open(input_ref_fn):
48 | quit "couldn't open fasta"
49 |
50 | for i in createdChunks(fai, chunk_size):
51 | echo " chunk i=", i
52 | updateChunk(svKmers, fai, i, kmer_size, space)
53 |
54 | when isMainModule:
55 | import hts
56 | var fai: Fai
57 | import times
58 |
59 | if not fai.open("/data/human/g1k_v37_decoy.fa"):
60 | quit "bad"
61 |
62 | var s = fai.get("22")
63 | var svkmers: svIdx
64 | new(svkmers)
65 | echo "starting"
66 | for i in countup(0, 100_000_000, 10):
67 | svkmers[i.uint64] = (0'u32, 0'u32, newSeq[uint32]())
68 |
69 | var t0 = cpuTime()
70 | svKmers.addRefCount(s)
71 | echo "time:", cpuTime() - t0
72 |
--------------------------------------------------------------------------------
/src/nibpkg/reporter.nim:
--------------------------------------------------------------------------------
1 | import tables
2 | import hts
3 | import svidx
4 |
5 |
6 | ## N.B.: Add a function that takes a BAM path and returns the sample name
7 | ##
8 | ## TODO: Add a function that handles genotypes using the svIdx's ref/alt count fields.
9 | proc report*(vcf_name : string, sv_read_supports : CountTableRef[uint32], sv_index : SvIndex, sample_name : string="SAMPLE") =
10 | ## Query SV supports for each SV in a VCF, appending the sample name to a field in the INFO fileds if
11 | ## the SV is present in the sample (i.e., SV support count > 1)
12 | var variants:VCF
13 | doAssert open(variants, vcf_name)
14 | echo "Writing report to output.vcf"
15 |
16 | var sv_to_kmer = initTable[uint32, seq[uint64]]()
17 | for kmer, support in sv_index.counts:
18 | doAssert(support.svs.len != 0)
19 | for svId in support.svs:
20 | var a = sv_to_kmer.getOrDefault(svId)
21 | a.add(kmer)
22 | sv_to_kmer[svId] = a
23 |
24 | var outputVCF:VCF
25 | doAssert open(outputVCF, "output.vcf", "w")
26 | ## Note: this will overwrite the existing entry if any exist in the VCF
27 | discard variants.header.add_info("NIB_SAMPLES_WITH_SV", ".", "String", "Sample name is present if SV is present in sample.")
28 | discard variants.header.add_info("NIB_READ_SUPPORTS", ".", "Integer", "The number of reads supporting a given SV.")
29 | discard variants.header.add_info("NIB_SV_REF_KMERIDX_COUNT", "1", "Integer", "Number of REF kmers in SV index for SV.")
30 | discard variants.header.add_info("NIB_SV_ALT_KMERIDX_COUNT", "1", "Integer", "Number of ALT kmers in SV index for SV.")
31 |
32 | #discard variants.header.add_info("NIB_ALT_SUPPORTS", ".", "Integer", "The number of reads supporting a given SV alt.")
33 | outputVCF.copy_header(variants.header)
34 | discard outputVCF.write_header()
35 |
36 | var sample_name = sample_name
37 | var sv_id :uint32= 0
38 | for v in variants:
39 | var sv_support_count = sv_read_supports.getOrDefault(sv_id, -1)
40 | var sv_ref_k_count = 0
41 | var sv_alt_k_count = 0
42 | for km in sv_to_kmer.getOrDefault(sv_id):
43 | sv_ref_k_count += sv_index.counts[km].refCount.int
44 | sv_alt_k_count += sv_index.counts[km].altCount.int
45 |
46 | doAssert v.info.set("NIB_SV_REF_KMERIDX_COUNT", sv_ref_k_count) == Status.OK
47 | doAssert v.info.set("NIB_SV_ALT_KMERIDX_COUNT", sv_alt_k_count) == Status.OK
48 | if sv_support_count > 0:
49 | doAssert v.info.set("NIB_SAMPLES_WITH_SV", sample_name) == Status.OK
50 | doAssert v.info.set("NIB_READ_SUPPORTS", sv_support_count) == Status.OK
51 |
52 |
53 | doAssert outputVCF.write_variant(v)
54 |
55 | sv_id.inc
56 |
57 | close(outputVCF)
58 | close(variants)
59 |
--------------------------------------------------------------------------------
/src/nibpkg/svidx.nim:
--------------------------------------------------------------------------------
1 | # vim: sw=4 ts=4 sts=4 tw=0 et:
2 | import tables
3 | from strutils import nil
4 | from strformat import fmt
5 | import msgpack4nim, streams, json
6 | import ./kmers
7 |
8 | type
9 | #SvValue* = tuple[refCount: uint32, altCount: uint32, svs: seq[uint32]]
10 | SvValue* = object
11 | refCount*: uint32
12 | altCount*: uint32
13 | svs*: seq[uint32]
14 |
15 | ## A map from KMER ID -> (number of time kmer appears in a ref seq, number of times kmer appears in an alt seq, list(SVs) that kmer is contained in )
16 | #svIdx* = TableRef[uint64, SvValue]
17 | SvIndex* = object
18 | counts*: Table[uint64, SvValue]
19 | kmerSize*: uint8
20 |
21 | proc len*(idx: SvIndex): int =
22 | return idx.counts.len
23 |
24 |
25 | #Cost savings on allocations?
26 | var empty: seq[uint32]
27 |
28 | proc lookupKmer*(idx: SvIndex, kmer: seed_t): seq[uint32] {.noInit.} =
29 | if kmer.kmer in idx.counts:
30 | return idx.counts[kmer.kmer].svs
31 | return empty
32 |
33 | proc dumpIndexToFile*(idx: SvIndex, fn: string) =
34 | let strm = openFileStream(fn, fmWrite)
35 | strm.pack(idx)
36 | strm.close()
37 |
38 | proc loadIndexFromFile*(fn: string, kmerSize: int): SvIndex =
39 | let strm = openFileStream(fn, fmRead)
40 | strm.unpack(result)
41 | strm.close()
42 | if kmerSize != result.kmerSize.int:
43 | echo "ERROR: Inconsistent SvIndex file '{fn}'\nkmerSize={kmerSize} != SvIndex.kmerSize={result.kmerSize}".fmt
44 | doAssert(kmerSize == result.kmerSize.int)
45 |
46 | proc `%`(idx: SvIndex): JsonNode =
47 | result = json.newJObject()
48 | result["kmerSize"] = %idx.kmerSize
49 | result["counts"] = json.newJObject()
50 | for k, v in idx.counts.pairs():
51 | let val = SvValue(refCount: v.refCount, altCount: v.altCount, svs: v.svs)
52 | result["counts"][$k] = %val
53 |
54 | proc dumpIndexToJson*(idx: SvIndex): string =
55 | return json.pretty(%idx)
56 |
57 | proc loadIndexFromJson*(js: string): SvIndex =
58 | ## This painful method might become simple if SvIndex values
59 | ## switched from tuple to object.
60 | let j = json.parseJson(js)
61 | result.kmerSize = j["kmerSize"].getInt().uint8
62 | for key, val in j["counts"]:
63 | let k: uint64 = strutils.parseBiggestUint(key)
64 | let v = json.to(val, SvValue)
65 | result.counts[k] = v
66 |
67 | proc insert*(idx: var SvIndex, sequence: string, k: int, sv_idx: int = -1, space: int = 0) =
68 | ## when inserting reference sequences leave sv_idx as -1
69 | #doAssert(k == idx.kmerSize.int);
70 | var l = Dna(sequence).dna_to_kmers(k.int)
71 | if space > 0:
72 | l = spacing_kmer(l, space)
73 |
74 | # inserting alternates
75 | if sv_idx >= 0:
76 | for kmer in l.seeds:
77 | var kc = idx.counts.getOrDefault(kmer.kmer)
78 | kc.altCount.inc
79 | kc.svs.add(sv_idx.uint32)
80 | idx.counts[kmer.kmer] = kc
81 |
82 | return
83 |
84 | # inserting reference counts iff the kmer was already found as alternate.
85 | for kmer in l.seeds:
86 | # note: sometimes doing double lookup.
87 | if kmer.kmer notin idx.counts: continue
88 | idx.counts[kmer.kmer].refCount.inc
89 |
90 | proc filterRefKmers*(svKmers: var SvIndex, maxRefCount: uint32) =
91 | ## Remove entries in the SV index that have a ref count higher than specified
92 | echo "before:", svKmers.len, " maxRefCount:", maxRefCount
93 | var toRemove: seq[uint64]
94 | for k, v in pairs(svKmers.counts):
95 | if v.refCount > maxRefCount:
96 | toRemove.add(k)
97 | for k in toRemove:
98 | svKmers.counts.del(k)
99 | echo "after:", svKmers.len, " maxRefCount:", maxRefCount
100 |
--------------------------------------------------------------------------------
/src/nibpkg/util.nim:
--------------------------------------------------------------------------------
1 | # vim: sts=4:ts=4:sw=4:et:tw=0
2 | #from cpuinfo import nil
3 | from math import nil
4 | from os import nil
5 | #from threadpool import nil
6 | from streams import nil
7 | from strformat import fmt
8 | from strutils import nil
9 | import heapqueue
10 | import osproc
11 | import times
12 |
13 | type PbError* = object of CatchableError
14 | type GenomeCoverageError* = object of PbError
15 | type FieldTooLongError* = object of PbError
16 | type TooFewFieldsError* = object of PbError
17 |
18 | proc raiseEx*(msg: string) {.discardable.} =
19 | raise newException(PbError, msg)
20 |
21 | proc isEmptyFile*(fn: string): bool =
22 | var finfo = os.getFileInfo(fn)
23 | if finfo.size == 0:
24 | return true
25 | return false
26 |
27 | #from strformat import fmt
28 | proc isOlderFile*(afn, bfn: string): bool =
29 | ## Return true iff afn is older than bnf.
30 | let
31 | at = os.getLastModificationTime(afn)
32 | bt = os.getLastModificationTime(bfn)
33 | #af = at.format("yyyy-MM-dd'T'HH:mm:ss,ffffffzzz")
34 | #bf = bt.format("yyyy-MM-dd'T'HH:mm:ss,ffffffzzz")
35 | #echo "glmt {afn}: {af}, {bfn}: {bf}".fmt
36 | return at < bt
37 |
38 | template withcd*(newdir: string, statements: untyped) =
39 | let olddir = os.getCurrentDir()
40 | os.setCurrentDir(newdir)
41 | defer: os.setCurrentDir(olddir)
42 | statements
43 |
44 | proc log*(words: varargs[string, `$`]) =
45 | for word in words:
46 | write(stderr, word)
47 | write(stderr, '\l')
48 |
49 | proc logt*(words: varargs[string, `$`]) =
50 | var then {.global.} = times.now()
51 | let
52 | since = times.initDuration(seconds = times.inSeconds(times.now() - then))
53 | dp = times.toParts(since)
54 | prefix = strformat.fmt("{dp[Hours]}:{dp[Minutes]:02d}:{dp[Seconds]:02d}s ")
55 | write(stderr, prefix)
56 | log(words)
57 |
58 | proc adjustThreadPool*(n: int) =
59 | ## n==0 => use ncpus
60 | ## n==-1 => do not alter threadpool size (to avoid a weird problem for now)
61 | log("(ThreadPool is currently not used.)")
62 | #var size = n
63 | #if n == 0:
64 | # size = cpuinfo.countProcessors()
65 | #if size > threadpool.MaxThreadPoolSize:
66 | # size = threadpool.MaxThreadPoolSize
67 | #if size == -1:
68 | # log("ThreadPoolsize=", size,
69 | # " (i.e. do not change)",
70 | # ", MaxThreadPoolSize=", threadpool.MaxThreadPoolSize,
71 | # ", NumCpus=", cpuinfo.countProcessors())
72 | # return
73 | #log("ThreadPoolsize=", size,
74 | # ", MaxThreadPoolSize=", threadpool.MaxThreadPoolSize,
75 | # ", NumCpus=", cpuinfo.countProcessors())
76 | #threadpool.setMaxPoolSize(size)
77 |
78 | iterator walk*(dir: string, followlinks = false, relative = false): string =
79 | ## similar to python os.walk(), but always topdown and no "onerror"
80 | # Slow! 30x slower than Unix find.
81 | let followFilter = if followLinks: {os.pcDir, os.pcLinkToDir} else: {os.pcDir}
82 | let yieldFilter = {os.pcFile, os.pcLinkToFile}
83 | for p in os.walkDirRec(dir, yieldFilter = yieldFilter,
84 | followFilter = followFilter, relative = relative):
85 | yield p
86 |
87 | iterator readProc*(cmd: string): string =
88 | ## Stream from Unix subprocess, e.g. "find .".
89 | ## But if cmd=="-", stream directly from stdin.
90 | if cmd == "-":
91 | log("Reading from stdin...")
92 | for line in lines(stdin):
93 | yield line
94 | else:
95 | log("Reading from '" & cmd & "'...")
96 | var p = osproc.startProcess(cmd, options = {poEvalCommand})
97 | if osproc.peekExitCode(p) > 0:
98 | let msg = "Immedate failure in readProc startProcess('" & cmd & "')"
99 | raiseEx(msg)
100 | defer: osproc.close(p)
101 | for line in streams.lines(osproc.outputStream(p)):
102 | yield line
103 |
104 | iterator readProcInMemory(cmd: string): string =
105 | ## Read from Unix subprocess, e.g. "find .", into memory.
106 | ## But if cmd=="-", stream directly from stdin.
107 | if cmd == "-":
108 | log("Reading from stdin...")
109 | for line in lines(stdin):
110 | yield line
111 | else:
112 | log("Reading from '" & cmd & "'...")
113 | let found = osproc.execProcess(cmd, options = {poEvalCommand})
114 | var sin = streams.newStringStream(found)
115 | for line in streams.lines(sin):
116 | yield line
117 |
118 | proc removeFile*(fn: string, failIfMissing = false) =
119 | if failIfMissing and not os.fileExists(fn):
120 | raiseEx("Cannot remove non-existent file '" & fn & "'")
121 | log("rm -f ", fn)
122 | os.removeFile(fn)
123 |
124 | proc removeFiles*(fns: openarray[string], failIfMissing = false) =
125 | for fn in fns:
126 | removeFile(fn, failIfMissing)
127 |
128 | proc which*(exe: string) =
129 | let cmd = "which " & exe
130 | log(cmd)
131 | discard execCmd(cmd)
132 |
133 | proc thousands*(v: SomeInteger): string =
134 | if v == 0:
135 | return "0"
136 | var i: type(v) = v
137 | let negative = (i < 0)
138 | i = abs(i)
139 | #result = strformat.fmt"{i mod 1000:03}"
140 | #i = i div 1000
141 | while i > 0:
142 | result = strformat.fmt"{i mod 1000:03}," & result
143 | i = i div 1000
144 | # Drop tailing comma.
145 | assert result[^1] == ','
146 | result = result[0 .. ^2]
147 | # Drop leading 0s.
148 | while result[0] == '0':
149 | result = result[1 .. ^1]
150 | if negative:
151 | result = '-' & result
152 |
153 | proc splitWeighted*(n: int, sizes: seq[int]): seq[int] =
154 | # Split sizes into n contiguous subsets, weighted by each size.
155 | # Each elem of result will represent a range of elems of sizes.
156 | # len(result) will be <= n
157 |
158 | if n == 0:
159 | return
160 | var sums: seq[int]
161 | var totalSize = math.sum(sizes)
162 | var remSize = totalSize
163 | var curr = 0
164 | var remN = min(n, len(sizes))
165 | while len(sizes) > curr:
166 | #assert len(sizes) > curr, "not enough elements in sizes {len(sizes)} <= {curr}".fmt
167 | result.add(0)
168 | let approx = int(math.ceil(remSize / remN))
169 | #echo "approx={approx}, remaining={remN}, tot={remSize}".fmt
170 | sums.add(0)
171 | while sums[^1] < approx:
172 | result[^1] += 1
173 | sums[^1] += sizes[curr]
174 | curr += 1
175 | remN -= 1
176 | remSize -= sums[^1]
177 | assert math.sum(result) == len(sizes)
178 | assert math.sum(sizes) == totalSize
179 | assert len(result) <= n
180 |
181 | type
182 | BinSum = object
183 | indices: seq[int]
184 | sum: int64
185 | order: int
186 | WeightedIndex = tuple[index: int, size: int]
187 |
188 | proc `<`(a, b: BinSum): bool =
189 | return a.sum < b.sum or (a.sum == b.sum and a.indices.len() < b.indices.len()) or
190 | (a.sum == b.sum and a.indices.len() == b.indices.len() and a.order < b.order)
191 | proc `<`(a, b: WeightedIndex): bool =
192 | return a.size > b.size or (a.size == b.size and a.index > b.index)
193 |
194 | proc partitionWeighted*(n: int, sizes: seq[int]): seq[seq[int]] =
195 | ## {sizes} is an index; other seqs refer to its indices.
196 | ## The splits for this version are not required to be contiguous.
197 | ## The result has at most n index-seqs, none of which are empty.
198 | var biggest = initHeapQueue[WeightedIndex]()
199 | for i in 0 ..< len(sizes):
200 | let wi: WeightedIndex = (index: i, size: sizes[i])
201 | biggest.push(wi)
202 | var smallest_bin = initHeapQueue[BinSum]()
203 | for x in 0 ..< n:
204 | var bin: BinSum = BinSum(sum: 0, order: x)
205 | smallest_bin.push(bin)
206 | while biggest.len() > 0:
207 | let wi = biggest.pop()
208 | var bin = smallest_bin.pop()
209 | bin.indices.add(wi.index)
210 | bin.sum += wi.size
211 | smallest_bin.push(bin)
212 | while smallest_bin.len() > 0:
213 | let bin = smallest_bin.pop()
214 | if bin.indices.len() > 0:
215 | result.add(bin.indices)
216 | return result
217 |
218 | proc combineToTarget*(target: int64, weights: seq[int64]): seq[seq[int]] =
219 | # Given a seq of weights,
220 | # combine consecutive groups of them until they meet target.
221 | # Return a seq of seqs of those indices. For now,
222 | # the results will always be consecutive, e.g.
223 | # [ [0,1,2], [2,3], [4], [5,6] ]
224 | var
225 | total = target
226 | n = -1
227 | for i in 0 ..< len(weights):
228 | let next_weight = weights[i]
229 | #echo "i:{i} next:{next_weight} total:{total} n:{n}".fmt
230 | if total >= target:
231 | # new group
232 | result.add(@[i])
233 | n = len(result) - 1
234 | total = next_weight
235 | else:
236 | # current group
237 | result[n].add(i)
238 | total += next_weight
239 |
240 | const
241 | MAX_HEADROOM* = 1024
242 | type
243 | Headroom* = array[MAX_HEADROOM, cchar]
244 |
245 | proc sscanf*(s: cstring, frmt: cstring): cint {.varargs, importc,
246 | header: "".}
247 |
248 | proc strlen(s: cstring): cint {.importc: "strlen", nodecl.}
249 |
250 | proc strlen(a: var Headroom): int =
251 | let n = strlen(cast[cstring](addr a))
252 | return n
253 |
254 | proc toString*(ins: var Headroom, outs: var string, source: string = "") =
255 | var n = strlen(ins)
256 | if n >= (MAX_HEADROOM - 1):
257 | # Why is max-1 illegal? B/c this is used after sscanf, and that has no way to report
258 | # a buffer-overflow. So a 0 at end-of-buffer is considered too long.
259 | let msg = strformat.fmt"Too many characters in substring (>{MAX_HEADROOM - 1}) from '{source}'"
260 | raise newException(util.FieldTooLongError, msg)
261 | outs.setLen(n)
262 | for i in 0 ..< n:
263 | outs[i] = ins[i]
264 |
265 | proc getNthWord*(line: string, n: Natural, delim: char): string =
266 | ## n is 0-based
267 | var
268 | start = 0
269 | count = 0
270 | found = -1
271 | while count < n:
272 | found = strutils.find(line, delim, start)
273 | if found == -1:
274 | let msg = "Found only {count} < {n} instances of '{delim}' in '{line}'".fmt
275 | raiseEx(msg)
276 | start = found + 1
277 | count += 1
278 | var wordEnd = strutils.find(line, delim, start)
279 | if wordEnd == -1:
280 | wordEnd = line.len()
281 | return line[start..(wordEnd-1)]
282 |
--------------------------------------------------------------------------------
/src/nibpkg/welcome.nim:
--------------------------------------------------------------------------------
1 | # vim: sw=4 ts=4 sts=4 tw=0 et:
2 |
3 | proc getWelcomeMessage*(): string =
4 | "Hello, World!"
5 |
--------------------------------------------------------------------------------
/src/nibsv.nim:
--------------------------------------------------------------------------------
1 | from nibpkg/compose import nil
2 | from nibpkg/classify import nil
3 | from nibpkg/captain import nil
4 |
5 | when isMainModule:
6 | import cligen
7 | dispatchMulti(
8 | [compose.compose_variants, cmdName = "compose"],
9 | [classify.buildSvIndex, cmdName = "lookup"],
10 | [classify.main_classify, cmdName = "classify"],
11 | [captain.main_runner, cmdName = "main",
12 | help={
13 | "variants-fn": "long read VCF SV calls",
14 | "refSeq-fn": "reference genome FASTA, compressed OK",
15 | "reads-fn": "input short-reads in BAM/SAM/CRAM/FASTQ",
16 | "prefix" : "output prefix",
17 | "kmer-size" : "kmer size, for spaced seeds use <=16 otherwise <=32",
18 | "spaced-seeds" : "turn on spaced seeds",
19 | "space" : "width between spaced kmers",
20 | "flank" : "number of bases on either side of ALT/REF in VCF records",
21 | "max-ref-kmer-count" : "max number of reference kmers allowed in SV event"
22 | }
23 | ],
24 | )
25 |
--------------------------------------------------------------------------------
/test-data/GIAB-chr22.vcf:
--------------------------------------------------------------------------------
1 | ##fileformat=VCFv4.2
2 | ##fileDate=2020-03-04T19:05:39.98Z
3 | ##source=pbsv 2.3.0 (commit v2.3.0)
4 | ##PG="pbsv call -j 16 -t DEL,INS,INV -m 20 -A 3 -O 3 --call-min-read-perc-one-sample 20 /pbi/dept/secondary/siv/references/human_GRCh38_no_alt_analysis_set/sequence/human_GRCh38_no_alt_analysis_set.fasta /pbi/dept/bifx/awenger/prj/giab/20200303_PacBio_pbsv/svsig/AJTrio_GRCh38.fofn /pbi/dept/bifx/awenger/prj/giab/20200303_PacBio_pbsv/vcf/AJTrio_GRCh38.pbsv.vcf"
5 | ##INFO=
6 | ##INFO=
7 | ##INFO=
8 | ##INFO=
9 | ##INFO=
10 | ##INFO=
11 | ##INFO=
12 | ##INFO=
13 | ##INFO=
14 | ##ALT=
15 | ##ALT=
16 | ##ALT=
17 | ##FILTER=
18 | ##FILTER== 50 Ns) in the reference assembly">
19 | ##FILTER=
20 | ##FILTER=
21 | ##FILTER=
22 | ##FORMAT=
23 | ##FORMAT=
24 | ##FORMAT=
25 | ##FORMAT=
26 | ##FORMAT=
27 | ##reference=file:///pbi/dept/secondary/siv/references/human_GRCh38_no_alt_analysis_set/sequence/human_GRCh38_no_alt_analysis_set.fasta
28 | ##contig=
29 | ##contig=
30 | ##contig=
31 | ##contig=
32 | ##contig=
33 | ##contig=
34 | ##contig=
35 | ##contig=
36 | ##contig=
37 | ##contig=
38 | ##contig=
39 | ##contig=
40 | ##contig=
41 | ##contig=
42 | ##contig=
43 | ##contig=
44 | ##contig=
45 | ##contig=
46 | ##contig=
47 | ##contig=
48 | ##contig=
49 | ##contig=
50 | ##contig=
51 | ##contig=
52 | ##contig=
53 | ##contig=
54 | ##contig=
55 | ##contig=
56 | ##contig=
57 | ##contig=
58 | ##contig=
59 | ##contig=
60 | ##contig=
61 | ##contig=
62 | ##contig=
63 | ##contig=
64 | ##contig=
65 | ##contig=
66 | ##contig=
67 | ##contig=
68 | ##contig=
69 | ##contig=
70 | ##contig=
71 | ##contig=
72 | ##contig=
73 | ##contig=
74 | ##contig=
75 | ##contig=
76 | ##contig=
77 | ##contig=
78 | ##contig=
79 | ##contig=
80 | ##contig=
81 | ##contig=
82 | ##contig=
83 | ##contig=
84 | ##contig=
85 | ##contig=
86 | ##contig=
87 | ##contig=
88 | ##contig=
89 | ##contig=
90 | ##contig=
91 | ##contig=
92 | ##contig=
93 | ##contig=
94 | ##contig=
95 | ##contig=
96 | ##contig=
97 | ##contig=
98 | ##contig=
99 | ##contig=
100 | ##contig=
101 | ##contig=
102 | ##contig=
103 | ##contig=
104 | ##contig=
105 | ##contig=
106 | ##contig=
107 | ##contig=
108 | ##contig=
109 | ##contig=
110 | ##contig=
111 | ##contig=
112 | ##contig=
113 | ##contig=
114 | ##contig=
115 | ##contig=
116 | ##contig=
117 | ##contig=
118 | ##contig=
119 | ##contig=
120 | ##contig=
121 | ##contig=
122 | ##contig=
123 | ##contig=
124 | ##contig=
125 | ##contig=
126 | ##contig=
127 | ##contig=
128 | ##contig=
129 | ##contig=
130 | ##contig=
131 | ##contig=
132 | ##contig=
133 | ##contig=
134 | ##contig=
135 | ##contig=
136 | ##contig=
137 | ##contig=
138 | ##contig=
139 | ##contig=
140 | ##contig=
141 | ##contig=
142 | ##contig=
143 | ##contig=
144 | ##contig=
145 | ##contig=
146 | ##contig=
147 | ##contig=
148 | ##contig=
149 | ##contig=
150 | ##contig=
151 | ##contig=
152 | ##contig=
153 | ##contig=
154 | ##contig=
155 | ##contig=
156 | ##contig=
157 | ##contig=
158 | ##contig=
159 | ##contig=
160 | ##contig=
161 | ##contig=
162 | ##contig=
163 | ##contig=
164 | ##contig=
165 | ##contig=
166 | ##contig=
167 | ##contig=
168 | ##contig=
169 | ##contig=
170 | ##contig=
171 | ##contig=
172 | ##contig=
173 | ##contig=
174 | ##contig=
175 | ##contig=
176 | ##contig=
177 | ##contig=
178 | ##contig=
179 | ##contig=
180 | ##contig=
181 | ##contig=
182 | ##contig=
183 | ##contig=
184 | ##contig=
185 | ##contig=
186 | ##contig=
187 | ##contig=
188 | ##contig=
189 | ##contig=
190 | ##contig=
191 | ##contig=
192 | ##contig=
193 | ##contig=
194 | ##contig=
195 | ##contig=
196 | ##contig=
197 | ##contig=
198 | ##contig=
199 | ##contig=
200 | ##contig=
201 | ##contig=
202 | ##contig=
203 | ##contig=
204 | ##contig=
205 | ##contig=
206 | ##contig=
207 | ##contig=
208 | ##contig=
209 | ##contig=
210 | ##contig=
211 | ##contig=
212 | ##contig=
213 | ##contig=
214 | ##contig=
215 | ##contig=
216 | ##contig=
217 | ##contig=
218 | ##contig=
219 | ##contig=
220 | ##contig=
221 | ##contig=
222 | ##contig=
223 | #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG002 HG003 HG004
224 | chr22 48910763 pbsv.DEL.66387 CCCCAGATTCTGAAATCTTTCATTGTGGTTGAAGTCTCCCCTCCCGA C . PASS SVTYPE=DEL;END=48910809;SVLEN=-46;SVANN=TANDEM GT:AD:DP 0/1:17,11:28 0/1:16,13:29 0/0:25,0:25
225 |
--------------------------------------------------------------------------------
/test-data/GIAB_PBSV_TRIO_CALLS.vcf:
--------------------------------------------------------------------------------
1 | ##fileformat=VCFv4.2
2 | ##fileDate=2020-03-04T19:05:39.98Z
3 | ##source=pbsv 2.3.0 (commit v2.3.0)
4 | ##PG="pbsv call -j 16 -t DEL,INS,INV -m 20 -A 3 -O 3 --call-min-read-perc-one-sample 20 /pbi/dept/secondary/siv/references/human_GRCh38_no_alt_analysis_set/sequence/human_GRCh38_no_alt_analysis_set.fasta /pbi/dept/bifx/awenger/prj/giab/20200303_PacBio_pbsv/svsig/AJTrio_GRCh38.fofn /pbi/dept/bifx/awenger/prj/giab/20200303_PacBio_pbsv/vcf/AJTrio_GRCh38.pbsv.vcf"
5 | ##INFO=
6 | ##INFO=
7 | ##INFO=
8 | ##INFO=
9 | ##INFO=
10 | ##INFO=
11 | ##INFO=
12 | ##INFO=
13 | ##INFO=
14 | ##ALT=
15 | ##ALT=
16 | ##ALT=
17 | ##FILTER=
18 | ##FILTER== 50 Ns) in the reference assembly">
19 | ##FILTER=
20 | ##FILTER=
21 | ##FILTER=
22 | ##FORMAT=
23 | ##FORMAT=
24 | ##FORMAT=
25 | ##FORMAT=
26 | ##FORMAT=
27 | ##reference=file:///pbi/dept/secondary/siv/references/human_GRCh38_no_alt_analysis_set/sequence/human_GRCh38_no_alt_analysis_set.fasta
28 | ##contig=
29 | ##contig=
30 | ##contig=
31 | ##contig=
32 | ##contig=
33 | ##contig=
34 | ##contig=
35 | ##contig=
36 | ##contig=
37 | ##contig=
38 | ##contig=
39 | ##contig=