├── LICENSE
├── README.md
└── probetools_v_0_1_11.py
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2021 Kevin Kuchinski
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # ProbeTools
2 | ProbeTools is a collection of general-purpose modules for designing hybridization probe panels targeting diverse and hypervariable viral taxa. The objective of ProbeTools is to generate the smallest possible panel of oligo sequences that maximizes coverage of provided target sequences. It is based on k-mer clustering. In brief, probe-length k-mers are enumerated from the target space, usually spaced one nucleotide apart so that all possible k-mers are enumerated. The k-mers are then clustered based on their nucleotide sequence identity to collapse redundant probes enumerated from conserved genomic loci. Cluster centroids become probe candidates, which are ranked based on the size of the cluster they represent; centroids representing larger clusters are assumed to make better probes by virtue of having similarity to more sub-sequences in the target space.
3 |
4 | ProbeTools can further optimize probe panel designs by using an incremental strategy. In this strategy, probes are added to the panel in batches. Between the addition of each batch, ProbeTools determines what regions of the target space have achieved coverage and removes them from the target space before designing the next batch. This improves coverage of less-common sub-sequences in the target space and reduces the generation of redundant probes.
5 |
6 | Additional details and discussion about ProbeTools, along with in silico and in vitro validation results can be found in:
7 |
8 | Kuchinski KS et al. . ProbeTools: designing hybridization probes for targeted genomic sequencing of diverse and hypervariable viral taxa. BMC Genomics. 2022 Aug 12;23(1):579. doi: 10.1186/s12864-022-08790-4. PMID: 35953803; PMCID: PMC9371634.
9 |
10 | # Setup
11 | ProbeTools requires VSEARCH and BLASTn. The ProbeTools package can be installed with these dependencies via Anaconda/Miniconda. It can also be installed separate from its dependencies via the Python Package Index (PyPI).
12 | ## Anaconda/Miniconda
13 | 1. Create a conda environment for ProbeTools (replace env_name with a name of your choice for the ProbeTools environment):
14 | ```
15 | conda create -n env_name -c kevinkuchinski probetools
16 | ```
17 | ## PyPI
18 |
19 | 1. Install Python (version 3.7 or greater) from https://www.python.org/
20 | 2. Install the ProbeTools package:
21 | ```
22 | pip install probetools
23 | ```
24 | 3. Install VSEARCH (version 2.15.2 recommended) from https://github.com/torognes/vsearch
25 | 4. Install BLAST (version 2.12.0 recommended) from https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download
26 |
27 |
28 | # Quick-start to probe design
29 | ProbeTools provides the makeprobes module as a user-friendly, general-purpose implementation of the incremental k-mer clustering strategy. Simply indicate a FASTA file containing target sequences (-t), the number of probes to add each batch (-b), and an output path and design name to append to output files (-o):
30 | ```
31 | probetools makeprobes -t target_space_FASTA.fa -b 100 -o demo_probes_dir/demo_probes
32 | ```
33 | makeprobes will add batches of probes to the panel until one of three end points is reached:
34 | 1. The panel achieves a target coverage goal (default: 90% of target sequences have at least 90% of their nucleotide positions covered)
35 | 2. The panel reaches a specific size (default: MAX, i.e. the panel continues to grow until one of the other end points is reached)
36 | 3. No further probe sequences can be designed
37 |
38 | The desired coverage goal and the maximum panel size can be set, along with numerous other parameters (see usage guide below). In general, smaller batch sizes will provide more compact panels but take more rounds of design and, thus, longer to compute.
39 |
40 | # ProbeTools modules
41 | ProbeTools consists of 6 modules:
42 | 1. makeprobes - a user-friendly, general-purpose implementation of the incremental k-mer clustering strategy
43 | 2. clusterkmers - single-batch probe generation using the k-mer clustering algorithm
44 | 3. capture - in silico assessment of how well provided probe sequences cover provided target sequences
45 | 4. getlowcov - uses output of capture to extract low-coverage regions from provided target sequences
46 | 5. stats - uses output of capture to calculate coverage statistics overall and for each provided target sequence
47 | 6. merge - merges output files generated by capture module
48 |
49 | # Usage guide for ProbeTools modules
50 | ## makeprobes
51 | A general-purpose implementation of the incremental k-mer clustering strategy. Probes are added to the panel in batches. Between the addition of each batch, ProbeTools determines what regions of the target space have achieved coverage and removes them from the target space before designing the next batch. Probe sequences are provided in the output_name_probes.fa file with probe sequences ranked in descending order of cluster size. NOTE: for best results, all target sequences should be provided on the same strand/in the same sense.
52 |
53 | Usage example:
54 | ```
55 | $ probetools makeprobes -t -b -o / []
56 | ```
57 | Required arguments:
58 |
59 | -t : path to target sequences in FASTA file
60 | -b : number of probes in each batch (min=1)
61 | -o : path to output directory and design name to append to output files
62 |
63 | Optional arguments:
64 |
65 | -m : max number of probes to add to panel (default=MAX, min=1)
66 | -c : target for 10th percentile of probe coverage (default=90, min=1, max=100)
67 | -k : length of probes to generate (default=120, min=32)
68 | -s : number of bases separating each kmer (default=1, min=1)
69 | -d : number of degenerate bases to permit in probes (default=0, min=0)
70 | -i : nucleotide sequence identity (%) threshold used for kmer clustering and probe-target alignments (default=90, min=50, min=100)
71 | -l : minimum length for probe-target alignments (default=60, min=1)
72 | -D : minimum probe depth threshold used to define low coverage sub-sequences (default=0, min=0)
73 | -L : minimum number of consecutive bases below probe depth threshold to define a low coverage sub-sequence (default=40, min=1)
74 | -T : number of threads used by VSEARCH and BLASTn for clustering kmers and aligning probes to targets (default=MAX for VSEARCH, default=1 for BLASTn, min=1)
75 |
76 | ## clusterkmers
77 | Enumerate and cluster kmers from target sequences. Extract cluster centroids as probe candidates ranked by cluster size. Probe sequences are provided in the output_name_probes.fa file with probe sequences ranked in descending order of cluster size. NOTE: for best results, all target sequences should be provided on the same strand/in the same sense.
78 |
79 | Usage example:
80 | ```
81 | $ probetools clusterkmers -t -o / []
82 | ```
83 | Required arguments:
84 |
85 | -t : path to target sequences in FASTA file
86 | -o : path to output directory and design name to append to output files
87 |
88 | Optional arguments:
89 |
90 | -k : length of kmers to enumerate (default=120, min=32)
91 | -s : number of bases separating each kmer (default=1, min=1)
92 | -d : number of degenerate bases to permit in probes (default=0, min=0)
93 | -i : nucleotide sequence identity (%) threshold used for kmer clustering (default=90, min=50, max=100)
94 | -p : path to FASTA file containing previously-generated probe sequences to remove from new probes
95 | -n : number of probe candidates to return (default=MAX, min=1)
96 | -T : number of threads used by VSEARCH for clustering kmers (default=MAX, min=1)
97 |
98 | ## capture
99 | Assess probe panel coverage of target sequences. BLASTn is used to align each provided probe sequence against each provided target sequence. BLASTn output is parsed to determine how many probes cover each nucleotide position in target sequences. Results are output to the output_name_capture.pt file (see .pt format specifications below).
100 |
101 | Usage example:
102 | ```
103 | $ probetools capture -t -p -o / []
104 | ```
105 | Required arguments:
106 |
107 | -t : path to target sequences in FASTA file
108 | -p : path to probe sequences in FASTA file
109 | -o : path to output directory and design name to append to output files
110 |
111 | Optional arguments:
112 |
113 | -i : nucleotide sequence identity (%) threshold used for probe-target alignments (default=90, min=50, max=100)
114 | -l : minimum length for probe-target alignments (default=60, min=1)
115 | -T : number of threads used by BLASTn for aligning probes to targets (default=1, min=1)
116 |
117 | ## getlowcov
118 | Extract poorly covered sub-sequences from target sequences based on a specific set of capture results. Low-coverage sub-sequences are written to the output_name_low_cov.fa file.
119 |
120 | Usage example:
121 | ```
122 | $ probetools getlowcov -i -o / []
123 | ```
124 | Required arguments:
125 |
126 | -i : path to capture results in PT file
127 | -o : path to output directory and design name to append to output files
128 |
129 | Optional arguments:
130 |
131 | -k : minimum sub-sequence length extracted, should be same as kmer length used for making probes (default=120, min=32)
132 | -D : minimum probe depth threshold used to define low coverage sub-sequences (default=0, min=0)
133 | -L : minimum number of consecutive bases below probe depth threshold to define a low coverage sub-sequence (default=40, min=1)
134 |
135 | ## stats
136 | Calculate and tabulate probe coverage statistics for target sequences. Overall target space statistics are provided in output_name_summary_report.tsv and statistics for each target sequence are provided in output_name_long_report.tsv.
137 |
138 | Usage example:
139 | ```
140 | $ probetools stats -i -o /
141 | ```
142 | Required arguments:
143 |
144 | -i : path to capture results in PT file
145 | -o : path to output directory and design name to append to output files
146 |
147 | ## merge
148 | Combine results from two output files from the capture module. This module conducts an outer merge: if entries with the same header (and matching nucleotide sequences) appear in both files, their probe depth lists are summed together position-by-position. Entries appearing in only one or the other file are copied to the new file unmodified.
149 |
150 | Usage example:
151 | ```
152 | $ probetools merge -i -I -o
153 | ```
154 | Required arguments:
155 |
156 | -i : path to capture results in PT file
157 | -I : path to other capture results in PT file
158 | -o : path to merge capture results PT file
159 |
160 | # .pt Format Specifications
161 | The .pt format is used for output from the capture module and input for stats and getlowcov modules. The .pt format is largely derived from the FASTA format. Each entry spans three lines, and each line starts with its own identifying character:
162 |
163 | Entry header (>): A text header to describe the sequence. Do not use spaces in the header.
164 |
165 | Entry sequence ($): The nucleotide sequence of the entry.
166 |
167 | Entry probe depths (#): A comma-separated list of the number of probes covering each nucleotide position. The order of the list follows the order of the nucleotide sequence, i.e. the 4th number of the list describes the number probes covering the 4th nucleotide position of the entry's sequence.
168 |
169 | Example entry:
170 | ```
171 | >Entry_header
172 | $ATGCGTTGACAGTGCACACG
173 | #1,1,1,1,1,2,2,2,2,2,1,1,2,2,2,3,3,3,3,3
174 | ```
175 |
--------------------------------------------------------------------------------
/probetools_v_0_1_11.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 |
4 | import sys
5 | import subprocess
6 | import os
7 |
8 |
9 | def main():
10 | version = '0.1.11'
11 | # Parse command line arguments
12 | module, args = parse_args(sys.argv, version)
13 | # Set path to output directory and name to append to output files
14 | out_path, name = os.path.split(args['-o'])
15 | out_path = '.' if out_path == '' else out_path
16 | # Make sure output directory exists
17 | if os.path.exists(out_path) == False and os.path.isdir(out_path) == False:
18 | print(f'\nERROR: Output path {out_path} does not exist.\n')
19 | exit(1)
20 | # Run top-level function for selected module
21 | if module == 'clusterkmers':
22 | print(f'\nProbeTools ClusterKmers v{version}')
23 | print('https://github.com/KevinKuchinski/ProbeTools\n')
24 | cluster_kmers(out_path, name, args['-t'], args['-k'], args['-i'], args['-s'], args['-d'], args['-p'], args['-n'], args['-T'])
25 | elif module == 'capture':
26 | print(f'\nProbeTools Capture v{version}')
27 | print('https://github.com/KevinKuchinski/ProbeTools\n')
28 | capture(out_path, name, args['-t'], args['-p'], args['-i'], args['-l'], args['-T'])
29 | elif module == 'getlowcov':
30 | print(f'\nProbeTools GetLowCov v{version}')
31 | print('https://github.com/KevinKuchinski/ProbeTools\n')
32 | get_low_cov(out_path, name, args['-i'], args['-k'], args['-D'], args['-L'])
33 | elif module == 'stats':
34 | print(f'\nProbeTools Stats v{version}')
35 | print('https://github.com/KevinKuchinski/ProbeTools\n')
36 | stats(out_path, name, args['-i'])
37 | elif module == 'makeprobes':
38 | print(f'\nProbeTools MakeProbes v{version}')
39 | print('https://github.com/KevinKuchinski/ProbeTools\n')
40 | make_probes(out_path, name, args['-t'], args['-b'], args['-m'], args['-c'], args['-k'], args['-i'], args['-s'], args['-d'],
41 | args['-D'], args['-L'], args['-i'], args['-l'], args['-T'])
42 | elif module == 'merge':
43 | print(f'\nProbeTools Merge v{version}')
44 | print('https://github.com/KevinKuchinski/ProbeTools\n')
45 | merge(args['-i'], args['-I'], args['-o'])
46 | print('\nDone.\n')
47 | exit(0)
48 |
49 |
50 | ########## Command line interface functions ##########
51 | def parse_args(args, version):
52 | if len(args) < 2:
53 | print('\nERROR: A ProbeTools module must be selected.\n')
54 | print_usage(None, version)
55 | exit(1)
56 | else:
57 | module = args[1]
58 | arg_values = {}
59 | for arg_1, arg_2 in zip(args[1:-1], args[2:]):
60 | if arg_1[0] == '-':
61 | if arg_2[0] != '-':
62 | arg_values[arg_1] = arg_2
63 | else:
64 | arg_values[arg_1] = ''
65 | if args[-1][0] == '-':
66 | arg_values[args[-1]] = ''
67 | # Set defaults, mins, and maxs depending on selected module
68 | if module == 'clusterkmers':
69 | required_args = {'-t', '-o'}
70 | arg_value_types = {'-t': str, '-o': str, '-k': int, '-s': int, '-d': int, '-i': float, '-p': str, '-n': int, '-T': int}
71 | min_arg_values = {'-k': 32, '-s': 1, '-d': 0, '-i': 50, '-n': 1, '-T': 1}
72 | max_arg_values = {'-i': 100}
73 | default_arg_values = {'-k': 120, '-s': 1, '-d': 0, '-i': 90, '-p': None, '-n': 'MAX', '-T': 0}
74 | elif module == 'capture':
75 | required_args = {'-t', '-p', '-o'}
76 | arg_value_types = {'-t': str, '-p': str, '-o':str, '-i': float, '-l': int, '-T': int}
77 | min_arg_values = {'-i': 50, '-l': 1, '-T': 1}
78 | max_arg_values = {'-i': 100}
79 | default_arg_values = {'-i': 90, '-l': 60, '-T': 1}
80 | elif module == 'getlowcov':
81 | required_args = {'-i', '-o'}
82 | arg_value_types = {'-i': str, '-o': str, '-k': int, '-D': int, '-L': int}
83 | min_arg_values = {'-k': 32, '-D': 0, '-L': 1}
84 | max_arg_values = {}
85 | default_arg_values = {'-k': 120, '-D': 0, '-L': 40}
86 | elif module == 'stats':
87 | required_args = {'-i', '-o'}
88 | arg_value_types = {'-i': str, '-o': str}
89 | min_arg_values = {}
90 | max_arg_values = {}
91 | default_arg_values = {}
92 | elif module == 'makeprobes':
93 | required_args = {'-t', '-b', '-o'}
94 | arg_value_types = {'-t': str, '-b': int, '-o': str, '-m': int, '-c': float, '-k': int,
95 | '-s': int, '-d': int, '-D': int, '-L': int, '-i': float, '-l': int, '-T': int}
96 | min_arg_values = {'-m': 1, '-c': 0, '-k': 32, '-s': 1, '-d': 0, '-D': 0, '-L': 1, '-i': 50, '-l': 1, '-T': 1}
97 | max_arg_values = {'-c': 100, '-i': 100}
98 | default_arg_values = {'-m': 'MAX', '-c': 90, '-k': 120, '-s': 1, '-d': 0, '-D': 0, '-L': 40, '-i': 90, '-l': 60, '-T': 0}
99 | elif module == 'merge':
100 | required_args = {'-i', '-I', '-o'}
101 | arg_value_types = {'-i': str, '-I': str, '-o': str}
102 | min_arg_values = {}
103 | max_arg_values = {}
104 | default_arg_values = {}
105 | else:
106 | print('\nERROR: Module not recognized.')
107 | print_usage(None, version)
108 | exit(1)
109 | # Check if all required arguments were provided
110 | missing_args = set()
111 | for required_arg in required_args:
112 | if required_arg not in arg_values.keys() or arg_values[required_arg] == '':
113 | missing_args = missing_args | {required_arg}
114 | if missing_args != set():
115 | print(f'\nERROR: Values must be provided for the argument following arguments: {", ".join(sorted(missing_args))}')
116 | print_usage(module, version)
117 | exit(1)
118 | # Check if unrecognized arguments were provided
119 | recognized_args = required_args | set(arg_value_types.keys())
120 | unrecognized_args = set()
121 | for provided_arg in arg_values.keys():
122 | if provided_arg not in recognized_args:
123 | unrecognized_args = unrecognized_args | {provided_arg}
124 | if unrecognized_args != set():
125 | print(f'\nERROR: The following arguments are not recognized: {", ".join(sorted(unrecognized_args))}')
126 | print_usage(module, version)
127 | exit(1)
128 | # Check if arguments were provided without values
129 | empty_args = set()
130 | for arg, value in arg_values.items():
131 | if value == '':
132 | empty_args = empty_args | {arg}
133 | if empty_args != set():
134 | print(f'\nERROR: The following arguments were provided without values: {", ".join(sorted(empty_args))}')
135 | print_usage(module, version)
136 | exit(1)
137 | # Check if provided values are of the correct type
138 | for arg, value in arg_values.items():
139 | try:
140 | arg_values[arg] = arg_value_types[arg](value)
141 | except ValueError:
142 | print(f'\nERROR: Value for argument {arg} must be of type {str(arg_value_types[arg].__name__)}')
143 | print_usage(module, version)
144 | exit(1)
145 | # Check if provided values are within the correct range
146 | for arg, value in arg_values.items():
147 | if arg in min_arg_values.keys() and value < min_arg_values[arg]:
148 | print(f'\nERROR: Value for argument {arg} must be at least {min_arg_values[arg]}')
149 | print_usage(module, version)
150 | exit(1)
151 | if arg in max_arg_values.keys() and value > max_arg_values[arg]:
152 | print(f'\nERROR: Value for argument {arg} must not exceed {max_arg_values[arg]}')
153 | print_usage(module, version)
154 | exit(1)
155 | # Assign default values to unspecified arguments
156 | for arg, value in default_arg_values.items():
157 | if arg not in arg_values.keys():
158 | arg_values[arg] = value
159 | # Return keyword args and their values
160 | return module, arg_values
161 |
162 |
163 | def print_usage(module, version):
164 | if module == None:
165 | print(f'\nProbeTools v{version}')
166 | print('https://github.com/KevinKuchinski/ProbeTools\n')
167 | print('Available modules:')
168 | print('makeprobes - probe panel design using a general purpose incremental strategy')
169 | print('clusterkmers - enumerate and cluster kmers from target sequences')
170 | print('capture - assess probe panel coverage of target sequences')
171 | print('getlowcov - extract low coverage sequences from target space')
172 | print('stats - calculate probe coverage and depth stats')
173 | print('merge - merge two sets of capture results\n')
174 | elif module == 'clusterkmers':
175 | print(f'\nProbeTools ClusterKmers v{version}')
176 | print('https://github.com/KevinKuchinski/ProbeTools\n')
177 | print('Usage: probetools clusterkmers -t -o / []\n')
178 | print('Required arguments:')
179 | print(' -t : path to target seqs in FASTA file')
180 | print(' -o : path to output directory and name to append to output files')
181 | print('Optional arguments:')
182 | print(' -k : length of kmers to enumerate (default=120, min=32)')
183 | print(' -s : number of bases separating each kmer (default=1, min=1)')
184 | print(' -d : number of degenerate bases to permit in probes (default=0, min=0)')
185 | print(' -i : nucleotide sequence identity (%) threshold used for kmer clustering (default=90, min=50, max=100)')
186 | print(' -p : path to FASTA file containing previously-generated probe sequences to filter from new probes')
187 | print(' -n : number of probe candidates to return (default=MAX, min=1)')
188 | print(' -T : number of threads used by VSEARCH for clustering kmers (default=MAX, min=1)\n')
189 | elif module == 'capture':
190 | print(f'\nProbeTools Capture v{version}')
191 | print('https://github.com/KevinKuchinski/ProbeTools\n')
192 | print('Usage: probetools capture -t -p -o / []\n')
193 | print('Required arguments:')
194 | print(' -t : path to target seqs in FASTA file')
195 | print(' -p : path to probe sequences in FASTA file')
196 | print(' -o : path to output directory and name to append to output files')
197 | print('Optional arguments:')
198 | print(' -i : nucleotide sequence identity (%) threshold used for probe-target alignments (default=90, min=50, max=100)')
199 | print(' -l : minimum length for probe-target alignments (default=60, min=1)')
200 | print(' -T : number of threads used by BLASTn for aligning probes to targets (default=1, min=1)\n')
201 | elif module == 'getlowcov':
202 | print(f'\nProbeTools GetLowCov v{version}')
203 | print('https://github.com/KevinKuchinski/ProbeTools\n')
204 | print('Usage: probetools getlowcov -i -o / []')
205 | print('Required arguments:')
206 | print(' -i : path to capture results in PT file')
207 | print(' -o : path to output directory and name to append to output files')
208 | print('Optional arguments:')
209 | print(' -k : minimum sub-sequence length extracted, should be same as kmer length used for making probes (default=120, min=32)')
210 | print(' -D : minimum probe depth threshold used to define low coverage sub-sequences (default=0, min=0)')
211 | print(' -L : minimum number of consecutive bases below probe depth threshold to define a low coverage sub-sequence (default=40, min=1)')
212 | elif module == 'stats':
213 | print(f'\nProbeTools Stats v{version}')
214 | print('https://github.com/KevinKuchinski/ProbeTools\n')
215 | print('Usage: probetools stats -i -o /')
216 | print('Required arguments:')
217 | print(' -i : path to capture results in PT file')
218 | print(' -o : path to output directory and name to append to output files\n')
219 | elif module == 'makeprobes':
220 | print(f'\nProbeTools MakeProbes v{version}')
221 | print('https://github.com/KevinKuchinski/ProbeTools\n')
222 | print('Usage: probetools incrementalprobes -t -b -o / []')
223 | print('Required arguments:')
224 | print(' -t : path to target sequences in FASTA file')
225 | print(' -b : number of probes in each batch (min=1)')
226 | print(' -o : path to output directory and name to append to output files')
227 | print('Optional arguments:')
228 | print(' -m : max number of probes to add to panel (default=MAX, min=1)')
229 | print(' -c : target for 10th percentile of probe coverage (default=90, min=1, max=100)')
230 | print(' -k : length of probes to generate (default=120, min=32)')
231 | print(' -s : number of bases separating each kmer (default=1, min=1)')
232 | print(' -d : number of degenerate bases to permit in probes (default=0, min=0)')
233 | print(' -i : nucleotide sequence identity (%) threshold used for kmer clustering and probe-target alignments (default=90, min=50, min=100)')
234 | print(' -l : minimum length for probe-target alignments (default=60, min=1)')
235 | print(' -D : minimum probe depth threshold used to define low coverage sub-sequences (default=0, min=0)')
236 | print(' -L : minimum number of consecutive bases below probe depth threshold to define a low coverage sub-sequence (default=40, min=1)')
237 | print(' -T : number of threads used by VSEARCH and BLASTn for clustering kmers and aligning'
238 | f' probes to targets (default=MAX for VSEARCH, default=1 for BLASTn, min=1)\n')
239 | elif module == 'merge':
240 | print(f'\nProbeTools Merge v{version}')
241 | print('https://github.com/KevinKuchinski/ProbeTools\n')
242 | print('Usage: probetools merge -i -I -o ')
243 | print('Required arguments:')
244 | print(' -i : path to capture results in PT file')
245 | print(' -I : path to other capture results in PT file')
246 | print(' -o : path to merged capture results PT output file\n')
247 |
248 |
249 | ########## Helper functions ##########
250 | def check_input(files):
251 | '''Checks list of input file paths to make sure they exist.'''
252 | for file in files:
253 | if os.path.exists(file) == False:
254 | print(f'\nERROR: Input file {file} does not exist.\n')
255 | exit(1)
256 |
257 |
258 | def load_fasta(fasta_path):
259 | '''Loads contents of a FASTA file into two lists (one for headers, and another for their
260 | corresponding seqs).'''
261 | # Check if targets FASTA is valid
262 | check_input([fasta_path])
263 | # Open targets FASTA and load headers and seqs into lists
264 | with open(fasta_path, 'r') as input_file:
265 | entry_counter = 0
266 | header = ''
267 | headers, seqs = [], []
268 | for line in input_file:
269 | if line[0] == '>':
270 | entry_counter += 1
271 | header = line.strip().lstrip('>')
272 | headers.append(header)
273 | seqs.append('')
274 | elif header != '':
275 | seqs[-1] += line.strip()
276 | for header, seq in zip(headers, seqs):
277 | if header == '':
278 | print('\nWARNING: Seq is present without header in {fasta_path}.')
279 | if seq == '':
280 | print('\nWARNING: Header {header} in {fasta_path} has no accompanying seq.')
281 | return headers, seqs
282 |
283 |
284 | def append_fasta(fasta_path, new_fasta_path):
285 | '''Appends contents of new FASTA file to existing FASTA file then deletes new FASTA file.'''
286 | with open(new_fasta_path, 'r') as input_file, open(fasta_path, 'a') as output_file:
287 | for line in input_file:
288 | output_file.write(line)
289 | # GARBAGE COLLECTION - Delete new FASTA file after contents have been appended
290 | os.remove(new_fasta_path)
291 |
292 |
293 | ######### Top-level functions for modules ##########
294 | def cluster_kmers(out_path, name, targets_path, k, cluster_id, step, max_degen, prev_probes_path, num_probes, threads):
295 | check_input([targets_path])
296 | kmers_path = os.path.join(out_path, name + '_kmers.fa')
297 | enum_kmers(targets_path, kmers_path, k, step, max_degen)
298 | centroids_path = os.path.join(out_path, name + '_centroids.fa')
299 | cluster_kmers_with_VSEARCH(kmers_path, centroids_path, cluster_id, threads)
300 | remove_prev_probes(centroids_path, prev_probes_path)
301 | probe_names, probe_seqs = rank_probe_candidates(centroids_path)
302 | probes_path = os.path.join(out_path, name + '_probes.fa')
303 | potential_probes, probes_writen = write_top_probes(probe_names, probe_seqs, num_probes, probes_path, name)
304 | return potential_probes, probes_writen
305 |
306 |
307 | def capture(out_path, name, targets_path, probes_path, min_id, min_length, threads):
308 | check_input([targets_path, probes_path])
309 | blast_path = os.path.join(out_path, name + '_blast_results.tsv')
310 | align_probes_to_targets_with_BLAST(targets_path, probes_path, blast_path, threads)
311 | capture_data = create_empty_capture_data(targets_path)
312 | capture_data = add_BLAST_results_to_capture_data(blast_path, capture_data, min_id, min_length)
313 | capture_path = os.path.join(out_path, name + '_capture.pt')
314 | write_capture_data(capture_path, capture_data)
315 |
316 |
317 | def get_low_cov(out_path, name, capture_path, k, min_depth, min_length):
318 | check_input([capture_path])
319 | capture_data = load_capture_data(capture_path)
320 | low_cov_path = os.path.join(out_path, name + '_low_cov_seqs.fa')
321 | seqs_writen = write_low_cov_seqs(capture_data, low_cov_path, k, min_depth, min_length)
322 |
323 |
324 | def stats(out_path, name, capture_path):
325 | '''Top-level function for stats module.'''
326 | check_input([capture_path])
327 | capture_data = load_capture_data(capture_path)
328 | report_path = os.path.join(out_path, name + '_summary_stats_report.tsv')
329 | write_summary_report(capture_data, report_path, name)
330 | report_path = os.path.join(out_path, name + '_long_stats_report.tsv')
331 | write_long_report(capture_data, report_path, name)
332 |
333 |
334 | def make_probes(out_path, name, targets_path, batch_size, max_probes, cov_target, k, cluster_id, step, max_degen,
335 | min_depth, min_low_cov_length, min_id, min_capture_length, threads):
336 | check_input([targets_path])
337 | # Set variable for counting rounds of incremental probe design
338 | round_counter = 0
339 | # Initialize variables for panel size (# probes) and 10th percentile of panel coverage
340 | panel_size, panel_cov = 0, 0
341 | # If panel size is not specified, set max_panel_size as 1 probe larger than current panel size so loop doesn't break
342 | max_panel_size = panel_size + batch_size if max_probes == 'MAX' else max_probes
343 | # Create empty FASTA file for final probe panel
344 | final_probes_path = os.path.join(out_path, name + '_probes.fa')
345 | with open(final_probes_path, 'w') as output_file:
346 | pass
347 | # Create empty capture dict from target space
348 | capture_data = create_empty_capture_data(targets_path)
349 | print()
350 | # Enter main incremental design loop
351 | while panel_size < max_panel_size and panel_cov < cov_target:
352 | round_counter += 1
353 | print('*' * 20, f'ROUND {round_counter}', '*' * 20 + '\n')
354 | # Write low cov seqs to FASTA as target space for this round
355 | low_cov_path = os.path.join(out_path, name + '_low_cov_seqs.fa')
356 | seqs_writen = write_low_cov_seqs(capture_data, low_cov_path, k, min_depth, min_low_cov_length)
357 | print()
358 | # Break design loop if no low seqs were writen
359 | if seqs_writen == 0:
360 | break
361 | # Make probes from target space
362 | num_probes = min(batch_size, max_panel_size - panel_size)
363 | round_name = name + f'_round_{round_counter}'
364 | potential_probes, probes_writen = cluster_kmers(out_path, round_name, low_cov_path, k, cluster_id, step, max_degen, final_probes_path, num_probes, threads)
365 | print()
366 | # GARBAGE COLLECTION - Delete low cov seqs
367 | os.remove(low_cov_path)
368 | # Update panel size and max panel size
369 | panel_size += probes_writen
370 | max_panel_size = panel_size + batch_size if max_probes == 'MAX' else max_probes
371 | # Break design loop if no potential probes or no probes writen
372 | if potential_probes == 0 or probes_writen == 0:
373 | break
374 | # Capture probes against original targets
375 | blast_path = os.path.join(out_path, name + '_blast_results.tsv')
376 | new_probes_path = os.path.join(out_path, round_name + '_probes.fa')
377 | align_probes_to_targets_with_BLAST(targets_path, new_probes_path, blast_path, threads)
378 | capture_data = add_BLAST_results_to_capture_data(blast_path, capture_data, min_id, min_capture_length)
379 | print()
380 | # Append new probes to final panel
381 | append_fasta(final_probes_path, new_probes_path)
382 | # Check if panel cov exceeds cov target
383 | panel_cov = calc_cov_percentiles(capture_data, percentiles=(0.1,))[0]
384 | print(f'10th percentile of target coverage: {panel_cov}%\n')
385 | print('Incremental probe design finished.')
386 | if panel_size >= max_panel_size:
387 | print(' Maximum panel size reached.')
388 | if panel_cov >= cov_target:
389 | print(' Coverage target for 10th percentile of targets reached.')
390 | if seqs_writen == 0 or potential_probes == 0 or probes_writen == 0:
391 | print(' No additional probes could be designed.')
392 | capture_path = os.path.join(out_path, name + '_capture.pt')
393 | print(f'\nWriting capture results to {capture_path}...')
394 | write_capture_data(capture_path, capture_data)
395 | report_path = os.path.join(out_path, name + '_summary_stats_report.tsv')
396 | write_summary_report(capture_data, report_path, name)
397 | report_path = os.path.join(out_path, name + '_long_stats_report.tsv')
398 | write_long_report(capture_data, report_path, name)
399 |
400 |
401 | def merge(capture_path, other_capture_path, merged_capture_path):
402 | capture_data = load_capture_data(capture_path)
403 | other_capture_data = load_capture_data(other_capture_path)
404 | merged_capture_data = merge_capture_results(capture_data, other_capture_data)
405 | write_capture_data(merged_capture_path, merged_capture_data)
406 |
407 |
408 | ########## clusterkmers functions ##########
409 | def enum_kmers(targets_path, kmers_path, k, step, max_degen):
410 | print(f'Enumerating all {k}-mers in {targets_path}...')
411 | # Load contents of targets FASTA
412 | headers, seqs = load_fasta(targets_path)
413 | print(f' Loaded {"{:,}".format(len(seqs))} target seqs in {targets_path}...')
414 | # Create FASTA file for kmers output and enumerate kmers from target seqs
415 | with open(kmers_path, 'w') as output_file:
416 | target_counter = 0
417 | kmer_counter = 0
418 | for header, seq in zip(headers, seqs):
419 | if seq == '':
420 | print(f' WARNING --- Target {header} has no sequence!')
421 | elif len(seq) < k:
422 | print(f' WARNING --- Target {header} is shorter than the desired probe length!')
423 | elif len(seq) >= k and seq != '':
424 | target_counter += 1
425 | for i in range(0, len(seq) - k + 1, step):
426 | kmer = seq[i:i+k].upper()
427 | num_degen = len(kmer) - sum(kmer.count(base) for base in 'ATGC')
428 | if num_degen <= max_degen:
429 | kmer_counter += 1
430 | output_file.write(f'>kmer_{kmer_counter}\n')
431 | output_file.write(kmer + '\n')
432 | print(f' Enumerated {"{:,}".format(kmer_counter)} k-mers from {targets_path}.')
433 |
434 |
435 | def cluster_kmers_with_VSEARCH(kmers_path, centroids_path, cluster_id, threads):
436 | print(f' Clustering k-mers at {cluster_id}% identity...')
437 | # Create and run terminal command for VSEARCH
438 | terminal_command = (f'vsearch --cluster_fast {kmers_path} --id {cluster_id / 100} --centroids {centroids_path}'
439 | f' --fasta_width 0 --sizeout --qmask none --threads {threads}')
440 | finished_process = subprocess.call(terminal_command, stdout=subprocess.DEVNULL,
441 | stderr=subprocess.DEVNULL, shell=True)
442 | if finished_process != 0:
443 | print(f'\nERROR: vsearch cluster_fast terminated with errors while clustering k-mers'
444 | f' (Error code: {finished_process}).\n')
445 | exit(1)
446 | # GARBAGE COLLECTION - Delete kmers FASTA file
447 | os.remove(kmers_path)
448 | print(f' K-mer clustering finished.')
449 |
450 |
451 | def remove_prev_probes(centroids_path, prev_probes_path):
452 | if prev_probes_path != None:
453 | print(f'Removing any previously designed probes...')
454 | prev_probe_headers, prev_probe_seqs = load_fasta(prev_probes_path)
455 | print(f' Loaded {"{:,}".format(len(prev_probe_seqs))} previously designed probes in {prev_probes_path}')
456 | centroid_headers, centroid_seqs = load_fasta(centroids_path)
457 | with open(centroids_path, 'w') as output_file:
458 | for header, seq in zip(centroid_headers, centroid_seqs):
459 | if seq not in prev_probe_seqs:
460 | output_file.write('>' + header + '\n')
461 | output_file.write(seq + '\n')
462 |
463 |
464 | def rank_probe_candidates(centroids_path):
465 | print(f'Ranking probe candidates...')
466 | # Load centroids FASTA file
467 | headers, seqs = load_fasta(centroids_path)
468 | print(f' Loaded {"{:,}".format(len(seqs))} probe candidates.')
469 | # Create dicts for looking up each candidate probe seq's cluster size and probe name
470 | cluster_sizes = {seq: int(header.split('size=')[1].split(';')[0]) for header, seq in zip(headers, seqs)}
471 | probe_names = {seq: header.split(';')[0] for header, seq in zip(headers, seqs)}
472 | # Rank candidate probe seqs based on cluster size
473 | probe_seqs = sorted(seqs, key=lambda seq: cluster_sizes[seq], reverse=True)
474 | # GARBAGE COLLCTION - Delete centroids FASTA file
475 | os.remove(centroids_path)
476 | return probe_names, probe_seqs
477 |
478 |
479 | def write_top_probes(probe_names, probe_seqs, num_probes, probes_path, name):
480 | print(f'Writing top probe candidates to {probes_path}...')
481 | potential_probes = len(probe_seqs)
482 | # Check if there are any probes to write
483 | if potential_probes == 0:
484 | print(' No probe candidates to write to FASTA.')
485 | return 0, 0
486 | # Determine how many probes to write to FASTA file
487 | if num_probes == 'MAX':
488 | max_probes = len(probe_seqs)
489 | else:
490 | max_probes = num_probes
491 | # Write probes to FASTA file
492 | probe_counter = 0
493 | with open(probes_path, 'w') as output_file:
494 | while probe_counter < len(probe_seqs) and probe_counter < max_probes:
495 | probe_seq = probe_seqs[probe_counter]
496 | probe_counter += 1
497 | output_file.write(f'>{name}_probe_{probe_counter}\n')
498 | output_file.write(probe_seq + '\n')
499 | print(f' Wrote {"{:,}".format(probe_counter)} probes to {probes_path}.')
500 | return potential_probes, probe_counter
501 |
502 |
503 | ########## capture functions ##########
504 | def align_probes_to_targets_with_BLAST(targets_path, probes_path, blast_path, threads):
505 | print(f'Aligning probes in {probes_path} to targets in {targets_path}...')
506 | # Check that BLAST db files exist for target seqs
507 | if any([os.path.exists(targets_path + '.' + suffix) == False for suffix in ['nhr', 'nin' , 'nsq']]):
508 | print(' WARNING: blastn db files do not exist for target seqs. Creating blastn db...')
509 | # Create terminal command for making BLASTn db
510 | terminal_command = f'makeblastdb -in {targets_path} -dbtype nucl'
511 | # Run terminal command and redirect stdout and stderr to DEVNULL
512 | finished_process = subprocess.run(terminal_command, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, shell=True)
513 | # Exit with errors if BLASTn terminates with errors
514 | if finished_process.returncode != 0:
515 | print(f'\nERROR: makeblastdb terminated with errors while making db for target seqs (Error code: {finished_process.returncode}).\n')
516 | exit(1)
517 | # Count target seqs
518 | headers, seqs = load_fasta(targets_path)
519 | num_targets = len(seqs)
520 | # Create and run terminal command for BLASTn
521 | threads = 1 if threads == 0 else threads
522 | terminal_command = (f"blastn -db {targets_path} -query {probes_path} -max_target_seqs {num_targets}"
523 | f" -num_threads {threads} -outfmt 6 > {blast_path}")
524 | finished_process = subprocess.run(terminal_command, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, shell=True)
525 | if finished_process.returncode != 0:
526 | print(f'\nERROR: blastn terminated with errors while aligning probes to target seqs (Error code: {finished_process.returncode}).\n')
527 | exit(1)
528 |
529 |
530 | def create_empty_capture_data(targets_path):
531 | print(f'Extracting target names and seqs from {targets_path}...')
532 | with open(targets_path, 'r') as input_file:
533 | headers, seqs = [], []
534 | for line in input_file:
535 | if line[0] == '>':
536 | header = line.strip().lstrip('>')
537 | headers.append(header)
538 | seqs.append('')
539 | else:
540 | seqs[-1] += line.strip()
541 | for header in headers:
542 | if headers.count(header) > 1:
543 | print(f'\nERROR: Header {header} appears more than once in {targets_path}.\n')
544 | exit(1)
545 | if ' ' in header:
546 | print(f'\nERROR: Header {header} contains spaces.\n')
547 | exit(1)
548 | capture_data = {header: (seq, [0] * len(seq)) for header, seq in zip(headers, seqs)}
549 | print(f' Total targets loaded: {"{:,}".format(len(capture_data))}')
550 | return capture_data
551 |
552 |
553 | def load_capture_data(capture_path):
554 | print(f'Loading capture data from {capture_path}...')
555 | with open(capture_path, 'r') as input_file:
556 | headers, seqs, depths = [], [], []
557 | for line in input_file:
558 | if line[0] == '>':
559 | header = line.strip().lstrip('>')
560 | headers.append(header)
561 | seqs.append('')
562 | depths.append('')
563 | elif line[0] == '$':
564 | seqs[-1] += line.strip()
565 | elif line[0] == '#':
566 | depths[-1] += line.strip()
567 | seqs = [seq.lstrip('$') for seq in seqs]
568 | depths = [[int(d) for d in depth.lstrip('#').split(',')] for depth in depths]
569 | if len(set(len(c) for c in [headers, seqs, depths])) != 1:
570 | print(f'\nERROR: The number of headers, seqs, and probe depth lists do not match in {capture_path}.')
571 | exit(1)
572 | for header in headers:
573 | if headers.count(header) > 1:
574 | print(f'\nERROR: Header {header} appears more than once in {capture_path}.\n')
575 | exit(1)
576 | capture_data = {header: (seq, depth) for header, seq, depth in zip(headers, seqs, depths)}
577 | for header in capture_data.keys():
578 | if len(capture_data[header][0]) != len(capture_data[header][1]):
579 | print(f'\nERROR: Seq length and probe depth list length do not match for entry {header}.')
580 | exit(1)
581 | print(f' Total targets loaded: {"{:,}".format(len(capture_data))}')
582 | return capture_data
583 |
584 |
585 | def merge_capture_results(capture_data, other_capture_data):
586 | headers, other_headers = set(capture_data.keys()), set(other_capture_data.keys())
587 | shared_headers = headers & other_headers
588 | merged_capture_data = {}
589 | for header in shared_headers:
590 | if capture_data[header][0] != other_capture_data[header][0]:
591 | print(f'\nERROR: Header {header} appears in both capture results but they do not have the same seq.')
592 | exit(1)
593 | if len(capture_data[header][1]) != len(other_capture_data[header][1]):
594 | print(f'\nERROR: Header {header} appears in both capture results but their probe depth lists are different lengths.')
595 | exit(1)
596 | merged_capture_data[header] = (capture_data[header][0], [a + b for a, b in zip(capture_data[header][1], other_capture_data[header][1])])
597 | for header in headers - shared_headers:
598 | merged_capture_data[header] = capture_data[header]
599 | for header in other_headers - shared_headers:
600 | merged_capture_data[header] = other_capture_data[header]
601 | return merged_capture_data
602 |
603 |
604 | def add_BLAST_results_to_capture_data(blast_path, capture_data, min_id, min_length):
605 | print(f'Extracting probe coverage and probe depth from alignment of probes against targets...')
606 | with open(blast_path, 'r') as input_file:
607 | for line in input_file:
608 | line = line.split('\t')
609 | if float(line[2]) >= min_id and float(line[3]) >= min_length:
610 | target = line[1].strip()
611 | aln_start = int(line[8].strip()) - 1
612 | aln_end = int(line[9].strip()) - 1
613 | aln_start, aln_end = sorted([aln_start, aln_end])
614 | for position in range(aln_start, aln_end + 1):
615 | capture_data[target][1][position] += 1
616 | # GARBAGE COLLECTION - Delete BLAST results
617 | os.remove(blast_path)
618 | return capture_data
619 |
620 |
621 | def write_capture_data(capture_path, capture_data):
622 | print(f'Writing capture results to {capture_path}...')
623 | with open(capture_path, 'w') as output_file:
624 | for header, (seq, depth) in capture_data.items():
625 | output_file.write('>' + header + '\n')
626 | output_file.write('$' + seq + '\n')
627 | output_file.write('#' + ','.join([str(d) for d in depth]) + '\n')
628 | print(f' Wrote capture results for {"{:,}".format(len(capture_data))} targets.')
629 |
630 |
631 | ########## getlowcov functions ##########
632 | def write_low_cov_seqs(capture_data, low_cov_path, k, min_depth, min_length):
633 | print(f'Extracting low coverage areas...')
634 | # Write low cov seqs to FASTA file
635 | with open(low_cov_path, 'w') as output_file:
636 | total_low_cov_counter = 0
637 | for header, (seq, depth) in capture_data.items():
638 | if len(seq) < k:
639 | print(f' WARNING: Seq {header} is shorter than the window length so low coverage seqs cannot be extracted.')
640 | else:
641 | low_cov_counter = 0
642 | low_cov_start = 0
643 | while low_cov_start < len(seq):
644 | while low_cov_start < len(seq) and depth[low_cov_start] > min_depth:
645 | low_cov_start += 1
646 | low_cov_end = low_cov_start
647 | while low_cov_end < len(seq) and depth[low_cov_end] <= min_depth:
648 | low_cov_end += 1
649 | low_cov_seq = seq[low_cov_start:low_cov_end]
650 | if len(low_cov_seq) > min_length and len(low_cov_seq) >= k:
651 | total_low_cov_counter += 1
652 | low_cov_counter += 1
653 | output_file.write(f'>{header}_position_{low_cov_start + 1}_to_{low_cov_end}\n')
654 | output_file.write(low_cov_seq + '\n')
655 | elif len(low_cov_seq) > min_length and len(low_cov_seq) < k:
656 | window_start = low_cov_start - int((k - len(low_cov_seq)) / 2)
657 | window_end = window_start + k
658 | if window_start < 0:
659 | window_end = window_end - window_start
660 | window_start = 0
661 | if window_end > len(seq):
662 | window_start = window_start - window_end + len(seq)
663 | window_end = len(seq)
664 | extended_low_cov_seq = seq[window_start:window_end]
665 | total_low_cov_counter += 1
666 | low_cov_counter += 1
667 | output_file.write(f'>{header}_position_{window_start + 1}_to_{window_end}\n')
668 | output_file.write(extended_low_cov_seq.upper() + '\n')
669 | low_cov_start = low_cov_end
670 | print(f' Wrote {total_low_cov_counter} low coverage seqs from {"{:,}".format(len(capture_data))} targets.')
671 | seqs_writen = total_low_cov_counter
672 | return seqs_writen
673 |
674 |
675 | ########## stats functions ##########
676 | def calc_cov_at_depth(seq, depth, min_depth):
677 | covered_bases = 0
678 | total_bases = 0
679 | for base, base_depth in zip(seq, depth):
680 | if base_depth >= min_depth:
681 | covered_bases += 1
682 | total_bases += 1
683 | elif base.upper() in 'ATGC':
684 | total_bases += 1
685 | if total_bases == 0:
686 | coverage = 0
687 | else:
688 | coverage = covered_bases * 100 / total_bases
689 | return coverage
690 |
691 |
692 | def calc_cov_percentiles(capture_data, percentiles=(0, 0.05, 0.1, 0.25, 0.5, 0.75, 1)):
693 | """Takes a capture dict and tuple of perentiles, and returns a list of those percentile
694 | values."""
695 | cov_values = []
696 | for header, (seq, depth) in capture_data.items():
697 | cov_values.append(calc_cov_at_depth(seq, depth, 1))
698 | cov_values = sorted(cov_values)
699 | percentile_values = []
700 | for percentile in percentiles:
701 | values_under_percentile = (len(cov_values) - 1) * percentile
702 | if values_under_percentile == int(values_under_percentile):
703 | percentile_values.append(cov_values[int(values_under_percentile)])
704 | else:
705 | percentile_value = cov_values[int(values_under_percentile)]
706 | percentile_value += cov_values[int(values_under_percentile) + 1]
707 | percentile_value = percentile_value / 2
708 | percentile_values.append(percentile_value)
709 | return [round(percentile_value,2) for percentile_value in percentile_values]
710 |
711 |
712 | def calc_total_probe_depth(capture_data):
713 | """Takes a capture dict and returns a tuple containing the percentage of nucleotide positions
714 | in the target space covered by 0, 1, 2, 3, 4, and 5+ probes."""
715 | total = 0
716 | total_0 = 0
717 | total_1 = 0
718 | total_2 = 0
719 | total_3 = 0
720 | total_4 = 0
721 | total_5 = 0
722 | for header, (seq, depth) in capture_data.items():
723 | total += len(depth)
724 | total_0 += depth.count(0)
725 | total_1 += depth.count(1)
726 | total_2 += depth.count(2)
727 | total_3 += depth.count(3)
728 | total_4 += depth.count(4)
729 | total_5 += len([d for d in depth if d >= 5])
730 | total_0 = round(total_0 * 100 / total, 2)
731 | total_1 = round(total_1 * 100 / total, 2)
732 | total_2 = round(total_2 * 100 / total, 2)
733 | total_3 = round(total_3 * 100 / total, 2)
734 | total_4 = round(total_4 * 100 / total, 2)
735 | total_5 = round(total_5 * 100 / total, 2)
736 | return (total_0, total_1, total_2, total_3, total_4, total_5)
737 |
738 |
739 | def write_summary_report(capture_data, report_path, name):
740 | """Takes a capture dict and generates the summary report."""
741 | print(f'Writing probe coverage and probe depth summary stats to {report_path}...')
742 | with open(report_path, 'w') as output_file:
743 | header = ['name', 'total_targets']
744 | header += ['cov_' + s for s in ['min', '5%tile', '10%tile', 'Q1', 'med', 'Q3', 'max']]
745 | header += ['depth_' + s for s in ['0', '1', '2', '3', '4', '5+']]
746 | output_file.write('\t'.join(header) + '\n')
747 | line = [name, str(len(capture_data))]
748 | line += [str(f) for f in calc_cov_percentiles(capture_data)]
749 | line += [str(f) for f in calc_total_probe_depth(capture_data)]
750 | output_file.write('\t'.join(line) + '\n')
751 |
752 |
753 | def write_long_report(capture_data, report_path, name):
754 | """Takes a capture dict and generates the long-form report."""
755 | print(f'Writing probe coverage and probe depth stats for each target to {report_path}...')
756 | with open(report_path, 'w') as output_file:
757 | header = ['name', 'target', 'length', 'ATGCs', '%_ATGCs']
758 | header += ['bases_' + str(d) + 'X' for d in ['0', '1', '2', '3', '4', '5+']]
759 | header += ['%_cov_' + str(d) + 'X' for d in ['1', '2', '3', '4', '5']]
760 | output_file.write('\t'.join(header) + '\n')
761 | for header, (seq, depth) in capture_data.items():
762 | total_ATGC = len([s for s in seq if s.upper() in 'ATGC'])
763 | length = len(depth)
764 | line = [name, header, str(length), str(total_ATGC), str(round(total_ATGC * 100 / length, 2))]
765 | line += [str(depth.count(d)) for d in [0, 1, 2, 3, 4]]
766 | line += [str(len([d for d in depth if d >= 5]))]
767 | line += [str(round(calc_cov_at_depth(seq, depth, min_depth), 1)) for min_depth in [1, 2, 3, 4, 5]]
768 | output_file.write('\t'.join(line) + '\n')
769 |
770 |
771 | ######### call main function ##########
772 | if __name__ == '__main__':
773 | main()
774 |
775 |
--------------------------------------------------------------------------------