├── LICENSE ├── README.md └── probetools_v_0_1_11.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Kevin Kuchinski 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ProbeTools 2 | ProbeTools is a collection of general-purpose modules for designing hybridization probe panels targeting diverse and hypervariable viral taxa. The objective of ProbeTools is to generate the smallest possible panel of oligo sequences that maximizes coverage of provided target sequences. It is based on k-mer clustering. In brief, probe-length k-mers are enumerated from the target space, usually spaced one nucleotide apart so that all possible k-mers are enumerated. The k-mers are then clustered based on their nucleotide sequence identity to collapse redundant probes enumerated from conserved genomic loci. Cluster centroids become probe candidates, which are ranked based on the size of the cluster they represent; centroids representing larger clusters are assumed to make better probes by virtue of having similarity to more sub-sequences in the target space. 3 | 4 | ProbeTools can further optimize probe panel designs by using an incremental strategy. In this strategy, probes are added to the panel in batches. Between the addition of each batch, ProbeTools determines what regions of the target space have achieved coverage and removes them from the target space before designing the next batch. This improves coverage of less-common sub-sequences in the target space and reduces the generation of redundant probes. 5 | 6 | Additional details and discussion about ProbeTools, along with in silico and in vitro validation results can be found in: 7 | 8 | Kuchinski KS et al.. ProbeTools: designing hybridization probes for targeted genomic sequencing of diverse and hypervariable viral taxa. BMC Genomics. 2022 Aug 12;23(1):579. doi: 10.1186/s12864-022-08790-4. PMID: 35953803; PMCID: PMC9371634. 9 | 10 | # Setup 11 | ProbeTools requires VSEARCH and BLASTn. The ProbeTools package can be installed with these dependencies via Anaconda/Miniconda. It can also be installed separate from its dependencies via the Python Package Index (PyPI). 12 | ## Anaconda/Miniconda 13 | 1. Create a conda environment for ProbeTools (replace env_name with a name of your choice for the ProbeTools environment): 14 | ``` 15 | conda create -n env_name -c kevinkuchinski probetools 16 | ``` 17 | ## PyPI 18 | 19 | 1. Install Python (version 3.7 or greater) from https://www.python.org/ 20 | 2. Install the ProbeTools package: 21 | ``` 22 | pip install probetools 23 | ``` 24 | 3. Install VSEARCH (version 2.15.2 recommended) from https://github.com/torognes/vsearch 25 | 4. Install BLAST (version 2.12.0 recommended) from https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download 26 | 27 | 28 | # Quick-start to probe design 29 | ProbeTools provides the makeprobes module as a user-friendly, general-purpose implementation of the incremental k-mer clustering strategy. Simply indicate a FASTA file containing target sequences (-t), the number of probes to add each batch (-b), and an output path and design name to append to output files (-o): 30 | ``` 31 | probetools makeprobes -t target_space_FASTA.fa -b 100 -o demo_probes_dir/demo_probes 32 | ``` 33 | makeprobes will add batches of probes to the panel until one of three end points is reached: 34 | 1. The panel achieves a target coverage goal (default: 90% of target sequences have at least 90% of their nucleotide positions covered) 35 | 2. The panel reaches a specific size (default: MAX, i.e. the panel continues to grow until one of the other end points is reached) 36 | 3. No further probe sequences can be designed 37 | 38 | The desired coverage goal and the maximum panel size can be set, along with numerous other parameters (see usage guide below). In general, smaller batch sizes will provide more compact panels but take more rounds of design and, thus, longer to compute. 39 | 40 | # ProbeTools modules 41 | ProbeTools consists of 6 modules: 42 | 1. makeprobes - a user-friendly, general-purpose implementation of the incremental k-mer clustering strategy 43 | 2. clusterkmers - single-batch probe generation using the k-mer clustering algorithm 44 | 3. capture - in silico assessment of how well provided probe sequences cover provided target sequences 45 | 4. getlowcov - uses output of capture to extract low-coverage regions from provided target sequences 46 | 5. stats - uses output of capture to calculate coverage statistics overall and for each provided target sequence 47 | 6. merge - merges output files generated by capture module 48 | 49 | # Usage guide for ProbeTools modules 50 | ## makeprobes 51 | A general-purpose implementation of the incremental k-mer clustering strategy. Probes are added to the panel in batches. Between the addition of each batch, ProbeTools determines what regions of the target space have achieved coverage and removes them from the target space before designing the next batch. Probe sequences are provided in the output_name_probes.fa file with probe sequences ranked in descending order of cluster size. NOTE: for best results, all target sequences should be provided on the same strand/in the same sense. 52 | 53 | Usage example: 54 | ``` 55 | $ probetools makeprobes -t -b -o / [] 56 | ``` 57 | Required arguments: 58 | 59 | -t : path to target sequences in FASTA file 60 | -b : number of probes in each batch (min=1) 61 | -o : path to output directory and design name to append to output files 62 | 63 | Optional arguments: 64 | 65 | -m : max number of probes to add to panel (default=MAX, min=1) 66 | -c : target for 10th percentile of probe coverage (default=90, min=1, max=100) 67 | -k : length of probes to generate (default=120, min=32) 68 | -s : number of bases separating each kmer (default=1, min=1) 69 | -d : number of degenerate bases to permit in probes (default=0, min=0) 70 | -i : nucleotide sequence identity (%) threshold used for kmer clustering and probe-target alignments (default=90, min=50, min=100) 71 | -l : minimum length for probe-target alignments (default=60, min=1) 72 | -D : minimum probe depth threshold used to define low coverage sub-sequences (default=0, min=0) 73 | -L : minimum number of consecutive bases below probe depth threshold to define a low coverage sub-sequence (default=40, min=1) 74 | -T : number of threads used by VSEARCH and BLASTn for clustering kmers and aligning probes to targets (default=MAX for VSEARCH, default=1 for BLASTn, min=1) 75 | 76 | ## clusterkmers 77 | Enumerate and cluster kmers from target sequences. Extract cluster centroids as probe candidates ranked by cluster size. Probe sequences are provided in the output_name_probes.fa file with probe sequences ranked in descending order of cluster size. NOTE: for best results, all target sequences should be provided on the same strand/in the same sense. 78 | 79 | Usage example: 80 | ``` 81 | $ probetools clusterkmers -t -o / [] 82 | ``` 83 | Required arguments: 84 | 85 | -t : path to target sequences in FASTA file 86 | -o : path to output directory and design name to append to output files 87 | 88 | Optional arguments: 89 | 90 | -k : length of kmers to enumerate (default=120, min=32) 91 | -s : number of bases separating each kmer (default=1, min=1) 92 | -d : number of degenerate bases to permit in probes (default=0, min=0) 93 | -i : nucleotide sequence identity (%) threshold used for kmer clustering (default=90, min=50, max=100) 94 | -p : path to FASTA file containing previously-generated probe sequences to remove from new probes 95 | -n : number of probe candidates to return (default=MAX, min=1) 96 | -T : number of threads used by VSEARCH for clustering kmers (default=MAX, min=1) 97 | 98 | ## capture 99 | Assess probe panel coverage of target sequences. BLASTn is used to align each provided probe sequence against each provided target sequence. BLASTn output is parsed to determine how many probes cover each nucleotide position in target sequences. Results are output to the output_name_capture.pt file (see .pt format specifications below). 100 | 101 | Usage example: 102 | ``` 103 | $ probetools capture -t -p -o / [] 104 | ``` 105 | Required arguments: 106 | 107 | -t : path to target sequences in FASTA file 108 | -p : path to probe sequences in FASTA file 109 | -o : path to output directory and design name to append to output files 110 | 111 | Optional arguments: 112 | 113 | -i : nucleotide sequence identity (%) threshold used for probe-target alignments (default=90, min=50, max=100) 114 | -l : minimum length for probe-target alignments (default=60, min=1) 115 | -T : number of threads used by BLASTn for aligning probes to targets (default=1, min=1) 116 | 117 | ## getlowcov 118 | Extract poorly covered sub-sequences from target sequences based on a specific set of capture results. Low-coverage sub-sequences are written to the output_name_low_cov.fa file. 119 | 120 | Usage example: 121 | ``` 122 | $ probetools getlowcov -i -o / [] 123 | ``` 124 | Required arguments: 125 | 126 | -i : path to capture results in PT file 127 | -o : path to output directory and design name to append to output files 128 | 129 | Optional arguments: 130 | 131 | -k : minimum sub-sequence length extracted, should be same as kmer length used for making probes (default=120, min=32) 132 | -D : minimum probe depth threshold used to define low coverage sub-sequences (default=0, min=0) 133 | -L : minimum number of consecutive bases below probe depth threshold to define a low coverage sub-sequence (default=40, min=1) 134 | 135 | ## stats 136 | Calculate and tabulate probe coverage statistics for target sequences. Overall target space statistics are provided in output_name_summary_report.tsv and statistics for each target sequence are provided in output_name_long_report.tsv. 137 | 138 | Usage example: 139 | ``` 140 | $ probetools stats -i -o / 141 | ``` 142 | Required arguments: 143 | 144 | -i : path to capture results in PT file 145 | -o : path to output directory and design name to append to output files 146 | 147 | ## merge 148 | Combine results from two output files from the capture module. This module conducts an outer merge: if entries with the same header (and matching nucleotide sequences) appear in both files, their probe depth lists are summed together position-by-position. Entries appearing in only one or the other file are copied to the new file unmodified. 149 | 150 | Usage example: 151 | ``` 152 | $ probetools merge -i -I -o 153 | ``` 154 | Required arguments: 155 | 156 | -i : path to capture results in PT file 157 | -I : path to other capture results in PT file 158 | -o : path to merge capture results PT file 159 | 160 | # .pt Format Specifications 161 | The .pt format is used for output from the capture module and input for stats and getlowcov modules. The .pt format is largely derived from the FASTA format. Each entry spans three lines, and each line starts with its own identifying character: 162 | 163 | Entry header (>): A text header to describe the sequence. Do not use spaces in the header. 164 | 165 | Entry sequence ($): The nucleotide sequence of the entry. 166 | 167 | Entry probe depths (#): A comma-separated list of the number of probes covering each nucleotide position. The order of the list follows the order of the nucleotide sequence, i.e. the 4th number of the list describes the number probes covering the 4th nucleotide position of the entry's sequence. 168 | 169 | Example entry: 170 | ``` 171 | >Entry_header 172 | $ATGCGTTGACAGTGCACACG 173 | #1,1,1,1,1,2,2,2,2,2,1,1,2,2,2,3,3,3,3,3 174 | ``` 175 | -------------------------------------------------------------------------------- /probetools_v_0_1_11.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | 4 | import sys 5 | import subprocess 6 | import os 7 | 8 | 9 | def main(): 10 | version = '0.1.11' 11 | # Parse command line arguments 12 | module, args = parse_args(sys.argv, version) 13 | # Set path to output directory and name to append to output files 14 | out_path, name = os.path.split(args['-o']) 15 | out_path = '.' if out_path == '' else out_path 16 | # Make sure output directory exists 17 | if os.path.exists(out_path) == False and os.path.isdir(out_path) == False: 18 | print(f'\nERROR: Output path {out_path} does not exist.\n') 19 | exit(1) 20 | # Run top-level function for selected module 21 | if module == 'clusterkmers': 22 | print(f'\nProbeTools ClusterKmers v{version}') 23 | print('https://github.com/KevinKuchinski/ProbeTools\n') 24 | cluster_kmers(out_path, name, args['-t'], args['-k'], args['-i'], args['-s'], args['-d'], args['-p'], args['-n'], args['-T']) 25 | elif module == 'capture': 26 | print(f'\nProbeTools Capture v{version}') 27 | print('https://github.com/KevinKuchinski/ProbeTools\n') 28 | capture(out_path, name, args['-t'], args['-p'], args['-i'], args['-l'], args['-T']) 29 | elif module == 'getlowcov': 30 | print(f'\nProbeTools GetLowCov v{version}') 31 | print('https://github.com/KevinKuchinski/ProbeTools\n') 32 | get_low_cov(out_path, name, args['-i'], args['-k'], args['-D'], args['-L']) 33 | elif module == 'stats': 34 | print(f'\nProbeTools Stats v{version}') 35 | print('https://github.com/KevinKuchinski/ProbeTools\n') 36 | stats(out_path, name, args['-i']) 37 | elif module == 'makeprobes': 38 | print(f'\nProbeTools MakeProbes v{version}') 39 | print('https://github.com/KevinKuchinski/ProbeTools\n') 40 | make_probes(out_path, name, args['-t'], args['-b'], args['-m'], args['-c'], args['-k'], args['-i'], args['-s'], args['-d'], 41 | args['-D'], args['-L'], args['-i'], args['-l'], args['-T']) 42 | elif module == 'merge': 43 | print(f'\nProbeTools Merge v{version}') 44 | print('https://github.com/KevinKuchinski/ProbeTools\n') 45 | merge(args['-i'], args['-I'], args['-o']) 46 | print('\nDone.\n') 47 | exit(0) 48 | 49 | 50 | ########## Command line interface functions ########## 51 | def parse_args(args, version): 52 | if len(args) < 2: 53 | print('\nERROR: A ProbeTools module must be selected.\n') 54 | print_usage(None, version) 55 | exit(1) 56 | else: 57 | module = args[1] 58 | arg_values = {} 59 | for arg_1, arg_2 in zip(args[1:-1], args[2:]): 60 | if arg_1[0] == '-': 61 | if arg_2[0] != '-': 62 | arg_values[arg_1] = arg_2 63 | else: 64 | arg_values[arg_1] = '' 65 | if args[-1][0] == '-': 66 | arg_values[args[-1]] = '' 67 | # Set defaults, mins, and maxs depending on selected module 68 | if module == 'clusterkmers': 69 | required_args = {'-t', '-o'} 70 | arg_value_types = {'-t': str, '-o': str, '-k': int, '-s': int, '-d': int, '-i': float, '-p': str, '-n': int, '-T': int} 71 | min_arg_values = {'-k': 32, '-s': 1, '-d': 0, '-i': 50, '-n': 1, '-T': 1} 72 | max_arg_values = {'-i': 100} 73 | default_arg_values = {'-k': 120, '-s': 1, '-d': 0, '-i': 90, '-p': None, '-n': 'MAX', '-T': 0} 74 | elif module == 'capture': 75 | required_args = {'-t', '-p', '-o'} 76 | arg_value_types = {'-t': str, '-p': str, '-o':str, '-i': float, '-l': int, '-T': int} 77 | min_arg_values = {'-i': 50, '-l': 1, '-T': 1} 78 | max_arg_values = {'-i': 100} 79 | default_arg_values = {'-i': 90, '-l': 60, '-T': 1} 80 | elif module == 'getlowcov': 81 | required_args = {'-i', '-o'} 82 | arg_value_types = {'-i': str, '-o': str, '-k': int, '-D': int, '-L': int} 83 | min_arg_values = {'-k': 32, '-D': 0, '-L': 1} 84 | max_arg_values = {} 85 | default_arg_values = {'-k': 120, '-D': 0, '-L': 40} 86 | elif module == 'stats': 87 | required_args = {'-i', '-o'} 88 | arg_value_types = {'-i': str, '-o': str} 89 | min_arg_values = {} 90 | max_arg_values = {} 91 | default_arg_values = {} 92 | elif module == 'makeprobes': 93 | required_args = {'-t', '-b', '-o'} 94 | arg_value_types = {'-t': str, '-b': int, '-o': str, '-m': int, '-c': float, '-k': int, 95 | '-s': int, '-d': int, '-D': int, '-L': int, '-i': float, '-l': int, '-T': int} 96 | min_arg_values = {'-m': 1, '-c': 0, '-k': 32, '-s': 1, '-d': 0, '-D': 0, '-L': 1, '-i': 50, '-l': 1, '-T': 1} 97 | max_arg_values = {'-c': 100, '-i': 100} 98 | default_arg_values = {'-m': 'MAX', '-c': 90, '-k': 120, '-s': 1, '-d': 0, '-D': 0, '-L': 40, '-i': 90, '-l': 60, '-T': 0} 99 | elif module == 'merge': 100 | required_args = {'-i', '-I', '-o'} 101 | arg_value_types = {'-i': str, '-I': str, '-o': str} 102 | min_arg_values = {} 103 | max_arg_values = {} 104 | default_arg_values = {} 105 | else: 106 | print('\nERROR: Module not recognized.') 107 | print_usage(None, version) 108 | exit(1) 109 | # Check if all required arguments were provided 110 | missing_args = set() 111 | for required_arg in required_args: 112 | if required_arg not in arg_values.keys() or arg_values[required_arg] == '': 113 | missing_args = missing_args | {required_arg} 114 | if missing_args != set(): 115 | print(f'\nERROR: Values must be provided for the argument following arguments: {", ".join(sorted(missing_args))}') 116 | print_usage(module, version) 117 | exit(1) 118 | # Check if unrecognized arguments were provided 119 | recognized_args = required_args | set(arg_value_types.keys()) 120 | unrecognized_args = set() 121 | for provided_arg in arg_values.keys(): 122 | if provided_arg not in recognized_args: 123 | unrecognized_args = unrecognized_args | {provided_arg} 124 | if unrecognized_args != set(): 125 | print(f'\nERROR: The following arguments are not recognized: {", ".join(sorted(unrecognized_args))}') 126 | print_usage(module, version) 127 | exit(1) 128 | # Check if arguments were provided without values 129 | empty_args = set() 130 | for arg, value in arg_values.items(): 131 | if value == '': 132 | empty_args = empty_args | {arg} 133 | if empty_args != set(): 134 | print(f'\nERROR: The following arguments were provided without values: {", ".join(sorted(empty_args))}') 135 | print_usage(module, version) 136 | exit(1) 137 | # Check if provided values are of the correct type 138 | for arg, value in arg_values.items(): 139 | try: 140 | arg_values[arg] = arg_value_types[arg](value) 141 | except ValueError: 142 | print(f'\nERROR: Value for argument {arg} must be of type {str(arg_value_types[arg].__name__)}') 143 | print_usage(module, version) 144 | exit(1) 145 | # Check if provided values are within the correct range 146 | for arg, value in arg_values.items(): 147 | if arg in min_arg_values.keys() and value < min_arg_values[arg]: 148 | print(f'\nERROR: Value for argument {arg} must be at least {min_arg_values[arg]}') 149 | print_usage(module, version) 150 | exit(1) 151 | if arg in max_arg_values.keys() and value > max_arg_values[arg]: 152 | print(f'\nERROR: Value for argument {arg} must not exceed {max_arg_values[arg]}') 153 | print_usage(module, version) 154 | exit(1) 155 | # Assign default values to unspecified arguments 156 | for arg, value in default_arg_values.items(): 157 | if arg not in arg_values.keys(): 158 | arg_values[arg] = value 159 | # Return keyword args and their values 160 | return module, arg_values 161 | 162 | 163 | def print_usage(module, version): 164 | if module == None: 165 | print(f'\nProbeTools v{version}') 166 | print('https://github.com/KevinKuchinski/ProbeTools\n') 167 | print('Available modules:') 168 | print('makeprobes - probe panel design using a general purpose incremental strategy') 169 | print('clusterkmers - enumerate and cluster kmers from target sequences') 170 | print('capture - assess probe panel coverage of target sequences') 171 | print('getlowcov - extract low coverage sequences from target space') 172 | print('stats - calculate probe coverage and depth stats') 173 | print('merge - merge two sets of capture results\n') 174 | elif module == 'clusterkmers': 175 | print(f'\nProbeTools ClusterKmers v{version}') 176 | print('https://github.com/KevinKuchinski/ProbeTools\n') 177 | print('Usage: probetools clusterkmers -t -o / []\n') 178 | print('Required arguments:') 179 | print(' -t : path to target seqs in FASTA file') 180 | print(' -o : path to output directory and name to append to output files') 181 | print('Optional arguments:') 182 | print(' -k : length of kmers to enumerate (default=120, min=32)') 183 | print(' -s : number of bases separating each kmer (default=1, min=1)') 184 | print(' -d : number of degenerate bases to permit in probes (default=0, min=0)') 185 | print(' -i : nucleotide sequence identity (%) threshold used for kmer clustering (default=90, min=50, max=100)') 186 | print(' -p : path to FASTA file containing previously-generated probe sequences to filter from new probes') 187 | print(' -n : number of probe candidates to return (default=MAX, min=1)') 188 | print(' -T : number of threads used by VSEARCH for clustering kmers (default=MAX, min=1)\n') 189 | elif module == 'capture': 190 | print(f'\nProbeTools Capture v{version}') 191 | print('https://github.com/KevinKuchinski/ProbeTools\n') 192 | print('Usage: probetools capture -t -p -o / []\n') 193 | print('Required arguments:') 194 | print(' -t : path to target seqs in FASTA file') 195 | print(' -p : path to probe sequences in FASTA file') 196 | print(' -o : path to output directory and name to append to output files') 197 | print('Optional arguments:') 198 | print(' -i : nucleotide sequence identity (%) threshold used for probe-target alignments (default=90, min=50, max=100)') 199 | print(' -l : minimum length for probe-target alignments (default=60, min=1)') 200 | print(' -T : number of threads used by BLASTn for aligning probes to targets (default=1, min=1)\n') 201 | elif module == 'getlowcov': 202 | print(f'\nProbeTools GetLowCov v{version}') 203 | print('https://github.com/KevinKuchinski/ProbeTools\n') 204 | print('Usage: probetools getlowcov -i -o / []') 205 | print('Required arguments:') 206 | print(' -i : path to capture results in PT file') 207 | print(' -o : path to output directory and name to append to output files') 208 | print('Optional arguments:') 209 | print(' -k : minimum sub-sequence length extracted, should be same as kmer length used for making probes (default=120, min=32)') 210 | print(' -D : minimum probe depth threshold used to define low coverage sub-sequences (default=0, min=0)') 211 | print(' -L : minimum number of consecutive bases below probe depth threshold to define a low coverage sub-sequence (default=40, min=1)') 212 | elif module == 'stats': 213 | print(f'\nProbeTools Stats v{version}') 214 | print('https://github.com/KevinKuchinski/ProbeTools\n') 215 | print('Usage: probetools stats -i -o /') 216 | print('Required arguments:') 217 | print(' -i : path to capture results in PT file') 218 | print(' -o : path to output directory and name to append to output files\n') 219 | elif module == 'makeprobes': 220 | print(f'\nProbeTools MakeProbes v{version}') 221 | print('https://github.com/KevinKuchinski/ProbeTools\n') 222 | print('Usage: probetools incrementalprobes -t -b -o / []') 223 | print('Required arguments:') 224 | print(' -t : path to target sequences in FASTA file') 225 | print(' -b : number of probes in each batch (min=1)') 226 | print(' -o : path to output directory and name to append to output files') 227 | print('Optional arguments:') 228 | print(' -m : max number of probes to add to panel (default=MAX, min=1)') 229 | print(' -c : target for 10th percentile of probe coverage (default=90, min=1, max=100)') 230 | print(' -k : length of probes to generate (default=120, min=32)') 231 | print(' -s : number of bases separating each kmer (default=1, min=1)') 232 | print(' -d : number of degenerate bases to permit in probes (default=0, min=0)') 233 | print(' -i : nucleotide sequence identity (%) threshold used for kmer clustering and probe-target alignments (default=90, min=50, min=100)') 234 | print(' -l : minimum length for probe-target alignments (default=60, min=1)') 235 | print(' -D : minimum probe depth threshold used to define low coverage sub-sequences (default=0, min=0)') 236 | print(' -L : minimum number of consecutive bases below probe depth threshold to define a low coverage sub-sequence (default=40, min=1)') 237 | print(' -T : number of threads used by VSEARCH and BLASTn for clustering kmers and aligning' 238 | f' probes to targets (default=MAX for VSEARCH, default=1 for BLASTn, min=1)\n') 239 | elif module == 'merge': 240 | print(f'\nProbeTools Merge v{version}') 241 | print('https://github.com/KevinKuchinski/ProbeTools\n') 242 | print('Usage: probetools merge -i -I -o ') 243 | print('Required arguments:') 244 | print(' -i : path to capture results in PT file') 245 | print(' -I : path to other capture results in PT file') 246 | print(' -o : path to merged capture results PT output file\n') 247 | 248 | 249 | ########## Helper functions ########## 250 | def check_input(files): 251 | '''Checks list of input file paths to make sure they exist.''' 252 | for file in files: 253 | if os.path.exists(file) == False: 254 | print(f'\nERROR: Input file {file} does not exist.\n') 255 | exit(1) 256 | 257 | 258 | def load_fasta(fasta_path): 259 | '''Loads contents of a FASTA file into two lists (one for headers, and another for their 260 | corresponding seqs).''' 261 | # Check if targets FASTA is valid 262 | check_input([fasta_path]) 263 | # Open targets FASTA and load headers and seqs into lists 264 | with open(fasta_path, 'r') as input_file: 265 | entry_counter = 0 266 | header = '' 267 | headers, seqs = [], [] 268 | for line in input_file: 269 | if line[0] == '>': 270 | entry_counter += 1 271 | header = line.strip().lstrip('>') 272 | headers.append(header) 273 | seqs.append('') 274 | elif header != '': 275 | seqs[-1] += line.strip() 276 | for header, seq in zip(headers, seqs): 277 | if header == '': 278 | print('\nWARNING: Seq is present without header in {fasta_path}.') 279 | if seq == '': 280 | print('\nWARNING: Header {header} in {fasta_path} has no accompanying seq.') 281 | return headers, seqs 282 | 283 | 284 | def append_fasta(fasta_path, new_fasta_path): 285 | '''Appends contents of new FASTA file to existing FASTA file then deletes new FASTA file.''' 286 | with open(new_fasta_path, 'r') as input_file, open(fasta_path, 'a') as output_file: 287 | for line in input_file: 288 | output_file.write(line) 289 | # GARBAGE COLLECTION - Delete new FASTA file after contents have been appended 290 | os.remove(new_fasta_path) 291 | 292 | 293 | ######### Top-level functions for modules ########## 294 | def cluster_kmers(out_path, name, targets_path, k, cluster_id, step, max_degen, prev_probes_path, num_probes, threads): 295 | check_input([targets_path]) 296 | kmers_path = os.path.join(out_path, name + '_kmers.fa') 297 | enum_kmers(targets_path, kmers_path, k, step, max_degen) 298 | centroids_path = os.path.join(out_path, name + '_centroids.fa') 299 | cluster_kmers_with_VSEARCH(kmers_path, centroids_path, cluster_id, threads) 300 | remove_prev_probes(centroids_path, prev_probes_path) 301 | probe_names, probe_seqs = rank_probe_candidates(centroids_path) 302 | probes_path = os.path.join(out_path, name + '_probes.fa') 303 | potential_probes, probes_writen = write_top_probes(probe_names, probe_seqs, num_probes, probes_path, name) 304 | return potential_probes, probes_writen 305 | 306 | 307 | def capture(out_path, name, targets_path, probes_path, min_id, min_length, threads): 308 | check_input([targets_path, probes_path]) 309 | blast_path = os.path.join(out_path, name + '_blast_results.tsv') 310 | align_probes_to_targets_with_BLAST(targets_path, probes_path, blast_path, threads) 311 | capture_data = create_empty_capture_data(targets_path) 312 | capture_data = add_BLAST_results_to_capture_data(blast_path, capture_data, min_id, min_length) 313 | capture_path = os.path.join(out_path, name + '_capture.pt') 314 | write_capture_data(capture_path, capture_data) 315 | 316 | 317 | def get_low_cov(out_path, name, capture_path, k, min_depth, min_length): 318 | check_input([capture_path]) 319 | capture_data = load_capture_data(capture_path) 320 | low_cov_path = os.path.join(out_path, name + '_low_cov_seqs.fa') 321 | seqs_writen = write_low_cov_seqs(capture_data, low_cov_path, k, min_depth, min_length) 322 | 323 | 324 | def stats(out_path, name, capture_path): 325 | '''Top-level function for stats module.''' 326 | check_input([capture_path]) 327 | capture_data = load_capture_data(capture_path) 328 | report_path = os.path.join(out_path, name + '_summary_stats_report.tsv') 329 | write_summary_report(capture_data, report_path, name) 330 | report_path = os.path.join(out_path, name + '_long_stats_report.tsv') 331 | write_long_report(capture_data, report_path, name) 332 | 333 | 334 | def make_probes(out_path, name, targets_path, batch_size, max_probes, cov_target, k, cluster_id, step, max_degen, 335 | min_depth, min_low_cov_length, min_id, min_capture_length, threads): 336 | check_input([targets_path]) 337 | # Set variable for counting rounds of incremental probe design 338 | round_counter = 0 339 | # Initialize variables for panel size (# probes) and 10th percentile of panel coverage 340 | panel_size, panel_cov = 0, 0 341 | # If panel size is not specified, set max_panel_size as 1 probe larger than current panel size so loop doesn't break 342 | max_panel_size = panel_size + batch_size if max_probes == 'MAX' else max_probes 343 | # Create empty FASTA file for final probe panel 344 | final_probes_path = os.path.join(out_path, name + '_probes.fa') 345 | with open(final_probes_path, 'w') as output_file: 346 | pass 347 | # Create empty capture dict from target space 348 | capture_data = create_empty_capture_data(targets_path) 349 | print() 350 | # Enter main incremental design loop 351 | while panel_size < max_panel_size and panel_cov < cov_target: 352 | round_counter += 1 353 | print('*' * 20, f'ROUND {round_counter}', '*' * 20 + '\n') 354 | # Write low cov seqs to FASTA as target space for this round 355 | low_cov_path = os.path.join(out_path, name + '_low_cov_seqs.fa') 356 | seqs_writen = write_low_cov_seqs(capture_data, low_cov_path, k, min_depth, min_low_cov_length) 357 | print() 358 | # Break design loop if no low seqs were writen 359 | if seqs_writen == 0: 360 | break 361 | # Make probes from target space 362 | num_probes = min(batch_size, max_panel_size - panel_size) 363 | round_name = name + f'_round_{round_counter}' 364 | potential_probes, probes_writen = cluster_kmers(out_path, round_name, low_cov_path, k, cluster_id, step, max_degen, final_probes_path, num_probes, threads) 365 | print() 366 | # GARBAGE COLLECTION - Delete low cov seqs 367 | os.remove(low_cov_path) 368 | # Update panel size and max panel size 369 | panel_size += probes_writen 370 | max_panel_size = panel_size + batch_size if max_probes == 'MAX' else max_probes 371 | # Break design loop if no potential probes or no probes writen 372 | if potential_probes == 0 or probes_writen == 0: 373 | break 374 | # Capture probes against original targets 375 | blast_path = os.path.join(out_path, name + '_blast_results.tsv') 376 | new_probes_path = os.path.join(out_path, round_name + '_probes.fa') 377 | align_probes_to_targets_with_BLAST(targets_path, new_probes_path, blast_path, threads) 378 | capture_data = add_BLAST_results_to_capture_data(blast_path, capture_data, min_id, min_capture_length) 379 | print() 380 | # Append new probes to final panel 381 | append_fasta(final_probes_path, new_probes_path) 382 | # Check if panel cov exceeds cov target 383 | panel_cov = calc_cov_percentiles(capture_data, percentiles=(0.1,))[0] 384 | print(f'10th percentile of target coverage: {panel_cov}%\n') 385 | print('Incremental probe design finished.') 386 | if panel_size >= max_panel_size: 387 | print(' Maximum panel size reached.') 388 | if panel_cov >= cov_target: 389 | print(' Coverage target for 10th percentile of targets reached.') 390 | if seqs_writen == 0 or potential_probes == 0 or probes_writen == 0: 391 | print(' No additional probes could be designed.') 392 | capture_path = os.path.join(out_path, name + '_capture.pt') 393 | print(f'\nWriting capture results to {capture_path}...') 394 | write_capture_data(capture_path, capture_data) 395 | report_path = os.path.join(out_path, name + '_summary_stats_report.tsv') 396 | write_summary_report(capture_data, report_path, name) 397 | report_path = os.path.join(out_path, name + '_long_stats_report.tsv') 398 | write_long_report(capture_data, report_path, name) 399 | 400 | 401 | def merge(capture_path, other_capture_path, merged_capture_path): 402 | capture_data = load_capture_data(capture_path) 403 | other_capture_data = load_capture_data(other_capture_path) 404 | merged_capture_data = merge_capture_results(capture_data, other_capture_data) 405 | write_capture_data(merged_capture_path, merged_capture_data) 406 | 407 | 408 | ########## clusterkmers functions ########## 409 | def enum_kmers(targets_path, kmers_path, k, step, max_degen): 410 | print(f'Enumerating all {k}-mers in {targets_path}...') 411 | # Load contents of targets FASTA 412 | headers, seqs = load_fasta(targets_path) 413 | print(f' Loaded {"{:,}".format(len(seqs))} target seqs in {targets_path}...') 414 | # Create FASTA file for kmers output and enumerate kmers from target seqs 415 | with open(kmers_path, 'w') as output_file: 416 | target_counter = 0 417 | kmer_counter = 0 418 | for header, seq in zip(headers, seqs): 419 | if seq == '': 420 | print(f' WARNING --- Target {header} has no sequence!') 421 | elif len(seq) < k: 422 | print(f' WARNING --- Target {header} is shorter than the desired probe length!') 423 | elif len(seq) >= k and seq != '': 424 | target_counter += 1 425 | for i in range(0, len(seq) - k + 1, step): 426 | kmer = seq[i:i+k].upper() 427 | num_degen = len(kmer) - sum(kmer.count(base) for base in 'ATGC') 428 | if num_degen <= max_degen: 429 | kmer_counter += 1 430 | output_file.write(f'>kmer_{kmer_counter}\n') 431 | output_file.write(kmer + '\n') 432 | print(f' Enumerated {"{:,}".format(kmer_counter)} k-mers from {targets_path}.') 433 | 434 | 435 | def cluster_kmers_with_VSEARCH(kmers_path, centroids_path, cluster_id, threads): 436 | print(f' Clustering k-mers at {cluster_id}% identity...') 437 | # Create and run terminal command for VSEARCH 438 | terminal_command = (f'vsearch --cluster_fast {kmers_path} --id {cluster_id / 100} --centroids {centroids_path}' 439 | f' --fasta_width 0 --sizeout --qmask none --threads {threads}') 440 | finished_process = subprocess.call(terminal_command, stdout=subprocess.DEVNULL, 441 | stderr=subprocess.DEVNULL, shell=True) 442 | if finished_process != 0: 443 | print(f'\nERROR: vsearch cluster_fast terminated with errors while clustering k-mers' 444 | f' (Error code: {finished_process}).\n') 445 | exit(1) 446 | # GARBAGE COLLECTION - Delete kmers FASTA file 447 | os.remove(kmers_path) 448 | print(f' K-mer clustering finished.') 449 | 450 | 451 | def remove_prev_probes(centroids_path, prev_probes_path): 452 | if prev_probes_path != None: 453 | print(f'Removing any previously designed probes...') 454 | prev_probe_headers, prev_probe_seqs = load_fasta(prev_probes_path) 455 | print(f' Loaded {"{:,}".format(len(prev_probe_seqs))} previously designed probes in {prev_probes_path}') 456 | centroid_headers, centroid_seqs = load_fasta(centroids_path) 457 | with open(centroids_path, 'w') as output_file: 458 | for header, seq in zip(centroid_headers, centroid_seqs): 459 | if seq not in prev_probe_seqs: 460 | output_file.write('>' + header + '\n') 461 | output_file.write(seq + '\n') 462 | 463 | 464 | def rank_probe_candidates(centroids_path): 465 | print(f'Ranking probe candidates...') 466 | # Load centroids FASTA file 467 | headers, seqs = load_fasta(centroids_path) 468 | print(f' Loaded {"{:,}".format(len(seqs))} probe candidates.') 469 | # Create dicts for looking up each candidate probe seq's cluster size and probe name 470 | cluster_sizes = {seq: int(header.split('size=')[1].split(';')[0]) for header, seq in zip(headers, seqs)} 471 | probe_names = {seq: header.split(';')[0] for header, seq in zip(headers, seqs)} 472 | # Rank candidate probe seqs based on cluster size 473 | probe_seqs = sorted(seqs, key=lambda seq: cluster_sizes[seq], reverse=True) 474 | # GARBAGE COLLCTION - Delete centroids FASTA file 475 | os.remove(centroids_path) 476 | return probe_names, probe_seqs 477 | 478 | 479 | def write_top_probes(probe_names, probe_seqs, num_probes, probes_path, name): 480 | print(f'Writing top probe candidates to {probes_path}...') 481 | potential_probes = len(probe_seqs) 482 | # Check if there are any probes to write 483 | if potential_probes == 0: 484 | print(' No probe candidates to write to FASTA.') 485 | return 0, 0 486 | # Determine how many probes to write to FASTA file 487 | if num_probes == 'MAX': 488 | max_probes = len(probe_seqs) 489 | else: 490 | max_probes = num_probes 491 | # Write probes to FASTA file 492 | probe_counter = 0 493 | with open(probes_path, 'w') as output_file: 494 | while probe_counter < len(probe_seqs) and probe_counter < max_probes: 495 | probe_seq = probe_seqs[probe_counter] 496 | probe_counter += 1 497 | output_file.write(f'>{name}_probe_{probe_counter}\n') 498 | output_file.write(probe_seq + '\n') 499 | print(f' Wrote {"{:,}".format(probe_counter)} probes to {probes_path}.') 500 | return potential_probes, probe_counter 501 | 502 | 503 | ########## capture functions ########## 504 | def align_probes_to_targets_with_BLAST(targets_path, probes_path, blast_path, threads): 505 | print(f'Aligning probes in {probes_path} to targets in {targets_path}...') 506 | # Check that BLAST db files exist for target seqs 507 | if any([os.path.exists(targets_path + '.' + suffix) == False for suffix in ['nhr', 'nin' , 'nsq']]): 508 | print(' WARNING: blastn db files do not exist for target seqs. Creating blastn db...') 509 | # Create terminal command for making BLASTn db 510 | terminal_command = f'makeblastdb -in {targets_path} -dbtype nucl' 511 | # Run terminal command and redirect stdout and stderr to DEVNULL 512 | finished_process = subprocess.run(terminal_command, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, shell=True) 513 | # Exit with errors if BLASTn terminates with errors 514 | if finished_process.returncode != 0: 515 | print(f'\nERROR: makeblastdb terminated with errors while making db for target seqs (Error code: {finished_process.returncode}).\n') 516 | exit(1) 517 | # Count target seqs 518 | headers, seqs = load_fasta(targets_path) 519 | num_targets = len(seqs) 520 | # Create and run terminal command for BLASTn 521 | threads = 1 if threads == 0 else threads 522 | terminal_command = (f"blastn -db {targets_path} -query {probes_path} -max_target_seqs {num_targets}" 523 | f" -num_threads {threads} -outfmt 6 > {blast_path}") 524 | finished_process = subprocess.run(terminal_command, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, shell=True) 525 | if finished_process.returncode != 0: 526 | print(f'\nERROR: blastn terminated with errors while aligning probes to target seqs (Error code: {finished_process.returncode}).\n') 527 | exit(1) 528 | 529 | 530 | def create_empty_capture_data(targets_path): 531 | print(f'Extracting target names and seqs from {targets_path}...') 532 | with open(targets_path, 'r') as input_file: 533 | headers, seqs = [], [] 534 | for line in input_file: 535 | if line[0] == '>': 536 | header = line.strip().lstrip('>') 537 | headers.append(header) 538 | seqs.append('') 539 | else: 540 | seqs[-1] += line.strip() 541 | for header in headers: 542 | if headers.count(header) > 1: 543 | print(f'\nERROR: Header {header} appears more than once in {targets_path}.\n') 544 | exit(1) 545 | if ' ' in header: 546 | print(f'\nERROR: Header {header} contains spaces.\n') 547 | exit(1) 548 | capture_data = {header: (seq, [0] * len(seq)) for header, seq in zip(headers, seqs)} 549 | print(f' Total targets loaded: {"{:,}".format(len(capture_data))}') 550 | return capture_data 551 | 552 | 553 | def load_capture_data(capture_path): 554 | print(f'Loading capture data from {capture_path}...') 555 | with open(capture_path, 'r') as input_file: 556 | headers, seqs, depths = [], [], [] 557 | for line in input_file: 558 | if line[0] == '>': 559 | header = line.strip().lstrip('>') 560 | headers.append(header) 561 | seqs.append('') 562 | depths.append('') 563 | elif line[0] == '$': 564 | seqs[-1] += line.strip() 565 | elif line[0] == '#': 566 | depths[-1] += line.strip() 567 | seqs = [seq.lstrip('$') for seq in seqs] 568 | depths = [[int(d) for d in depth.lstrip('#').split(',')] for depth in depths] 569 | if len(set(len(c) for c in [headers, seqs, depths])) != 1: 570 | print(f'\nERROR: The number of headers, seqs, and probe depth lists do not match in {capture_path}.') 571 | exit(1) 572 | for header in headers: 573 | if headers.count(header) > 1: 574 | print(f'\nERROR: Header {header} appears more than once in {capture_path}.\n') 575 | exit(1) 576 | capture_data = {header: (seq, depth) for header, seq, depth in zip(headers, seqs, depths)} 577 | for header in capture_data.keys(): 578 | if len(capture_data[header][0]) != len(capture_data[header][1]): 579 | print(f'\nERROR: Seq length and probe depth list length do not match for entry {header}.') 580 | exit(1) 581 | print(f' Total targets loaded: {"{:,}".format(len(capture_data))}') 582 | return capture_data 583 | 584 | 585 | def merge_capture_results(capture_data, other_capture_data): 586 | headers, other_headers = set(capture_data.keys()), set(other_capture_data.keys()) 587 | shared_headers = headers & other_headers 588 | merged_capture_data = {} 589 | for header in shared_headers: 590 | if capture_data[header][0] != other_capture_data[header][0]: 591 | print(f'\nERROR: Header {header} appears in both capture results but they do not have the same seq.') 592 | exit(1) 593 | if len(capture_data[header][1]) != len(other_capture_data[header][1]): 594 | print(f'\nERROR: Header {header} appears in both capture results but their probe depth lists are different lengths.') 595 | exit(1) 596 | merged_capture_data[header] = (capture_data[header][0], [a + b for a, b in zip(capture_data[header][1], other_capture_data[header][1])]) 597 | for header in headers - shared_headers: 598 | merged_capture_data[header] = capture_data[header] 599 | for header in other_headers - shared_headers: 600 | merged_capture_data[header] = other_capture_data[header] 601 | return merged_capture_data 602 | 603 | 604 | def add_BLAST_results_to_capture_data(blast_path, capture_data, min_id, min_length): 605 | print(f'Extracting probe coverage and probe depth from alignment of probes against targets...') 606 | with open(blast_path, 'r') as input_file: 607 | for line in input_file: 608 | line = line.split('\t') 609 | if float(line[2]) >= min_id and float(line[3]) >= min_length: 610 | target = line[1].strip() 611 | aln_start = int(line[8].strip()) - 1 612 | aln_end = int(line[9].strip()) - 1 613 | aln_start, aln_end = sorted([aln_start, aln_end]) 614 | for position in range(aln_start, aln_end + 1): 615 | capture_data[target][1][position] += 1 616 | # GARBAGE COLLECTION - Delete BLAST results 617 | os.remove(blast_path) 618 | return capture_data 619 | 620 | 621 | def write_capture_data(capture_path, capture_data): 622 | print(f'Writing capture results to {capture_path}...') 623 | with open(capture_path, 'w') as output_file: 624 | for header, (seq, depth) in capture_data.items(): 625 | output_file.write('>' + header + '\n') 626 | output_file.write('$' + seq + '\n') 627 | output_file.write('#' + ','.join([str(d) for d in depth]) + '\n') 628 | print(f' Wrote capture results for {"{:,}".format(len(capture_data))} targets.') 629 | 630 | 631 | ########## getlowcov functions ########## 632 | def write_low_cov_seqs(capture_data, low_cov_path, k, min_depth, min_length): 633 | print(f'Extracting low coverage areas...') 634 | # Write low cov seqs to FASTA file 635 | with open(low_cov_path, 'w') as output_file: 636 | total_low_cov_counter = 0 637 | for header, (seq, depth) in capture_data.items(): 638 | if len(seq) < k: 639 | print(f' WARNING: Seq {header} is shorter than the window length so low coverage seqs cannot be extracted.') 640 | else: 641 | low_cov_counter = 0 642 | low_cov_start = 0 643 | while low_cov_start < len(seq): 644 | while low_cov_start < len(seq) and depth[low_cov_start] > min_depth: 645 | low_cov_start += 1 646 | low_cov_end = low_cov_start 647 | while low_cov_end < len(seq) and depth[low_cov_end] <= min_depth: 648 | low_cov_end += 1 649 | low_cov_seq = seq[low_cov_start:low_cov_end] 650 | if len(low_cov_seq) > min_length and len(low_cov_seq) >= k: 651 | total_low_cov_counter += 1 652 | low_cov_counter += 1 653 | output_file.write(f'>{header}_position_{low_cov_start + 1}_to_{low_cov_end}\n') 654 | output_file.write(low_cov_seq + '\n') 655 | elif len(low_cov_seq) > min_length and len(low_cov_seq) < k: 656 | window_start = low_cov_start - int((k - len(low_cov_seq)) / 2) 657 | window_end = window_start + k 658 | if window_start < 0: 659 | window_end = window_end - window_start 660 | window_start = 0 661 | if window_end > len(seq): 662 | window_start = window_start - window_end + len(seq) 663 | window_end = len(seq) 664 | extended_low_cov_seq = seq[window_start:window_end] 665 | total_low_cov_counter += 1 666 | low_cov_counter += 1 667 | output_file.write(f'>{header}_position_{window_start + 1}_to_{window_end}\n') 668 | output_file.write(extended_low_cov_seq.upper() + '\n') 669 | low_cov_start = low_cov_end 670 | print(f' Wrote {total_low_cov_counter} low coverage seqs from {"{:,}".format(len(capture_data))} targets.') 671 | seqs_writen = total_low_cov_counter 672 | return seqs_writen 673 | 674 | 675 | ########## stats functions ########## 676 | def calc_cov_at_depth(seq, depth, min_depth): 677 | covered_bases = 0 678 | total_bases = 0 679 | for base, base_depth in zip(seq, depth): 680 | if base_depth >= min_depth: 681 | covered_bases += 1 682 | total_bases += 1 683 | elif base.upper() in 'ATGC': 684 | total_bases += 1 685 | if total_bases == 0: 686 | coverage = 0 687 | else: 688 | coverage = covered_bases * 100 / total_bases 689 | return coverage 690 | 691 | 692 | def calc_cov_percentiles(capture_data, percentiles=(0, 0.05, 0.1, 0.25, 0.5, 0.75, 1)): 693 | """Takes a capture dict and tuple of perentiles, and returns a list of those percentile 694 | values.""" 695 | cov_values = [] 696 | for header, (seq, depth) in capture_data.items(): 697 | cov_values.append(calc_cov_at_depth(seq, depth, 1)) 698 | cov_values = sorted(cov_values) 699 | percentile_values = [] 700 | for percentile in percentiles: 701 | values_under_percentile = (len(cov_values) - 1) * percentile 702 | if values_under_percentile == int(values_under_percentile): 703 | percentile_values.append(cov_values[int(values_under_percentile)]) 704 | else: 705 | percentile_value = cov_values[int(values_under_percentile)] 706 | percentile_value += cov_values[int(values_under_percentile) + 1] 707 | percentile_value = percentile_value / 2 708 | percentile_values.append(percentile_value) 709 | return [round(percentile_value,2) for percentile_value in percentile_values] 710 | 711 | 712 | def calc_total_probe_depth(capture_data): 713 | """Takes a capture dict and returns a tuple containing the percentage of nucleotide positions 714 | in the target space covered by 0, 1, 2, 3, 4, and 5+ probes.""" 715 | total = 0 716 | total_0 = 0 717 | total_1 = 0 718 | total_2 = 0 719 | total_3 = 0 720 | total_4 = 0 721 | total_5 = 0 722 | for header, (seq, depth) in capture_data.items(): 723 | total += len(depth) 724 | total_0 += depth.count(0) 725 | total_1 += depth.count(1) 726 | total_2 += depth.count(2) 727 | total_3 += depth.count(3) 728 | total_4 += depth.count(4) 729 | total_5 += len([d for d in depth if d >= 5]) 730 | total_0 = round(total_0 * 100 / total, 2) 731 | total_1 = round(total_1 * 100 / total, 2) 732 | total_2 = round(total_2 * 100 / total, 2) 733 | total_3 = round(total_3 * 100 / total, 2) 734 | total_4 = round(total_4 * 100 / total, 2) 735 | total_5 = round(total_5 * 100 / total, 2) 736 | return (total_0, total_1, total_2, total_3, total_4, total_5) 737 | 738 | 739 | def write_summary_report(capture_data, report_path, name): 740 | """Takes a capture dict and generates the summary report.""" 741 | print(f'Writing probe coverage and probe depth summary stats to {report_path}...') 742 | with open(report_path, 'w') as output_file: 743 | header = ['name', 'total_targets'] 744 | header += ['cov_' + s for s in ['min', '5%tile', '10%tile', 'Q1', 'med', 'Q3', 'max']] 745 | header += ['depth_' + s for s in ['0', '1', '2', '3', '4', '5+']] 746 | output_file.write('\t'.join(header) + '\n') 747 | line = [name, str(len(capture_data))] 748 | line += [str(f) for f in calc_cov_percentiles(capture_data)] 749 | line += [str(f) for f in calc_total_probe_depth(capture_data)] 750 | output_file.write('\t'.join(line) + '\n') 751 | 752 | 753 | def write_long_report(capture_data, report_path, name): 754 | """Takes a capture dict and generates the long-form report.""" 755 | print(f'Writing probe coverage and probe depth stats for each target to {report_path}...') 756 | with open(report_path, 'w') as output_file: 757 | header = ['name', 'target', 'length', 'ATGCs', '%_ATGCs'] 758 | header += ['bases_' + str(d) + 'X' for d in ['0', '1', '2', '3', '4', '5+']] 759 | header += ['%_cov_' + str(d) + 'X' for d in ['1', '2', '3', '4', '5']] 760 | output_file.write('\t'.join(header) + '\n') 761 | for header, (seq, depth) in capture_data.items(): 762 | total_ATGC = len([s for s in seq if s.upper() in 'ATGC']) 763 | length = len(depth) 764 | line = [name, header, str(length), str(total_ATGC), str(round(total_ATGC * 100 / length, 2))] 765 | line += [str(depth.count(d)) for d in [0, 1, 2, 3, 4]] 766 | line += [str(len([d for d in depth if d >= 5]))] 767 | line += [str(round(calc_cov_at_depth(seq, depth, min_depth), 1)) for min_depth in [1, 2, 3, 4, 5]] 768 | output_file.write('\t'.join(line) + '\n') 769 | 770 | 771 | ######### call main function ########## 772 | if __name__ == '__main__': 773 | main() 774 | 775 | --------------------------------------------------------------------------------