├── .gitmodules ├── README.md ├── fastq_len_filter.py ├── license.txt ├── medians.csv ├── test ├── demo_SRR829034.fastq.bz2 └── test.sh └── viromeQC.py /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "cmseq"] 2 | path = cmseq 3 | url = https://github.com/SegataLab/cmseq.git 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ViromeQC # 2 | 3 | ## Description ## 4 | 5 | * Provides an enrichment score for VLP viromes with respect to metagenomes 6 | * Useful benchmark for the quality of enrichment of a virome 7 | * Tested on Linux Ubuntu Server 16.04 LTS and on Linux Mint 19 8 | 9 | **Requires:** 10 | 11 | * [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) >= v. 2.3.4 12 | * [Samtools](http://samtools.sourceforge.net/) >= 1.3.1 13 | * [Biopython](https://github.com/biopython/biopython) >= 1.69 14 | * [Pysam](http://pysam.readthedocs.io/en/latest/) >= 0.14 15 | * [Diamond](http://github.com/bbuchfink/diamond) (tested on v.0.9.9 and 0.9.29) 16 | * Python3 (tested on 3.6) 17 | * [pandas](https://pandas.pydata.org) >= 0.20 18 | 19 | **Update:** _ViromeQC_ now works with newer versions of diamond (e.g. v0.9.29) 20 | Thanks to Ryan Cook ([@RyanCookAMR](https://twitter.com/RyanCookAMR)) for the new diamond db 21 | 22 | ## Usage ## 23 | 24 | ### Step 1: clone or download the repository ### 25 | 26 | `git clone --recurse-submodules https://github.com/SegataLab/viromeqc.git` 27 | 28 | or download the repository from the **[releases](https://github.com/SegataLab/viromeqc/releases)** page 29 | 30 | ### Step 2: install the database: ### 31 | 32 | This steps downloads the database file. This needs to be done only the first time you run ViromeQC. This may require a few minutes, depending on your internet connection. 33 | 34 | `viromeQC.py --install` 35 | 36 | Alternatively, you can also download the database files from [Zenodo](https://zenodo.org/record/4020594#.X1jxgGMzZDM). Once downloaded the files, create a folder named `index/` in the ViromeQC installation folder and unzip all the files in this folder. 37 | 38 | ### Step 3: Run on your sample ### 39 | 40 | `viromeQC.py -i -o ` 41 | 42 | *Please Note:* 43 | You can pass more than one file as input (e.g. for multiple runs or paired end reads). However, you can process only one sample at a time with this command. If you want to parallelize the execution, this can be easily done with [Parallel](https://www.gnu.org/software/parallel/) or equivalent tools. 44 | 45 | You can try the test example (`test/test.sh`) which analyzes 10'000 reads from the sample `SRR829034`. This should take approximately 1 or 2 minutes. 46 | 47 | Parameters: 48 | 49 | ``` 50 | usage: viromeQC.py -i -o 51 | 52 | optional arguments: 53 | -h, --help show this help message and exit 54 | -i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]] 55 | Raw Reads in FASTQ format. Supports multiple inputs 56 | (plain, gz o bz2) (default: None) 57 | -o OUTPUT, --output OUTPUT 58 | output file (default: None) 59 | --minlen MINLEN Minimum Read Length allowed (default: 75) 60 | --minqual MINQUAL Minimum Read Average Phred quality (default: 20) 61 | --bowtie2_threads BOWTIE2_THREADS 62 | Number of Threads to use with Bowtie2 (default: 4) 63 | --diamond_threads DIAMOND_THREADS 64 | Number of Threads to use with Diamond (default: 4) 65 | -w {human,environmental}, --enrichment_preset {human,environmental} 66 | Calculate the enrichment basing on human or 67 | environmental metagenomes. Defualt: human-microbiome 68 | (default: human) 69 | --bowtie2_path BOWTIE2_PATH 70 | Full path to the bowtie2 command to use, deafult 71 | assumes that bowtie2 is present in the system path 72 | (default: bowtie2) 73 | --diamond_path DIAMOND_PATH 74 | Full path to the diamond command to use, deafult 75 | assumes that diamond is present in the system path 76 | (default: diamond) 77 | --version Prints version informations (default: False) 78 | --install Downloads database files (default: False) 79 | --sample_name SAMPLE_NAME 80 | Optional label for the sample to be included in the 81 | output file (default: None) 82 | --tempdir TEMPDIR Temporary Directory override (default is the system 83 | temp. directory) (default: None) 84 | ``` 85 | 86 | ### Pipeline structure ### 87 | 88 | ViromeQC starts from FASTQ files (compressed files are supported), and will: 89 | 90 | 1. Elimitate short and low quality reads 91 | - *adjust the `minqual` and `minlen` parameters if you want to change the thresholds* 92 | 2. Map the reads against a curated collection of rRNAs and single-copy bacteral markers 93 | 3. Filter the reads to remove short and dlsivergent alignments 94 | 4. Compute the enrichment value of the sample, compared to the median observed in human metagenomes 95 | - use `-w environmental` for envronmental reads 96 | - reference medians for un-enriched metagenomes are taken from `medians.csv`, you can provide your own data to ViromeQC by changing this file accordingly 97 | 5. Produce a report file with the alignment rates and the final enrichment score (which is the minimum enrichment observed across SSU-rRNA, LSU-rRNA and single-copy markers) 98 | 99 | 100 | ### Output ### 101 | 102 | Output is given as a TSV file with the following structure: 103 | 104 | 105 | | Sample | Reads | Reads_HQ | SSU rRNA alignment (%) | LSU rRNA alignment (%) | Bacterial_Markers alignment (%) | total enrichmnet score 106 | |---|---|---|---|---|---|---| 107 | | your_sample.fq | 40000 | 39479 | 0.00759898 | 0.0227969 | 0.01266496 | 5.795329 108 | 109 | 110 | - An alignment score of 5.8 means that the virome is 5.8 times more enriched than a comparable metagenome 111 | - High score (e.g. 10-50) reflect high VLP enrichment 112 | 113 | 114 | ## Citation ## 115 | 116 | If you find this tool useful, please cite: 117 | 118 | *Zolfo, M., Pinto, F., Asnicar, F., Manghi, P., Tett A., Segata N.* **[Detecting contamination in viromes using ViromeQC](https://www.nature.com/articles/s41587-019-0334-5)**, *Nature Biotechnology* 37, 1408–1412 (2019) 119 | 120 | -------------------------------------------------------------------------------- /fastq_len_filter.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import os 4 | from Bio import SeqIO 5 | import argparse as ap 6 | import sys 7 | import os 8 | import numpy as np 9 | 10 | def read_params(args): 11 | p = ap.ArgumentParser(description = 'fastax_len_filter.py Parameters\n') 12 | p.add_argument('--min_len', required = True, default = None, type = int) 13 | p.add_argument('--min_qual', default = 0, type = int) 14 | p.add_argument('--no_anonim', action="store_true") 15 | p.add_argument('--count') 16 | p.add_argument('-i','--input', required = True, help="Input File (FASTQ)") 17 | p.add_argument('-o','--output', required = True, help="Output File (FASTQ)") 18 | 19 | 20 | return vars(p.parse_args()) 21 | 22 | screenQual=range(20,31) 23 | qualCounter = dict((k,0) for k in screenQual) 24 | 25 | counter = 0 26 | allCounter = 0 27 | rpl=[] 28 | 29 | 30 | 31 | if __name__ == '__main__': 32 | 33 | 34 | args = read_params(sys.argv) 35 | 36 | if not os.path.isfile(args['input']): 37 | print("Error: file "+args['input']+' is not accessible!') 38 | sys.exit(1) 39 | 40 | if args['input'].endswith('.gz'): 41 | import gzip 42 | from functools import partial 43 | _open = partial(gzip.open, mode='rt') 44 | elif args['input'].endswith('.bz2'): 45 | import bz2 46 | from functools import partial 47 | 48 | _open = partial(bz2.open, mode='rt') 49 | else: 50 | _open = open 51 | 52 | 53 | 54 | min_len = args['min_len'] 55 | with open(args['output'],'w') as outf: 56 | 57 | with _open(args['input']) as f: 58 | for r in SeqIO.parse(f, "fastq"): 59 | 60 | avQual= np.mean(r.letter_annotations['phred_quality']) 61 | 62 | if avQual >= args['min_qual'] and len(r) >= min_len: 63 | if not args['no_anonim']: 64 | r.id = r.id+'_'+str(allCounter) 65 | 66 | counter+=1 67 | rpl.append(r) 68 | 69 | for qu in screenQual: 70 | if avQual >= qu: 71 | qualCounter[qu]+=1 72 | 73 | 74 | if len(rpl) % 30000 == 0: 75 | 76 | SeqIO.write(rpl, outf, "fastq") 77 | rpl=[] 78 | 79 | allCounter+=1 80 | 81 | 82 | if len(rpl) > 0: 83 | SeqIO.write(rpl, outf, "fastq") 84 | rpl=[] 85 | 86 | if args['count']: 87 | outCount = open(args['count'],'w') 88 | outCount.write(str(counter)+'\t'+str(allCounter)+'\t'+'\t'.join([str(ke)+':'+str(val) for ke,val in qualCounter.items()])) 89 | outCount.close() 90 | -------------------------------------------------------------------------------- /license.txt: -------------------------------------------------------------------------------- 1 | Copyright (c) 2019, Moreno Zolfo and Nicola Segata 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 4 | 5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 6 | 7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 8 | -------------------------------------------------------------------------------- /medians.csv: -------------------------------------------------------------------------------- 1 | parameter environmental human 2 | AMPHORA2 0.481685106521252 0.700892560801689 3 | rRNA_LSU 0.13211572831416 0.530267135901026 4 | rRNA_SSU 0.076231996313743 0.247254267758797 -------------------------------------------------------------------------------- /test/demo_SRR829034.fastq.bz2: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SegataLab/viromeqc/2939a6af18b5973bfcba1d668c255b7db285f150/test/demo_SRR829034.fastq.bz2 -------------------------------------------------------------------------------- /test/test.sh: -------------------------------------------------------------------------------- 1 | if [ -f demo_SRR829034.fastq.bz2 ]; then 2 | bunzip2 demo_SRR829034.fastq.bz2 3 | fi; 4 | ../viromeQC.py -i demo_SRR829034.fastq -o out.txt 5 | -------------------------------------------------------------------------------- /viromeQC.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import os 3 | import sys 4 | import argparse 5 | import zipfile 6 | import time 7 | import tempfile 8 | import subprocess 9 | import pandas as pd 10 | 11 | 12 | 13 | 14 | __author__ = 'Moreno Zolfo (moreno.zolfo@unitn.it)' 15 | __version__ = '1.0.2' 16 | __date__ = '15 Nov. 2022' 17 | 18 | 19 | 20 | def byte_to_megabyte(byte): 21 | """ 22 | Convert byte value to megabyte 23 | """ 24 | 25 | return byte / (1024.0**2) 26 | 27 | 28 | class ReportHook(): 29 | def __init__(self): 30 | self.start_time = time.time() 31 | 32 | def report(self, blocknum, block_size, total_size): 33 | """ 34 | Print download progress message 35 | """ 36 | 37 | if blocknum == 0: 38 | self.start_time = time.time() 39 | if total_size > 0: 40 | sys.stderr.write("Downloading file of size: {:.2f} MB\n" 41 | .format(byte_to_megabyte(total_size))) 42 | else: 43 | total_downloaded = blocknum * block_size 44 | status = "{:3.2f} MB ".format(byte_to_megabyte(total_downloaded)) 45 | 46 | if total_size > 0: 47 | percent_downloaded = total_downloaded * 100.0 / total_size 48 | # use carriage return plus sys.stderr to overwrite stderr 49 | download_rate = total_downloaded / (time.time() - self.start_time) 50 | estimated_time = (total_size - total_downloaded) / download_rate 51 | estimated_minutes = int(estimated_time / 60.0) 52 | estimated_seconds = estimated_time - estimated_minutes * 60.0 53 | status += ("{:3.2f} % {:5.2f} MB/sec {:2.0f} min {:2.0f} sec " 54 | .format(percent_downloaded, 55 | byte_to_megabyte(download_rate), 56 | estimated_minutes, estimated_seconds)) 57 | 58 | status += " \r" 59 | sys.stderr.write(status) 60 | 61 | 62 | def download(url, download_file): 63 | """ 64 | Download a file from a url 65 | """ 66 | # try to import urllib.request.urlretrieve for python3 67 | try: 68 | from urllib.request import urlretrieve 69 | except ImportError: 70 | from urllib import urlretrieve 71 | 72 | if not os.path.isfile(download_file): 73 | try: 74 | sys.stderr.write("\nDownloading " + url + "\n") 75 | file, headers = urlretrieve(url, download_file, 76 | reporthook=ReportHook().report) 77 | except EnvironmentError: 78 | sys.stderr.write("\nWarning: Unable to download " + url + "\n") 79 | else: 80 | sys.stderr.write("\nFile {} already present!\n".format(download_file)) 81 | 82 | 83 | 84 | def print_version(): 85 | print ("Version:\t"+__version__) 86 | print ("Author:\t\t"+__author__) 87 | print ("Software:\t"+'Virome QC') 88 | sys.exit(0) 89 | 90 | class bcolors: 91 | HEADER = '\033[95m' 92 | OKBLUE = '\033[94m' 93 | OKGREEN = '\033[92m' 94 | WARNING = '\033[93m' 95 | FAIL = '\033[91m' 96 | ENDC = '\033[0m' 97 | OKGREEN2 = '\033[42m\033[30m' 98 | RED = '\033[1;91m' 99 | CYAN = '\033[0;37m' 100 | 101 | 102 | def fancy_print(mesg,label,type,reline=False,newLine=False): 103 | opening = "\r" if reline else '' 104 | ending = "\r\n" if not reline or newLine else '' 105 | 106 | if len(mesg) < 65: 107 | 108 | sys.stdout.write(opening+mesg.ljust(66)+(type+'[ - '+label.center(5)+' - ]'+bcolors.ENDC).ljust(14)+ending) 109 | else: 110 | c=0 111 | wds = [] 112 | lines=[] 113 | for word in mesg.split(' '): 114 | 115 | if c + len(word)+2 > 65: 116 | print (' '.join(wds)) 117 | c=0 118 | wds=[word] 119 | continue 120 | c = c+len(word)+2 121 | wds.append(word) 122 | sys.stdout.write(opening+(' '.join(wds)).ljust(66)+(type+'[ - '+label.center(5)+' - ]'+bcolors.ENDC).ljust(14)+ending) 123 | 124 | sys.stdout.flush() 125 | 126 | 127 | def check_install(req_dmd_db_filename,source): 128 | 129 | 130 | try: 131 | #download indexes if you don't have it 132 | to_download=[] 133 | fancy_print("Checking Database Files",'...',bcolors.OKBLUE,reline=True) 134 | 135 | if not os.path.isdir(INDEX_PATH): 136 | os.mkdir(INDEX_PATH) 137 | 138 | 139 | remote_links = { 140 | 'dropbox': { \ 141 | 'silva_LSU_clean' : ['https://www.dropbox.com/s/c0nbhkw0ww3lm97/SILVA_132_LSURef_tax_silva.clean.zip?dl=1'], \ 142 | 'silva_SSU_clean': [ 143 | 'https://www.dropbox.com/s/mb5a0g7utmcupje/SILVA_132_SSURef_Nr99_tax_silva.clean_1.zip?dl=1', \ 144 | 'https://www.dropbox.com/s/qqqokke8r26e8ve/SILVA_132_SSURef_Nr99_tax_silva.clean_2.zip?dl=1', \ 145 | 'https://www.dropbox.com/s/idmbwbavqalse9q/SILVA_132_SSURef_Nr99_tax_silva.clean_3.zip?dl=1' 146 | ], 147 | 'amph_dmd': [ 148 | 'https://www.dropbox.com/s/rfer26hdoj3nsm0/amphora_bacteria.dmnd.zip?dl=1', \ 149 | 'https://www.dropbox.com/s/43nu0l6zkiw2las/amphora_bacteria_294.dmnd.zip?dl=1' 150 | ] 151 | }, 152 | 'zenodo': { \ 153 | 'silva_LSU_clean' : ['https://zenodo.org/record/4020594/files/SILVA_132_LSURef_tax_silva_clean.zip?download=1'], \ 154 | 'silva_SSU_clean': ['https://zenodo.org/record/4020594/files/SILVA_132_SSURef_Nr99_tax_silva.clean.zip?download=1'], 155 | 'amph_dmd': ['https://zenodo.org/record/4020594/files/amphora_markers.zip?download=1'] 156 | }, 157 | } 158 | 159 | if source in remote_links: 160 | if not os.path.isfile(INDEX_PATH+'/SILVA_132_LSURef_tax_silva.clean.1.bt2'): 161 | to_download.append(remote_links[source]['silva_LSU_clean']) 162 | 163 | if not os.path.isfile(INDEX_PATH+'/SILVA_132_SSURef_Nr99_tax_silva.clean.1.bt2'): 164 | to_download.append(remote_links[source]['silva_SSU_clean']) 165 | 166 | if not os.path.isfile(INDEX_PATH+'/'+req_dmd_db_filename): 167 | to_download.append(remote_links[source]['amph_dmd']) 168 | 169 | fancy_print("Checking Database Files",'OK',bcolors.OKGREEN,reline=True,newLine=True) 170 | 171 | 172 | if(to_download): 173 | to_download = [_ for grp in to_download for _ in grp] 174 | fancy_print("Using {} as download source for ViromeQC db".format(source),'...',bcolors.OKBLUE,newLine=True) 175 | fancy_print("Need to download {} files".format(len(to_download)),'...',bcolors.OKBLUE,reline=True) 176 | for downloadable in to_download: 177 | 178 | download(downloadable, INDEX_PATH+'/tmp.zip') 179 | zipDB = zipfile.ZipFile(INDEX_PATH+'/tmp.zip', 'r') 180 | zipDB.extractall(INDEX_PATH) 181 | zipDB.close() 182 | os.remove(INDEX_PATH+'/tmp.zip') 183 | fancy_print("Uncompressing DB ({} files)".format(len(to_download)),'DONE',bcolors.OKGREEN,reline=True,newLine=True) 184 | 185 | 186 | except IOError: 187 | print("Failed to retrieve DB") 188 | fancy_print("Failed to retrieve DB",'FAIL',bcolors.FAIL) 189 | sys.exit(1) 190 | 191 | 192 | 193 | def no_fq_extension(string): 194 | z=[] 195 | for p in string.split('.'): 196 | if any([q in p for q in ['bz2','fq','fastq','gz']]): continue 197 | else: 198 | z.append(p) 199 | 200 | return '.'.join(z) 201 | 202 | 203 | 204 | try: 205 | from Bio import SeqIO 206 | from Bio.Seq import Seq 207 | from Bio.SeqRecord import SeqRecord 208 | except ImportError as e: 209 | fancy_print("Failed in importing Biopython. Please check Biopython is installed properly on your system!",'FAIL',bcolors.FAIL) 210 | sys.exit(1) 211 | 212 | try: 213 | import pysam 214 | except ImportError as e: 215 | fancy_print("Failed in importing pysam. Please check pysam is installed properly on your system!",'FAIL',bcolors.FAIL) 216 | sys.exit(1) 217 | 218 | 219 | CHECKER_PATH=os.path.abspath(os.path.dirname(os.path.realpath(__file__))) 220 | INDEX_PATH=CHECKER_PATH+"/index/" 221 | LIMIT_OF_DETECTION = 1e-6 222 | 223 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter, 224 | description='Checks a virome FASTQ file for enrichment efficiency') 225 | 226 | parser.add_argument("-i","--input", required=all([x not in sys.argv for x in ['--install','--version']]), nargs="*", help="Raw Reads in FASTQ format. Supports multiple inputs (plain, gz o bz2)") 227 | parser.add_argument("-o",'--output',required=all([x not in sys.argv for x in ['--install','--version']]), help="output file") 228 | 229 | parser.add_argument("--minlen", help="Minimum Read Length allowed",default='75') 230 | parser.add_argument("--minqual", help="Minimum Read Average Phred quality",default='20') 231 | 232 | parser.add_argument("--minlen_SSU", help="Minimum alignment length when considering SSU rRNA gene",default='50') 233 | parser.add_argument("--minlen_LSU", help="Minimum alignment length when considering LSU rRNA gene",default='50') 234 | 235 | parser.add_argument("--bowtie2_threads", help="Number of Threads to use with Bowtie2",default='4') 236 | parser.add_argument("--diamond_threads", help="Number of Threads to use with Diamond",default='4') 237 | 238 | parser.add_argument("-w","--enrichment_preset", choices=['human','environmental'], help="Calculate the enrichment basing on human or environmental metagenomes. Defualt: human-microbiome",default='human') 239 | parser.add_argument('--medians', type=str, default=CHECKER_PATH+'/medians.csv', help="File containing reference medians to calculate the enrichment. Default is medians.csv in the script directory. You can specify a different file with this parameter.") 240 | 241 | parser.add_argument('--bowtie2_path', type=str, default='bowtie2', 242 | help="Full path to the bowtie2 command to use, deafult assumes " 243 | "that 'bowtie2 is present in the system path") 244 | parser.add_argument('--diamond_path', type=str, default='diamond', 245 | help="Full path to the diamond command to use, deafult assumes " 246 | "that 'diamond is present in the system path") 247 | 248 | 249 | 250 | parser.add_argument("--version", help="Prints version informations", action='store_true') 251 | parser.add_argument("--debug", help="Prints error messages in case of debug", action='store_true') 252 | 253 | parser.add_argument("--install", help="Downloads database files", action='store_true') 254 | parser.add_argument("--zenodo", help="Use Zenodo instead of Dropbox to download the DB", action='store_true') 255 | parser.add_argument("--sample_name", help="Optional label for the sample to be included in the output file") 256 | parser.add_argument("--tempdir", help="Temporary Directory override (default is the system temp directory)") 257 | 258 | 259 | args=parser.parse_args() 260 | 261 | medians = pd.read_csv(args.medians,sep='\t') 262 | dwl_source = 'zenodo' if args.zenodo else 'dropbox' 263 | 264 | try: 265 | diamond_command = [args.diamond_path,'--version'] 266 | 267 | with open(os.devnull) as devnull: 268 | ps1 = subprocess.Popen(diamond_command, stdout=subprocess.PIPE,stderr=devnull) 269 | 270 | dmd_version = str(ps1.communicate()[0].strip()).strip("'").split(' ')[-1] 271 | 272 | dmd_v_split = dmd_version.split('.') 273 | 274 | if int(dmd_v_split[1]) == 9 and int(dmd_v_split[2]) < 19: 275 | req_dmd_db_filename='amphora_bacteria.dmnd' 276 | else: 277 | req_dmd_db_filename='amphora_bacteria_294.dmnd' 278 | except: 279 | fancy_print("Failed to detect diamond version",'FAIL',bcolors.FAIL) 280 | sys.exit(1) 281 | 282 | if args.version: print_version() 283 | 284 | 285 | 286 | if args.install: 287 | check_install(req_dmd_db_filename, source=dwl_source) 288 | sys.exit(0) 289 | 290 | #pre-flight check 291 | for inputFile in args.input: 292 | if not os.path.isfile(inputFile): 293 | fancy_print("Error: file ",inputFile,'does not exist','ERROR',bcolors.FAIL) 294 | 295 | commands = [['zcat', '-h'],['bzcat', '-h'],[args.bowtie2_path, '-h'],[args.diamond_path, 'help']] 296 | 297 | 298 | for sw in commands: 299 | try: 300 | with open(os.devnull, 'w') as devnull: 301 | subprocess.check_call(sw, stdout=devnull, stderr=devnull) 302 | 303 | except Exception as e: 304 | fancy_print("Error, command not found: "+sw[0],'ERROR',bcolors.FAIL) 305 | 306 | check_install(req_dmd_db_filename,source=dwl_source) 307 | 308 | 309 | if args.tempdir: 310 | tempfile.tempdir=args.tempdir 311 | 312 | try: 313 | tmpdir = tempfile.TemporaryDirectory() 314 | tmpdirname = tmpdir.name 315 | except Exception as e: 316 | fancy_print("Could not create temp folder in "+str(tempfile.tempdir),'FAIL',bcolors.FAIL) 317 | sys.exit(1) 318 | 319 | if len(args.input) > 1: 320 | fancy_print('Merging '+str(len(args.input))+' files','...',bcolors.OKBLUE,reline=True) 321 | with open(tmpdirname+'/combined.fastq','a') as combinedFastq: 322 | for infile in args.input: 323 | #print(['cat',infile,'>>',tmpdirname+'/combined.fastq']) 324 | if infile.endswith('.gz'): 325 | uncompression_cmd = 'zcat' 326 | elif infile.endswith('.bz2'): 327 | uncompression_cmd = 'bzcat' 328 | else: 329 | uncompression_cmd = 'cat' 330 | 331 | subprocess.check_call([uncompression_cmd,infile],stdout=combinedFastq) 332 | 333 | inputFile = tmpdirname+'/combined.fastq' 334 | workingName = args.sample_name if args.sample_name else ','.join([ no_fq_extension(os.path.basename(x)) for x in args.input]) 335 | 336 | fancy_print('Merging '+str(len(args.input))+' files','DONE',bcolors.OKGREEN,reline=True,newLine=True) 337 | else: 338 | inputFile = args.input[0] 339 | workingName = args.sample_name if args.sample_name else no_fq_extension(os.path.basename(inputFile)) 340 | 341 | 342 | 343 | fileName = no_fq_extension(os.path.basename(inputFile)) 344 | 345 | 346 | fastq_len_cmd = [CHECKER_PATH+'/fastq_len_filter.py', '--min_len',args.minlen, '--min_qual',args.minqual,'--count',tmpdirname+'/'+fileName+'.nreads','-i',inputFile,'-o',tmpdirname+'/'+fileName+'.filter.fastq'] 347 | try: 348 | fancy_print('[fastq_len_filter] | filtering HQ reads','...',bcolors.OKBLUE,reline=True) 349 | 350 | subprocess.check_call(fastq_len_cmd) 351 | 352 | 353 | with open(tmpdirname+'/'+fileName+'.nreads') as readCounts: 354 | HQReads, totalReads = [line.strip().split('\t')[0:2] for line in readCounts][0] 355 | 356 | 357 | filteredFile = tmpdirname+'/'+fileName+'.filter.fastq' 358 | fancy_print('[fastq_len_filter] | '+HQReads+' / '+totalReads+' ('+str(round(float(HQReads)/float(totalReads),2)*100)+'%) reads selected','DONE',bcolors.OKGREEN,reline=True,newLine=True) 359 | 360 | except Exception as e: 361 | fancy_print('Fatal error running fastq_len_filter. Error message: '+str(e),'FAIL',bcolors.FAIL) 362 | sys.exit(1) 363 | 364 | 365 | try: 366 | 367 | fancy_print('[SILVA_SSU] | Bowtie2 Aligning','...',bcolors.OKBLUE,reline=True) 368 | 369 | 370 | bt2_command = ['bowtie2','--quiet','-p',args.bowtie2_threads,'--very-sensitive-local','-x',INDEX_PATH+'/SILVA_132_SSURef_Nr99_tax_silva.clean','--no-unal','-U',filteredFile,'-S','-'] 371 | if args.debug: print(' '.join(bt2_command)) 372 | p4 = subprocess.Popen(['wc','-l'], stdin=subprocess.PIPE,stdout=subprocess.PIPE) 373 | p3 = subprocess.Popen(['samtools','view','-'], stdin=subprocess.PIPE,stdout=p4.stdin) 374 | p2 = subprocess.Popen([CHECKER_PATH+'/cmseq/cmseq/filter.py','--minlen',args.minlen_SSU,'--minqual','20','--maxsnps','0.075'],stdin=subprocess.PIPE,stdout=p3.stdin) 375 | p1 = subprocess.Popen(bt2_command, stdout=p2.stdin) 376 | 377 | p1.wait() 378 | p2.communicate() 379 | p3.communicate() 380 | 381 | SSU_reads = int(p4.communicate()[0]) 382 | SSU_reads_rate = max(LIMIT_OF_DETECTION,float(SSU_reads)/float(HQReads)*100) 383 | enrichment_SSU = min(100,float(medians.loc[medians['parameter']=='rRNA_SSU',args.enrichment_preset]) / float(SSU_reads_rate)) 384 | 385 | fancy_print('[SILVA_SSU] | Bowtie2 Alignment rate: '+str(round(SSU_reads_rate,4))+'% (~'+str(round(enrichment_SSU,1))+'x)','DONE',bcolors.OKGREEN,reline=True,newLine=True) 386 | 387 | if(SSU_reads_rate <= LIMIT_OF_DETECTION): 388 | fancy_print('[SILVA_SSU] | Value is below limit-of-detection ('+str(SSU_reads)+' SSU reads)','!!',bcolors.WARNING) 389 | 390 | except Exception as e: 391 | fancy_print('Fatal error running Bowtie2 on SSU rRNA. Error message: '+str(e),'FAIL',bcolors.FAIL) 392 | sys.exit(1) 393 | 394 | 395 | try: 396 | 397 | fancy_print('[SILVA_LSU] | Bowtie2 Aligning','...',bcolors.OKBLUE,reline=True) 398 | 399 | bt2_command = ['bowtie2','--quiet','-p',args.bowtie2_threads,'--very-sensitive-local','-x',INDEX_PATH+'/SILVA_132_LSURef_tax_silva.clean','--no-unal','-U',filteredFile,'-S','-'] 400 | if args.debug: print(' '.join(bt2_command)) 401 | #print("AAA") 402 | 403 | p4 = subprocess.Popen(['wc','-l'], stdin=subprocess.PIPE,stdout=subprocess.PIPE) 404 | p3 = subprocess.Popen(['samtools','view','-'], stdin=subprocess.PIPE,stdout=p4.stdin) 405 | p2 = subprocess.Popen([CHECKER_PATH+'/cmseq/cmseq/filter.py','--minlen',args.minlen_LSU,'--minqual','20','--maxsnps','0.075'],stdin=subprocess.PIPE,stdout=p3.stdin) 406 | p1 = subprocess.Popen(bt2_command, stdout=p2.stdin) 407 | 408 | p1.wait() 409 | p2.communicate() 410 | p3.communicate() 411 | 412 | LSU_reads = int(p4.communicate()[0]) 413 | LSU_reads_rate = max(LIMIT_OF_DETECTION,float(LSU_reads)/float(HQReads)*100) 414 | 415 | enrichment_LSU = min(100,float(medians.loc[medians['parameter']=='rRNA_LSU',args.enrichment_preset]) / float(LSU_reads_rate)) 416 | 417 | fancy_print('[SILVA_LSU] | Bowtie2 Alignment rate: '+str(round(LSU_reads_rate,4))+'% (~'+str(round(enrichment_LSU,1))+'x)','DONE',bcolors.OKGREEN,reline=True,newLine=True) 418 | 419 | 420 | if(LSU_reads_rate <= LIMIT_OF_DETECTION): 421 | fancy_print('[SILVA_LSU] | Value is below limit-of-detection ('+str(LSU_reads)+' LSU reads)','!!',bcolors.WARNING) 422 | 423 | 424 | 425 | except Exception as e: 426 | fancy_print('Fatal error running Bowtie2 on LSU rRNA. Error message: '+str(e),'FAIL',bcolors.FAIL) 427 | sys.exit(1) 428 | 429 | 430 | try: 431 | 432 | fancy_print('[SC-Markers] | Diamond Aligning','...',bcolors.OKBLUE,reline=True) 433 | 434 | diamond_command = [args.diamond_path,'blastx','-q',filteredFile,'--threads',args.diamond_threads,'--outfmt','6','--db',INDEX_PATH+'/'+req_dmd_db_filename,'--id','50','--max-hsps','35','-k','0','--quiet'] 435 | p2 = subprocess.Popen('cut -f1 | sort | uniq | wc -l',shell=True, stdin=subprocess.PIPE,stdout=subprocess.PIPE) 436 | 437 | if args.debug: 438 | p1 = subprocess.Popen(diamond_command, stdout=p2.stdin) 439 | else: 440 | if args.debug: 441 | p1 = subprocess.Popen(diamond_command, stdout=p2.stdin) 442 | else: 443 | with open(os.devnull) as devnull: 444 | p1 = subprocess.Popen(diamond_command, stdout=p2.stdin,stderr=devnull) 445 | 446 | singleCopyMarkers_reads = int(p2.communicate()[0]) 447 | singleCopyMarkers_reads_rate = max(LIMIT_OF_DETECTION,float(singleCopyMarkers_reads)/float(HQReads)*100) 448 | 449 | enrichment_singleCopyMarkers = min(100,float(medians.loc[medians['parameter']=='AMPHORA2',args.enrichment_preset]) / float(singleCopyMarkers_reads_rate)) 450 | 451 | fancy_print('[SC-Markers] | Diamond Alignment rate: '+str(round(singleCopyMarkers_reads_rate,4))+'% (~'+str(round(enrichment_singleCopyMarkers,1))+'x)','DONE',bcolors.OKGREEN,reline=True,newLine=True) 452 | 453 | if(singleCopyMarkers_reads <= LIMIT_OF_DETECTION): 454 | fancy_print('[SC-Markers] | Value is below to limit-of-detection ('+str(singleCopyMarkers_reads)+' reads)','!!',bcolors.WARNING) 455 | 456 | except Exception as e: 457 | fancy_print('Fatal error running Diamond on Single-Copy-Proteins. Error message: '+str(e),'FAIL',bcolors.FAIL) 458 | sys.exit(1) 459 | 460 | overallEnrichmenScore = min(enrichment_SSU,enrichment_LSU,enrichment_singleCopyMarkers) 461 | to_out=[workingName,totalReads,HQReads,SSU_reads_rate,LSU_reads_rate,singleCopyMarkers_reads_rate,overallEnrichmenScore] 462 | 463 | 464 | outFile = open(args.output,'w') 465 | outFile.write("Sample\tReads\tReads_HQ\tSSU rRNA alignment rate\tLSU rRNA alignment rate\tBacterial_Markers alignment rate\ttotal enrichmnet score\n") 466 | outFile.write('\t'.join([str(x) for x in to_out])+'\n') 467 | outFile.close() 468 | 469 | fancy_print('Finished','',bcolors.ENDC) 470 | fancy_print(' | Overall Enrichment Score: ~'+str(round(overallEnrichmenScore,1))+'x','.',bcolors.ENDC) 471 | fancy_print(' | Output File: '+args.output,'.',bcolors.ENDC) 472 | fancy_print('Have a nice day! ','DONE',bcolors.OKGREEN) 473 | 474 | tmpdir.cleanup() 475 | --------------------------------------------------------------------------------