├── .gitmodules
├── README.md
├── fastq_len_filter.py
├── license.txt
├── medians.csv
├── test
    ├── demo_SRR829034.fastq.bz2
    └── test.sh
└── viromeQC.py


/.gitmodules:
--------------------------------------------------------------------------------
1 | [submodule "cmseq"]
2 | 	path = cmseq
3 | 	url = https://github.com/SegataLab/cmseq.git
4 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # ViromeQC #
  2 |  
  3 | ## Description ##
  4 |  
  5 | * Provides an enrichment score for VLP viromes with respect to metagenomes
  6 | * Useful benchmark for the quality of enrichment of a virome
  7 | * Tested on Linux Ubuntu Server 16.04 LTS and on Linux Mint 19
  8 | 
  9 | **Requires:**
 10 | 
 11 | * [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) >= v. 2.3.4
 12 | * [Samtools](http://samtools.sourceforge.net/) >= 1.3.1
 13 | * [Biopython](https://github.com/biopython/biopython) >= 1.69
 14 | * [Pysam](http://pysam.readthedocs.io/en/latest/) >= 0.14
 15 | * [Diamond](http://github.com/bbuchfink/diamond) (tested on v.0.9.9 and 0.9.29)
 16 | * Python3 (tested on 3.6)
 17 | * [pandas](https://pandas.pydata.org) >= 0.20
 18 | 
 19 | **Update:** _ViromeQC_ now works with newer versions of diamond (e.g. v0.9.29) 
 20 | Thanks to Ryan Cook ([@RyanCookAMR](https://twitter.com/RyanCookAMR)) for the new diamond db
 21 | 
 22 | ## Usage ##
 23 | 
 24 | ### Step 1: clone or download the repository ###
 25 | 
 26 | `git clone --recurse-submodules https://github.com/SegataLab/viromeqc.git`
 27 | 
 28 | or download the repository from the **[releases](https://github.com/SegataLab/viromeqc/releases)** page
 29 | 
 30 | ### Step 2: install the database: ###
 31 | 
 32 | This steps downloads the database file. This needs to be done only the first time you run ViromeQC. This may require a few minutes, depending on your internet connection.
 33 | 
 34 | `viromeQC.py --install`
 35 | 
 36 | Alternatively, you can also download the database files from [Zenodo](https://zenodo.org/record/4020594#.X1jxgGMzZDM). Once downloaded the files, create a folder named `index/` in the ViromeQC installation folder and unzip all the files in this folder.
 37 | 
 38 | ### Step 3: Run on your sample ###
 39 | 
 40 | `viromeQC.py -i <input_virome_file(s)> -o <report_file.txt>`
 41 | 
 42 | *Please Note:* 
 43 | You can pass more than one file as input (e.g. for multiple runs or paired end reads). However, you can process only one sample at a time with this command. If you want to parallelize the execution, this can be easily done with [Parallel](https://www.gnu.org/software/parallel/) or equivalent tools.
 44 | 
 45 | You can try the test example (`test/test.sh`) which analyzes 10'000 reads from the sample `SRR829034`. This should take approximately 1 or 2 minutes.
 46 | 
 47 | Parameters:
 48 | 
 49 | ```
 50 | usage: viromeQC.py -i <input_virome_file> -o <report_file.txt>
 51 | 
 52 | optional arguments:
 53 |   -h, --help            show this help message and exit
 54 |   -i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
 55 |                         Raw Reads in FASTQ format. Supports multiple inputs
 56 |                         (plain, gz o bz2) (default: None)
 57 |   -o OUTPUT, --output OUTPUT
 58 |                         output file (default: None)
 59 |   --minlen MINLEN       Minimum Read Length allowed (default: 75)
 60 |   --minqual MINQUAL     Minimum Read Average Phred quality (default: 20)
 61 |   --bowtie2_threads BOWTIE2_THREADS
 62 |                         Number of Threads to use with Bowtie2 (default: 4)
 63 |   --diamond_threads DIAMOND_THREADS
 64 |                         Number of Threads to use with Diamond (default: 4)
 65 |   -w {human,environmental}, --enrichment_preset {human,environmental}
 66 |                         Calculate the enrichment basing on human or
 67 |                         environmental metagenomes. Defualt: human-microbiome
 68 |                         (default: human)
 69 |   --bowtie2_path BOWTIE2_PATH
 70 |                         Full path to the bowtie2 command to use, deafult
 71 |                         assumes that bowtie2 is present in the system path
 72 |                         (default: bowtie2)
 73 |   --diamond_path DIAMOND_PATH
 74 |                         Full path to the diamond command to use, deafult
 75 |                         assumes that diamond is present in the system path
 76 |                         (default: diamond)
 77 |   --version             Prints version informations (default: False)
 78 |   --install             Downloads database files (default: False)
 79 |   --sample_name SAMPLE_NAME
 80 |                         Optional label for the sample to be included in the
 81 |                         output file (default: None)
 82 |   --tempdir TEMPDIR     Temporary Directory override (default is the system
 83 |                         temp. directory) (default: None)
 84 | ```
 85 | 
 86 | ### Pipeline structure ###
 87 | 
 88 | ViromeQC starts from FASTQ files (compressed files are supported), and will:
 89 | 
 90 | 1. Elimitate short and low quality reads
 91 |     - *adjust the `minqual` and `minlen` parameters if you want to change the thresholds*
 92 | 2. Map the reads against a curated collection of rRNAs and single-copy bacteral markers
 93 | 3. Filter the reads to remove short and dlsivergent alignments
 94 | 4. Compute the enrichment value of the sample, compared to the median observed in human metagenomes
 95 |     - use `-w environmental` for envronmental reads
 96 |     - reference medians for un-enriched metagenomes are taken from `medians.csv`, you can provide your own data to ViromeQC by changing this file accordingly
 97 | 5. Produce a report file with the alignment rates and the final enrichment score (which is the minimum enrichment observed across SSU-rRNA, LSU-rRNA and single-copy markers)
 98 | 
 99 | 
100 | ### Output ###
101 | 
102 | Output is given as a TSV file with the following structure:
103 | 
104 | 
105 | |    Sample    |    Reads    |    Reads_HQ    |    SSU rRNA alignment (%)    |    LSU rRNA alignment (%)   |    Bacterial_Markers alignment (%)   |    total enrichmnet score
106 | |---|---|---|---|---|---|---|
107 | |    your_sample.fq | 40000 | 39479 | 0.00759898  | 0.0227969 | 0.01266496  | 5.795329
108 | 
109 | 
110 | - An alignment score of 5.8 means that the virome is 5.8 times more enriched than a comparable metagenome
111 | - High score (e.g. 10-50) reflect high VLP enrichment 
112 | 
113 | 
114 | ## Citation ##
115 | 
116 | If you find this tool useful, please cite:
117 | 
118 | *Zolfo, M., Pinto, F., Asnicar, F., Manghi, P., Tett A., Segata N.* **[Detecting contamination in viromes using ViromeQC](https://www.nature.com/articles/s41587-019-0334-5)**, *Nature Biotechnology* 37, 1408–1412 (2019)
119 | 
120 | 


--------------------------------------------------------------------------------
/fastq_len_filter.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | import os
 4 | from Bio import SeqIO
 5 | import argparse as ap
 6 | import sys
 7 | import os
 8 | import numpy as np 
 9 | 
10 | def read_params(args):
11 | 	p = ap.ArgumentParser(description = 'fastax_len_filter.py Parameters\n')
12 | 	p.add_argument('--min_len', required = True, default = None, type = int)
13 | 	p.add_argument('--min_qual', default = 0, type = int)
14 | 	p.add_argument('--no_anonim', action="store_true")
15 | 	p.add_argument('--count')
16 | 	p.add_argument('-i','--input', required = True, help="Input File (FASTQ)")
17 | 	p.add_argument('-o','--output', required = True, help="Output File (FASTQ)")
18 | 
19 | 
20 | 	return vars(p.parse_args())
21 | 
22 | screenQual=range(20,31)
23 | qualCounter = dict((k,0) for k in screenQual)
24 | 
25 | counter = 0
26 | allCounter = 0
27 | rpl=[]
28 | 
29 | 
30 | 
31 | if __name__ == '__main__':
32 | 
33 | 
34 | 	args = read_params(sys.argv)
35 | 
36 | 	if not os.path.isfile(args['input']):
37 | 		print("Error: file "+args['input']+' is not accessible!')
38 | 		sys.exit(1)
39 | 
40 | 	if args['input'].endswith('.gz'):
41 | 		import gzip
42 | 		from functools import partial
43 | 		_open = partial(gzip.open, mode='rt')
44 | 	elif args['input'].endswith('.bz2'):
45 | 		import bz2
46 | 		from functools import partial
47 | 
48 | 		_open = partial(bz2.open, mode='rt')
49 | 	else:
50 | 		_open = open
51 | 
52 | 
53 | 
54 | 	min_len = args['min_len']
55 | 	with open(args['output'],'w') as outf:
56 | 
57 | 		with _open(args['input']) as f:
58 | 			for r in SeqIO.parse(f, "fastq"):
59 | 
60 | 				avQual= np.mean(r.letter_annotations['phred_quality'])
61 | 				
62 | 				if avQual >= args['min_qual'] and len(r) >= min_len:	 
63 | 					if not args['no_anonim']:
64 | 						r.id = r.id+'_'+str(allCounter)
65 | 
66 | 					counter+=1
67 | 					rpl.append(r)
68 | 
69 | 					for qu in screenQual:
70 | 						if avQual >= qu:
71 | 							qualCounter[qu]+=1
72 | 
73 | 					
74 | 					if len(rpl) % 30000 == 0:
75 | 				
76 | 						SeqIO.write(rpl, outf, "fastq")
77 | 						rpl=[]
78 | 
79 | 				allCounter+=1
80 | 
81 | 
82 | 			if len(rpl) > 0:
83 | 				SeqIO.write(rpl, outf, "fastq")
84 | 				rpl=[]
85 | 				 
86 | 			if args['count']:
87 | 				outCount = open(args['count'],'w')
88 | 				outCount.write(str(counter)+'\t'+str(allCounter)+'\t'+'\t'.join([str(ke)+':'+str(val) for ke,val in qualCounter.items()]))
89 | 				outCount.close()
90 | 


--------------------------------------------------------------------------------
/license.txt:
--------------------------------------------------------------------------------
1 | Copyright (c) 2019, Moreno Zolfo and Nicola Segata
2 | 
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4 | 
5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6 | 
7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
8 | 


--------------------------------------------------------------------------------
/medians.csv:
--------------------------------------------------------------------------------
1 | parameter	environmental	human
2 | AMPHORA2	0.481685106521252	0.700892560801689
3 | rRNA_LSU	0.13211572831416	0.530267135901026
4 | rRNA_SSU	0.076231996313743	0.247254267758797


--------------------------------------------------------------------------------
/test/demo_SRR829034.fastq.bz2:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SegataLab/viromeqc/2939a6af18b5973bfcba1d668c255b7db285f150/test/demo_SRR829034.fastq.bz2


--------------------------------------------------------------------------------
/test/test.sh:
--------------------------------------------------------------------------------
1 | if [ -f demo_SRR829034.fastq.bz2 ]; then
2 | 	bunzip2 demo_SRR829034.fastq.bz2
3 | fi;
4 | ../viromeQC.py -i demo_SRR829034.fastq -o out.txt
5 | 


--------------------------------------------------------------------------------
/viromeQC.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | import os
  3 | import sys
  4 | import argparse
  5 | import zipfile
  6 | import time
  7 | import tempfile
  8 | import subprocess
  9 | import pandas as pd
 10 | 
 11 | 
 12 | 
 13 | 
 14 | __author__ = 'Moreno Zolfo (moreno.zolfo@unitn.it)'
 15 | __version__ = '1.0.2'
 16 | __date__ = '15 Nov. 2022'
 17 | 
 18 | 
 19 | 
 20 | def byte_to_megabyte(byte):
 21 |     """
 22 |     Convert byte value to megabyte
 23 |     """
 24 | 
 25 |     return byte / (1024.0**2)
 26 | 
 27 | 
 28 | class ReportHook():
 29 |     def __init__(self):
 30 |         self.start_time = time.time()
 31 | 
 32 |     def report(self, blocknum, block_size, total_size):
 33 |         """
 34 |         Print download progress message
 35 |         """
 36 | 
 37 |         if blocknum == 0:
 38 |             self.start_time = time.time()
 39 |             if total_size > 0:
 40 |                 sys.stderr.write("Downloading file of size: {:.2f} MB\n"
 41 |                                  .format(byte_to_megabyte(total_size)))
 42 |         else:
 43 |             total_downloaded = blocknum * block_size
 44 |             status = "{:3.2f} MB ".format(byte_to_megabyte(total_downloaded))
 45 | 
 46 |             if total_size > 0:
 47 |                 percent_downloaded = total_downloaded * 100.0 / total_size
 48 |                 # use carriage return plus sys.stderr to overwrite stderr
 49 |                 download_rate = total_downloaded / (time.time() - self.start_time)
 50 |                 estimated_time = (total_size - total_downloaded) / download_rate
 51 |                 estimated_minutes = int(estimated_time / 60.0)
 52 |                 estimated_seconds = estimated_time - estimated_minutes * 60.0
 53 |                 status += ("{:3.2f} %  {:5.2f} MB/sec {:2.0f} min {:2.0f} sec "
 54 |                            .format(percent_downloaded,
 55 |                                    byte_to_megabyte(download_rate),
 56 |                                    estimated_minutes, estimated_seconds))
 57 | 
 58 |             status += "        \r"
 59 |             sys.stderr.write(status)
 60 | 
 61 | 
 62 | def download(url, download_file):
 63 |     """
 64 |     Download a file from a url
 65 |     """
 66 |     # try to import urllib.request.urlretrieve for python3
 67 |     try:
 68 |         from urllib.request import urlretrieve
 69 |     except ImportError:
 70 |         from urllib import urlretrieve
 71 | 
 72 |     if not os.path.isfile(download_file):
 73 |         try:
 74 |             sys.stderr.write("\nDownloading " + url + "\n")
 75 |             file, headers = urlretrieve(url, download_file,
 76 |                                         reporthook=ReportHook().report)
 77 |         except EnvironmentError:
 78 |             sys.stderr.write("\nWarning: Unable to download " + url + "\n")
 79 |     else:
 80 |         sys.stderr.write("\nFile {} already present!\n".format(download_file))
 81 | 
 82 | 
 83 | 
 84 | def print_version():
 85 | 	print ("Version:\t"+__version__)
 86 | 	print ("Author:\t\t"+__author__)
 87 | 	print ("Software:\t"+'Virome QC')
 88 | 	sys.exit(0)
 89 | 
 90 | class bcolors:
 91 | 	HEADER = '\033[95m'
 92 | 	OKBLUE = '\033[94m'
 93 | 	OKGREEN = '\033[92m'
 94 | 	WARNING = '\033[93m'
 95 | 	FAIL = '\033[91m'
 96 | 	ENDC = '\033[0m'
 97 | 	OKGREEN2 = '\033[42m\033[30m'
 98 | 	RED = '\033[1;91m'
 99 | 	CYAN = '\033[0;37m'
100 | 
101 | 
102 | def fancy_print(mesg,label,type,reline=False,newLine=False):
103 | 	opening = "\r" if reline else ''
104 | 	ending = "\r\n" if not reline or newLine else ''
105 | 
106 | 	if len(mesg) < 65:
107 | 	
108 | 		sys.stdout.write(opening+mesg.ljust(66)+(type+'[ - '+label.center(5)+' - ]'+bcolors.ENDC).ljust(14)+ending)
109 | 	else: 
110 | 		c=0
111 | 		wds = []
112 | 		lines=[]
113 | 		for word in mesg.split(' '):
114 | 
115 | 				if c + len(word)+2 > 65:
116 | 					print (' '.join(wds))
117 | 					c=0
118 | 					wds=[word]
119 | 					continue
120 | 				c = c+len(word)+2
121 | 				wds.append(word)
122 | 		sys.stdout.write(opening+(' '.join(wds)).ljust(66)+(type+'[ - '+label.center(5)+' - ]'+bcolors.ENDC).ljust(14)+ending)
123 | 
124 | 	sys.stdout.flush()
125 | 
126 | 
127 | def check_install(req_dmd_db_filename,source):
128 |  
129 |   
130 | 	try:
131 | 		#download indexes if you don't have it
132 | 		to_download=[]
133 | 		fancy_print("Checking Database Files",'...',bcolors.OKBLUE,reline=True)
134 | 
135 | 		if not os.path.isdir(INDEX_PATH):
136 | 			os.mkdir(INDEX_PATH)
137 | 
138 | 
139 | 		remote_links = {
140 | 		'dropbox': { \
141 | 			'silva_LSU_clean' : ['https://www.dropbox.com/s/c0nbhkw0ww3lm97/SILVA_132_LSURef_tax_silva.clean.zip?dl=1'], \
142 | 		 	'silva_SSU_clean':  [
143 | 		 						'https://www.dropbox.com/s/mb5a0g7utmcupje/SILVA_132_SSURef_Nr99_tax_silva.clean_1.zip?dl=1', \
144 | 		 						'https://www.dropbox.com/s/qqqokke8r26e8ve/SILVA_132_SSURef_Nr99_tax_silva.clean_2.zip?dl=1', \
145 | 		 						'https://www.dropbox.com/s/idmbwbavqalse9q/SILVA_132_SSURef_Nr99_tax_silva.clean_3.zip?dl=1'
146 | 		 						],
147 | 		 	'amph_dmd': [
148 | 		 				'https://www.dropbox.com/s/rfer26hdoj3nsm0/amphora_bacteria.dmnd.zip?dl=1', \
149 | 		 				'https://www.dropbox.com/s/43nu0l6zkiw2las/amphora_bacteria_294.dmnd.zip?dl=1'
150 | 		 				]
151 | 				   },
152 | 		'zenodo': { \
153 | 					'silva_LSU_clean' : ['https://zenodo.org/record/4020594/files/SILVA_132_LSURef_tax_silva_clean.zip?download=1'], \
154 | 				 	'silva_SSU_clean':  ['https://zenodo.org/record/4020594/files/SILVA_132_SSURef_Nr99_tax_silva.clean.zip?download=1'],
155 | 				 	'amph_dmd': ['https://zenodo.org/record/4020594/files/amphora_markers.zip?download=1']
156 | 				   },
157 | 		}
158 | 
159 | 		if source in remote_links:
160 | 			if not os.path.isfile(INDEX_PATH+'/SILVA_132_LSURef_tax_silva.clean.1.bt2'):
161 | 				to_download.append(remote_links[source]['silva_LSU_clean'])
162 | 
163 | 			if not os.path.isfile(INDEX_PATH+'/SILVA_132_SSURef_Nr99_tax_silva.clean.1.bt2'):
164 | 				to_download.append(remote_links[source]['silva_SSU_clean'])
165 | 		 
166 | 			if not os.path.isfile(INDEX_PATH+'/'+req_dmd_db_filename):
167 | 				to_download.append(remote_links[source]['amph_dmd'])
168 | 
169 | 		fancy_print("Checking Database Files",'OK',bcolors.OKGREEN,reline=True,newLine=True)
170 | 
171 | 
172 | 		if(to_download):
173 | 			to_download = [_ for grp in to_download for _ in grp]
174 | 			fancy_print("Using {} as download source for ViromeQC db".format(source),'...',bcolors.OKBLUE,newLine=True)
175 | 			fancy_print("Need to download {} files".format(len(to_download)),'...',bcolors.OKBLUE,reline=True)
176 | 			for downloadable in to_download:
177 | 
178 | 				download(downloadable, INDEX_PATH+'/tmp.zip')
179 | 				zipDB = zipfile.ZipFile(INDEX_PATH+'/tmp.zip', 'r')
180 | 				zipDB.extractall(INDEX_PATH)
181 | 				zipDB.close()
182 | 				os.remove(INDEX_PATH+'/tmp.zip')
183 | 			fancy_print("Uncompressing DB ({} files)".format(len(to_download)),'DONE',bcolors.OKGREEN,reline=True,newLine=True)
184 | 		 
185 | 
186 | 	except IOError: 
187 | 		print("Failed to retrieve DB")
188 | 		fancy_print("Failed to retrieve DB",'FAIL',bcolors.FAIL)
189 | 		sys.exit(1)
190 | 
191 | 
192 | 
193 | def no_fq_extension(string):
194 | 	z=[]
195 | 	for p in string.split('.'):
196 | 		if any([q in p for q in ['bz2','fq','fastq','gz']]): continue
197 | 		else: 
198 | 			z.append(p)
199 | 	
200 | 	return '.'.join(z)
201 | 
202 | 	
203 | 
204 | try:
205 | 	from Bio import SeqIO
206 | 	from Bio.Seq import Seq
207 | 	from Bio.SeqRecord import SeqRecord
208 | except ImportError as e:
209 | 	fancy_print("Failed in importing Biopython. Please check Biopython is installed properly on your system!",'FAIL',bcolors.FAIL)
210 | 	sys.exit(1)
211 | 
212 | try:
213 | 	import pysam
214 | except ImportError as e:
215 | 	fancy_print("Failed in importing pysam. Please check pysam is installed properly on your system!",'FAIL',bcolors.FAIL)
216 | 	sys.exit(1)
217 | 
218 | 
219 | CHECKER_PATH=os.path.abspath(os.path.dirname(os.path.realpath(__file__)))
220 | INDEX_PATH=CHECKER_PATH+"/index/"
221 | LIMIT_OF_DETECTION = 1e-6
222 | 
223 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter,
224 | 		description='Checks a virome FASTQ file for enrichment efficiency')
225 | 
226 | parser.add_argument("-i","--input", required=all([x not in sys.argv for x in ['--install','--version']]), nargs="*", help="Raw Reads in FASTQ format. Supports multiple inputs (plain, gz o bz2)")
227 | parser.add_argument("-o",'--output',required=all([x not in sys.argv for x in ['--install','--version']]), help="output file")
228 | 
229 | parser.add_argument("--minlen", help="Minimum Read Length allowed",default='75')
230 | parser.add_argument("--minqual", help="Minimum Read Average Phred quality",default='20')
231 | 
232 | parser.add_argument("--minlen_SSU", help="Minimum alignment length when considering SSU rRNA gene",default='50')
233 | parser.add_argument("--minlen_LSU", help="Minimum alignment length when considering LSU rRNA gene",default='50')
234 | 
235 | parser.add_argument("--bowtie2_threads", help="Number of Threads to use with Bowtie2",default='4') 
236 | parser.add_argument("--diamond_threads", help="Number of Threads to use with Diamond",default='4') 
237 | 
238 | parser.add_argument("-w","--enrichment_preset", choices=['human','environmental'], help="Calculate the enrichment basing on human or environmental metagenomes. Defualt: human-microbiome",default='human') 
239 | parser.add_argument('--medians', type=str, default=CHECKER_PATH+'/medians.csv', help="File containing reference medians to calculate the enrichment. Default is medians.csv in the script directory. You can specify a different file with this parameter.") 
240 | 
241 | parser.add_argument('--bowtie2_path', type=str, default='bowtie2',
242 |         help="Full path to the bowtie2 command to use, deafult assumes "
243 |              "that 'bowtie2 is present in the system path") 
244 | parser.add_argument('--diamond_path', type=str, default='diamond',
245 |         help="Full path to the diamond command to use, deafult assumes "
246 |              "that 'diamond is present in the system path") 
247 | 
248 | 
249 | 
250 | parser.add_argument("--version", help="Prints version informations", action='store_true')
251 | parser.add_argument("--debug", help="Prints error messages in case of debug", action='store_true')
252 | 
253 | parser.add_argument("--install", help="Downloads database files", action='store_true')
254 | parser.add_argument("--zenodo", help="Use Zenodo instead of Dropbox to download the DB", action='store_true')
255 | parser.add_argument("--sample_name", help="Optional label for the sample to be included in the output file")
256 | parser.add_argument("--tempdir", help="Temporary Directory override (default is the system temp directory)")
257 | 
258 | 
259 | args=parser.parse_args()
260 | 
261 | medians = pd.read_csv(args.medians,sep='\t')
262 | dwl_source = 'zenodo' if args.zenodo else 'dropbox'
263 | 
264 | try:
265 | 	diamond_command = [args.diamond_path,'--version']
266 | 		
267 | 	with open(os.devnull) as devnull:
268 | 		ps1 = subprocess.Popen(diamond_command, stdout=subprocess.PIPE,stderr=devnull)
269 | 
270 | 	dmd_version = str(ps1.communicate()[0].strip()).strip("'").split(' ')[-1]
271 | 
272 | 	dmd_v_split = dmd_version.split('.')
273 | 	
274 | 	if int(dmd_v_split[1]) == 9 and int(dmd_v_split[2]) < 19:
275 | 		req_dmd_db_filename='amphora_bacteria.dmnd'
276 | 	else:
277 | 		req_dmd_db_filename='amphora_bacteria_294.dmnd'
278 | except:
279 | 	fancy_print("Failed to detect diamond version",'FAIL',bcolors.FAIL)
280 | 	sys.exit(1)
281 | 
282 | if args.version: print_version()
283 | 
284 | 
285 | 
286 | if args.install:
287 | 	check_install(req_dmd_db_filename, source=dwl_source)
288 | 	sys.exit(0)
289 | 
290 | #pre-flight check
291 | for inputFile in args.input:
292 | 	if not os.path.isfile(inputFile):
293 | 		fancy_print("Error: file ",inputFile,'does not exist','ERROR',bcolors.FAIL)
294 | 
295 | commands = [['zcat', '-h'],['bzcat', '-h'],[args.bowtie2_path, '-h'],[args.diamond_path, 'help']]
296 | 
297 | 
298 | for sw in commands:
299 | 	try: 
300 | 		with open(os.devnull, 'w') as devnull:
301 | 			subprocess.check_call(sw, stdout=devnull, stderr=devnull)
302 | 
303 | 	except Exception as e:
304 | 		fancy_print("Error, command not found: "+sw[0],'ERROR',bcolors.FAIL)
305 | 
306 | check_install(req_dmd_db_filename,source=dwl_source)
307 | 
308 | 
309 | if args.tempdir: 
310 | 	tempfile.tempdir=args.tempdir
311 | 
312 | try:
313 | 	tmpdir = tempfile.TemporaryDirectory()
314 | 	tmpdirname = tmpdir.name
315 | except Exception as e:
316 | 	fancy_print("Could not create temp folder in "+str(tempfile.tempdir),'FAIL',bcolors.FAIL)
317 | 	sys.exit(1)
318 | 
319 | if len(args.input) > 1:
320 | 	fancy_print('Merging '+str(len(args.input))+' files','...',bcolors.OKBLUE,reline=True)
321 | 	with open(tmpdirname+'/combined.fastq','a') as combinedFastq:
322 | 		for infile in args.input:
323 | 			#print(['cat',infile,'>>',tmpdirname+'/combined.fastq'])
324 | 			if infile.endswith('.gz'):
325 | 				uncompression_cmd = 'zcat'
326 | 			elif infile.endswith('.bz2'):
327 | 				uncompression_cmd = 'bzcat'
328 | 			else:
329 | 				uncompression_cmd = 'cat'
330 | 
331 | 			subprocess.check_call([uncompression_cmd,infile],stdout=combinedFastq)
332 | 
333 | 	inputFile = tmpdirname+'/combined.fastq' 
334 | 	workingName = args.sample_name if args.sample_name else ','.join([ no_fq_extension(os.path.basename(x)) for x in args.input])
335 | 
336 | 	fancy_print('Merging '+str(len(args.input))+' files','DONE',bcolors.OKGREEN,reline=True,newLine=True)
337 | else:
338 | 	inputFile = args.input[0]
339 | 	workingName = args.sample_name if args.sample_name else no_fq_extension(os.path.basename(inputFile))
340 | 
341 | 
342 | 
343 | fileName = no_fq_extension(os.path.basename(inputFile))
344 | 
345 | 
346 | fastq_len_cmd = [CHECKER_PATH+'/fastq_len_filter.py', '--min_len',args.minlen, '--min_qual',args.minqual,'--count',tmpdirname+'/'+fileName+'.nreads','-i',inputFile,'-o',tmpdirname+'/'+fileName+'.filter.fastq']
347 | try: 
348 | 	fancy_print('[fastq_len_filter] | filtering HQ reads','...',bcolors.OKBLUE,reline=True)
349 | 
350 | 	subprocess.check_call(fastq_len_cmd)
351 | 
352 | 
353 | 	with open(tmpdirname+'/'+fileName+'.nreads') as readCounts:
354 | 		HQReads, totalReads = [line.strip().split('\t')[0:2] for line in readCounts][0]
355 | 
356 | 	
357 | 	filteredFile = tmpdirname+'/'+fileName+'.filter.fastq' 
358 | 	fancy_print('[fastq_len_filter] | '+HQReads+' / '+totalReads+' ('+str(round(float(HQReads)/float(totalReads),2)*100)+'%) reads selected','DONE',bcolors.OKGREEN,reline=True,newLine=True)
359 | 
360 | except Exception as e: 
361 | 	fancy_print('Fatal error running fastq_len_filter. Error message: '+str(e),'FAIL',bcolors.FAIL)
362 | 	sys.exit(1)
363 | 
364 | 
365 | try: 
366 | 	
367 | 	fancy_print('[SILVA_SSU]   | Bowtie2 Aligning','...',bcolors.OKBLUE,reline=True)
368 | 	
369 | 
370 | 	bt2_command = ['bowtie2','--quiet','-p',args.bowtie2_threads,'--very-sensitive-local','-x',INDEX_PATH+'/SILVA_132_SSURef_Nr99_tax_silva.clean','--no-unal','-U',filteredFile,'-S','-']
371 | 	if args.debug: print(' '.join(bt2_command))
372 | 	p4 = subprocess.Popen(['wc','-l'], stdin=subprocess.PIPE,stdout=subprocess.PIPE)
373 | 	p3 = subprocess.Popen(['samtools','view','-'], stdin=subprocess.PIPE,stdout=p4.stdin)
374 | 	p2 = subprocess.Popen([CHECKER_PATH+'/cmseq/cmseq/filter.py','--minlen',args.minlen_SSU,'--minqual','20','--maxsnps','0.075'],stdin=subprocess.PIPE,stdout=p3.stdin)
375 | 	p1 = subprocess.Popen(bt2_command, stdout=p2.stdin)
376 | 
377 | 	p1.wait() 
378 | 	p2.communicate()
379 | 	p3.communicate()
380 | 	
381 | 	SSU_reads = int(p4.communicate()[0])
382 | 	SSU_reads_rate = max(LIMIT_OF_DETECTION,float(SSU_reads)/float(HQReads)*100)
383 | 	enrichment_SSU = min(100,float(medians.loc[medians['parameter']=='rRNA_SSU',args.enrichment_preset]) / float(SSU_reads_rate))
384 | 
385 | 	fancy_print('[SILVA_SSU]   | Bowtie2 Alignment rate: '+str(round(SSU_reads_rate,4))+'% (~'+str(round(enrichment_SSU,1))+'x)','DONE',bcolors.OKGREEN,reline=True,newLine=True)
386 | 
387 | 	if(SSU_reads_rate <= LIMIT_OF_DETECTION):
388 | 		fancy_print('[SILVA_SSU]   | Value is below limit-of-detection ('+str(SSU_reads)+' SSU reads)','!!',bcolors.WARNING)
389 | 
390 | except Exception as e: 
391 | 	fancy_print('Fatal error running Bowtie2 on SSU rRNA. Error message: '+str(e),'FAIL',bcolors.FAIL)
392 | 	sys.exit(1)
393 | 
394 | 
395 | try: 
396 | 	 
397 | 	fancy_print('[SILVA_LSU]   | Bowtie2 Aligning','...',bcolors.OKBLUE,reline=True)
398 | 	
399 | 	bt2_command = ['bowtie2','--quiet','-p',args.bowtie2_threads,'--very-sensitive-local','-x',INDEX_PATH+'/SILVA_132_LSURef_tax_silva.clean','--no-unal','-U',filteredFile,'-S','-']
400 | 	if args.debug: print(' '.join(bt2_command))
401 | 	#print("AAA")
402 | 	
403 | 	p4 = subprocess.Popen(['wc','-l'], stdin=subprocess.PIPE,stdout=subprocess.PIPE)
404 | 	p3 = subprocess.Popen(['samtools','view','-'], stdin=subprocess.PIPE,stdout=p4.stdin)
405 | 	p2 = subprocess.Popen([CHECKER_PATH+'/cmseq/cmseq/filter.py','--minlen',args.minlen_LSU,'--minqual','20','--maxsnps','0.075'],stdin=subprocess.PIPE,stdout=p3.stdin)
406 | 	p1 = subprocess.Popen(bt2_command, stdout=p2.stdin)
407 | 
408 | 	p1.wait() 
409 | 	p2.communicate()
410 | 	p3.communicate()
411 | 	
412 | 	LSU_reads = int(p4.communicate()[0]) 
413 | 	LSU_reads_rate = max(LIMIT_OF_DETECTION,float(LSU_reads)/float(HQReads)*100)
414 | 
415 | 	enrichment_LSU = min(100,float(medians.loc[medians['parameter']=='rRNA_LSU',args.enrichment_preset]) / float(LSU_reads_rate))
416 | 	
417 | 	fancy_print('[SILVA_LSU]   | Bowtie2 Alignment rate: '+str(round(LSU_reads_rate,4))+'% (~'+str(round(enrichment_LSU,1))+'x)','DONE',bcolors.OKGREEN,reline=True,newLine=True)
418 | 
419 | 
420 | 	if(LSU_reads_rate <= LIMIT_OF_DETECTION):
421 | 		fancy_print('[SILVA_LSU]   | Value is below limit-of-detection ('+str(LSU_reads)+' LSU reads)','!!',bcolors.WARNING)
422 | 
423 | 
424 | 
425 | except Exception as e: 
426 | 	fancy_print('Fatal error running Bowtie2 on LSU rRNA. Error message: '+str(e),'FAIL',bcolors.FAIL)
427 | 	sys.exit(1)
428 | 
429 | 
430 | try:
431 | 	 
432 | 	fancy_print('[SC-Markers]  | Diamond Aligning','...',bcolors.OKBLUE,reline=True)
433 | 	
434 | 	diamond_command = [args.diamond_path,'blastx','-q',filteredFile,'--threads',args.diamond_threads,'--outfmt','6','--db',INDEX_PATH+'/'+req_dmd_db_filename,'--id','50','--max-hsps','35','-k','0','--quiet']
435 | 	p2 = subprocess.Popen('cut -f1 | sort | uniq | wc -l',shell=True, stdin=subprocess.PIPE,stdout=subprocess.PIPE)
436 | 	
437 | 	if args.debug:
438 | 		p1 = subprocess.Popen(diamond_command, stdout=p2.stdin)
439 | 	else:
440 | 		if args.debug:
441 | 			p1 = subprocess.Popen(diamond_command, stdout=p2.stdin)
442 | 		else:
443 | 			with open(os.devnull) as devnull:
444 | 				p1 = subprocess.Popen(diamond_command, stdout=p2.stdin,stderr=devnull)
445 | 
446 | 	singleCopyMarkers_reads = int(p2.communicate()[0])
447 | 	singleCopyMarkers_reads_rate = max(LIMIT_OF_DETECTION,float(singleCopyMarkers_reads)/float(HQReads)*100)
448 | 
449 | 	enrichment_singleCopyMarkers = min(100,float(medians.loc[medians['parameter']=='AMPHORA2',args.enrichment_preset]) / float(singleCopyMarkers_reads_rate))
450 | 
451 | 	fancy_print('[SC-Markers]  | Diamond Alignment rate: '+str(round(singleCopyMarkers_reads_rate,4))+'% (~'+str(round(enrichment_singleCopyMarkers,1))+'x)','DONE',bcolors.OKGREEN,reline=True,newLine=True)
452 | 
453 | 	if(singleCopyMarkers_reads <= LIMIT_OF_DETECTION):
454 | 		fancy_print('[SC-Markers]  | Value is below to limit-of-detection ('+str(singleCopyMarkers_reads)+' reads)','!!',bcolors.WARNING)
455 | 
456 | except Exception as e: 
457 | 	fancy_print('Fatal error running Diamond on Single-Copy-Proteins. Error message: '+str(e),'FAIL',bcolors.FAIL)
458 | 	sys.exit(1)
459 | 
460 | overallEnrichmenScore = min(enrichment_SSU,enrichment_LSU,enrichment_singleCopyMarkers)
461 | to_out=[workingName,totalReads,HQReads,SSU_reads_rate,LSU_reads_rate,singleCopyMarkers_reads_rate,overallEnrichmenScore]
462 | 
463 | 
464 | outFile = open(args.output,'w')
465 | outFile.write("Sample\tReads\tReads_HQ\tSSU rRNA alignment rate\tLSU rRNA alignment rate\tBacterial_Markers alignment rate\ttotal enrichmnet score\n")
466 | outFile.write('\t'.join([str(x) for x in to_out])+'\n')
467 | outFile.close()
468 | 
469 | fancy_print('Finished','',bcolors.ENDC)
470 | fancy_print('              | Overall Enrichment Score: ~'+str(round(overallEnrichmenScore,1))+'x','.',bcolors.ENDC)
471 | fancy_print('              | Output File: '+args.output,'.',bcolors.ENDC)
472 | fancy_print('Have a nice day! ','DONE',bcolors.OKGREEN)
473 | 
474 | tmpdir.cleanup()
475 | 


--------------------------------------------------------------------------------