├── README.md ├── job_scripts ├── README.md ├── fastqc.sh ├── rcorector.sh └── rm_rcorrector_unfixable.sh ├── test ├── fastqc_data_r1.txt ├── fastqc_data_r1_nooverrep.txt ├── fastqc_data_r2.txt ├── fastqc_data_r2_nooverrep.txt ├── test_R1_adaptor-trimmed.fq.gz └── test_R2_adaptor-trimmed.fq.gz └── utilities ├── FilterListConverBam2Fastq.py ├── FilterUncorrectabledPEfastq.py ├── README.md ├── RemoveFastqcOverrepSequenceReads.py └── RepairSRAHeadersForTrinity.py /README.md: -------------------------------------------------------------------------------- 1 | # TranscriptomeAssemblyTools 2 | A collection of scripts for pre-processing fastq files prior to *de novo* transcriptome assembly, as well as slurm scripts for running what we view as a reasonable best practice workflow, the steps of which are: 3 | 4 | 1. run *fastqc* on raw fastq reads to identify potential issues with Illumina sequencing libraries such as retained adapter, over-represented (often low complexity) sequences, greater than expected decrease in base quality with read cycle 5 | 1. perform kmer-based read corrections with with [rCorrector](https://github.com/mourisl/Rcorrector), see [Song and Florea 2015, Gigascience](https://gigascience.biomedcentral.com/articles/10.1186/s13742-015-0089-y) 6 | 1. remove read pairs where at least one read has been flagged by *rCorrector* as containing an erroneous kmer, and where it was not possible to computationally correct the errors 7 | 1. remove read pairs where at least read contains an over-represented sequence 8 | 1. perform light quality trimming with [TrimGalore](https://github.com/FelixKrueger/TrimGalore) 9 | 1. optionally, to map the remaining reads to the [SILVA rRNA database](https://www.arb-silva.de/), then filtering read pairs where at least one read aligns to an rRNA sequence 10 | 1. assembly reads with [Trinity](https://github.com/trinityrnaseq/trinityrnaseq) 11 | 12 | From our years of experience troubleshooting and evaluating *de novo* transcriptome assemblies, we have identified a number of statistical issues regarding the robustness of downstream analyses based upon them. A summary of these issues can be found at [Freedman et al. 2020, *Molecular Ecology Resources*](https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13156). Given these issues and the rapidly decreasing cost of generating a genome assembly, we suggest that during the study design phase of your project, you consider the feasibility of assembling and annotating a genome before choosing to generate a *de novo* transcriptome assembly. 13 | 14 | ## Workflow steps 15 | To the extent possible, to improve reproducibility (not to mention ease of implementation!), we run analyses from within conda environments. Below, we explain how to execute particular steps assuming that a separate conda enviornment is created for each step, with job scripts designed to be run on the SLURM job scheduler. These can easily be modified to work with other schedulers such as SGE and LSF. 16 | 17 | ### 1. Running fastqc 18 | *Fastqc* generated a bunch of quality metrics such as quality-by-cycle, frequency by cycle, of adapter sequence, and overrepresentation of sequences. This information can provide some initial information about potential issues with sequencing library quality, and may highlight the importance of particular downstream filtering steps. 19 | 20 | We can create a conda environment for *fastqc* as follows 21 | ```bash 22 | conda create -n fastqc -c bioconda fastqc 23 | ``` 24 | Then, we can run *fastqc*. Assuming we run each file as a separate job, the command line looks like this: 25 | ```bash 26 | fastqc --outdir `pwd`/fastqc 27 | ``` 28 | where *--outdir* simply indicates where you want to store the output. In this case, we are storing it in a directory called fastqc nested inside the current working directory. 29 | 30 | This command can be submitted as a SLURM job with [fastqc.sh](https://github.com/harvardinformatics/TranscriptomeAssemblyTools/blob/master/job_scripts/fastqc.sh) 31 | ```bash 32 | sbatch fastqc.sh 33 | ``` 34 | 35 | Note, if one supplies more threads to *fastqc*, with the *-t* switch, one can supply multiple fastq files at once. When choosing whether or not to run *fastqc* in multi-threaded mode, remember that the program only allocates 1 thread per file, i.e. there is no benefit to specifying more than one thread when only one file is being quality-checked. 36 | 37 | ### 2. Kmer-based error corrections with *rCorrector* 38 | For a given RNA-seq experiment, kmers observed in reads very rarely are likely to be those containing (at least one) sequencing error. Such errors can negatively impact the quality of a *de novo* transcsriptome assembly, and so ideally we either corrector the errors or remove erroneous reads that are computationally "uncorrectable". We can do this with *rCorrector*, a bioinformatics tool that first builds a kmer library for a set of reads, then identifies rare kmers, attempts to find a more frequent kmer that doesn't differ by it too much, correct the read to the more frequent kmer. If there isn't a detectable correction for a flagged rare kmer, *rCorrector* flags it as unfixable. *rCorrector* takes as input all of the paired-end reads that are to be used to generate the assembly. 39 | 40 | We can create a conda environment as follows: 41 | ```bash 42 | conda create -n rcorrector -c bioconda rcorrector 43 | ``` 44 | 45 | Then, we can run the following command line: 46 | ```bash 47 | run_rcorrector.pl -t 16 -1 -2 48 | ``` 49 | where "-t" is a command line switch to specify how many threads to use in the analysis. 50 | 51 | This command can be submitted as a SLURM job with [rcorrector.sh](https://github.com/harvardinformatics/TranscriptomeAssemblyTools/blob/master/job_scripts/rcorector.sh) 52 | ```bash 53 | sbatch rcorrector.sh 54 | ``` 55 | -------------------------------------------------------------------------------- /job_scripts/README.md: -------------------------------------------------------------------------------- 1 | ## Executing workflow steps in an HPC environment 2 | In this directory, we provide example sbatch scripts for submitting workflow steps to the slurm scheduler. These should be easily adaptable to be run in a High Performance Computing (HPC) environment using a different scheduler such as LSF or SGE. 3 | -------------------------------------------------------------------------------- /job_scripts/fastqc.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH -p # put comma separated list of available partitions here 3 | #SBATCH -n 1 # Number of cores 4 | #SBATCH -t 0-3:00 # Runtime in days-hours:minutes 5 | #SBATCH --mem 6000 # Memory in MB 6 | #SBATCH -J FastQC # job name 7 | #SBATCH -o FastQC.%A.out # File to which standard out will be written 8 | #SBATCH -e FastQC.%A.err # File to which standard err will be written 9 | ##SBATCH --mail-type=ALL # Type of email notification- BEGIN,END,FAIL,ALL 10 | ##SBATCH --mail-user= # Email to which notifications will be sent 11 | 12 | """ 13 | For this script to initialize a conda environment, a version of python that supports 14 | anaconda or mamba will need to be in PATH. In a case where the HPC environment has python 15 | available as a loadable module, such as the Harvard Cannon cluster, 16 | it will simply require adding: module load python. 17 | """ 18 | 19 | source activate fastqc 20 | infile=$1 # $1 represent the first (and in this case only) command line argument supplied to the script 21 | echo "infile is $infile" 22 | 23 | fastqc --outdir `pwd`/fastqc $infile 24 | 25 | -------------------------------------------------------------------------------- /job_scripts/rcorector.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH -p # put comma-separated list of available partisions here 3 | #SBATCH -n 16 # Number of cores 4 | #SBATCH -t 23:00:00 # Runtime in hours:minutes:secods 5 | #SBATCH --mem 48000 # Memory in MB 6 | #SBATCH -J rcorrector # job name 7 | #SBATCH -o rcorrector.%A.out # File to which standard out will be written 8 | #SBATCH -e rcorrector.%A.err # File to which standard err will be written 9 | #SBATCH --mail-type=ALL # Type of email notification- BEGIN,END,FAIL,ALL 10 | #SBATCH --mail-user= # Email to which notifications will be sent 11 | 12 | """ 13 | For this script to initialize a conda environment, a version of python that supports 14 | anaconda or mamba will need to be in PATH. In a case where the HPC environment has python 15 | available as a loadable module, such as the Harvard Cannon cluster, 16 | it will simply require adding: module load python. 17 | """ 18 | 19 | source activate rcorrector 20 | 21 | """ 22 | R1 and R2 are the first and second command line arguments that follow "sbatch rcorrector.sh 23 | R1 is a comma-separated list of the left (R1) fastq files, and R2 is a similar list for the right (R2) reads 24 | """ 25 | 26 | R1=$1 27 | R2=$2 28 | 29 | run_rcorrector.pl -t 16 -1 $R1 -2 $R2 30 | -------------------------------------------------------------------------------- /job_scripts/rm_rcorrector_unfixable.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH -p # put comma separated list of available partitions here 3 | #SBATCH -n 1 # Number of cores 4 | #SBATCH -t 0-3:00 # Runtime in days-hours:minutes 5 | #SBATCH --mem 6000 # Memory in MB 6 | #SBATCH -J rmunfix # job name 7 | #SBATCH -o rmunfix.%A.out # File to which standard out will be written 8 | #SBATCH -e rmunvix.%A.err # File to which standard err will be written 9 | ##SBATCH --mail-type=ALL # Type of email notification- BEGIN,END,FAIL,ALL 10 | ##SBATCH --mail-user= # Email to which notifications will be sent 11 | 12 | r1=$1 # rCorrector corrected left (R1)fastq file 13 | r2=$2 # rCorrector corrected right (R2) fastq file 14 | sample_name=$3 # sample name used for outfiles prefix 15 | echo "input read pair files are: $r1 and $r2" 16 | 17 | python FilterUncorrectabledPEfastq.py -1 $r1 -2 $r2 -s $sample_name 18 | 19 | -------------------------------------------------------------------------------- /test/test_R1_adaptor-trimmed.fq.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/harvardinformatics/TranscriptomeAssemblyTools/a16497457ff09e34743d0d31d3b2674e3b6e014a/test/test_R1_adaptor-trimmed.fq.gz -------------------------------------------------------------------------------- /test/test_R2_adaptor-trimmed.fq.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/harvardinformatics/TranscriptomeAssemblyTools/a16497457ff09e34743d0d31d3b2674e3b6e014a/test/test_R2_adaptor-trimmed.fq.gz -------------------------------------------------------------------------------- /utilities/FilterListConverBam2Fastq.py: -------------------------------------------------------------------------------- 1 | from os.path import basename,dirname 2 | import argparse 3 | from subprocess import Popen,PIPE 4 | import glob 5 | from sets import Set 6 | from os import path 7 | 8 | 9 | # set bits for reads in properly mapped pairs 10 | mate_to_bits={'R1':'0x42','R2':'0x82'} 11 | 12 | def sortbam(unsorted_bam): 13 | """ 14 | coordinate sorts bam file 15 | """ 16 | print('coordinate sorting %s\n' % unsorted_bam) 17 | cmd='samtools sort %s> %s/sorted_%s' % (unsorted_bam,dirname(unsorted_bam),basename(unsorted_bam)) 18 | samsort=Popen(cmd,shell=True,stderr=PIPE,stdout=PIPE) 19 | samsort_stdout,samsort_stderr=samsort.communicate() 20 | if samsort.returncode==0: 21 | return True 22 | else: 23 | return False 24 | 25 | 26 | def extractbamheader(bamin): 27 | cmd='samtools view -H %s > %s/header%s.sam' % (bamin,dirname(bamin),basename(bamin)[:-4]) 28 | headgrab=Popen(cmd,shell=True,stderr=PIPE,stdout=PIPE) 29 | headout,headerr=headgrab.communicate() 30 | if headgrab.returncode==0: 31 | headpass=True 32 | headerr='' 33 | else: 34 | headpass=False 35 | return headpass,headerr 36 | 37 | 38 | def addheadertosam(samin): 39 | concatcmd='cat %s/header%s.sam %s/namesort_%s > %s/wheader_%s' % (dirname(samin),basename(samin)[:-4],dirname(samin),basename(samin),dirname(samin),basename(samin)) 40 | doconcat=Popen(concatcmd,shell=True,stderr=PIPE,stdout=PIPE) 41 | concatout,concaterr=doconcat.communicate() 42 | if doconcat.returncode==0: 43 | concatpass=True 44 | else: 45 | concatpass=False 46 | raise ValueError('%s\n' % concaterr) 47 | return concatpass 48 | 49 | def sam2bam(samin): 50 | bamcmd='samtools view -Sbh %s > %s/%s.bam' % (samin,dirname(samin),basename(samin)[:-4]) 51 | makebam=Popen(bamcmd,shell=True,stederr=PIPE,stdout=PIPE) 52 | bamout,bamerr=makebam.communicate() 53 | if makebam.returncode==0: 54 | makebampass=True 55 | else: 56 | makebampass=False 57 | raise ValueError('%s\n' % makebamerr) 58 | return makebampass 59 | 60 | 61 | def namesortsam(samin): 62 | """ 63 | takes filtered sam file, name sorts it, 64 | and writes it, with header, as a sam file 65 | """ 66 | sortcmd='samtools sort -n -o %s/namesort_%s %s' % (dirname(samin),basename(samin),samin) 67 | dosort=Popen(sortcmd,shell=True,stderr=PIPE,stdout=PIPE) 68 | sortout,sorterr=dosort.communicate() 69 | if dosort.returncode==0: 70 | sortpass=True 71 | sorterr='' 72 | else: 73 | sortpass=False 74 | #raise ValueError('%s\n' % sorterr) 75 | return sortpass,sorterr 76 | 77 | def indexbam(sorted_bam): 78 | print('indexing %s\n' % sorted_bam) 79 | cmd='samtools index %s/%s' % (dirname(sorted_bam),sorted_bam) 80 | samindex=Popen(cmd,shell=True,stderr=PIPE,stdout=PIPE) 81 | samindex_stdout,samindex_stderr=samindex.communicate() 82 | if samindex.returncode==0: 83 | return True,'' 84 | else: 85 | return True,samindex_stderr 86 | 87 | def makefilterset(filterfile): 88 | fopen=open(filterfile,'r') 89 | filterset=Set() 90 | for line in fopen: 91 | contig=line.strip() 92 | filterset.add(contig) 93 | return filterset 94 | 95 | def write_mate_sams(bamin,hexval,mate): 96 | """ 97 | extract mapped reads from a coord-sorted 98 | bam file for a pass-filter contig in a 99 | mate-specific fashion 100 | """ 101 | cmd='samtools view -h -f %s %s > %s/%s_%s.sam' % (hexval,bamin,dirname(bamin),basename(bamin)[:-4],mate) 102 | getreads=Popen(cmd,shell=True,stderr=PIPE,stdout=PIPE) 103 | readsout,readserr=getreads.communicate() 104 | if getreads.returncode==0: 105 | return True,'' 106 | else: 107 | return False,readserr 108 | 109 | def sam2fastq(samfile,contigskeep): 110 | counter=0 111 | fopen=open(samfile,'r') 112 | fqout=open('%s.fq' % samfile[:-4],'w') 113 | lastid='' 114 | for line in fopen: 115 | linelist=line.strip().split('\t') 116 | if line[0]!='@' and linelist[2] in contigskeep: 117 | counter+=1 118 | if counter%100000==0: 119 | print 'processing read %s' % counter 120 | if lastid=='': 121 | lastid=linelist[0] 122 | fqout.write('%s\n' % '\n'.join([linelist[0],linelist[9],'+',linelist[10]])) 123 | elif lastid!=linelist[0]: 124 | lastid=linelist[0] 125 | fqout.write('%s\n' % '\n'.join([linelist[0],linelist[9],'+',linelist[10]])) 126 | else: 127 | pass 128 | else: 129 | pass 130 | 131 | fqout.close() 132 | 133 | 134 | if __name__=="__main__": 135 | parser = argparse.ArgumentParser(description="using contig keep list to extract fastq files") 136 | parser.add_argument('-s','--is_coord_sorted',action='store_true',help='flag for whether bam is name sorted') 137 | parser.add_argument('-b','--bam_in',dest='bamin',type=str,help='compressed read alignment infile') 138 | parser.add_argument('-k','--contig_keep_file',dest='keeps',type=str,help='list of contig mapping to keep') 139 | opts = parser.parse_args() 140 | 141 | print opts 142 | 143 | ### make sure in bam is coord sorted to enable contig searhes 144 | if opts.is_coord_sorted: 145 | sortedbam=opts.bamin 146 | else: 147 | bamsorter=sortbam(opts.bamin) 148 | sortedbam='sorted_%s' % basename(opts.bamin) 149 | 150 | ### verify or make bam file index is present 151 | if glob.glob('%s.bai' % sortedbam)==[]: 152 | indexflag,indexerr=indexbam(sortedbam) 153 | if indexflag==False: 154 | raise ValueError('%s\n' % indexerr) 155 | 156 | ### make list of pass-filter contigs 157 | good_contigs=makefilterset(opts.keeps) 158 | 159 | ### write a contig-filtered bam file for each mate in a pair: 160 | for mate in ['R1','R2']: 161 | print 'writing %s sam file' % mate 162 | mateflag,materr=write_mate_sams(sortedbam,mate_to_bits[mate],mate) 163 | if mateflag==True: 164 | pass 165 | else: 166 | raise ValueError('%s\n' % materr) 167 | 168 | print 'name-sorting %s sam file' % mate 169 | sortflag,sorterr=namesortsam('%s_%s.sam' % (sortedbam[:-4],mate)) 170 | if sortflag==False: 171 | raise ValueError('%s\n' % sorterr) 172 | 173 | else: 174 | print 'writing unique %s reads to fastq' % mate 175 | sam2fastq('%s/namesort_%s_%s.sam' % (dirname(sortedbam),basename(sortedbam)[:-4],mate),good_contigs) 176 | 177 | 178 | 179 | 180 | -------------------------------------------------------------------------------- /utilities/FilterUncorrectabledPEfastq.py: -------------------------------------------------------------------------------- 1 | """ 2 | author: adam h freedman 3 | afreedman405 at gmail.com 4 | data: Thursday Oct 12 EDT 2023 5 | 6 | This script takes as an input Rcorrector error corrected Illumina paired-reads 7 | in fastq format and: 8 | 9 | 1. Removes any reads that Rcorrector indentifes as containing an error, 10 | but can't be corrected, typically low complexity sequences. For these, 11 | the header contains 'unfixable'. 12 | 13 | 2. Strips the ' cor' from headers of reads that Rcorrector fixed, to avoid 14 | issues created by certain header formats for downstream tools. 15 | 16 | 3. Write a log with counts of (a) read pairs that were removed because one end 17 | was unfixable, (b) corrected left and right reads, (c) total number of 18 | read pairs containing at least one corrected read. 19 | 20 | Currently, this script only handles paired-end data, and handle either unzipped 21 | or gzipped files on the fly, so long as the gzipped files end with 'gz'. 22 | 23 | """ 24 | #!/usr/bin/python3 25 | import argparse 26 | import sys 27 | import gzip 28 | 29 | try: 30 | from itertools import izip_longest 31 | except ImportError: 32 | from itertools import zip_longest as izip_longest 33 | 34 | import argparse 35 | import os.path 36 | from os import system 37 | 38 | def get_input_streams(r1file,r2file): 39 | if r1file[-2:]=='gz': 40 | r1handle=gzip.open(r1file,'rb') 41 | r2handle=gzip.open(r2file,'rb') 42 | else: 43 | r1handle=open(r1file,'r') 44 | r2handle=open(r2file,'r') 45 | 46 | return r1handle,r2handle 47 | 48 | 49 | def grouper(iterable, n, fillvalue=None): 50 | "Collect data into fixed-length chunks or blocks" 51 | # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx 52 | args = [iter(iterable)] * n 53 | return izip_longest(fillvalue=fillvalue, *args) 54 | 55 | 56 | if __name__=="__main__": 57 | parser = argparse.ArgumentParser(description="options for filtering and logging rCorrector fastq outputs") 58 | parser.add_argument('-1','--left_reads',dest='leftreads',type=str,help='R1 fastq file') 59 | parser.add_argument('-2','--right_reads',dest='rightreads',type=str,help='R2 fastq file') 60 | parser.add_argument('-s','--sample_id',dest='id',type=str,help='sample name to write to log file') 61 | opts = parser.parse_args() 62 | 63 | output_dir = os.path.dirname(opts.leftreads) 64 | r1out = open(os.path.join(output_dir, "unfixrm_{}".format( 65 | os.path.basename(opts.leftreads).replace('.gz', ''))), 'w') 66 | r2out = open(os.path.join(output_dir, "unfixrm_{}".format( 67 | os.path.basename(opts.rightreads).replace('.gz', ''))), 'w') 68 | 69 | 70 | r1_cor_count=0 71 | r2_cor_count=0 72 | pair_cor_count=0 73 | unfix_r1_count=0 74 | unfix_r2_count=0 75 | unfix_both_count=0 76 | 77 | r1_stream,r2_stream=get_input_streams(opts.leftreads,opts.rightreads) 78 | 79 | with r1_stream as f1, r2_stream as f2: 80 | R1=grouper(f1,4) 81 | R2=grouper(f2,4) 82 | counter=0 83 | for entry in R1: 84 | counter+=1 85 | if counter%100000==0: 86 | print("%s reads processed" % counter) 87 | 88 | head1,seq1,placeholder1,qual1=[i.decode('ASCII').strip() for i in entry] 89 | head2,seq2,placeholder2,qual2=[j.decode('ASCII').strip() for j in next(R2)] 90 | 91 | if 'unfixable' in head1 and 'unfixable' not in head2: 92 | unfix_r1_count+=1 93 | elif 'unfixable' in head2 and 'unfixable' not in head1: 94 | unfix_r2_count+=1 95 | elif 'unfixable' in head1 and 'unfixable' in head2: 96 | unfix_both_count+=1 97 | else: 98 | if 'cor' in head1: 99 | r1_cor_count+=1 100 | if 'cor' in head2: 101 | r2_cor_count+=1 102 | if 'cor' in head1 or 'cor' in head2: 103 | pair_cor_count+=1 104 | 105 | r1out.write('%s\n' % '\n'.join([head1.replace(' cor',''),seq1,placeholder1,qual1])) 106 | r2out.write('%s\n' % '\n'.join([head2.replace(' cor',''),seq2,placeholder2,qual2])) 107 | 108 | total_unfixable = unfix_r1_count+unfix_r2_count+unfix_both_count 109 | total_retained = counter - total_unfixable 110 | 111 | unfix_log = open(os.path.join( 112 | output_dir, "rmunfixable_{}.log".format(opts.id)), 'w') 113 | unfix_log.write('total PE reads:%s\nremoved PE reads:%s\nretained PE reads:%s\nR1 corrected:%s\nR2 corrected:%s\npairs corrected:%s\nR1 unfixable:%s\nR2 unfixable:%s\nboth reads unfixable:%s\n' % (counter,total_unfixable,total_retained,r1_cor_count,r2_cor_count,pair_cor_count,unfix_r1_count,unfix_r2_count,unfix_both_count)) 114 | 115 | r1out.close() 116 | r2out.close() 117 | unfix_log.close() 118 | outfiles = [os.path.join(output_dir, "unfixrm_{}".format(os.path.basename(opts.leftreads).replace('.gz', ''))), 119 | os.path.join(output_dir, "unfixrm_{}".format(os.path.basename(opts.rightreads).replace('.gz', '')))] 120 | for outfile in outfiles: 121 | print("gzipping %s" % outfile) 122 | os.system("gzip %s" % outfile) 123 | -------------------------------------------------------------------------------- /utilities/README.md: -------------------------------------------------------------------------------- 1 | # Utilities for pre-processing fastq files prior assembly 2 | A collection of scripts for processing fastq files in ways to improve de novo transcriptome assemblies, and for evaluating those assemblies. 3 | 4 | ## FilterUncorrectablePEfastq.py 5 | Takes a paired-end Illumina fastq file generated from an RNA-seq library, and that has been error corrected with [rCorrector](https://github.com/mourisl/Rcorrector), see [Song and Florea 2015, Gigascience](https://gigascience.biomedcentral.com/articles/10.1186/s13742-015-0089-y),and removes reads with errors but that are unfixable, and strips 'cor' flags from reads that were corrected. 6 | 7 | ## RemoveFastqcOverrepSequenceReads.py 8 | Parses the fastqc output files to retrieve over-represented sequences, and uses these to remove read pairs where either read has a sequence match to an over-represented sequence. 9 | -------------------------------------------------------------------------------- /utilities/RemoveFastqcOverrepSequenceReads.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import gzip 3 | from os.path import basename 4 | import argparse 5 | import re 6 | try: 7 | from itertools import izip_longest 8 | except ImportError: 9 | from itertools import zip_longest as izip_longest 10 | 11 | 12 | def seqsmatch(overreplist,read): 13 | flag=False 14 | if overreplist!=[]: 15 | for seq in overreplist: 16 | if seq in read: 17 | flag=True 18 | break 19 | return flag 20 | 21 | def get_input_streams(r1file,r2file): 22 | if r1file[-2:]=='gz': 23 | r1handle=gzip.open(r1file,'rb') 24 | r2handle=gzip.open(r2file,'rb') 25 | else: 26 | r1handle=open(r1file,'r') 27 | r2handle=open(r2file,'r') 28 | 29 | return r1handle,r2handle 30 | 31 | def FastqIterate(iterable,fillvalue=None): 32 | "Grab one 4-line fastq read at a time" 33 | args = [iter(iterable)] * 4 34 | return izip_longest(fillvalue=fillvalue, *args) 35 | 36 | def ParseFastqcLog(fastqclog): 37 | with open(fastqclog) as fp: 38 | for result in re.findall('Overrepresented sequences(.*?)END_MODULE', fp.read(), re.S): 39 | seqs=([i.split('\t')[0] for i in result.split('\n')[2:-1]]) 40 | if result == []: 41 | seqs = [] 42 | return seqs 43 | 44 | if __name__=="__main__": 45 | parser = argparse.ArgumentParser(description="options for removing reads with over-represented sequences") 46 | parser.add_argument('-1','--left_reads',dest='leftreads',type=str,help='R1 fastq file') 47 | parser.add_argument('-2','--right_reads',dest='rightreads',type=str,help='R2 fastq file') 48 | parser.add_argument('-fql','--fastqc_left',dest='l_fastqc',type=str,help='fastqc text file for R1') 49 | parser.add_argument('-fqr','--fastqc_right',dest='r_fastqc',type=str,help='fastqc text file for R2') 50 | parser.add_argument('-o','--output-log-prefix',dest='logprefix',type=str,help='prefix for logfile summarizing filtering') 51 | opts = parser.parse_args() 52 | 53 | logout = open('%s_rmoverrep.log' % opts.logprefix,'w') 54 | leftseqs=ParseFastqcLog(opts.l_fastqc) 55 | if leftseqs == []: 56 | print("no overrepresented sequences in R1") 57 | rightseqs=ParseFastqcLog(opts.r_fastqc) 58 | if rightseqs == []: 59 | print("no overrepresented sequences in R2") 60 | 61 | r1_out=open('rmoverrep_'+basename(opts.leftreads).replace('.gz',''),'w') 62 | r2_out=open('rmoverrep_'+basename(opts.rightreads).replace('.gz',''),'w') 63 | 64 | r1_stream,r2_stream=get_input_streams(opts.leftreads,opts.rightreads) 65 | 66 | counter=0 67 | failcounter=0 68 | 69 | with r1_stream as f1, r2_stream as f2: 70 | R1=FastqIterate(f1) 71 | R2=FastqIterate(f2) 72 | for entry in R1: 73 | counter+=1 74 | if counter%100000==0: 75 | print("%s reads processed" % counter) 76 | 77 | 78 | head1,seq1,placeholder1,qual1=[i.decode('ASCII').strip() for i in entry] 79 | head2,seq2,placeholder2,qual2=[j.decode('ASCII').strip() for j in next(R2)] 80 | #head1,seq1,placeholder1,qual1=[i.strip() for i in entry] 81 | #head2,seq2,placeholder2,qual2=[j.strip() for j in R2.next()] 82 | 83 | flagleft,flagright=seqsmatch(leftseqs,seq1),seqsmatch(rightseqs,seq2) 84 | 85 | if True not in (flagleft,flagright): 86 | r1_out.write('%s\n' % '\n'.join([head1,seq1,'+',qual1])) 87 | r2_out.write('%s\n' % '\n'.join([head2,seq2,'+',qual2])) 88 | else: 89 | failcounter+=1 90 | 91 | logout.write('n_reads_eval\tn_reads_retained\tn_pe_reads_filtered\n') 92 | logout.write('%s\t%s\t%s\n' % (counter,counter-failcounter,failcounter)) 93 | logout.close() 94 | 95 | r1_out.close() 96 | r2_out.close() 97 | -------------------------------------------------------------------------------- /utilities/RepairSRAHeadersForTrinity.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import gzip 3 | from itertools import izip,izip_longest 4 | import argparse 5 | from os.path import basename 6 | 7 | def get_input_streams(r1file,r2file): 8 | if r1file[-2:]=='gz': 9 | r1handle=gzip.open(r1file,'rb') 10 | r2handle=gzip.open(r2file,'rb') 11 | else: 12 | r1handle=open(r1file,'r') 13 | r2handle=open(r2file,'r') 14 | 15 | return r1handle,r2handle 16 | 17 | 18 | def grouper(iterable, n, fillvalue=None): 19 | "Collect data into fixed-length chunks or blocks" 20 | # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx 21 | args = [iter(iterable)] * n 22 | return izip_longest(fillvalue=fillvalue, *args) 23 | 24 | 25 | if __name__=="__main__": 26 | parser = argparse.ArgumentParser(description="options for filtering and logging rCorrector fastq outputs") 27 | parser.add_argument('-1','--left_reads',dest='leftreads',type=str,help='R1 fastq file') 28 | parser.add_argument('-2','--right_reads',dest='rightreads',type=str,help='R2 fastq file') 29 | opts = parser.parse_args() 30 | 31 | r1out=open('sraheaderfixed_%s' % basename(opts.leftreads).replace('.gz',''),'w') 32 | r2out=open('sraheaderfixed_%s' % basename(opts.rightreads).replace('.gz',''),'w') 33 | r1_stream,r2_stream=get_input_streams(opts.leftreads,opts.rightreads) 34 | with r1_stream as f1, r2_stream as f2: 35 | R1=grouper(f1,4) 36 | R2=grouper(f2,4) 37 | counter=0 38 | for entry in R1: 39 | counter+=1 40 | if counter%100000==0: 41 | print "%s reads processed" % counter 42 | 43 | head1,seq1,placeholder1,qual1=[i.strip() for i in entry] 44 | head2,seq2,placeholder2,qual2=[j.strip() for j in R2.next()] 45 | head1=head1.split()[0]+'/1' 46 | head2=head2.split()[0]+'/2' 47 | placeholder1 = '+' 48 | placeholder2 = '+' 49 | r1out.write('%s\n' % '\n'.join([head1,seq1,placeholder1,qual1])) 50 | r2out.write('%s\n' % '\n'.join([head2,seq2,placeholder2,qual2])) 51 | 52 | r1out.close() 53 | r2out.close() 54 | --------------------------------------------------------------------------------