├── README.md
├── job_scripts
    ├── README.md
    ├── fastqc.sh
    ├── rcorector.sh
    └── rm_rcorrector_unfixable.sh
├── test
    ├── fastqc_data_r1.txt
    ├── fastqc_data_r1_nooverrep.txt
    ├── fastqc_data_r2.txt
    ├── fastqc_data_r2_nooverrep.txt
    ├── test_R1_adaptor-trimmed.fq.gz
    └── test_R2_adaptor-trimmed.fq.gz
└── utilities
    ├── FilterListConverBam2Fastq.py
    ├── FilterUncorrectabledPEfastq.py
    ├── README.md
    ├── RemoveFastqcOverrepSequenceReads.py
    └── RepairSRAHeadersForTrinity.py


/README.md:
--------------------------------------------------------------------------------
 1 | # TranscriptomeAssemblyTools
 2 | A collection of scripts for pre-processing fastq files prior to *de novo* transcriptome assembly, as well as slurm scripts for running what we view as a reasonable best practice workflow, the steps of which are:
 3 | 
 4 | 1. run *fastqc* on raw fastq reads to identify potential issues with Illumina sequencing libraries such as retained adapter, over-represented (often low complexity) sequences, greater than expected decrease in base quality with read cycle
 5 | 1. perform kmer-based read corrections with with [rCorrector](https://github.com/mourisl/Rcorrector), see [Song and Florea 2015, Gigascience](https://gigascience.biomedcentral.com/articles/10.1186/s13742-015-0089-y)
 6 | 1. remove read pairs where at least one read has been flagged by *rCorrector* as containing an erroneous kmer, and where it was not possible to computationally correct the errors
 7 | 1. remove read pairs where at least read contains an over-represented sequence
 8 | 1. perform light quality trimming with [TrimGalore](https://github.com/FelixKrueger/TrimGalore)
 9 | 1. optionally, to map the remaining reads to the [SILVA rRNA database](https://www.arb-silva.de/), then filtering read pairs where at least one read aligns to an rRNA sequence
10 | 1. assembly reads with [Trinity](https://github.com/trinityrnaseq/trinityrnaseq)
11 | 
12 | From our years of experience troubleshooting and evaluating *de novo* transcriptome assemblies, we have identified a number of statistical issues regarding the robustness of downstream analyses based upon them. A summary of these issues can be found at [Freedman et al. 2020, *Molecular Ecology Resources*](https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13156). Given these issues and the rapidly decreasing cost of generating a genome assembly, we suggest that during the study design phase of your project, you consider the feasibility of assembling and annotating a genome before choosing to generate a *de novo* transcriptome assembly.
13 | 
14 | ## Workflow steps
15 | To the extent possible, to improve reproducibility (not to mention ease of implementation!), we run analyses from within conda environments. Below, we explain how to execute particular steps assuming that a separate conda enviornment is created for each step, with job scripts designed to be run on the SLURM job scheduler. These can easily be modified to work with other schedulers such as SGE and LSF.
16 | 
17 | ### 1. Running fastqc
18 | *Fastqc* generated a bunch of quality metrics such as quality-by-cycle, frequency by cycle, of adapter sequence, and overrepresentation of sequences. This information can provide some initial information about potential issues with sequencing library quality, and may highlight the importance of particular downstream filtering steps.
19 | 
20 | We can create a conda environment for *fastqc* as follows
21 | ```bash
22 | conda create -n fastqc -c bioconda fastqc
23 | ```
24 | Then, we can run *fastqc*. Assuming we run each file as a separate job, the command line looks like this:
25 | ```bash
26 | fastqc --outdir `pwd`/fastqc <name of your fastq file>
27 | ```
28 | where *--outdir* simply indicates where you want to store the output. In this case, we are storing it in a directory called fastqc nested inside the current working directory.
29 | 
30 | This command can be submitted as a SLURM job with [fastqc.sh](https://github.com/harvardinformatics/TranscriptomeAssemblyTools/blob/master/job_scripts/fastqc.sh)
31 | ```bash
32 | sbatch fastqc.sh <name of fastq file>
33 | ```
34 | 
35 | Note, if one supplies more threads to *fastqc*, with the *-t* switch, one can supply multiple fastq files at once. When choosing whether or not to run *fastqc* in multi-threaded mode, remember that the program only allocates 1 thread per file, i.e. there is no benefit to specifying more than one thread when only one file is being quality-checked. 
36 | 
37 | ### 2. Kmer-based error corrections with *rCorrector*
38 | For a given RNA-seq experiment, kmers observed in reads very rarely are likely to be those containing (at least one) sequencing error. Such errors can negatively impact the quality of a *de novo* transcsriptome assembly, and so ideally we either corrector the errors or remove erroneous reads that are computationally "uncorrectable". We can do this with *rCorrector*, a bioinformatics tool that first builds a kmer library for a set of reads, then identifies rare kmers, attempts to find a more frequent kmer that doesn't differ by it too much, correct the read to the more frequent kmer. If there isn't a detectable correction for a flagged rare kmer, *rCorrector* flags it as unfixable. *rCorrector* takes as input all of the paired-end reads that are to be used to generate the assembly.
39 | 
40 | We can create a conda environment as follows:
41 | ```bash
42 | conda create -n rcorrector -c bioconda rcorrector
43 | ```
44 | 
45 | Then, we can run the following command line:
46 | ```bash
47 | run_rcorrector.pl -t 16 -1 <comma-separated list of R1 files>  -2 <comma-separated list of R2 files>
48 | ```
49 | where "-t" is a command line switch to specify how many threads to use in the analysis.
50 | 
51 | This command can be submitted as a SLURM job with [rcorrector.sh](https://github.com/harvardinformatics/TranscriptomeAssemblyTools/blob/master/job_scripts/rcorector.sh)
52 | ```bash
53 | sbatch rcorrector.sh <comma-separated list of R1 files> <comma-separated list of R2 files>
54 | ```
55 | 


--------------------------------------------------------------------------------
/job_scripts/README.md:
--------------------------------------------------------------------------------
1 | ## Executing workflow steps in an HPC environment
2 | In this directory, we provide example sbatch scripts for submitting workflow steps to the slurm scheduler. These should be easily adaptable to be run in a High Performance Computing (HPC) environment using a different scheduler such as LSF or SGE.
3 | 


--------------------------------------------------------------------------------
/job_scripts/fastqc.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash 
 2 | #SBATCH -p        # put comma separated list of available partitions here 
 3 | #SBATCH -n 1                    # Number of cores
 4 | #SBATCH -t 0-3:00               # Runtime in days-hours:minutes 
 5 | #SBATCH --mem 6000              # Memory in MB
 6 | #SBATCH -J FastQC               # job name 
 7 | #SBATCH -o FastQC.%A.out        # File to which standard out will be written 
 8 | #SBATCH -e FastQC.%A.err        # File to which standard err will be written 
 9 | ##SBATCH --mail-type=ALL         # Type of email notification- BEGIN,END,FAIL,ALL 
10 | ##SBATCH --mail-user=<PUT YOUR EMAIL ADDRESS HERE>  # Email to which notifications will be sent
11 | 
12 | """
13 | For this script to initialize a conda environment, a version of python that supports
14 | anaconda or mamba will need to be in PATH. In a case where the HPC environment has python
15 | available as a loadable module, such as the Harvard Cannon cluster, 
16 | it will simply require adding: module load python.
17 | """
18 | 
19 | source activate fastqc
20 | infile=$1 # $1 represent the first (and in this case only) command line argument supplied to the script
21 | echo "infile is $infile"
22 | 
23 | fastqc --outdir `pwd`/fastqc $infile
24 | 
25 | 


--------------------------------------------------------------------------------
/job_scripts/rcorector.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash 
 2 | #SBATCH -p        # put comma-separated list of available partisions here 
 3 | #SBATCH -n 16                   # Number of cores
 4 | #SBATCH -t 23:00:00               # Runtime in hours:minutes:secods
 5 | #SBATCH --mem 48000               # Memory in MB
 6 | #SBATCH -J rcorrector               # job name 
 7 | #SBATCH -o rcorrector.%A.out        # File to which standard out will be written 
 8 | #SBATCH -e rcorrector.%A.err        # File to which standard err will be written 
 9 | #SBATCH --mail-type=ALL         # Type of email notification- BEGIN,END,FAIL,ALL 
10 | #SBATCH --mail-user=<PUT YOUR EMAIL ADDRESS HERE>  # Email to which notifications will be sent
11 | 
12 | """
13 | For this script to initialize a conda environment, a version of python that supports
14 | anaconda or mamba will need to be in PATH. In a case where the HPC environment has python
15 | available as a loadable module, such as the Harvard Cannon cluster, 
16 | it will simply require adding: module load python.
17 | """
18 | 
19 | source activate rcorrector
20 | 
21 | """
22 | R1 and R2 are the first and second command line arguments that follow "sbatch rcorrector.sh
23 | R1 is a comma-separated list of the left (R1) fastq files, and R2 is a similar list for the right (R2) reads
24 | """
25 | 
26 | R1=$1
27 | R2=$2
28 | 
29 | run_rcorrector.pl -t 16 -1 $R1 -2 $R2
30 | 


--------------------------------------------------------------------------------
/job_scripts/rm_rcorrector_unfixable.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash 
 2 | #SBATCH -p        # put comma separated list of available partitions here 
 3 | #SBATCH -n 1                    # Number of cores
 4 | #SBATCH -t 0-3:00               # Runtime in days-hours:minutes 
 5 | #SBATCH --mem 6000              # Memory in MB
 6 | #SBATCH -J rmunfix               # job name 
 7 | #SBATCH -o rmunfix.%A.out        # File to which standard out will be written 
 8 | #SBATCH -e rmunvix.%A.err        # File to which standard err will be written 
 9 | ##SBATCH --mail-type=ALL         # Type of email notification- BEGIN,END,FAIL,ALL 
10 | ##SBATCH --mail-user=<PUT YOUR EMAIL ADDRESS HERE>  # Email to which notifications will be sent
11 | 
12 | r1=$1 # rCorrector corrected left (R1)fastq file
13 | r2=$2 # rCorrector corrected right (R2) fastq file
14 | sample_name=$3 # sample name used for outfiles prefix
15 | echo "input read pair files are: $r1 and $r2"
16 | 
17 | python FilterUncorrectabledPEfastq.py -1 $r1 -2 $r2 -s $sample_name
18 | 
19 | 


--------------------------------------------------------------------------------
/test/test_R1_adaptor-trimmed.fq.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harvardinformatics/TranscriptomeAssemblyTools/a16497457ff09e34743d0d31d3b2674e3b6e014a/test/test_R1_adaptor-trimmed.fq.gz


--------------------------------------------------------------------------------
/test/test_R2_adaptor-trimmed.fq.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harvardinformatics/TranscriptomeAssemblyTools/a16497457ff09e34743d0d31d3b2674e3b6e014a/test/test_R2_adaptor-trimmed.fq.gz


--------------------------------------------------------------------------------
/utilities/FilterListConverBam2Fastq.py:
--------------------------------------------------------------------------------
  1 | from os.path import basename,dirname
  2 | import argparse
  3 | from subprocess import Popen,PIPE
  4 | import glob
  5 | from sets import Set
  6 | from os import path
  7 | 
  8 | 
  9 | # set bits for reads in properly mapped pairs 
 10 | mate_to_bits={'R1':'0x42','R2':'0x82'}
 11 | 
 12 | def sortbam(unsorted_bam):
 13 |     """
 14 |     coordinate sorts bam file
 15 |     """
 16 |     print('coordinate sorting %s\n' % unsorted_bam)
 17 |     cmd='samtools sort %s> %s/sorted_%s' % (unsorted_bam,dirname(unsorted_bam),basename(unsorted_bam))
 18 |     samsort=Popen(cmd,shell=True,stderr=PIPE,stdout=PIPE)
 19 |     samsort_stdout,samsort_stderr=samsort.communicate()
 20 |     if samsort.returncode==0:
 21 |         return True
 22 |     else:
 23 |         return False    
 24 | 
 25 | 
 26 | def extractbamheader(bamin):
 27 |     cmd='samtools view -H %s > %s/header%s.sam' % (bamin,dirname(bamin),basename(bamin)[:-4])
 28 |     headgrab=Popen(cmd,shell=True,stderr=PIPE,stdout=PIPE)
 29 |     headout,headerr=headgrab.communicate()
 30 |     if headgrab.returncode==0:
 31 |         headpass=True
 32 |         headerr=''
 33 |     else:
 34 |         headpass=False
 35 |     return headpass,headerr
 36 | 
 37 | 
 38 | def addheadertosam(samin):
 39 |     concatcmd='cat %s/header%s.sam %s/namesort_%s > %s/wheader_%s' % (dirname(samin),basename(samin)[:-4],dirname(samin),basename(samin),dirname(samin),basename(samin))
 40 |     doconcat=Popen(concatcmd,shell=True,stderr=PIPE,stdout=PIPE)
 41 |     concatout,concaterr=doconcat.communicate()
 42 |     if doconcat.returncode==0:
 43 |         concatpass=True
 44 |     else:
 45 |         concatpass=False
 46 |         raise ValueError('%s\n' % concaterr)
 47 |     return concatpass
 48 | 
 49 | def sam2bam(samin):
 50 |     bamcmd='samtools view -Sbh %s > %s/%s.bam' % (samin,dirname(samin),basename(samin)[:-4])
 51 |     makebam=Popen(bamcmd,shell=True,stederr=PIPE,stdout=PIPE)
 52 |     bamout,bamerr=makebam.communicate()
 53 |     if makebam.returncode==0:
 54 |         makebampass=True
 55 |     else:
 56 |         makebampass=False
 57 |         raise ValueError('%s\n' % makebamerr)
 58 |     return makebampass
 59 |     
 60 | 
 61 | def namesortsam(samin):
 62 |     """
 63 |     takes filtered sam file, name sorts it,
 64 |     and writes it, with header, as a sam file
 65 |     """
 66 |     sortcmd='samtools sort -n -o %s/namesort_%s %s' % (dirname(samin),basename(samin),samin)
 67 |     dosort=Popen(sortcmd,shell=True,stderr=PIPE,stdout=PIPE)
 68 |     sortout,sorterr=dosort.communicate()
 69 |     if dosort.returncode==0:
 70 |         sortpass=True
 71 |         sorterr=''
 72 |     else:
 73 |         sortpass=False
 74 |         #raise ValueError('%s\n' % sorterr)    
 75 |     return    sortpass,sorterr
 76 |  
 77 | def indexbam(sorted_bam):
 78 |     print('indexing %s\n' % sorted_bam) 
 79 |     cmd='samtools index %s/%s' % (dirname(sorted_bam),sorted_bam)
 80 |     samindex=Popen(cmd,shell=True,stderr=PIPE,stdout=PIPE)
 81 |     samindex_stdout,samindex_stderr=samindex.communicate()
 82 |     if samindex.returncode==0:
 83 |         return True,''
 84 |     else:
 85 |         return True,samindex_stderr
 86 | 
 87 | def makefilterset(filterfile):
 88 |     fopen=open(filterfile,'r')
 89 |     filterset=Set()
 90 |     for line in fopen:
 91 |         contig=line.strip()
 92 |         filterset.add(contig)
 93 |     return filterset
 94 | 
 95 | def write_mate_sams(bamin,hexval,mate):
 96 |     """
 97 |     extract mapped reads from a coord-sorted
 98 |     bam file for a pass-filter contig in a
 99 |      mate-specific fashion
100 |     """
101 |     cmd='samtools view -h -f %s %s > %s/%s_%s.sam' % (hexval,bamin,dirname(bamin),basename(bamin)[:-4],mate)
102 |     getreads=Popen(cmd,shell=True,stderr=PIPE,stdout=PIPE)
103 |     readsout,readserr=getreads.communicate()
104 |     if getreads.returncode==0:
105 |         return True,''
106 |     else:
107 |         return False,readserr
108 | 
109 | def sam2fastq(samfile,contigskeep):
110 |     counter=0
111 |     fopen=open(samfile,'r')
112 |     fqout=open('%s.fq' % samfile[:-4],'w')
113 |     lastid=''
114 |     for line in fopen:
115 |         linelist=line.strip().split('\t')    
116 |         if line[0]!='@' and linelist[2] in contigskeep:
117 |             counter+=1
118 |             if counter%100000==0:
119 |                 print 'processing read %s' % counter
120 |             if lastid=='':
121 |                 lastid=linelist[0]
122 |                 fqout.write('%s\n' % '\n'.join([linelist[0],linelist[9],'+',linelist[10]]))
123 |             elif lastid!=linelist[0]:
124 |                 lastid=linelist[0]
125 |                 fqout.write('%s\n' % '\n'.join([linelist[0],linelist[9],'+',linelist[10]]))
126 |             else:
127 |                 pass
128 |         else:
129 |             pass
130 | 
131 |     fqout.close()
132 |           
133 | 
134 | if __name__=="__main__": 
135 |     parser = argparse.ArgumentParser(description="using contig keep list to extract fastq files")
136 |     parser.add_argument('-s','--is_coord_sorted',action='store_true',help='flag for whether bam is name sorted')
137 |     parser.add_argument('-b','--bam_in',dest='bamin',type=str,help='compressed read alignment infile')
138 |     parser.add_argument('-k','--contig_keep_file',dest='keeps',type=str,help='list of contig mapping to keep')
139 |     opts = parser.parse_args()  
140 |     
141 |     print opts
142 |     
143 |     ### make sure in bam is coord sorted to enable contig searhes   
144 |     if opts.is_coord_sorted:
145 |         sortedbam=opts.bamin
146 |     else:
147 |         bamsorter=sortbam(opts.bamin)
148 |         sortedbam='sorted_%s' % basename(opts.bamin)
149 |     
150 |     ### verify or make bam file index is present
151 |     if glob.glob('%s.bai' % sortedbam)==[]:
152 |         indexflag,indexerr=indexbam(sortedbam) 
153 |         if indexflag==False:
154 |             raise ValueError('%s\n' % indexerr)
155 | 
156 |     ### make list of pass-filter contigs
157 |     good_contigs=makefilterset(opts.keeps)
158 | 
159 |     ### write a contig-filtered bam file for each mate in a pair: 
160 |     for mate in ['R1','R2']:
161 |         print 'writing %s sam file' % mate
162 |         mateflag,materr=write_mate_sams(sortedbam,mate_to_bits[mate],mate)
163 |         if mateflag==True:
164 |             pass
165 |         else:
166 |             raise ValueError('%s\n' % materr)
167 | 
168 |         print 'name-sorting %s sam file' % mate
169 |         sortflag,sorterr=namesortsam('%s_%s.sam' % (sortedbam[:-4],mate))
170 |         if sortflag==False:
171 |             raise ValueError('%s\n' % sorterr)
172 |         
173 |         else:
174 |             print 'writing unique %s reads to fastq' % mate
175 |             sam2fastq('%s/namesort_%s_%s.sam' % (dirname(sortedbam),basename(sortedbam)[:-4],mate),good_contigs)
176 | 
177 | 
178 | 
179 | 
180 | 


--------------------------------------------------------------------------------
/utilities/FilterUncorrectabledPEfastq.py:
--------------------------------------------------------------------------------
  1 | """
  2 | author: adam h freedman
  3 | afreedman405 at gmail.com
  4 | data: Thursday Oct 12 EDT 2023
  5 | 
  6 | This script takes as an input Rcorrector error corrected Illumina paired-reads
  7 | in fastq format and:
  8 | 
  9 | 1. Removes any reads that Rcorrector indentifes as containing an error,
 10 | but can't be corrected, typically low complexity sequences. For these,
 11 | the header contains 'unfixable'.
 12 | 
 13 | 2. Strips the ' cor' from headers of reads that Rcorrector fixed, to avoid
 14 | issues created by certain header formats for downstream tools.
 15 | 
 16 | 3. Write a log with counts of (a) read pairs that were removed because one end
 17 | was unfixable, (b) corrected left and right reads, (c) total number of
 18 | read pairs containing at least one corrected read.
 19 | 
 20 | Currently, this script only handles paired-end data, and handle either unzipped
 21 | or gzipped files on the fly, so long as the gzipped files end with 'gz'.
 22 | 
 23 | """
 24 | #!/usr/bin/python3
 25 | import argparse
 26 | import sys        
 27 | import gzip
 28 | 
 29 | try:
 30 |     from itertools import izip_longest
 31 | except ImportError: 
 32 |     from itertools import zip_longest as izip_longest
 33 | 
 34 | import argparse
 35 | import os.path
 36 | from os import system
 37 | 
 38 | def get_input_streams(r1file,r2file):
 39 |     if r1file[-2:]=='gz':
 40 |         r1handle=gzip.open(r1file,'rb')
 41 |         r2handle=gzip.open(r2file,'rb')
 42 |     else:
 43 |         r1handle=open(r1file,'r')
 44 |         r2handle=open(r2file,'r')
 45 |     
 46 |     return r1handle,r2handle
 47 |         
 48 |         
 49 | def grouper(iterable, n, fillvalue=None):
 50 |     "Collect data into fixed-length chunks or blocks"
 51 |     # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
 52 |     args = [iter(iterable)] * n
 53 |     return izip_longest(fillvalue=fillvalue, *args)  
 54 |     
 55 | 
 56 | if __name__=="__main__": 
 57 |     parser = argparse.ArgumentParser(description="options for filtering and logging rCorrector fastq outputs")
 58 |     parser.add_argument('-1','--left_reads',dest='leftreads',type=str,help='R1 fastq file')
 59 |     parser.add_argument('-2','--right_reads',dest='rightreads',type=str,help='R2 fastq file')
 60 |     parser.add_argument('-s','--sample_id',dest='id',type=str,help='sample name to write to log file')
 61 |     opts = parser.parse_args()
 62 | 
 63 |     output_dir = os.path.dirname(opts.leftreads)
 64 |     r1out = open(os.path.join(output_dir, "unfixrm_{}".format(
 65 |         os.path.basename(opts.leftreads).replace('.gz', ''))), 'w')
 66 |     r2out = open(os.path.join(output_dir, "unfixrm_{}".format(
 67 |         os.path.basename(opts.rightreads).replace('.gz', ''))), 'w')
 68 | 
 69 | 
 70 |     r1_cor_count=0
 71 |     r2_cor_count=0
 72 |     pair_cor_count=0
 73 |     unfix_r1_count=0
 74 |     unfix_r2_count=0
 75 |     unfix_both_count=0   
 76 | 
 77 |     r1_stream,r2_stream=get_input_streams(opts.leftreads,opts.rightreads)
 78 | 
 79 |     with r1_stream as f1, r2_stream as f2:
 80 |         R1=grouper(f1,4)
 81 |         R2=grouper(f2,4)
 82 |         counter=0
 83 |         for entry in R1:
 84 |             counter+=1
 85 |             if counter%100000==0:
 86 |                 print("%s reads processed" % counter)
 87 |         
 88 |             head1,seq1,placeholder1,qual1=[i.decode('ASCII').strip() for i in entry]
 89 |             head2,seq2,placeholder2,qual2=[j.decode('ASCII').strip() for j in next(R2)]
 90 |             
 91 |             if 'unfixable' in head1 and 'unfixable' not in head2:
 92 |                 unfix_r1_count+=1
 93 |             elif 'unfixable' in head2 and 'unfixable' not in head1:
 94 |                 unfix_r2_count+=1
 95 |             elif 'unfixable' in head1 and 'unfixable' in head2:
 96 |                 unfix_both_count+=1
 97 |             else:
 98 |                 if 'cor' in head1:
 99 |                     r1_cor_count+=1
100 |                 if 'cor' in head2:
101 |                     r2_cor_count+=1
102 |                 if 'cor' in head1 or 'cor' in head2:
103 |                     pair_cor_count+=1
104 |                 
105 |                 r1out.write('%s\n' % '\n'.join([head1.replace(' cor',''),seq1,placeholder1,qual1]))
106 |                 r2out.write('%s\n' % '\n'.join([head2.replace(' cor',''),seq2,placeholder2,qual2]))
107 |     
108 |     total_unfixable = unfix_r1_count+unfix_r2_count+unfix_both_count
109 |     total_retained = counter - total_unfixable
110 | 
111 |     unfix_log = open(os.path.join(
112 |         output_dir, "rmunfixable_{}.log".format(opts.id)), 'w')
113 |     unfix_log.write('total PE reads:%s\nremoved PE reads:%s\nretained PE reads:%s\nR1 corrected:%s\nR2 corrected:%s\npairs corrected:%s\nR1 unfixable:%s\nR2 unfixable:%s\nboth reads unfixable:%s\n' % (counter,total_unfixable,total_retained,r1_cor_count,r2_cor_count,pair_cor_count,unfix_r1_count,unfix_r2_count,unfix_both_count))
114 |             
115 |     r1out.close()
116 |     r2out.close() 
117 |     unfix_log.close()
118 |     outfiles = [os.path.join(output_dir, "unfixrm_{}".format(os.path.basename(opts.leftreads).replace('.gz', ''))),
119 |                 os.path.join(output_dir, "unfixrm_{}".format(os.path.basename(opts.rightreads).replace('.gz', '')))]
120 |     for outfile in outfiles:
121 |         print("gzipping %s" % outfile)
122 |         os.system("gzip %s" % outfile)
123 | 


--------------------------------------------------------------------------------
/utilities/README.md:
--------------------------------------------------------------------------------
1 | # Utilities for pre-processing fastq files prior assembly
2 | A collection of scripts for processing fastq files in ways to improve de novo transcriptome assemblies, and for evaluating those assemblies.
3 | 
4 | ## FilterUncorrectablePEfastq.py
5 | Takes a paired-end Illumina fastq file generated from an RNA-seq library, and that has been error corrected with [rCorrector](https://github.com/mourisl/Rcorrector), see [Song and Florea 2015, Gigascience](https://gigascience.biomedcentral.com/articles/10.1186/s13742-015-0089-y),and removes reads with errors but that are unfixable, and strips 'cor' flags from reads that were corrected.
6 | 
7 | ## RemoveFastqcOverrepSequenceReads.py
8 | Parses the fastqc output files to retrieve over-represented sequences, and uses these to remove read pairs where either read has a sequence match to an over-represented sequence.
9 | 


--------------------------------------------------------------------------------
/utilities/RemoveFastqcOverrepSequenceReads.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import gzip
 3 | from os.path import basename
 4 | import argparse
 5 | import re
 6 | try:
 7 |     from itertools import izip_longest
 8 | except ImportError: 
 9 |     from itertools import zip_longest as izip_longest
10 | 
11 | 
12 | def seqsmatch(overreplist,read):
13 |     flag=False
14 |     if overreplist!=[]:
15 |         for seq in overreplist:
16 |             if seq in read:
17 |                 flag=True
18 |                 break
19 |     return flag
20 |             
21 | def get_input_streams(r1file,r2file):
22 |     if  r1file[-2:]=='gz':
23 |         r1handle=gzip.open(r1file,'rb')
24 |         r2handle=gzip.open(r2file,'rb')
25 |     else:
26 |         r1handle=open(r1file,'r')
27 |         r2handle=open(r2file,'r')
28 |     
29 |     return r1handle,r2handle
30 | 
31 | def FastqIterate(iterable,fillvalue=None):
32 |     "Grab one 4-line fastq read at a time"
33 |     args = [iter(iterable)] * 4
34 |     return izip_longest(fillvalue=fillvalue, *args) 
35 | 
36 | def ParseFastqcLog(fastqclog):    
37 |     with open(fastqclog) as fp:
38 |         for result in re.findall('Overrepresented sequences(.*?)END_MODULE', fp.read(), re.S):
39 |             seqs=([i.split('\t')[0] for i in result.split('\n')[2:-1]])
40 |             if result == []:
41 |                 seqs = [] 
42 |     return seqs     
43 | 
44 | if __name__=="__main__": 
45 |     parser = argparse.ArgumentParser(description="options for removing reads with over-represented sequences")
46 |     parser.add_argument('-1','--left_reads',dest='leftreads',type=str,help='R1 fastq file')
47 |     parser.add_argument('-2','--right_reads',dest='rightreads',type=str,help='R2 fastq file')
48 |     parser.add_argument('-fql','--fastqc_left',dest='l_fastqc',type=str,help='fastqc text file for R1')
49 |     parser.add_argument('-fqr','--fastqc_right',dest='r_fastqc',type=str,help='fastqc text file for R2')
50 |     parser.add_argument('-o','--output-log-prefix',dest='logprefix',type=str,help='prefix for logfile summarizing filtering')
51 |     opts = parser.parse_args()
52 | 
53 |     logout = open('%s_rmoverrep.log' % opts.logprefix,'w')
54 |     leftseqs=ParseFastqcLog(opts.l_fastqc)
55 |     if leftseqs == []:
56 |         print("no overrepresented sequences in R1")
57 |     rightseqs=ParseFastqcLog(opts.r_fastqc)
58 |     if rightseqs == []:
59 |         print("no overrepresented sequences in R2")    
60 |    
61 |     r1_out=open('rmoverrep_'+basename(opts.leftreads).replace('.gz',''),'w')
62 |     r2_out=open('rmoverrep_'+basename(opts.rightreads).replace('.gz',''),'w')
63 |     
64 |     r1_stream,r2_stream=get_input_streams(opts.leftreads,opts.rightreads)
65 |         
66 |     counter=0
67 |     failcounter=0
68 | 
69 |     with r1_stream as f1, r2_stream as f2:
70 |         R1=FastqIterate(f1)
71 |         R2=FastqIterate(f2)
72 |         for entry in R1:
73 |             counter+=1
74 |             if counter%100000==0:
75 |                 print("%s reads processed" % counter)
76 |         
77 |             
78 |             head1,seq1,placeholder1,qual1=[i.decode('ASCII').strip() for i in entry]
79 |             head2,seq2,placeholder2,qual2=[j.decode('ASCII').strip() for j in next(R2)]
80 |             #head1,seq1,placeholder1,qual1=[i.strip() for i in entry]
81 |             #head2,seq2,placeholder2,qual2=[j.strip() for j in R2.next()]
82 |             
83 |             flagleft,flagright=seqsmatch(leftseqs,seq1),seqsmatch(rightseqs,seq2)
84 |             
85 |             if True not in (flagleft,flagright):
86 |                 r1_out.write('%s\n' % '\n'.join([head1,seq1,'+',qual1]))
87 |                 r2_out.write('%s\n' % '\n'.join([head2,seq2,'+',qual2]))
88 |             else:
89 |                 failcounter+=1
90 | 
91 |         logout.write('n_reads_eval\tn_reads_retained\tn_pe_reads_filtered\n')
92 |         logout.write('%s\t%s\t%s\n' % (counter,counter-failcounter,failcounter))
93 |         logout.close()
94 | 
95 | r1_out.close()
96 | r2_out.close()
97 | 


--------------------------------------------------------------------------------
/utilities/RepairSRAHeadersForTrinity.py:
--------------------------------------------------------------------------------
 1 | import sys        
 2 | import gzip
 3 | from itertools import izip,izip_longest
 4 | import argparse
 5 | from os.path import basename
 6 | 
 7 | def get_input_streams(r1file,r2file):
 8 |     if r1file[-2:]=='gz':
 9 |         r1handle=gzip.open(r1file,'rb')
10 |         r2handle=gzip.open(r2file,'rb')
11 |     else:
12 |         r1handle=open(r1file,'r')
13 |         r2handle=open(r2file,'r')
14 |     
15 |     return r1handle,r2handle
16 |         
17 |         
18 | def grouper(iterable, n, fillvalue=None):
19 |     "Collect data into fixed-length chunks or blocks"
20 |     # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
21 |     args = [iter(iterable)] * n
22 |     return izip_longest(fillvalue=fillvalue, *args)  
23 |     
24 |     
25 | if __name__=="__main__": 
26 |     parser = argparse.ArgumentParser(description="options for filtering and logging rCorrector fastq outputs")
27 |     parser.add_argument('-1','--left_reads',dest='leftreads',type=str,help='R1 fastq file')
28 |     parser.add_argument('-2','--right_reads',dest='rightreads',type=str,help='R2 fastq file')
29 |     opts = parser.parse_args()
30 | 
31 |     r1out=open('sraheaderfixed_%s' % basename(opts.leftreads).replace('.gz',''),'w')
32 |     r2out=open('sraheaderfixed_%s' % basename(opts.rightreads).replace('.gz',''),'w') 
33 |     r1_stream,r2_stream=get_input_streams(opts.leftreads,opts.rightreads)
34 |     with r1_stream as f1, r2_stream as f2:
35 |         R1=grouper(f1,4)
36 |         R2=grouper(f2,4)
37 |         counter=0
38 |         for entry in R1:
39 |             counter+=1
40 |             if counter%100000==0:
41 |                 print "%s reads processed" % counter
42 |         
43 |             head1,seq1,placeholder1,qual1=[i.strip() for i in entry]
44 |             head2,seq2,placeholder2,qual2=[j.strip() for j in R2.next()]
45 |             head1=head1.split()[0]+'/1'
46 |             head2=head2.split()[0]+'/2'
47 |             placeholder1 = '+'
48 |             placeholder2 = '+'
49 |             r1out.write('%s\n' % '\n'.join([head1,seq1,placeholder1,qual1]))
50 |             r2out.write('%s\n' % '\n'.join([head2,seq2,placeholder2,qual2]))
51 |             
52 |     r1out.close()
53 |     r2out.close() 
54 | 


--------------------------------------------------------------------------------