├── Classify_bins.sh ├── README.md ├── diamond2vizbin.R ├── fastqCombinePairedEnd.py └── sample_size.sh /Classify_bins.sh: -------------------------------------------------------------------------------- 1 | #/bin/bash 2 | 3 | # This script classifies selected bins by phylosift. 4 | 5 | # Call this script from where you placed the binning output 6 | # and have stored the bin ID file. The bin ID file is a text file 7 | # with on each new line a bin ID (e.g. 304). The fasta files for the bins are assumed 8 | # to be stored in a subdirectory of the main directory (folder/binFolder). 9 | 10 | # created by Ruben Props 11 | 12 | ####################################### 13 | ##### MAKE PARAMETER CHANGES HERE ##### 14 | ####################################### 15 | 16 | # Make sure you named the binning folder according to the 17 | # parameterization (e.g., k4_L3000 for kmer=4 and length threshold=3000) 18 | k=4 19 | L=3000 20 | folder=k${k}_L${L} 21 | binFolder=fasta-bins 22 | input=bins2classify.txt 23 | 24 | #fasta extension 25 | ext=fa 26 | 27 | # Number of threads 28 | threads=20 29 | 30 | #################################################### 31 | ##### DO NOT MAKE ANY CHANGES BEYOND THIS LINE ##### 32 | ##### unless you know what you're doing ##### 33 | #################################################### 34 | 35 | cat $input | while read ID 36 | do 37 | echo "[`date`] Starting with bin ${ID}" 38 | echo ./${folder}/${binFolder}/${ID}.${ext} 39 | phylosift all --threads $threads --output ./${folder}/phylosift_bin_${ID} ./${folder}/${binFolder}/${ID}.${ext} 40 | done 41 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ******************* 2 | 3 | Up-to-date workflow and wiki can be found [here](https://github.com/rprops/MetaG_analysis_workflow/wiki) 4 | 5 | ******************* 6 | 7 | # Metagenomic analysis workflow 8 | 9 | ### Install DESMAN 10 | ``` 11 | module load gsl python-anaconda2 12 | cd 13 | git clone https://github.com/chrisquince/DESMAN.git 14 | cd DESMAN 15 | CFLAGS="-I$GSL_INCLUDE -L$GSL_LIB" python setup.py install --user 16 | ``` 17 | It is now installed and can be called like: 18 | ``` 19 | ~/.local/bin/desman -h 20 | ``` 21 | Or add it to your path: 22 | ``` 23 | export PATH=~/.local/bin:$PATH 24 | ``` 25 | ### Other software 26 | Look [here](https://github.com/rprops/DESMAN/wiki/Software-installations) to find out if you need to install anything else for the analysis. Make sure all these modules are loaded. 27 | 28 | ### Step 1: Quality trimming of reads 29 | 30 | Make sure that you have non-interleaved fastq.gz files of forward and reverse reads. These should have an *R1* tag in their filename and saved in a directory called *sample*, e.g.: sample/sample.R1.fastq.gz. 31 | 32 | **IMPORTANT** The adapter trimming in qc.sh is standard set for Truseq paired-end libraries. You will have to adjust this if this is different for you by changing the path in the shell script to the correct fasta file of the adapters located at /home/your_username/Trimmomatic-0.36/adapters. 33 | 34 | **IMPORTANT** In case you are unsure which adapters are present in the sequences, you can download bbtools 35 | (https://sourceforge.net/projects/bbmap/) unzip the tar.gz and add the directory to your export path. Then run the following code on a subsample of a sample (e.g., 1m reads). The resulting consensus sequences of the adapters will be stored in adapters.fa. 36 | ``` 37 | bbmerge.sh in1=$(echo *R1.fastq) in2=$(echo *R2.fastq) outa=adapters.fa reads=1m 38 | ``` 39 | Run quality trimming: 40 | ``` 41 | bash /nfs/vdenef-lab/Shared/Ruben/scripts_metaG/wrappers/Assembly/qc.sh sample_directory 42 | ``` 43 | Alternatively modify the run_quality.sh or run_quality.pbs scripts to run sequentially. 44 | 45 | Copy Fastqc files to new folder (adjust the paths) 46 | ``` 47 | rsync -a --include '*/' --include '*fastqc.html' --exclude '*' /scratch/vdenef_fluxm/rprops/DESMAN/metaG/Nextera /scratch/vdenef_fluxm/rprops/DESMAN/metaG/FASTQC --progress 48 | ``` 49 | #### Random subsampling of reads. Works if you are interested in abundant taxa. 50 | 51 | You can check the number of reads in the interleaved fasta file with the sample_size.sh. This will store the sample sizes in the sample_sizes.txt file. Run this in the directory where your samples are located. Then use seqtk to randomly subsample your fasta files to the desired sample size. -s sets seed for random sampling. 52 | ``` 53 | bash sample_size.sh 54 | seqtk sample -s 777 *.fastq 5000000 > *.fastq 55 | ``` 56 | ### Step 2: Start co-assembly 57 | At this point you have multiple assemblers to choose from (IDBA-UD/Megahit/...). We choose here for IDBA-UD. 58 | 59 | #### IDBA-UD assembly 60 | Merge all interleaved files to one file with the following shell script: 61 | ``` 62 | bash assembly_prep.sh 63 | ``` 64 | **Optional:** normalize data based on coverage (BBNorm) 65 | This can be required for co-assemblies which are too big (see: [here] (http://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbnorm-guide/)). First estimate memory requirements based on number of unique kmers using loglog.sh. Run this in a pbs script (due to memory requirements). 66 | ``` 67 | loglog.sh in=merged_dt_int.fasta 68 | bbnorm.sh in=merged_dt_int.fasta out=merged_dt_int_normalized.100.5.fasta target=100 min=5 69 | ``` 70 | Due to the normalization some paired reads will have lost their mate. Therefore we must split the normalized fasta file and recombine them to a single fasta file (without the unpaired sequences). Run the below code in a pbs script. Adjust the pattern :1 or :2 to reflect the R1/R2 read names in your sequence data. Also adjust 71 | ``` 72 | # Generate list of R1 and R2 reads 73 | grep " 1:" merged_dt_int_normalized.100.5.fasta | sed "s/>//g" > list.R1 74 | grep " 2:" merged_dt_int_normalized.100.5.fasta | sed "s/>//g" > list.R2 75 | # Extract R1 and R2 reads from interleaved file using the filterbyname.sh script from BBmap 76 | filterbyname.sh in=merged_dt_int_normalized.100.5.fasta names=list.R1 out=merged_dt_int_normalized.100.5.R1.fasta -include t 77 | filterbyname.sh in=merged_dt_int_normalized.100.5.fasta names=list.R2 out=merged_dt_int_normalized.100.5.R2.fasta -include t 78 | # Interleave R1/R2 reads using inhouse perl script (author: Sunit Jain) 79 | /nfs/vdenef-lab/Shared/Ruben/scripts_metaG/SeqTools/interleave.pl -fwd merged_dt_int_normalized.100.5.R1.fasta -rev merged_dt_int_normalized.100.5.R2.fasta -o remerged_dt_int_normalized.100.5 80 | ``` 81 | 82 | Now run idba_ud (you may have to adjust the parameters. 83 | ``` 84 | idba_ud -o idba_k52_100_s8 -r remerged_dt_int_normalized.100.5.fasta --num_threads ${PBS_NP} --mink 52 --maxk 100 --step 8 85 | ``` 86 | In case you run this on flux you'll have to add the following line of code above your actual code to allow openmpi to run multithreaded. 87 | ``` 88 | # On Flux, to use OpenMP, you need to explicitly tell OpenMP how many threads to use. 89 | export OMP_NUM_THREADS=${PBS_NP} 90 | ``` 91 | 92 | ### IMPORTANT 93 | Some assemblers (such as the newest idba version) will give extra information in the headers which can be problematic for other software (e.g anvio and phylosift). To avoid this we can use an anvio script to make the headers compatible with all software. 94 | ``` 95 | module load anvio 96 | anvi-script-reformat-fasta contig.fa -o contigs-fixed.fa -l 0 --simplify-names 97 | mv contigs-fixed.fa contig.fa 98 | ``` 99 | 100 | ### Step 3: Generating assembly stats 101 | We use quast for this purpose (adjust path for your specific assembly): 102 | ``` 103 | quast.py -f --meta -t 20 -l "Contigs" megahit_assembly_sensitive/final.contigs.fa 104 | ``` 105 | For IDBA-UD this can be run both on the scaffolds and the contigs. In general, the contigs are more reliable. 106 | 107 | ### Step 4A: CONCOCT binning 108 | CONCOCT binning is not optimal for large co-assemblies since it combines both coverage and kmer/GC% in one step. This will result in very slow clustering for complex data sets. So go for Binsanity, if this is the case. We're gonna focus on the IDBA-UD assembly now but for megahit you can find the analogy [here](https://github.com/rprops/DESMAN/wiki/Megahit-analysis). 109 | 110 | #### IDBA-UD 111 | IDBA-UD will also make scaffolds based on the contigs. So the two primary output files are: 112 | ``` 113 | scaffold.fa 114 | contig.fa 115 | ``` 116 | First cut your large contigs before running CONCOCT (make sure CONCOCT is in your path). 117 | ``` 118 | module load python-anaconda2/201607 119 | module load gsl 120 | 121 | mkdir contigs 122 | cut_up_fasta.py -c 10000 -o 0 -m idba_k52_100_s8/contig.fa > contigs/final_contigs_c10K.fa 123 | ``` 124 | Be aware that this introduces dots in the contig names (e.g., contig-100_1647.1). This can pose an issue for taxonomic classifications through Phylosift. Replace dots by underscores as follows: 125 | ``` 126 | sed "s/\./_/g" final_contigs_c10K.fa > final_contigs_c10K_nodots.fa 127 | ``` 128 | **Optional/Necessary** You can at this point already remove short contigs (e.g., < 1000) this will save time in the mapping but also for generating the coverage file. 129 | ``` 130 | reformat.sh in=final_contigs_c10K.fa out=final_contigs_c10K_1000.fa minlength=1000 131 | ``` 132 | Make an index for your contigs fasta. 133 | ``` 134 | cd contigs 135 | bwa index final_contigs_c10K.fa 136 | cd - 137 | ``` 138 | Then map reads (**warning:** fastq files must not contain unpaired reads). 139 | ``` 140 | for file in *R1.fastq 141 | do 142 | 143 | stub=${file%_R1.fastq} 144 | 145 | echo $stub 146 | 147 | file2=${stub}_R2.fastq 148 | 149 | bwa mem -t 20 contigs/final_contigs_c10K.fa $file $file2 > Map/${stub}.sam 150 | done 151 | 152 | ``` 153 | After the mapping create .bam files using samtools. Make sure you install bedtools2 for the next step. 154 | **IMPORTANT:**Make sure you have samtools >v1.3! 155 | Use -@ to multithread and -m to adjust memory usage. 156 | ``` 157 | #/bin/bash 158 | 159 | set -e 160 | 161 | for file in Map/*.sam 162 | do 163 | stub=${file%.sam} 164 | stub2=${stub#Map\/} 165 | echo $stub 166 | samtools view -h -b -S $file > ${stub}.bam 167 | samtools view -b -F 4 ${stub}.bam > ${stub}.mapped.bam 168 | samtools sort -m 1000000000 ${stub}.mapped.bam -o ${stub}.mapped.sorted.bam -@ 8 169 | samtools index ${stub}.mapped.sorted.bam 170 | bedtools genomecov -ibam ${stub}.mapped.sorted.bam -g contigs/final_contigs_c10K.len > ${stub}_cov.txt 171 | done 172 | 173 | for i in Map/*_cov.txt 174 | do 175 | echo $i 176 | stub=${i%_cov.txt} 177 | stub=${stub#Map\/} 178 | echo $stub 179 | awk -F"\t" '{l[$1]=l[$1]+($2 *$3);r[$1]=$4} END {for (i in l){print i","(l[i]/r[i])}}' $i > Map/${stub}_cov.csv 180 | done 181 | 182 | ~/DESMAN/scripts/Collate.pl Map | tr "," "\t" > Coverage.tsv 183 | ``` 184 | Once you've formatted the coverage files we can start binning using CONCOCT (use 40 core in pbs script to maximize on C-CONCOCT). We perform the binning on contigs>3000bp and put the kmer-signature on 4. 185 | ``` 186 | module load gsl 187 | mkdir Concoct 188 | cd Concoct 189 | mv ../Coverage.tsv . 190 | concoct --coverage_file Coverage.tsv --composition_file ../contigs/final_contigs_c10K.fa -s 777 --no_original_data -l 3000 -k 4 191 | cd .. 192 | ``` 193 | After binning we can quickly evaluate the clusters: 194 | ``` 195 | mkdir evaluation-output 196 | Rscript ~/CONCOCT/scripts/ClusterPlot.R -c clustering_gt3000.csv -p PCA_transformed_data_gt3000.csv -m pca_means_gt3000.csv -r pca_variances_gt3000_dim -l -o evaluation-output/ClusterPlot.pdf 197 | ``` 198 | CONCOCT only outputs a list of bins with the associated contig ids. We further use the functionality of the Phylosift software to extract the corresponding fasta file for each bin (run in pbs script). We needs these fasta files later on for CheckM. 199 | ``` 200 | mkdir fasta-bins 201 | extract_fasta_bins.py ../contigs/final_contigs_c10K.fa ./k4_L3000_diginorm/clustering_gt2000.csv --output_path ./k4_L3000_diginorm/evaluation-output 202 | ``` 203 | Then run CheckM, this requires pplacer, hmmer and prodigal to be loaded. This will process wil generate an output file bin_stats_mega_k4_L3000.tsv and a plot describing these results, which will be stored in the evaluation-output folder. 204 | ``` 205 | checkm lineage_wf --pplacer_threads 10 -t 10 -x fa -f bin_stats_mega_k4_L3000.tsv ./k4_L3000/fasta-bins ./k4_L3000/tree_folder 206 | checkm bin_qa_plot ./k4_L3000/tree_folder ./k4_L3000/fasta-bins ./k4_L3000/evaluation-output -x fa --dpi 250 207 | ``` 208 | Next we select some bins and classify them with the following bash script (classify_bins.sh): 209 | ``` 210 | #/bin/bash 211 | 212 | # This script classifies selected bins by phylosift. 213 | 214 | # Call this script from where you placed the binning output 215 | # and have stored the bin ID file. The bin ID file is a text file 216 | # with on each new line a bin ID (e.g. 304). The fasta files for the bins are assumed 217 | # to be stored in a subdirectory of the main directory (folder/binFolder). 218 | 219 | # created by Ruben Props 220 | 221 | ####################################### 222 | ##### MAKE PARAMETER CHANGES HERE ##### 223 | ####################################### 224 | 225 | # Make sure you named the binning folder according to the 226 | # parameterization (e.g., k4_L3000 for kmer=4 and length threshold=3000) 227 | k=4 228 | L=3000 229 | folder=k${k}_L${L} 230 | binFolder=fasta-bins 231 | input=bins2classify.txt 232 | 233 | #fasta extension 234 | ext=fa 235 | 236 | # Number of threads 237 | threads=20 238 | 239 | #################################################### 240 | ##### DO NOT MAKE ANY CHANGES BEYOND THIS LINE ##### 241 | ##### unless you know what you're doing ##### 242 | #################################################### 243 | 244 | cat $input | while read ID 245 | do 246 | echo "[`date`] Starting with bin ${ID}" 247 | echo ./${folder}/${binFolder}/${ID}.${ext} 248 | phylosift all --threads $threads --output ./${folder}/phylosift_bin_${ID} ./${folder}/${binFolder}/${ID}.${ext} 249 | done 250 | ``` 251 | -------------------------------------------------------------------------------- /diamond2vizbin.R: -------------------------------------------------------------------------------- 1 | ### Script for translating Diamond classification output to vizbin annotation 2 | ### Output is a file with contig ID and label at the requested taxonomic rank 3 | ### The user will still need to sort the list so that the rows exactly match 4 | ### with the contigs used to generate the binning. 5 | userprefs <- commandArgs(trailingOnly = TRUE) 6 | 7 | library("dplyr") 8 | 9 | file <- userprefs[1] 10 | reference <- userprefs[2] 11 | 12 | file1 <- read.delim(file, 13 | stringsAsFactors = FALSE, header=FALSE) 14 | reference <- read.delim(reference, 15 | stringsAsFactors = FALSE, header=FALSE) 16 | 17 | tmp <- anti_join(reference, file1, by = "V1") 18 | tmp <- data.frame(V1 = tmp, V2 =rep("UNCLASSIFIED", nrow(reference) - nrow(file1))) 19 | result <- rbind(file1, tmp) 20 | result <- result[match(reference$V1, as.character(result$V1)),] 21 | write.csv(data.frame(label = as.character(result$V2)), file="Annotation_vizbin.csv", row.names = FALSE, quote=FALSE) 22 | -------------------------------------------------------------------------------- /fastqCombinePairedEnd.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """Resynchronize 2 fastq or fastq.gz files (R1 and R2) after they have been 3 | trimmed and cleaned 4 | Credit to: https://github.com/enormandeau/Scripts/blob/master/fastqCombinePairedEnd.py 5 | WARNING! This program assumes that the fastq file uses EXACTLY four lines per 6 | sequence 7 | 8 | Three output files are generated. The first two files contain the reads of the 9 | pairs that match and the third contains the solitary reads. 10 | 11 | Usage: 12 | python fastqCombinePairedEnd.py input1 input2 separator 13 | 14 | input1 = LEFT fastq or fastq.gz file (R1) 15 | input2 = RIGHT fastq or fastq.gz file (R2) 16 | separator = character that separates the name of the read from the part that 17 | describes if it goes on the left or right, usually with characters '1' or 18 | '2'. The separator is often a space, but could be another character. A 19 | space is used by default. 20 | """ 21 | 22 | # Importing modules 23 | import gzip 24 | import sys 25 | 26 | # Parsing user input 27 | try: 28 | in1 = sys.argv[1] 29 | in2 = sys.argv[2] 30 | except: 31 | print(__doc__) 32 | sys.exit(1) 33 | 34 | try: 35 | separator = sys.argv[3] 36 | except: 37 | separator = " " 38 | 39 | # Defining classes 40 | class Fastq(object): 41 | """Fastq object with name and sequence 42 | """ 43 | 44 | def __init__(self, name, seq, name2, qual): 45 | self.name = name 46 | self.seq = seq 47 | self.name2 = name2 48 | self.qual = qual 49 | 50 | def getShortname(self, separator): 51 | self.temp = self.name.split(separator) 52 | del(self.temp[-1]) 53 | return separator.join(self.temp) 54 | 55 | def write_to_file(self, handle): 56 | handle.write(self.name + "\n") 57 | handle.write(self.seq + "\n") 58 | handle.write(self.name2 + "\n") 59 | handle.write(self.qual + "\n") 60 | 61 | # Defining functions 62 | def myopen(infile, mode="r"): 63 | if infile.endswith(".gz"): 64 | return gzip.open(infile, mode=mode) 65 | else: 66 | return open(infile, mode=mode) 67 | 68 | def fastq_parser(infile): 69 | """Takes a fastq file infile and returns a fastq object iterator 70 | """ 71 | 72 | with myopen(infile) as f: 73 | while True: 74 | name = f.readline().strip() 75 | if not name: 76 | break 77 | 78 | seq = f.readline().strip() 79 | name2 = f.readline().strip() 80 | qual = f.readline().strip() 81 | yield Fastq(name, seq, name2, qual) 82 | 83 | # Main 84 | if __name__ == "__main__": 85 | seq1_dict = {} 86 | seq2_dict = {} 87 | seq1 = fastq_parser(in1) 88 | seq2 = fastq_parser(in2) 89 | s1_finished = False 90 | s2_finished = False 91 | 92 | if in1.endswith('.gz'): 93 | outSuffix='.fastq.gz' 94 | else: 95 | outSuffix='.fastq' 96 | 97 | with myopen(in1 + "_pairs_R1" + outSuffix, "w") as out1: 98 | with myopen(in2 + "_pairs_R2" + outSuffix, "w") as out2: 99 | with myopen(in1 + "_singles" + outSuffix, "w") as out3: 100 | while not (s1_finished and s2_finished): 101 | try: 102 | s1 = seq1.next() 103 | except: 104 | s1_finished = True 105 | try: 106 | s2 = seq2.next() 107 | except: 108 | s2_finished = True 109 | 110 | # Add new sequences to hashes 111 | if not s1_finished: 112 | seq1_dict[s1.getShortname(separator)] = s1 113 | if not s2_finished: 114 | seq2_dict[s2.getShortname(separator)] = s2 115 | 116 | if not s1_finished and s1.getShortname(separator) in seq2_dict: 117 | seq1_dict[s1.getShortname(separator)].write_to_file(out1) 118 | seq1_dict.pop(s1.getShortname(separator)) 119 | seq2_dict[s1.getShortname(separator)].write_to_file(out2) 120 | seq2_dict.pop(s1.getShortname(separator)) 121 | 122 | if not s2_finished and s2.getShortname(separator) in seq1_dict: 123 | seq2_dict[s2.getShortname(separator)].write_to_file(out2) 124 | seq2_dict.pop(s2.getShortname(separator)) 125 | seq1_dict[s2.getShortname(separator)].write_to_file(out1) 126 | seq1_dict.pop(s2.getShortname(separator)) 127 | 128 | # Treat all unpaired reads 129 | for r in seq1_dict.values(): 130 | r.write_to_file(out3) 131 | 132 | for r in seq2_dict.values(): 133 | r.write_to_file(out3) 134 | 135 | -------------------------------------------------------------------------------- /sample_size.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -e 3 | for i in */; do 4 | cd $i 5 | grep -c HISEQ dt_int.fasta | awk -v pwd="${PWD##*/}" '{print pwd, $1}' >> ../sample_sizes.txt 6 | cd - 7 | done 8 | --------------------------------------------------------------------------------