├── CHANGES.md ├── CITATAION ├── INSTALLATION.md ├── LICENSE ├── README.md ├── Rscripts ├── Ordination_plots.R ├── README.md └── Rarefaction-curves.R ├── VERSION ├── kraken-multiple-taxa.py └── kraken-multiple.py /CHANGES.md: -------------------------------------------------------------------------------- 1 | Contain a list of the differences between the different versions 2 | -------------------------------------------------------------------------------- /CITATAION: -------------------------------------------------------------------------------- 1 | If you use atavide, please cite 2 | 3 | DOI:10.5281/zenodo.6000803 4 | -------------------------------------------------------------------------------- /INSTALLATION.md: -------------------------------------------------------------------------------- 1 | **Pre-requisites** 2 | 3 | Run Kraken2 command and generate a report 4 | 5 | kraken2 --db $KRAKEN_DB --paired read_1.fastq read_2.fastq --threads 1 --use-names --report kraken_report --report-zero-counts --output kraken.out 6 | 7 | Make sure to add the parameter **--report-zero-counts** in the kraken2 command, if this paramter is not added, then the result from this script will be hard to parse through. 8 | 9 | **Dependecies** 10 | 11 | - Python 3 12 | 13 | 14 | - Python packages 15 | - numpy 16 | - scipy 17 | - argparse 18 | - pandas 19 | - collection 20 | 21 | **Kraken-output-manipulation** 22 | 23 | git clone https://github.com/npbhavya/Kraken2-output-manipulation.git 24 | 25 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | Copyright © 2022 Bhavya Papudeshi 3 | 4 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software 5 | and associated documentation files (the “Software”), to deal in the Software without restriction, 6 | including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, 7 | and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do 8 | so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in all copies or substantial 11 | portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT 14 | NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 15 | IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, 16 | WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 17 | SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 18 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![DOI](https://zenodo.org/badge/216017473.svg)](https://zenodo.org/badge/latestdoi/216017473) 2 | 3 | # Kraken2-output-manipulation 4 | 5 | Kraken2 output generates a report for each datasets, this script takes these individual output reports and combines them to one file in the formal 6 | 7 | |Taxa ID| sample1| sample2| sample3| sample4| .....| 8 | |--------|-------|------------|------------|------------|------| 9 | |1234 | 1909| 10| 100 | 0 | ..... | 10 | 11 | or output 12 | 13 | |Taxa |sample1 |sample2 |sample3 | sample4 |..... | 14 | |--------|-------|------------|------------|------------|------| 15 | |Pseudomonas | 1909 | 10 | 100 | 0 | .....| 16 | 17 | 18 | The Taxa ID number is the same as the column5 in the kraken2 output report, " NCBI taxonomic ID number". 19 | The numbers under sample1, sample2, ... etc can be determined by the user to select which value they would like reported 20 | - col1: Percentage of fragments covered by the clade rooted at this taxon 21 | - col2: Number of fragments covered by the clade rooted at this taxon 22 | - col3: Number of fragments assigned directly to this taxon 23 | - col4: A rank code 24 | - col5: NCBI taxonomic ID number (this doesnt make sense to report for every column but its possible) 25 | - col6: Indented scientific name 26 | 27 | For documentation on Krkaen2 installation and maunal, here is a link to kraken2 documentation 28 | - https://ccb.jhu.edu/software/kraken2/index.shtml?t=manual 29 | - https://ncgas.org/Blog_Posts/Metagenomic%20taxa%20analysis.php 30 | 31 | ## Pre-requisites 32 | Run Kraken2 command and generate a report \ 33 | *kraken2 --db $KRAKEN_DB --paired read_1.fastq read_2.fastq --threads 1 --use-names --report kraken_report --report-zero-counts --output kraken.out* 34 | 35 | Make sure to add the parameter *--report-zero-counts* in the kraken2 command, if this paramter is not added, then the result from this script will be hard to parse through. 36 | 37 | ## Dependecies 38 | **Python 3** \ 39 | Python packages 40 | - numpy 41 | - scipy 42 | - argparse 43 | - pandas 44 | - collection 45 | 46 | ## Usage 47 | *python kraken-multiple.py --help \ 48 | usage: kraken-multiple.py [-h] [-d DIRECTORY] [-r {U,R,D,K,P,C,O,F,G,S}] [-c {1,2,3,4,5,6}] [-o OUTPUT]* 49 | 50 | Take multiple kraken output files and consolidate them to one output 51 | 52 | optional arguments: \ 53 | -h, --help show this help message and exit \ 54 | -d DIRECTORY Enter a directory with kraken summary reports \ 55 | -r {U,R,D,K,P,C,O,F,G,S} Enter a rank code \ 56 | -c {1,2,3,4,5,6} Enter the column number in the report you would like to include in the output \ 57 | -o OUTPUT Enter the output file name 58 | 59 | #### For getting taxa information instead of taxa ID 60 | 61 | *python kraken-multiple-taxa.py --help \ 62 | usage: kraken-multiple-taxa.py [-h] [-d DIRECTORY] [-r {U,R,D,K,P,C,O,F,G,S}] [-c {1,2,3,4,5,6}] [-o OUTPUT]* 63 | 64 | Take multiple kraken output files and consolidate them to one output 65 | 66 | optional arguments: \ 67 | -h, --help show this help message and exit \ 68 | -d DIRECTORY Enter a directory with kraken summary reports \ 69 | -r {U,R,D,K,P,C,O,F,G,S} Enter a rank code \ 70 | -c {1,2,3,4,5,6} Enter the column number in the report you would like to include in the output \ 71 | -o OUTPUT Enter the output file name 72 | 73 | #### Detailed usage description 74 | The input for this script is 75 | - directory with kraken reports only. Use the -d flag to point to this directory. 76 | The format of the output should be 77 | 39.87 290930 290930 U 0 unclassified 78 | 60.13 438756 117 R 1 root 79 | 59.67 435435 723 R1 131567 cellular organisms 80 | 58.38 425979 4039 D 2 Bacteria 81 | 33.55 244810 2293 P 1224 Proteobacteria 82 | 16.06 117202 1091 C 28211 Alphaproteobacteria 83 | 84 | - rank, since the kraken report includes results for each rank ranging from "(U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies." The user can define the level of rank they would like to look at. The flag -r helps define this paramater 85 | 86 | - The -c flag sets the which column the user would like reported in the final report. 87 | - col1: Percentage of fragments covered by the clade rooted at this taxon 88 | - col2: Number of fragments covered by the clade rooted at this taxon 89 | - col3: Number of fragments assigned directly to this taxon 90 | - col4: A rank code 91 | - col5: NCBI taxonomic ID number (this doesnt make sense to report for every column but its possible) 92 | - col6: Indented scientific name 93 | 94 | - output, The -o flag sets the name of the output file to write the final report. 95 | 96 | ## Example command 97 | python kraken-multiple.py -d kraken_report/ -r F -c 2 -o kraken-report-final 98 | 99 | ## Output 100 | Taking a look at the kraken-report-final \ 101 | **TaxaID ['sample1','sample2','sample3','sample4','sample5','sample6']** \ 102 | 135621 ['210', '859', '2843', '595', '281', '1064'] \ 103 | 468 ['80', '359', '1054', '361', '164', '299'] \ 104 | 72275 ['66', '1838', '4664', '462', '75', '2074'] \ 105 | 267888 ['45', '1407', '59440', '930', '120', '79'] 106 | 107 | If you ran the kraken-multiple-taxa.py, then the output will be \ 108 | **Taxa ['sample1','sample2','sample3','sample4','sample5','sample6']** \ 109 | Actinomycetaceae ['210', '859', '2843', '595', '281', '1064'] \ 110 | Budviciaceae ['80', '359', '1054', '361', '164', '299'] \ 111 | Mycoplasmataceae ['66', '1838', '4664', '462', '75', '2074'] \ 112 | Vibrionaceae ['45', '1407', '59440', '930', '120', '79'] 113 | 114 | ## Downstream 115 | Run the bash command to generate a csv file that can be easily imported to R 116 | 117 | sed -e "s/\[//g;s/\]//g;s/'//g;s|\t|,|g" kraken_report_all >kraken_report_all_table.csv 118 | 119 | |TaxaID |sample1|sample2|sample3|sample4|sample5|sample6| 120 | |---------|---------|----------|---------|---------|---------|---------| 121 | |135621 |210|859| 2843| 595| 281| 1064| 122 | |468 |'80|359|1054|361| 164 |299| 123 | |72275 |66|1838|4664|462|75| 2074| 124 | |267888 |45|1407|59440|930|120|79| 125 | 126 | ## Visualization 127 | In the directory Rscripts, there are two scripts 128 | - Rarefaction-curves.R : that takes the output from this program to plot the rarefaction curve for the samples 129 | - Ordination_plots.R : that takes the output from this program to plot ordination plots for the samples 130 | -------------------------------------------------------------------------------- /Rscripts/Ordination_plots.R: -------------------------------------------------------------------------------- 1 | #Importing the data 2 | data=read.csv("kraken-report-genus-col2-taxa.csv", header=TRUE, row.names = 1) 3 | data_t=t(data) 4 | data_matrix=as.matrix(data_t) 5 | 6 | #data normalization using rarefaction method 7 | data_t=as.data.frame(t(data)) 8 | S <- specnumber(data_t) 9 | raremax <-min(rowSums(data_t)) 10 | Srare <- rrarefy(Data_t, raremax) 11 | 12 | ##PCA plots 13 | library(devtools) 14 | #install_github("vqv/ggbiplot") #to install the ggbiplot 15 | library(ggbiplot) 16 | 17 | #step to remove the variables without a lot of variance 18 | Srare_clean=Srare[,apply(Srare, 2,var, na.rm=TRUE)!=0] 19 | 20 | data.pca=prcomp(Srare_clean, center=TRUE, scale. = TRUE) 21 | summary(data.pca) 22 | #View(data.pca) 23 | 24 | ggbiplot(data.pca) 25 | #location=c("add in factors like environment etc here") 26 | ggbiplot(data.pca, varname.size = 0, ellipse = TRUE, groups = location) 27 | 28 | 29 | ## NMDS plots 30 | #install.packages("vegan") 31 | library(vegan) 32 | 33 | d=vegdist(Srare, method="bray") 34 | d_matrix=as.matrix(d, labels=T) 35 | mds=metaMDS(d, distance = "bray", k=3) #k=number of dimensions 36 | mds_data=as.data.frame(mds$points) 37 | mds_data$SampleID =rownames(mds_data) 38 | 39 | #one way to visaluze the MDS plots 40 | ggplot(mds_data, aes(x=MDS1, y=MDS2, color= location)) +geom_point() 41 | 42 | #another way to visualize the MDS plots 43 | stressplot(mds) 44 | plot(mds) 45 | ordihull(mds, location, display = "sites", draw=c("polygon"), col=NULL, border=c("blue", "red", "green", "orange", "yellow"), 46 | lty = c(1, 2, 1, 2), lwd = 2.5, label = TRUE) 47 | 48 | ## Beta diversity index and PcoA plots 49 | ## Calculate multivariate dispersions 50 | mod_loc <- betadisper(d, location) 51 | mod_water <- betadisper(d, water) 52 | 53 | #Statistical test 54 | anova(mod_loc) 55 | anova(mod_water) 56 | 57 | ## Permutation test for F 58 | permutest(mod_loc, pairwise = TRUE, permutations = 99) 59 | permutest(mod_water, pairwise = TRUE, permutations = 99) 60 | 61 | #Post hoc Tukey HSD test 62 | (mod_loc.HSD <- TukeyHSD(mod_loc)) 63 | plot(mod_loc.HSD) 64 | 65 | (mod_water.HSD <- TukeyHSD(mod_water)) 66 | plot(mod_water.HSD) 67 | 68 | #Plot the groups and distances to centroids on the with data ellipses instead of hulls - PCoA plots 69 | plot(mod_loc, ellipse = TRUE, hull = FALSE) # 1 sd data ellipse 70 | plot(mod_loc, ellipse = TRUE, hull = FALSE, conf = 0.90) # 90% data ellipse 71 | 72 | plot(mod_water, ellipse = TRUE, hull = FALSE) # 1 sd data ellipse 73 | plot(mod_water, ellipse = TRUE, hull = FALSE, conf = 0.90) # 90% data ellipse 74 | -------------------------------------------------------------------------------- /Rscripts/README.md: -------------------------------------------------------------------------------- 1 | ## R scripts references 2 | 3 | - Most of the code was written with the help of NCGAS R workshop- https://ncgas.org/R%20for%20Biologists%20Workshop.php. \ 4 | The chapter on Ordination plots 5 | 6 | - R documentation 7 | 8 | - NMDS tutorial in R https://jonlefcheck.net/2012/10/24/nmds-tutorial-in-r/ 9 | 10 | - Diversity statistics - https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 11 | -------------------------------------------------------------------------------- /Rscripts/Rarefaction-curves.R: -------------------------------------------------------------------------------- 1 | library(vegan) 2 | 3 | #importing the file and parsing the file correctly 4 | # Replace the kraken_final name to the actual filename. 5 | Data=read.table("kraken_report_all_R.csv", sep=",", row.names = 1, header=TRUE) 6 | Data_t=as.data.frame(t(Data)) 7 | 8 | #count the number of species 9 | S <- specnumber(Data_t) 10 | raremax <-min(rowSums(Data_t)) 11 | 12 | #Rarefaction of the samples 13 | Srare <- rarefy(Data_t, raremax) 14 | 15 | #plotting the rarefaction curves 16 | plot(S, Srare, xlab = "Observed No. of Species", ylab = "Rarefied No. of Species") 17 | abline(0, 1) 18 | pdf("Rarefaction_curve.pdf") 19 | rarecurve(Data_t, step =20, sample = raremax, col = "blue", cex = 0.4, ) 20 | dev.off() 21 | 22 | #PCA plots 23 | Data_add=cbind(Data_t,groups) 24 | pca=rda(Data_t) 25 | points(pca, display=c("sites"), pch=20, col=factor(Data_add$groups)) 26 | 27 | summary(pca) 28 | biplot(pca, display=c("sites","species"), type=c("points"), xlab="PC1 (2.654508e+09%)", ylab="PC2 (1.748614e+09%)") 29 | points(pca, display=c("sites"),pch=20, col=factor(Data_add$groups)) 30 | 31 | groupnames=levels(Data_add$groups) 32 | legend("topright", 33 | col = factor(Data_add$groups), 34 | lty = 1, 35 | legend = groupnames) 36 | -------------------------------------------------------------------------------- /VERSION: -------------------------------------------------------------------------------- 1 | v1.0 2 | -------------------------------------------------------------------------------- /kraken-multiple-taxa.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import sys 4 | import os 5 | import argparse 6 | import numpy as np 7 | import pandas as pd 8 | from collections import defaultdict 9 | 10 | ''' 11 | writing a script to take kraken2 summary files and make one file in the format 12 | taxaID taxa sample1 sample2 sample3 sample4 .... 13 | ''' 14 | 15 | #concatenating the kraken reports from multiple file using a dictionary 16 | def kraken_cat_report(dir, rank, col, out): 17 | filelist=input_dir(dir) 18 | h=defaultdict(list) 19 | for file in filelist: 20 | openfi=open(file, 'r') 21 | for line in openfi: 22 | fields=line.split('\t') 23 | if (fields[3] == rank): 24 | taxa=fields[5].strip() 25 | h[taxa].append(fields[col-1]) 26 | 27 | print ("Writing output to a file") 28 | with open (out, 'w') as fout: 29 | fout.write('Taxa\t%s\n' %filelist) 30 | for key,value in h.items(): 31 | #print (key, len(value)) 32 | fout.write('%s\t%s\n' %(key, value)) 33 | 34 | #function that open the directory and confirms there are files, and checks to see that the files are summary files 35 | def input_dir(dir): 36 | files=os.listdir(dir) 37 | assert (len(files)!=0), "The directory is empty" 38 | path=[] 39 | for f in files: 40 | fipath=os.path.join(dir,f) 41 | path.append(fipath) 42 | report=open(fipath, 'r') 43 | for line in report: 44 | fields=line.split('\t') 45 | cols=len(fields) 46 | assert (cols==6), "The %s file is not kraken2 summary report" %report 47 | break 48 | print ("Checking input file done") 49 | return (path) 50 | 51 | 52 | if __name__=='__main__' : 53 | parser=argparse.ArgumentParser(description="Take multiple kraken output files and consolidate them to one output") 54 | parser.add_argument ('-d', dest='directory', help='Enter a directory with kraken summary reports') 55 | parser.add_argument ('-r', dest='rank', choices=['U','R','D','K','P','C','O','F','G','S'], help='Enter a rank code') 56 | parser.add_argument ('-c', dest='column', type=int, choices=range(1,7), help="Enter the column number in the report you would like to include in the output") 57 | parser.add_argument ('-o', dest='output', help='Enter the output file name') 58 | results=parser.parse_args() 59 | #input_dir(results.directory) 60 | kraken_cat_report(results.directory, results.rank, results.column, results.output) 61 | 62 | -------------------------------------------------------------------------------- /kraken-multiple.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import sys 4 | import os 5 | import argparse 6 | import numpy as np 7 | import pandas as pd 8 | from collections import defaultdict 9 | 10 | ''' 11 | writing a script to take kraken2 summary files and make one file in the format 12 | taxaID taxa sample1 sample2 sample3 sample4 .... 13 | ''' 14 | 15 | #concatenating the kraken reports from multiple file using a dictionary 16 | def kraken_cat_report(dir, rank, col, out): 17 | filelist=input_dir(dir) 18 | h=defaultdict(list) 19 | for file in filelist: 20 | openfi=open(file, 'r') 21 | for line in openfi: 22 | fields=line.split('\t') 23 | if (fields[3] == rank): 24 | h[fields[4]].append(fields[col-1]) 25 | 26 | print ("Writing output to a file") 27 | with open (out, 'w') as fout: 28 | fout.write('TaxaID\t%s\n' %filelist) 29 | for key,value in h.items(): 30 | #print (key, len(value)) 31 | fout.write('%s\t%s\n' %(key, value)) 32 | 33 | 34 | #function that open the directory and confirms there are files, and checks to see that the files are summary files 35 | def input_dir(dir): 36 | files=os.listdir(dir) 37 | assert (len(files)!=0), "The directory is empty" 38 | path=[] 39 | for f in files: 40 | fipath=os.path.join(dir,f) 41 | path.append(fipath) 42 | report=open(fipath, 'r') 43 | for line in report: 44 | fields=line.split('\t') 45 | cols=len(fields) 46 | assert (cols==6), "The %s file is not kraken2 summary report" %report 47 | break 48 | print ("Checking input file done") 49 | return (path) 50 | 51 | 52 | if __name__=='__main__' : 53 | parser=argparse.ArgumentParser(description="Take multiple kraken output files and consolidate them to one output") 54 | parser.add_argument ('-d', dest='directory', help='Enter a directory with kraken summary reports') 55 | parser.add_argument ('-r', dest='rank', choices=['U','R','D','K','P','C','O','F','G','S'], help='Enter a rank code') 56 | parser.add_argument ('-c', dest='column', type=int, choices=range(1,7), help="Enter the column number in the report you would like to include in the output") 57 | parser.add_argument ('-o', dest='output', help='Enter the output file name') 58 | results=parser.parse_args() 59 | #input_dir(results.directory) 60 | kraken_cat_report(results.directory, results.rank, results.column, results.output) 61 | 62 | --------------------------------------------------------------------------------