├── CHANGES.md
├── CITATAION
├── INSTALLATION.md
├── LICENSE
├── README.md
├── Rscripts
    ├── Ordination_plots.R
    ├── README.md
    └── Rarefaction-curves.R
├── VERSION
├── kraken-multiple-taxa.py
└── kraken-multiple.py


/CHANGES.md:
--------------------------------------------------------------------------------
1 | Contain a list of the differences between the different versions
2 | 


--------------------------------------------------------------------------------
/CITATAION:
--------------------------------------------------------------------------------
1 | If you use atavide, please cite
2 | 
3 | DOI:10.5281/zenodo.6000803
4 | 


--------------------------------------------------------------------------------
/INSTALLATION.md:
--------------------------------------------------------------------------------
 1 | **Pre-requisites**
 2 | 
 3 | Run Kraken2 command and generate a report
 4 | 
 5 |     kraken2 --db $KRAKEN_DB --paired read_1.fastq read_2.fastq --threads 1 --use-names --report kraken_report --report-zero-counts --output kraken.out
 6 | 
 7 | Make sure to add the parameter **--report-zero-counts** in the kraken2 command, if this paramter is not added, then the result from this script will be hard to parse through.
 8 | 
 9 | **Dependecies**
10 | 
11 | - Python 3
12 | 
13 | 
14 | - Python packages
15 |     - numpy
16 |     - scipy
17 |     - argparse
18 |     - pandas
19 |     - collection
20 | 
21 | **Kraken-output-manipulation**
22 | 
23 |     git clone https://github.com/npbhavya/Kraken2-output-manipulation.git
24 |     
25 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | Copyright © 2022 Bhavya Papudeshi
 3 | 
 4 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software 
 5 | and associated documentation files (the “Software”), to deal in the Software without restriction, 
 6 | including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, 
 7 | and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do 
 8 | so, subject to the following conditions:
 9 | 
10 | The above copyright notice and this permission notice shall be included in all copies or substantial 
11 | portions of the Software.
12 | 
13 | THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT 
14 | NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 
15 | IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, 
16 | WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 
17 | SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
18 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | [![DOI](https://zenodo.org/badge/216017473.svg)](https://zenodo.org/badge/latestdoi/216017473)
  2 | 
  3 | # Kraken2-output-manipulation
  4 | 
  5 | Kraken2 output generates a report for each datasets, this script takes these individual output reports and combines them to one file in the formal 
  6 | 
  7 | |Taxa ID| sample1|     sample2|     sample3|     sample4| .....| 
  8 | |--------|-------|------------|------------|------------|------|
  9 | |1234   |    1909|          10|        100 |         0  |     ..... |
 10 | 
 11 | or output 
 12 | 
 13 | |Taxa     |sample1     |sample2     |sample3    | sample4     |..... | 
 14 | |--------|-------|------------|------------|------------|------|
 15 | |Pseudomonas | 1909    |      10    |    100    |     0       | .....| 
 16 | 
 17 | 
 18 | The Taxa ID number is the same as the column5 in the kraken2 output report, " NCBI taxonomic ID number". 
 19 | The numbers under sample1, sample2, ... etc can be determined by the user to select which value they would like reported 
 20 | - col1: Percentage of fragments covered by the clade rooted at this taxon
 21 | - col2: Number of fragments covered by the clade rooted at this taxon
 22 | - col3: Number of fragments assigned directly to this taxon
 23 | - col4: A rank code
 24 | - col5: NCBI taxonomic ID number (this doesnt make sense to report for every column but its possible)
 25 | - col6: Indented scientific name
 26 | 
 27 | For documentation on Krkaen2 installation and maunal, here is a link to kraken2 documentation 
 28 | - https://ccb.jhu.edu/software/kraken2/index.shtml?t=manual
 29 | - https://ncgas.org/Blog_Posts/Metagenomic%20taxa%20analysis.php
 30 | 
 31 | ## Pre-requisites 
 32 | Run Kraken2 command and generate a report \
 33 | *kraken2 --db $KRAKEN_DB --paired read_1.fastq read_2.fastq --threads 1 --use-names --report kraken_report --report-zero-counts --output kraken.out*
 34 | 
 35 | Make sure to add the parameter *--report-zero-counts* in the kraken2 command, if this paramter is not added, then the result from this script will be hard to parse through. 
 36 | 
 37 | ## Dependecies 
 38 | **Python 3** \
 39 | Python packages 
 40 | - numpy 
 41 | - scipy
 42 | - argparse
 43 | - pandas 
 44 | - collection 
 45 | 
 46 | ## Usage 
 47 | *python kraken-multiple.py --help \
 48 | usage: kraken-multiple.py [-h] [-d DIRECTORY] [-r {U,R,D,K,P,C,O,F,G,S}] [-c {1,2,3,4,5,6}] [-o OUTPUT]* 
 49 | 
 50 | Take multiple kraken output files and consolidate them to one output 
 51 | 
 52 | optional arguments: \
 53 |   -h, --help                show this help message and exit \
 54 |   -d DIRECTORY              Enter a directory with kraken summary reports \
 55 |   -r {U,R,D,K,P,C,O,F,G,S}  Enter a rank code \
 56 |   -c {1,2,3,4,5,6}          Enter the column number in the report you would like to include in the output \
 57 |   -o OUTPUT                 Enter the output file name 
 58 | 
 59 | #### For getting taxa information instead of taxa ID 
 60 | 
 61 | *python kraken-multiple-taxa.py --help \
 62 | usage: kraken-multiple-taxa.py [-h] [-d DIRECTORY] [-r {U,R,D,K,P,C,O,F,G,S}] [-c {1,2,3,4,5,6}] [-o OUTPUT]* 
 63 | 
 64 | Take multiple kraken output files and consolidate them to one output
 65 | 
 66 | optional arguments: \
 67 |   -h, --help                show this help message and exit \
 68 |   -d DIRECTORY              Enter a directory with kraken summary reports \
 69 |   -r {U,R,D,K,P,C,O,F,G,S}  Enter a rank code \
 70 |   -c {1,2,3,4,5,6}          Enter the column number in the report you would like to include in the output \
 71 |   -o OUTPUT                 Enter the output file name 
 72 | 
 73 | #### Detailed usage description 
 74 |  The input for this script is 
 75 |  - directory with kraken reports only. Use the -d flag to point to this directory. 
 76 |  The format of the output should be 
 77 |  39.87  290930  290930  U       0       unclassified
 78 |  60.13  438756  117     R       1       root
 79 |  59.67  435435  723     R1      131567    cellular organisms
 80 |  58.38  425979  4039    D       2           Bacteria
 81 |  33.55  244810  2293    P       1224          Proteobacteria
 82 |  16.06  117202  1091    C       28211           Alphaproteobacteria
 83 | 
 84 | - rank, since the kraken report includes results for each rank ranging from "(U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies." The user can define the level of rank they would like to look at. The flag -r helps define this paramater 
 85 | 
 86 | - The -c flag sets the which column the user would like reported in the final report. 
 87 |   - col1: Percentage of fragments covered by the clade rooted at this taxon
 88 |   - col2: Number of fragments covered by the clade rooted at this taxon
 89 |   - col3: Number of fragments assigned directly to this taxon
 90 |   - col4: A rank code
 91 |   - col5: NCBI taxonomic ID number (this doesnt make sense to report for every column but its possible)
 92 |   - col6: Indented scientific name
 93 |   
 94 | - output, The -o flag sets the name of the output file to write the final report. 
 95 | 
 96 | ## Example command 
 97 | python kraken-multiple.py -d kraken_report/ -r F -c 2 -o kraken-report-final
 98 | 
 99 | ## Output 
100 | Taking a look at the kraken-report-final  \
101 | **TaxaID  ['sample1','sample2','sample3','sample4','sample5','sample6']** \
102 | 135621  ['210', '859', '2843', '595', '281', '1064'] \
103 | 468     ['80', '359', '1054', '361', '164', '299'] \
104 | 72275   ['66', '1838', '4664', '462', '75', '2074'] \
105 | 267888  ['45', '1407', '59440', '930', '120', '79'] 
106 | 
107 | If you ran the kraken-multiple-taxa.py, then the output will be  \
108 | **Taxa           ['sample1','sample2','sample3','sample4','sample5','sample6']** \
109 | Actinomycetaceae ['210', '859', '2843', '595', '281', '1064'] \
110 | Budviciaceae     ['80', '359', '1054', '361', '164', '299'] \
111 | Mycoplasmataceae ['66', '1838', '4664', '462', '75', '2074'] \
112 | Vibrionaceae     ['45', '1407', '59440', '930', '120', '79'] 
113 | 
114 | ## Downstream 
115 | Run the bash command to generate a csv file that can be easily imported to R 
116 | 
117 |     sed -e "s/\[//g;s/\]//g;s/'//g;s|\t|,|g" kraken_report_all >kraken_report_all_table.csv
118 | 
119 | |TaxaID  |sample1|sample2|sample3|sample4|sample5|sample6|
120 | |---------|---------|----------|---------|---------|---------|---------|
121 | |135621  |210|859| 2843| 595| 281| 1064|
122 | |468     |'80|359|1054|361| 164 |299| 
123 | |72275   |66|1838|4664|462|75| 2074|
124 | |267888  |45|1407|59440|930|120|79|
125 | 
126 | ## Visualization 
127 | In the directory Rscripts, there are two scripts 
128 | - Rarefaction-curves.R : that takes the output from this program to plot the rarefaction curve for the samples 
129 | - Ordination_plots.R : that takes the output from this program to plot ordination plots for the samples
130 | 


--------------------------------------------------------------------------------
/Rscripts/Ordination_plots.R:
--------------------------------------------------------------------------------
 1 | #Importing the data 
 2 | data=read.csv("kraken-report-genus-col2-taxa.csv", header=TRUE, row.names = 1)
 3 | data_t=t(data)
 4 | data_matrix=as.matrix(data_t)
 5 | 
 6 | #data normalization using rarefaction method
 7 | data_t=as.data.frame(t(data))
 8 | S <- specnumber(data_t)
 9 | raremax <-min(rowSums(data_t))
10 | Srare <- rrarefy(Data_t, raremax)
11 | 
12 | ##PCA plots 
13 | library(devtools)
14 | #install_github("vqv/ggbiplot") #to install the ggbiplot
15 | library(ggbiplot)
16 | 
17 | #step to remove the variables without a lot of variance 
18 | Srare_clean=Srare[,apply(Srare, 2,var, na.rm=TRUE)!=0]
19 | 
20 | data.pca=prcomp(Srare_clean, center=TRUE, scale. = TRUE)
21 | summary(data.pca)
22 | #View(data.pca)
23 | 
24 | ggbiplot(data.pca)
25 | #location=c("add in factors like environment etc here")
26 | ggbiplot(data.pca, varname.size = 0, ellipse = TRUE, groups = location)
27 | 
28 | 
29 | ## NMDS plots 
30 | #install.packages("vegan")
31 | library(vegan)
32 | 
33 | d=vegdist(Srare, method="bray")
34 | d_matrix=as.matrix(d, labels=T)
35 | mds=metaMDS(d, distance = "bray", k=3) #k=number of dimensions
36 | mds_data=as.data.frame(mds$points)
37 | mds_data$SampleID =rownames(mds_data)
38 | 
39 | #one way to visaluze the MDS plots 
40 | ggplot(mds_data, aes(x=MDS1, y=MDS2, color= location)) +geom_point()
41 | 
42 | #another way to visualize the MDS plots 
43 | stressplot(mds)
44 | plot(mds)  
45 | ordihull(mds, location, display = "sites", draw=c("polygon"), col=NULL, border=c("blue", "red", "green", "orange", "yellow"), 
46 |          lty = c(1, 2, 1, 2), lwd = 2.5, label = TRUE)
47 | 
48 | ## Beta diversity index and PcoA plots 
49 | ## Calculate multivariate dispersions
50 | mod_loc <- betadisper(d, location)
51 | mod_water <- betadisper(d, water)
52 | 
53 | #Statistical test 
54 | anova(mod_loc)
55 | anova(mod_water)
56 | 
57 | ## Permutation test for F
58 | permutest(mod_loc, pairwise = TRUE, permutations = 99)
59 | permutest(mod_water, pairwise = TRUE, permutations = 99)
60 | 
61 | #Post hoc Tukey HSD test 
62 | (mod_loc.HSD <- TukeyHSD(mod_loc))
63 | plot(mod_loc.HSD)
64 | 
65 | (mod_water.HSD <- TukeyHSD(mod_water))
66 | plot(mod_water.HSD)
67 | 
68 | #Plot the groups and distances to centroids on the with data ellipses instead of hulls - PCoA plots 
69 | plot(mod_loc, ellipse = TRUE, hull = FALSE) # 1 sd data ellipse
70 | plot(mod_loc, ellipse = TRUE, hull = FALSE, conf = 0.90) # 90% data ellipse
71 | 
72 | plot(mod_water, ellipse = TRUE, hull = FALSE) # 1 sd data ellipse
73 | plot(mod_water, ellipse = TRUE, hull = FALSE, conf = 0.90) # 90% data ellipse
74 | 


--------------------------------------------------------------------------------
/Rscripts/README.md:
--------------------------------------------------------------------------------
 1 | ## R scripts references 
 2 | 
 3 | - Most of the code was written with the help of NCGAS R workshop- https://ncgas.org/R%20for%20Biologists%20Workshop.php. \
 4 | The chapter on Ordination plots
 5 | 
 6 | - R documentation 
 7 | 
 8 | - NMDS tutorial in R https://jonlefcheck.net/2012/10/24/nmds-tutorial-in-r/
 9 | 
10 | - Diversity statistics - https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
11 | 


--------------------------------------------------------------------------------
/Rscripts/Rarefaction-curves.R:
--------------------------------------------------------------------------------
 1 | library(vegan)
 2 | 
 3 | #importing the file and parsing the file correctly
 4 | # Replace the kraken_final name to the actual filename.
 5 | Data=read.table("kraken_report_all_R.csv", sep=",", row.names = 1, header=TRUE)
 6 | Data_t=as.data.frame(t(Data))
 7 | 
 8 | #count the number of species
 9 | S <- specnumber(Data_t)
10 | raremax <-min(rowSums(Data_t))
11 | 
12 | #Rarefaction of the samples
13 | Srare <- rarefy(Data_t, raremax)
14 | 
15 | #plotting the rarefaction curves
16 | plot(S, Srare, xlab = "Observed No. of Species", ylab = "Rarefied No. of Species")
17 | abline(0, 1)
18 | pdf("Rarefaction_curve.pdf")
19 | rarecurve(Data_t, step =20, sample = raremax, col = "blue", cex = 0.4, )
20 | dev.off()
21 | 
22 | #PCA plots 
23 | Data_add=cbind(Data_t,groups)
24 | pca=rda(Data_t)
25 | points(pca, display=c("sites"), pch=20, col=factor(Data_add$groups))
26 | 
27 | summary(pca)
28 | biplot(pca, display=c("sites","species"), type=c("points"), xlab="PC1 (2.654508e+09%)", ylab="PC2 (1.748614e+09%)")
29 | points(pca, display=c("sites"),pch=20, col=factor(Data_add$groups))
30 | 
31 | groupnames=levels(Data_add$groups)
32 | legend("topright",
33 |        col = factor(Data_add$groups),
34 |        lty = 1,
35 |        legend = groupnames)
36 | 


--------------------------------------------------------------------------------
/VERSION:
--------------------------------------------------------------------------------
1 | v1.0
2 | 


--------------------------------------------------------------------------------
/kraken-multiple-taxa.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import sys 
 4 | import os
 5 | import argparse
 6 | import numpy as np
 7 | import pandas as pd
 8 | from collections import defaultdict
 9 | 
10 | '''
11 | writing a script to take kraken2 summary files and make one file in the format 
12 | taxaID	taxa	sample1	sample2	sample3	sample4 ....
13 | '''
14 | 
15 | #concatenating the kraken reports from multiple file using a dictionary
16 | def kraken_cat_report(dir, rank, col, out):
17 | 	filelist=input_dir(dir)
18 | 	h=defaultdict(list)
19 | 	for file in filelist:
20 | 		openfi=open(file, 'r')
21 | 		for line in openfi:
22 | 			fields=line.split('\t')
23 | 			if (fields[3] == rank):
24 | 				taxa=fields[5].strip()
25 | 				h[taxa].append(fields[col-1])
26 | 
27 | 	print ("Writing output to a file")
28 | 	with open (out, 'w') as fout:
29 | 		fout.write('Taxa\t%s\n' %filelist)
30 | 		for key,value in h.items():
31 | 			#print (key, len(value))
32 | 			fout.write('%s\t%s\n' %(key, value))
33 | 			
34 | #function that open the directory and confirms there are files, and checks to see that the files are summary files 
35 | def input_dir(dir):
36 | 	files=os.listdir(dir)
37 | 	assert (len(files)!=0), "The directory is empty"
38 | 	path=[]
39 | 	for f in files:
40 | 		fipath=os.path.join(dir,f)
41 | 		path.append(fipath)
42 | 		report=open(fipath, 'r')
43 | 		for line in report:
44 | 			fields=line.split('\t')
45 | 			cols=len(fields)
46 | 			assert (cols==6), "The %s file is not kraken2 summary report" %report
47 | 			break
48 | 	print ("Checking input file done")	
49 | 	return (path)
50 | 
51 | 
52 | if __name__=='__main__' :
53 | 	parser=argparse.ArgumentParser(description="Take multiple kraken output files and consolidate them to one output")
54 | 	parser.add_argument ('-d', dest='directory', help='Enter a directory with kraken summary reports')
55 | 	parser.add_argument ('-r', dest='rank', choices=['U','R','D','K','P','C','O','F','G','S'], help='Enter a rank code')
56 | 	parser.add_argument ('-c', dest='column', type=int, choices=range(1,7), help="Enter the column number in the report you would like to include in the output")
57 | 	parser.add_argument ('-o', dest='output', help='Enter the output file name')
58 | 	results=parser.parse_args()
59 | 	#input_dir(results.directory)
60 | 	kraken_cat_report(results.directory, results.rank, results.column, results.output)
61 | 
62 | 


--------------------------------------------------------------------------------
/kraken-multiple.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import sys 
 4 | import os
 5 | import argparse
 6 | import numpy as np
 7 | import pandas as pd
 8 | from collections import defaultdict
 9 | 
10 | '''
11 | writing a script to take kraken2 summary files and make one file in the format 
12 | taxaID	taxa	sample1	sample2	sample3	sample4 ....
13 | '''
14 | 
15 | #concatenating the kraken reports from multiple file using a dictionary
16 | def kraken_cat_report(dir, rank, col, out):
17 | 	filelist=input_dir(dir)
18 | 	h=defaultdict(list)
19 | 	for file in filelist:
20 | 		openfi=open(file, 'r')
21 | 		for line in openfi:
22 | 			fields=line.split('\t')
23 | 			if (fields[3] == rank):
24 | 				h[fields[4]].append(fields[col-1])
25 | 				
26 | 	print ("Writing output to a file")
27 | 	with open (out, 'w') as fout:
28 | 		fout.write('TaxaID\t%s\n' %filelist)
29 | 		for key,value in h.items():
30 | 			#print (key, len(value))
31 | 			fout.write('%s\t%s\n' %(key, value))
32 | 		
33 | 							
34 | #function that open the directory and confirms there are files, and checks to see that the files are summary files 
35 | def input_dir(dir):
36 | 	files=os.listdir(dir)
37 | 	assert (len(files)!=0), "The directory is empty"
38 | 	path=[]
39 | 	for f in files:
40 | 		fipath=os.path.join(dir,f)
41 | 		path.append(fipath)
42 | 		report=open(fipath, 'r')
43 | 		for line in report:
44 | 			fields=line.split('\t')
45 | 			cols=len(fields)
46 | 			assert (cols==6), "The %s file is not kraken2 summary report" %report
47 | 			break
48 | 	print ("Checking input file done")	
49 | 	return (path)
50 | 
51 | 
52 | if __name__=='__main__' :
53 | 	parser=argparse.ArgumentParser(description="Take multiple kraken output files and consolidate them to one output")
54 | 	parser.add_argument ('-d', dest='directory', help='Enter a directory with kraken summary reports')
55 | 	parser.add_argument ('-r', dest='rank', choices=['U','R','D','K','P','C','O','F','G','S'], help='Enter a rank code')
56 | 	parser.add_argument ('-c', dest='column', type=int, choices=range(1,7), help="Enter the column number in the report you would like to include in the output")
57 | 	parser.add_argument ('-o', dest='output', help='Enter the output file name')
58 | 	results=parser.parse_args()
59 | 	#input_dir(results.directory)
60 | 	kraken_cat_report(results.directory, results.rank, results.column, results.output)
61 | 
62 | 


--------------------------------------------------------------------------------