├── DESCRIPTION ├── NAMESPACE ├── R ├── bclust.R ├── bdist.R ├── binomix.R ├── blasting.R ├── domainclust.R ├── entrez.R ├── extern.R ├── extractPanGenes.R ├── genomedistances.R ├── hmmer3.R ├── micropan.R ├── panmat.R ├── panpca.R ├── panprep.R ├── powerlaw.R ├── rarefaction.R ├── xmpl.R └── xz.R ├── Readme.Rmd ├── Readme.md ├── Readme_files ├── figure-gfm │ ├── unnamed-chunk-10-1.png │ ├── unnamed-chunk-15-1.png │ ├── unnamed-chunk-16-1.png │ ├── unnamed-chunk-20-1.png │ ├── unnamed-chunk-21-1.png │ ├── unnamed-chunk-26-1.png │ ├── unnamed-chunk-27-1.png │ ├── unnamed-chunk-28-1.png │ ├── unnamed-chunk-29-1.png │ ├── unnamed-chunk-30-1.png │ ├── unnamed-chunk-31-1.png │ ├── unnamed-chunk-32-1.png │ ├── unnamed-chunk-8-1.png │ └── unnamed-chunk-9-1.png └── figure-markdown_github │ ├── unnamed-chunk-10-1.png │ ├── unnamed-chunk-14-1.png │ ├── unnamed-chunk-16-1.png │ ├── unnamed-chunk-18-1.png │ ├── unnamed-chunk-20-1.png │ ├── unnamed-chunk-21-1.png │ ├── unnamed-chunk-25-1.png │ ├── unnamed-chunk-26-1.png │ ├── unnamed-chunk-27-1.png │ ├── unnamed-chunk-28-1.png │ ├── unnamed-chunk-29-1.png │ ├── unnamed-chunk-30-1.png │ ├── unnamed-chunk-31-1.png │ ├── unnamed-chunk-32-1.png │ ├── unnamed-chunk-7-1.png │ ├── unnamed-chunk-8-1.png │ └── unnamed-chunk-9-1.png ├── data ├── xmpl.bclst.rda ├── xmpl.bdist.rda └── xmpl.panmat.rda ├── inst └── extdata │ ├── GID1_vs_GID1_.txt.xz │ ├── GID1_vs_microfam.hmm.txt.xz │ ├── GID2_vs_GID1_.txt.xz │ ├── GID2_vs_GID2_.txt.xz │ ├── GID2_vs_microfam.hmm.txt.xz │ ├── GID3_vs_GID1_.txt.xz │ ├── GID3_vs_GID2_.txt.xz │ ├── GID3_vs_GID3_.txt.xz │ ├── GID3_vs_microfam.hmm.txt.xz │ ├── microfam.hmm.h3f.xz │ ├── microfam.hmm.h3i.xz │ ├── microfam.hmm.h3m.xz │ ├── microfam.hmm.h3p.xz │ ├── xmpl.faa.xz │ ├── xmpl_GID1.faa.xz │ ├── xmpl_GID2.faa.xz │ └── xmpl_GID3.faa.xz ├── man ├── bClust.Rd ├── bDist.Rd ├── binomixEstimate.Rd ├── blastpAllAll.Rd ├── chao.Rd ├── dClust.Rd ├── distJaccard.Rd ├── distManhattan.Rd ├── entrezDownload.Rd ├── extractPanGenes.Rd ├── fluidity.Rd ├── geneFamilies2fasta.Rd ├── geneWeights.Rd ├── getAccessions.Rd ├── heaps.Rd ├── hmmerCleanOverlap.Rd ├── hmmerScan.Rd ├── isOrtholog.Rd ├── micropan.Rd ├── panMatrix.Rd ├── panPca.Rd ├── panPrep.Rd ├── rarefaction.Rd ├── readBlastSelf.Rd ├── readHmmer.Rd ├── xmpl.Rd └── xz.Rd └── vignettes ├── vignette-concordance.tex ├── vignette.Rnw ├── vignette.aux ├── vignette.log ├── vignette.pdf └── vignette.tex /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: micropan 2 | Type: Package 3 | Title: Microbial Pan-Genome Analysis 4 | Version: 2.2.1 5 | Date: 2022-04-12 6 | Author: Lars Snipen and Kristian Hovde Liland 7 | Maintainer: Lars Snipen 8 | Description: A collection of functions for computations and visualizations of microbial pan-genomes. 9 | License: GPL-2 10 | Depends: 11 | R (>= 4.0.0), 12 | microseq, 13 | dplyr, 14 | stringr, 15 | igraph 16 | Imports: 17 | tibble, 18 | rlang 19 | LazyData: FALSE 20 | ZipData: TRUE 21 | RoxygenNote: 7.1.1 22 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | export(bClust) 4 | export(bDist) 5 | export(binomixEstimate) 6 | export(blastpAllAll) 7 | export(chao) 8 | export(dClust) 9 | export(distJaccard) 10 | export(distManhattan) 11 | export(entrezDownload) 12 | export(extractPanGenes) 13 | export(fluidity) 14 | export(geneFamilies2fasta) 15 | export(geneWeights) 16 | export(getAccessions) 17 | export(heaps) 18 | export(hmmerCleanOverlap) 19 | export(hmmerScan) 20 | export(isOrtholog) 21 | export(panMatrix) 22 | export(panPca) 23 | export(panPrep) 24 | export(rarefaction) 25 | export(readBlastPair) 26 | export(readBlastSelf) 27 | export(readHmmer) 28 | export(xzcompress) 29 | export(xzuncompress) 30 | importFrom(dplyr,"%>%") 31 | importFrom(dplyr,arrange) 32 | importFrom(dplyr,bind_cols) 33 | importFrom(dplyr,bind_rows) 34 | importFrom(dplyr,desc) 35 | importFrom(dplyr,distinct) 36 | importFrom(dplyr,filter) 37 | importFrom(dplyr,group_by) 38 | importFrom(dplyr,mutate) 39 | importFrom(dplyr,n) 40 | importFrom(dplyr,rename) 41 | importFrom(dplyr,right_join) 42 | importFrom(dplyr,select) 43 | importFrom(dplyr,slice) 44 | importFrom(dplyr,summarize) 45 | importFrom(igraph,components) 46 | importFrom(igraph,degree) 47 | importFrom(igraph,graph_from_edgelist) 48 | importFrom(microseq,readFasta) 49 | importFrom(microseq,writeFasta) 50 | importFrom(rlang,.data) 51 | importFrom(stats,as.dendrogram) 52 | importFrom(stats,as.dist) 53 | importFrom(stats,constrOptim) 54 | importFrom(stats,cutree) 55 | importFrom(stats,dendrapply) 56 | importFrom(stats,dist) 57 | importFrom(stats,hclust) 58 | importFrom(stats,is.leaf) 59 | importFrom(stats,optim) 60 | importFrom(stats,prcomp) 61 | importFrom(stats,sd) 62 | importFrom(stringr,str_c) 63 | importFrom(stringr,str_detect) 64 | importFrom(stringr,str_extract) 65 | importFrom(stringr,str_extract_all) 66 | importFrom(stringr,str_length) 67 | importFrom(stringr,str_remove) 68 | importFrom(stringr,str_remove_all) 69 | importFrom(stringr,str_replace) 70 | importFrom(stringr,str_replace_all) 71 | importFrom(stringr,str_split) 72 | importFrom(stringr,str_sub) 73 | importFrom(tibble,as_tibble) 74 | importFrom(tibble,tibble) 75 | importFrom(utils,read.table) 76 | -------------------------------------------------------------------------------- /R/bclust.R: -------------------------------------------------------------------------------- 1 | #' @name bClust 2 | #' @title Clustering sequences based on pairwise distances 3 | #' 4 | #' @description Sequences are clustered by hierarchical clustering based on a set of pariwise distances. 5 | #' The distances must take values between 0.0 and 1.0, and all pairs \emph{not} listed are assumed to 6 | #' have distance 1.0. 7 | #' 8 | #' @param dist.tbl A \code{tibble} with pairwise distances. 9 | #' @param linkage A text indicating what type of clustering to perform, either \samp{complete} (default), 10 | #' \samp{average} or \samp{single}. 11 | #' @param threshold Specifies the maximum size of a cluster. Must be a distance, i.e. a number between 12 | #' 0.0 and 1.0. 13 | #' @param verbose Logical, turns on/off text output during computations. 14 | #' 15 | #' @details Computing clusters (gene families) is an essential step in many comparative studies. 16 | #' \code{\link{bClust}} will assign sequences into gene families by a hierarchical clustering approach. 17 | #' Since the number of sequences may be huge, a full all-against-all distance matrix will be impossible 18 | #' to handle in memory. However, most sequence pairs will have an \sQuote{infinite} distance between them, 19 | #' and only the pairs with a finite (smallish) distance need to be considered. 20 | #' 21 | #' This function takes as input the distances in \code{dist.tbl} where only the relevant distances are 22 | #' listed. The columns \samp{Query} and \samp{Hit} contain tags identifying pairs of sequences. The column 23 | #' \samp{Distance} contains the distances, always a number from 0.0 to 1.0. Typically, this is the output 24 | #' from \code{\link{bDist}}. All pairs of sequences \emph{not} listed are assumed to have distance 1.0, 25 | #' which is considered the \sQuote{infinite} distance. 26 | #' All sequences must be listed at least once in ceither column \samp{Query} or \samp{Hit} of the \code{dist.tbl}. 27 | #' This should pose no problem, since all sequences must have distance 0.0 to themselves, and should be listed 28 | #' with this distance once (\samp{Query} and \samp{Hit} containing the same tag). 29 | #' 30 | #' The \samp{linkage} defines the type of clusters produced. The \samp{threshold} indicates the size of 31 | #' the clusters. A \samp{single} linkage clustering means all members of a cluster have at least one other 32 | #' member of the same cluster within distance \samp{threshold} from itself. An \samp{average} linkage means 33 | #' all members of a cluster are within the distance \samp{threshold} from the center of the cluster. A 34 | #' \samp{complete} linkage means all members of a cluster are no more than the distance \samp{threshold} 35 | #' away from any other member of the same cluster. 36 | #' 37 | #' Typically, \samp{single} linkage produces big clusters where members may differ a lot, since they are 38 | #' only required to be close to something, which is close to something,...,which is close to some other 39 | #' member. On the other extreme, \samp{complete} linkage will produce small and tight clusters, since all 40 | #' must be similar to all. The \samp{average} linkage is between, but closer to \samp{complete} linkage. If 41 | #' you want the \samp{threshold} to specify directly the maximum distance tolerated between two members of 42 | #' the same gene family, you must use \samp{complete} linkage. The \samp{single} linkage is the fastest 43 | #' alternative to compute. Using the default setting of \samp{single} linkage and maximum \samp{threshold} 44 | #' (1.0) will produce the largest and fewest clusters possible. 45 | #' 46 | #' @return The function returns a vector of integers, indicating the cluster membership of every unique 47 | #' sequence from the \samp{Query} or \samp{Hit} columns of the input \samp{dist.tbl}. The name 48 | #' of each element indicates the sequence. The numerical values have no meaning as such, they are simply 49 | #' categorical indicators of cluster membership. 50 | #' 51 | #' @author Lars Snipen and Kristian Hovde Liland. 52 | #' 53 | #' @seealso \code{\link{bDist}}, \code{\link{hclust}}, \code{\link{dClust}}, \code{\link{isOrtholog}}. 54 | #' 55 | #' @examples 56 | #' # Loading example BLAST distances 57 | #' data(xmpl.bdist) 58 | #' 59 | #' # Clustering with default settings 60 | #' clst <- bClust(xmpl.bdist) 61 | #' # Other settings, and verbose 62 | #' clst <- bClust(xmpl.bdist, linkage = "average", threshold = 0.5, verbose = TRUE) 63 | #' 64 | #' @importFrom igraph graph_from_edgelist components degree 65 | #' @importFrom stats hclust as.dist cutree 66 | #' @importFrom dplyr filter %>% 67 | #' @importFrom rlang .data 68 | #' 69 | #' @export bClust 70 | #' 71 | bClust <- function(dist.tbl, linkage = "complete", threshold = 0.75, verbose = TRUE){ 72 | if(verbose) cat("bClust:\n") 73 | linknum <- grep(linkage, c("single", "average", "complete")) 74 | dist.tbl %>% 75 | filter(.data$Distance < threshold) -> dist.tbl 76 | utag <- sort(unique(c(dist.tbl$Dbase, dist.tbl$Query))) # Important to sort here! 77 | 78 | if(verbose) cat("...constructing graph with", length(utag), "sequences (nodes) and", nrow(dist.tbl), "distances (edges)\n") 79 | M <- matrix(as.numeric(factor(c(dist.tbl$Dbase, dist.tbl$Query), levels = utag)), ncol = 2, byrow = F) 80 | g <- graph_from_edgelist(M, directed = F) 81 | cls <- components(g) 82 | if(verbose) cat("...found", cls$no, "single linkage clusters\n") 83 | tibble(cluster = cls$membership, 84 | tag = utag) -> cls.tbl 85 | 86 | if(linknum > 1){ 87 | ucls <- sort(unique(cls.tbl$cluster)) 88 | incomplete <- which(sapply(ucls, function(j){ 89 | v <- which(cls$membership == j) 90 | degg <- degree(g, v) 91 | return(min(degg) < (length(degg) + 1)) 92 | })) 93 | if(verbose) cat("...found", length(incomplete), "incomplete clusters\n") 94 | if(length(incomplete) > 0){ 95 | cls.tbl %>% 96 | filter(.data$cluster %in% incomplete) %>% 97 | mutate(cluster = .data$cluster * 1000) -> inc.tbl 98 | cls.tbl %>% 99 | filter(!(.data$cluster %in% incomplete)) -> cls.tbl 100 | ucls.c <- unique(inc.tbl$cluster) 101 | for(i in 1:length(ucls.c)){ # for each incomplete cluster 102 | inc.tbl %>% 103 | filter(.data$cluster == ucls.c[i]) -> tbl 104 | D <- matrix(1, nrow = nrow(tbl), ncol = nrow(tbl)) 105 | rownames(D) <- colnames(D) <- tbl$tag 106 | dist.tbl %>% 107 | filter(.data$Dbase %in% tbl$tag | .data$Query %in% tbl$tag) %>% 108 | mutate(Dbase = factor(.data$Dbase, levels = tbl$tag)) %>% 109 | mutate(Query = factor(.data$Query, levels = tbl$tag)) -> d.tbl 110 | M <- matrix(c(as.integer(d.tbl$Dbase), as.integer(d.tbl$Query)), ncol = 2, byrow = F) 111 | D[M] <- d.tbl$Distance 112 | D[M[,c(2,1)]] <- d.tbl$Distance 113 | if(linknum == 2){ 114 | clst <- hclust(as.dist(D), method = "average") 115 | } else { 116 | clst <- hclust(as.dist(D), method = "complete") 117 | } 118 | tbl %>% 119 | mutate(cluster = .data$cluster + cutree(clst, h = threshold)) %>% 120 | bind_rows(cls.tbl) -> cls.tbl 121 | if(verbose) cat(i, "/", length(ucls.c), "\r") 122 | } 123 | } 124 | } 125 | clustering <- as.integer(factor(cls.tbl$cluster)) # to get values 1,2,3,... 126 | names(clustering) <- cls.tbl$tag 127 | if(verbose) cat("\n...ended with", length(unique(clustering)), 128 | "clusters, largest cluster has", 129 | max(table(clustering)), "members\n") 130 | return(sort(clustering)) 131 | } 132 | 133 | 134 | #' @name isOrtholog 135 | #' @title Identifies orthologs in gene clusters 136 | #' 137 | #' @description Finds the ortholog sequences in every cluster based on pairwise distances. 138 | #' 139 | #' @param clustering A vector of integers indicating the cluster for every sequence. Sequences with 140 | #' the same number belong to the same cluster. The name of each element is the tag identifying the sequence. 141 | #' @param dist.tbl A \code{tibble} with pairwise distances. The columns \samp{Query} and 142 | #' \samp{Hit} contain tags identifying pairs of sequences. The column \samp{Distance} contains the 143 | #' distances, always a number from 0.0 to 1.0. 144 | #' 145 | #' @details The input \code{clustering} is typically produced by \code{\link{bClust}}. The input 146 | #' \code{dist.tbl} is typically produced by \code{\link{bDist}}. 147 | #' 148 | #' The concept of orthologs is difficult for prokaryotes, and this function finds orthologs in a 149 | #' simplistic way. For a given cluster, with members from many genomes, there is one ortholog from every 150 | #' genome. In cases where a genome has two or more members in the same cluster, only one of these is an 151 | #' ortholog, the rest are paralogs. 152 | #' 153 | #' Consider all sequences from the same genome belonging to the same cluster. The ortholog is defined as 154 | #' the one having the smallest sum of distances to all other members of the same cluster, i.e. the one 155 | #' closest to the \sQuote{center} of the cluster. 156 | #' 157 | #' Note that the status as ortholog or paralog depends greatly on how clusters are defined in the first 158 | #' place. If you allow large and diverse (and few) clusters, many sequences will be paralogs. If you define 159 | #' tight and homogenous (and many) clusters, almost all sequences will be orthologs. 160 | #' 161 | #' @return A vector of logicals with the same number of elements as the input \samp{clustering}, indicating 162 | #' if the corresponding sequence is an ortholog (\code{TRUE}) or not (\code{FALSE}). The name of each 163 | #' element is copied from \samp{clustering}. 164 | #' 165 | #' @author Lars Snipen and Kristian Hovde Liland. 166 | #' 167 | #' @seealso \code{\link{bDist}}, \code{\link{bClust}}. 168 | #' 169 | #' @examples 170 | #' \dontrun{ 171 | #' # Loading distance data and their clustering results 172 | #' data(list = c("xmpl.bdist","xmpl.bclst")) 173 | #' 174 | #' # Finding orthologs 175 | #' is.ortholog <- isOrtholog(xmpl.bclst, xmpl.bdist) 176 | #' # The orthologs are 177 | #' which(is.ortholog) 178 | #' } 179 | #' 180 | #' @export isOrtholog 181 | #' 182 | isOrtholog <- function(clustering, dist.tbl){ 183 | uclst <- unique(clustering) 184 | tags <- names(clustering) 185 | is.ortholog <- rep(F, length(clustering)) 186 | names(is.ortholog) <- tags 187 | for(i in 1:length(uclst)){ 188 | idx <- which(clustering == uclst[i]) 189 | idd <- which((dist.tbl$Query %in% tags[idx]) & (dist.tbl$Hit %in% tags[idx])) 190 | gidz <- str_extract(tags[idx], "GID[0-9]+") 191 | if(max(table(gidz)) > 1){ 192 | D <- matrix(1, nrow = length(idx), ncol = length(idx)) 193 | a <- as.numeric(factor(dist.tbl$Query[idd], levels = tags[idx])) 194 | b <- as.numeric(factor(dist.tbl$Hit[idd], levels = tags[idx])) 195 | D[matrix(c(a,b), ncol = 2, byrow = F)] <- dist.tbl$Distance[idd] 196 | D[matrix(c(b,a), ncol = 2, byrow = F)] <- dist.tbl$Distance[idd] 197 | ixx <- order(rowSums(D)) 198 | ixd <- which(!duplicated(gidz[ixx])) 199 | is.ortholog[idx[ixx[ixd]]] <- T 200 | } else { 201 | is.ortholog[idx] <- T 202 | } 203 | cat(i, "/", length(uclst), "\r") 204 | } 205 | return(is.ortholog) 206 | } 207 | 208 | 209 | 210 | -------------------------------------------------------------------------------- /R/bdist.R: -------------------------------------------------------------------------------- 1 | #' @name bDist 2 | #' @title Computes distances between sequences 3 | #' 4 | #' @description Computes distance between all sequences based on the BLAST bit-scores. 5 | #' 6 | #' @param blast.files A text vector of BLAST result filenames. 7 | #' @param blast.tbl A table with BLAST results. 8 | #' @param e.value A threshold E-value to immediately discard (very) poor BLAST alignments. 9 | #' @param verbose Logical, indicating if textual output should be given to monitor the progress. 10 | #' 11 | #' @details The essential input is either a vector of BLAST result filenames (\code{blast.files}) or a 12 | #' table of the BLAST results (\code{blast.tbl}). It is no point in providing both, then \code{blast.tbl} is ignored. 13 | #' 14 | #' For normal sized data sets (e.g. less than 100 genomes), you would provide the BLAST filenames as the argument 15 | #' \code{blast.files} to this function. 16 | #' Then results are read, and distances are computed. Only if you have huge data sets, you may find it more efficient to 17 | #' read the files using \code{\link{readBlastSelf}} and \code{\link{readBlastPair}} separately, and then provide as the 18 | #' argument \code{blast.tbl]} the table you get from binding these results. In all cases, the BLAST result files must 19 | #' have been produced by \code{\link{blastpAllAll}}. 20 | #' 21 | #' Setting a small \samp{e.value} threshold can speed up the computation and reduce the size of the 22 | #' output, but you may loose some alignments that could produce smallish distances for short sequences. 23 | #' 24 | #' The distance computed is based on alignment bitscores. Assume the alignment of query A against hit B 25 | #' has a bitscore of S(A,B). The distance is D(A,B)=1-2*S(A,B)/(S(A,A)+S(B,B)) where S(A,A) and S(B,B) are 26 | #' the self-alignment bitscores, i.e. the scores of aligning against itself. A distance of 27 | #' 0.0 means A and B are identical. The maximum possible distance is 1.0, meaning there is no BLAST between A and B. 28 | #' 29 | #' This distance should not be interpreted as lack of identity! A distance of 0.0 means 100\% identity, 30 | #' but a distance of 0.25 does \emph{not} mean 75\% identity. It has some resemblance to an evolutinary 31 | #' (raw) distance, but since it is based on protein alignments, the type of mutations plays a significant 32 | #' role, not only the number of mutations. 33 | #' 34 | #' @return The function returns a table with columns \samp{Dbase}, \samp{Query}, \samp{Bitscore} 35 | #' and \samp{Distance}. Each row corresponds to a pair of sequences (Dbase and Query sequences) having at least 36 | #' one BLAST hit between 37 | #' them. All pairs \emph{not} listed in the output have distance 1.0 between them. 38 | #' 39 | #' @author Lars Snipen and Kristian Hovde Liland. 40 | #' 41 | #' @seealso \code{\link{blastpAllAll}}, \code{\link{readBlastSelf}}, \code{\link{readBlastPair}}, 42 | #' \code{\link{bClust}}, \code{\link{isOrtholog}}. 43 | #' 44 | #' @examples 45 | #' # Using BLAST result files in this package... 46 | #' prefix <- c("GID1_vs_GID1_", 47 | #' "GID2_vs_GID1_", 48 | #' "GID3_vs_GID1_", 49 | #' "GID2_vs_GID2_", 50 | #' "GID3_vs_GID2_", 51 | #' "GID3_vs_GID3_") 52 | #' bf <- file.path(path.package("micropan"), "extdata", str_c(prefix, ".txt.xz")) 53 | #' 54 | #' # We need to uncompress them first... 55 | #' blast.files <- tempfile(pattern = prefix, fileext = ".txt.xz") 56 | #' ok <- file.copy(from = bf, to = blast.files) 57 | #' blast.files <- unlist(lapply(blast.files, xzuncompress)) 58 | #' 59 | #' # Computing pairwise distances 60 | #' blast.dist <- bDist(blast.files) 61 | #' 62 | #' # Read files separately, then use bDist 63 | #' self.tbl <- readBlastSelf(blast.files) 64 | #' pair.tbl <- readBlastPair(blast.files) 65 | #' blast.dist <- bDist(blast.tbl = bind_rows(self.tbl, pair.tbl)) 66 | #' 67 | #' # ...and cleaning... 68 | #' ok <- file.remove(blast.files) 69 | #' 70 | #' # See also example for blastpAl 71 | #' 72 | #' @importFrom tibble tibble 73 | #' @importFrom stringr str_extract_all 74 | #' @importFrom dplyr %>% rename filter arrange mutate distinct select bind_rows desc 75 | #' @importFrom utils read.table 76 | #' @importFrom rlang .data 77 | #' 78 | #' @export bDist 79 | #' 80 | bDist <- function(blast.files = NULL, blast.tbl = NULL, e.value = 1, verbose = TRUE){ 81 | if(!is.null(blast.files)){ 82 | readBlastSelf(blast.files, e.value = e.value, verbose = verbose) %>% 83 | filter(.data$Evalue <= e.value) %>% 84 | arrange(desc(.data$Bitscore)) %>% 85 | mutate(Pair = sortPaste(.data$Dbase, .data$Query)) %>% 86 | distinct(.data$Pair, .keep_all = TRUE) %>% 87 | select(.data$Dbase, .data$Query, .data$Bitscore) -> self.tbl 88 | readBlastPair(blast.files, e.value = e.value, verbose = verbose) %>% 89 | filter(.data$Evalue <= e.value) %>% 90 | select(.data$Dbase, .data$Query, .data$Bitscore) %>% 91 | arrange(desc(.data$Bitscore)) %>% 92 | distinct(.data$Dbase, .data$Query, .keep_all = TRUE) %>% 93 | bind_rows(self.tbl) -> blast.tbl 94 | } else if(is.null(blast.tbl)){ 95 | stop("Needs either blast.files or blast.tbl as input") 96 | } else { 97 | blast.tbl %>% 98 | filter(.data$Evalue <= e.value) %>% 99 | select(-.data$Evalue) %>% 100 | arrange(desc(.data$Bitscore)) %>% 101 | mutate(GIDd = str_extract(.data$Dbase, "GID[0-9]+")) %>% 102 | mutate(GIDq = str_extract(.data$Query, "GID[0-9]+")) -> blast.tbl 103 | blast.tbl %>% 104 | filter(.data$GIDd == .data$GIDq) %>% 105 | mutate(Pair = sortPaste(.data$Dbase, .data$Query)) %>% 106 | distinct(.data$Pair, .keep_all = TRUE) %>% 107 | select(-.data$Pair) -> self.tbl 108 | blast.tbl %>% 109 | filter(.data$GIDd != .data$GIDq) %>% 110 | bind_rows(self.tbl) %>% 111 | select(-.data$GIDd, -.data$GIDq) %>% 112 | distinct(.data$Dbase, .data$Query, .keep_all = TRUE) -> blast.tbl 113 | } 114 | if(verbose) cat("bDist:\n ...found", nrow(blast.tbl), "alignments...\n") 115 | blast.tbl %>% 116 | filter(.data$Dbase == .data$Query) -> self.tbl 117 | if(verbose) cat(" ...where", nrow(self.tbl), "are self-alignments...\n") 118 | idx.d <- match(blast.tbl$Dbase, self.tbl$Dbase) 119 | idd <- which(is.na(idx.d)) 120 | if(length(idd) > 0) stop("No self-alignment for sequences: ", str_c(unique(blast.tbl$Dbase[idd]), collapse = ",")) 121 | idx.q <- match(blast.tbl$Query, self.tbl$Query) 122 | idd <- which(is.na(idx.q)) 123 | if(length(idd) > 0) stop("No self-alignment for sequences: ", str_c(unique(blast.tbl$Query[idd]), collapse = ",")) 124 | 125 | blast.tbl %>% 126 | mutate(Distance = 1 - (2 * .data$Bitscore) / (self.tbl$Bitscore[idx.d] + self.tbl$Bitscore[idx.q])) %>% 127 | arrange(.data$Dbase, .data$Query) -> dist.tbl 128 | return(dist.tbl) 129 | } 130 | 131 | 132 | # Local function 133 | sortPaste <- function(q, h){ 134 | M <- matrix(c(q, h), ncol = 2, byrow = F) 135 | pp <- apply(M, 1, function(x){paste(sort(x), collapse = ":")}) 136 | return(pp) 137 | } 138 | 139 | 140 | 141 | #' @name readBlastSelf 142 | #' @aliases readBlastSelf readBlastPair 143 | #' @title Reads BLAST result files 144 | #' 145 | #' @description Reads files from a search with blastpAllAll 146 | #' 147 | #' @param blast.files A text vector of filenames. 148 | #' @param e.value A threshold E-value to immediately discard (very) poor BLAST alignments. 149 | #' @param verbose Logical, indicating if textual output should be given to monitor the progress. 150 | #' 151 | #' @details The filenames given as input must refer to BLAST result files produced by \code{\link{blastpAllAll}}. 152 | #' 153 | #' With \code{readBlastSelf} you only read the self-alignment results, i.e. blasting a genome against itself. With 154 | #' \code{readBlastPair} you read all the other files, i.e. different genomes compared. You may use all blast file 155 | #' names as input to both, they will select the proper files based on their names, e.g. GID1_vs_GID1.txt is read 156 | #' by \code{readBlastSelf} while GID2_vs_GID1.txt is read by \code{readBlastPair}. 157 | #' 158 | #' Setting a small \samp{e.value} threshold will filter the alignment, and may speed up this and later processing, 159 | #' but you may also loose some important alignments for short sequences. 160 | #' 161 | #' Both these functions are used by \code{\link{bDist}}. The reason we provide them separately is to allow the user 162 | #' to complete this file reading before calling \code{\link{bDist}}. If you have a huge number of files, a 163 | #' skilled user may utilize parallell processing to speed up the reading. For normal size data sets (e.g. less than 100 genomes) 164 | #' you should probably use \code{\link{bDist}} directly. 165 | #' 166 | #' @return The functions returns a table with columns \samp{Dbase}, \samp{Query}, \samp{Bitscore} 167 | #' and \samp{Distance}. Each row corresponds to a pair of sequences (a Dbase and a Query sequence) having at least 168 | #' one BLAST hit between 169 | #' them. All pairs \emph{not} listed have distance 1.0 between them. You should normally bind the output from 170 | #' \code{readBlastSelf} to the ouptut from \code{readBlastPair} and use the result as input to \code{\link{bDist}}. 171 | #' 172 | #' @author Lars Snipen. 173 | #' 174 | #' @seealso \code{\link{bDist}}, \code{\link{blastpAllAll}}. 175 | #' 176 | #' @examples 177 | #' # Using BLAST result files in this package... 178 | #' prefix <- c("GID1_vs_GID1_", 179 | #' "GID2_vs_GID1_", 180 | #' "GID3_vs_GID1_", 181 | #' "GID2_vs_GID2_", 182 | #' "GID3_vs_GID2_", 183 | #' "GID3_vs_GID3_") 184 | #' bf <- file.path(path.package("micropan"), "extdata", str_c(prefix, ".txt.xz")) 185 | #' 186 | #' # We need to uncompress them first... 187 | #' blast.files <- tempfile(pattern = prefix, fileext = ".txt.xz") 188 | #' ok <- file.copy(from = bf, to = blast.files) 189 | #' blast.files <- unlist(lapply(blast.files, xzuncompress)) 190 | #' 191 | #' # Reading self-alignment files, then the other files 192 | #' self.tbl <- readBlastSelf(blast.files) 193 | #' pair.tbl <- readBlastPair(blast.files) 194 | #' 195 | #' # ...and cleaning... 196 | #' ok <- file.remove(blast.files) 197 | #' 198 | #' # See also examples for bDist 199 | #' 200 | #' @importFrom stringr str_extract_all 201 | #' @importFrom dplyr %>% rename filter bind_rows 202 | #' @importFrom utils read.table 203 | #' @importFrom rlang .data 204 | #' 205 | #' @export readBlastSelf readBlastPair 206 | #' 207 | readBlastSelf <- function(blast.files, e.value = 1, verbose = TRUE){ 208 | blast.files <- normalizePath(blast.files) 209 | if(verbose) cat("readBlastSelf:\n ...received", length(blast.files), "blast-files...\n") 210 | gids <- str_extract_all(blast.files, "GID[0-9]+", simplify = TRUE) 211 | self.idx <- which(gids[,1] == gids[,2]) 212 | if(verbose) cat(" ...found", length(self.idx), "self-alignment files...\n") 213 | lapply(blast.files[self.idx], read.table, header = FALSE, sep = "\t", strip.white = TRUE, stringsAsFactors = FALSE) %>% 214 | bind_rows() %>% 215 | rename(Dbase = .data$V1, Query = .data$V2, Evalue = .data$V3, Bitscore = .data$V4) %>% 216 | filter(.data$Evalue <= e.value) -> self.tbl 217 | if(verbose) cat(" ...returns", nrow(self.tbl), "alignment results\n") 218 | return(self.tbl) 219 | } 220 | readBlastPair <- function(blast.files, e.value = 1, verbose = TRUE){ 221 | blast.files <- normalizePath(blast.files) 222 | if(verbose) cat("readBlastPairs:\n ...received", length(blast.files), "blast-files...\n") 223 | gids <- str_extract_all(blast.files, "GID[0-9]+", simplify = TRUE) 224 | pair.idx <- which(gids[,1] != gids[,2]) 225 | if(verbose) cat(" ...found", length(pair.idx), "alignment files who are NOT self-alignments...\n") 226 | lapply(blast.files[pair.idx], read.table, header = FALSE, sep = "\t", strip.white = TRUE, stringsAsFactors = FALSE) %>% 227 | bind_rows() %>% 228 | rename(Dbase = .data$V1, Query = .data$V2, Evalue = .data$V3, Bitscore = .data$V4) %>% 229 | filter(.data$Evalue <= e.value) -> pair.tbl 230 | if(verbose) cat(" ...returns", nrow(pair.tbl), "alignment results\n") 231 | return(pair.tbl) 232 | } 233 | 234 | -------------------------------------------------------------------------------- /R/binomix.R: -------------------------------------------------------------------------------- 1 | #' @name binomixEstimate 2 | #' @aliases binomixEstimate 3 | #' 4 | #' @title Binomial mixture model estimates 5 | #' 6 | #' @description Fits binomial mixture models to the data given as a pan-matrix. From the fitted models 7 | #' both estimates of pan-genome size and core-genome size are available. 8 | #' 9 | #' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details. 10 | #' @param K.range The range of model complexities to explore. The vector of integers specify the number 11 | #' of binomial densities to combine in the mixture models. 12 | #' @param core.detect.prob The detection probability of core genes. This should almost always be 1.0, 13 | #' since a core gene is by definition always present in all genomes, but can be set fractionally smaller. 14 | #' @param verbose Logical indicating if textual output should be given to monitor the progress of the 15 | #' computations. 16 | #' 17 | #' @details A binomial mixture model can be used to describe the distribution of gene clusters across 18 | #' genomes in a pan-genome. The idea and the details of the computations are given in Hogg et al (2007), 19 | #' Snipen et al (2009) and Snipen & Ussery (2012). 20 | #' 21 | #' Central to the concept is the idea that every gene has a detection probability, i.e. a probability of 22 | #' being present in a genome. Genes who are always present in all genomes are called core genes, and these 23 | #' should have a detection probability of 1.0. Other genes are only present in a subset of the genomes, and 24 | #' these have smaller detection probabilities. Some genes are only present in one single genome, denoted 25 | #' ORFan genes, and an unknown number of genes have yet to be observed. If the number of genomes investigated 26 | #' is large these latter must have a very small detection probability. 27 | #' 28 | #' A binomial mixture model with \samp{K} components estimates \samp{K} detection probabilities from the 29 | #' data. The more components you choose, the better you can fit the (present) data, at the cost of less 30 | #' precision in the estimates due to less degrees of freedom. \code{\link{binomixEstimate}} allows you to 31 | #' fit several models, and the input \samp{K.range} specifies which values of \samp{K} to try out. There no 32 | #' real point using \samp{K} less than 3, and the default is \samp{K.range=3:5}. In general, the more genomes 33 | #' you have the larger you can choose \samp{K} without overfitting. Computations will be slower for larger 34 | #' values of \samp{K}. In order to choose the optimal value for \samp{K}, \code{\link{binomixEstimate}} 35 | #' computes the BIC-criterion, see below. 36 | #' 37 | #' As the number of genomes grow, we tend to observe an increasing number of gene clusters. Once a 38 | #' \samp{K}-component binomial mixture has been fitted, we can estimate the number of gene clusters not yet 39 | #' observed, and thereby the pan-genome size. Also, as the number of genomes grows we tend to observe fewer 40 | #' core genes. The fitted binomial mixture model also gives an estimate of the final number of core gene 41 | #' clusters, i.e. those still left after having observed \sQuote{infinite} many genomes. 42 | #' 43 | #' The detection probability of core genes should be 1.0, but can at times be set fractionally smaller. 44 | #' This means you accept that even core genes are not always detected in every genome, e.g. they may be 45 | #' there, but your gene prediction has missed them. Notice that setting the \samp{core.detect.prob} to less 46 | #' than 1.0 may affect the core gene size estimate dramatically. 47 | #' 48 | #' @return \code{\link{binomixEstimate}} returns a \code{list} with two components, the \samp{BIC.tbl} 49 | #' and \samp{Mix.tbl}. 50 | #' 51 | #' The \samp{BIC.tbl} is a \code{tibble} listing, in each row, the results for each number of components 52 | #' used, given by the input \samp{K.range}. The column \samp{Core.size} is the estimated number of 53 | #' core gene families, the column \samp{Pan.size} is the estimated pan-genome size. The column 54 | #' \samp{BIC} is the Bayesian Information Criterion (Schwarz, 1978) that should be used to choose the 55 | #' optimal component number (\samp{K}). The number of components where \samp{BIC} is minimized is the 56 | #' optimum. If minimum \samp{BIC} is reached for the largest \samp{K} value you should extend the 57 | #' \samp{K.range} to larger values and re-fit. The function will issue 58 | #' a \code{warning} to remind you of this. 59 | #' 60 | #' The \samp{Mix.tbl} is a \code{tibble} with estimates from the mixture models. The column \samp{Component} 61 | #' indicates the model, i.e. all rows where \samp{Component} has the same value are from the same model. 62 | #' There will be 3 rows for 3-component model, 4 rows for 4-component, etc. The column \samp{Detection.prob} 63 | #' contain the estimated detection probabilities for each component of the mixture models. A 64 | #' \samp{Mixing.proportion} is the proportion of the gene clusters having the corresponding \samp{Detection.prob}, 65 | #' i.e. if core genes have \samp{Detection.prob} 1.0, the corresponding \samp{Mixing.proportion} (same row) 66 | #' indicates how large fraction of the gene families are core genes. 67 | #' 68 | #' @references 69 | #' Hogg, J.S., Hu, F.Z, Janto, B., Boissy, R., Hayes, J., Keefe, R., Post, J.C., Ehrlich, G.D. (2007). 70 | #' Characterization and modeling of the Haemophilus influenzae core- and supra-genomes based on the 71 | #' complete genomic sequences of Rd and 12 clinical nontypeable strains. Genome Biology, 8:R103. 72 | #' 73 | #' Snipen, L., Almoy, T., Ussery, D.W. (2009). Microbial comparative pan-genomics using binomial 74 | #' mixture models. BMC Genomics, 10:385. 75 | #' 76 | #' Snipen, L., Ussery, D.W. (2012). A domain sequence approach to pangenomics: Applications to 77 | #' Escherichia coli. F1000 Research, 1:19. 78 | #' 79 | #' Schwarz, G. (1978). Estimating the Dimension of a Model. The Annals of Statistics, 6(2):461-464. 80 | #' 81 | #' @author Lars Snipen and Kristian Hovde Liland. 82 | #' 83 | #' @seealso \code{\link{panMatrix}}, \code{\link{chao}}. 84 | #' 85 | #' @examples 86 | #' # Loading an example pan-matrix 87 | #' data(xmpl.panmat) 88 | #' 89 | #' # Estimating binomial mixture models 90 | #' binmix.lst <- binomixEstimate(xmpl.panmat, K.range = 3:8) 91 | #' print(binmix.lst$BIC.tbl) # minimum BIC at 3 components 92 | #' 93 | #' \dontrun{ 94 | #' # The pan-genome gene distribution as a pie-chart 95 | #' library(ggplot2) 96 | #' ncomp <- 3 97 | #' binmix.lst$Mix.tbl %>% 98 | #' filter(Components == ncomp) %>% 99 | #' ggplot() + 100 | #' geom_col(aes(x = "", y = Mixing.proportion, fill = Detection.prob)) + 101 | #' coord_polar(theta = "y") + 102 | #' labs(x = "", y = "", title = "Pan-genome gene distribution") + 103 | #' scale_fill_gradientn(colors = c("pink", "orange", "green", "cyan", "blue")) 104 | #' 105 | #' # The distribution in an average genome 106 | #' binmix.lst$Mix.tbl %>% 107 | #' filter(Components == ncomp) %>% 108 | #' mutate(Single = Mixing.proportion * Detection.prob) %>% 109 | #' ggplot() + 110 | #' geom_col(aes(x = "", y = Single, fill = Detection.prob)) + 111 | #' coord_polar(theta = "y") + 112 | #' labs(x = "", y = "", title = "Average genome gene distribution") + 113 | #' scale_fill_gradientn(colors = c("pink", "orange", "green", "cyan", "blue")) 114 | #' } 115 | #' 116 | #' @importFrom stringr str_c 117 | #' @importFrom tibble tibble 118 | #' 119 | #' @export binomixEstimate 120 | #' 121 | binomixEstimate <- function(pan.matrix, K.range = 3:5, core.detect.prob = 1.0, verbose = TRUE){ 122 | pan.matrix[which(pan.matrix > 0, arr.ind = T)] <- 1 123 | y <- table(factor(colSums(pan.matrix), levels = 1:nrow(pan.matrix))) 124 | bic.mat <- matrix(c(K.range, rep(0, 3*length(K.range))), ncol = 4) 125 | colnames(bic.mat) <- c("K.range", "Core.size", "Pan.size", "BIC") 126 | mix.tbl <- NULL 127 | for(i in 1:length(K.range)){ 128 | if(verbose) cat("binomixEstimate: Fitting", K.range[i], "component model...\n") 129 | lst <- binomixMachine(y, K.range[i], core.detect.prob) 130 | bic.mat[i,-1] <- lst[[1]] 131 | mix.tbl <- bind_rows(mix.tbl, lst[[2]]) 132 | } 133 | bic.tbl <- as_tibble(bic.mat) 134 | if(bic.tbl[length(K.range),3] == min(bic.tbl[,3])) warning("Minimum BIC at maximum K, increase upper limit of K.range") 135 | return(list(BIC.tbl = bic.tbl, Mix.tbl = mix.tbl)) 136 | } 137 | 138 | 139 | #' @importFrom stats constrOptim as.dendrogram dendrapply dist is.leaf optim prcomp sd 140 | #' @importFrom tibble tibble 141 | binomixMachine <- function(y, K, core.detect.prob = 1.0){ 142 | n <- sum(y) 143 | G <- length(y) 144 | ctr <- list(maxit = 300, reltol = 1e-6) 145 | np <- K - 1 146 | 147 | pmix0 <- rep(1, np)/K # flat mixture proportions 148 | pdet0 <- (1:np)/(np+1) # "all" possible detection probabilities 149 | p.initial <- c(pmix0, pdet0) # initial values for parameters 150 | # the inequality constraints... 151 | A <- rbind(c( rep(1, np), rep(0, np)), c(rep(-1, np), rep(0, np)), diag(np+np), -1*diag(np+np)) 152 | b <- c(0, -1, rep(0, np+np), rep(-1, np+np)) 153 | 154 | # The estimation, maximizing the negative truncated log-likelihood function 155 | est <- constrOptim(theta = p.initial, f = negTruncLogLike, grad = NULL, method = "Nelder-Mead", control = ctr, ui = A, ci = b, 156 | y = y, core.p = core.detect.prob) 157 | 158 | estimates <- numeric(3) 159 | names(estimates) <- c("Core.size", "Pan.size", "BIC") 160 | estimates[3] <- 2*est$value + log(n)*(np+K) # the BIC-criterion 161 | p.mix <- c(1 - sum(est$par[1:np]), est$par[1:np]) # the mixing proportions 162 | p.det <- c(core.detect.prob, est$par[(np+1):length( est$par )]) # the detection probabilities 163 | ixx <- order(p.det) 164 | p.det <- p.det[ixx] 165 | p.mix <- p.mix[ixx] 166 | 167 | theta_0 <- choose(G, 0) * sum(p.mix * (1-p.det)^G) 168 | y_0 <- n * theta_0/(1-theta_0) 169 | estimates[2] <- n + round(y_0) 170 | ixx <- which(p.det >= core.detect.prob) 171 | estimates[1] <- round(estimates[2] * sum(p.mix[ixx])) 172 | mix.tbl <- tibble(Components = rep(K, length(p.det)), 173 | Detection.prob = p.det, 174 | Mixing.proportion = p.mix) 175 | return(list(estimates, mix.tbl)) 176 | } 177 | 178 | negTruncLogLike <- function(p, y, core.p){ 179 | np <- length(p)/2 180 | p.det <- c(core.p, p[(np+1):length(p)]) 181 | p.mix <- c(1-sum(p[1:np]), p[1:np]) 182 | G <- length(y) 183 | K <- length(p.mix) 184 | n <- sum(y) 185 | 186 | theta_0 <- choose(G, 0) * sum(p.mix * (1-p.det)^G) 187 | L <- -n * log(1 - theta_0) 188 | for(g in 1:G){ 189 | theta_g <- choose(G, g) * sum(p.mix * p.det^g * (1-p.det)^(G-g)) 190 | L <- L + y[g] * log(theta_g) 191 | } 192 | return(-L) 193 | } 194 | -------------------------------------------------------------------------------- /R/blasting.R: -------------------------------------------------------------------------------- 1 | #' @name blastpAllAll 2 | #' @title Making BLAST search all against all genomes 3 | #' 4 | #' @description Runs a reciprocal all-against-all BLAST search to look for similarity of proteins 5 | #' within and across genomes. 6 | #' 7 | #' @param prot.files A vector with FASTA filenames. 8 | #' @param out.folder The folder where the result files should end up. 9 | #' @param e.value The chosen E-value threshold in BLAST. 10 | #' @param job An integer to separate multiple jobs. 11 | #' @param start.at An integer to specify where in the file-list to start BLASTing. 12 | #' @param threads The number of CPU's to use. 13 | #' @param verbose Logical, if \code{TRUE} some text output is produced to monitor the progress. 14 | #' 15 | #' @details A basic step in pangenomics and many other comparative studies is to cluster proteins into 16 | #' groups or families. One commonly used approach is based on BLASTing. This function uses the 17 | #' \samp{blast+} software available for free from NCBI (Camacho et al, 2009). More precisely, the blastp 18 | #' algorithm with the BLOSUM45 scoring matrix and all composition based statistics turned off. 19 | #' 20 | #' A vector listing FASTA files of protein sequences is given as input in \samp{prot.files}. These files 21 | #' must have the genome_id in the first token of every header, and in their filenames as well, i.e. all input 22 | #' files should first be prepared by \code{\link{panPrep}} to ensure this. Note that only protein sequences 23 | #' are considered here. If your coding genes are stored as DNA, please translate them to protein prior to 24 | #' using this function, see \code{\link[microseq]{translate}}. 25 | #' 26 | #' In the first version of this package we used reciprocal BLASTing, i.e. we computed both genome A against 27 | #' B and B against A. This may sometimes produce slightly different results, but in reality this is too 28 | #' costly compared to its gain, and we now only make one of the above searches. This basically halves the 29 | #' number of searches. This step is still very time consuming for larger number of genomes. Note that the 30 | #' protein files are sorted by the genome_id (part of filename) inside this function. This is to ensure a 31 | #' consistent ordering irrespective of how they are enterred. 32 | #' 33 | #' For every pair of genomes a result file is produced. If two genomes have genome_id's \samp{GID111}, 34 | #' and \samp{GID222} then the result file \samp{GID222_vs_GID111.txt} will 35 | #' be found in \samp{out.folder} after the completion of this search. The last of the two genome_id is always 36 | #' the first in alphabetical order of the two. 37 | #' 38 | #' The \samp{out.folder} is scanned for already existing result files, and \code{\link{blastpAllAll}} never 39 | #' overwrites an existing result file. If a file with the name \samp{GID111_vs_GID222.txt} already exists in 40 | #' the \samp{out.folder}, this particular search is skipped. This makes it possible to run multiple jobs in 41 | #' parallell, writing to the same \samp{out.folder}. It also makes it possible to add new genomes, and only 42 | #' BLAST the new combinations without repeating previous comparisons. 43 | #' 44 | #' This search can be slow if the genomes contain many proteins and it scales quadratically in the number of 45 | #' input files. It is best suited for the study of a smaller number of genomes. By 46 | #' starting multiple R sessions, you can speed up the search by running \code{\link{blastpAllAll}} from each R 47 | #' session, using the same \samp{out.folder} but different integers for the \code{job} option. At the same 48 | #' time you may also want to start the BLASTing at different places in the file-list, by giving larger values 49 | #' to the argument \code{start.at}. This is 1 by default, i.e. the BLASTing starts at the first protein file. 50 | #' If you are using a multicore computer you can also increase the number of CPUs by increasing \code{threads}. 51 | #' 52 | #' The result files are tab-separated text files, and can be read into R, but more 53 | #' commonly they are used as input to \code{\link{bDist}} to compute distances between sequences for subsequent 54 | #' clustering. 55 | #' 56 | #' @return The function produces a result file for each pair of files listed in \samp{prot.files}. 57 | #' These result files are located in \code{out.folder}. Existing files are never overwritten by 58 | #' \code{\link{blastpAllAll}}, if you want to re-compute something, delete the corresponding result files first. 59 | #' 60 | #' @references Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., Madden, T.L. 61 | #' (2009). BLAST+: architecture and applications. BMC Bioinformatics, 10:421. 62 | #' 63 | #' @note The \samp{blast+} software must be installed on the system for this function to work, i.e. the command 64 | #' \samp{system("makeblastdb -help")} must be recognized as valid commands if you 65 | #' run them in the Console window. 66 | #' 67 | #' @author Lars Snipen and Kristian Hovde Liland. 68 | #' 69 | #' @seealso \code{\link{panPrep}}, \code{\link{bDist}}. 70 | #' 71 | #' @examples 72 | #' \dontrun{ 73 | #' # This example requires the external BLAST+ software 74 | #' # Using protein files in this package 75 | #' pf <- file.path(path.package("micropan"), "extdata", 76 | #' str_c("xmpl_GID", 1:3, ".faa.xz")) 77 | #' 78 | #' # We need to uncompress them first... 79 | #' prot.files <- tempfile(fileext = c("_GID1.faa.xz","_GID2.faa.xz","_GID3.faa.xz")) 80 | #' ok <- file.copy(from = pf, to = prot.files) 81 | #' prot.files <- unlist(lapply(prot.files, xzuncompress)) 82 | #' 83 | #' # Blasting all versus all 84 | #' out.dir <- "." 85 | #' blastpAllAll(prot.files, out.folder = out.dir) 86 | #' 87 | #' # Reading results, and computing blast.distances 88 | #' blast.files <- list.files(out.dir, pattern = "GID[0-9]+_vs_GID[0-9]+.txt") 89 | #' blast.distances <- bDist(file.path(out.dir, blast.files)) 90 | #' 91 | #' # ...and cleaning... 92 | #' ok <- file.remove(prot.files) 93 | #' ok <- file.remove(file.path(out.dir, blast.files)) 94 | #' } 95 | #' 96 | #' @importFrom stringr str_extract str_c 97 | #' 98 | #' @export blastpAllAll 99 | blastpAllAll <- function(prot.files, out.folder, e.value = 1, job = 1, 100 | threads = 1, start.at = 1, verbose = TRUE){ 101 | if(available.external("blast+")){ 102 | N <- length(prot.files) 103 | genome_id <- str_extract(prot.files, "GID[0-9]+") 104 | prot.files <- prot.files[order(genome_id)] 105 | genome_id <- genome_id[order(genome_id)] 106 | prot.files <- prot.files[start.at:N] 107 | genome_id <- genome_id[start.at:N] 108 | N <- length(prot.files) 109 | if(.Platform$OS.type == "windows"){ 110 | outfmt <- "\"6 qseqid sseqid evalue bitscore\"" 111 | } else { 112 | outfmt <- "'6 qseqid sseqid evalue bitscore'" 113 | } 114 | #out.folder <- normalizePath(out.folder) # do NOT normalize this path! 115 | file.tbl <- data.frame(Dbase = rep("", (N^2 + N)/2), 116 | Query = rep("", (N^2 + N)/2), 117 | Res.file = rep("", (N^2 + N)/2), 118 | stringsAsFactors = FALSE) 119 | cc <- 1 120 | for(i in 1:N){ 121 | for(j in i:N){ 122 | file.tbl$Dbase[cc] <- prot.files[i] 123 | file.tbl$Query[cc] <- prot.files[j] 124 | file.tbl$Res.file[cc] <- str_c(genome_id[j], "_vs_", genome_id[i], ".txt") 125 | cc <- cc + 1 126 | } 127 | } 128 | existing.files <- list.files(out.folder, pattern = "txt$") 129 | file.tbl %>% 130 | filter(!(.data$Res.file %in% existing.files)) -> file.tbl 131 | if(nrow(file.tbl) > 0){ 132 | dbases <- unique(file.tbl$Dbase) 133 | for(i in 1:length(dbases)){ 134 | log.fil <- file.path(out.folder, str_c("log", job, ".txt")) 135 | db.fil <- file.path(out.folder, str_c("blastDB", job)) 136 | if(verbose) cat("blastpAllAll: Making BLAST database of", dbases[i], "\n") 137 | system(paste("makeblastdb -logfile", log.fil, "-dbtype prot -out", db.fil, "-in", dbases[i])) 138 | file.tbl %>% 139 | filter(.data$Dbase == dbases[i]) -> tbl 140 | for(j in 1:nrow(tbl)){ 141 | out.file <- file.path(out.folder, tbl$Res.file[j]) 142 | if(!file.exists(out.file)){ 143 | if(verbose) cat(" ", tbl$Res.file[j], "\n") 144 | cmd <- paste("blastp", 145 | "-matrix BLOSUM45", 146 | "-evalue", e.value, 147 | "-num_threads", threads, 148 | "-comp_based_stats", "F", 149 | "-num_alignments", 1000, 150 | "-outfmt", outfmt, 151 | "-query", tbl$Query[j], 152 | "-db", db.fil, 153 | "-out", out.file) 154 | system(cmd) 155 | } 156 | } 157 | } 158 | ok <- file.remove(list.files(out.folder, pattern = str_c("blastDB", job), full.names = T)) 159 | ok <- file.remove(log.fil) 160 | if(file.exists(str_c(log.fil, ".perf"))) ok <- file.remove(str_c(log.fil, ".perf")) 161 | } 162 | invisible(TRUE) 163 | } 164 | } 165 | -------------------------------------------------------------------------------- /R/domainclust.R: -------------------------------------------------------------------------------- 1 | #' @name dClust 2 | #' @title Clustering sequences based on domain sequence 3 | #' 4 | #' @description Proteins are clustered by their sequence of protein domains. A domain sequence is the 5 | #' ordered sequence of domains in the protein. All proteins having identical domain sequence are assigned 6 | #' to the same cluster. 7 | #' 8 | #' @param hmmer.tbl A \code{tibble} of results from a \code{\link{hmmerScan}} against a domain database. 9 | #' 10 | #' @details A domain sequence is simply the ordered list of domains occurring in a protein. Not all proteins 11 | #' contain known domains, but those who do will have from one to several domains, and these can be ordered 12 | #' forming a sequence. Since domains can be more or less conserved, two proteins can be quite different in 13 | #' their amino acid sequence, and still share the same domains. Describing, and grouping, proteins by their 14 | #' domain sequence was proposed by Snipen & Ussery (2012) as an alternative to clusters based on pairwise 15 | #' alignments, see \code{\link{bClust}}. Domain sequence clusters are less influenced by gene prediction errors. 16 | #' 17 | #' The input is a \code{tibble} of the type produced by \code{\link{readHmmer}}. Typically, it is the 18 | #' result of scanning proteins (using \code{\link{hmmerScan}}) against Pfam-A or any other HMMER3 database 19 | #' of protein domains. It is highly reccomended that you remove overlapping hits in \samp{hmmer.tbl} before 20 | #' you pass it as input to \code{\link{dClust}}. Use the function \code{\link{hmmerCleanOverlap}} for this. 21 | #' Overlapping hits are in some cases real hits, but often the poorest of them are artifacts. 22 | #' 23 | #' @return The output is a numeric vector with one element for each unique sequence in the \samp{Query} 24 | #' column of the input \samp{hmmer.tbl}. Sequences with identical number belong to the same cluster. The 25 | #' name of each element identifies the sequence. 26 | #' 27 | #' This vector also has an attribute called \samp{cluster.info} which is a character vector containing the 28 | #' domain sequences. The first element is the domain sequence for cluster 1, the second for cluster 2, etc. 29 | #' In this way you can, in addition to clustering the sequences, also see which domains the sequences of a 30 | #' particular cluster share. 31 | #' 32 | #' @references Snipen, L. Ussery, D.W. (2012). A domain sequence approach to pangenomics: Applications 33 | #' to Escherichia coli. F1000 Research, 1:19. 34 | #' 35 | #' @author Lars Snipen and Kristian Hovde Liland. 36 | #' 37 | #' @seealso \code{\link{panPrep}}, \code{\link{hmmerScan}}, \code{\link{readHmmer}}, 38 | #' \code{\link{hmmerCleanOverlap}}, \code{\link{bClust}}. 39 | #' 40 | #' @examples 41 | #' # HMMER3 result files in this package 42 | #' hf <- file.path(path.package("micropan"), "extdata", 43 | #' str_c("GID", 1:3, "_vs_microfam.hmm.txt.xz")) 44 | #' 45 | #' # We need to uncompress them first... 46 | #' hmm.files <- tempfile(fileext = rep(".xz", length(hf))) 47 | #' ok <- file.copy(from = hf, to = hmm.files) 48 | #' hmm.files <- unlist(lapply(hmm.files, xzuncompress)) 49 | #' 50 | #' # Reading the HMMER3 results, cleaning overlaps... 51 | #' hmmer.tbl <- NULL 52 | #' for(i in 1:3){ 53 | #' readHmmer(hmm.files[i]) %>% 54 | #' hmmerCleanOverlap() %>% 55 | #' bind_rows(hmmer.tbl) -> hmmer.tbl 56 | #' } 57 | #' 58 | #' # The clustering 59 | #' clst <- dClust(hmmer.tbl) 60 | #' 61 | #' # ...and cleaning... 62 | #' ok <- file.remove(hmm.files) 63 | #' 64 | #' @importFrom dplyr arrange group_by summarize mutate 65 | #' @importFrom rlang .data 66 | #' 67 | #' @export dClust 68 | #' 69 | dClust <- function(hmmer.tbl){ 70 | hmmer.tbl %>% 71 | arrange(.data$Start) %>% 72 | group_by(.data$Query) %>% 73 | summarize(Dom.seq = str_c(.data$Hit, collapse = ",")) %>% 74 | mutate(Cluster = as.integer(factor(.data$Dom.seq, levels = unique(.data$Dom.seq)))) -> tbl 75 | 76 | dsc <- tbl$Cluster 77 | names(dsc) <- tbl$Query 78 | attr(dsc, "cluster.info") <- unique(tbl$Dom.seq) 79 | return(dsc) 80 | } 81 | 82 | 83 | #' @name hmmerCleanOverlap 84 | #' @title Removing overlapping hits from HMMER3 scans 85 | #' 86 | #' @description Removing hits to avoid overlapping HMMs on the same protein sequence. 87 | #' 88 | #' @param hmmer.tbl A table (\code{tibble}) with \code{\link{hmmerScan}} results, see \code{\link{readHmmer}}. 89 | #' 90 | #' @details When scanning sequences against a profile HMM database using \code{\link{hmmerScan}}, we 91 | #' often find that several patterns (HMMs) match in the same region of the query sequence, i.e. we have 92 | #' overlapping hits. The function \code{\link{hmmerCleanOverlap}} will remove the poorest overlapping hit 93 | #' in a recursive way such that all overlaps are eliminated. 94 | #' 95 | #' The input is a \code{tibble} of the type produced by \code{\link{readHmmer}}. 96 | #' 97 | #' @return A \code{tibble} which is a subset of the input, where some rows may have been deleted to 98 | #' avoid overlapping hits. 99 | #' 100 | #' @author Lars Snipen and Kristian Hovde Liland. 101 | #' 102 | #' @seealso \code{\link{hmmerScan}}, \code{\link{readHmmer}}, \code{\link{dClust}}. 103 | #' 104 | #' @examples # See the example in the Help-file for dClust. 105 | #' 106 | #' @importFrom dplyr filter select %>% slice 107 | #' @importFrom rlang .data 108 | #' 109 | #' @export hmmerCleanOverlap 110 | #' 111 | hmmerCleanOverlap <- function(hmmer.tbl){ 112 | qt <- table(hmmer.tbl$Query) 113 | if(max(qt) > 1){ 114 | multi <- names(qt[qt > 1]) 115 | hmmer.tbl$Keep <- TRUE 116 | for(i in 1:length(multi)){ 117 | idx <- which(hmmer.tbl$Query == multi[i]) 118 | hmmer.tbl$Keep[idx] <- keeper(hmmer.tbl[idx,]) 119 | } 120 | hmmer.tbl %>% 121 | filter(.data$Keep) %>% 122 | select(-.data$Keep) -> hmmer.tbl 123 | } 124 | return(hmmer.tbl) 125 | } 126 | 127 | 128 | 129 | # Local functions 130 | keeper<- function(hmmer.tbl){ 131 | hmmer.tbl$Overlaps <- overlapper(hmmer.tbl) 132 | while((sum(hmmer.tbl$Keep) > 1) & (sum(hmmer.tbl$Overlaps) > 0)){ 133 | idx <- which(hmmer.tbl$Overlaps) 134 | idd <- which(hmmer.tbl$Evalue[idx] == max(hmmer.tbl$Evalue[idx])) 135 | hmmer.tbl$Keep[idx[idd[1]]] <- F 136 | hmmer.tbl$Overlaps <- overlapper(hmmer.tbl) 137 | } 138 | return(hmmer.tbl$Keep) 139 | } 140 | overlapper <- function(hmmer.tbl){ 141 | olaps <- rep(FALSE, nrow(hmmer.tbl)) 142 | idx <- which(hmmer.tbl$Keep) 143 | if(length(idx) > 1){ 144 | ht <- slice(hmmer.tbl, idx) 145 | for(i in 1:nrow(ht)){ 146 | ovr <- ((ht$Start[i] <= ht$Stop[-i]) & (ht$Start[i] >= ht$Start[-i])) | 147 | ((ht$Stop[i] <= ht$Stop[-i]) & (ht$Stop[i] >= ht$Start[-i])) | 148 | ((ht$Start[i] <= ht$Start[-i]) & (ht$Stop[i] >= ht$Stop[-i])) 149 | olaps[idx[i]] <- (length(which(ovr)) > 0) 150 | } 151 | } 152 | return(olaps) 153 | } 154 | -------------------------------------------------------------------------------- /R/entrez.R: -------------------------------------------------------------------------------- 1 | #' @name entrezDownload 2 | #' @title Downloading genome data 3 | #' 4 | #' @description Retrieving genomes from NCBI using the Entrez programming utilities. 5 | #' 6 | #' @param accession A character vector containing a set of valid accession numbers at the NCBI 7 | #' Nucleotide database. 8 | #' @param out.file Name of the file where downloaded sequences should be written in FASTA format. 9 | #' @param verbose Logical indicating if textual output should be given during execution, to monitor 10 | #' the download progress. 11 | #' 12 | #' @details The Entrez programming utilities is a toolset for automatic download of data from the 13 | #' NCBI databases, see \href{https://www.ncbi.nlm.nih.gov/books/NBK25500/}{E-utilities Quick Start} 14 | #' for details. \code{\link{entrezDownload}} can be used to download genomes from the NCBI Nucleotide 15 | #' database through these utilities. 16 | #' 17 | #' The argument \samp{accession} must be a set of valid accession numbers at NCBI Nucleotide, typically 18 | #' all accession numbers related to a genome (chromosomes, plasmids, contigs, etc). For completed genomes, 19 | #' where the number of sequences is low, \samp{accession} is typically a single text listing all accession 20 | #' numbers separated by commas. In the case of some draft genomes having a large number of contigs, the 21 | #' accession numbers must be split into several comma-separated texts. The reason for this is that Entrez 22 | #' will not accept too many queries in one chunk. 23 | #' 24 | #' The downloaded sequences are saved in \samp{out.file} on your system. This will be a FASTA formatted file. 25 | #' Note that all downloaded sequences end up in this file. If you want to download multiple genomes, 26 | #' you call \code{\link{entrezDownload}} multiple times and store in multiple files. 27 | #' 28 | #' @return The name of the resulting FASTA file is returned (same as \code{out.file}), but the real result of 29 | #' this function is the creation of the file itself. 30 | #' 31 | #' @author Lars Snipen and Kristian Liland. 32 | #' 33 | #' @seealso \code{\link{getAccessions}}, \code{\link[microseq]{readFasta}}. 34 | #' 35 | #' @examples 36 | #' \dontrun{ 37 | #' # Accession numbers for the chromosome and plasmid of Buchnera aphidicola, strain APS 38 | #' acc <- "BA000003.2,AP001071.1" 39 | #' genome.file <- tempfile(pattern = "Buchnera_aphidicola", fileext = ".fna") 40 | #' txt <- entrezDownload(acc, out.file = genome.file) 41 | #' 42 | #' # ...cleaning... 43 | #' ok <- file.remove(genome.file) 44 | #' } 45 | #' 46 | #' @importFrom stringr str_c 47 | #' 48 | #' @export entrezDownload 49 | #' 50 | entrezDownload <- function(accession, out.file, verbose = TRUE){ 51 | if(verbose) cat("Downloading genome...") 52 | connect <- file(out.file, open = "w") 53 | for(j in 1:length(accession)){ 54 | adr <- str_c("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide", 55 | "&id=", accession[j], 56 | "&retmode=text", 57 | "&rettype=fasta") 58 | entrez <- url(adr, open = "rt") 59 | if(isOpen(entrez)){ 60 | lines <- readLines(entrez) 61 | writeLines(lines, con = connect) 62 | close(entrez) 63 | } else { 64 | cat("Download failed: Could not open connection\n") 65 | } 66 | } 67 | close(connect) 68 | if(verbose) cat("...sequences saved in", out.file, "\n") 69 | return(out.file) 70 | } 71 | 72 | 73 | #' @name getAccessions 74 | #' @title Collecting contig accession numbers 75 | #' 76 | #' @description Retrieving the accession numbers for all contigs from a master record GenBank file. 77 | #' 78 | #' @param master.record.accession The accession number (single text) to a master record GenBank file having 79 | #' the WGS entry specifying the accession numbers to all contigs of the WGS genome. 80 | #' @param chunk.size The maximum number of accession numbers returned in one text. 81 | #' 82 | #' @details In order to download a WGS genome (draft genome) using \code{\link{entrezDownload}} you will 83 | #' need the accession number of every contig. This is found in the master record GenBank file, which is 84 | #' available for every WGS genome. \code{\link{getAccessions}} will extract these from the GenBank file and 85 | #' return them in the apropriate way to be used by \code{\link{entrezDownload}}. 86 | #' 87 | #' The download API at NCBI will not tolerate too many accessions per query, and for this reason you need 88 | #' to split the accessions for many contigs into several texts using \code{chunk.size}. 89 | #' 90 | #' @return A character vector where each element is a text listing the accession numbers separated by comma. 91 | #' Each vector element will contain no more than \code{chunk.size} accession numbers, see 92 | #' \code{\link{entrezDownload}} for details on this. The vector returned by \code{\link{getAccessions}} 93 | #' is typically used as input to \code{\link{entrezDownload}}. 94 | #' 95 | #' @author Lars Snipen and Kristian Liland. 96 | #' 97 | #' @seealso \code{\link{entrezDownload}}. 98 | #' 99 | #' @examples 100 | #' \dontrun{ 101 | #' # The master record accession for the WGS genome Mycoplasma genitalium, strain G37 102 | #' acc <- getAccessions("AAGX00000000") 103 | #' # Then we use this to download all contigs and save them 104 | #' genome.file <- tempfile(fileext = ".fna") 105 | #' txt <- entrezDownload(acc, out.file = genome.file) 106 | #' 107 | #' # ...cleaning... 108 | #' ok <- file.remove(genome.file) 109 | #' } 110 | #' 111 | #' @importFrom stringr str_c str_detect str_extract str_sub str_remove str_split 112 | #' 113 | #' @export getAccessions 114 | #' 115 | getAccessions <- function(master.record.accession, chunk.size = 99){ 116 | adrId <- str_c("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccore", 117 | "&term=", master.record.accession) 118 | idSearch <- url(adrId, open = "rt") 119 | if(isOpen(idSearch)){ 120 | idDoc <- readLines(idSearch) 121 | idLine <- which(str_detect(idDoc, "+"))[1] 122 | id <- str_sub(idDoc[idLine], 5, -6) 123 | close(idSearch) 124 | } else { 125 | cat("Download failed: Could not open connection\n") 126 | close(idSearch) 127 | return(NULL) 128 | } 129 | 130 | adr <- str_c("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore", 131 | "&id=", id, 132 | "&retmode=text", 133 | "&rettype=gb") 134 | entrez <- url(adr, open = "rt") 135 | accessions <- "" 136 | if(isOpen(entrez)){ 137 | lines <- readLines(entrez) 138 | close(entrez) 139 | wgs.line <- str_remove(lines[str_detect(lines, pattern = "WGS ")], "WGS[ ]+") 140 | ss <- str_split(wgs.line, pattern = "-", simplify = T) 141 | head <- str_extract(ss[1], "[A-Z]+[0]+") 142 | ss.num <- as.numeric(str_remove(ss, "^[A-Z]+[0]+")) 143 | if(length(ss.num) > 1){ 144 | range <- ss.num[1]:ss.num[2] 145 | } else { 146 | range <- ss.num 147 | } 148 | ns <- ceiling(length(range)/chunk.size) 149 | accessions <- character(ns) 150 | for(j in 1:ns){ 151 | s1 <- (j-1) * chunk.size + 1 152 | s2 <- min(j * chunk.size, length(range)) 153 | accessions[j] <- str_c(str_c(head, range[s1]:range[s2]), collapse = ",") 154 | } 155 | } else { 156 | close(entrez) 157 | cat("Download failed: Could not open connection\n") 158 | } 159 | return(accessions) 160 | } 161 | -------------------------------------------------------------------------------- /R/extern.R: -------------------------------------------------------------------------------- 1 | 2 | 3 | ## Non-exported function to gracefully fail when external dependencies are missing. 4 | available.external <- function(what){ 5 | if(what == "hmmer"){ 6 | chr <- NULL 7 | try(chr <- system('hmmscan -h', intern = TRUE), silent = TRUE) 8 | if(is.null(chr)){ 9 | stop(paste('hmmer was not found by R.', 10 | 'Please install hmmer from: http://hmmer.org/download.html', 11 | 'After installation, re-start R and make sure the hmmer softwares can be run from R by', 12 | 'the command \'system("hmmscan -h")\'.', sep = '\n')) 13 | return(FALSE) 14 | } else { 15 | return(TRUE) 16 | } 17 | } else if(what == "blast+"){ 18 | chr <- NULL 19 | try(chr <- system('makeblastdb -help', intern = TRUE), silent = TRUE) 20 | if(is.null(chr)){ 21 | stop(paste('blast+ was not found by R.', 22 | 'Please install blast+ from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/', 23 | 'After installation, re-start R and make sure the blast+ softwares can be run from R by', 24 | 'the command \'system("makeblastdb -help")\'.', sep = '\n')) 25 | return(FALSE) 26 | } else { 27 | return(TRUE) 28 | } 29 | } 30 | } 31 | -------------------------------------------------------------------------------- /R/extractPanGenes.R: -------------------------------------------------------------------------------- 1 | #' @name extractPanGenes 2 | #' @title Extracting genes of same prevalence 3 | #' 4 | #' @description Based on a clustering of genes, this function extracts the genes 5 | #' occurring in the same number of genomes. 6 | #' 7 | #' @param clustering Named vector of clustering 8 | #' @param N.genomes Vector specifying the number of genomes the genes should be in 9 | #' 10 | #' @details Pan-genome studies focus on the gene families obtained by some clustering, 11 | #' see \code{\link{bClust}} or \code{\link{dClust}}. This function will extract the individual genes from 12 | #' each genome belonging to gene families found in \code{N.genomes} genomes specified by the user. 13 | #' Only the sequence tag for each gene is extracted, but the sequences can be added easily, see examples 14 | #' below. 15 | #' 16 | #' @return A table with columns 17 | #' \itemize{ 18 | #' \item cluster. The gene family (integer) 19 | #' \item seq_tag. The sequence tag identifying each sequence (text) 20 | #' \item N_genomes. The number of genomes in which it is found (integer) 21 | #' } 22 | #' 23 | #' @author Lars Snipen. 24 | #' 25 | #' @seealso \code{\link{bClust}}, \code{\link{dClust}}, \code{\link{geneFamilies2fasta}}. 26 | #' 27 | #' @examples 28 | #' # Loading clustering data in this package 29 | #' data(xmpl.bclst) 30 | #' 31 | #' # Finding genes in 5 genomes 32 | #' core.tbl <- extractPanGenes(xmpl.bclst, N.genomes = 5) 33 | #' #...or in a single genome 34 | #' orfan.tbl <- extractPanGenes(xmpl.bclst, N.genomes = 1) 35 | #' 36 | #' \dontrun{ 37 | #' # To add the sequences, assume all protein fasta files are in a folder named faa: 38 | #' lapply(list.files("faa", full.names = T), readFasta) %>% 39 | #' bind_rows() %>% 40 | #' mutate(seq_tag = word(Header, 1, 1)) %>% 41 | #' right_join(orfan.tbl, by = "seq_tag") -> orfan.tbl 42 | #' # The resulting table can be written to fasta file directly using writeFasta() 43 | #' # See also geneFamilies2fasta() 44 | #' } 45 | #' 46 | #' @importFrom dplyr distinct group_by summarize filter %>% mutate select arrange 47 | #' @importFrom tibble tibble 48 | #' @importFrom stringr str_extract 49 | #' @importFrom rlang .data 50 | #' 51 | #' @export extractPanGenes 52 | #' 53 | extractPanGenes <- function(clustering, N.genomes = 1:2){ 54 | tibble(cluster = clustering, 55 | seq_tag = names(clustering), 56 | genome_id = str_extract(names(clustering), "GID[0-9]+")) -> tbl 57 | names(tbl$cluster) <- NULL 58 | tbl %>% 59 | distinct(cluster, genome_id) %>% 60 | group_by(cluster) %>% 61 | summarize(n.genomes = n()) %>% 62 | filter(n.genomes %in% N.genomes) -> trg.tbl 63 | 64 | out.tbl <- NULL 65 | for(i in 1:length(N.genomes)){ 66 | idx <- which(trg.tbl$n.genomes == N.genomes[i]) 67 | tbl %>% 68 | filter(cluster %in% trg.tbl$cluster[idx]) %>% 69 | mutate(N_genomes = N.genomes[i]) %>% 70 | select(cluster, seq_tag, N_genomes) %>% 71 | arrange(cluster, seq_tag) %>% 72 | bind_rows(out.tbl) -> out.tbl 73 | } 74 | return(out.tbl) 75 | } 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | #' @name geneFamilies2fasta 85 | #' @title Write gene families to files 86 | #' 87 | #' @description Writes specified gene families to separate fasta files. 88 | #' 89 | #' @param pangene.tbl A table listing gene families (clusters). 90 | #' @param fasta.folder The folder containing the fasta files with all sequences. 91 | #' @param out.folder The folder to write to. 92 | #' @param file.ext The file extension to recognize the fasta files in \code{fasta.folder}. 93 | #' @param verbose Logical to allow text ouput during processing 94 | #' 95 | #' @details The argument \code{pangene.tbl} should be produced by \code{\link{extractPanGenes}} in order to 96 | #' contain the columns \code{cluster}, \code{seq_tag} and \code{N_genomes} required by this function. The 97 | #' files in \code{fasta.folder} must have been prepared by \code{\link{panPrep}} in order to have the proper 98 | #' sequence tag information. They may contain protein sequences or DNA sequences. 99 | #' 100 | #' If you already added the \code{Header} and \code{Sequence} information to \code{pangene.tbl} these will be 101 | #' used instead of reading the files in \code{fasta.folder}, but a warning is issued. 102 | #' 103 | #' @author Lars Snipen. 104 | #' 105 | #' @seealso \code{\link{extractPanGenes}}, \code{\link{writeFasta}}. 106 | #' 107 | #' @examples 108 | #' # Loading clustering data in this package 109 | #' data(xmpl.bclst) 110 | #' 111 | #' # Finding genes in 1,..,5 genomes (all genes) 112 | #' all.tbl <- extractPanGenes(xmpl.bclst, N.genomes = 1:5) 113 | #' 114 | #' \dontrun{ 115 | #' # All protein fasta files are in a folder named faa, and we write to the current folder: 116 | #' clusters2fasta(all.tbl, fasta.folder = "faa", out.folder = ".") 117 | #' 118 | #' # use pipe, write to folder "orfans" 119 | #' extractPanGenes(xmpl.bclst, N.genomes = 1)) %>% 120 | #' geneFamilies2fasta(fasta.folder = "faa", out.folder = "orfans") 121 | #' } 122 | #' 123 | #' @importFrom dplyr bind_rows distinct filter %>% mutate select right_join 124 | #' @importFrom microseq writeFasta readFasta 125 | #' @importFrom stringr str_extract 126 | #' @importFrom rlang .data 127 | #' 128 | #' @export geneFamilies2fasta 129 | #' 130 | geneFamilies2fasta <- function(pangene.tbl, fasta.folder, out.folder, file.ext = "fasta$|faa$|fna$|fa$", 131 | verbose = TRUE){ 132 | has.Header <- exists("Header", pangene.tbl) 133 | has.Sequence <- exists("Sequence", pangene.tbl) 134 | if(has.Header & has.Sequence){ 135 | warning("pangene.tbl already has sequences, ignoring fasta.folder") 136 | } else { 137 | if(has.Header) pangene.tbl %>% select(-Header) -> pangene.tbl 138 | if(has.Sequence) pangene.tbl %>% select(-Sequence) -> pangene.tbl 139 | fasta.files <- list.files(fasta.folder, pattern = file.ext, full.names = T) 140 | if(length(fasta.files) == 0) stop("Found no fasta files in fasta.folder") 141 | if(verbose) cat("clusters2fasta:\n found", length(fasta.files), "fasta files...\n") 142 | lapply(fasta.files, readFasta) %>% 143 | bind_rows() %>% 144 | mutate(seq_tag = word(Header, 1, 1)) %>% 145 | right_join(pangene.tbl, by = "seq_tag") -> pangene.tbl 146 | if(sum(is.na(pangene.tbl$Sequence))) stop("Cannot bind these sequences to the supplied genes, mismatching sequence tags") 147 | } 148 | pangene.tbl %>% 149 | distinct(cluster, N_genomes) %>% 150 | by(1:nrow(.), function(x){str_c("Genome=", x[2], "_Cluster=", x[1], ".fasta")}) %>% 151 | as.character() -> filenames 152 | ucls <- unique(pangene.tbl$cluster) 153 | if(verbose) cat(" writing", length(ucls), "clusters to files...\n") 154 | for(i in 1:length(ucls)){ 155 | pangene.tbl %>% 156 | filter(cluster == ucls[i]) %>% 157 | writeFasta(out.file = file.path(out.folder, filenames[i])) 158 | } 159 | } -------------------------------------------------------------------------------- /R/genomedistances.R: -------------------------------------------------------------------------------- 1 | #' @name fluidity 2 | #' @title Computing genomic fluidity for a pan-genome 3 | #' 4 | #' @description Computes the genomic fluidity, which is a measure of population diversity. 5 | #' 6 | #' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details. 7 | #' @param n.sim An integer specifying the number of random samples to use in the computations. 8 | #' 9 | #' @details The genomic fluidity between two genomes is defined as the number of unique gene 10 | #' families divided by the total number of gene families (Kislyuk et al, 2011). This is averaged 11 | #' over \samp{n.sim} random pairs of genomes to obtain a population estimate. 12 | #' 13 | #' The genomic fluidity between two genomes describes their degree of overlap with respect to gene 14 | #' cluster content. If the fluidity is 0.0, the two genomes contain identical gene clusters. If it 15 | #' is 1.0 the two genomes are non-overlapping. The difference between a Jaccard distance (see 16 | #' \code{\link{distJaccard}}) and genomic fluidity is small, they both measure overlap between 17 | #' genomes, but fluidity is computed for the population by averaging over many pairs, while Jaccard 18 | #' distances are computed for every pair. Note that only presence/absence of gene clusters are 19 | #' considered, not multiple occurrences. 20 | #' 21 | #' The input \samp{pan.matrix} is typically constructed by \code{\link{panMatrix}}. 22 | #' 23 | #' @return A vector with two elements, the mean fluidity and its sample standard deviation over 24 | #' the \samp{n.sim} computed values. 25 | #' 26 | #' @references Kislyuk, A.O., Haegeman, B., Bergman, N.H., Weitz, J.S. (2011). Genomic fluidity: 27 | #' an integrative view of gene diversity within microbial populations. BMC Genomics, 12:32. 28 | #' 29 | #' @author Lars Snipen and Kristian Hovde Liland. 30 | #' 31 | #' @seealso \code{\link{panMatrix}}, \code{\link{distJaccard}}. 32 | #' 33 | #' @examples 34 | #' # Loading a pan-matrix in this package 35 | #' data(xmpl.panmat) 36 | #' 37 | #' # Fluidity based on this pan-matrix 38 | #' fluid <- fluidity(xmpl.panmat) 39 | #' 40 | #' @importFrom stats sd 41 | #' 42 | #' @export fluidity 43 | #' 44 | fluidity <- function(pan.matrix, n.sim = 10){ 45 | pan.matrix[which(pan.matrix > 0, arr.ind=T)] <- 1 46 | flu <- rep(0, n.sim) 47 | for(i in 1:n.sim){ 48 | ii <- sample(nrow(pan.matrix), 2) 49 | flu[i] <- (sum(pan.matrix[ii[1],] > 0 & pan.matrix[ii[2],] == 0) 50 | + sum(pan.matrix[ii[1],] == 0 & pan.matrix[ii[2],] > 0)) / (sum(pan.matrix[ii[1],]) + sum(pan.matrix[ii[2],])) 51 | } 52 | flu.vec <- c(Mean = mean(flu), Std = sd(flu)) 53 | return(flu.vec) 54 | } 55 | 56 | 57 | #' @name distJaccard 58 | #' @title Computing Jaccard distances between genomes 59 | #' 60 | #' @description Computes the Jaccard distances between all pairs of genomes. 61 | #' 62 | #' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details. 63 | #' 64 | #' @details The Jaccard index between two sets is defined as the size of the intersection of 65 | #' the sets divided by the size of the union. The Jaccard distance is simply 1 minus the Jaccard index. 66 | #' 67 | #' The Jaccard distance between two genomes describes their degree of overlap with respect to gene 68 | #' cluster content. If the Jaccard distance is 0.0, the two genomes contain identical gene clusters. 69 | #' If it is 1.0 the two genomes are non-overlapping. The difference between a genomic fluidity (see 70 | #' \code{\link{fluidity}}) and a Jaccard distance is small, they both measure overlap between genomes, 71 | #' but fluidity is computed for the population by averaging over many pairs, while Jaccard distances are 72 | #' computed for every pair. Note that only presence/absence of gene clusters are considered, not multiple 73 | #' occurrences. 74 | #' 75 | #' The input \samp{pan.matrix} is typically constructed by \code{\link{panMatrix}}. 76 | #' 77 | #' @return A \code{dist} object (see \code{\link{dist}}) containing all pairwise Jaccard distances 78 | #' between genomes. 79 | #' 80 | #' @author Lars Snipen and Kristian Hovde Liland. 81 | #' 82 | #' @seealso \code{\link{panMatrix}}, \code{\link{fluidity}}, \code{\link{dist}}. 83 | #' 84 | #' @examples 85 | #' # Loading a pan-matrix in this package 86 | #' data(xmpl.panmat) 87 | #' 88 | #' # Jaccard distances 89 | #' Jdist <- distJaccard(xmpl.panmat) 90 | #' 91 | #' # Making a dendrogram based on the distances, 92 | #' # see example for distManhattan 93 | #' 94 | #' @export distJaccard 95 | #' 96 | distJaccard <- function(pan.matrix){ 97 | pan.matrix[which(pan.matrix > 0, arr.ind = T)] <- 1 98 | D <- matrix(0, nrow = nrow(pan.matrix), ncol = nrow(pan.matrix)) 99 | rownames(D) <- colnames(D) <- rownames(pan.matrix) 100 | for(i in 1:(nrow(pan.matrix) - 1)){ 101 | for(j in (i+1):nrow(pan.matrix)){ 102 | cs <- pan.matrix[i,] + pan.matrix[j,] 103 | D[j,i] <- D[i,j] <- 1 - sum(cs > 1)/sum(cs > 0) 104 | } 105 | } 106 | return(as.dist(D)) 107 | } 108 | 109 | 110 | #' @name distManhattan 111 | #' @title Computing Manhattan distances between genomes 112 | #' 113 | #' @description Computes the (weighted) Manhattan distances beween all pairs of genomes. 114 | #' 115 | #' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details. 116 | #' @param scale An optional scale to control how copy numbers should affect the distances. 117 | #' @param weights Vector of optional weights of gene clusters. 118 | #' 119 | #' @details The Manhattan distance is defined as the sum of absolute elementwise differences between 120 | #' two vectors. Each genome is represented as a vector (row) of integers in \samp{pan.matrix}. The 121 | #' Manhattan distance between two genomes is the sum of absolute difference between these rows. If 122 | #' two rows (genomes) of the \samp{pan.matrix} are identical, the corresponding Manhattan distance 123 | #' is \samp{0.0}. 124 | #' 125 | #' The \samp{scale} can be used to control how copy number differences play a role in the distances 126 | #' computed. Usually we assume that going from 0 to 1 copy of a gene is the big change of the genome, 127 | #' and going from 1 to 2 (or more) copies is less. Prior to computing the Manhattan distance, the 128 | #' \samp{pan.matrix} is transformed according to the following affine mapping: If the original value in 129 | #' \samp{pan.matrix} is \samp{x}, and \samp{x} is not 0, then the transformed value is \samp{1 + (x-1)*scale}. 130 | #' Note that with \samp{scale=0.0} (default) this will result in 1 regardless of how large \samp{x} was. 131 | #' In this case the Manhattan distance only distinguish between presence and absence of gene clusters. 132 | #' If \samp{scale=1.0} the value \samp{x} is left untransformed. In this case the difference between 1 133 | #' copy and 2 copies is just as big as between 1 copy and 0 copies. For any \samp{scale} between 0.0 and 134 | #' 1.0 the transformed value is shrunk towards 1, but a certain effect of larger copy numbers is still 135 | #' present. In this way you can decide if the distances between genomes should be affected, and to what 136 | #' degree, by differences in copy numbers beyond 1. Notice that as long as \samp{scale=0.0} (and no 137 | #' weighting) the Manhattan distance has a nice interpretation, namely the number of gene clusters that 138 | #' differ in present/absent status between two genomes. 139 | #' 140 | #' When summing the difference across gene clusters we can also up- or downweight some clusters compared 141 | #' to others. The vector \samp{weights} must contain one value for each column in \samp{pan.matrix}. The 142 | #' default is to use flat weights, i.e. all clusters count equal. See \code{\link{geneWeights}} for 143 | #' alternative weighting strategies. 144 | #' 145 | #' @return A \code{dist} object (see \code{\link{dist}}) containing all pairwise Manhattan distances 146 | #' between genomes. 147 | #' 148 | #' @author Lars Snipen and Kristian Hovde Liland. 149 | #' 150 | #' @seealso \code{\link{panMatrix}}, \code{\link{distJaccard}}, \code{\link{geneWeights}}. 151 | #' 152 | #' @examples 153 | #' # Loading a pan-matrix in this package 154 | #' data(xmpl.panmat) 155 | #' 156 | #' # Manhattan distances between genomes 157 | #' Mdist <- distManhattan(xmpl.panmat) 158 | #' 159 | #' \dontrun{ 160 | #' # Making a dendrogram based on shell-weighted distances 161 | #' library(ggdendro) 162 | #' weights <- geneWeights(xmpl.panmat, type = "shell") 163 | #' Mdist <- distManhattan(xmpl.panmat, weights = weights) 164 | #' ggdendrogram(dendro_data(hclust(Mdist, method = "average")), 165 | #' rotate = TRUE, theme_dendro = FALSE) + 166 | #' labs(x = "Genomes", y = "Shell-weighted Manhattan distance", title = "Pan-genome dendrogram") 167 | #' } 168 | #' 169 | #' @importFrom stats dist 170 | #' 171 | #' @export distManhattan 172 | #' 173 | distManhattan <- function(pan.matrix, scale = 0.0, weights = rep(1, ncol(pan.matrix))){ 174 | if((scale > 1) | (scale < 0)){ 175 | warning( "scale should be between 0.0 and 1.0, using scale = 0.0" ) 176 | scale <- 0.0 177 | } 178 | idx <- which(pan.matrix > 0, arr.ind = T) 179 | pan.matrix[idx] <- 1 + (pan.matrix[idx] - 1) * scale 180 | pan.matrix <- t(t(pan.matrix) * weights) 181 | return(dist(pan.matrix, method = "manhattan")) 182 | } 183 | 184 | 185 | #' @name geneWeights 186 | #' @title Gene cluster weighting 187 | #' 188 | #' @description This function computes weights for gene cluster according to their distribution in a pan-genome. 189 | #' 190 | #' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details. 191 | #' @param type A text indicating the weighting strategy. 192 | #' 193 | #' @details When computing distances between genomes or a PCA, it is possible to give weights to the 194 | #' different gene clusters, emphasizing certain aspects. 195 | #' 196 | #' As proposed by Snipen & Ussery (2010), we have implemented two types of weighting: The default 197 | #' \samp{"shell"} type means gene families occuring frequently in the genomes, denoted shell-genes, are 198 | #' given large weight (close to 1) while those occurring rarely are given small weight (close to 0). 199 | #' The opposite is the \samp{"cloud"} type of weighting. Genes observed in a minority of the genomes are 200 | #' referred to as cloud-genes. Presumeably, the \samp{"shell"} weighting will give distances/PCA reflecting 201 | #' a more long-term evolution, since emphasis is put on genes who have just barely diverged away from the 202 | #' core. The \samp{"cloud"} weighting emphasizes those gene clusters seen rarely. Genomes with similar 203 | #' patterns among these genes may have common recent history. A \samp{"cloud"} weighting typically gives 204 | #' a more erratic or \sQuote{noisy} picture than the \samp{"shell"} weighting. 205 | #' 206 | #' @return A vector of weights, one for each column in \code{pan.matrix}. 207 | #' 208 | #' @references Snipen, L., Ussery, D.W. (2010). Standard operating procedure for computing pangenome 209 | #' trees. Standards in Genomic Sciences, 2:135-141. 210 | #' 211 | #' @author Lars Snipen and Kristian Hovde Liland. 212 | #' 213 | #' @seealso \code{\link{panMatrix}}, \code{\link{distManhattan}}. 214 | #' 215 | #' @examples 216 | #' # See examples for distManhattan 217 | #' 218 | #' @export geneWeights 219 | #' 220 | geneWeights <- function(pan.matrix, type = c("shell", "cloud")){ 221 | ng <- dim( pan.matrix )[1] 222 | nf <- dim( pan.matrix )[2] 223 | pan.matrix[which(pan.matrix > 0, arr.ind = T)] <- 1 224 | cs <- colSums(pan.matrix) 225 | 226 | midx <- grep(type[1], c("shell", "cloud")) 227 | if(length(midx) == 0){ 228 | warning("Unknown weighting:", type, ", using shell weights") 229 | midx <- 1 230 | } 231 | W <- rep(1, ncol(pan.matrix)) 232 | x <- 1:nrow(pan.matrix) 233 | ww <- 1 / (1 + exp(((x - 1) - (max(x) - 1)/2) / ((max(x) - 1) / 10))) 234 | if(midx == 1) ww <- 1 - ww 235 | for(i in x) W[cs == i] <- ww[i] 236 | return(W) 237 | } 238 | -------------------------------------------------------------------------------- /R/hmmer3.R: -------------------------------------------------------------------------------- 1 | #' @name hmmerScan 2 | #' @title Scanning a profile Hidden Markov Model database 3 | #' 4 | #' @description Scanning FASTA formatted protein files against a database of pHMMs using the HMMER3 5 | #' software. 6 | #' 7 | #' @param in.files A character vector of file names. 8 | #' @param dbase The full path-name of the database to scan (text). 9 | #' @param out.folder The name of the folder to put the result files. 10 | #' @param threads Number of CPU's to use. 11 | #' @param verbose Logical indicating if textual output should be given to monitor the progress. 12 | #' 13 | #' @details The HMMER3 software is purpose-made for handling profile Hidden Markov Models (pHMM) 14 | #' describing patterns in biological sequences (Eddy, 2008). This function will make calls to the 15 | #' HMMER3 software to scan FASTA files of proteins against a pHMM database. 16 | #' 17 | #' The files named in \samp{in.files} must contain FASTA formatted protein sequences. These files 18 | #' should be prepared by \code{\link{panPrep}} to make certain each sequence, as well as the file name, 19 | #' has a GID-tag identifying their genome. The database named in \samp{db} must be a HMMER3 formatted 20 | #' database. It is typically the Pfam-A database, but you can also make your own HMMER3 databases, see 21 | #' the HMMER3 documentation for help. 22 | #' 23 | #' \code{\link{hmmerScan}} will query every input file against the named database. The database contains 24 | #' profile Hidden Markov Models describing position specific sequence patterns. Each sequence in every 25 | #' input file is scanned to see if some of the patterns can be matched to some degree. Each input file 26 | #' results in an output file with the same GID-tag in the name. The result files give tabular output, and 27 | #' are plain text files. See \code{\link{readHmmer}} for how to read the results into R. 28 | #' 29 | #' Scanning large databases like Pfam-A takes time, usually several minutes per genome. The scan is set 30 | #' up to use only 1 cpu per scan by default. By increasing \code{threads} you can utilize multiple CPUs, typically 31 | #' on a computing cluster. 32 | #' Our experience is that from a multi-core laptop it is better to start this function in default mode 33 | #' from mutliple R-sessions. This function will not overwrite an existing result file, and multiple parallel 34 | #' sessions can write results to the same folder. 35 | #' 36 | #' @return This function produces files in the folder specified by \samp{out.folder}. Existing files are 37 | #' never overwritten by \code{\link{hmmerScan}}, if you want to re-compute something, delete the 38 | #' corresponding result files first. 39 | #' 40 | #' @references Eddy, S.R. (2008). A Probabilistic Model of Local Sequence Alignment That Simplifies 41 | #' Statistical Significance Estimation. PLoS Computational Biology, 4(5). 42 | #' 43 | #' @note The HMMER3 software must be installed on the system for this function to work, i.e. the command 44 | #' \samp{system("hmmscan -h")} must be recognized as a valid command if you run it in the Console window. 45 | #' 46 | #' @author Lars Snipen and Kristian Hovde Liland. 47 | #' 48 | #' @seealso \code{\link{panPrep}}, \code{\link{readHmmer}}. 49 | #' 50 | #' @examples 51 | #' \dontrun{ 52 | #' # This example require the external software HMMER 53 | #' # Using example files in this package 54 | #' pf <- file.path(path.package("micropan"), "extdata", "xmpl_GID1.faa.xz") 55 | #' dbf <- file.path(path.package("micropan"), "extdata", 56 | #' str_c("microfam.hmm", c(".h3f.xz",".h3i.xz",".h3m.xz",".h3p.xz"))) 57 | #' 58 | #' # We need to uncompress them first... 59 | #' prot.file <- tempfile(pattern = "GID1.faa", fileext=".xz") 60 | #' ok <- file.copy(from = pf, to = prot.file) 61 | #' prot.file <- xzuncompress(prot.file) 62 | #' db.files <- str_c(tempfile(), c(".h3f.xz",".h3i.xz",".h3m.xz",".h3p.xz")) 63 | #' ok <- file.copy(from = dbf, to = db.files) 64 | #' db.files <- unlist(lapply(db.files, xzuncompress)) 65 | #' db.name <- str_remove(db.files[1], "\\.[a-z0-9]+$") 66 | #' 67 | #' # Scanning the FASTA file against microfam.hmm... 68 | #' hmmerScan(in.files = prot.file, dbase = db.name, out.folder = ".") 69 | #' 70 | #' # Reading results 71 | #' hmm.file <- file.path(".", str_c("GID1_vs_", basename(db.name), ".txt")) 72 | #' hmm.tbl <- readHmmer(hmm.file) 73 | #' 74 | #' # ...and cleaning... 75 | #' ok <- file.remove(prot.file) 76 | #' ok <- file.remove(str_remove(db.files, ".xz")) 77 | #' } 78 | #' 79 | #' @importFrom stringr str_c str_extract 80 | #' 81 | #' @export hmmerScan 82 | #' 83 | hmmerScan <- function(in.files, dbase, out.folder, threads = 0, verbose = TRUE){ 84 | if(length(dbase) > 1){ 85 | stop("Argument dbase must be a single text") 86 | } 87 | if(available.external("hmmer")){ 88 | log.fil <- file.path(out.folder, "log.txt") 89 | basic <- paste("hmmscan -o", log.fil,"--cut_ga --noali --cpu", threads) 90 | rbase <- str_c("_vs_", basename(dbase), ".txt") 91 | gids <- str_extract(in.files, "GID[0-9]+") 92 | in.files <- normalizePath(in.files) 93 | for(i in 1:length(in.files)){ 94 | rname <- str_c(gids[i], rbase) 95 | res.files <- list.files(out.folder) 96 | if(!(rname %in% res.files)){ 97 | if(verbose) cat("hmmerScan: Scanning file", i, "out of", length(in.files), "...\r") 98 | cmd <- paste(basic, 99 | "--domtblout", file.path(out.folder, rname), 100 | dbase, 101 | in.files[i]) 102 | system(cmd) 103 | ok <- file.remove(log.fil) 104 | } 105 | } 106 | } 107 | } 108 | 109 | 110 | 111 | #' @name readHmmer 112 | #' @title Reading results from a HMMER3 scan 113 | #' 114 | #' @description Reading a text file produced by \code{\link{hmmerScan}}. 115 | #' 116 | #' @param hmmer.file The name of a \code{\link{hmmerScan}} result file. 117 | #' @param e.value Numeric threshold, hits with E-value above this are ignored (default is 1.0). 118 | #' @param use.acc Logical indicating if accession numbers should be used to identify the hits. 119 | #' 120 | #' @details The function reads a text file produced by \code{\link{hmmerScan}}. By specifying a smaller 121 | #' \samp{e.value} you filter out poorer hits, and fewer results are returned. The option \samp{use.acc} 122 | #' should be turned off (FALSE) if you scan against your own database where accession numbers are lacking. 123 | #' 124 | #' @return The results are returned in a \samp{tibble} with columns \samp{Query}, \samp{Hit}, 125 | #' \samp{Evalue}, \samp{Score}, \samp{Start}, \samp{Stop} and \samp{Description}. \samp{Query} is the tag 126 | #' identifying each query sequence. \samp{Hit} is the name or accession number for a pHMM in the database 127 | #' describing patterns. The \samp{Evalue} is the \samp{ievalue} in the HMMER3 terminology. The \samp{Score} 128 | #' is the HMMER3 score for the match between \samp{Query} and \samp{Hit}. The \samp{Start} and \samp{Stop} 129 | #' are the positions within the \samp{Query} where the \samp{Hit} (pattern) starts and stops. 130 | #' \samp{Description} is the description of the \samp{Hit}. There is one line for each hit. 131 | #' 132 | #' @author Lars Snipen and Kristian Hovde Liland. 133 | #' 134 | #' @seealso \code{\link{hmmerScan}}, \code{\link{hmmerCleanOverlap}}, \code{\link{dClust}}. 135 | #' 136 | #' @examples 137 | #' # See the examples in the Help-files for dClust and hmmerScan. 138 | #' 139 | #' @importFrom stringr str_detect str_split str_c str_replace_all 140 | #' @importFrom tibble tibble 141 | #' @importFrom dplyr %>% 142 | #' @importFrom rlang .data 143 | #' 144 | #' @export readHmmer 145 | #' 146 | readHmmer <- function(hmmer.file, e.value = 1, use.acc = TRUE){ 147 | hmmer.file <- normalizePath(hmmer.file) 148 | lines <- readLines(hmmer.file) 149 | subset(lines, !str_detect(lines, "^\\#")) %>% 150 | str_replace_all("[ ]+", " ") %>% 151 | str_split(pattern = " ") -> lst 152 | if(use.acc){ 153 | hit <- sapply(lst, function(x){x[2]}) 154 | } else { 155 | hit <- sapply(lst, function(x){x[1]}) 156 | } 157 | tibble(Query = sapply(lst, function(x){x[4]}), 158 | Hit = hit, 159 | Evalue = as.numeric(sapply(lst, function(x){x[13]})), 160 | Score = as.numeric(sapply(lst, function(x){x[14]})), 161 | Start = as.numeric(sapply(lst, function(x){x[18]})), 162 | Stop = as.numeric(sapply(lst, function(x){x[19]})), 163 | Description = sapply(lst, function(x){str_c(x[23:length(x)], collapse = " ")})) %>% 164 | filter(.data$Evalue <= e.value) -> hmmer.tbl 165 | return(hmmer.tbl) 166 | } 167 | 168 | -------------------------------------------------------------------------------- /R/micropan.R: -------------------------------------------------------------------------------- 1 | #' @name micropan 2 | #' @aliases micropan-package 3 | #' @title Microbial Pan-Genome Analysis 4 | #' 5 | #' @description A collection of functions for computations and visualizations of microbial pan-genomes. 6 | #' Some of the functions make use of external software that needs to be installed on the system, see the 7 | #' package vignette for more details on this. 8 | #' 9 | #' 10 | #' @author Lars Snipen and Kristian Hovde Liland. 11 | #' 12 | #' Maintainer: Lars Snipen 13 | #' 14 | #' @references 15 | #' Snipen, L., Liland, KH. (2015). micropan: an R-package for microbial pan-genomics. BMC Bioinformatics, 16:79. 16 | #' 17 | NULL -------------------------------------------------------------------------------- /R/panmat.R: -------------------------------------------------------------------------------- 1 | #' @name panMatrix 2 | #' @title Computing the pan-matrix for a set of gene clusters 3 | #' 4 | #' @description A pan-matrix has one row for each genome and one column for each gene cluster, and 5 | #' cell \samp{[i,j]} indicates how many members genome \samp{i} has in gene family \samp{j}. 6 | #' 7 | #' @param clustering A named vector of integers. 8 | #' 9 | #' @details The pan-matrix is a central data structure for pan-genomic analysis. It is a matrix with 10 | #' one row for each genome in the study, and one column for each gene cluster. Cell \samp{[i,j]} 11 | #' contains an integer indicating how many members genome \samp{i} has in cluster \samp{j}. 12 | #' 13 | #' The input \code{clustering} must be a named integer vector with one element for each sequence in the study, 14 | #' typically produced by either \code{\link{bClust}} or \code{\link{dClust}}. The name of each element 15 | #' is a text identifying every sequence. The value of each element indicates the cluster, i.e. those 16 | #' sequences with identical values are in the same cluster. IMPORTANT: The name of each sequence must 17 | #' contain the \samp{genome_id} for each genome, i.e. they must of the form \samp{GID111_seq1}, \samp{GID111_seq2},... 18 | #' where the \samp{GIDxxx} part indicates which genome the sequence belongs to. See \code{\link{panPrep}} 19 | #' for details. 20 | #' 21 | #' The rows of the pan-matrix is named by the \samp{genome_id} for every genome. The columns are just named 22 | #' \samp{Cluster_x} where \samp{x} is an integer copied from \samp{clustering}. 23 | #' 24 | #' @return An integer matrix with a row for each genome and a column for each sequence cluster. 25 | #' The input vector \samp{clustering} is attached as the attribute \samp{clustering}. 26 | #' 27 | #' @author Lars Snipen and Kristian Hovde Liland. 28 | #' 29 | #' @seealso \code{\link{bClust}}, \code{\link{dClust}}, \code{\link{distManhattan}}, 30 | #' \code{\link{distJaccard}}, \code{\link{fluidity}}, \code{\link{chao}}, 31 | #' \code{\link{binomixEstimate}}, \code{\link{heaps}}, \code{\link{rarefaction}}. 32 | #' 33 | #' @examples 34 | #' # Loading clustering data in this package 35 | #' data(xmpl.bclst) 36 | #' 37 | #' # Pan-matrix based on the clustering 38 | #' panmat <- panMatrix(xmpl.bclst) 39 | #' 40 | #' \dontrun{ 41 | #' # Plotting cluster distribution 42 | #' library(ggplot2) 43 | #' tibble(Clusters = as.integer(table(factor(colSums(panmat > 0), levels = 1:nrow(panmat)))), 44 | #' Genomes = 1:nrow(panmat)) %>% 45 | #' ggplot(aes(x = Genomes, y = Clusters)) + 46 | #' geom_col() 47 | #' } 48 | #' 49 | #' @importFrom stringr str_extract str_c 50 | #' 51 | #' @export panMatrix 52 | #' 53 | panMatrix <- function(clustering){ 54 | gids <- str_extract(names(clustering), "GID[0-9]+") 55 | ugids <- sort(unique(gids)) 56 | uclst <- sort(unique(clustering)) 57 | pan.matrix <- matrix(0, nrow = length(ugids), ncol = length(uclst)) 58 | rownames(pan.matrix) <- ugids 59 | colnames(pan.matrix) <- str_c("Cluster", uclst) 60 | for(i in 1:length(ugids)){ 61 | tb <- table(clustering[gids == ugids[i]]) 62 | idd <- as.numeric(names(tb)) 63 | idx <- which(uclst %in% idd) 64 | pan.matrix[i,idx] <- tb 65 | } 66 | attr(pan.matrix, "clustering") <- clustering 67 | return(pan.matrix) 68 | } 69 | 70 | -------------------------------------------------------------------------------- /R/panpca.R: -------------------------------------------------------------------------------- 1 | #' @name panPca 2 | #' @title Principal component analysis of a pan-matrix 3 | #' 4 | #' @description Computes a principal component decomposition of a pan-matrix, with possible 5 | #' scaling and weightings. 6 | #' 7 | #' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details. 8 | #' @param scale An optional scale to control how copy numbers should affect the distances. 9 | #' @param weights Vector of optional weights of gene clusters. 10 | #' 11 | #' @details A principal component analysis (PCA) can be computed for any matrix, also a pan-matrix. 12 | #' The principal components will in this case be linear combinations of the gene clusters. One major 13 | #' idea behind PCA is to truncate the space, e.g. instead of considering the genomes as points in a 14 | #' high-dimensional space spanned by all gene clusters, we look for a few \sQuote{smart} combinations 15 | #' of the gene clusters, and visualize the genomes in a low-dimensional space spanned by these directions. 16 | #' 17 | #' The \samp{scale} can be used to control how copy number differences play a role in the PCA. Usually 18 | #' we assume that going from 0 to 1 copy of a gene is the big change of the genome, and going from 1 to 19 | #' 2 (or more) copies is less. Prior to computing the PCA, the \samp{pan.matrix} is transformed according 20 | #' to the following affine mapping: If the original value in \samp{pan.matrix} is \samp{x}, and \samp{x} 21 | #' is not 0, then the transformed value is \samp{1 + (x-1)*scale}. Note that with \samp{scale=0.0} 22 | #' (default) this will result in 1 regardless of how large \samp{x} was. In this case the PCA only 23 | #' distinguish between presence and absence of gene clusters. If \samp{scale=1.0} the value \samp{x} is 24 | #' left untransformed. In this case the difference between 1 copy and 2 copies is just as big as between 25 | #' 1 copy and 0 copies. For any \samp{scale} between 0.0 and 1.0 the transformed value is shrunk towards 26 | #' 1, but a certain effect of larger copy numbers is still present. In this way you can decide if the PCA 27 | #' should be affected, and to what degree, by differences in copy numbers beyond 1. 28 | #' 29 | #' The PCA may also up- or downweight some clusters compared to others. The vector \samp{weights} must 30 | #' contain one value for each column in \samp{pan.matrix}. The default is to use flat weights, i.e. all 31 | #' clusters count equal. See \code{\link{geneWeights}} for alternative weighting strategies. 32 | #' 33 | #' @return A \code{list} with three tables: 34 | #' 35 | #' \samp{Evar.tbl} has two columns, one listing the component number and one listing the relative 36 | #' explained variance for each component. The relative explained variance always sums to 1.0 over 37 | #' all components. This value indicates the importance of each component, and it is always in 38 | #' descending order, the first component being the most important. 39 | #' This is typically the first result you look at after a PCA has been computed, as it indicates 40 | #' how many components (directions) you need to capture the bulk of the total variation in the data. 41 | #' 42 | #' \samp{Scores.tbl} has a column listing the \samp{GID.tag} for each genome, and then one column for each 43 | #' principal component. The columns are ordered corresponding to the elements in \samp{Evar}. The 44 | #' scores are the coordinates of each genome in the principal component space. 45 | #' 46 | #' \samp{Loadings.tbl} is similar to \samp{Scores.tbl} but contain values for each gene cluster 47 | #' instead of each genome. The columns are ordered corresponding to the elements in \samp{Evar}. 48 | #' The loadings are the contributions from each gene cluster to the principal component directions. 49 | #' NOTE: Only gene clusters having a non-zero variance is used in a PCA. Gene clusters with the 50 | #' same value for every genome have no impact and are discarded from the \samp{Loadings}. 51 | #' 52 | #' @author Lars Snipen and Kristian Hovde Liland. 53 | #' 54 | #' @seealso \code{\link{distManhattan}}, \code{\link{geneWeights}}. 55 | #' 56 | #' @examples 57 | #' # Loading a pan-matrix in this package 58 | #' data(xmpl.panmat) 59 | #' 60 | #' # Computing panPca 61 | #' ppca <- panPca(xmpl.panmat) 62 | #' 63 | #' \dontrun{ 64 | #' # Plotting explained variance 65 | #' library(ggplot2) 66 | #' ggplot(ppca$Evar.tbl) + 67 | #' geom_col(aes(x = Component, y = Explained.variance)) 68 | #' # Plotting scores 69 | #' ggplot(ppca$Scores.tbl) + 70 | #' geom_text(aes(x = PC1, y = PC2, label = GID.tag)) 71 | #' # Plotting loadings 72 | #' ggplot(ppca$Loadings.tbl) + 73 | #' geom_text(aes(x = PC1, y = PC2, label = Cluster)) 74 | #' } 75 | #' 76 | #' @importFrom tibble as_tibble tibble 77 | #' 78 | #' @export panPca 79 | #' 80 | panPca <- function(pan.matrix, scale = 0.0, weights = rep(1, ncol(pan.matrix))){ 81 | if((scale > 1) | (scale < 0)){ 82 | warning("scale should be between 0.0 and 1.0, using scale=0.0") 83 | scale <- 0.0 84 | } 85 | idx <- which(pan.matrix > 0, arr.ind = T) 86 | pan.matrix[idx] <- 1 + (pan.matrix[idx] - 1) * scale 87 | pan.matrix <- t(t(pan.matrix) * weights) 88 | X <- pan.matrix[,which(apply(pan.matrix, 2, sd) > 0)] 89 | pca <- prcomp(X) 90 | pca.lst <- list(Evar.tbl = tibble(Component = 1:length(pca$sdev), 91 | Explained.variance = pca$sdev^2/sum(pca$sdev^2)), 92 | Scores.tbl = as_tibble(pca$x, rownames = "GID.tag"), 93 | Loadings.tbl = as_tibble(pca$rotation, rownames = "Cluster")) 94 | return(pca.lst) 95 | } 96 | 97 | -------------------------------------------------------------------------------- /R/panprep.R: -------------------------------------------------------------------------------- 1 | #' @name panPrep 2 | #' @title Preparing FASTA files for pan-genomics 3 | #' 4 | #' @description Preparing a FASTA file before starting comparisons of sequences. 5 | #' 6 | #' @usage panPrep(in.file, genome_id, out.file, protein = TRUE, min.length = 10, discard = "") 7 | #' 8 | #' @param in.file The name of a FASTA formatted file with protein or nucleotide sequences for coding 9 | #' genes in a genome. 10 | #' @param genome_id The Genome Identifier, see below. 11 | #' @param out.file Name of file where the prepared sequences will be written. 12 | #' @param protein Logical, indicating if the \samp{in.file} contains protein (\code{TRUE}) or 13 | #' nucleotide (\code{FALSE}) sequences. 14 | #' @param min.length Minimum sequence length 15 | #' @param discard A text, a regular expression, and sequences having a match against this in their 16 | #' headerline will be discarded. 17 | #' 18 | #' @details This function will read the \code{in.file} and produce another, slightly modified, FASTA file 19 | #' which is prepared for the comparisons using \code{\link{blastpAllAll}}, \code{\link{hmmerScan}} 20 | #' or any other method. 21 | #' 22 | #' The main purpose of \code{\link{panPrep}} is to make certain every sequence is labeled with a tag 23 | #' called a \samp{genome_id} identifying the genome from which it comes. This text contains the text 24 | #' \dQuote{GID} followed by an integer. This integer can be any integer as long as it is unique to every 25 | #' genome in the study. If a genome has the text \dQuote{GID12345} as identifier, then the 26 | #' sequences in the file produced by \code{\link{panPrep}} will have headerlines starting with 27 | #' \dQuote{GID12345_seq1}, \dQuote{GID12345_seq2}, \dQuote{GID12345_seq3}...etc. This makes it possible 28 | #' to quickly identify which genome every sequence belongs to. 29 | #' 30 | #' The \samp{genome_id} is also added to the file name specified in \samp{out.file}. For this reason the 31 | #' \samp{out.file} must have a file extension containing letters only. By convention, we expect FASTA 32 | #' files to have one of the extensions \samp{.fsa}, \samp{.faa}, \samp{.fa} or \samp{.fasta}. 33 | #' 34 | #' \code{\link{panPrep}} will also remove sequences shorter than \code{min.length}, removing stop codon 35 | #' symbols (\samp{*}), replacing alien characters with \samp{X} and converting all sequences to upper-case. 36 | #' If the input \samp{discard} contains a regular expression, any sequences having a match to this in their 37 | #' headerline are also removed. Example: If we use the \code{prodigal} software (see \code{\link[microseq]{findGenes}}) 38 | #' to find proteins in a genome, partially predicted genes will have the text \samp{partial=10} or 39 | #' \samp{partial=01} in their headerline. Using \samp{discard= "partial=01|partial=10"} will remove 40 | #' these from the data set. 41 | #' 42 | #' @return This function produces a FASTA formatted sequence file, and returns the name of this file. 43 | #' 44 | #' @author Lars Snipen and Kristian Liland. 45 | #' 46 | #' @seealso \code{\link{hmmerScan}}, \code{\link{blastpAllAll}}. 47 | #' 48 | #' @examples 49 | #' # Using a protein file in this package 50 | #' # We need to uncompress it first... 51 | #' pf <- file.path(path.package("micropan"),"extdata","xmpl.faa.xz") 52 | #' prot.file <- tempfile(fileext = ".xz") 53 | #' ok <- file.copy(from = pf, to = prot.file) 54 | #' prot.file <- xzuncompress(prot.file) 55 | #' 56 | #' # Prepping it, using the genome_id "GID123" 57 | #' prepped.file <- panPrep(prot.file, genome_id = "GID123", out.file = tempfile(fileext = ".faa")) 58 | #' 59 | #' # Reading the prepped file 60 | #' prepped <- readFasta(prepped.file) 61 | #' head(prepped) 62 | #' 63 | #' # ...and cleaning... 64 | #' ok <- file.remove(prot.file, prepped.file) 65 | #' 66 | #' @importFrom microseq readFasta writeFasta 67 | #' @importFrom dplyr mutate filter %>% n 68 | #' @importFrom stringr str_remove_all str_length str_c str_extract str_replace 69 | #' @importFrom rlang .data 70 | #' 71 | #' @export panPrep 72 | #' 73 | panPrep <- function(in.file, genome_id, out.file, protein = TRUE, min.length = 10, discard = ""){ 74 | if(protein){ 75 | alien <- "[^ARNDCQEGHILKMFPSTWYV]" 76 | } else { 77 | alien <- "[^ACGT]" 78 | } 79 | readFasta(normalizePath(in.file)) %>% 80 | mutate(Sequence = toupper(.data$Sequence)) %>% 81 | mutate(Sequence = str_remove_all(.data$Sequence, "\\*")) %>% 82 | mutate(Length = str_length(.data$Sequence)) %>% 83 | filter(.data$Length >= min.length) %>% 84 | mutate(Sequence = str_replace_all(.data$Sequence, pattern = alien, "X")) %>% 85 | mutate(Header = str_c(genome_id, "_seq", 1:n(), " ", .data$Header)) -> fdta 86 | if(str_length(discard) > 0){ 87 | fdta %>% 88 | filter(!str_detect(.data$Header, pattern = discard)) -> fdta 89 | } 90 | out.file <- file.path(normalizePath(dirname(out.file)), 91 | basename(out.file)) 92 | fext <- str_extract(out.file, "\\.[a-zA-Z]+$") 93 | out.file <- str_replace(out.file, str_c(fext, "$"), str_c("_", genome_id, fext)) 94 | writeFasta(fdta, out.file = out.file) 95 | return(out.file) 96 | } -------------------------------------------------------------------------------- /R/powerlaw.R: -------------------------------------------------------------------------------- 1 | #' @name chao 2 | #' @title The Chao lower bound estimate of pan-genome size 3 | #' 4 | #' @description Computes the Chao lower bound estimated number of gene clusters in a pan-genome. 5 | #' 6 | #' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details. 7 | #' 8 | #' @details The size of a pan-genome is the number of gene clusters in it, both those observed and those 9 | #' not yet observed. 10 | #' 11 | #' The input \samp{pan.matrix} is a a matrix with one row for each 12 | #' genome and one column for each observed gene cluster in the pan-genome. See \code{\link{panMatrix}} 13 | #' for how to construct this. 14 | #' 15 | #' The number of observed gene clusters is simply the number of columns in \samp{pan.matrix}. The 16 | #' number of gene clusters not yet observed is estimated by the Chao lower bound estimator (Chao, 1987). 17 | #' This is based solely on the number of clusters observed in 1 and 2 genomes. It is a very simple and 18 | #' conservative estimator, i.e. it is more likely to be too small than too large. 19 | #' 20 | #' @return The function returns an integer, the estimated pan-genome size. This includes both the number 21 | #' of gene clusters observed so far, as well as the estimated number not yet seen. 22 | #' 23 | #' @references Chao, A. (1987). Estimating the population size for capture-recapture data with unequal 24 | #' catchability. Biometrics, 43:783-791. 25 | #' 26 | #' @author Lars Snipen and Kristian Hovde Liland. 27 | #' 28 | #' @seealso \code{\link{panMatrix}}, \code{\link{binomixEstimate}}. 29 | #' 30 | #' @examples 31 | #' # Loading a pan-matrix in this package 32 | #' data(xmpl.panmat) 33 | #' 34 | #' # Estimating the pan-genome size using the Chao estimator 35 | #' chao.pansize <- chao(xmpl.panmat) 36 | #' 37 | #' @export chao 38 | #' 39 | chao <- function(pan.matrix){ 40 | y <- table(factor(colSums(pan.matrix > 0), levels = 1:nrow(pan.matrix))) 41 | if(y[2] == 0){ 42 | stop( "Cannot compute Chao estimate since there are 0 gene clusters observed in 2 genomes!\n" ) 43 | } else { 44 | pan.size <- round(sum(y) + y[1]^2/(2*y[2])) 45 | names(pan.size) <- NULL 46 | return(pan.size) 47 | } 48 | } 49 | 50 | 51 | #' @name heaps 52 | #' @title Heaps law estimate 53 | #' 54 | #' @description Estimating if a pan-genome is open or closed based on a Heaps law model. 55 | #' 56 | #' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details. 57 | #' @param n.perm The number of random permutations of genome ordering. 58 | #' 59 | #' @details An open pan-genome means there will always be new gene clusters observed as long as new genomes 60 | #' are being sequenced. This may sound controversial, but in a pragmatic view, an open pan-genome indicates 61 | #' that the number of new gene clusters to be observed in future genomes is \sQuote{large} (but not literally 62 | #' infinite). Opposite, a closed pan-genome indicates we are approaching the end of new gene clusters. 63 | #' 64 | #' This function is based on a Heaps law approach suggested by Tettelin et al (2008). The Heaps law model 65 | #' is fitted to the number of new gene clusters observed when genomes are ordered in a random way. The model 66 | #' has two parameters, an intercept and a decay parameter called \samp{alpha}. If \samp{alpha>1.0} the 67 | #' pan-genome is closed, if \samp{alpha<1.0} it is open. 68 | #' 69 | #' The number of permutations, \samp{n.perm}, should be as large as possible, limited by computation time. 70 | #' The default value of 100 is certainly a minimum. 71 | #' 72 | #' Word of caution: The Heaps law assumes independent sampling. If some of the genomes in the data set 73 | #' form distinct sub-groups in the population, this may affect the results of this analysis severely. 74 | #' 75 | #' @return A vector of two estimated parameters: The \samp{Intercept} and the decay parameter \samp{alpha}. 76 | #' If \samp{alpha<1.0} the pan-genome is open, if \samp{alpha>1.0} it is closed. 77 | #' 78 | #' @references Tettelin, H., Riley, D., Cattuto, C., Medini, D. (2008). Comparative genomics: the 79 | #' bacterial pan-genome. Current Opinions in Microbiology, 12:472-477. 80 | #' 81 | #' @author Lars Snipen and Kristian Hovde Liland. 82 | #' 83 | #' @seealso \code{\link{binomixEstimate}}, \code{\link{chao}}, \code{\link{rarefaction}}. 84 | #' 85 | #' @examples 86 | #' # Loading a pan-matrix in this package 87 | #' data(xmpl.panmat) 88 | #' 89 | #' # Estimating population openness 90 | #' h.est <- heaps(xmpl.panmat, n.perm = 500) 91 | #' print(h.est) 92 | #' # If alpha < 1 it indicates an open pan-genome 93 | #' 94 | #' @importFrom stats optim 95 | #' 96 | #' @export 97 | heaps <- function(pan.matrix, n.perm = 100){ 98 | pan.matrix[which(pan.matrix > 0, arr.ind = T)] <- 1 99 | ng <- dim(pan.matrix)[1] 100 | nmat <- matrix(0, nrow = nrow(pan.matrix) - 1, ncol = n.perm) 101 | for(i in 1:n.perm){ 102 | cm <- apply(pan.matrix[sample(nrow(pan.matrix)),], 2, cumsum) 103 | nmat[,i] <- rowSums((cm == 1)[2:ng,] & (cm == 0)[1:(ng-1),]) 104 | cat(i, "/", n.perm, "\r") 105 | } 106 | x <- rep((2:nrow(pan.matrix)), times = n.perm) 107 | y <- as.numeric(nmat) 108 | p0 <- c(mean(y[which(x == 2)] ), 1) 109 | fit <- optim(p0, objectFun, gr = NULL, x, y, method = "L-BFGS-B", lower = c(0, 0), upper = c(10000, 2)) 110 | p.hat <- fit$par 111 | names(p.hat) <- c("Intercept", "alpha") 112 | return(p.hat) 113 | } 114 | 115 | objectFun <- function(p, x, y){ 116 | y.hat <- p[1] * x^(-p[2]) 117 | J <- sqrt(sum((y - y.hat)^2))/length(x) 118 | return(J) 119 | } 120 | -------------------------------------------------------------------------------- /R/rarefaction.R: -------------------------------------------------------------------------------- 1 | #' @name rarefaction 2 | #' @title Rarefaction curves for a pan-genome 3 | #' 4 | #' @description Computes rarefaction curves for a number of random permutations of genomes. 5 | #' 6 | #' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details. 7 | #' @param n.perm The number of random genome orderings to use. If \samp{n.perm=1} the fixed order of 8 | #' the genomes in \samp{pan.matrix} is used. 9 | #' 10 | #' @details A rarefaction curve is simply the cumulative number of unique gene clusters we observe as 11 | #' more and more genomes are being considered. The shape of this curve will depend on the order of the 12 | #' genomes. This function will typically compute rarefaction curves for a number of (\samp{n.perm}) 13 | #' orderings. By using a large number of permutations, and then averaging over the results, the effect 14 | #' of any particular ordering is smoothed. 15 | #' 16 | #' The averaged curve illustrates how many new gene clusters we observe for each new genome. If this 17 | #' levels out and becomes flat, it means we expect few, if any, new gene clusters by sequencing more 18 | #' genomes. The function \code{\link{heaps}} can be used to estimate population openness based on this 19 | #' principle. 20 | #' 21 | #' @return A table with the curves in the columns. The first column is the number of genomes, while 22 | #' all other columns are the cumulative number of clusters, one column for each permutation. 23 | #' 24 | #' @author Lars Snipen and Kristian Hovde Liland. 25 | #' 26 | #' @seealso \code{\link{heaps}}, \code{\link{panMatrix}}. 27 | #' 28 | #' @examples 29 | #' # Loading a pan-matrix in this package 30 | #' data(xmpl.panmat) 31 | #' 32 | #' # Rarefaction 33 | #' rar.tbl <- rarefaction(xmpl.panmat, n.perm = 1000) 34 | #' 35 | #' \dontrun{ 36 | #' # Plotting 37 | #' library(ggplot2) 38 | #' library(tidyr) 39 | #' rar.tbl %>% 40 | #' gather(key = "Permutation", value = "Clusters", -Genome) %>% 41 | #' ggplot(aes(x = Genome, y = Clusters, group = Permutation)) + 42 | #' geom_line() 43 | #' } 44 | #' 45 | #' @importFrom dplyr %>% bind_cols 46 | #' @importFrom tibble tibble as_tibble 47 | #' 48 | #' @export rarefaction 49 | #' 50 | rarefaction <- function(pan.matrix, n.perm = 1){ 51 | pan.matrix[which(pan.matrix > 0, arr.ind = T)] <- 1 52 | nmat <- matrix(0, nrow = nrow(pan.matrix), ncol = n.perm) 53 | cm <- apply(pan.matrix, 2, cumsum) 54 | nmat[,1] <- rowSums(cm > 0) 55 | if(n.perm > 1){ 56 | for(i in 2:n.perm){ 57 | cm <- apply(pan.matrix[sample(nrow(pan.matrix)),], 2, cumsum) 58 | nmat[,i] <- rowSums(cm > 0) 59 | cat(i, "/", n.perm, "\r") 60 | } 61 | } 62 | nmat <- rbind(rep(0, ncol(nmat)), nmat) 63 | tibble(Genomes = 0:nrow(pan.matrix)) %>% 64 | bind_cols(as_tibble(nmat, .name_repair = "minimal")) -> rtbl 65 | colnames(rtbl) <- c("Genome", str_c("Perm", 1:n.perm)) 66 | return(rtbl) 67 | } 68 | 69 | -------------------------------------------------------------------------------- /R/xmpl.R: -------------------------------------------------------------------------------- 1 | #' @name xmpl 2 | #' @aliases xmpl xmpl.bdist xmpl.bclst xmpl.panmat 3 | #' @docType data 4 | #' @title Data sets for use in examples 5 | #' 6 | #' @description This data set contains several files with various objects used in examples 7 | #' in some of the functions in the \code{micropan} package. 8 | #' 9 | #' @usage 10 | #' data(xmpl.bdist) 11 | #' data(xmpl.bclst) 12 | #' data(xmpl.panmat) 13 | #' 14 | #' @details 15 | #' \samp{xmpl.bdist} is a \code{tibble} with 4 columns holding all 16 | #' BLAST distances between pairs of proteins in an example with 10 small genomes. 17 | #' 18 | #' \samp{xmpl.bclst} is a clustering vector of all proteins in the 19 | #' genomes from \samp{xmpl.bdist}. 20 | #' 21 | #' \samp{xmpl.panmat} is a pan-matrix with 10 rows and 1210 columns 22 | #' computed from \samp{xmpl.bclst}. 23 | #' 24 | #' @author Lars Snipen and Kristian Hovde Liland. 25 | #' 26 | #' @examples 27 | #' 28 | #' # BLAST distances, only the first 20 are displayed 29 | #' data(xmpl.bdist) 30 | #' head(xmpl.bdist) 31 | #' 32 | #' # Clustering vector 33 | #' data(xmpl.bclst) 34 | #' print(xmpl.bclst[1:30]) 35 | #' 36 | #' # Pan-matrix 37 | #' data(xmpl.panmat) 38 | #' head(xmpl.panmat) 39 | #' 40 | NULL -------------------------------------------------------------------------------- /R/xz.R: -------------------------------------------------------------------------------- 1 | #' @rdname xz 2 | #' @name xzcompress 3 | #' @title Compressing and uncompressing text files 4 | #' 5 | #' @description These functions are adapted from the \code{R.utils} package from gzip to xz. Internally 6 | #' \code{xzfile()} (see connections) is used to read (write) chunks to (from) the xz file. If the 7 | #' process is interrupted before completed, the partially written output file is automatically removed. 8 | #' 9 | #' @param filename Path name of input file. 10 | #' @param destname Pathname of output file. 11 | #' @param temporary If TRUE, the output file is created in a temporary directory. 12 | #' @param skip If TRUE and the output file already exists, the output file is returned as is. 13 | #' @param overwrite If TRUE and the output file already exists, the file is silently overwritting, 14 | #' otherwise an exception is thrown (unless skip is TRUE). 15 | #' @param remove If TRUE, the input file is removed afterward, otherwise not. 16 | #' @param BFR.SIZE The number of bytes read in each chunk. 17 | #' @param compression The compression level used (1-9). 18 | #' @param ... Not used. 19 | #' 20 | #' @return Returns the pathname of the output file. The number of bytes processed is returned as an attribute. 21 | #' 22 | #' @author Kristian Hovde Liland. 23 | #' 24 | #' @examples 25 | #' # Creating small file 26 | #' tf <- tempfile() 27 | #' cat(file=tf, "Hello world!") 28 | #' 29 | #' # Compressing 30 | #' tf.xz <- xzcompress(tf) 31 | #' print(file.info(tf.xz)) 32 | #' 33 | #' # Uncompressing 34 | #' tf <- xzuncompress(tf.xz) 35 | #' print(file.info(tf)) 36 | #' file.remove(tf) 37 | #' 38 | #' @export xzcompress 39 | #' 40 | xzcompress <- function(filename, destname = sprintf("%s.xz", filename), temporary = FALSE, 41 | skip = FALSE, overwrite = FALSE, remove = TRUE, BFR.SIZE = 1e+07, compression = 6, 42 | ...){ 43 | if(!file.exists(filename)){ 44 | stop("No such file: ", filename) 45 | } 46 | if(temporary){ 47 | destname <- file.path(tempdir(), basename(destname)) 48 | } 49 | attr(destname, "temporary") <- temporary 50 | if(filename == destname) 51 | stop(sprintf("Argument 'filename' and 'destname' are identical: %s", filename)) 52 | if(file.exists(destname)){ 53 | if(skip){ 54 | return(destname) 55 | } else if(!overwrite){ 56 | stop(sprintf("File already exists: %s", destname)) 57 | } 58 | } 59 | destpath <- dirname(destname) 60 | if(!file.info(destpath)$isdir) 61 | dir.create(destpath) 62 | inn <- file(filename, open = "rb") 63 | on.exit(if(!is.null(inn)) close(inn)) 64 | outComplete <- FALSE 65 | out <- xzfile(destname, open = "wb", compression = compression, ...) 66 | on.exit({ 67 | close(out) 68 | if(!outComplete){ 69 | file.remove(destname) 70 | } 71 | }, add = TRUE) 72 | nbytes <- 0L 73 | repeat{ 74 | bfr <- readBin(inn, what = raw(0L), size = 1L, n = BFR.SIZE) 75 | n <- length(bfr) 76 | if(n == 0L) 77 | break 78 | nbytes <- nbytes + n 79 | writeBin(bfr, con = out, size = 1L) 80 | bfr <- NULL 81 | } 82 | outComplete <- TRUE 83 | if(remove){ 84 | close(inn) 85 | inn <- NULL 86 | file.remove(filename) 87 | } 88 | attr(destname, "nbrOfBytes") <- nbytes 89 | invisible(destname) 90 | } 91 | #' @rdname xz 92 | #' @export xzuncompress 93 | xzuncompress <- function(filename, destname = gsub("[.]xz$", "", filename, ignore.case = TRUE), 94 | temporary = FALSE, skip = FALSE, overwrite = FALSE, remove = TRUE, 95 | BFR.SIZE = 1e+07, ...){ 96 | if(!file.exists(filename)){ 97 | stop("No such file: ", filename) 98 | } 99 | if(temporary){ 100 | destname <- file.path(tempdir(), basename(destname)) 101 | } 102 | attr(destname, "temporary") <- temporary 103 | if(filename == destname){ 104 | stop(sprintf("Argument 'filename' and 'destname' are identical: %s", filename)) 105 | } 106 | if(file.exists(destname)){ 107 | if(skip){ 108 | return(destname) 109 | } else if(!overwrite){ 110 | stop(sprintf("File already exists: %s", destname)) 111 | } 112 | } 113 | destpath <- dirname(destname) 114 | if(!file.info(destpath)$isdir) 115 | dir.create(destpath) 116 | inn <- xzfile(filename, open = "rb") 117 | on.exit(if(!is.null(inn)) close(inn)) 118 | outComplete <- FALSE 119 | out <- file(destname, open = "wb") 120 | on.exit({ 121 | close(out) 122 | if (!outComplete) { 123 | file.remove(destname) 124 | } 125 | }, add = TRUE) 126 | nbytes <- 0L 127 | repeat{ 128 | bfr <- readBin(inn, what = raw(0L), size = 1L, n = BFR.SIZE) 129 | n <- length(bfr) 130 | if(n == 0L) 131 | break 132 | nbytes <- nbytes + n 133 | writeBin(bfr, con = out, size = 1L) 134 | bfr <- NULL 135 | } 136 | outComplete <- TRUE 137 | if(remove){ 138 | close(inn) 139 | inn <- NULL 140 | file.remove(filename) 141 | } 142 | attr(destname, "nbrOfBytes") <- nbytes 143 | invisible(destname) 144 | } 145 | -------------------------------------------------------------------------------- /Readme_files/figure-gfm/unnamed-chunk-10-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-10-1.png -------------------------------------------------------------------------------- /Readme_files/figure-gfm/unnamed-chunk-15-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-15-1.png -------------------------------------------------------------------------------- /Readme_files/figure-gfm/unnamed-chunk-16-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-16-1.png -------------------------------------------------------------------------------- /Readme_files/figure-gfm/unnamed-chunk-20-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-20-1.png -------------------------------------------------------------------------------- /Readme_files/figure-gfm/unnamed-chunk-21-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-21-1.png -------------------------------------------------------------------------------- /Readme_files/figure-gfm/unnamed-chunk-26-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-26-1.png -------------------------------------------------------------------------------- /Readme_files/figure-gfm/unnamed-chunk-27-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-27-1.png -------------------------------------------------------------------------------- /Readme_files/figure-gfm/unnamed-chunk-28-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-28-1.png -------------------------------------------------------------------------------- /Readme_files/figure-gfm/unnamed-chunk-29-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-29-1.png -------------------------------------------------------------------------------- /Readme_files/figure-gfm/unnamed-chunk-30-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-30-1.png -------------------------------------------------------------------------------- /Readme_files/figure-gfm/unnamed-chunk-31-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-31-1.png -------------------------------------------------------------------------------- /Readme_files/figure-gfm/unnamed-chunk-32-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-32-1.png -------------------------------------------------------------------------------- /Readme_files/figure-gfm/unnamed-chunk-8-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-8-1.png -------------------------------------------------------------------------------- /Readme_files/figure-gfm/unnamed-chunk-9-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-9-1.png -------------------------------------------------------------------------------- /Readme_files/figure-markdown_github/unnamed-chunk-10-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-10-1.png -------------------------------------------------------------------------------- /Readme_files/figure-markdown_github/unnamed-chunk-14-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-14-1.png -------------------------------------------------------------------------------- /Readme_files/figure-markdown_github/unnamed-chunk-16-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-16-1.png -------------------------------------------------------------------------------- /Readme_files/figure-markdown_github/unnamed-chunk-18-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-18-1.png -------------------------------------------------------------------------------- /Readme_files/figure-markdown_github/unnamed-chunk-20-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-20-1.png -------------------------------------------------------------------------------- /Readme_files/figure-markdown_github/unnamed-chunk-21-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-21-1.png -------------------------------------------------------------------------------- /Readme_files/figure-markdown_github/unnamed-chunk-25-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-25-1.png -------------------------------------------------------------------------------- /Readme_files/figure-markdown_github/unnamed-chunk-26-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-26-1.png -------------------------------------------------------------------------------- /Readme_files/figure-markdown_github/unnamed-chunk-27-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-27-1.png -------------------------------------------------------------------------------- /Readme_files/figure-markdown_github/unnamed-chunk-28-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-28-1.png -------------------------------------------------------------------------------- /Readme_files/figure-markdown_github/unnamed-chunk-29-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-29-1.png -------------------------------------------------------------------------------- /Readme_files/figure-markdown_github/unnamed-chunk-30-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-30-1.png -------------------------------------------------------------------------------- /Readme_files/figure-markdown_github/unnamed-chunk-31-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-31-1.png -------------------------------------------------------------------------------- /Readme_files/figure-markdown_github/unnamed-chunk-32-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-32-1.png -------------------------------------------------------------------------------- /Readme_files/figure-markdown_github/unnamed-chunk-7-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-7-1.png -------------------------------------------------------------------------------- /Readme_files/figure-markdown_github/unnamed-chunk-8-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-8-1.png -------------------------------------------------------------------------------- /Readme_files/figure-markdown_github/unnamed-chunk-9-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-9-1.png -------------------------------------------------------------------------------- /data/xmpl.bclst.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/data/xmpl.bclst.rda -------------------------------------------------------------------------------- /data/xmpl.bdist.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/data/xmpl.bdist.rda -------------------------------------------------------------------------------- /data/xmpl.panmat.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/data/xmpl.panmat.rda -------------------------------------------------------------------------------- /inst/extdata/GID1_vs_GID1_.txt.xz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/GID1_vs_GID1_.txt.xz -------------------------------------------------------------------------------- /inst/extdata/GID1_vs_microfam.hmm.txt.xz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/GID1_vs_microfam.hmm.txt.xz -------------------------------------------------------------------------------- /inst/extdata/GID2_vs_GID1_.txt.xz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/GID2_vs_GID1_.txt.xz -------------------------------------------------------------------------------- /inst/extdata/GID2_vs_GID2_.txt.xz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/GID2_vs_GID2_.txt.xz -------------------------------------------------------------------------------- /inst/extdata/GID2_vs_microfam.hmm.txt.xz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/GID2_vs_microfam.hmm.txt.xz -------------------------------------------------------------------------------- /inst/extdata/GID3_vs_GID1_.txt.xz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/GID3_vs_GID1_.txt.xz -------------------------------------------------------------------------------- /inst/extdata/GID3_vs_GID2_.txt.xz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/GID3_vs_GID2_.txt.xz -------------------------------------------------------------------------------- /inst/extdata/GID3_vs_GID3_.txt.xz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/GID3_vs_GID3_.txt.xz -------------------------------------------------------------------------------- /inst/extdata/GID3_vs_microfam.hmm.txt.xz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/GID3_vs_microfam.hmm.txt.xz -------------------------------------------------------------------------------- /inst/extdata/microfam.hmm.h3f.xz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/microfam.hmm.h3f.xz -------------------------------------------------------------------------------- /inst/extdata/microfam.hmm.h3i.xz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/microfam.hmm.h3i.xz -------------------------------------------------------------------------------- /inst/extdata/microfam.hmm.h3m.xz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/microfam.hmm.h3m.xz -------------------------------------------------------------------------------- /inst/extdata/microfam.hmm.h3p.xz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/microfam.hmm.h3p.xz -------------------------------------------------------------------------------- /inst/extdata/xmpl.faa.xz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/xmpl.faa.xz -------------------------------------------------------------------------------- /inst/extdata/xmpl_GID1.faa.xz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/xmpl_GID1.faa.xz -------------------------------------------------------------------------------- /inst/extdata/xmpl_GID2.faa.xz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/xmpl_GID2.faa.xz -------------------------------------------------------------------------------- /inst/extdata/xmpl_GID3.faa.xz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/xmpl_GID3.faa.xz -------------------------------------------------------------------------------- /man/bClust.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/bclust.R 3 | \name{bClust} 4 | \alias{bClust} 5 | \title{Clustering sequences based on pairwise distances} 6 | \usage{ 7 | bClust(dist.tbl, linkage = "complete", threshold = 0.75, verbose = TRUE) 8 | } 9 | \arguments{ 10 | \item{dist.tbl}{A \code{tibble} with pairwise distances.} 11 | 12 | \item{linkage}{A text indicating what type of clustering to perform, either \samp{complete} (default), 13 | \samp{average} or \samp{single}.} 14 | 15 | \item{threshold}{Specifies the maximum size of a cluster. Must be a distance, i.e. a number between 16 | 0.0 and 1.0.} 17 | 18 | \item{verbose}{Logical, turns on/off text output during computations.} 19 | } 20 | \value{ 21 | The function returns a vector of integers, indicating the cluster membership of every unique 22 | sequence from the \samp{Query} or \samp{Hit} columns of the input \samp{dist.tbl}. The name 23 | of each element indicates the sequence. The numerical values have no meaning as such, they are simply 24 | categorical indicators of cluster membership. 25 | } 26 | \description{ 27 | Sequences are clustered by hierarchical clustering based on a set of pariwise distances. 28 | The distances must take values between 0.0 and 1.0, and all pairs \emph{not} listed are assumed to 29 | have distance 1.0. 30 | } 31 | \details{ 32 | Computing clusters (gene families) is an essential step in many comparative studies. 33 | \code{\link{bClust}} will assign sequences into gene families by a hierarchical clustering approach. 34 | Since the number of sequences may be huge, a full all-against-all distance matrix will be impossible 35 | to handle in memory. However, most sequence pairs will have an \sQuote{infinite} distance between them, 36 | and only the pairs with a finite (smallish) distance need to be considered. 37 | 38 | This function takes as input the distances in \code{dist.tbl} where only the relevant distances are 39 | listed. The columns \samp{Query} and \samp{Hit} contain tags identifying pairs of sequences. The column 40 | \samp{Distance} contains the distances, always a number from 0.0 to 1.0. Typically, this is the output 41 | from \code{\link{bDist}}. All pairs of sequences \emph{not} listed are assumed to have distance 1.0, 42 | which is considered the \sQuote{infinite} distance. 43 | All sequences must be listed at least once in ceither column \samp{Query} or \samp{Hit} of the \code{dist.tbl}. 44 | This should pose no problem, since all sequences must have distance 0.0 to themselves, and should be listed 45 | with this distance once (\samp{Query} and \samp{Hit} containing the same tag). 46 | 47 | The \samp{linkage} defines the type of clusters produced. The \samp{threshold} indicates the size of 48 | the clusters. A \samp{single} linkage clustering means all members of a cluster have at least one other 49 | member of the same cluster within distance \samp{threshold} from itself. An \samp{average} linkage means 50 | all members of a cluster are within the distance \samp{threshold} from the center of the cluster. A 51 | \samp{complete} linkage means all members of a cluster are no more than the distance \samp{threshold} 52 | away from any other member of the same cluster. 53 | 54 | Typically, \samp{single} linkage produces big clusters where members may differ a lot, since they are 55 | only required to be close to something, which is close to something,...,which is close to some other 56 | member. On the other extreme, \samp{complete} linkage will produce small and tight clusters, since all 57 | must be similar to all. The \samp{average} linkage is between, but closer to \samp{complete} linkage. If 58 | you want the \samp{threshold} to specify directly the maximum distance tolerated between two members of 59 | the same gene family, you must use \samp{complete} linkage. The \samp{single} linkage is the fastest 60 | alternative to compute. Using the default setting of \samp{single} linkage and maximum \samp{threshold} 61 | (1.0) will produce the largest and fewest clusters possible. 62 | } 63 | \examples{ 64 | # Loading example BLAST distances 65 | data(xmpl.bdist) 66 | 67 | # Clustering with default settings 68 | clst <- bClust(xmpl.bdist) 69 | # Other settings, and verbose 70 | clst <- bClust(xmpl.bdist, linkage = "average", threshold = 0.5, verbose = TRUE) 71 | 72 | } 73 | \seealso{ 74 | \code{\link{bDist}}, \code{\link{hclust}}, \code{\link{dClust}}, \code{\link{isOrtholog}}. 75 | } 76 | \author{ 77 | Lars Snipen and Kristian Hovde Liland. 78 | } 79 | -------------------------------------------------------------------------------- /man/bDist.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/bdist.R 3 | \name{bDist} 4 | \alias{bDist} 5 | \title{Computes distances between sequences} 6 | \usage{ 7 | bDist(blast.files = NULL, blast.tbl = NULL, e.value = 1, verbose = TRUE) 8 | } 9 | \arguments{ 10 | \item{blast.files}{A text vector of BLAST result filenames.} 11 | 12 | \item{blast.tbl}{A table with BLAST results.} 13 | 14 | \item{e.value}{A threshold E-value to immediately discard (very) poor BLAST alignments.} 15 | 16 | \item{verbose}{Logical, indicating if textual output should be given to monitor the progress.} 17 | } 18 | \value{ 19 | The function returns a table with columns \samp{Dbase}, \samp{Query}, \samp{Bitscore} 20 | and \samp{Distance}. Each row corresponds to a pair of sequences (Dbase and Query sequences) having at least 21 | one BLAST hit between 22 | them. All pairs \emph{not} listed in the output have distance 1.0 between them. 23 | } 24 | \description{ 25 | Computes distance between all sequences based on the BLAST bit-scores. 26 | } 27 | \details{ 28 | The essential input is either a vector of BLAST result filenames (\code{blast.files}) or a 29 | table of the BLAST results (\code{blast.tbl}). It is no point in providing both, then \code{blast.tbl} is ignored. 30 | 31 | For normal sized data sets (e.g. less than 100 genomes), you would provide the BLAST filenames as the argument 32 | \code{blast.files} to this function. 33 | Then results are read, and distances are computed. Only if you have huge data sets, you may find it more efficient to 34 | read the files using \code{\link{readBlastSelf}} and \code{\link{readBlastPair}} separately, and then provide as the 35 | argument \code{blast.tbl]} the table you get from binding these results. In all cases, the BLAST result files must 36 | have been produced by \code{\link{blastpAllAll}}. 37 | 38 | Setting a small \samp{e.value} threshold can speed up the computation and reduce the size of the 39 | output, but you may loose some alignments that could produce smallish distances for short sequences. 40 | 41 | The distance computed is based on alignment bitscores. Assume the alignment of query A against hit B 42 | has a bitscore of S(A,B). The distance is D(A,B)=1-2*S(A,B)/(S(A,A)+S(B,B)) where S(A,A) and S(B,B) are 43 | the self-alignment bitscores, i.e. the scores of aligning against itself. A distance of 44 | 0.0 means A and B are identical. The maximum possible distance is 1.0, meaning there is no BLAST between A and B. 45 | 46 | This distance should not be interpreted as lack of identity! A distance of 0.0 means 100\% identity, 47 | but a distance of 0.25 does \emph{not} mean 75\% identity. It has some resemblance to an evolutinary 48 | (raw) distance, but since it is based on protein alignments, the type of mutations plays a significant 49 | role, not only the number of mutations. 50 | } 51 | \examples{ 52 | # Using BLAST result files in this package... 53 | prefix <- c("GID1_vs_GID1_", 54 | "GID2_vs_GID1_", 55 | "GID3_vs_GID1_", 56 | "GID2_vs_GID2_", 57 | "GID3_vs_GID2_", 58 | "GID3_vs_GID3_") 59 | bf <- file.path(path.package("micropan"), "extdata", str_c(prefix, ".txt.xz")) 60 | 61 | # We need to uncompress them first... 62 | blast.files <- tempfile(pattern = prefix, fileext = ".txt.xz") 63 | ok <- file.copy(from = bf, to = blast.files) 64 | blast.files <- unlist(lapply(blast.files, xzuncompress)) 65 | 66 | # Computing pairwise distances 67 | blast.dist <- bDist(blast.files) 68 | 69 | # Read files separately, then use bDist 70 | self.tbl <- readBlastSelf(blast.files) 71 | pair.tbl <- readBlastPair(blast.files) 72 | blast.dist <- bDist(blast.tbl = bind_rows(self.tbl, pair.tbl)) 73 | 74 | # ...and cleaning... 75 | ok <- file.remove(blast.files) 76 | 77 | # See also example for blastpAl 78 | 79 | } 80 | \seealso{ 81 | \code{\link{blastpAllAll}}, \code{\link{readBlastSelf}}, \code{\link{readBlastPair}}, 82 | \code{\link{bClust}}, \code{\link{isOrtholog}}. 83 | } 84 | \author{ 85 | Lars Snipen and Kristian Hovde Liland. 86 | } 87 | -------------------------------------------------------------------------------- /man/binomixEstimate.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/binomix.R 3 | \name{binomixEstimate} 4 | \alias{binomixEstimate} 5 | \title{Binomial mixture model estimates} 6 | \usage{ 7 | binomixEstimate( 8 | pan.matrix, 9 | K.range = 3:5, 10 | core.detect.prob = 1, 11 | verbose = TRUE 12 | ) 13 | } 14 | \arguments{ 15 | \item{pan.matrix}{A pan-matrix, see \code{\link{panMatrix}} for details.} 16 | 17 | \item{K.range}{The range of model complexities to explore. The vector of integers specify the number 18 | of binomial densities to combine in the mixture models.} 19 | 20 | \item{core.detect.prob}{The detection probability of core genes. This should almost always be 1.0, 21 | since a core gene is by definition always present in all genomes, but can be set fractionally smaller.} 22 | 23 | \item{verbose}{Logical indicating if textual output should be given to monitor the progress of the 24 | computations.} 25 | } 26 | \value{ 27 | \code{\link{binomixEstimate}} returns a \code{list} with two components, the \samp{BIC.tbl} 28 | and \samp{Mix.tbl}. 29 | 30 | The \samp{BIC.tbl} is a \code{tibble} listing, in each row, the results for each number of components 31 | used, given by the input \samp{K.range}. The column \samp{Core.size} is the estimated number of 32 | core gene families, the column \samp{Pan.size} is the estimated pan-genome size. The column 33 | \samp{BIC} is the Bayesian Information Criterion (Schwarz, 1978) that should be used to choose the 34 | optimal component number (\samp{K}). The number of components where \samp{BIC} is minimized is the 35 | optimum. If minimum \samp{BIC} is reached for the largest \samp{K} value you should extend the 36 | \samp{K.range} to larger values and re-fit. The function will issue 37 | a \code{warning} to remind you of this. 38 | 39 | The \samp{Mix.tbl} is a \code{tibble} with estimates from the mixture models. The column \samp{Component} 40 | indicates the model, i.e. all rows where \samp{Component} has the same value are from the same model. 41 | There will be 3 rows for 3-component model, 4 rows for 4-component, etc. The column \samp{Detection.prob} 42 | contain the estimated detection probabilities for each component of the mixture models. A 43 | \samp{Mixing.proportion} is the proportion of the gene clusters having the corresponding \samp{Detection.prob}, 44 | i.e. if core genes have \samp{Detection.prob} 1.0, the corresponding \samp{Mixing.proportion} (same row) 45 | indicates how large fraction of the gene families are core genes. 46 | } 47 | \description{ 48 | Fits binomial mixture models to the data given as a pan-matrix. From the fitted models 49 | both estimates of pan-genome size and core-genome size are available. 50 | } 51 | \details{ 52 | A binomial mixture model can be used to describe the distribution of gene clusters across 53 | genomes in a pan-genome. The idea and the details of the computations are given in Hogg et al (2007), 54 | Snipen et al (2009) and Snipen & Ussery (2012). 55 | 56 | Central to the concept is the idea that every gene has a detection probability, i.e. a probability of 57 | being present in a genome. Genes who are always present in all genomes are called core genes, and these 58 | should have a detection probability of 1.0. Other genes are only present in a subset of the genomes, and 59 | these have smaller detection probabilities. Some genes are only present in one single genome, denoted 60 | ORFan genes, and an unknown number of genes have yet to be observed. If the number of genomes investigated 61 | is large these latter must have a very small detection probability. 62 | 63 | A binomial mixture model with \samp{K} components estimates \samp{K} detection probabilities from the 64 | data. The more components you choose, the better you can fit the (present) data, at the cost of less 65 | precision in the estimates due to less degrees of freedom. \code{\link{binomixEstimate}} allows you to 66 | fit several models, and the input \samp{K.range} specifies which values of \samp{K} to try out. There no 67 | real point using \samp{K} less than 3, and the default is \samp{K.range=3:5}. In general, the more genomes 68 | you have the larger you can choose \samp{K} without overfitting. Computations will be slower for larger 69 | values of \samp{K}. In order to choose the optimal value for \samp{K}, \code{\link{binomixEstimate}} 70 | computes the BIC-criterion, see below. 71 | 72 | As the number of genomes grow, we tend to observe an increasing number of gene clusters. Once a 73 | \samp{K}-component binomial mixture has been fitted, we can estimate the number of gene clusters not yet 74 | observed, and thereby the pan-genome size. Also, as the number of genomes grows we tend to observe fewer 75 | core genes. The fitted binomial mixture model also gives an estimate of the final number of core gene 76 | clusters, i.e. those still left after having observed \sQuote{infinite} many genomes. 77 | 78 | The detection probability of core genes should be 1.0, but can at times be set fractionally smaller. 79 | This means you accept that even core genes are not always detected in every genome, e.g. they may be 80 | there, but your gene prediction has missed them. Notice that setting the \samp{core.detect.prob} to less 81 | than 1.0 may affect the core gene size estimate dramatically. 82 | } 83 | \examples{ 84 | # Loading an example pan-matrix 85 | data(xmpl.panmat) 86 | 87 | # Estimating binomial mixture models 88 | binmix.lst <- binomixEstimate(xmpl.panmat, K.range = 3:8) 89 | print(binmix.lst$BIC.tbl) # minimum BIC at 3 components 90 | 91 | \dontrun{ 92 | # The pan-genome gene distribution as a pie-chart 93 | library(ggplot2) 94 | ncomp <- 3 95 | binmix.lst$Mix.tbl \%>\% 96 | filter(Components == ncomp) \%>\% 97 | ggplot() + 98 | geom_col(aes(x = "", y = Mixing.proportion, fill = Detection.prob)) + 99 | coord_polar(theta = "y") + 100 | labs(x = "", y = "", title = "Pan-genome gene distribution") + 101 | scale_fill_gradientn(colors = c("pink", "orange", "green", "cyan", "blue")) 102 | 103 | # The distribution in an average genome 104 | binmix.lst$Mix.tbl \%>\% 105 | filter(Components == ncomp) \%>\% 106 | mutate(Single = Mixing.proportion * Detection.prob) \%>\% 107 | ggplot() + 108 | geom_col(aes(x = "", y = Single, fill = Detection.prob)) + 109 | coord_polar(theta = "y") + 110 | labs(x = "", y = "", title = "Average genome gene distribution") + 111 | scale_fill_gradientn(colors = c("pink", "orange", "green", "cyan", "blue")) 112 | } 113 | 114 | } 115 | \references{ 116 | Hogg, J.S., Hu, F.Z, Janto, B., Boissy, R., Hayes, J., Keefe, R., Post, J.C., Ehrlich, G.D. (2007). 117 | Characterization and modeling of the Haemophilus influenzae core- and supra-genomes based on the 118 | complete genomic sequences of Rd and 12 clinical nontypeable strains. Genome Biology, 8:R103. 119 | 120 | Snipen, L., Almoy, T., Ussery, D.W. (2009). Microbial comparative pan-genomics using binomial 121 | mixture models. BMC Genomics, 10:385. 122 | 123 | Snipen, L., Ussery, D.W. (2012). A domain sequence approach to pangenomics: Applications to 124 | Escherichia coli. F1000 Research, 1:19. 125 | 126 | Schwarz, G. (1978). Estimating the Dimension of a Model. The Annals of Statistics, 6(2):461-464. 127 | } 128 | \seealso{ 129 | \code{\link{panMatrix}}, \code{\link{chao}}. 130 | } 131 | \author{ 132 | Lars Snipen and Kristian Hovde Liland. 133 | } 134 | -------------------------------------------------------------------------------- /man/blastpAllAll.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/blasting.R 3 | \name{blastpAllAll} 4 | \alias{blastpAllAll} 5 | \title{Making BLAST search all against all genomes} 6 | \usage{ 7 | blastpAllAll( 8 | prot.files, 9 | out.folder, 10 | e.value = 1, 11 | job = 1, 12 | threads = 1, 13 | start.at = 1, 14 | verbose = TRUE 15 | ) 16 | } 17 | \arguments{ 18 | \item{prot.files}{A vector with FASTA filenames.} 19 | 20 | \item{out.folder}{The folder where the result files should end up.} 21 | 22 | \item{e.value}{The chosen E-value threshold in BLAST.} 23 | 24 | \item{job}{An integer to separate multiple jobs.} 25 | 26 | \item{threads}{The number of CPU's to use.} 27 | 28 | \item{start.at}{An integer to specify where in the file-list to start BLASTing.} 29 | 30 | \item{verbose}{Logical, if \code{TRUE} some text output is produced to monitor the progress.} 31 | } 32 | \value{ 33 | The function produces a result file for each pair of files listed in \samp{prot.files}. 34 | These result files are located in \code{out.folder}. Existing files are never overwritten by 35 | \code{\link{blastpAllAll}}, if you want to re-compute something, delete the corresponding result files first. 36 | } 37 | \description{ 38 | Runs a reciprocal all-against-all BLAST search to look for similarity of proteins 39 | within and across genomes. 40 | } 41 | \details{ 42 | A basic step in pangenomics and many other comparative studies is to cluster proteins into 43 | groups or families. One commonly used approach is based on BLASTing. This function uses the 44 | \samp{blast+} software available for free from NCBI (Camacho et al, 2009). More precisely, the blastp 45 | algorithm with the BLOSUM45 scoring matrix and all composition based statistics turned off. 46 | 47 | A vector listing FASTA files of protein sequences is given as input in \samp{prot.files}. These files 48 | must have the genome_id in the first token of every header, and in their filenames as well, i.e. all input 49 | files should first be prepared by \code{\link{panPrep}} to ensure this. Note that only protein sequences 50 | are considered here. If your coding genes are stored as DNA, please translate them to protein prior to 51 | using this function, see \code{\link[microseq]{translate}}. 52 | 53 | In the first version of this package we used reciprocal BLASTing, i.e. we computed both genome A against 54 | B and B against A. This may sometimes produce slightly different results, but in reality this is too 55 | costly compared to its gain, and we now only make one of the above searches. This basically halves the 56 | number of searches. This step is still very time consuming for larger number of genomes. Note that the 57 | protein files are sorted by the genome_id (part of filename) inside this function. This is to ensure a 58 | consistent ordering irrespective of how they are enterred. 59 | 60 | For every pair of genomes a result file is produced. If two genomes have genome_id's \samp{GID111}, 61 | and \samp{GID222} then the result file \samp{GID222_vs_GID111.txt} will 62 | be found in \samp{out.folder} after the completion of this search. The last of the two genome_id is always 63 | the first in alphabetical order of the two. 64 | 65 | The \samp{out.folder} is scanned for already existing result files, and \code{\link{blastpAllAll}} never 66 | overwrites an existing result file. If a file with the name \samp{GID111_vs_GID222.txt} already exists in 67 | the \samp{out.folder}, this particular search is skipped. This makes it possible to run multiple jobs in 68 | parallell, writing to the same \samp{out.folder}. It also makes it possible to add new genomes, and only 69 | BLAST the new combinations without repeating previous comparisons. 70 | 71 | This search can be slow if the genomes contain many proteins and it scales quadratically in the number of 72 | input files. It is best suited for the study of a smaller number of genomes. By 73 | starting multiple R sessions, you can speed up the search by running \code{\link{blastpAllAll}} from each R 74 | session, using the same \samp{out.folder} but different integers for the \code{job} option. At the same 75 | time you may also want to start the BLASTing at different places in the file-list, by giving larger values 76 | to the argument \code{start.at}. This is 1 by default, i.e. the BLASTing starts at the first protein file. 77 | If you are using a multicore computer you can also increase the number of CPUs by increasing \code{threads}. 78 | 79 | The result files are tab-separated text files, and can be read into R, but more 80 | commonly they are used as input to \code{\link{bDist}} to compute distances between sequences for subsequent 81 | clustering. 82 | } 83 | \note{ 84 | The \samp{blast+} software must be installed on the system for this function to work, i.e. the command 85 | \samp{system("makeblastdb -help")} must be recognized as valid commands if you 86 | run them in the Console window. 87 | } 88 | \examples{ 89 | \dontrun{ 90 | # This example requires the external BLAST+ software 91 | # Using protein files in this package 92 | pf <- file.path(path.package("micropan"), "extdata", 93 | str_c("xmpl_GID", 1:3, ".faa.xz")) 94 | 95 | # We need to uncompress them first... 96 | prot.files <- tempfile(fileext = c("_GID1.faa.xz","_GID2.faa.xz","_GID3.faa.xz")) 97 | ok <- file.copy(from = pf, to = prot.files) 98 | prot.files <- unlist(lapply(prot.files, xzuncompress)) 99 | 100 | # Blasting all versus all 101 | out.dir <- "." 102 | blastpAllAll(prot.files, out.folder = out.dir) 103 | 104 | # Reading results, and computing blast.distances 105 | blast.files <- list.files(out.dir, pattern = "GID[0-9]+_vs_GID[0-9]+.txt") 106 | blast.distances <- bDist(file.path(out.dir, blast.files)) 107 | 108 | # ...and cleaning... 109 | ok <- file.remove(prot.files) 110 | ok <- file.remove(file.path(out.dir, blast.files)) 111 | } 112 | 113 | } 114 | \references{ 115 | Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., Madden, T.L. 116 | (2009). BLAST+: architecture and applications. BMC Bioinformatics, 10:421. 117 | } 118 | \seealso{ 119 | \code{\link{panPrep}}, \code{\link{bDist}}. 120 | } 121 | \author{ 122 | Lars Snipen and Kristian Hovde Liland. 123 | } 124 | -------------------------------------------------------------------------------- /man/chao.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/powerlaw.R 3 | \name{chao} 4 | \alias{chao} 5 | \title{The Chao lower bound estimate of pan-genome size} 6 | \usage{ 7 | chao(pan.matrix) 8 | } 9 | \arguments{ 10 | \item{pan.matrix}{A pan-matrix, see \code{\link{panMatrix}} for details.} 11 | } 12 | \value{ 13 | The function returns an integer, the estimated pan-genome size. This includes both the number 14 | of gene clusters observed so far, as well as the estimated number not yet seen. 15 | } 16 | \description{ 17 | Computes the Chao lower bound estimated number of gene clusters in a pan-genome. 18 | } 19 | \details{ 20 | The size of a pan-genome is the number of gene clusters in it, both those observed and those 21 | not yet observed. 22 | 23 | The input \samp{pan.matrix} is a a matrix with one row for each 24 | genome and one column for each observed gene cluster in the pan-genome. See \code{\link{panMatrix}} 25 | for how to construct this. 26 | 27 | The number of observed gene clusters is simply the number of columns in \samp{pan.matrix}. The 28 | number of gene clusters not yet observed is estimated by the Chao lower bound estimator (Chao, 1987). 29 | This is based solely on the number of clusters observed in 1 and 2 genomes. It is a very simple and 30 | conservative estimator, i.e. it is more likely to be too small than too large. 31 | } 32 | \examples{ 33 | # Loading a pan-matrix in this package 34 | data(xmpl.panmat) 35 | 36 | # Estimating the pan-genome size using the Chao estimator 37 | chao.pansize <- chao(xmpl.panmat) 38 | 39 | } 40 | \references{ 41 | Chao, A. (1987). Estimating the population size for capture-recapture data with unequal 42 | catchability. Biometrics, 43:783-791. 43 | } 44 | \seealso{ 45 | \code{\link{panMatrix}}, \code{\link{binomixEstimate}}. 46 | } 47 | \author{ 48 | Lars Snipen and Kristian Hovde Liland. 49 | } 50 | -------------------------------------------------------------------------------- /man/dClust.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/domainclust.R 3 | \name{dClust} 4 | \alias{dClust} 5 | \title{Clustering sequences based on domain sequence} 6 | \usage{ 7 | dClust(hmmer.tbl) 8 | } 9 | \arguments{ 10 | \item{hmmer.tbl}{A \code{tibble} of results from a \code{\link{hmmerScan}} against a domain database.} 11 | } 12 | \value{ 13 | The output is a numeric vector with one element for each unique sequence in the \samp{Query} 14 | column of the input \samp{hmmer.tbl}. Sequences with identical number belong to the same cluster. The 15 | name of each element identifies the sequence. 16 | 17 | This vector also has an attribute called \samp{cluster.info} which is a character vector containing the 18 | domain sequences. The first element is the domain sequence for cluster 1, the second for cluster 2, etc. 19 | In this way you can, in addition to clustering the sequences, also see which domains the sequences of a 20 | particular cluster share. 21 | } 22 | \description{ 23 | Proteins are clustered by their sequence of protein domains. A domain sequence is the 24 | ordered sequence of domains in the protein. All proteins having identical domain sequence are assigned 25 | to the same cluster. 26 | } 27 | \details{ 28 | A domain sequence is simply the ordered list of domains occurring in a protein. Not all proteins 29 | contain known domains, but those who do will have from one to several domains, and these can be ordered 30 | forming a sequence. Since domains can be more or less conserved, two proteins can be quite different in 31 | their amino acid sequence, and still share the same domains. Describing, and grouping, proteins by their 32 | domain sequence was proposed by Snipen & Ussery (2012) as an alternative to clusters based on pairwise 33 | alignments, see \code{\link{bClust}}. Domain sequence clusters are less influenced by gene prediction errors. 34 | 35 | The input is a \code{tibble} of the type produced by \code{\link{readHmmer}}. Typically, it is the 36 | result of scanning proteins (using \code{\link{hmmerScan}}) against Pfam-A or any other HMMER3 database 37 | of protein domains. It is highly reccomended that you remove overlapping hits in \samp{hmmer.tbl} before 38 | you pass it as input to \code{\link{dClust}}. Use the function \code{\link{hmmerCleanOverlap}} for this. 39 | Overlapping hits are in some cases real hits, but often the poorest of them are artifacts. 40 | } 41 | \examples{ 42 | # HMMER3 result files in this package 43 | hf <- file.path(path.package("micropan"), "extdata", 44 | str_c("GID", 1:3, "_vs_microfam.hmm.txt.xz")) 45 | 46 | # We need to uncompress them first... 47 | hmm.files <- tempfile(fileext = rep(".xz", length(hf))) 48 | ok <- file.copy(from = hf, to = hmm.files) 49 | hmm.files <- unlist(lapply(hmm.files, xzuncompress)) 50 | 51 | # Reading the HMMER3 results, cleaning overlaps... 52 | hmmer.tbl <- NULL 53 | for(i in 1:3){ 54 | readHmmer(hmm.files[i]) \%>\% 55 | hmmerCleanOverlap() \%>\% 56 | bind_rows(hmmer.tbl) -> hmmer.tbl 57 | } 58 | 59 | # The clustering 60 | clst <- dClust(hmmer.tbl) 61 | 62 | # ...and cleaning... 63 | ok <- file.remove(hmm.files) 64 | 65 | } 66 | \references{ 67 | Snipen, L. Ussery, D.W. (2012). A domain sequence approach to pangenomics: Applications 68 | to Escherichia coli. F1000 Research, 1:19. 69 | } 70 | \seealso{ 71 | \code{\link{panPrep}}, \code{\link{hmmerScan}}, \code{\link{readHmmer}}, 72 | \code{\link{hmmerCleanOverlap}}, \code{\link{bClust}}. 73 | } 74 | \author{ 75 | Lars Snipen and Kristian Hovde Liland. 76 | } 77 | -------------------------------------------------------------------------------- /man/distJaccard.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/genomedistances.R 3 | \name{distJaccard} 4 | \alias{distJaccard} 5 | \title{Computing Jaccard distances between genomes} 6 | \usage{ 7 | distJaccard(pan.matrix) 8 | } 9 | \arguments{ 10 | \item{pan.matrix}{A pan-matrix, see \code{\link{panMatrix}} for details.} 11 | } 12 | \value{ 13 | A \code{dist} object (see \code{\link{dist}}) containing all pairwise Jaccard distances 14 | between genomes. 15 | } 16 | \description{ 17 | Computes the Jaccard distances between all pairs of genomes. 18 | } 19 | \details{ 20 | The Jaccard index between two sets is defined as the size of the intersection of 21 | the sets divided by the size of the union. The Jaccard distance is simply 1 minus the Jaccard index. 22 | 23 | The Jaccard distance between two genomes describes their degree of overlap with respect to gene 24 | cluster content. If the Jaccard distance is 0.0, the two genomes contain identical gene clusters. 25 | If it is 1.0 the two genomes are non-overlapping. The difference between a genomic fluidity (see 26 | \code{\link{fluidity}}) and a Jaccard distance is small, they both measure overlap between genomes, 27 | but fluidity is computed for the population by averaging over many pairs, while Jaccard distances are 28 | computed for every pair. Note that only presence/absence of gene clusters are considered, not multiple 29 | occurrences. 30 | 31 | The input \samp{pan.matrix} is typically constructed by \code{\link{panMatrix}}. 32 | } 33 | \examples{ 34 | # Loading a pan-matrix in this package 35 | data(xmpl.panmat) 36 | 37 | # Jaccard distances 38 | Jdist <- distJaccard(xmpl.panmat) 39 | 40 | # Making a dendrogram based on the distances, 41 | # see example for distManhattan 42 | 43 | } 44 | \seealso{ 45 | \code{\link{panMatrix}}, \code{\link{fluidity}}, \code{\link{dist}}. 46 | } 47 | \author{ 48 | Lars Snipen and Kristian Hovde Liland. 49 | } 50 | -------------------------------------------------------------------------------- /man/distManhattan.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/genomedistances.R 3 | \name{distManhattan} 4 | \alias{distManhattan} 5 | \title{Computing Manhattan distances between genomes} 6 | \usage{ 7 | distManhattan(pan.matrix, scale = 0, weights = rep(1, ncol(pan.matrix))) 8 | } 9 | \arguments{ 10 | \item{pan.matrix}{A pan-matrix, see \code{\link{panMatrix}} for details.} 11 | 12 | \item{scale}{An optional scale to control how copy numbers should affect the distances.} 13 | 14 | \item{weights}{Vector of optional weights of gene clusters.} 15 | } 16 | \value{ 17 | A \code{dist} object (see \code{\link{dist}}) containing all pairwise Manhattan distances 18 | between genomes. 19 | } 20 | \description{ 21 | Computes the (weighted) Manhattan distances beween all pairs of genomes. 22 | } 23 | \details{ 24 | The Manhattan distance is defined as the sum of absolute elementwise differences between 25 | two vectors. Each genome is represented as a vector (row) of integers in \samp{pan.matrix}. The 26 | Manhattan distance between two genomes is the sum of absolute difference between these rows. If 27 | two rows (genomes) of the \samp{pan.matrix} are identical, the corresponding Manhattan distance 28 | is \samp{0.0}. 29 | 30 | The \samp{scale} can be used to control how copy number differences play a role in the distances 31 | computed. Usually we assume that going from 0 to 1 copy of a gene is the big change of the genome, 32 | and going from 1 to 2 (or more) copies is less. Prior to computing the Manhattan distance, the 33 | \samp{pan.matrix} is transformed according to the following affine mapping: If the original value in 34 | \samp{pan.matrix} is \samp{x}, and \samp{x} is not 0, then the transformed value is \samp{1 + (x-1)*scale}. 35 | Note that with \samp{scale=0.0} (default) this will result in 1 regardless of how large \samp{x} was. 36 | In this case the Manhattan distance only distinguish between presence and absence of gene clusters. 37 | If \samp{scale=1.0} the value \samp{x} is left untransformed. In this case the difference between 1 38 | copy and 2 copies is just as big as between 1 copy and 0 copies. For any \samp{scale} between 0.0 and 39 | 1.0 the transformed value is shrunk towards 1, but a certain effect of larger copy numbers is still 40 | present. In this way you can decide if the distances between genomes should be affected, and to what 41 | degree, by differences in copy numbers beyond 1. Notice that as long as \samp{scale=0.0} (and no 42 | weighting) the Manhattan distance has a nice interpretation, namely the number of gene clusters that 43 | differ in present/absent status between two genomes. 44 | 45 | When summing the difference across gene clusters we can also up- or downweight some clusters compared 46 | to others. The vector \samp{weights} must contain one value for each column in \samp{pan.matrix}. The 47 | default is to use flat weights, i.e. all clusters count equal. See \code{\link{geneWeights}} for 48 | alternative weighting strategies. 49 | } 50 | \examples{ 51 | # Loading a pan-matrix in this package 52 | data(xmpl.panmat) 53 | 54 | # Manhattan distances between genomes 55 | Mdist <- distManhattan(xmpl.panmat) 56 | 57 | \dontrun{ 58 | # Making a dendrogram based on shell-weighted distances 59 | library(ggdendro) 60 | weights <- geneWeights(xmpl.panmat, type = "shell") 61 | Mdist <- distManhattan(xmpl.panmat, weights = weights) 62 | ggdendrogram(dendro_data(hclust(Mdist, method = "average")), 63 | rotate = TRUE, theme_dendro = FALSE) + 64 | labs(x = "Genomes", y = "Shell-weighted Manhattan distance", title = "Pan-genome dendrogram") 65 | } 66 | 67 | } 68 | \seealso{ 69 | \code{\link{panMatrix}}, \code{\link{distJaccard}}, \code{\link{geneWeights}}. 70 | } 71 | \author{ 72 | Lars Snipen and Kristian Hovde Liland. 73 | } 74 | -------------------------------------------------------------------------------- /man/entrezDownload.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/entrez.R 3 | \name{entrezDownload} 4 | \alias{entrezDownload} 5 | \title{Downloading genome data} 6 | \usage{ 7 | entrezDownload(accession, out.file, verbose = TRUE) 8 | } 9 | \arguments{ 10 | \item{accession}{A character vector containing a set of valid accession numbers at the NCBI 11 | Nucleotide database.} 12 | 13 | \item{out.file}{Name of the file where downloaded sequences should be written in FASTA format.} 14 | 15 | \item{verbose}{Logical indicating if textual output should be given during execution, to monitor 16 | the download progress.} 17 | } 18 | \value{ 19 | The name of the resulting FASTA file is returned (same as \code{out.file}), but the real result of 20 | this function is the creation of the file itself. 21 | } 22 | \description{ 23 | Retrieving genomes from NCBI using the Entrez programming utilities. 24 | } 25 | \details{ 26 | The Entrez programming utilities is a toolset for automatic download of data from the 27 | NCBI databases, see \href{https://www.ncbi.nlm.nih.gov/books/NBK25500/}{E-utilities Quick Start} 28 | for details. \code{\link{entrezDownload}} can be used to download genomes from the NCBI Nucleotide 29 | database through these utilities. 30 | 31 | The argument \samp{accession} must be a set of valid accession numbers at NCBI Nucleotide, typically 32 | all accession numbers related to a genome (chromosomes, plasmids, contigs, etc). For completed genomes, 33 | where the number of sequences is low, \samp{accession} is typically a single text listing all accession 34 | numbers separated by commas. In the case of some draft genomes having a large number of contigs, the 35 | accession numbers must be split into several comma-separated texts. The reason for this is that Entrez 36 | will not accept too many queries in one chunk. 37 | 38 | The downloaded sequences are saved in \samp{out.file} on your system. This will be a FASTA formatted file. 39 | Note that all downloaded sequences end up in this file. If you want to download multiple genomes, 40 | you call \code{\link{entrezDownload}} multiple times and store in multiple files. 41 | } 42 | \examples{ 43 | \dontrun{ 44 | # Accession numbers for the chromosome and plasmid of Buchnera aphidicola, strain APS 45 | acc <- "BA000003.2,AP001071.1" 46 | genome.file <- tempfile(pattern = "Buchnera_aphidicola", fileext = ".fna") 47 | txt <- entrezDownload(acc, out.file = genome.file) 48 | 49 | # ...cleaning... 50 | ok <- file.remove(genome.file) 51 | } 52 | 53 | } 54 | \seealso{ 55 | \code{\link{getAccessions}}, \code{\link[microseq]{readFasta}}. 56 | } 57 | \author{ 58 | Lars Snipen and Kristian Liland. 59 | } 60 | -------------------------------------------------------------------------------- /man/extractPanGenes.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/extractPanGenes.R 3 | \name{extractPanGenes} 4 | \alias{extractPanGenes} 5 | \title{Extracting genes of same prevalence} 6 | \usage{ 7 | extractPanGenes(clustering, N.genomes = 1:2) 8 | } 9 | \arguments{ 10 | \item{clustering}{Named vector of clustering} 11 | 12 | \item{N.genomes}{Vector specifying the number of genomes the genes should be in} 13 | } 14 | \value{ 15 | A table with columns 16 | \itemize{ 17 | \item cluster. The gene family (integer) 18 | \item seq_tag. The sequence tag identifying each sequence (text) 19 | \item N_genomes. The number of genomes in which it is found (integer) 20 | } 21 | } 22 | \description{ 23 | Based on a clustering of genes, this function extracts the genes 24 | occurring in the same number of genomes. 25 | } 26 | \details{ 27 | Pan-genome studies focus on the gene families obtained by some clustering, 28 | see \code{\link{bClust}} or \code{\link{dClust}}. This function will extract the individual genes from 29 | each genome belonging to gene families found in \code{N.genomes} genomes specified by the user. 30 | Only the sequence tag for each gene is extracted, but the sequences can be added easily, see examples 31 | below. 32 | } 33 | \examples{ 34 | # Loading clustering data in this package 35 | data(xmpl.bclst) 36 | 37 | # Finding genes in 5 genomes 38 | core.tbl <- extractPanGenes(xmpl.bclst, N.genomes = 5) 39 | #...or in a single genome 40 | orfan.tbl <- extractPanGenes(xmpl.bclst, N.genomes = 1) 41 | 42 | \dontrun{ 43 | # To add the sequences, assume all protein fasta files are in a folder named faa: 44 | lapply(list.files("faa", full.names = T), readFasta) \%>\% 45 | bind_rows() \%>\% 46 | mutate(seq_tag = word(Header, 1, 1)) \%>\% 47 | right_join(orfan.tbl, by = "seq_tag") -> orfan.tbl 48 | # The resulting table can be written to fasta file directly using writeFasta() 49 | # See also geneFamilies2fasta() 50 | } 51 | 52 | } 53 | \seealso{ 54 | \code{\link{bClust}}, \code{\link{dClust}}, \code{\link{geneFamilies2fasta}}. 55 | } 56 | \author{ 57 | Lars Snipen. 58 | } 59 | -------------------------------------------------------------------------------- /man/fluidity.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/genomedistances.R 3 | \name{fluidity} 4 | \alias{fluidity} 5 | \title{Computing genomic fluidity for a pan-genome} 6 | \usage{ 7 | fluidity(pan.matrix, n.sim = 10) 8 | } 9 | \arguments{ 10 | \item{pan.matrix}{A pan-matrix, see \code{\link{panMatrix}} for details.} 11 | 12 | \item{n.sim}{An integer specifying the number of random samples to use in the computations.} 13 | } 14 | \value{ 15 | A vector with two elements, the mean fluidity and its sample standard deviation over 16 | the \samp{n.sim} computed values. 17 | } 18 | \description{ 19 | Computes the genomic fluidity, which is a measure of population diversity. 20 | } 21 | \details{ 22 | The genomic fluidity between two genomes is defined as the number of unique gene 23 | families divided by the total number of gene families (Kislyuk et al, 2011). This is averaged 24 | over \samp{n.sim} random pairs of genomes to obtain a population estimate. 25 | 26 | The genomic fluidity between two genomes describes their degree of overlap with respect to gene 27 | cluster content. If the fluidity is 0.0, the two genomes contain identical gene clusters. If it 28 | is 1.0 the two genomes are non-overlapping. The difference between a Jaccard distance (see 29 | \code{\link{distJaccard}}) and genomic fluidity is small, they both measure overlap between 30 | genomes, but fluidity is computed for the population by averaging over many pairs, while Jaccard 31 | distances are computed for every pair. Note that only presence/absence of gene clusters are 32 | considered, not multiple occurrences. 33 | 34 | The input \samp{pan.matrix} is typically constructed by \code{\link{panMatrix}}. 35 | } 36 | \examples{ 37 | # Loading a pan-matrix in this package 38 | data(xmpl.panmat) 39 | 40 | # Fluidity based on this pan-matrix 41 | fluid <- fluidity(xmpl.panmat) 42 | 43 | } 44 | \references{ 45 | Kislyuk, A.O., Haegeman, B., Bergman, N.H., Weitz, J.S. (2011). Genomic fluidity: 46 | an integrative view of gene diversity within microbial populations. BMC Genomics, 12:32. 47 | } 48 | \seealso{ 49 | \code{\link{panMatrix}}, \code{\link{distJaccard}}. 50 | } 51 | \author{ 52 | Lars Snipen and Kristian Hovde Liland. 53 | } 54 | -------------------------------------------------------------------------------- /man/geneFamilies2fasta.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/extractPanGenes.R 3 | \name{geneFamilies2fasta} 4 | \alias{geneFamilies2fasta} 5 | \title{Write gene families to files} 6 | \usage{ 7 | geneFamilies2fasta( 8 | pangene.tbl, 9 | fasta.folder, 10 | out.folder, 11 | file.ext = "fasta$|faa$|fna$|fa$", 12 | verbose = TRUE 13 | ) 14 | } 15 | \arguments{ 16 | \item{pangene.tbl}{A table listing gene families (clusters).} 17 | 18 | \item{fasta.folder}{The folder containing the fasta files with all sequences.} 19 | 20 | \item{out.folder}{The folder to write to.} 21 | 22 | \item{file.ext}{The file extension to recognize the fasta files in \code{fasta.folder}.} 23 | 24 | \item{verbose}{Logical to allow text ouput during processing} 25 | } 26 | \description{ 27 | Writes specified gene families to separate fasta files. 28 | } 29 | \details{ 30 | The argument \code{pangene.tbl} should be produced by \code{\link{extractPanGenes}} in order to 31 | contain the columns \code{cluster}, \code{seq_tag} and \code{N_genomes} required by this function. The 32 | files in \code{fasta.folder} must have been prepared by \code{\link{panPrep}} in order to have the proper 33 | sequence tag information. They may contain protein sequences or DNA sequences. 34 | 35 | If you already added the \code{Header} and \code{Sequence} information to \code{pangene.tbl} these will be 36 | used instead of reading the files in \code{fasta.folder}, but a warning is issued. 37 | } 38 | \examples{ 39 | # Loading clustering data in this package 40 | data(xmpl.bclst) 41 | 42 | # Finding genes in 1,..,5 genomes (all genes) 43 | all.tbl <- extractPanGenes(xmpl.bclst, N.genomes = 1:5) 44 | 45 | \dontrun{ 46 | # All protein fasta files are in a folder named faa, and we write to the current folder: 47 | clusters2fasta(all.tbl, fasta.folder = "faa", out.folder = ".") 48 | 49 | # use pipe, write to folder "orfans" 50 | extractPanGenes(xmpl.bclst, N.genomes = 1)) \%>\% 51 | geneFamilies2fasta(fasta.folder = "faa", out.folder = "orfans") 52 | } 53 | 54 | } 55 | \seealso{ 56 | \code{\link{extractPanGenes}}, \code{\link{writeFasta}}. 57 | } 58 | \author{ 59 | Lars Snipen. 60 | } 61 | -------------------------------------------------------------------------------- /man/geneWeights.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/genomedistances.R 3 | \name{geneWeights} 4 | \alias{geneWeights} 5 | \title{Gene cluster weighting} 6 | \usage{ 7 | geneWeights(pan.matrix, type = c("shell", "cloud")) 8 | } 9 | \arguments{ 10 | \item{pan.matrix}{A pan-matrix, see \code{\link{panMatrix}} for details.} 11 | 12 | \item{type}{A text indicating the weighting strategy.} 13 | } 14 | \value{ 15 | A vector of weights, one for each column in \code{pan.matrix}. 16 | } 17 | \description{ 18 | This function computes weights for gene cluster according to their distribution in a pan-genome. 19 | } 20 | \details{ 21 | When computing distances between genomes or a PCA, it is possible to give weights to the 22 | different gene clusters, emphasizing certain aspects. 23 | 24 | As proposed by Snipen & Ussery (2010), we have implemented two types of weighting: The default 25 | \samp{"shell"} type means gene families occuring frequently in the genomes, denoted shell-genes, are 26 | given large weight (close to 1) while those occurring rarely are given small weight (close to 0). 27 | The opposite is the \samp{"cloud"} type of weighting. Genes observed in a minority of the genomes are 28 | referred to as cloud-genes. Presumeably, the \samp{"shell"} weighting will give distances/PCA reflecting 29 | a more long-term evolution, since emphasis is put on genes who have just barely diverged away from the 30 | core. The \samp{"cloud"} weighting emphasizes those gene clusters seen rarely. Genomes with similar 31 | patterns among these genes may have common recent history. A \samp{"cloud"} weighting typically gives 32 | a more erratic or \sQuote{noisy} picture than the \samp{"shell"} weighting. 33 | } 34 | \examples{ 35 | # See examples for distManhattan 36 | 37 | } 38 | \references{ 39 | Snipen, L., Ussery, D.W. (2010). Standard operating procedure for computing pangenome 40 | trees. Standards in Genomic Sciences, 2:135-141. 41 | } 42 | \seealso{ 43 | \code{\link{panMatrix}}, \code{\link{distManhattan}}. 44 | } 45 | \author{ 46 | Lars Snipen and Kristian Hovde Liland. 47 | } 48 | -------------------------------------------------------------------------------- /man/getAccessions.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/entrez.R 3 | \name{getAccessions} 4 | \alias{getAccessions} 5 | \title{Collecting contig accession numbers} 6 | \usage{ 7 | getAccessions(master.record.accession, chunk.size = 99) 8 | } 9 | \arguments{ 10 | \item{master.record.accession}{The accession number (single text) to a master record GenBank file having 11 | the WGS entry specifying the accession numbers to all contigs of the WGS genome.} 12 | 13 | \item{chunk.size}{The maximum number of accession numbers returned in one text.} 14 | } 15 | \value{ 16 | A character vector where each element is a text listing the accession numbers separated by comma. 17 | Each vector element will contain no more than \code{chunk.size} accession numbers, see 18 | \code{\link{entrezDownload}} for details on this. The vector returned by \code{\link{getAccessions}} 19 | is typically used as input to \code{\link{entrezDownload}}. 20 | } 21 | \description{ 22 | Retrieving the accession numbers for all contigs from a master record GenBank file. 23 | } 24 | \details{ 25 | In order to download a WGS genome (draft genome) using \code{\link{entrezDownload}} you will 26 | need the accession number of every contig. This is found in the master record GenBank file, which is 27 | available for every WGS genome. \code{\link{getAccessions}} will extract these from the GenBank file and 28 | return them in the apropriate way to be used by \code{\link{entrezDownload}}. 29 | 30 | The download API at NCBI will not tolerate too many accessions per query, and for this reason you need 31 | to split the accessions for many contigs into several texts using \code{chunk.size}. 32 | } 33 | \examples{ 34 | \dontrun{ 35 | # The master record accession for the WGS genome Mycoplasma genitalium, strain G37 36 | acc <- getAccessions("AAGX00000000") 37 | # Then we use this to download all contigs and save them 38 | genome.file <- tempfile(fileext = ".fna") 39 | txt <- entrezDownload(acc, out.file = genome.file) 40 | 41 | # ...cleaning... 42 | ok <- file.remove(genome.file) 43 | } 44 | 45 | } 46 | \seealso{ 47 | \code{\link{entrezDownload}}. 48 | } 49 | \author{ 50 | Lars Snipen and Kristian Liland. 51 | } 52 | -------------------------------------------------------------------------------- /man/heaps.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/powerlaw.R 3 | \name{heaps} 4 | \alias{heaps} 5 | \title{Heaps law estimate} 6 | \usage{ 7 | heaps(pan.matrix, n.perm = 100) 8 | } 9 | \arguments{ 10 | \item{pan.matrix}{A pan-matrix, see \code{\link{panMatrix}} for details.} 11 | 12 | \item{n.perm}{The number of random permutations of genome ordering.} 13 | } 14 | \value{ 15 | A vector of two estimated parameters: The \samp{Intercept} and the decay parameter \samp{alpha}. 16 | If \samp{alpha<1.0} the pan-genome is open, if \samp{alpha>1.0} it is closed. 17 | } 18 | \description{ 19 | Estimating if a pan-genome is open or closed based on a Heaps law model. 20 | } 21 | \details{ 22 | An open pan-genome means there will always be new gene clusters observed as long as new genomes 23 | are being sequenced. This may sound controversial, but in a pragmatic view, an open pan-genome indicates 24 | that the number of new gene clusters to be observed in future genomes is \sQuote{large} (but not literally 25 | infinite). Opposite, a closed pan-genome indicates we are approaching the end of new gene clusters. 26 | 27 | This function is based on a Heaps law approach suggested by Tettelin et al (2008). The Heaps law model 28 | is fitted to the number of new gene clusters observed when genomes are ordered in a random way. The model 29 | has two parameters, an intercept and a decay parameter called \samp{alpha}. If \samp{alpha>1.0} the 30 | pan-genome is closed, if \samp{alpha<1.0} it is open. 31 | 32 | The number of permutations, \samp{n.perm}, should be as large as possible, limited by computation time. 33 | The default value of 100 is certainly a minimum. 34 | 35 | Word of caution: The Heaps law assumes independent sampling. If some of the genomes in the data set 36 | form distinct sub-groups in the population, this may affect the results of this analysis severely. 37 | } 38 | \examples{ 39 | # Loading a pan-matrix in this package 40 | data(xmpl.panmat) 41 | 42 | # Estimating population openness 43 | h.est <- heaps(xmpl.panmat, n.perm = 500) 44 | print(h.est) 45 | # If alpha < 1 it indicates an open pan-genome 46 | 47 | } 48 | \references{ 49 | Tettelin, H., Riley, D., Cattuto, C., Medini, D. (2008). Comparative genomics: the 50 | bacterial pan-genome. Current Opinions in Microbiology, 12:472-477. 51 | } 52 | \seealso{ 53 | \code{\link{binomixEstimate}}, \code{\link{chao}}, \code{\link{rarefaction}}. 54 | } 55 | \author{ 56 | Lars Snipen and Kristian Hovde Liland. 57 | } 58 | -------------------------------------------------------------------------------- /man/hmmerCleanOverlap.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/domainclust.R 3 | \name{hmmerCleanOverlap} 4 | \alias{hmmerCleanOverlap} 5 | \title{Removing overlapping hits from HMMER3 scans} 6 | \usage{ 7 | hmmerCleanOverlap(hmmer.tbl) 8 | } 9 | \arguments{ 10 | \item{hmmer.tbl}{A table (\code{tibble}) with \code{\link{hmmerScan}} results, see \code{\link{readHmmer}}.} 11 | } 12 | \value{ 13 | A \code{tibble} which is a subset of the input, where some rows may have been deleted to 14 | avoid overlapping hits. 15 | } 16 | \description{ 17 | Removing hits to avoid overlapping HMMs on the same protein sequence. 18 | } 19 | \details{ 20 | When scanning sequences against a profile HMM database using \code{\link{hmmerScan}}, we 21 | often find that several patterns (HMMs) match in the same region of the query sequence, i.e. we have 22 | overlapping hits. The function \code{\link{hmmerCleanOverlap}} will remove the poorest overlapping hit 23 | in a recursive way such that all overlaps are eliminated. 24 | 25 | The input is a \code{tibble} of the type produced by \code{\link{readHmmer}}. 26 | } 27 | \examples{ 28 | # See the example in the Help-file for dClust. 29 | 30 | } 31 | \seealso{ 32 | \code{\link{hmmerScan}}, \code{\link{readHmmer}}, \code{\link{dClust}}. 33 | } 34 | \author{ 35 | Lars Snipen and Kristian Hovde Liland. 36 | } 37 | -------------------------------------------------------------------------------- /man/hmmerScan.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/hmmer3.R 3 | \name{hmmerScan} 4 | \alias{hmmerScan} 5 | \title{Scanning a profile Hidden Markov Model database} 6 | \usage{ 7 | hmmerScan(in.files, dbase, out.folder, threads = 0, verbose = TRUE) 8 | } 9 | \arguments{ 10 | \item{in.files}{A character vector of file names.} 11 | 12 | \item{dbase}{The full path-name of the database to scan (text).} 13 | 14 | \item{out.folder}{The name of the folder to put the result files.} 15 | 16 | \item{threads}{Number of CPU's to use.} 17 | 18 | \item{verbose}{Logical indicating if textual output should be given to monitor the progress.} 19 | } 20 | \value{ 21 | This function produces files in the folder specified by \samp{out.folder}. Existing files are 22 | never overwritten by \code{\link{hmmerScan}}, if you want to re-compute something, delete the 23 | corresponding result files first. 24 | } 25 | \description{ 26 | Scanning FASTA formatted protein files against a database of pHMMs using the HMMER3 27 | software. 28 | } 29 | \details{ 30 | The HMMER3 software is purpose-made for handling profile Hidden Markov Models (pHMM) 31 | describing patterns in biological sequences (Eddy, 2008). This function will make calls to the 32 | HMMER3 software to scan FASTA files of proteins against a pHMM database. 33 | 34 | The files named in \samp{in.files} must contain FASTA formatted protein sequences. These files 35 | should be prepared by \code{\link{panPrep}} to make certain each sequence, as well as the file name, 36 | has a GID-tag identifying their genome. The database named in \samp{db} must be a HMMER3 formatted 37 | database. It is typically the Pfam-A database, but you can also make your own HMMER3 databases, see 38 | the HMMER3 documentation for help. 39 | 40 | \code{\link{hmmerScan}} will query every input file against the named database. The database contains 41 | profile Hidden Markov Models describing position specific sequence patterns. Each sequence in every 42 | input file is scanned to see if some of the patterns can be matched to some degree. Each input file 43 | results in an output file with the same GID-tag in the name. The result files give tabular output, and 44 | are plain text files. See \code{\link{readHmmer}} for how to read the results into R. 45 | 46 | Scanning large databases like Pfam-A takes time, usually several minutes per genome. The scan is set 47 | up to use only 1 cpu per scan by default. By increasing \code{threads} you can utilize multiple CPUs, typically 48 | on a computing cluster. 49 | Our experience is that from a multi-core laptop it is better to start this function in default mode 50 | from mutliple R-sessions. This function will not overwrite an existing result file, and multiple parallel 51 | sessions can write results to the same folder. 52 | } 53 | \note{ 54 | The HMMER3 software must be installed on the system for this function to work, i.e. the command 55 | \samp{system("hmmscan -h")} must be recognized as a valid command if you run it in the Console window. 56 | } 57 | \examples{ 58 | \dontrun{ 59 | # This example require the external software HMMER 60 | # Using example files in this package 61 | pf <- file.path(path.package("micropan"), "extdata", "xmpl_GID1.faa.xz") 62 | dbf <- file.path(path.package("micropan"), "extdata", 63 | str_c("microfam.hmm", c(".h3f.xz",".h3i.xz",".h3m.xz",".h3p.xz"))) 64 | 65 | # We need to uncompress them first... 66 | prot.file <- tempfile(pattern = "GID1.faa", fileext=".xz") 67 | ok <- file.copy(from = pf, to = prot.file) 68 | prot.file <- xzuncompress(prot.file) 69 | db.files <- str_c(tempfile(), c(".h3f.xz",".h3i.xz",".h3m.xz",".h3p.xz")) 70 | ok <- file.copy(from = dbf, to = db.files) 71 | db.files <- unlist(lapply(db.files, xzuncompress)) 72 | db.name <- str_remove(db.files[1], "\\\\.[a-z0-9]+$") 73 | 74 | # Scanning the FASTA file against microfam.hmm... 75 | hmmerScan(in.files = prot.file, dbase = db.name, out.folder = ".") 76 | 77 | # Reading results 78 | hmm.file <- file.path(".", str_c("GID1_vs_", basename(db.name), ".txt")) 79 | hmm.tbl <- readHmmer(hmm.file) 80 | 81 | # ...and cleaning... 82 | ok <- file.remove(prot.file) 83 | ok <- file.remove(str_remove(db.files, ".xz")) 84 | } 85 | 86 | } 87 | \references{ 88 | Eddy, S.R. (2008). A Probabilistic Model of Local Sequence Alignment That Simplifies 89 | Statistical Significance Estimation. PLoS Computational Biology, 4(5). 90 | } 91 | \seealso{ 92 | \code{\link{panPrep}}, \code{\link{readHmmer}}. 93 | } 94 | \author{ 95 | Lars Snipen and Kristian Hovde Liland. 96 | } 97 | -------------------------------------------------------------------------------- /man/isOrtholog.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/bclust.R 3 | \name{isOrtholog} 4 | \alias{isOrtholog} 5 | \title{Identifies orthologs in gene clusters} 6 | \usage{ 7 | isOrtholog(clustering, dist.tbl) 8 | } 9 | \arguments{ 10 | \item{clustering}{A vector of integers indicating the cluster for every sequence. Sequences with 11 | the same number belong to the same cluster. The name of each element is the tag identifying the sequence.} 12 | 13 | \item{dist.tbl}{A \code{tibble} with pairwise distances. The columns \samp{Query} and 14 | \samp{Hit} contain tags identifying pairs of sequences. The column \samp{Distance} contains the 15 | distances, always a number from 0.0 to 1.0.} 16 | } 17 | \value{ 18 | A vector of logicals with the same number of elements as the input \samp{clustering}, indicating 19 | if the corresponding sequence is an ortholog (\code{TRUE}) or not (\code{FALSE}). The name of each 20 | element is copied from \samp{clustering}. 21 | } 22 | \description{ 23 | Finds the ortholog sequences in every cluster based on pairwise distances. 24 | } 25 | \details{ 26 | The input \code{clustering} is typically produced by \code{\link{bClust}}. The input 27 | \code{dist.tbl} is typically produced by \code{\link{bDist}}. 28 | 29 | The concept of orthologs is difficult for prokaryotes, and this function finds orthologs in a 30 | simplistic way. For a given cluster, with members from many genomes, there is one ortholog from every 31 | genome. In cases where a genome has two or more members in the same cluster, only one of these is an 32 | ortholog, the rest are paralogs. 33 | 34 | Consider all sequences from the same genome belonging to the same cluster. The ortholog is defined as 35 | the one having the smallest sum of distances to all other members of the same cluster, i.e. the one 36 | closest to the \sQuote{center} of the cluster. 37 | 38 | Note that the status as ortholog or paralog depends greatly on how clusters are defined in the first 39 | place. If you allow large and diverse (and few) clusters, many sequences will be paralogs. If you define 40 | tight and homogenous (and many) clusters, almost all sequences will be orthologs. 41 | } 42 | \examples{ 43 | \dontrun{ 44 | # Loading distance data and their clustering results 45 | data(list = c("xmpl.bdist","xmpl.bclst")) 46 | 47 | # Finding orthologs 48 | is.ortholog <- isOrtholog(xmpl.bclst, xmpl.bdist) 49 | # The orthologs are 50 | which(is.ortholog) 51 | } 52 | 53 | } 54 | \seealso{ 55 | \code{\link{bDist}}, \code{\link{bClust}}. 56 | } 57 | \author{ 58 | Lars Snipen and Kristian Hovde Liland. 59 | } 60 | -------------------------------------------------------------------------------- /man/micropan.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/micropan.R 3 | \name{micropan} 4 | \alias{micropan} 5 | \alias{micropan-package} 6 | \title{Microbial Pan-Genome Analysis} 7 | \description{ 8 | A collection of functions for computations and visualizations of microbial pan-genomes. 9 | Some of the functions make use of external software that needs to be installed on the system, see the 10 | package vignette for more details on this. 11 | } 12 | \references{ 13 | Snipen, L., Liland, KH. (2015). micropan: an R-package for microbial pan-genomics. BMC Bioinformatics, 16:79. 14 | } 15 | \author{ 16 | Lars Snipen and Kristian Hovde Liland. 17 | 18 | Maintainer: Lars Snipen 19 | } 20 | -------------------------------------------------------------------------------- /man/panMatrix.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/panmat.R 3 | \name{panMatrix} 4 | \alias{panMatrix} 5 | \title{Computing the pan-matrix for a set of gene clusters} 6 | \usage{ 7 | panMatrix(clustering) 8 | } 9 | \arguments{ 10 | \item{clustering}{A named vector of integers.} 11 | } 12 | \value{ 13 | An integer matrix with a row for each genome and a column for each sequence cluster. 14 | The input vector \samp{clustering} is attached as the attribute \samp{clustering}. 15 | } 16 | \description{ 17 | A pan-matrix has one row for each genome and one column for each gene cluster, and 18 | cell \samp{[i,j]} indicates how many members genome \samp{i} has in gene family \samp{j}. 19 | } 20 | \details{ 21 | The pan-matrix is a central data structure for pan-genomic analysis. It is a matrix with 22 | one row for each genome in the study, and one column for each gene cluster. Cell \samp{[i,j]} 23 | contains an integer indicating how many members genome \samp{i} has in cluster \samp{j}. 24 | 25 | The input \code{clustering} must be a named integer vector with one element for each sequence in the study, 26 | typically produced by either \code{\link{bClust}} or \code{\link{dClust}}. The name of each element 27 | is a text identifying every sequence. The value of each element indicates the cluster, i.e. those 28 | sequences with identical values are in the same cluster. IMPORTANT: The name of each sequence must 29 | contain the \samp{genome_id} for each genome, i.e. they must of the form \samp{GID111_seq1}, \samp{GID111_seq2},... 30 | where the \samp{GIDxxx} part indicates which genome the sequence belongs to. See \code{\link{panPrep}} 31 | for details. 32 | 33 | The rows of the pan-matrix is named by the \samp{genome_id} for every genome. The columns are just named 34 | \samp{Cluster_x} where \samp{x} is an integer copied from \samp{clustering}. 35 | } 36 | \examples{ 37 | # Loading clustering data in this package 38 | data(xmpl.bclst) 39 | 40 | # Pan-matrix based on the clustering 41 | panmat <- panMatrix(xmpl.bclst) 42 | 43 | \dontrun{ 44 | # Plotting cluster distribution 45 | library(ggplot2) 46 | tibble(Clusters = as.integer(table(factor(colSums(panmat > 0), levels = 1:nrow(panmat)))), 47 | Genomes = 1:nrow(panmat)) \%>\% 48 | ggplot(aes(x = Genomes, y = Clusters)) + 49 | geom_col() 50 | } 51 | 52 | } 53 | \seealso{ 54 | \code{\link{bClust}}, \code{\link{dClust}}, \code{\link{distManhattan}}, 55 | \code{\link{distJaccard}}, \code{\link{fluidity}}, \code{\link{chao}}, 56 | \code{\link{binomixEstimate}}, \code{\link{heaps}}, \code{\link{rarefaction}}. 57 | } 58 | \author{ 59 | Lars Snipen and Kristian Hovde Liland. 60 | } 61 | -------------------------------------------------------------------------------- /man/panPca.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/panpca.R 3 | \name{panPca} 4 | \alias{panPca} 5 | \title{Principal component analysis of a pan-matrix} 6 | \usage{ 7 | panPca(pan.matrix, scale = 0, weights = rep(1, ncol(pan.matrix))) 8 | } 9 | \arguments{ 10 | \item{pan.matrix}{A pan-matrix, see \code{\link{panMatrix}} for details.} 11 | 12 | \item{scale}{An optional scale to control how copy numbers should affect the distances.} 13 | 14 | \item{weights}{Vector of optional weights of gene clusters.} 15 | } 16 | \value{ 17 | A \code{list} with three tables: 18 | 19 | \samp{Evar.tbl} has two columns, one listing the component number and one listing the relative 20 | explained variance for each component. The relative explained variance always sums to 1.0 over 21 | all components. This value indicates the importance of each component, and it is always in 22 | descending order, the first component being the most important. 23 | This is typically the first result you look at after a PCA has been computed, as it indicates 24 | how many components (directions) you need to capture the bulk of the total variation in the data. 25 | 26 | \samp{Scores.tbl} has a column listing the \samp{GID.tag} for each genome, and then one column for each 27 | principal component. The columns are ordered corresponding to the elements in \samp{Evar}. The 28 | scores are the coordinates of each genome in the principal component space. 29 | 30 | \samp{Loadings.tbl} is similar to \samp{Scores.tbl} but contain values for each gene cluster 31 | instead of each genome. The columns are ordered corresponding to the elements in \samp{Evar}. 32 | The loadings are the contributions from each gene cluster to the principal component directions. 33 | NOTE: Only gene clusters having a non-zero variance is used in a PCA. Gene clusters with the 34 | same value for every genome have no impact and are discarded from the \samp{Loadings}. 35 | } 36 | \description{ 37 | Computes a principal component decomposition of a pan-matrix, with possible 38 | scaling and weightings. 39 | } 40 | \details{ 41 | A principal component analysis (PCA) can be computed for any matrix, also a pan-matrix. 42 | The principal components will in this case be linear combinations of the gene clusters. One major 43 | idea behind PCA is to truncate the space, e.g. instead of considering the genomes as points in a 44 | high-dimensional space spanned by all gene clusters, we look for a few \sQuote{smart} combinations 45 | of the gene clusters, and visualize the genomes in a low-dimensional space spanned by these directions. 46 | 47 | The \samp{scale} can be used to control how copy number differences play a role in the PCA. Usually 48 | we assume that going from 0 to 1 copy of a gene is the big change of the genome, and going from 1 to 49 | 2 (or more) copies is less. Prior to computing the PCA, the \samp{pan.matrix} is transformed according 50 | to the following affine mapping: If the original value in \samp{pan.matrix} is \samp{x}, and \samp{x} 51 | is not 0, then the transformed value is \samp{1 + (x-1)*scale}. Note that with \samp{scale=0.0} 52 | (default) this will result in 1 regardless of how large \samp{x} was. In this case the PCA only 53 | distinguish between presence and absence of gene clusters. If \samp{scale=1.0} the value \samp{x} is 54 | left untransformed. In this case the difference between 1 copy and 2 copies is just as big as between 55 | 1 copy and 0 copies. For any \samp{scale} between 0.0 and 1.0 the transformed value is shrunk towards 56 | 1, but a certain effect of larger copy numbers is still present. In this way you can decide if the PCA 57 | should be affected, and to what degree, by differences in copy numbers beyond 1. 58 | 59 | The PCA may also up- or downweight some clusters compared to others. The vector \samp{weights} must 60 | contain one value for each column in \samp{pan.matrix}. The default is to use flat weights, i.e. all 61 | clusters count equal. See \code{\link{geneWeights}} for alternative weighting strategies. 62 | } 63 | \examples{ 64 | # Loading a pan-matrix in this package 65 | data(xmpl.panmat) 66 | 67 | # Computing panPca 68 | ppca <- panPca(xmpl.panmat) 69 | 70 | \dontrun{ 71 | # Plotting explained variance 72 | library(ggplot2) 73 | ggplot(ppca$Evar.tbl) + 74 | geom_col(aes(x = Component, y = Explained.variance)) 75 | # Plotting scores 76 | ggplot(ppca$Scores.tbl) + 77 | geom_text(aes(x = PC1, y = PC2, label = GID.tag)) 78 | # Plotting loadings 79 | ggplot(ppca$Loadings.tbl) + 80 | geom_text(aes(x = PC1, y = PC2, label = Cluster)) 81 | } 82 | 83 | } 84 | \seealso{ 85 | \code{\link{distManhattan}}, \code{\link{geneWeights}}. 86 | } 87 | \author{ 88 | Lars Snipen and Kristian Hovde Liland. 89 | } 90 | -------------------------------------------------------------------------------- /man/panPrep.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/panprep.R 3 | \name{panPrep} 4 | \alias{panPrep} 5 | \title{Preparing FASTA files for pan-genomics} 6 | \usage{ 7 | panPrep(in.file, genome_id, out.file, protein = TRUE, min.length = 10, discard = "") 8 | } 9 | \arguments{ 10 | \item{in.file}{The name of a FASTA formatted file with protein or nucleotide sequences for coding 11 | genes in a genome.} 12 | 13 | \item{genome_id}{The Genome Identifier, see below.} 14 | 15 | \item{out.file}{Name of file where the prepared sequences will be written.} 16 | 17 | \item{protein}{Logical, indicating if the \samp{in.file} contains protein (\code{TRUE}) or 18 | nucleotide (\code{FALSE}) sequences.} 19 | 20 | \item{min.length}{Minimum sequence length} 21 | 22 | \item{discard}{A text, a regular expression, and sequences having a match against this in their 23 | headerline will be discarded.} 24 | } 25 | \value{ 26 | This function produces a FASTA formatted sequence file, and returns the name of this file. 27 | } 28 | \description{ 29 | Preparing a FASTA file before starting comparisons of sequences. 30 | } 31 | \details{ 32 | This function will read the \code{in.file} and produce another, slightly modified, FASTA file 33 | which is prepared for the comparisons using \code{\link{blastpAllAll}}, \code{\link{hmmerScan}} 34 | or any other method. 35 | 36 | The main purpose of \code{\link{panPrep}} is to make certain every sequence is labeled with a tag 37 | called a \samp{genome_id} identifying the genome from which it comes. This text contains the text 38 | \dQuote{GID} followed by an integer. This integer can be any integer as long as it is unique to every 39 | genome in the study. If a genome has the text \dQuote{GID12345} as identifier, then the 40 | sequences in the file produced by \code{\link{panPrep}} will have headerlines starting with 41 | \dQuote{GID12345_seq1}, \dQuote{GID12345_seq2}, \dQuote{GID12345_seq3}...etc. This makes it possible 42 | to quickly identify which genome every sequence belongs to. 43 | 44 | The \samp{genome_id} is also added to the file name specified in \samp{out.file}. For this reason the 45 | \samp{out.file} must have a file extension containing letters only. By convention, we expect FASTA 46 | files to have one of the extensions \samp{.fsa}, \samp{.faa}, \samp{.fa} or \samp{.fasta}. 47 | 48 | \code{\link{panPrep}} will also remove sequences shorter than \code{min.length}, removing stop codon 49 | symbols (\samp{*}), replacing alien characters with \samp{X} and converting all sequences to upper-case. 50 | If the input \samp{discard} contains a regular expression, any sequences having a match to this in their 51 | headerline are also removed. Example: If we use the \code{prodigal} software (see \code{\link[microseq]{findGenes}}) 52 | to find proteins in a genome, partially predicted genes will have the text \samp{partial=10} or 53 | \samp{partial=01} in their headerline. Using \samp{discard= "partial=01|partial=10"} will remove 54 | these from the data set. 55 | } 56 | \examples{ 57 | # Using a protein file in this package 58 | # We need to uncompress it first... 59 | pf <- file.path(path.package("micropan"),"extdata","xmpl.faa.xz") 60 | prot.file <- tempfile(fileext = ".xz") 61 | ok <- file.copy(from = pf, to = prot.file) 62 | prot.file <- xzuncompress(prot.file) 63 | 64 | # Prepping it, using the genome_id "GID123" 65 | prepped.file <- panPrep(prot.file, genome_id = "GID123", out.file = tempfile(fileext = ".faa")) 66 | 67 | # Reading the prepped file 68 | prepped <- readFasta(prepped.file) 69 | head(prepped) 70 | 71 | # ...and cleaning... 72 | ok <- file.remove(prot.file, prepped.file) 73 | 74 | } 75 | \seealso{ 76 | \code{\link{hmmerScan}}, \code{\link{blastpAllAll}}. 77 | } 78 | \author{ 79 | Lars Snipen and Kristian Liland. 80 | } 81 | -------------------------------------------------------------------------------- /man/rarefaction.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/rarefaction.R 3 | \name{rarefaction} 4 | \alias{rarefaction} 5 | \title{Rarefaction curves for a pan-genome} 6 | \usage{ 7 | rarefaction(pan.matrix, n.perm = 1) 8 | } 9 | \arguments{ 10 | \item{pan.matrix}{A pan-matrix, see \code{\link{panMatrix}} for details.} 11 | 12 | \item{n.perm}{The number of random genome orderings to use. If \samp{n.perm=1} the fixed order of 13 | the genomes in \samp{pan.matrix} is used.} 14 | } 15 | \value{ 16 | A table with the curves in the columns. The first column is the number of genomes, while 17 | all other columns are the cumulative number of clusters, one column for each permutation. 18 | } 19 | \description{ 20 | Computes rarefaction curves for a number of random permutations of genomes. 21 | } 22 | \details{ 23 | A rarefaction curve is simply the cumulative number of unique gene clusters we observe as 24 | more and more genomes are being considered. The shape of this curve will depend on the order of the 25 | genomes. This function will typically compute rarefaction curves for a number of (\samp{n.perm}) 26 | orderings. By using a large number of permutations, and then averaging over the results, the effect 27 | of any particular ordering is smoothed. 28 | 29 | The averaged curve illustrates how many new gene clusters we observe for each new genome. If this 30 | levels out and becomes flat, it means we expect few, if any, new gene clusters by sequencing more 31 | genomes. The function \code{\link{heaps}} can be used to estimate population openness based on this 32 | principle. 33 | } 34 | \examples{ 35 | # Loading a pan-matrix in this package 36 | data(xmpl.panmat) 37 | 38 | # Rarefaction 39 | rar.tbl <- rarefaction(xmpl.panmat, n.perm = 1000) 40 | 41 | \dontrun{ 42 | # Plotting 43 | library(ggplot2) 44 | library(tidyr) 45 | rar.tbl \%>\% 46 | gather(key = "Permutation", value = "Clusters", -Genome) \%>\% 47 | ggplot(aes(x = Genome, y = Clusters, group = Permutation)) + 48 | geom_line() 49 | } 50 | 51 | } 52 | \seealso{ 53 | \code{\link{heaps}}, \code{\link{panMatrix}}. 54 | } 55 | \author{ 56 | Lars Snipen and Kristian Hovde Liland. 57 | } 58 | -------------------------------------------------------------------------------- /man/readBlastSelf.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/bdist.R 3 | \name{readBlastSelf} 4 | \alias{readBlastSelf} 5 | \alias{readBlastPair} 6 | \title{Reads BLAST result files} 7 | \usage{ 8 | readBlastSelf(blast.files, e.value = 1, verbose = TRUE) 9 | } 10 | \arguments{ 11 | \item{blast.files}{A text vector of filenames.} 12 | 13 | \item{e.value}{A threshold E-value to immediately discard (very) poor BLAST alignments.} 14 | 15 | \item{verbose}{Logical, indicating if textual output should be given to monitor the progress.} 16 | } 17 | \value{ 18 | The functions returns a table with columns \samp{Dbase}, \samp{Query}, \samp{Bitscore} 19 | and \samp{Distance}. Each row corresponds to a pair of sequences (a Dbase and a Query sequence) having at least 20 | one BLAST hit between 21 | them. All pairs \emph{not} listed have distance 1.0 between them. You should normally bind the output from 22 | \code{readBlastSelf} to the ouptut from \code{readBlastPair} and use the result as input to \code{\link{bDist}}. 23 | } 24 | \description{ 25 | Reads files from a search with blastpAllAll 26 | } 27 | \details{ 28 | The filenames given as input must refer to BLAST result files produced by \code{\link{blastpAllAll}}. 29 | 30 | With \code{readBlastSelf} you only read the self-alignment results, i.e. blasting a genome against itself. With 31 | \code{readBlastPair} you read all the other files, i.e. different genomes compared. You may use all blast file 32 | names as input to both, they will select the proper files based on their names, e.g. GID1_vs_GID1.txt is read 33 | by \code{readBlastSelf} while GID2_vs_GID1.txt is read by \code{readBlastPair}. 34 | 35 | Setting a small \samp{e.value} threshold will filter the alignment, and may speed up this and later processing, 36 | but you may also loose some important alignments for short sequences. 37 | 38 | Both these functions are used by \code{\link{bDist}}. The reason we provide them separately is to allow the user 39 | to complete this file reading before calling \code{\link{bDist}}. If you have a huge number of files, a 40 | skilled user may utilize parallell processing to speed up the reading. For normal size data sets (e.g. less than 100 genomes) 41 | you should probably use \code{\link{bDist}} directly. 42 | } 43 | \examples{ 44 | # Using BLAST result files in this package... 45 | prefix <- c("GID1_vs_GID1_", 46 | "GID2_vs_GID1_", 47 | "GID3_vs_GID1_", 48 | "GID2_vs_GID2_", 49 | "GID3_vs_GID2_", 50 | "GID3_vs_GID3_") 51 | bf <- file.path(path.package("micropan"), "extdata", str_c(prefix, ".txt.xz")) 52 | 53 | # We need to uncompress them first... 54 | blast.files <- tempfile(pattern = prefix, fileext = ".txt.xz") 55 | ok <- file.copy(from = bf, to = blast.files) 56 | blast.files <- unlist(lapply(blast.files, xzuncompress)) 57 | 58 | # Reading self-alignment files, then the other files 59 | self.tbl <- readBlastSelf(blast.files) 60 | pair.tbl <- readBlastPair(blast.files) 61 | 62 | # ...and cleaning... 63 | ok <- file.remove(blast.files) 64 | 65 | # See also examples for bDist 66 | 67 | } 68 | \seealso{ 69 | \code{\link{bDist}}, \code{\link{blastpAllAll}}. 70 | } 71 | \author{ 72 | Lars Snipen. 73 | } 74 | -------------------------------------------------------------------------------- /man/readHmmer.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/hmmer3.R 3 | \name{readHmmer} 4 | \alias{readHmmer} 5 | \title{Reading results from a HMMER3 scan} 6 | \usage{ 7 | readHmmer(hmmer.file, e.value = 1, use.acc = TRUE) 8 | } 9 | \arguments{ 10 | \item{hmmer.file}{The name of a \code{\link{hmmerScan}} result file.} 11 | 12 | \item{e.value}{Numeric threshold, hits with E-value above this are ignored (default is 1.0).} 13 | 14 | \item{use.acc}{Logical indicating if accession numbers should be used to identify the hits.} 15 | } 16 | \value{ 17 | The results are returned in a \samp{tibble} with columns \samp{Query}, \samp{Hit}, 18 | \samp{Evalue}, \samp{Score}, \samp{Start}, \samp{Stop} and \samp{Description}. \samp{Query} is the tag 19 | identifying each query sequence. \samp{Hit} is the name or accession number for a pHMM in the database 20 | describing patterns. The \samp{Evalue} is the \samp{ievalue} in the HMMER3 terminology. The \samp{Score} 21 | is the HMMER3 score for the match between \samp{Query} and \samp{Hit}. The \samp{Start} and \samp{Stop} 22 | are the positions within the \samp{Query} where the \samp{Hit} (pattern) starts and stops. 23 | \samp{Description} is the description of the \samp{Hit}. There is one line for each hit. 24 | } 25 | \description{ 26 | Reading a text file produced by \code{\link{hmmerScan}}. 27 | } 28 | \details{ 29 | The function reads a text file produced by \code{\link{hmmerScan}}. By specifying a smaller 30 | \samp{e.value} you filter out poorer hits, and fewer results are returned. The option \samp{use.acc} 31 | should be turned off (FALSE) if you scan against your own database where accession numbers are lacking. 32 | } 33 | \examples{ 34 | # See the examples in the Help-files for dClust and hmmerScan. 35 | 36 | } 37 | \seealso{ 38 | \code{\link{hmmerScan}}, \code{\link{hmmerCleanOverlap}}, \code{\link{dClust}}. 39 | } 40 | \author{ 41 | Lars Snipen and Kristian Hovde Liland. 42 | } 43 | -------------------------------------------------------------------------------- /man/xmpl.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/xmpl.R 3 | \docType{data} 4 | \name{xmpl} 5 | \alias{xmpl} 6 | \alias{xmpl.bdist} 7 | \alias{xmpl.bclst} 8 | \alias{xmpl.panmat} 9 | \title{Data sets for use in examples} 10 | \usage{ 11 | data(xmpl.bdist) 12 | data(xmpl.bclst) 13 | data(xmpl.panmat) 14 | } 15 | \description{ 16 | This data set contains several files with various objects used in examples 17 | in some of the functions in the \code{micropan} package. 18 | } 19 | \details{ 20 | \samp{xmpl.bdist} is a \code{tibble} with 4 columns holding all 21 | BLAST distances between pairs of proteins in an example with 10 small genomes. 22 | 23 | \samp{xmpl.bclst} is a clustering vector of all proteins in the 24 | genomes from \samp{xmpl.bdist}. 25 | 26 | \samp{xmpl.panmat} is a pan-matrix with 10 rows and 1210 columns 27 | computed from \samp{xmpl.bclst}. 28 | } 29 | \examples{ 30 | 31 | # BLAST distances, only the first 20 are displayed 32 | data(xmpl.bdist) 33 | head(xmpl.bdist) 34 | 35 | # Clustering vector 36 | data(xmpl.bclst) 37 | print(xmpl.bclst[1:30]) 38 | 39 | # Pan-matrix 40 | data(xmpl.panmat) 41 | head(xmpl.panmat) 42 | 43 | } 44 | \author{ 45 | Lars Snipen and Kristian Hovde Liland. 46 | } 47 | -------------------------------------------------------------------------------- /man/xz.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/xz.R 3 | \name{xzcompress} 4 | \alias{xzcompress} 5 | \alias{xzuncompress} 6 | \title{Compressing and uncompressing text files} 7 | \usage{ 8 | xzcompress( 9 | filename, 10 | destname = sprintf("\%s.xz", filename), 11 | temporary = FALSE, 12 | skip = FALSE, 13 | overwrite = FALSE, 14 | remove = TRUE, 15 | BFR.SIZE = 1e+07, 16 | compression = 6, 17 | ... 18 | ) 19 | 20 | xzuncompress( 21 | filename, 22 | destname = gsub("[.]xz$", "", filename, ignore.case = TRUE), 23 | temporary = FALSE, 24 | skip = FALSE, 25 | overwrite = FALSE, 26 | remove = TRUE, 27 | BFR.SIZE = 1e+07, 28 | ... 29 | ) 30 | } 31 | \arguments{ 32 | \item{filename}{Path name of input file.} 33 | 34 | \item{destname}{Pathname of output file.} 35 | 36 | \item{temporary}{If TRUE, the output file is created in a temporary directory.} 37 | 38 | \item{skip}{If TRUE and the output file already exists, the output file is returned as is.} 39 | 40 | \item{overwrite}{If TRUE and the output file already exists, the file is silently overwritting, 41 | otherwise an exception is thrown (unless skip is TRUE).} 42 | 43 | \item{remove}{If TRUE, the input file is removed afterward, otherwise not.} 44 | 45 | \item{BFR.SIZE}{The number of bytes read in each chunk.} 46 | 47 | \item{compression}{The compression level used (1-9).} 48 | 49 | \item{...}{Not used.} 50 | } 51 | \value{ 52 | Returns the pathname of the output file. The number of bytes processed is returned as an attribute. 53 | } 54 | \description{ 55 | These functions are adapted from the \code{R.utils} package from gzip to xz. Internally 56 | \code{xzfile()} (see connections) is used to read (write) chunks to (from) the xz file. If the 57 | process is interrupted before completed, the partially written output file is automatically removed. 58 | } 59 | \examples{ 60 | # Creating small file 61 | tf <- tempfile() 62 | cat(file=tf, "Hello world!") 63 | 64 | # Compressing 65 | tf.xz <- xzcompress(tf) 66 | print(file.info(tf.xz)) 67 | 68 | # Uncompressing 69 | tf <- xzuncompress(tf.xz) 70 | print(file.info(tf)) 71 | file.remove(tf) 72 | 73 | } 74 | \author{ 75 | Kristian Hovde Liland. 76 | } 77 | -------------------------------------------------------------------------------- /vignettes/vignette-concordance.tex: -------------------------------------------------------------------------------- 1 | \Sconcordance{concordance:vignette.tex:C:/projects/git/micropan/vignettes/vignette.Rnw:% 2 | 1 31 1 1 2 4 0 1 2 4 1 1 2 4 0 1 2 4 1} 3 | -------------------------------------------------------------------------------- /vignettes/vignette.Rnw: -------------------------------------------------------------------------------- 1 | \documentclass{article} 2 | \usepackage{url,Sweave} 3 | %\VignetteIndexEntry{The micropan package vignette} 4 | 5 | \title{The \texttt{micropan} package vignette} 6 | \author{Lars Snipen and Kristian Hovde Liland} 7 | \date{} 8 | 9 | \begin{document} 10 | \SweaveOpts{concordance=TRUE} 11 | %\SweaveOpts{concordance=TRUE} 12 | 13 | \maketitle 14 | 15 | 16 | \section{Using \texttt{dplyr} and \texttt{stringr}} 17 | A major change in the 2.0 version is the use of generic data structures and functions in R instead of creating package specific ones. This makes it possible to use the power of standard data manipulation tools and visualization that R-users are familiar with. 18 | 19 | Compared to previous versions some functions have been moved to the \texttt{microseq} package. 20 | 21 | You will also find no casestudy document or plotting functions. However, if you locate the GitHub site for this package, you will find a tutorial with code making similar plots using \texttt{ggplot} or \texttt{ggdendro}. This is an example of using generic R tools instead of making functions for each special case. 22 | 23 | \subsection{Faster reading of BLAST results} 24 | A major change in the 2.1 version is faster reading of the BLAST result files, see `?bDist` or the tutorial at GitHub mentioned above for more details. 25 | 26 | 27 | \section{External software} 28 | Some functions in this package calls upons external software that must be available on the system. Some of these are 'installed' by simply downloading a binary executable that you put somewhere proper on your computer. To make such programs visible to R, you typically need to update your \texttt{PATH} environment variable, to specify where these executables are located. Try it out, and use google for help! 29 | 30 | 31 | \subsection{Software \texttt{blast+}} 32 | The function \emph{blastpAllAll} uses the free software \texttt{blast+} (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/). Source code and installers makes it straightforward to install. In the R console the command 33 | <>= 34 | system("blastp -h") 35 | @ 36 | should produce some sensible output. 37 | 38 | 39 | \subsection{Software \texttt{hmmer}} 40 | The functions \emph{hmmerScan()} uses the free software \texttt{hmmer} (http://hmmer.org/). This software is developed for UNIX systems (e.g. Mac or Linux), and Windows users may find it a little difficult to install and run from R. In the R console the command 41 | <>= 42 | system("hmmscan -h") 43 | @ 44 | should produce some sensible output. 45 | 46 | 47 | 48 | \end{document} -------------------------------------------------------------------------------- /vignettes/vignette.aux: -------------------------------------------------------------------------------- 1 | \relax 2 | \@writefile{toc}{\contentsline {section}{\numberline {1}Using \texttt {dplyr} and \texttt {stringr}}{1}{}\protected@file@percent } 3 | \@writefile{toc}{\contentsline {subsection}{\numberline {1.1}Faster reading of BLAST results}{1}{}\protected@file@percent } 4 | \@writefile{toc}{\contentsline {section}{\numberline {2}External software}{1}{}\protected@file@percent } 5 | \@writefile{toc}{\contentsline {subsection}{\numberline {2.1}Software \texttt {blast+}}{1}{}\protected@file@percent } 6 | \@writefile{toc}{\contentsline {subsection}{\numberline {2.2}Software \texttt {hmmer}}{2}{}\protected@file@percent } 7 | \gdef \@abspage@last{2} 8 | -------------------------------------------------------------------------------- /vignettes/vignette.log: -------------------------------------------------------------------------------- 1 | This is pdfTeX, Version 3.14159265-2.6-1.40.21 (MiKTeX 20.10) (preloaded format=pdflatex 2020.10.12) 9 NOV 2020 16:22 2 | entering extended mode 3 | **C:/projects/git/micropan/vignettes/vignette.tex 4 | (C:/projects/git/micropan/vignettes/vignette.tex 5 | LaTeX2e <2020-10-01> patch level 1 6 | L3 programming layer <2020-10-05> xparse <2020-03-03> ("C:\Program Files\MiKTeX 7 | \tex/latex/base\article.cls" 8 | Document Class: article 2020/04/10 v1.4m Standard LaTeX document class 9 | ("C:\Program Files\MiKTeX\tex/latex/base\size10.clo" 10 | File: size10.clo 2020/04/10 v1.4m Standard LaTeX file (size option) 11 | ) 12 | \c@part=\count175 13 | \c@section=\count176 14 | \c@subsection=\count177 15 | \c@subsubsection=\count178 16 | \c@paragraph=\count179 17 | \c@subparagraph=\count180 18 | \c@figure=\count181 19 | \c@table=\count182 20 | \abovecaptionskip=\skip47 21 | \belowcaptionskip=\skip48 22 | \bibindent=\dimen138 23 | ) ("C:\Program Files\MiKTeX\tex/latex/url\url.sty" 24 | \Urlmuskip=\muskip16 25 | Package: url 2013/09/16 ver 3.4 Verb mode for urls, etc. 26 | ) (C:/PROGRA~1/R/R-40~1.3/share/texmf/tex/latex\Sweave.sty 27 | Package: Sweave 28 | ("C:\Program Files\MiKTeX\tex/latex/base\ifthen.sty" 29 | Package: ifthen 2014/09/29 v1.1c Standard LaTeX ifthen package (DPC) 30 | ) ("C:\Program Files\MiKTeX\tex/latex/graphics\graphicx.sty" 31 | Package: graphicx 2020/09/09 v1.2b Enhanced LaTeX Graphics (DPC,SPQR) 32 | ("C:\Program Files\MiKTeX\tex/latex/graphics\keyval.sty" 33 | Package: keyval 2014/10/28 v1.15 key=value parser (DPC) 34 | \KV@toks@=\toks15 35 | ) ("C:\Program Files\MiKTeX\tex/latex/graphics\graphics.sty" 36 | Package: graphics 2020/08/30 v1.4c Standard LaTeX Graphics (DPC,SPQR) 37 | ("C:\Program Files\MiKTeX\tex/latex/graphics\trig.sty" 38 | Package: trig 2016/01/03 v1.10 sin cos tan (DPC) 39 | ) ("C:\Program Files\MiKTeX\tex/latex/graphics-cfg\graphics.cfg" 40 | File: graphics.cfg 2016/06/04 v1.11 sample graphics configuration 41 | ) 42 | Package graphics Info: Driver file: pdftex.def on input line 105. 43 | ("C:\Program Files\MiKTeX\tex/latex/graphics-def\pdftex.def" 44 | File: pdftex.def 2020/10/05 v1.2a Graphics/color driver for pdftex 45 | )) 46 | \Gin@req@height=\dimen139 47 | \Gin@req@width=\dimen140 48 | ) (C:\Users\larssn\AppData\Roaming\MiKTeX\tex/latex/fancyvrb\fancyvrb.sty 49 | Package: fancyvrb 2020/05/03 v3.6 verbatim text (tvz,hv) 50 | \FV@CodeLineNo=\count183 51 | \FV@InFile=\read2 52 | \FV@TabBox=\box47 53 | \c@FancyVerbLine=\count184 54 | \FV@StepNumber=\count185 55 | \FV@OutFile=\write3 56 | ) ("C:\Program Files\MiKTeX\tex/latex/base\textcomp.sty" 57 | Package: textcomp 2020/02/02 v2.0n Standard LaTeX package 58 | ) ("C:\Program Files\MiKTeX\tex/latex/base\fontenc.sty" 59 | Package: fontenc 2020/08/10 v2.0s Standard LaTeX package 60 | ) (C:\Users\larssn\AppData\Roaming\MiKTeX\tex/latex/ae\ae.sty 61 | Package: ae 2001/02/12 1.3 Almost European Computer Modern 62 | ("C:\Program Files\MiKTeX\tex/latex/base\fontenc.sty" 63 | Package: fontenc 2020/08/10 v2.0s Standard LaTeX package 64 | ))) 65 | LaTeX Font Info: Trying to load font information for T1+aer on input line 9. 66 | 67 | (C:\Users\larssn\AppData\Roaming\MiKTeX\tex/latex/ae\t1aer.fd 68 | File: t1aer.fd 1997/11/16 Font definitions for T1/aer. 69 | ) ("C:\Program Files\MiKTeX\tex/latex/l3backend\l3backend-pdftex.def" 70 | File: l3backend-pdftex.def 2020-09-24 L3 backend support: PDF output (pdfTeX) 71 | \l__kernel_color_stack_int=\count186 72 | \l__pdf_internal_box=\box48 73 | ) (vignette.aux) 74 | \openout1 = `vignette.aux'. 75 | 76 | LaTeX Font Info: Checking defaults for OML/cmm/m/it on input line 9. 77 | LaTeX Font Info: ... okay on input line 9. 78 | LaTeX Font Info: Checking defaults for OMS/cmsy/m/n on input line 9. 79 | LaTeX Font Info: ... okay on input line 9. 80 | LaTeX Font Info: Checking defaults for OT1/cmr/m/n on input line 9. 81 | LaTeX Font Info: ... okay on input line 9. 82 | LaTeX Font Info: Checking defaults for T1/cmr/m/n on input line 9. 83 | LaTeX Font Info: ... okay on input line 9. 84 | LaTeX Font Info: Checking defaults for TS1/cmr/m/n on input line 9. 85 | LaTeX Font Info: ... okay on input line 9. 86 | LaTeX Font Info: Checking defaults for OMX/cmex/m/n on input line 9. 87 | LaTeX Font Info: ... okay on input line 9. 88 | LaTeX Font Info: Checking defaults for U/cmr/m/n on input line 9. 89 | LaTeX Font Info: ... okay on input line 9. 90 | ("C:\Program Files\MiKTeX\tex/context/base/mkii\supp-pdf.mkii" 91 | [Loading MPS to PDF converter (version 2006.09.02).] 92 | \scratchcounter=\count187 93 | \scratchdimen=\dimen141 94 | \scratchbox=\box49 95 | \nofMPsegments=\count188 96 | \nofMParguments=\count189 97 | \everyMPshowfont=\toks16 98 | \MPscratchCnt=\count190 99 | \MPscratchDim=\dimen142 100 | \MPnumerator=\count191 101 | \makeMPintoPDFobject=\count192 102 | \everyMPtoPDFconversion=\toks17 103 | ) ("C:\Program Files\MiKTeX\tex/latex/epstopdf-pkg\epstopdf-base.sty" 104 | Package: epstopdf-base 2020-01-24 v2.11 Base part for package epstopdf 105 | (C:\Users\larssn\AppData\Roaming\MiKTeX\tex/generic/infwarerr\infwarerr.sty 106 | Package: infwarerr 2019/12/03 v1.5 Providing info/warning/error messages (HO) 107 | ) (C:\Users\larssn\AppData\Roaming\MiKTeX\tex/latex/grfext\grfext.sty 108 | Package: grfext 2019/12/03 v1.3 Manage graphics extensions (HO) 109 | 110 | (C:\Users\larssn\AppData\Roaming\MiKTeX\tex/generic/kvdefinekeys\kvdefinekeys.s 111 | ty 112 | Package: kvdefinekeys 2019-12-19 v1.6 Define keys (HO) 113 | )) (C:\Users\larssn\AppData\Roaming\MiKTeX\tex/latex/kvoptions\kvoptions.sty 114 | Package: kvoptions 2019/11/29 v3.13 Key value format for package options (HO) 115 | (C:\Users\larssn\AppData\Roaming\MiKTeX\tex/generic/ltxcmds\ltxcmds.sty 116 | Package: ltxcmds 2020-05-10 v1.25 LaTeX kernel commands for general use (HO) 117 | ) (C:\Users\larssn\AppData\Roaming\MiKTeX\tex/generic/kvsetkeys\kvsetkeys.sty 118 | Package: kvsetkeys 2019/12/15 v1.18 Key value parser (HO) 119 | )) ("C:\Program Files\MiKTeX\tex/generic/pdftexcmds\pdftexcmds.sty" 120 | Package: pdftexcmds 2020-06-27 v0.33 Utility functions of pdfTeX for LuaTeX (HO 121 | ) 122 | ("C:\Program Files\MiKTeX\tex/generic/iftex\iftex.sty" 123 | Package: iftex 2020/03/06 v1.0d TeX engine tests 124 | ) 125 | Package pdftexcmds Info: \pdf@primitive is available. 126 | Package pdftexcmds Info: \pdf@ifprimitive is available. 127 | Package pdftexcmds Info: \pdfdraftmode found. 128 | ) 129 | Package epstopdf-base Info: Redefining graphics rule for `.eps' on input line 4 130 | 85. 131 | Package grfext Info: Graphics extension search list: 132 | (grfext) [.pdf,.png,.jpg,.mps,.jpeg,.jbig2,.jb2,.PDF,.PNG,.JPG,.JPE 133 | G,.JBIG2,.JB2,.eps] 134 | (grfext) \AppendGraphicsExtensions on input line 504. 135 | ) (vignette-concordance.tex) 136 | LaTeX Font Info: Trying to load font information for T1+aett on input line 1 137 | 3. 138 | (C:\Users\larssn\AppData\Roaming\MiKTeX\tex/latex/ae\t1aett.fd 139 | File: t1aett.fd 1997/11/16 Font definitions for T1/aett. 140 | ) 141 | LaTeX Font Info: External font `cmex10' loaded for size 142 | (Font) <12> on input line 13. 143 | LaTeX Font Info: External font `cmex10' loaded for size 144 | (Font) <8> on input line 13. 145 | LaTeX Font Info: External font `cmex10' loaded for size 146 | (Font) <6> on input line 13. 147 | 148 | LaTeX Font Warning: Font shape `T1/aett/b/n' undefined 149 | (Font) using `T1/aett/m/n' instead on input line 16. 150 | 151 | 152 | Overfull \hbox (179.69289pt too wide) in paragraph at lines 32--34 153 | \T1/aer/m/n/10 The func-tion \T1/aer/m/it/10 blast-pAl-lAll \T1/aer/m/n/10 uses 154 | the free soft-ware \T1/aett/m/n/10 blast+ \T1/aer/m/n/10 (ftp://ftp.ncbi.nlm.n 155 | ih.gov/blast/executables/blast+/LATEST/). 156 | [] 157 | 158 | [1 159 | 160 | {C:/Users/larssn/AppData/Local/MiKTeX/pdftex/config/pdftex.map}] [2] (vignette. 161 | aux) 162 | 163 | LaTeX Font Warning: Some font shapes were not available, defaults substituted. 164 | 165 | ) 166 | Here is how much of TeX's memory you used: 167 | 2555 strings out of 480236 168 | 37894 string characters out of 2890387 169 | 307402 words of memory out of 3000000 170 | 19045 multiletter control sequences out of 15000+200000 171 | 559697 words of font info for 80 fonts, out of 3000000 for 9000 172 | 1141 hyphenation exceptions out of 8191 173 | 72i,6n,77p,487b,225s stack positions out of 5000i,500n,10000p,200000b,50000s 174 | 181 | Output written on vignette.pdf (2 pages, 100326 bytes). 182 | PDF statistics: 183 | 42 PDF objects out of 1000 (max. 8388607) 184 | 0 named destinations out of 1000 (max. 500000) 185 | 5 words of extra memory for PDF output out of 10000 (max. 10000000) 186 | 187 | -------------------------------------------------------------------------------- /vignettes/vignette.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/vignettes/vignette.pdf -------------------------------------------------------------------------------- /vignettes/vignette.tex: -------------------------------------------------------------------------------- 1 | \documentclass{article} 2 | \usepackage{url,Sweave} 3 | %\VignetteIndexEntry{The micropan package vignette} 4 | 5 | \title{The \texttt{micropan} package vignette} 6 | \author{Lars Snipen and Kristian Hovde Liland} 7 | \date{} 8 | 9 | \begin{document} 10 | \input{vignette-concordance} 11 | %\SweaveOpts{concordance=TRUE} 12 | 13 | \maketitle 14 | 15 | 16 | \section{Using \texttt{dplyr} and \texttt{stringr}} 17 | A major change in the 2.0 version is the use of generic data structures and functions in R instead of creating package specific ones. This makes it possible to use the power of standard data manipulation tools and visualization that R-users are familiar with. 18 | 19 | Compared to previous versions some functions have been moved to the \texttt{microseq} package. 20 | 21 | You will also find no casestudy document or plotting functions. However, if you locate the GitHub site for this package, you will find a tutorial with code making similar plots using \texttt{ggplot} or \texttt{ggdendro}. This is an example of using generic R tools instead of making functions for each special case. 22 | 23 | \subsection{Faster reading of BLAST results} 24 | A major change in the 2.1 version is faster reading of the BLAST result files, see `?bDist` or the tutorial at GitHub mentioned above for more details. 25 | 26 | 27 | \section{External software} 28 | Some functions in this package calls upons external software that must be available on the system. Some of these are 'installed' by simply downloading a binary executable that you put somewhere proper on your computer. To make such programs visible to R, you typically need to update your \texttt{PATH} environment variable, to specify where these executables are located. Try it out, and use google for help! 29 | 30 | 31 | \subsection{Software \texttt{blast+}} 32 | The function \emph{blastpAllAll} uses the free software \texttt{blast+} (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/). Source code and installers makes it straightforward to install. In the R console the command 33 | \begin{Schunk} 34 | \begin{Sinput} 35 | > system("blastp -h") 36 | \end{Sinput} 37 | \end{Schunk} 38 | should produce some sensible output. 39 | 40 | 41 | \subsection{Software \texttt{hmmer}} 42 | The functions \emph{hmmerScan()} uses the free software \texttt{hmmer} (http://hmmer.org/). This software is developed for UNIX systems (e.g. Mac or Linux), and Windows users may find it a little difficult to install and run from R. In the R console the command 43 | \begin{Schunk} 44 | \begin{Sinput} 45 | > system("hmmscan -h") 46 | \end{Sinput} 47 | \end{Schunk} 48 | should produce some sensible output. 49 | 50 | 51 | 52 | \end{document} 53 | --------------------------------------------------------------------------------